diff mbox

[net-next,v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set

Message ID 20150811143846.672A92039@prod-mail-relay10.akamai.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Baron Aug. 11, 2015, 2:38 p.m. UTC
From: Jason Baron <jbaron@akamai.com>

When SO_SNDBUF is set and we are under tcp memory pressure, the effective write
buffer space can be much lower than what was set using SO_SNDBUF. For example,
we may have set the buffer to 100kb, but we may only be able to write 10kb. In
this scenario poll()/select()/epoll(), are going to continuously return POLLOUT,
followed by -EAGAIN from write() in a very tight loop.

Introduce sk->sk_effective_sndbuf, such that we can track the 'effective' size
of the sndbuf, when we have a short write due to memory pressure. By using the
sk->sk_effective_sndbuf instead of the sk->sk_sndbuf when we are under memory
pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally
do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will
auto tune the sk->sndbuf.

Although sk->sk_effective_sndbuf, and even sk->sk_sndbuf can change in
subsequent calls to tcp_poll() (queued via wakeups), we can guarantee that we
will eventually satisfy the write space check in all cases, since the write
queue will eventually drain to 0, and sk->sk_sndbuf and sk->sk_effective_sndbuf
will be non-zero when used in the write space check. If the write queue does not
drain to 0, then the client is not responding, and we will eventually timeout
the socket.

In my testing, this brought a single threaad's cpu usage down from 100% to ~1%
while maintaining the same level of throughput.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
Changes from V1:
	-Add READ_ONCE() in sk_stream_wspace() (Eric Dumazet)
	-Restrict sk_set_effective_sndbuf() to the write() path
---
 include/net/sock.h | 16 ++++++++++++++++
 net/core/sock.c    |  1 +
 net/core/stream.c  |  1 +
 net/ipv4/tcp.c     | 20 +++++++++++++++-----
 4 files changed, 33 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Aug. 11, 2015, 2:49 p.m. UTC | #1
On Tue, 2015-08-11 at 14:38 +0000, Jason Baron wrote:
> From: Jason Baron <jbaron@akamai.com>

> In my testing, this brought a single threaad's cpu usage down from 100% to ~1%
> while maintaining the same level of throughput.
> 

Hi Jason. Could you give more details on this test ?

How many flows are competing ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron Aug. 11, 2015, 3:03 p.m. UTC | #2
On 08/11/2015 10:49 AM, Eric Dumazet wrote:
> On Tue, 2015-08-11 at 14:38 +0000, Jason Baron wrote:
>> From: Jason Baron <jbaron@akamai.com>
> 
>> In my testing, this brought a single threaad's cpu usage down from 100% to ~1%
>> while maintaining the same level of throughput.
>>
> 
> Hi Jason. Could you give more details on this test ?
> 
> How many flows are competing ?
> 
>

Yes, so the test case I'm using to test against is somewhat contrived.
In that I am simply allocating around 40,000 sockets that are idle to
create a 'permanent' memory pressure in the background. Then, I have
just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.

That said, we encountered this issue initially where we had 10,000+
flows and whenever the system would get into memory pressure, we would
see all the cpus spin at 100%.

So the testcase I wrote, was just a simplistic version for testing. But
I am going to try and test against the more realistic workload where
this issue was initially observed.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 11, 2015, 4:12 p.m. UTC | #3
On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote:

> 
> Yes, so the test case I'm using to test against is somewhat contrived.
> In that I am simply allocating around 40,000 sockets that are idle to
> create a 'permanent' memory pressure in the background. Then, I have
> just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.
> 
> That said, we encountered this issue initially where we had 10,000+
> flows and whenever the system would get into memory pressure, we would
> see all the cpus spin at 100%.
> 
> So the testcase I wrote, was just a simplistic version for testing. But
> I am going to try and test against the more realistic workload where
> this issue was initially observed.
> 

Note that I am still trying to understand why we need to increase socket
structure, for something which is inherently a problem of sharing memory
with an unknown (potentially big) number of sockets.

I suggested to use a flag (one bit).

If set, then we should fallback to tcp_wmem[0] (each socket has 4096
bytes, so that we can avoid starvation)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron Aug. 11, 2015, 5:59 p.m. UTC | #4
On 08/11/2015 12:12 PM, Eric Dumazet wrote:
> On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote:
> 
>>
>> Yes, so the test case I'm using to test against is somewhat contrived.
>> In that I am simply allocating around 40,000 sockets that are idle to
>> create a 'permanent' memory pressure in the background. Then, I have
>> just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.
>>
>> That said, we encountered this issue initially where we had 10,000+
>> flows and whenever the system would get into memory pressure, we would
>> see all the cpus spin at 100%.
>>
>> So the testcase I wrote, was just a simplistic version for testing. But
>> I am going to try and test against the more realistic workload where
>> this issue was initially observed.
>>
> 
> Note that I am still trying to understand why we need to increase socket
> structure, for something which is inherently a problem of sharing memory
> with an unknown (potentially big) number of sockets.
> 

I was trying to mirror the wakeups when SO_SNDBUF is not set, where we
continue to trigger on 1/3 of the buffer being available, as the
sk->sndbuf is shrunk. And I saw this value as dynamic depending on
number of sockets and read/write buffer usage. So that's where I was
coming from with it.

Also, at least with the .config I have the tcp_sock structure didn't
increase in size (although struct sock did go up by 8 and not 4).

> I suggested to use a flag (one bit).
> 
> If set, then we should fallback to tcp_wmem[0] (each socket has 4096
> bytes, so that we can avoid starvation)
> 
> 
> 

Ok, I will test this approach.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron Aug. 21, 2015, 8:55 p.m. UTC | #5
On 08/11/2015 01:59 PM, Jason Baron wrote:
> 
> 
> On 08/11/2015 12:12 PM, Eric Dumazet wrote:
>> On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote:
>>
>>>
>>> Yes, so the test case I'm using to test against is somewhat contrived.
>>> In that I am simply allocating around 40,000 sockets that are idle to
>>> create a 'permanent' memory pressure in the background. Then, I have
>>> just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.
>>>
>>> That said, we encountered this issue initially where we had 10,000+
>>> flows and whenever the system would get into memory pressure, we would
>>> see all the cpus spin at 100%.
>>>
>>> So the testcase I wrote, was just a simplistic version for testing. But
>>> I am going to try and test against the more realistic workload where
>>> this issue was initially observed.
>>>
>>
>> Note that I am still trying to understand why we need to increase socket
>> structure, for something which is inherently a problem of sharing memory
>> with an unknown (potentially big) number of sockets.
>>
> 
> I was trying to mirror the wakeups when SO_SNDBUF is not set, where we
> continue to trigger on 1/3 of the buffer being available, as the
> sk->sndbuf is shrunk. And I saw this value as dynamic depending on
> number of sockets and read/write buffer usage. So that's where I was
> coming from with it.
> 
> Also, at least with the .config I have the tcp_sock structure didn't
> increase in size (although struct sock did go up by 8 and not 4).
> 
>> I suggested to use a flag (one bit).
>>
>> If set, then we should fallback to tcp_wmem[0] (each socket has 4096
>> bytes, so that we can avoid starvation)
>>
>>
>>
> 
> Ok, I will test this approach.

Hi Eric,

So I created a test here with 20,000 streams, and if I set SO_SNDBUF
high enough on the server side, I can create tcp memory pressure above
tcp_mem[2]. In this case, with the 'one bit' approach using tcp_wmem[0]
as the wakeup threshold I can still observe the 100% cpu spinning issue,
but with this v2 patch, cpu usage is minimal (1-2%). Since, we don't
guarantee tcp_wmem[0], above tcp_mem[2]. So using the 'one bit'
definitely alleviates the spinning between tcp_mem[1] and tcp_mem[2],
but not above tcp_mem[2] in my testing.

Maybe nobody cares about this case (you are getting what you ask for by
using SO_SNDBUF), but it seems to me that it would be nice to avoid this
sort of behavior. I also like the fact that with the
sk_effective_sndbuf, we keep doing wakeups on 1/3 of the write buffer
emptying, which keeps the wakeup behavior consistent. In theory this
would matter for high latency and bandwidth link, but in the testing I
did, I didn't observe any throughput differences between this v2 patch,
and the 'one bit' approach.

As I mentioned with this v2, the 'struct sock' grows by 4 bytes, but
struct tcp_sock does not increase. So since this is tcp specific, we
could add the sk_effective_sndbuf only to the struct tcp_sock.

So the 'one bit' approach definitely seems to me to be an improvement,
but I wanted to get feedback on this testing, before deciding how to
proceed.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 43c6abc..86598e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -243,6 +243,8 @@  struct cg_proto;
   *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
   *	@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *	@sk_sndbuf: size of send buffer in bytes
+  *	@sk_effective_sndbuf: Less than or equal to the size of sk_sndbuf when
+  *			      under memory pressure, 0 otherwise.
   *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *	@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
@@ -380,6 +382,7 @@  struct sock {
 	atomic_t		sk_wmem_alloc;
 	atomic_t		sk_omem_alloc;
 	int			sk_sndbuf;
+	int			sk_effective_sndbuf;
 	struct sk_buff_head	sk_write_queue;
 	kmemcheck_bitfield_begin(flags);
 	unsigned int		sk_shutdown  : 2,
@@ -779,6 +782,14 @@  static inline bool sk_acceptq_is_full(const struct sock *sk)
 	return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
 }
 
+static inline void sk_set_effective_sndbuf(struct sock *sk)
+{
+	if (sk->sk_wmem_queued > sk->sk_sndbuf)
+		sk->sk_effective_sndbuf = sk->sk_sndbuf;
+	else
+		sk->sk_effective_sndbuf = sk->sk_wmem_queued;
+}
+
 /*
  * Compute minimal free write space needed to queue new packets.
  */
@@ -789,6 +800,11 @@  static inline int sk_stream_min_wspace(const struct sock *sk)
 
 static inline int sk_stream_wspace(const struct sock *sk)
 {
+	int effective_sndbuf = READ_ONCE(sk->sk_effective_sndbuf);
+
+	if (effective_sndbuf)
+		return effective_sndbuf - sk->sk_wmem_queued;
+
 	return sk->sk_sndbuf - sk->sk_wmem_queued;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 193901d..4fce879 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2309,6 +2309,7 @@  void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_allocation	=	GFP_KERNEL;
 	sk->sk_rcvbuf		=	sysctl_rmem_default;
 	sk->sk_sndbuf		=	sysctl_wmem_default;
+	sk->sk_effective_sndbuf =	0;
 	sk->sk_state		=	TCP_CLOSE;
 	sk_set_socket(sk, sock);
 
diff --git a/net/core/stream.c b/net/core/stream.c
index d70f77a..7c175e7 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,6 +32,7 @@  void sk_stream_write_space(struct sock *sk)
 
 	if (sk_stream_is_writeable(sk) && sock) {
 		clear_bit(SOCK_NOSPACE, &sock->flags);
+		sk->sk_effective_sndbuf = 0;
 
 		rcu_read_lock();
 		wq = rcu_dereference(sk->sk_wq);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 45534a5..d935ad5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -923,8 +923,10 @@  new_segment:
 
 			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
 						  skb_queue_empty(&sk->sk_write_queue));
-			if (!skb)
+			if (!skb) {
+				sk_set_effective_sndbuf(sk);
 				goto wait_for_memory;
+			}
 
 			skb_entail(sk, skb);
 			copy = size_goal;
@@ -939,8 +941,10 @@  new_segment:
 			tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
-		if (!sk_wmem_schedule(sk, copy))
+		if (!sk_wmem_schedule(sk, copy)) {
+			sk_set_effective_sndbuf(sk);
 			goto wait_for_memory;
+		}
 
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
@@ -1163,8 +1167,10 @@  new_segment:
 						  select_size(sk, sg),
 						  sk->sk_allocation,
 						  skb_queue_empty(&sk->sk_write_queue));
-			if (!skb)
+			if (!skb) {
+				sk_set_effective_sndbuf(sk);
 				goto wait_for_memory;
+			}
 
 			/*
 			 * Check whether we can use HW checksum.
@@ -1200,8 +1206,10 @@  new_segment:
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
 
-			if (!sk_page_frag_refill(sk, pfrag))
+			if (!sk_page_frag_refill(sk, pfrag)) {
+				sk_set_effective_sndbuf(sk);
 				goto wait_for_memory;
+			}
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
@@ -1214,8 +1222,10 @@  new_segment:
 
 			copy = min_t(int, copy, pfrag->size - pfrag->offset);
 
-			if (!sk_wmem_schedule(sk, copy))
+			if (!sk_wmem_schedule(sk, copy)) {
+				sk_set_effective_sndbuf(sk);
 				goto wait_for_memory;
+			}
 
 			err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
 						       pfrag->page,