Message ID | 20150811143846.672A92039@prod-mail-relay10.akamai.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2015-08-11 at 14:38 +0000, Jason Baron wrote: > From: Jason Baron <jbaron@akamai.com> > In my testing, this brought a single threaad's cpu usage down from 100% to ~1% > while maintaining the same level of throughput. > Hi Jason. Could you give more details on this test ? How many flows are competing ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/11/2015 10:49 AM, Eric Dumazet wrote: > On Tue, 2015-08-11 at 14:38 +0000, Jason Baron wrote: >> From: Jason Baron <jbaron@akamai.com> > >> In my testing, this brought a single threaad's cpu usage down from 100% to ~1% >> while maintaining the same level of throughput. >> > > Hi Jason. Could you give more details on this test ? > > How many flows are competing ? > > Yes, so the test case I'm using to test against is somewhat contrived. In that I am simply allocating around 40,000 sockets that are idle to create a 'permanent' memory pressure in the background. Then, I have just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. That said, we encountered this issue initially where we had 10,000+ flows and whenever the system would get into memory pressure, we would see all the cpus spin at 100%. So the testcase I wrote, was just a simplistic version for testing. But I am going to try and test against the more realistic workload where this issue was initially observed. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote: > > Yes, so the test case I'm using to test against is somewhat contrived. > In that I am simply allocating around 40,000 sockets that are idle to > create a 'permanent' memory pressure in the background. Then, I have > just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. > > That said, we encountered this issue initially where we had 10,000+ > flows and whenever the system would get into memory pressure, we would > see all the cpus spin at 100%. > > So the testcase I wrote, was just a simplistic version for testing. But > I am going to try and test against the more realistic workload where > this issue was initially observed. > Note that I am still trying to understand why we need to increase socket structure, for something which is inherently a problem of sharing memory with an unknown (potentially big) number of sockets. I suggested to use a flag (one bit). If set, then we should fallback to tcp_wmem[0] (each socket has 4096 bytes, so that we can avoid starvation) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/11/2015 12:12 PM, Eric Dumazet wrote: > On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote: > >> >> Yes, so the test case I'm using to test against is somewhat contrived. >> In that I am simply allocating around 40,000 sockets that are idle to >> create a 'permanent' memory pressure in the background. Then, I have >> just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. >> >> That said, we encountered this issue initially where we had 10,000+ >> flows and whenever the system would get into memory pressure, we would >> see all the cpus spin at 100%. >> >> So the testcase I wrote, was just a simplistic version for testing. But >> I am going to try and test against the more realistic workload where >> this issue was initially observed. >> > > Note that I am still trying to understand why we need to increase socket > structure, for something which is inherently a problem of sharing memory > with an unknown (potentially big) number of sockets. > I was trying to mirror the wakeups when SO_SNDBUF is not set, where we continue to trigger on 1/3 of the buffer being available, as the sk->sndbuf is shrunk. And I saw this value as dynamic depending on number of sockets and read/write buffer usage. So that's where I was coming from with it. Also, at least with the .config I have the tcp_sock structure didn't increase in size (although struct sock did go up by 8 and not 4). > I suggested to use a flag (one bit). > > If set, then we should fallback to tcp_wmem[0] (each socket has 4096 > bytes, so that we can avoid starvation) > > > Ok, I will test this approach. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/11/2015 01:59 PM, Jason Baron wrote: > > > On 08/11/2015 12:12 PM, Eric Dumazet wrote: >> On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote: >> >>> >>> Yes, so the test case I'm using to test against is somewhat contrived. >>> In that I am simply allocating around 40,000 sockets that are idle to >>> create a 'permanent' memory pressure in the background. Then, I have >>> just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. >>> >>> That said, we encountered this issue initially where we had 10,000+ >>> flows and whenever the system would get into memory pressure, we would >>> see all the cpus spin at 100%. >>> >>> So the testcase I wrote, was just a simplistic version for testing. But >>> I am going to try and test against the more realistic workload where >>> this issue was initially observed. >>> >> >> Note that I am still trying to understand why we need to increase socket >> structure, for something which is inherently a problem of sharing memory >> with an unknown (potentially big) number of sockets. >> > > I was trying to mirror the wakeups when SO_SNDBUF is not set, where we > continue to trigger on 1/3 of the buffer being available, as the > sk->sndbuf is shrunk. And I saw this value as dynamic depending on > number of sockets and read/write buffer usage. So that's where I was > coming from with it. > > Also, at least with the .config I have the tcp_sock structure didn't > increase in size (although struct sock did go up by 8 and not 4). > >> I suggested to use a flag (one bit). >> >> If set, then we should fallback to tcp_wmem[0] (each socket has 4096 >> bytes, so that we can avoid starvation) >> >> >> > > Ok, I will test this approach. Hi Eric, So I created a test here with 20,000 streams, and if I set SO_SNDBUF high enough on the server side, I can create tcp memory pressure above tcp_mem[2]. In this case, with the 'one bit' approach using tcp_wmem[0] as the wakeup threshold I can still observe the 100% cpu spinning issue, but with this v2 patch, cpu usage is minimal (1-2%). Since, we don't guarantee tcp_wmem[0], above tcp_mem[2]. So using the 'one bit' definitely alleviates the spinning between tcp_mem[1] and tcp_mem[2], but not above tcp_mem[2] in my testing. Maybe nobody cares about this case (you are getting what you ask for by using SO_SNDBUF), but it seems to me that it would be nice to avoid this sort of behavior. I also like the fact that with the sk_effective_sndbuf, we keep doing wakeups on 1/3 of the write buffer emptying, which keeps the wakeup behavior consistent. In theory this would matter for high latency and bandwidth link, but in the testing I did, I didn't observe any throughput differences between this v2 patch, and the 'one bit' approach. As I mentioned with this v2, the 'struct sock' grows by 4 bytes, but struct tcp_sock does not increase. So since this is tcp specific, we could add the sk_effective_sndbuf only to the struct tcp_sock. So the 'one bit' approach definitely seems to me to be an improvement, but I wanted to get feedback on this testing, before deciding how to proceed. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sock.h b/include/net/sock.h index 43c6abc..86598e4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -243,6 +243,8 @@ struct cg_proto; * @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler) * @sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE) * @sk_sndbuf: size of send buffer in bytes + * @sk_effective_sndbuf: Less than or equal to the size of sk_sndbuf when + * under memory pressure, 0 otherwise. * @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings * @sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets @@ -380,6 +382,7 @@ struct sock { atomic_t sk_wmem_alloc; atomic_t sk_omem_alloc; int sk_sndbuf; + int sk_effective_sndbuf; struct sk_buff_head sk_write_queue; kmemcheck_bitfield_begin(flags); unsigned int sk_shutdown : 2, @@ -779,6 +782,14 @@ static inline bool sk_acceptq_is_full(const struct sock *sk) return sk->sk_ack_backlog > sk->sk_max_ack_backlog; } +static inline void sk_set_effective_sndbuf(struct sock *sk) +{ + if (sk->sk_wmem_queued > sk->sk_sndbuf) + sk->sk_effective_sndbuf = sk->sk_sndbuf; + else + sk->sk_effective_sndbuf = sk->sk_wmem_queued; +} + /* * Compute minimal free write space needed to queue new packets. */ @@ -789,6 +800,11 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { + int effective_sndbuf = READ_ONCE(sk->sk_effective_sndbuf); + + if (effective_sndbuf) + return effective_sndbuf - sk->sk_wmem_queued; + return sk->sk_sndbuf - sk->sk_wmem_queued; } diff --git a/net/core/sock.c b/net/core/sock.c index 193901d..4fce879 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2309,6 +2309,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_allocation = GFP_KERNEL; sk->sk_rcvbuf = sysctl_rmem_default; sk->sk_sndbuf = sysctl_wmem_default; + sk->sk_effective_sndbuf = 0; sk->sk_state = TCP_CLOSE; sk_set_socket(sk, sock); diff --git a/net/core/stream.c b/net/core/stream.c index d70f77a..7c175e7 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk) if (sk_stream_is_writeable(sk) && sock) { clear_bit(SOCK_NOSPACE, &sock->flags); + sk->sk_effective_sndbuf = 0; rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 45534a5..d935ad5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -923,8 +923,10 @@ new_segment: skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, skb_queue_empty(&sk->sk_write_queue)); - if (!skb) + if (!skb) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } skb_entail(sk, skb); copy = size_goal; @@ -939,8 +941,10 @@ new_segment: tcp_mark_push(tp, skb); goto new_segment; } - if (!sk_wmem_schedule(sk, copy)) + if (!sk_wmem_schedule(sk, copy)) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } if (can_coalesce) { skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); @@ -1163,8 +1167,10 @@ new_segment: select_size(sk, sg), sk->sk_allocation, skb_queue_empty(&sk->sk_write_queue)); - if (!skb) + if (!skb) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } /* * Check whether we can use HW checksum. @@ -1200,8 +1206,10 @@ new_segment: int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); - if (!sk_page_frag_refill(sk, pfrag)) + if (!sk_page_frag_refill(sk, pfrag)) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { @@ -1214,8 +1222,10 @@ new_segment: copy = min_t(int, copy, pfrag->size - pfrag->offset); - if (!sk_wmem_schedule(sk, copy)) + if (!sk_wmem_schedule(sk, copy)) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, pfrag->page,