Message ID | 1335173934.3293.84.camel@edumazet-glaptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On 04/23/2012 02:38 AM, Eric Dumazet wrote: > From: Eric Dumazet<edumazet@google.com> > > While investigating TCP performance problems on 10Gb+ links, we found a > tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit > in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends > one ACK every two MSS segments. > > A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default > value (87380), allowing a too small backlog. > > A TCP ACK, even being small, can consume nearly same truesize space than > outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and > is fast to compute. > > Performance results on netperf, single flow, receiver with disabled > GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop > increments at sender. > > Signed-off-by: Eric Dumazet<edumazet@google.com> > Cc: Neal Cardwell<ncardwell@google.com> > Cc: Tom Herbert<therbert@google.com> > Cc: Maciej Żenczykowski<maze@google.com> > Cc: Yuchung Cheng<ycheng@google.com> > Cc: Ilpo Järvinen<ilpo.jarvinen@helsinki.fi> > Cc: Rick Jones<rick.jones2@hp.com> > --- > net/ipv4/tcp_ipv4.c | 3 ++- > net/ipv6/tcp_ipv6.c | 3 ++- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 917607e..cf97e98 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1752,7 +1752,8 @@ process: > if (!tcp_prequeue(sk, skb)) > ret = tcp_v4_do_rcv(sk, skb); > } > - } else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) { > + } else if (unlikely(sk_add_backlog(sk, skb, > + sk->sk_rcvbuf + sk->sk_sndbuf))) { > bh_unlock_sock(sk); > NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); > goto discard_and_relse; This will increase what can be queued for arriving segments in general and not for ACKs specifically yes? (A possible issue that would have come-up with my previous wondering about just increasing SO_RCVBUF as SO_SNDBUF was increasing). Perhaps only add sk->sk_sndbuf to the limit if the arriving segment contains no data? rick > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index b04e6d8..5fb19d3 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -1654,7 +1654,8 @@ process: > if (!tcp_prequeue(sk, skb)) > ret = tcp_v6_do_rcv(sk, skb); > } > - } else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) { > + } else if (unlikely(sk_add_backlog(sk, skb, > + sk->sk_rcvbuf + sk->sk_sndbuf))) { > bh_unlock_sock(sk); > NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); > goto discard_and_relse; > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote: > > This will increase what can be queued for arriving segments in general > and not for ACKs specifically yes? (A possible issue that would have > come-up with my previous wondering about just increasing SO_RCVBUF as > SO_SNDBUF was increasing). Perhaps only add sk->sk_sndbuf to the limit > if the arriving segment contains no data? Thats the backlog limit that we tweak here. Its not a big deal if we allow a bit more packets to come and later drop them if we hit the real rcvbuf limit. (ACKS wont consume space, since they are freed as soon as processed) By the way, we used to have (sk_rcvbuf << 1) limit in the past. Before commit c377411f2494a931ff we had : if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1)) return -ENOBUFS We probably had drops in the past but didnt notice, since we lacked a counter for those drops. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 23 Apr 2012 19:23:15 +0200 > On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote: > >> >> This will increase what can be queued for arriving segments in general >> and not for ACKs specifically yes? (A possible issue that would have >> come-up with my previous wondering about just increasing SO_RCVBUF as >> SO_SNDBUF was increasing). Perhaps only add sk->sk_sndbuf to the limit >> if the arriving segment contains no data? > > Thats the backlog limit that we tweak here. > > Its not a big deal if we allow a bit more packets to come and later drop > them if we hit the real rcvbuf limit. (ACKS wont consume space, since > they are freed as soon as processed) Hmmm... why don't we just acknowledge reality and special case ACKs? If a TCP packet is dataless we should just let it go through no matter what and with no limits. It is by definition transient and will not get queued up into the socket past this backlog stage. This proposed patch allows non-dataless packets to eat more space in the backlog, thus the concern and slight pushback. And from another perspective, having the stack process data packets which will just get dropped when we try to attach it to the receive queue is just wasted work. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote: > Hmmm... why don't we just acknowledge reality and special case ACKs? > Yes why not. > If a TCP packet is dataless we should just let it go through no matter > what and with no limits. It is by definition transient and will not > get queued up into the socket past this backlog stage. > Even being transient we need a limit. Without copybreak, an ACK can cost 2048+256 bytes. In my 10Gbit tests (standard netperf using 16K buffers), I've seen backlogs of 300 ACK packets... > This proposed patch allows non-dataless packets to eat more space in > the backlog, thus the concern and slight pushback. And from another > perspective, having the stack process data packets which will just > get dropped when we try to attach it to the receive queue is just > wasted work. We could try to coalesce ACKs before backlogging them. I'll work on this. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/23/2012 01:37 PM, Eric Dumazet wrote: > In my 10Gbit tests (standard netperf using 16K buffers), I've seen > backlogs of 300 ACK packets... Probably better to call that something other than 16K buffers - the send size was probably 16K, which reflected SO_SNDBUF at the time the data socket was created, but clearly SO_SNDBUF grew in that timeframe. And those values are "standard" for netperf only in the context of (default) Linux - on other platforms the defaults in the stack and so in netperf are probably different. The classic/migrated classic tests report only the initial socket buffer sizes, not what they become by the end of the test: raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 941.06 To see what they are at the end of the test requires more direct use of the omni path. Either by way of test type: raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t omni OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () port 0 AF_INET Local Remote Local Elapsed Throughput Throughput Send Socket Recv Socket Send Time Units Size Size Size (sec) Final Final 266640 87380 16384 10.00 940.92 10^6bits/s or omni output selection: raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -- -k lss_size_req,lss_size,lss_size_end,rsr_size_req,rsr_size,rsr_size_end MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () port 0 AF_INET LSS_SIZE_REQ=-1 LSS_SIZE=16384 LSS_SIZE_END=266640 RSR_SIZE_REQ=-1 RSR_SIZE=87380 RSR_SIZE_END=87380 BTW, does it make sense that the SO_SNDBUF size on the netperf side (lss_size_end - 2.6.38-14-generic kernel) grew larger than the SO_RCVBUF on the netserver side? (3.2.0-rc4+) rick jones PS - here is data flowing the other way: raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t TCP_MAERTS -- -k lsr_size_req,lsr_size,lsr_size_end,rss_size_req,rss_size,rss_size_end MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () port 0 AF_INET LSR_SIZE_REQ=-1 LSR_SIZE=87380 LSR_SIZE_END=4194304 RSS_SIZE_REQ=-1 RSS_SIZE=16384 RSS_SIZE_END=65536 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 23 Apr 2012 22:37:26 +0200 > We could try to coalesce ACKs before backlogging them. I'll work on > this. Great idea, although I wonder about the effect this could have on RTT measurements. Instead of having N RTT measurements, we'd have just one. Granted, what happens right now wrt. RTT measurements with such huge ACK backlogs isn't all that nice either. Ideally, perhaps, we'd do a timestamp diff at the time we insert the packet into the backlog. That way we wouldn't gain the RTT inaccuracy introduced by such queueing delays and ACK backlogs. Another way to look at it is that the coalesced scheme would actually improve RTT measurements, since the most accurate (and least "delayed") of the timestamps would be the only one processed :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote: > On 04/23/2012 01:37 PM, Eric Dumazet wrote: > > In my 10Gbit tests (standard netperf using 16K buffers), I've seen > > backlogs of 300 ACK packets... > > Probably better to call that something other than 16K buffers - the send > size was probably 16K, which reflected SO_SNDBUF at the time the data > socket was created, but clearly SO_SNDBUF grew in that timeframe. > Maybe I was not clear : Application does sendmsg() of 16KB buffers. Yet, in the small time it takes to perform this operation, softirq can queue up to 300 packets coming from the other side. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 17:01 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Mon, 23 Apr 2012 22:37:26 +0200 > > > We could try to coalesce ACKs before backlogging them. I'll work on > > this. > > Great idea, although I wonder about the effect this could have on RTT > measurements. Instead of having N RTT measurements, we'd have just > one. > > Granted, what happens right now wrt. RTT measurements with such huge > ACK backlogs isn't all that nice either. > > Ideally, perhaps, we'd do a timestamp diff at the time we insert the > packet into the backlog. That way we wouldn't gain the RTT inaccuracy > introduced by such queueing delays and ACK backlogs. > > Another way to look at it is that the coalesced scheme would actually > improve RTT measurements, since the most accurate (and least > "delayed") of the timestamps would be the only one processed :-) The big part of the work is not doing the coalesce, but also counting the number of ACKS that are going to be carried into TCP stack if we want cwnd being updated correctly. Basically I'll have to add a new skb field (in cb[]) to properly count number of ACKS 'included' in a single packet. About the RTT, some congestion modules need TCP_CONG_RTT_STAMP but time is taken when backlog processing is done, that is after backlog in/out -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/23/2012 02:30 PM, Eric Dumazet wrote: > On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote: >> Probably better to call that something other than 16K buffers - the send >> size was probably 16K, which reflected SO_SNDBUF at the time the data >> socket was created, but clearly SO_SNDBUF grew in that timeframe. >> > > > Maybe I was not clear : Application does sendmsg() of 16KB buffers. I'd probably call that a 16K send test. The root of the issue being there being "send buffers" and "send socket buffers" (and their receive versions). My "canonical" test - at least one that appears in most of my contemporary scripts uses a 64K send size for the bulk transfer tests. I switch back-and-forth between tests which allow the socket buffer size to be determined automagically, and those where I set both sides' socket buffers to 1M via the test-specific -s and -S options. In "netperf speak" those would probably be "x64K" and "1Mx64k" respectively. More generally "<socket buffer size>x<send size>" (I rarely set/specify the receive size in those tests, leaving it at whatever SO_RCVBUF is at the start. > Yet, in the small time it takes to perform this operation, softirq can > queue up to 300 packets coming from the other side. There is more to it than just queue-up 16 KB right? rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/23/2012 02:51 PM, Rick Jones wrote: > On 04/23/2012 02:30 PM, Eric Dumazet wrote: >> On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote: >>> Probably better to call that something other than 16K buffers - the send >>> size was probably 16K, which reflected SO_SNDBUF at the time the data >>> socket was created, but clearly SO_SNDBUF grew in that timeframe. >>> >> >> >> Maybe I was not clear : Application does sendmsg() of 16KB buffers. > > I'd probably call that a 16K send test. The root of the issue being > there being "send buffers" and "send socket buffers" (and their receive > versions). > > My "canonical" test - at least one that appears in most of my > contemporary scripts uses a 64K send size for the bulk transfer tests. I > switch back-and-forth between tests which allow the socket buffer size > to be determined automagically, and those where I set both sides' socket > buffers to 1M via the test-specific -s and -S options. In "netperf > speak" those would probably be "x64K" and "1Mx64k" respectively. More > generally "<socket buffer size>x<send size>" (I rarely set/specify the > receive size in those tests, leaving it at whatever SO_RCVBUF is at the > start. > >> Yet, in the small time it takes to perform this operation, softirq can >> queue up to 300 packets coming from the other side. > > There is more to it than just queue-up 16 KB right? I should have added that 300 ACKs seems huge as a backlog. At ack-every-other that is 300 * 1448 * 2 or 868800 bytes worth of ACKs. That sounds like a great deal more than just one 16KB send's worth of being held-off. I mean at 10Gbe speeds (using your 54 usec for 64KB) that represents data which took something like three quarters of a millisecond to transmit on the wire. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote: > On 04/23/2012 02:30 PM, Eric Dumazet wrote: > > Yet, in the small time it takes to perform this operation, softirq can > > queue up to 300 packets coming from the other side. > > There is more to it than just queue-up 16 KB right? At full rate, we send 825.000 packets per second, and should receive 412.000 ACKS per second if receiver is standard TCP. The ACK are not smooth, because receiver also have a huge backlog issue and can send train of ACKS. (I have seen backlogs on receiver using more than 500 us to be processed) If the copyin(16KB) from user to kernel takes some us (preempt, irqs...), its pretty easy to catch an ACK train in this window. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/23/2012 03:05 PM, Eric Dumazet wrote: > On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote: >> On 04/23/2012 02:30 PM, Eric Dumazet wrote: > >>> Yet, in the small time it takes to perform this operation, softirq can >>> queue up to 300 packets coming from the other side. >> >> There is more to it than just queue-up 16 KB right? > > At full rate, we send 825.000 packets per second, and should receive > 412.000 ACKS per second if receiver is standard TCP. > > The ACK are not smooth, because receiver also have a huge backlog issue > and can send train of ACKS. (I have seen backlogs on receiver using more > than 500 us to be processed) > > If the copyin(16KB) from user to kernel takes some us (preempt, > irqs...), its pretty easy to catch an ACK train in this window. Is it at all possible to have the copies happen without the connection being locked? If indeed it is possible to be held-off with the connection locked for the better part of 3/4 of a millisecond, just what will that do to 40 or 100 GbE? If you've been seeing queues of 300 ACKs at 10 GbE that would be 3000 at 100 GbE, and assuming those are all in a 2048 byte buffer thats 6MB just of ACKs. I suppose 100GbE does mean non-trivial quantities of buffering anyway but that does still seem rather high. rick thank goodness for GRO's ACK stretching as an ACK avoidance heuristic I guess... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote: > > > Hmmm... why don't we just acknowledge reality and special case ACKs? > > > > Yes why not. > > > > If a TCP packet is dataless we should just let it go > > through no matter what and with no limits. > > It is by definition transient and will not > > get queued up into the socket past this backlog stage. > > > > Even being transient we need a limit. Without copybreak, an > ACK can cost 2048+256 bytes. > > In my 10Gbit tests (standard netperf using 16K buffers), I've seen > backlogs of 300 ACK packets... What about forcing a copybreak for acks when above the rx buffer size? That way you avoid the cost of the copy in teh normal case when the data will be freed, but avoid the memory overhead when a lot of acks (or rx data) is queued. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-04-24 at 09:44 +0100, David Laight wrote: > > On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote: > > > > > Hmmm... why don't we just acknowledge reality and special case ACKs? > > > > > > > Yes why not. > > > > > > > If a TCP packet is dataless we should just let it go > > > through no matter what and with no limits. > > > It is by definition transient and will not > > > get queued up into the socket past this backlog stage. > > > > > > > Even being transient we need a limit. Without copybreak, an > > ACK can cost 2048+256 bytes. > > > > In my 10Gbit tests (standard netperf using 16K buffers), I've seen > > backlogs of 300 ACK packets... > > What about forcing a copybreak for acks when above the rx buffer size? > That way you avoid the cost of the copy in teh normal case when > the data will be freed, but avoid the memory overhead when a lot > of acks (or rx data) is queued. Thats noise, as the minimal truesize of an ACK packet is 512 + 256 on x86_64 The fact that ixgbe provides 1024 + 256 could be fixed in the driver, its a 4 lines change actually. Then you already have a minimal skb. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 23 Apr 2012, Rick Jones wrote: > Is it at all possible to have the copies happen without the connection being > locked? If indeed it is possible to be held-off with the connection locked > for the better part of 3/4 of a millisecond, just what will that do to 40 or > 100 GbE? If you've been seeing queues of 300 ACKs at 10 GbE that would be > 3000 at 100 GbE, and assuming those are all in a 2048 byte buffer thats 6MB > just of ACKs. I suppose 100GbE does mean non-trivial quantities of buffering > anyway but that does still seem rather high. At some point people will need to realize that it is not business as usual if one tries to use the current network porotocols at speeds above 1G. There is a reason for high speed networks implementing new protocols like RDMA techniques and lossless characteristics of a network. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 917607e..cf97e98 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1752,7 +1752,8 @@ process: if (!tcp_prequeue(sk, skb)) ret = tcp_v4_do_rcv(sk, skb); } - } else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) { + } else if (unlikely(sk_add_backlog(sk, skb, + sk->sk_rcvbuf + sk->sk_sndbuf))) { bh_unlock_sock(sk); NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); goto discard_and_relse; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index b04e6d8..5fb19d3 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1654,7 +1654,8 @@ process: if (!tcp_prequeue(sk, skb)) ret = tcp_v6_do_rcv(sk, skb); } - } else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) { + } else if (unlikely(sk_add_backlog(sk, skb, + sk->sk_rcvbuf + sk->sk_sndbuf))) { bh_unlock_sock(sk); NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); goto discard_and_relse;