Patchwork [2/2,net-next] tcp: sk_add_backlog() is too agressive for TCP

login
register
mail settings
Submitter Eric Dumazet
Date April 23, 2012, 9:38 a.m.
Message ID <1335173934.3293.84.camel@edumazet-glaptop>
Download mbox | patch
Permalink /patch/154376/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - April 23, 2012, 9:38 a.m.
From: Eric Dumazet <edumazet@google.com>

While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.

A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.

A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.

Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
---
 net/ipv4/tcp_ipv4.c |    3 ++-
 net/ipv6/tcp_ipv6.c |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 23, 2012, 5:14 p.m.
On 04/23/2012 02:38 AM, Eric Dumazet wrote:
> From: Eric Dumazet<edumazet@google.com>
>
> While investigating TCP performance problems on 10Gb+ links, we found a
> tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
> in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
> one ACK every two MSS segments.
>
> A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
> value (87380), allowing a too small backlog.
>
> A TCP ACK, even being small, can consume nearly same truesize space than
> outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
> is fast to compute.
>
> Performance results on netperf, single flow, receiver with disabled
> GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
> increments at sender.
>
> Signed-off-by: Eric Dumazet<edumazet@google.com>
> Cc: Neal Cardwell<ncardwell@google.com>
> Cc: Tom Herbert<therbert@google.com>
> Cc: Maciej Żenczykowski<maze@google.com>
> Cc: Yuchung Cheng<ycheng@google.com>
> Cc: Ilpo Järvinen<ilpo.jarvinen@helsinki.fi>
> Cc: Rick Jones<rick.jones2@hp.com>
> ---
>   net/ipv4/tcp_ipv4.c |    3 ++-
>   net/ipv6/tcp_ipv6.c |    3 ++-
>   2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 917607e..cf97e98 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1752,7 +1752,8 @@ process:
>   			if (!tcp_prequeue(sk, skb))
>   				ret = tcp_v4_do_rcv(sk, skb);
>   		}
> -	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
> +	} else if (unlikely(sk_add_backlog(sk, skb,
> +					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
>   		bh_unlock_sock(sk);
>   		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
>   		goto discard_and_relse;

This will increase what can be queued for arriving segments in general 
and not for ACKs specifically yes?  (A possible issue that would have 
come-up with my previous wondering about just increasing SO_RCVBUF as 
SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
if the arriving segment contains no data?

rick


> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index b04e6d8..5fb19d3 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1654,7 +1654,8 @@ process:
>   			if (!tcp_prequeue(sk, skb))
>   				ret = tcp_v6_do_rcv(sk, skb);
>   		}
> -	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
> +	} else if (unlikely(sk_add_backlog(sk, skb,
> +					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
>   		bh_unlock_sock(sk);
>   		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
>   		goto discard_and_relse;
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 23, 2012, 5:23 p.m.
On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote:

> 
> This will increase what can be queued for arriving segments in general 
> and not for ACKs specifically yes?  (A possible issue that would have 
> come-up with my previous wondering about just increasing SO_RCVBUF as 
> SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
> if the arriving segment contains no data?

Thats the backlog limit that we tweak here.

Its not a big deal if we allow a bit more packets to come and later drop
them if we hit the real rcvbuf limit. (ACKS wont consume space, since
they are freed as soon as processed)

By the way, we used to have (sk_rcvbuf << 1) limit in the past.

Before commit c377411f2494a931ff we had :

if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1)) 
	return -ENOBUFS

We probably had drops in the past but didnt notice, since we lacked 
a counter for those drops.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 23, 2012, 8:01 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 Apr 2012 19:23:15 +0200

> On Mon, 2012-04-23 at 10:14 -0700, Rick Jones wrote:
> 
>> 
>> This will increase what can be queued for arriving segments in general 
>> and not for ACKs specifically yes?  (A possible issue that would have 
>> come-up with my previous wondering about just increasing SO_RCVBUF as 
>> SO_SNDBUF was increasing).  Perhaps only add sk->sk_sndbuf to the limit 
>> if the arriving segment contains no data?
> 
> Thats the backlog limit that we tweak here.
> 
> Its not a big deal if we allow a bit more packets to come and later drop
> them if we hit the real rcvbuf limit. (ACKS wont consume space, since
> they are freed as soon as processed)

Hmmm... why don't we just acknowledge reality and special case ACKs?

If a TCP packet is dataless we should just let it go through no matter
what and with no limits.  It is by definition transient and will not
get queued up into the socket past this backlog stage.

This proposed patch allows non-dataless packets to eat more space in
the backlog, thus the concern and slight pushback.  And from another
perspective, having the stack process data packets which will just
get dropped when we try to attach it to the receive queue is just
wasted work.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 23, 2012, 8:37 p.m.
On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:

> Hmmm... why don't we just acknowledge reality and special case ACKs?
> 

Yes why not.


> If a TCP packet is dataless we should just let it go through no matter
> what and with no limits.  It is by definition transient and will not
> get queued up into the socket past this backlog stage.
> 

Even being transient we need a limit. Without copybreak, an ACK can cost
2048+256 bytes.

In my 10Gbit tests (standard netperf using 16K buffers), I've seen
backlogs of 300 ACK packets...

> This proposed patch allows non-dataless packets to eat more space in
> the backlog, thus the concern and slight pushback.  And from another
> perspective, having the stack process data packets which will just
> get dropped when we try to attach it to the receive queue is just
> wasted work.

We could try to coalesce ACKs before backlogging them. I'll work on
this.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 23, 2012, 8:57 p.m.
On 04/23/2012 01:37 PM, Eric Dumazet wrote:
> In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> backlogs of 300 ACK packets...

Probably better to call that something other than 16K buffers - the send 
size was probably 16K, which reflected SO_SNDBUF at the time the data 
socket was created, but clearly SO_SNDBUF grew in that timeframe.

And those values are "standard" for netperf only in the context of 
(default) Linux - on other platforms the defaults in the stack and so in 
netperf are probably different.

The classic/migrated classic tests report only the initial socket buffer 
sizes, not what they become by the end of the test:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    10.00     941.06

To see what they are at the end of the test requires more direct use of 
the omni path.  Either by way of test type:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t omni
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () 
port 0 AF_INET
Local       Remote      Local  Elapsed Throughput Throughput
Send Socket Recv Socket Send   Time               Units
Size        Size        Size   (sec)
Final       Final
266640      87380       16384  10.00   940.92     10^6bits/s

or omni output selection:

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -- -k 
lss_size_req,lss_size,lss_size_end,rsr_size_req,rsr_size,rsr_size_end
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
LSS_SIZE_REQ=-1
LSS_SIZE=16384
LSS_SIZE_END=266640
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=87380

BTW, does it make sense that the SO_SNDBUF size on the netperf side 
(lss_size_end - 2.6.38-14-generic kernel) grew larger than the SO_RCVBUF 
on the netserver side? (3.2.0-rc4+)

rick jones

PS - here is data flowing the other way:
raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t TCP_MAERTS 
-- -k lsr_size_req,lsr_size,lsr_size_end,rss_size_req,rss_size,rss_size_end
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.3 () port 0 AF_INET
LSR_SIZE_REQ=-1
LSR_SIZE=87380
LSR_SIZE_END=4194304
RSS_SIZE_REQ=-1
RSS_SIZE=16384
RSS_SIZE_END=65536

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 23, 2012, 9:01 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 Apr 2012 22:37:26 +0200

> We could try to coalesce ACKs before backlogging them. I'll work on
> this.

Great idea, although I wonder about the effect this could have on RTT
measurements.  Instead of having N RTT measurements, we'd have just
one.

Granted, what happens right now wrt. RTT measurements with such huge
ACK backlogs isn't all that nice either.

Ideally, perhaps, we'd do a timestamp diff at the time we insert the
packet into the backlog.  That way we wouldn't gain the RTT inaccuracy
introduced by such queueing delays and ACK backlogs.

Another way to look at it is that the coalesced scheme would actually
improve RTT measurements, since the most accurate (and least
"delayed") of the timestamps would be the only one processed :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 23, 2012, 9:30 p.m.
On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
> On 04/23/2012 01:37 PM, Eric Dumazet wrote:
> > In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> > backlogs of 300 ACK packets...
> 
> Probably better to call that something other than 16K buffers - the send 
> size was probably 16K, which reflected SO_SNDBUF at the time the data 
> socket was created, but clearly SO_SNDBUF grew in that timeframe.
> 


Maybe I was not clear : Application does sendmsg() of 16KB buffers.

Yet, in the small time it takes to perform this operation, softirq can
queue up to 300 packets coming from the other side.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 23, 2012, 9:38 p.m.
On Mon, 2012-04-23 at 17:01 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 23 Apr 2012 22:37:26 +0200
> 
> > We could try to coalesce ACKs before backlogging them. I'll work on
> > this.
> 
> Great idea, although I wonder about the effect this could have on RTT
> measurements.  Instead of having N RTT measurements, we'd have just
> one.
> 
> Granted, what happens right now wrt. RTT measurements with such huge
> ACK backlogs isn't all that nice either.
> 
> Ideally, perhaps, we'd do a timestamp diff at the time we insert the
> packet into the backlog.  That way we wouldn't gain the RTT inaccuracy
> introduced by such queueing delays and ACK backlogs.
> 
> Another way to look at it is that the coalesced scheme would actually
> improve RTT measurements, since the most accurate (and least
> "delayed") of the timestamps would be the only one processed :-)

The big part of the work is not doing the coalesce, but also counting
the number of ACKS that are going to be carried into TCP stack if we
want cwnd being updated correctly.

Basically I'll have to add a new skb field (in cb[]) to properly count
number of ACKS 'included' in a single packet.

About the RTT, some congestion modules need TCP_CONG_RTT_STAMP but time
is taken when backlog processing is done, that is after backlog in/out



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 23, 2012, 9:51 p.m.
On 04/23/2012 02:30 PM, Eric Dumazet wrote:
> On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
>> Probably better to call that something other than 16K buffers - the send
>> size was probably 16K, which reflected SO_SNDBUF at the time the data
>> socket was created, but clearly SO_SNDBUF grew in that timeframe.
>>
>
>
> Maybe I was not clear : Application does sendmsg() of 16KB buffers.

I'd probably call that a 16K send test.  The root of the issue being 
there being "send buffers" and "send socket buffers" (and their receive 
versions).

My "canonical" test - at least one that appears in most of my 
contemporary scripts uses a 64K send size for the bulk transfer tests. 
I switch back-and-forth between tests which allow the socket buffer size 
to be determined automagically, and those where I set both sides' socket 
buffers to 1M via the test-specific -s and -S options.  In "netperf 
speak" those would probably be "x64K" and "1Mx64k" respectively.  More 
generally "<socket buffer size>x<send size>" (I rarely set/specify the 
receive size in those tests, leaving it at whatever SO_RCVBUF is at the 
start.

> Yet, in the small time it takes to perform this operation, softirq can
> queue up to 300 packets coming from the other side.

There is more to it than just queue-up 16 KB right?

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 23, 2012, 9:56 p.m.
On 04/23/2012 02:51 PM, Rick Jones wrote:
> On 04/23/2012 02:30 PM, Eric Dumazet wrote:
>> On Mon, 2012-04-23 at 13:57 -0700, Rick Jones wrote:
>>> Probably better to call that something other than 16K buffers - the send
>>> size was probably 16K, which reflected SO_SNDBUF at the time the data
>>> socket was created, but clearly SO_SNDBUF grew in that timeframe.
>>>
>>
>>
>> Maybe I was not clear : Application does sendmsg() of 16KB buffers.
>
> I'd probably call that a 16K send test. The root of the issue being
> there being "send buffers" and "send socket buffers" (and their receive
> versions).
>
> My "canonical" test - at least one that appears in most of my
> contemporary scripts uses a 64K send size for the bulk transfer tests. I
> switch back-and-forth between tests which allow the socket buffer size
> to be determined automagically, and those where I set both sides' socket
> buffers to 1M via the test-specific -s and -S options. In "netperf
> speak" those would probably be "x64K" and "1Mx64k" respectively. More
> generally "<socket buffer size>x<send size>" (I rarely set/specify the
> receive size in those tests, leaving it at whatever SO_RCVBUF is at the
> start.
>
>> Yet, in the small time it takes to perform this operation, softirq can
>> queue up to 300 packets coming from the other side.
>
> There is more to it than just queue-up 16 KB right?

I should have added that 300 ACKs seems huge as a backlog.  At 
ack-every-other that is 300 * 1448 * 2 or 868800 bytes worth of ACKs. 
That sounds like a great deal more than just one 16KB send's worth of 
being held-off.  I mean at 10Gbe speeds (using your 54 usec for 64KB) 
that represents data which took something like three quarters of a 
millisecond to transmit on the wire.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 23, 2012, 10:05 p.m.
On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote:
> On 04/23/2012 02:30 PM, Eric Dumazet wrote:

> > Yet, in the small time it takes to perform this operation, softirq can
> > queue up to 300 packets coming from the other side.
> 
> There is more to it than just queue-up 16 KB right?

At full rate, we send 825.000 packets per second, and should receive
412.000 ACKS per second if receiver is standard TCP.

The ACK are not smooth, because receiver also have a huge backlog issue
and can send train of ACKS. (I have seen backlogs on receiver using more
than 500 us to be  processed)

If the copyin(16KB) from user to kernel takes some us (preempt,
irqs...), its pretty easy to catch an ACK train in this window.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 23, 2012, 10:16 p.m.
On 04/23/2012 03:05 PM, Eric Dumazet wrote:
> On Mon, 2012-04-23 at 14:51 -0700, Rick Jones wrote:
>> On 04/23/2012 02:30 PM, Eric Dumazet wrote:
>
>>> Yet, in the small time it takes to perform this operation, softirq can
>>> queue up to 300 packets coming from the other side.
>>
>> There is more to it than just queue-up 16 KB right?
>
> At full rate, we send 825.000 packets per second, and should receive
> 412.000 ACKS per second if receiver is standard TCP.
>
> The ACK are not smooth, because receiver also have a huge backlog issue
> and can send train of ACKS. (I have seen backlogs on receiver using more
> than 500 us to be  processed)
>
> If the copyin(16KB) from user to kernel takes some us (preempt,
> irqs...), its pretty easy to catch an ACK train in this window.

Is it at all possible to have the copies happen without the connection 
being locked?  If indeed it is possible to be held-off with the 
connection locked for the better part of 3/4 of a millisecond, just what 
will that do to 40 or 100 GbE?  If you've been seeing queues of 300 ACKs 
at 10 GbE that would be 3000 at 100 GbE, and assuming those are all in a 
2048 byte buffer thats 6MB just of ACKs.  I suppose 100GbE does mean 
non-trivial quantities of buffering anyway but that does still seem 
rather high.

rick
thank goodness for GRO's ACK stretching as an ACK avoidance heuristic I 
guess...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight - April 24, 2012, 8:44 a.m.
> On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:
> 
> > Hmmm... why don't we just acknowledge reality and special case ACKs?
> > 
> 
> Yes why not.
> 
> 
> > If a TCP packet is dataless we should just let it go 
> > through no matter what and with no limits.
> >  It is by definition transient and will not
> > get queued up into the socket past this backlog stage.
> > 
> 
> Even being transient we need a limit. Without copybreak, an 
> ACK can cost 2048+256 bytes.
> 
> In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> backlogs of 300 ACK packets...

What about forcing a copybreak for acks when above the rx buffer size?
That way you avoid the cost of the copy in teh normal case when
the data will be freed, but avoid the memory overhead when a lot
of acks (or rx data) is queued.

	David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 24, 2012, 8:53 a.m.
On Tue, 2012-04-24 at 09:44 +0100, David Laight wrote:
> > On Mon, 2012-04-23 at 16:01 -0400, David Miller wrote:
> > 
> > > Hmmm... why don't we just acknowledge reality and special case ACKs?
> > > 
> > 
> > Yes why not.
> > 
> > 
> > > If a TCP packet is dataless we should just let it go 
> > > through no matter what and with no limits.
> > >  It is by definition transient and will not
> > > get queued up into the socket past this backlog stage.
> > > 
> > 
> > Even being transient we need a limit. Without copybreak, an 
> > ACK can cost 2048+256 bytes.
> > 
> > In my 10Gbit tests (standard netperf using 16K buffers), I've seen
> > backlogs of 300 ACK packets...
> 
> What about forcing a copybreak for acks when above the rx buffer size?
> That way you avoid the cost of the copy in teh normal case when
> the data will be freed, but avoid the memory overhead when a lot
> of acks (or rx data) is queued.

Thats noise, as the minimal truesize of an ACK packet is 512 + 256 on
x86_64

The fact that ixgbe provides 1024 + 256 could be fixed in the driver,
its a 4 lines change actually. Then you already have a minimal skb.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - April 24, 2012, 3:25 p.m.
On Mon, 23 Apr 2012, Rick Jones wrote:

> Is it at all possible to have the copies happen without the connection being
> locked?  If indeed it is possible to be held-off with the connection locked
> for the better part of 3/4 of a millisecond, just what will that do to 40 or
> 100 GbE?  If you've been seeing queues of 300 ACKs at 10 GbE that would be
> 3000 at 100 GbE, and assuming those are all in a 2048 byte buffer thats 6MB
> just of ACKs.  I suppose 100GbE does mean non-trivial quantities of buffering
> anyway but that does still seem rather high.

At some point people will need to realize that it is not business as usual
if one tries to use the current network porotocols at speeds above 1G.

There is a reason for high speed networks implementing new protocols like
RDMA techniques and lossless characteristics of a network.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 917607e..cf97e98 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1752,7 +1752,8 @@  process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v4_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
+	} else if (unlikely(sk_add_backlog(sk, skb,
+					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
 		bh_unlock_sock(sk);
 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
 		goto discard_and_relse;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b04e6d8..5fb19d3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1654,7 +1654,8 @@  process:
 			if (!tcp_prequeue(sk, skb))
 				ret = tcp_v6_do_rcv(sk, skb);
 		}
-	} else if (unlikely(sk_add_backlog(sk, skb, sk->sk_rcvbuf))) {
+	} else if (unlikely(sk_add_backlog(sk, skb,
+					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
 		bh_unlock_sock(sk);
 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
 		goto discard_and_relse;