TCP rx window autotuning harmful at LAN context
diff mbox

Message ID 87bps8fkaw.fsf@basil.nowhere.org
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Andi Kleen March 11, 2009, 10:03 a.m. UTC
David Miller <davem@davemloft.net> writes:

> From: Marian Ďurkovič <md@bts.sk>
> Date: Mon, 9 Mar 2009 21:05:05 +0100
>
>> Well, in practice that was always limited by receive window size, which
>> was by default 64 kB on most operating systems. So this undesirable behavior
>> was limited to hosts where receive window was manually increased to huge
>> values.
>
> You say "was" as if this was a recent change.  Linux has been doing
> receive buffer autotuning for at least 5 years if not longer.

I think his point was the only now does it become a visible problem
as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes.

Perhaps this points to the default buffer sizing heuristics to 
be too aggressive for >= 1GB?

Perhaps something like this patch? Marian, does that help?

-Andi

TCP: Lower per socket RX buffer sizing threshold 

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 net/ipv4/tcp.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Marian Ďurkovič March 11, 2009, 11:03 a.m. UTC | #1
On Wed, Mar 11, 2009 at 11:03:35AM +0100, Andi Kleen wrote:
> > You say "was" as if this was a recent change.  Linux has been doing
> > receive buffer autotuning for at least 5 years if not longer.
> 
> I think his point was the only now does it become a visible problem
> as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes.

Yes, exactly! We run into this after number of workstations were upgraded
at once to a new hardware with 2GB of RAM.

> Perhaps this points to the default buffer sizing heuristics to 
> be too aggressive for >= 1GB?
> 
> Perhaps something like this patch? Marian, does that help?

Sure - as it lowers the maximum from 4MB to 2MB, the net result is that
RTTs at 100 Mbps immediately went down from 267 msec into:

--- x.x.x.x ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8992ms
rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms

Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static
rx buffer look like this (with no performance penalty):

--- x.x.x.x ping statistics --
10 packets transmitted, 10 received, 0% packet loss, time 9000ms
rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms

I.e. the patch significantly helps as expected, however having one static
limit for all NIC speeds as well as for the whole range of RTTs is suboptimal 
by principle.


   Thanks & kind regards,

       M.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 11, 2009, 1:30 p.m. UTC | #2
From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 11 Mar 2009 11:03:35 +0100

> Perhaps this points to the default buffer sizing heuristics to 
> be too aggressive for >= 1GB?

It's necessary Andi, you can't fill a connection on a trans-
continental connection without at least a 4MB receive buffer.

Did you read the commit message of the change that increased
the limit?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marian Ďurkovič March 11, 2009, 2:56 p.m. UTC | #3
On Wed, Mar 11, 2009 at 04:01:49PM +0100, Andi Kleen wrote:
> On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
> > From: Andi Kleen <andi@firstfloor.org>
> > Date: Wed, 11 Mar 2009 11:03:35 +0100
> > 
> > > Perhaps this points to the default buffer sizing heuristics to 
> > > be too aggressive for >= 1GB?
> > 
> > It's necessary Andi, you can't fill a connection on a trans-
> > continental connection without at least a 4MB receive buffer.
> 
> Seems pretty arbitary to me. It's the value for a given bandwidth*latency
> product, but why not half or twice the bandwidth? I don't think
> that number is written in stone like you claim.

Besides being arbitrary, it's also incorrect. The defaults at
tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring
the fact, that it results in 4MB send buffer but only 3 MB 
receive buffer due to other defaults (tcp_adv_win_scale=2).
 
Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec
- i.e. exactly the latency we're seeing.

   With kind regards,

         M.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen March 11, 2009, 3:01 p.m. UTC | #4
On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Wed, 11 Mar 2009 11:03:35 +0100
> 
> > Perhaps this points to the default buffer sizing heuristics to 
> > be too aggressive for >= 1GB?
> 
> It's necessary Andi, you can't fill a connection on a trans-
> continental connection without at least a 4MB receive buffer.

Seems pretty arbitary to me. It's the value for a given bandwidth*latency
product, but why not half or twice the bandwidth? I don't think
that number is written in stone like you claim.

Anyways it was just a test patch and it indeeds seems to address
the problem at least partly.

-Andi
John Heffner March 11, 2009, 3:34 p.m. UTC | #5
On Wed, Mar 11, 2009 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote:
>> From: Andi Kleen <andi@firstfloor.org>
>> Date: Wed, 11 Mar 2009 11:03:35 +0100
>>
>> > Perhaps this points to the default buffer sizing heuristics to
>> > be too aggressive for >= 1GB?
>>
>> It's necessary Andi, you can't fill a connection on a trans-
>> continental connection without at least a 4MB receive buffer.
>
> Seems pretty arbitary to me. It's the value for a given bandwidth*latency
> product, but why not half or twice the bandwidth? I don't think
> that number is written in stone like you claim.

It is of course just a number, though not exactly arbitrary -- it's
approximately the required value for transcontinental 100 Mbps paths.
Choosing the value is a matter of engineering trade-offs, and seemed
like a reasonable cap at this time.

Any cap so much lower that it would give a small bound for LAN
latencies would bring us back to the bad old days where you couldn't
get anything more than 10 Mbps on the wide area.

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

Index: linux-2.6.28-test/net/ipv4/tcp.c
===================================================================
--- linux-2.6.28-test.orig/net/ipv4/tcp.c	2009-02-09 11:06:52.000000000 +0100
+++ linux-2.6.28-test/net/ipv4/tcp.c	2009-03-11 11:01:53.000000000 +0100
@@ -2757,9 +2757,9 @@ 
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
-	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
-	max_share = min(4UL*1024*1024, limit);
+	/* Set per-socket limits to no more than 1/256 the pressure threshold */
+	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8);
+	max_share = min(2UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
 	sysctl_tcp_wmem[1] = 16*1024;