Message ID | 87bps8fkaw.fsf@basil.nowhere.org |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Mar 11, 2009 at 11:03:35AM +0100, Andi Kleen wrote: > > You say "was" as if this was a recent change. Linux has been doing > > receive buffer autotuning for at least 5 years if not longer. > > I think his point was the only now does it become a visible problem > as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes. Yes, exactly! We run into this after number of workstations were upgraded at once to a new hardware with 2GB of RAM. > Perhaps this points to the default buffer sizing heuristics to > be too aggressive for >= 1GB? > > Perhaps something like this patch? Marian, does that help? Sure - as it lowers the maximum from 4MB to 2MB, the net result is that RTTs at 100 Mbps immediately went down from 267 msec into: --- x.x.x.x ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8992ms rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static rx buffer look like this (with no performance penalty): --- x.x.x.x ping statistics -- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms I.e. the patch significantly helps as expected, however having one static limit for all NIC speeds as well as for the whole range of RTTs is suboptimal by principle. Thanks & kind regards, M. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Andi Kleen <andi@firstfloor.org> Date: Wed, 11 Mar 2009 11:03:35 +0100 > Perhaps this points to the default buffer sizing heuristics to > be too aggressive for >= 1GB? It's necessary Andi, you can't fill a connection on a trans- continental connection without at least a 4MB receive buffer. Did you read the commit message of the change that increased the limit? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 11, 2009 at 04:01:49PM +0100, Andi Kleen wrote: > On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: > > From: Andi Kleen <andi@firstfloor.org> > > Date: Wed, 11 Mar 2009 11:03:35 +0100 > > > > > Perhaps this points to the default buffer sizing heuristics to > > > be too aggressive for >= 1GB? > > > > It's necessary Andi, you can't fill a connection on a trans- > > continental connection without at least a 4MB receive buffer. > > Seems pretty arbitary to me. It's the value for a given bandwidth*latency > product, but why not half or twice the bandwidth? I don't think > that number is written in stone like you claim. Besides being arbitrary, it's also incorrect. The defaults at tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring the fact, that it results in 4MB send buffer but only 3 MB receive buffer due to other defaults (tcp_adv_win_scale=2). Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec - i.e. exactly the latency we're seeing. With kind regards, M. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: > From: Andi Kleen <andi@firstfloor.org> > Date: Wed, 11 Mar 2009 11:03:35 +0100 > > > Perhaps this points to the default buffer sizing heuristics to > > be too aggressive for >= 1GB? > > It's necessary Andi, you can't fill a connection on a trans- > continental connection without at least a 4MB receive buffer. Seems pretty arbitary to me. It's the value for a given bandwidth*latency product, but why not half or twice the bandwidth? I don't think that number is written in stone like you claim. Anyways it was just a test patch and it indeeds seems to address the problem at least partly. -Andi
On Wed, Mar 11, 2009 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: > On Wed, Mar 11, 2009 at 06:30:58AM -0700, David Miller wrote: >> From: Andi Kleen <andi@firstfloor.org> >> Date: Wed, 11 Mar 2009 11:03:35 +0100 >> >> > Perhaps this points to the default buffer sizing heuristics to >> > be too aggressive for >= 1GB? >> >> It's necessary Andi, you can't fill a connection on a trans- >> continental connection without at least a 4MB receive buffer. > > Seems pretty arbitary to me. It's the value for a given bandwidth*latency > product, but why not half or twice the bandwidth? I don't think > that number is written in stone like you claim. It is of course just a number, though not exactly arbitrary -- it's approximately the required value for transcontinental 100 Mbps paths. Choosing the value is a matter of engineering trade-offs, and seemed like a reasonable cap at this time. Any cap so much lower that it would give a small bound for LAN latencies would bring us back to the bad old days where you couldn't get anything more than 10 Mbps on the wide area. -John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6.28-test/net/ipv4/tcp.c =================================================================== --- linux-2.6.28-test.orig/net/ipv4/tcp.c 2009-02-09 11:06:52.000000000 +0100 +++ linux-2.6.28-test/net/ipv4/tcp.c 2009-03-11 11:01:53.000000000 +0100 @@ -2757,9 +2757,9 @@ sysctl_tcp_mem[1] = limit; sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; - /* Set per-socket limits to no more than 1/128 the pressure threshold */ - limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); - max_share = min(4UL*1024*1024, limit); + /* Set per-socket limits to no more than 1/256 the pressure threshold */ + limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8); + max_share = min(2UL*1024*1024, limit); sysctl_tcp_wmem[0] = SK_MEM_QUANTUM; sysctl_tcp_wmem[1] = 16*1024;