diff mbox

tcp: sysctl for initial receive window

Message ID 20120921085502.4534.20232.stgit@dragon
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer Sept. 21, 2012, 8:55 a.m. UTC
Make it possible to adjust the TCP default initial advertised receive
window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.

The window size is this value multiplied by the MSS of the connection.
The default value is (still) 10, as descibed in commit 356f039822b
(TCP: increase default initial receive window.)

Allow minimum value of 1, but recommend against setting value below 2
in the documentation.

Its possible to control/override this value per route table entry via
the iproute2 option initrwnd.  Having the global default exported via
sysctl, helps determine the default setting, and make is easier to
adjust.

Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 Documentation/networking/ip-sysctl.txt |   12 ++++++++++++
 include/net/tcp.h                      |    1 +
 net/ipv4/sysctl_net_ipv4.c             |    9 +++++++++
 net/ipv4/tcp_input.c                   |    6 +++---
 net/ipv4/tcp_output.c                  |    8 +++++---
 5 files changed, 30 insertions(+), 6 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Sept. 21, 2012, 3:25 p.m. UTC | #1
On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> Make it possible to adjust the TCP default initial advertised receive
> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> 
> The window size is this value multiplied by the MSS of the connection.
> The default value is (still) 10, as descibed in commit 356f039822b
> (TCP: increase default initial receive window.)
> 
> Allow minimum value of 1, but recommend against setting value below 2
> in the documentation.
> 
> Its possible to control/override this value per route table entry via
> the iproute2 option initrwnd.  Having the global default exported via
> sysctl, helps determine the default setting, and make is easier to
> adjust.

I was wondering why its not symmetric :

If we add a sysctl for initial receive window, we need another one for
initial send window ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Sept. 21, 2012, 5:34 p.m. UTC | #2
On Fri, 2012-09-21 at 17:25 +0200, Eric Dumazet wrote:
> On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> > Make it possible to adjust the TCP default initial advertised receive
> > window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> > 
> > The window size is this value multiplied by the MSS of the connection.
> > The default value is (still) 10, as descibed in commit 356f039822b
> > (TCP: increase default initial receive window.)
> > 
> > Allow minimum value of 1, but recommend against setting value below 2
> > in the documentation.
> > 
> > Its possible to control/override this value per route table entry via
> > the iproute2 option initrwnd.  Having the global default exported via
> > sysctl, helps determine the default setting, and make is easier to
> > adjust.
> 
> I was wondering why its not symmetric :
> 
> If we add a sysctl for initial receive window, we need another one for
> initial send window ?

Yes, that was also part of my plan (I just didn't have time to complete
it).  I'll implement the sysctl for initial congestion window, next
week.

Just wanted some initial feedback, on if this sysctl approach is
acceptable or not.
David Miller Sept. 21, 2012, 5:56 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 17:25:11 +0200

> On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
>> Make it possible to adjust the TCP default initial advertised receive
>> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
>> 
>> The window size is this value multiplied by the MSS of the connection.
>> The default value is (still) 10, as descibed in commit 356f039822b
>> (TCP: increase default initial receive window.)
>> 
>> Allow minimum value of 1, but recommend against setting value below 2
>> in the documentation.
>> 
>> Its possible to control/override this value per route table entry via
>> the iproute2 option initrwnd.  Having the global default exported via
>> sysctl, helps determine the default setting, and make is easier to
>> adjust.
> 
> I was wondering why its not symmetric :
> 
> If we add a sysctl for initial receive window, we need another one for
> initial send window ?

Unlike the routing configuration, this is susceptible to serious abuse.

All it takes is for one jackass vendor to say that this should be set
to 1,000 in in sysctl.conf when using their product.

Whereas setting it on a per-route basis forces the person doing it
to actually consider that there might be ramifications that have to
do with the paths on which you are making this adjustment.

I would only let this in if you hard limited the setting to it's
current setting, 10.  So people could decrease it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Sept. 21, 2012, 6:32 p.m. UTC | #4
On Fri, 2012-09-21 at 13:56 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 21 Sep 2012 17:25:11 +0200
> 
> > On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> >> Make it possible to adjust the TCP default initial advertised receive
> >> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> >> 
> >> The window size is this value multiplied by the MSS of the connection.
> >> The default value is (still) 10, as descibed in commit 356f039822b
> >> (TCP: increase default initial receive window.)
> >> 
> >> Allow minimum value of 1, but recommend against setting value below 2
> >> in the documentation.
> >> 
> >> Its possible to control/override this value per route table entry via
> >> the iproute2 option initrwnd.  Having the global default exported via
> >> sysctl, helps determine the default setting, and make is easier to
> >> adjust.
> > 
> > I was wondering why its not symmetric :
> > 
> > If we add a sysctl for initial receive window, we need another one for
> > initial send window ?
> 
> Unlike the routing configuration, this is susceptible to serious abuse.

Are you talking about, this patch for "tcp_init_recv_window" initial
advertised receive window?


> All it takes is for one jackass vendor to say that this should be set
> to 1,000 in in sysctl.conf when using their product.

I do see your point with jackass vendors.

But (for tcp_init_recv_window) its not a problem, because this is being
limited by tcp_rmem[1] (and div 2 default due to tcp_adv_win_scale), and
can/is further be limited by window clamping. (and we also cut it if
tcp_adv_win_scale > 14).


> Whereas setting it on a per-route basis forces the person doing it
> to actually consider that there might be ramifications that have to
> do with the paths on which you are making this adjustment.

As I mentioned above, this also requires some extra work and
consideration to make this go out of bound.

> I would only let this in if you hard limited the setting to it's
> current setting, 10.  So people could decrease it.

The would defeat the purpose of the patch.  Perhaps we could, allow a
sensible max... (but this max is already being controlled as described).
David Miller Sept. 21, 2012, 6:48 p.m. UTC | #5
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 21 Sep 2012 20:32:06 +0200

> On Fri, 2012-09-21 at 13:56 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 21 Sep 2012 17:25:11 +0200
>> 
>> I would only let this in if you hard limited the setting to it's
>> current setting, 10.  So people could decrease it.
> 
> The would defeat the purpose of the patch.  Perhaps we could, allow a
> sensible max... (but this max is already being controlled as described).

Any new max which is truly sensible, could be the new default, and we
would apply the same amount of vetting for such a thing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Engelhardt Sept. 25, 2012, 5:29 a.m. UTC | #6
On Friday 2012-09-21 10:55, Jesper Dangaard Brouer wrote:
>diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>index c7fc107..684131c 100644
>--- a/Documentation/networking/ip-sysctl.txt
>+++ b/Documentation/networking/ip-sysctl.txt
>@@ -257,6 +257,18 @@ tcp_frto_response - INTEGER
> 		  to the values prior timeout
> 	Default: 0 (rate halving based)
> 
>+tcp_init_recv_window - INTEGER
>+	Default initial advertised receive window.  Actual window size
>+	is this value multiplied by the MSS of the connection.  Its

	is this value multiplied by the MSS of the connection.  It is

>+	possible to control/override this value per route table entry
>+	via the iproute2 option initrwnd.
>+	Minimum value is 1, but 2 is the recommended minimum.
>+	The effective max value, is limited by the sockets receive

	The effective max value is limited by the sockets receive

>+	buffer size (default tcp_rmem[1], and possibly scaled by
>+	tcp_adv_win_scale), and can further be limited by window

	tcp_adv_win_scale) and can further be limited by window

>+	clamp.

	clamping.

>+	Default: 10
>+
> tcp_keepalive_time - INTEGER
> 	How often TCP sends out keepalive messages when keepalive is enabled.
> 	Default: 2hours.

The "recommended minimum" is somewhat strange from a language POV,
since the recommendation is actually to _not touch_ the option at all
(because the default works and there is potential abuse as Dave
mentions).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 26, 2012, 11:53 a.m. UTC | #7
On Fri, 2012-09-21 at 14:48 -0400, David Miller wrote: 
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Fri, 21 Sep 2012 20:32:06 +0200
> > The would defeat the purpose of the patch.  Perhaps we could, allow a
> > sensible max... (but this max is already being controlled as described).
> 
> Any new max which is truly sensible, could be the new default, and we
> would apply the same amount of vetting for such a thing.


We have in linux a very conservative and complex rwin control at the
beginning of a TCP session, only for the very first packets,
if applications are reasonably fast at draining their receive queue.
(They mostly are)

Last time I had to take a look (after truesize changes), I was kind of
worried to not find a good reason why we were doing this.

We now have :

- rcvbuf autotuning, letting rwin growing up to 3MB or so
- Better truesize tracking
- global/cgroup tcp mem accounting/pressure
- TCP coalescing to minimize the effect of bad citizen packets
    (very low len/truesize ratio) 
- People tracking TCP stack inefficiencies and working on new CCs...
   (An example is Joe Touch I-D
http://tools.ietf.org/html/draft-touch-tcpm-automatic-iw-03 that
proposes increasing IW over a longer period of time (as opposed to
revisiting constants every few years).
- ...

TCP congestion control is controlled by the sender, driven by the ACK
coming back from receiver, and initial rwin should not change CC at all,
unless we deliberately constrain rwin to a too small value.

We did the 3 -> 10 change only two years ago.
And 3 was really too small even 5 years ago.

Browsers had to open simultaneous sessions to the same server only to
workaround this limit, and they still do.

I would just remove the 10 'hard constant', (but not so hard, since it
was 3 only 2 years ago), and let tcp_rmem[1]/SO_RCVBUF decide of the
initial receive window.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng Oct. 1, 2012, 10:36 p.m. UTC | #8
On Wed, Sep 26, 2012 at 4:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-09-21 at 14:48 -0400, David Miller wrote:
>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> Date: Fri, 21 Sep 2012 20:32:06 +0200
>> > The would defeat the purpose of the patch.  Perhaps we could, allow a
>> > sensible max... (but this max is already being controlled as described).
>>
>> Any new max which is truly sensible, could be the new default, and we
>> would apply the same amount of vetting for such a thing.
>
>
> We have in linux a very conservative and complex rwin control at the
> beginning of a TCP session, only for the very first packets,
> if applications are reasonably fast at draining their receive queue.
> (They mostly are)
>
> Last time I had to take a look (after truesize changes), I was kind of
> worried to not find a good reason why we were doing this.
>
> We now have :
>
> - rcvbuf autotuning, letting rwin growing up to 3MB or so
> - Better truesize tracking
> - global/cgroup tcp mem accounting/pressure
> - TCP coalescing to minimize the effect of bad citizen packets
>     (very low len/truesize ratio)
> - People tracking TCP stack inefficiencies and working on new CCs...
>    (An example is Joe Touch I-D
> http://tools.ietf.org/html/draft-touch-tcpm-automatic-iw-03 that
> proposes increasing IW over a longer period of time (as opposed to
> revisiting constants every few years).
> - ...
>
> TCP congestion control is controlled by the sender, driven by the ACK
> coming back from receiver, and initial rwin should not change CC at all,
> unless we deliberately constrain rwin to a too small value.
>
> We did the 3 -> 10 change only two years ago.
> And 3 was really too small even 5 years ago.
>
> Browsers had to open simultaneous sessions to the same server only to
> workaround this limit, and they still do.
>
> I would just remove the 10 'hard constant', (but not so hard, since it
> was 3 only 2 years ago), and let tcp_rmem[1]/SO_RCVBUF decide of the
> initial receive window.
I like this idea a lot. Got a patch for us to try?

>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7fc107..684131c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -257,6 +257,18 @@  tcp_frto_response - INTEGER
 		  to the values prior timeout
 	Default: 0 (rate halving based)
 
+tcp_init_recv_window - INTEGER
+	Default initial advertised receive window.  Actual window size
+	is this value multiplied by the MSS of the connection.  Its
+	possible to control/override this value per route table entry
+	via the iproute2 option initrwnd.
+	Minimum value is 1, but 2 is the recommended minimum.
+	The effective max value, is limited by the sockets receive
+	buffer size (default tcp_rmem[1], and possibly scaled by
+	tcp_adv_win_scale), and can further be limited by window
+	clamp.
+	Default: 10
+
 tcp_keepalive_time - INTEGER
 	How often TCP sends out keepalive messages when keepalive is enabled.
 	Default: 2hours.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8cb00c..3334852 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@  extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern u32 sysctl_tcp_init_recv_window;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 9205e49..9bb6608 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@ 
 #include <net/tcp_memcontrol.h>
 
 static int zero;
+static int one = 1;
 static int two = 2;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
@@ -794,6 +795,14 @@  static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero
 	},
+	{
+		.procname	= "tcp_init_recv_window",
+		.data		= &sysctl_tcp_init_recv_window,
+		.maxlen		= sizeof(sysctl_tcp_init_recv_window),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e2bec81..bbf7a33 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -356,14 +356,14 @@  static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_fixup_rcvbuf(struct sock *sk)
 {
 	u32 mss = tcp_sk(sk)->advmss;
-	u32 icwnd = TCP_DEFAULT_INIT_RCVWND;
+	u32 icwnd = sysctl_tcp_init_recv_window;
 	int rcvmem;
 
-	/* Limit to 10 segments if mss <= 1460,
+	/* Limit to default 10 segments if mss <= 1460,
 	 * or 14600/mss segments, with a minimum of two segments.
 	 */
 	if (mss > 1460)
-		icwnd = max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
+		icwnd = max_t(u32, (1460 * icwnd) / mss, 2);
 
 	rcvmem = SKB_TRUESIZE(mss + MAX_TCP_HEADER);
 	while (tcp_win_from_space(rcvmem) < mss)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe6ffe..5f3b26d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,8 @@  int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
  */
 int sysctl_tcp_tso_win_divisor __read_mostly = 3;
 
+u32 sysctl_tcp_init_recv_window __read_mostly = TCP_DEFAULT_INIT_RCVWND;
+
 int sysctl_tcp_mtu_probing __read_mostly = 0;
 int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 
@@ -235,14 +237,14 @@  void tcp_select_initial_window(int __space, __u32 mss,
 	}
 
 	/* Set initial window to a value enough for senders starting with
-	 * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
+	 * initial congestion window of sysctl_tcp_init_recv_window. Place
 	 * a limit on the initial window when mss is larger than 1460.
 	 */
 	if (mss > (1 << *rcv_wscale)) {
-		int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
+		int init_cwnd = sysctl_tcp_init_recv_window;
 		if (mss > 1460)
 			init_cwnd =
-			max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
+			max_t(u32, (1460 * init_cwnd) / mss, 2);
 		/* when initializing use the value from init_rcv_wnd
 		 * rather than the default from above
 		 */