diff mbox

net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()

Message ID 4E48B0C3.2010203@ctc-g.co.jp
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Jun.Kondo Aug. 15, 2011, 5:38 a.m. UTC
CTC had the following demand;

1. to ensure high throughput from the beginning of
tcp connection at normal times by acquiring large
default transmission buffer value

2. to limit the block time of the write in order to
prevent the timeout of upper layer applications
even when the connection has low throughput, such
as low rate streaming


The root of the issue;

2 can not be achieved with the configuration that
satisfies 1.

The current behavior is as follows;

Write is blocked when tcp transmission buffer (wmem)
becomes full.
In order to write again after that, one third of the
transmission buffer (sk_wmem_queued/2) must be freed.

When the throughput is low, timeout occurs by the time
when the free buffer space is created, which affects
streaming service.


The effect of the patch;

By putting xxx into the variable yyy, the portion of
the transmission buffer becomes zzz, thus timeout will
not occur in the low throughput network environment.

xxx → integer(e.g. 4)
yyy → "sysctl_tcp_lowat"
zzz → "sk_wmem_queued >> 4"

Also, we think one third of the transmission buffer
(sk_wmem_queued/2) is too deterministic, and it should
be configurable.

--------------------------------------------------

--------------------------------------------------

------------------------------------------
Jun.Kondo
ITOCHU TECHNO-SOLUTIONS Corporation(CTC)
tel:+81-3-6238-6607
fax:+81-3-5226-2369
------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jun.Kondo Aug. 19, 2011, 9:28 a.m. UTC | #1
You suggested to use non-blocking writes, but we think
we have to rewrite the Apache code if doing so.
That is, we have to make a modification to Apache that
depends on the architecture.
By using this patch, it can be handled by changing the
configuration a little bit on the kernel side for such
applications that it is difficult to do so on application
side.



(2011/08/15 14:47), David Miller wrote:
> From: "Jun.Kondo"<jun.kondo@ctc-g.co.jp>
> Date: Mon, 15 Aug 2011 14:38:11 +0900
>
>> 2. to limit the block time of the write in order to
>> prevent the timeout of upper layer applications
>> even when the connection has low throughput, such
>> as low rate streaming
> Use non-blocking writes if you want this behavior.
>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 19, 2011, 9:43 a.m. UTC | #2
From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Fri, 19 Aug 2011 18:28:45 +0900

> You suggested to use non-blocking writes, but we think
> we have to rewrite the Apache code if doing so.
> That is, we have to make a modification to Apache that
> depends on the architecture.
> By using this patch, it can be handled by changing the
> configuration a little bit on the kernel side for such
> applications that it is difficult to do so on application
> side.

The kernel provides the facilities necessary to achieve your
goals.  It is a userspace problem.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jun.Kondo Aug. 22, 2011, 12:33 a.m. UTC | #3
By using this patch, we want to prevent "timeout occured over the network that is low throughput but available".

But in the current implementation, both blocking and non-blocking,
user processes can't recognize the reason in detail
when failed to write to socket buffer, we think.

is it (really) network problem ?
or is wmem not enough free to write?

As stated above, we think it is difficult for user processes to handle timeout of writing socket buffer,
when wmem is configured large value.(to ensure high throughput over the high ralency network, like 3G).


(2011/08/19 18:43), David Miller wrote:
> From: "Jun.Kondo"<jun.kondo@ctc-g.co.jp>
> Date: Fri, 19 Aug 2011 18:28:45 +0900
>
>> You suggested to use non-blocking writes, but we think
>> we have to rewrite the Apache code if doing so.
>> That is, we have to make a modification to Apache that
>> depends on the architecture.
>> By using this patch, it can be handled by changing the
>> configuration a little bit on the kernel side for such
>> applications that it is difficult to do so on application
>> side.
> The kernel provides the facilities necessary to achieve your
> goals.  It is a userspace problem.
>
Hagen Paul Pfeifer Aug. 22, 2011, 2:21 p.m. UTC | #4
On Mon, 22 Aug 2011 09:33:52 +0900, "Jun.Kondo" wrote:
> By using this patch, we want to prevent "timeout occured over the
network
> that is low throughput but available".
> 
> But in the current implementation, both blocking and non-blocking,
> user processes can't recognize the reason in detail
> when failed to write to socket buffer, we think.

For your application it should not matter WHY the data can be written to
the peer. It can be happened that the peer close the window, some
scheduling bottleneck or whatever else. A blocking socket means for you
that some data is in the pipe, waiting for transmit. This is the knowledge
that you require, and you should deal with it. A blocking socket does not
mean FAILED, a failure is returned via ECONNRESET or otherwise. So
everything is fine when your socket blocks. Probably you should adjust your
Apache timeouts or other parts of the program logic.

> As stated above, we think it is difficult for user processes to handle
> timeout of writing socket buffer,
> when wmem is configured large value.(to ensure high throughput over the
> high ralency network, like 3G).

No, you should adjust your code and account that the socket has data in
the pipe. That's all.

Changing tcp_lowat

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2011, 6:35 p.m. UTC | #5
From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Mon, 22 Aug 2011 09:33:52 +0900

> is it (really) network problem ?
> or is wmem not enough free to write?

Oh yes you can indeed make this determination, by using the socket
timeouts via the SO_RCVTIMEO and SO_SNDTIMEO socket options.

Timeouts, when hit, will return -EINTR, whereas lack of buffer space
on a non-blocking socket will return -EAGAIN.

I think you simply are unaware of the facilities available in the BSD
socket API.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jun.Kondo Aug. 25, 2011, 4:46 a.m. UTC | #6
Currently, once the transmission buffer becomes full, it is not
possible to write again unless there is one third of free space
in the transmission buffer.

Our modification request is not intending to change the behavior
of the OS itself, but making the value "one third" to be
configurable, not fixed.

Thus it would be still possible to set the value to 1/3.

So, could you please tell us why it is not acceptable to make
it configurable, and what is the persistence with the value of
1/3?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 25, 2011, 5 a.m. UTC | #7
From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Thu, 25 Aug 2011 13:46:58 +0900

> Currently, once the transmission buffer becomes full, it is not
> possible to write again unless there is one third of free space
> in the transmission buffer.

Then use a non-blocking socket if you don't want to block.

We're talking in circles, and will walk down the same discussions
again.  You have still not shown what real limitation is created
by the way things work currently.

I've said everything that I can, and I will thus recuse myself from
the rest of this discussion since I really can't add anything more.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jun.Kondo Sept. 9, 2011, 1:33 a.m. UTC | #8
The client of this system is cellular phone, and the
status of the communication line with a client varies
widely according to its place or congestion situation.

In terms of the line speed, it can be around 9Mbps
when it is fast, but 8kbps when it is slow.

Requirement from customer is to provide stable service
in both situation.

- In normal situation, acquire large default transmission
   buffer value, and ensure high throughput from the
   beginning of tcp connection

- On the other hand, even when the connection has low
   throughput, such as low rate streaming, transmit data
   without timeout

However, when the throughput is low, it takes much time
for the transmission buffer to be freed, and timeout
will occur during that period.

Of course, the connection will not be disconnected when
the timeout of application is extended, but end user
would not wait patiently as long as 1 minute.
Therefore, we do not want to extend the timeout value.

By making the threshold, which makes write possible after
the buffer is blocked once, configurable, and set it to a
small value, it will be possible to return data to client
without making timeout occur.

So, we think the issue can be solved with this
modification.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 9, 2011, 2:17 a.m. UTC | #9
From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Fri, 09 Sep 2011 10:33:58 +0900

> - In normal situation, acquire large default transmission
>   buffer value, and ensure high throughput from the
>   beginning of tcp connection

You should never do this.  You should use the default buffer sizes and
as a result the kernel's TCP stack automatically adjusts the send and
receive buffers in response to the link characteristics.

When you set explicit buffer sizes, this turns off the TCP stack's
auto-tuning mechanism.

Every argument made in support of your proposed feature is based upon
a false premise of one kind of another, and this is yet another example
of this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- linux-mainline/include/net/sock.h.orig	2011-07-27 14:26:43.000000000 +0900
+++ linux-mainline/include/net/sock.h	2011-08-15 11:40:20.000000000 +0900
@@ -604,9 +604,11 @@  static inline int sk_acceptq_is_full(str
 /*
  * Compute minimal free write space needed to queue new packets.
  */
+extern __u32 sysctl_tcp_lowat;
+
 static inline int sk_stream_min_wspace(struct sock *sk)
 {
-	return sk->sk_wmem_queued >> 1;
+	return sk->sk_wmem_queued >> sysctl_tcp_lowat;
 }
 
 static inline int sk_stream_wspace(struct sock *sk)
--- linux-mainline/net/core/sock.c.orig	2011-07-24 05:04:06.000000000 +0900
+++ linux-mainline/net/core/sock.c	2011-08-15 11:34:27.000000000 +0900
@@ -217,6 +217,9 @@  __u32 sysctl_rmem_max __read_mostly = SK
 __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
 __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 
+__u32 sysctl_tcp_lowat = 1;
+EXPORT_SYMBOL(sysctl_tcp_lowat);
+
 /* Maximal space eaten by iovec or ancillary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
@@ -1330,6 +1333,8 @@  void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	sysctl_tcp_lowat = 1;
 }
 
 /*
--- linux-mainline/net/core/sysctl_net_core.c.orig	2011-05-29 06:01:16.000000000 +0900
+++ linux-mainline/net/core/sysctl_net_core.c	2011-08-15 11:05:38.000000000 +0900
@@ -168,6 +168,13 @@  static struct ctl_table net_core_table[]
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+	{
+		.procname	= "tcp_lowat",
+		.data		= &sysctl_tcp_lowat,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",