diff mbox

[RFC,net-next] tcp: add a global sysctl to control TCP delayed ack

Message ID 1358334345-28980-1-git-send-email-amwang@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Amerigo Wang Jan. 16, 2013, 11:05 a.m. UTC
According to previous discussion, it seems there is no
reasonable heuristics.

Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior.

Makes any sense?

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>

---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Laight Jan. 16, 2013, 12:22 p.m. UTC | #1
> According to previous discussion, it seems there is no
> reasonable heuristics.
> 
> Similar to TCP_QUICK_ACK option, but for people who can't
> modify the source code and still wants to control
> TCP delayed ACK behavior.
> 
> Makes any sense?

A sysctl is a bit of a big hammer, it probably isn't necessary
to disable delayed acks on all connections.

IIRC the related problems I saw were really on the sending
side when Nagle is disabled and it is doing 'slow start'.

Globally disabling on connections that have Nagle disabled
might be a possibility - but it is the Nagle parameter
at the other end that matters.

Perhaps the sending side, after sending 4 small frames immediately,
could send 1 or 2 additional full sized frames in order to
provoke an ack (IIRC an ack is sent if there are 2 full sized
frames of data unacked).

The other problem is that 'slow start' is restarted very
aggressively - whenever there is no unacked data.
If you have a very low latency connection and aren't doing
continuous bulk transfer it is restarted for every short
burst of transmits - effectively after every received ack.
There really ought to have to be a moderate idle time
before 'slow start' is restarted.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amerigo Wang Jan. 17, 2013, 9:21 a.m. UTC | #2
On Wed, 2013-01-16 at 12:22 +0000, David Laight wrote:
> > According to previous discussion, it seems there is no
> > reasonable heuristics.
> > 
> > Similar to TCP_QUICK_ACK option, but for people who can't
> > modify the source code and still wants to control
> > TCP delayed ACK behavior.
> > 
> > Makes any sense?
> 
> A sysctl is a bit of a big hammer, it probably isn't necessary
> to disable delayed acks on all connections.

You mean make this sysctl per-socket? But we don't have per-socket or
per-connection sysctl for networking, do we?

> 
> IIRC the related problems I saw were really on the sending
> side when Nagle is disabled and it is doing 'slow start'.
> 
> Globally disabling on connections that have Nagle disabled
> might be a possibility - but it is the Nagle parameter
> at the other end that matters.
> 
> Perhaps the sending side, after sending 4 small frames immediately,
> could send 1 or 2 additional full sized frames in order to
> provoke an ack (IIRC an ack is sent if there are 2 full sized
> frames of data unacked).
> 
> The other problem is that 'slow start' is restarted very
> aggressively - whenever there is no unacked data.
> If you have a very low latency connection and aren't doing
> continuous bulk transfer it is restarted for every short
> burst of transmits - effectively after every received ack.
> There really ought to have to be a moderate idle time
> before 'slow start' is restarted.
> 

These situations are not easy at all to detect.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Jan. 17, 2013, 12:34 p.m. UTC | #3
On 01/16/13 at 12:22pm, David Laight wrote:
> A sysctl is a bit of a big hammer, it probably isn't necessary
> to disable delayed acks on all connections.
> 
> IIRC the related problems I saw were really on the sending
> side when Nagle is disabled and it is doing 'slow start'.
> 
> Globally disabling on connections that have Nagle disabled
> might be a possibility - but it is the Nagle parameter
> at the other end that matters.
> 
> Perhaps the sending side, after sending 4 small frames immediately,
> could send 1 or 2 additional full sized frames in order to
> provoke an ack (IIRC an ack is sent if there are 2 full sized
> frames of data unacked).
> 
> The other problem is that 'slow start' is restarted very
> aggressively - whenever there is no unacked data.
> If you have a very low latency connection and aren't doing
> continuous bulk transfer it is restarted for every short
> burst of transmits - effectively after every received ack.
> There really ought to have to be a moderate idle time
> before 'slow start' is restarted.

Not that I disagree with this fundamentally but we already
have a socket option to enable the functionality. All this
patch does is making the same functionality available to
users that are not able to make modification on the
application level.

We can argue about making it available as route metric
exclusively though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Jan. 17, 2013, 1:25 p.m. UTC | #4
> Not that I disagree with this fundamentally but we already
> have a socket option to enable the functionality. All this
> patch does is making the same functionality available to
> users that are not able to make modification on the
> application level.

My reading of TCP_QUICKACK documentation is that it is a request
to send an ack now - rather than permanently disable delayed acks.
Having to do an extra system call after every rcv() call
is rather OTT.

Or did you mean some other socket option?

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 4976564..8fc96f2 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -605,6 +605,11 @@  tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 100
 
+tcp_quick_ack - BOOLEAN
+	Globally enables or disables TCP delayed ACK. The applications
+	can still change the quick ACK mode by TCP_QUICK_ACK option.
+	Default: off
+
 UDP variables:
 
 udp_mem - vector of 3 INTEGERs: min, pressure, max
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 614af8b..0ba0c26 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -291,6 +291,7 @@  extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_quick_ack;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a25e1d2..9b4bb75 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -767,6 +767,13 @@  static struct ctl_table ipv4_table[] = {
 		.extra2		= &two,
 	},
 	{
+		.procname	= "tcp_quick_ack",
+		.data		= &sysctl_tcp_quick_ack,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0905997..3f68482 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -100,6 +100,7 @@  int sysctl_tcp_thin_dupack __read_mostly;
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_abc __read_mostly;
 int sysctl_tcp_early_retrans __read_mostly = 2;
+int sysctl_tcp_quick_ack __read_mostly;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
@@ -4081,7 +4082,8 @@  static void tcp_fin(struct sock *sk)
 	case TCP_ESTABLISHED:
 		/* Move to CLOSE_WAIT */
 		tcp_set_state(sk, TCP_CLOSE_WAIT);
-		inet_csk(sk)->icsk_ack.pingpong = 1;
+		if (!sysctl_tcp_quick_ack)
+			inet_csk(sk)->icsk_ack.pingpong = 1;
 		break;
 
 	case TCP_CLOSE_WAIT:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 667a6ad..44eff34 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -174,8 +174,9 @@  static void tcp_event_data_sent(struct tcp_sock *tp,
 	/* If it is a reply for ato after last received
 	 * packet, enter pingpong mode.
 	 */
-	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
-		icsk->icsk_ack.pingpong = 1;
+	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato &&
+	    !sysctl_tcp_quick_ack)
+			icsk->icsk_ack.pingpong = 1;
 }
 
 /* Account for an ACK we sent. */