Message ID | 1358334345-28980-1-git-send-email-amwang@redhat.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
> According to previous discussion, it seems there is no > reasonable heuristics. > > Similar to TCP_QUICK_ACK option, but for people who can't > modify the source code and still wants to control > TCP delayed ACK behavior. > > Makes any sense? A sysctl is a bit of a big hammer, it probably isn't necessary to disable delayed acks on all connections. IIRC the related problems I saw were really on the sending side when Nagle is disabled and it is doing 'slow start'. Globally disabling on connections that have Nagle disabled might be a possibility - but it is the Nagle parameter at the other end that matters. Perhaps the sending side, after sending 4 small frames immediately, could send 1 or 2 additional full sized frames in order to provoke an ack (IIRC an ack is sent if there are 2 full sized frames of data unacked). The other problem is that 'slow start' is restarted very aggressively - whenever there is no unacked data. If you have a very low latency connection and aren't doing continuous bulk transfer it is restarted for every short burst of transmits - effectively after every received ack. There really ought to have to be a moderate idle time before 'slow start' is restarted. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2013-01-16 at 12:22 +0000, David Laight wrote: > > According to previous discussion, it seems there is no > > reasonable heuristics. > > > > Similar to TCP_QUICK_ACK option, but for people who can't > > modify the source code and still wants to control > > TCP delayed ACK behavior. > > > > Makes any sense? > > A sysctl is a bit of a big hammer, it probably isn't necessary > to disable delayed acks on all connections. You mean make this sysctl per-socket? But we don't have per-socket or per-connection sysctl for networking, do we? > > IIRC the related problems I saw were really on the sending > side when Nagle is disabled and it is doing 'slow start'. > > Globally disabling on connections that have Nagle disabled > might be a possibility - but it is the Nagle parameter > at the other end that matters. > > Perhaps the sending side, after sending 4 small frames immediately, > could send 1 or 2 additional full sized frames in order to > provoke an ack (IIRC an ack is sent if there are 2 full sized > frames of data unacked). > > The other problem is that 'slow start' is restarted very > aggressively - whenever there is no unacked data. > If you have a very low latency connection and aren't doing > continuous bulk transfer it is restarted for every short > burst of transmits - effectively after every received ack. > There really ought to have to be a moderate idle time > before 'slow start' is restarted. > These situations are not easy at all to detect. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/16/13 at 12:22pm, David Laight wrote: > A sysctl is a bit of a big hammer, it probably isn't necessary > to disable delayed acks on all connections. > > IIRC the related problems I saw were really on the sending > side when Nagle is disabled and it is doing 'slow start'. > > Globally disabling on connections that have Nagle disabled > might be a possibility - but it is the Nagle parameter > at the other end that matters. > > Perhaps the sending side, after sending 4 small frames immediately, > could send 1 or 2 additional full sized frames in order to > provoke an ack (IIRC an ack is sent if there are 2 full sized > frames of data unacked). > > The other problem is that 'slow start' is restarted very > aggressively - whenever there is no unacked data. > If you have a very low latency connection and aren't doing > continuous bulk transfer it is restarted for every short > burst of transmits - effectively after every received ack. > There really ought to have to be a moderate idle time > before 'slow start' is restarted. Not that I disagree with this fundamentally but we already have a socket option to enable the functionality. All this patch does is making the same functionality available to users that are not able to make modification on the application level. We can argue about making it available as route metric exclusively though. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Not that I disagree with this fundamentally but we already > have a socket option to enable the functionality. All this > patch does is making the same functionality available to > users that are not able to make modification on the > application level. My reading of TCP_QUICKACK documentation is that it is a request to send an ack now - rather than permanently disable delayed acks. Having to do an extra system call after every rcv() call is rather OTT. Or did you mean some other socket option? David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 4976564..8fc96f2 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -605,6 +605,11 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 100 +tcp_quick_ack - BOOLEAN + Globally enables or disables TCP delayed ACK. The applications + can still change the quick ACK mode by TCP_QUICK_ACK option. + Default: off + UDP variables: udp_mem - vector of 3 INTEGERs: min, pressure, max diff --git a/include/net/tcp.h b/include/net/tcp.h index 614af8b..0ba0c26 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -291,6 +291,7 @@ extern int sysctl_tcp_thin_dupack; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; +extern int sysctl_tcp_quick_ack; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index a25e1d2..9b4bb75 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -767,6 +767,13 @@ static struct ctl_table ipv4_table[] = { .extra2 = &two, }, { + .procname = "tcp_quick_ack", + .data = &sysctl_tcp_quick_ack, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .procname = "udp_mem", .data = &sysctl_udp_mem, .maxlen = sizeof(sysctl_udp_mem), diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0905997..3f68482 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -100,6 +100,7 @@ int sysctl_tcp_thin_dupack __read_mostly; int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; int sysctl_tcp_abc __read_mostly; int sysctl_tcp_early_retrans __read_mostly = 2; +int sysctl_tcp_quick_ack __read_mostly; #define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ @@ -4081,7 +4082,8 @@ static void tcp_fin(struct sock *sk) case TCP_ESTABLISHED: /* Move to CLOSE_WAIT */ tcp_set_state(sk, TCP_CLOSE_WAIT); - inet_csk(sk)->icsk_ack.pingpong = 1; + if (!sysctl_tcp_quick_ack) + inet_csk(sk)->icsk_ack.pingpong = 1; break; case TCP_CLOSE_WAIT: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 667a6ad..44eff34 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -174,8 +174,9 @@ static void tcp_event_data_sent(struct tcp_sock *tp, /* If it is a reply for ato after last received * packet, enter pingpong mode. */ - if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) - icsk->icsk_ack.pingpong = 1; + if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato && + !sysctl_tcp_quick_ack) + icsk->icsk_ack.pingpong = 1; } /* Account for an ACK we sent. */
According to previous discussion, it seems there is no reasonable heuristics. Similar to TCP_QUICK_ACK option, but for people who can't modify the source code and still wants to control TCP delayed ACK behavior. Makes any sense? Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Rick Jones <rick.jones2@hp.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Graf <tgraf@suug.ch> CC: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Cong Wang <amwang@redhat.com> --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html