Message ID | 1430947585-2375-2-git-send-email-edumazet@google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, May 6, 2015 at 5:26 PM, Eric Dumazet <edumazet@google.com> wrote: > With the advent of small rto timers in datacenter TCP, > (ip route ... rto_min x), the following can happen : > > 1) Qdisc is full, transmit fails. > > TCP sets a timer based on icsk_rto to retry the transmit, without > exponential backoff. > With low icsk_rto, and lot of sockets, all cpus are servicing timer > interrupts like crazy. > Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) > and 500ms (TCP_RESOURCE_PROBE_INTERVAL) > > 2) Receivers can send zero windows if they don't drain their receive queue. > > TCP sends zero window probes, based on icsk_rto current value, with > exponential backoff. > With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in > some cases), sender can abort in less than one or two minutes ! > If receiver stops the sender, it obviously doesn't care of very tight > rto. Probability of dropping the ACK reopening the window is not > worth the risk. > > Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these > events (but not normal RTO based retransmits) > > A followup patch adds a new SNMP counter, as it would have helped a lot > diagnosing this issue. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/tcp.h b/include/net/tcp.h index 6d204f3f9df8..7a2248a35b13 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1043,14 +1043,31 @@ static inline bool tcp_is_cwnd_limited(const struct sock *sk) return tp->is_cwnd_limited; } -static inline void tcp_check_probe_timer(struct sock *sk) +/* Something is really bad, we could not queue an additional packet, + * because qdisc is full or receiver sent a 0 window. + * We do not want to add fuel to the fire, or abort too early, + * so make sure the timer we arm now is at least 200ms in the future, + * regardless of current icsk_rto value (as it could be ~2ms) + */ +static inline unsigned long tcp_probe0_base(const struct sock *sk) { - const struct tcp_sock *tp = tcp_sk(sk); - const struct inet_connection_sock *icsk = inet_csk(sk); + return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN); +} - if (!tp->packets_out && !icsk->icsk_pending) +/* Variant of inet_csk_rto_backoff() used for zero window probes */ +static inline unsigned long tcp_probe0_when(const struct sock *sk, + unsigned long max_when) +{ + u64 when = (u64)tcp_probe0_base(sk) << inet_csk(sk)->icsk_backoff; + + return (unsigned long)min_t(u64, when, max_when); +} + +static inline void tcp_check_probe_timer(struct sock *sk) +{ + if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending) inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0, - icsk->icsk_rto, TCP_RTO_MAX); + tcp_probe0_base(sk), TCP_RTO_MAX); } static inline void tcp_init_wl(struct tcp_sock *tp, u32 seq) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index df2ca615cd0c..cf8b20ff6658 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3233,7 +3233,7 @@ static void tcp_ack_probe(struct sock *sk) * This function is not for random using! */ } else { - unsigned long when = inet_csk_rto_backoff(icsk, TCP_RTO_MAX); + unsigned long when = tcp_probe0_when(sk, TCP_RTO_MAX); inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0, when, TCP_RTO_MAX); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a369e8a70b2c..b76c719e1979 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3490,7 +3490,7 @@ void tcp_send_probe0(struct sock *sk) probe_max = TCP_RESOURCE_PROBE_INTERVAL; } inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0, - inet_csk_rto_backoff(icsk, probe_max), + tcp_probe0_when(sk, probe_max), TCP_RTO_MAX); }