diff mbox

[v2,net] tcp: zero retrans_stamp if all retrans were acked

Message ID f58c4b02028e0750e583518608891d20742c0f38.1415128437.git.mleitner@redhat.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Marcelo Leitner Nov. 4, 2014, 7:15 p.m. UTC
Ueki Kohei reported that when we are using NewReno with connections that
have a very low traffic, we may timeout the connection too early if a
second loss occurs after the first one was successfully acked but no
data was transfered later. Below is his description of it:

When SACK is disabled, and a socket suffers multiple separate TCP
retransmissions, that socket's ETIMEDOUT value is calculated from the
time of the *first* retransmission instead of the *latest*
retransmission.

This happens because the tcp_sock's retrans_stamp is set once then never
cleared.

Take the following connection:

                      Linux                    remote-machine
                        |                           |
         send#1---->(*1)|--------> data#1 --------->|
                  |     |                           |
                 RTO    :                           :
                  |     |                           |
                 ---(*2)|----> data#1(retrans) ---->|
                  | (*3)|<---------- ACK <----------|
                  |     |                           |
                  |     :                           :
                  |     :                           :
                  |     :                           :
                16 minutes (or more)                :
                  |     :                           :
                  |     :                           :
                  |     :                           :
                  |     |                           |
         send#2---->(*4)|--------> data#2 --------->|
                  |     |                           |
                 RTO    :                           :
                  |     |                           |
                 ---(*5)|----> data#2(retrans) ---->|
                  |     |                           |
                  |     |                           |
                RTO*2   :                           :
                  |     |                           |
                  |     |                           |
      ETIMEDOUT<----(*6)|                           |

(*1) One data packet sent.
(*2) Because no ACK packet is received, the packet is retransmitted.
(*3) The ACK packet is received. The transmitted packet is acknowledged.

At this point the first "retransmission event" has passed and been
recovered from. Any future retransmission is a completely new "event".

(*4) After 16 minutes (to correspond with retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and (*4).

The socket's timeout SHOULD be calculated from this point in time, but
instead it's calculated from the prior "event" 16 minutes ago.

(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT.


Therefore, now we clear retrans_stamp as soon as all data during the
loss window is fully acked.

Reported-by: Ueki Kohei
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
---

Notes:
    v1->v2: fixed compilation issue noticed by Neal

 net/ipv4/tcp_input.c | 60 +++++++++++++++++++++++++++-------------------------
 1 file changed, 31 insertions(+), 29 deletions(-)

Comments

Neal Cardwell Nov. 4, 2014, 8:10 p.m. UTC | #1
On Tue, Nov 4, 2014 at 2:15 PM, Marcelo Ricardo Leitner
<mleitner@redhat.com> wrote:
...
> Therefore, now we clear retrans_stamp as soon as all data during the
> loss window is fully acked.
>
> Reported-by: Ueki Kohei
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> ---
>
> Notes:
>     v1->v2: fixed compilation issue noticed by Neal
>
>  net/ipv4/tcp_input.c | 60 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 31 insertions(+), 29 deletions(-)

Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>

Code looks fine, and it passes Yuchung's packetdrill test case for this.

Thanks for finding and fixing this, Marcelo.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Leitner Nov. 4, 2014, 8:51 p.m. UTC | #2
On 04-11-2014 18:10, Neal Cardwell wrote:
> On Tue, Nov 4, 2014 at 2:15 PM, Marcelo Ricardo Leitner
> <mleitner@redhat.com> wrote:
> ...
>> Therefore, now we clear retrans_stamp as soon as all data during the
>> loss window is fully acked.
>>
>> Reported-by: Ueki Kohei
>> Cc: Neal Cardwell <ncardwell@google.com>
>> Cc: Yuchung Cheng <ycheng@google.com>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> ---
>>
>> Notes:
>>      v1->v2: fixed compilation issue noticed by Neal
>>
>>   net/ipv4/tcp_input.c | 60 +++++++++++++++++++++++++++-------------------------
>>   1 file changed, 31 insertions(+), 29 deletions(-)
>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>
>
> Code looks fine, and it passes Yuchung's packetdrill test case for this.
>
> Thanks for finding and fixing this, Marcelo.

And thank you guys for all the assistance on it. Btw, would you send me that 
packetdrill script? I'm curious to see how such corner case could be written 
on it.

Regards,
Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 5, 2014, 1:20 a.m. UTC | #3
On Tue, 2014-11-04 at 18:51 -0200, Marcelo Ricardo Leitner wrote:

> And thank you guys for all the assistance on it. Btw, would you send me that 
> packetdrill script? I'm curious to see how such corner case could be written 
> on it.

One of the script I saw was :

You might have to adapt preconditions (tcp_rmem[]/tcp_wmem[])

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// Set a 10s timeout
+.000 setsockopt(3, SOL_TCP, TCP_USER_TIMEOUT, [10000], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.000 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
+.010 < . 1:1(0) ack 1 win 257
+.000 accept(3, ..., ...) = 4
+.000 write(4, ..., 1000) = 1000
+.000 > P. 1:1001(1000) ack 1
+.625 > P. 1:1001(1000) ack 1
+.020 < . 1:1(0) ack 1001 win 257
// Purposely write more after the specified timeout for testing
+11.0 write(4, ..., 1000) = 1000
+.000 > P. 1001:2001(1000) ack 1
+1.25 > P. 1001:2001(1000) ack 1
// socket is killed when the 2nd RTO fires at +2.50 w/o this patch
// so the next write returns ETIMEOUT
+2.60 write(4, ..., 1000) = 1000


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Leitner Nov. 5, 2014, 3:32 p.m. UTC | #4
On 04-11-2014 23:20, Eric Dumazet wrote:
> On Tue, 2014-11-04 at 18:51 -0200, Marcelo Ricardo Leitner wrote:
>
>> And thank you guys for all the assistance on it. Btw, would you send me that
>> packetdrill script? I'm curious to see how such corner case could be written
>> on it.
>
> One of the script I saw was :
>
> You might have to adapt preconditions (tcp_rmem[]/tcp_wmem[])
>
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> // Set a 10s timeout
> +.000 setsockopt(3, SOL_TCP, TCP_USER_TIMEOUT, [10000], 4) = 0
> +.000 bind(3, ..., ...) = 0
> +.000 listen(3, 1) = 0
> +.000 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
> +.000 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
> +.010 < . 1:1(0) ack 1 win 257
> +.000 accept(3, ..., ...) = 4
> +.000 write(4, ..., 1000) = 1000
> +.000 > P. 1:1001(1000) ack 1
> +.625 > P. 1:1001(1000) ack 1
> +.020 < . 1:1(0) ack 1001 win 257
> // Purposely write more after the specified timeout for testing
> +11.0 write(4, ..., 1000) = 1000
> +.000 > P. 1001:2001(1000) ack 1
> +1.25 > P. 1001:2001(1000) ack 1
> // socket is killed when the 2nd RTO fires at +2.50 w/o this patch
> // so the next write returns ETIMEOUT
> +2.60 write(4, ..., 1000) = 1000

Cool, thanks!

Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 5, 2014, 10 p.m. UTC | #5
From: Neal Cardwell <ncardwell@google.com>
Date: Tue, 4 Nov 2014 15:10:31 -0500

> On Tue, Nov 4, 2014 at 2:15 PM, Marcelo Ricardo Leitner
> <mleitner@redhat.com> wrote:
> ...
>> Therefore, now we clear retrans_stamp as soon as all data during the
>> loss window is fully acked.
>>
>> Reported-by: Ueki Kohei
>> Cc: Neal Cardwell <ncardwell@google.com>
>> Cc: Yuchung Cheng <ycheng@google.com>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> ---
>>
>> Notes:
>>     v1->v2: fixed compilation issue noticed by Neal
>>
>>  net/ipv4/tcp_input.c | 60 +++++++++++++++++++++++++++-------------------------
>>  1 file changed, 31 insertions(+), 29 deletions(-)
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>
> 
> Code looks fine, and it passes Yuchung's packetdrill test case for this.
> 
> Thanks for finding and fixing this, Marcelo.

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a12b455928e52211efdc6b471ef54de6218f5df0..88fa2d1606859de25419d0d45c3095f6d410d42b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2315,6 +2315,35 @@  static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 
 /* Undo procedures. */
 
+/* We can clear retrans_stamp when there are no retransmissions in the
+ * window. It would seem that it is trivially available for us in
+ * tp->retrans_out, however, that kind of assumptions doesn't consider
+ * what will happen if errors occur when sending retransmission for the
+ * second time. ...It could the that such segment has only
+ * TCPCB_EVER_RETRANS set at the present time. It seems that checking
+ * the head skb is enough except for some reneging corner cases that
+ * are not worth the effort.
+ *
+ * Main reason for all this complexity is the fact that connection dying
+ * time now depends on the validity of the retrans_stamp, in particular,
+ * that successive retransmissions of a segment must not advance
+ * retrans_stamp under any conditions.
+ */
+static bool tcp_any_retrans_done(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+
+	if (tp->retrans_out)
+		return true;
+
+	skb = tcp_write_queue_head(sk);
+	if (unlikely(skb && TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS))
+		return true;
+
+	return false;
+}
+
 #if FASTRETRANS_DEBUG > 1
 static void DBGUNDO(struct sock *sk, const char *msg)
 {
@@ -2410,6 +2439,8 @@  static bool tcp_try_undo_recovery(struct sock *sk)
 		 * is ACKed. For Reno it is MUST to prevent false
 		 * fast retransmits (RFC2582). SACK TCP is safe. */
 		tcp_moderate_cwnd(tp);
+		if (!tcp_any_retrans_done(sk))
+			tp->retrans_stamp = 0;
 		return true;
 	}
 	tcp_set_ca_state(sk, TCP_CA_Open);
@@ -2430,35 +2461,6 @@  static bool tcp_try_undo_dsack(struct sock *sk)
 	return false;
 }
 
-/* We can clear retrans_stamp when there are no retransmissions in the
- * window. It would seem that it is trivially available for us in
- * tp->retrans_out, however, that kind of assumptions doesn't consider
- * what will happen if errors occur when sending retransmission for the
- * second time. ...It could the that such segment has only
- * TCPCB_EVER_RETRANS set at the present time. It seems that checking
- * the head skb is enough except for some reneging corner cases that
- * are not worth the effort.
- *
- * Main reason for all this complexity is the fact that connection dying
- * time now depends on the validity of the retrans_stamp, in particular,
- * that successive retransmissions of a segment must not advance
- * retrans_stamp under any conditions.
- */
-static bool tcp_any_retrans_done(const struct sock *sk)
-{
-	const struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-
-	if (tp->retrans_out)
-		return true;
-
-	skb = tcp_write_queue_head(sk);
-	if (unlikely(skb && TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS))
-		return true;
-
-	return false;
-}
-
 /* Undo during loss recovery after partial ACK or using F-RTO. */
 static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
 {