diff mbox series

[RFC,RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled

Message ID 20180613165543.0F92DA09E2@unicorn.suse.cz
State RFC, archived
Delegated to: David Miller
Headers show
Series [RFC,RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled | expand

Commit Message

Michal Kubecek June 13, 2018, 4:55 p.m. UTC
When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
timestamps (either because of (mis)configuration or because the other
endpoint does not advertise them), specific pattern loss can make RTO grow
exponentially until the sender is only able to send one packet per two
minutes (TCP_RTO_MAX).

One way to reproduce is to

  - make sure the connection uses neither SACK nor timestamps
  - let tp->reorder grow enough so that lost packets are retransmitted
    after RTO (rather than when high_seq - snd_una > reorder * MSS)
  - let the data flow stabilize
  - drop multiple sender packets in "every second" pattern
  - either there is no new data to send or acks received in response to new
    data are also window updates (i.e. not dupacks by definition)

In this scenario, the sender keeps cycling between retransmitting first
lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out
again. In this loop, the sender only gets

  (a) acks for retransmitted segments (possibly together with old ones)
  (b) window updates

Without timestamps, neither can be used for RTT estimator and without SACK,
we have no newly sacked segments to estimate RTT either. Therefore each
timeout doubles RTO and without usable RTT samples so that there is nothing
to counter the exponential growth.

While disabling both SACK and timestamps doesn't make any sense, the
resulting behaviour is so pathological that it deserves an improvement.
(Also, both can be disabled on the other side.) Avoid F-RTO algorithm in
case both SACK and timestamps are disabled so that the sender falls back to
traditional slow start retransmission.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Michal Kubecek June 13, 2018, 4:57 p.m. UTC | #1
On Wed, Jun 13, 2018 at 06:55:43PM +0200, Michal Kubecek wrote:
> When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> timestamps (either because of (mis)configuration or because the other
> endpoint does not advertise them), specific pattern loss can make RTO grow
> exponentially until the sender is only able to send one packet per two
> minutes (TCP_RTO_MAX).
> 
> One way to reproduce is to
> 
>   - make sure the connection uses neither SACK nor timestamps
>   - let tp->reorder grow enough so that lost packets are retransmitted
>     after RTO (rather than when high_seq - snd_una > reorder * MSS)
>   - let the data flow stabilize
>   - drop multiple sender packets in "every second" pattern
>   - either there is no new data to send or acks received in response to new
>     data are also window updates (i.e. not dupacks by definition)
> 
> In this scenario, the sender keeps cycling between retransmitting first
> lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out
> again. In this loop, the sender only gets
> 
>   (a) acks for retransmitted segments (possibly together with old ones)
>   (b) window updates
> 
> Without timestamps, neither can be used for RTT estimator and without SACK,
> we have no newly sacked segments to estimate RTT either. Therefore each
> timeout doubles RTO and without usable RTT samples so that there is nothing
> to counter the exponential growth.
> 
> While disabling both SACK and timestamps doesn't make any sense, the
> resulting behaviour is so pathological that it deserves an improvement.
> (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in
> case both SACK and timestamps are disabled so that the sender falls back to
> traditional slow start retransmission.
> 
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

I was able to illustrate the issue using a packetdrill script. It cheats
a bit by setting net.ipv4.tcp_reordering to 30 so that it we can get to
the issue more quickly. In this case, we don't have more data to send
but it's not essential; the issue can be reproduced even with sending of
new data in F-RTO, it would only make everything more complicated.

I was able to run the same script on kernels 4.17-rc6, 4.12 (SLE15) and
4.4 (SLE12-SP2). Kernel 3.12 required minor modifications but not in the
important part (the slow start is a bit slower there).

---------------------------------------------------------------------------
--tolerance_usecs=10000

// flush cached TCP metrics
0.000  `ip tcp_metrics flush all`
+0.000 `sysctl -q net.ipv4.tcp_reordering=20`


// establish a connection
+0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0

+0.100 < S 0:0(0) win 40000 <mss 1000>
+0.000 > S. 0:0(0) ack 1 <mss 1460>
+0.100 < . 1:1(0) ack 1 win 40000
+0.000 accept(3, ..., ...) = 4

// Send 10 data segments.
+0.100 write(4, ..., 30000) = 30000
// For some reason (unknown yet), GSO packets are only 2000 bytes long
+0.000 > . 1:2001(2000) ack 1
+0.000 > . 2001:4001(2000) ack 1
+0.000 > . 4001:6001(2000) ack 1
+0.000 > . 6001:8001(2000) ack 1
+0.000 > . 8001:10001(2000) ack 1
+0.100 < . 1:1(0) ack 2001 win 38000
+0.000 > . 10001:12001(2000) ack 1
+0.000 > . 12001:14001(2000) ack 1
+0.001 < . 1:1(0) ack 4001 win 36000
+0.000 > . 14001:16001(2000) ack 1
+0.000 > . 16001:18001(2000) ack 1
+0.001 < . 1:1(0) ack 6001 win 34000
+0.000 > . 18001:20001(2000) ack 1
+0.000 > . 20001:22001(2000) ack 1
+0.001 < . 1:1(0) ack 8001 win 32000
+0.000 > . 22001:24001(2000) ack 1
+0.000 > . 24001:26001(2000) ack 1
+0.001 < . 1:1(0) ack 10001 win 30000
+0.000 > . 26001:28001(2000) ack 1
+0.000 > P. 28001:30001(2000) ack 1

// loss of 12001:13001, 14001:15001, ..., 28001:29001
+0.100 < . 1:1(0) ack 12001 win 30000	// original ack
+0.000 < . 1:1(0) ack 12001 win 30000	// 13001:14001
+0.000 < . 1:1(0) ack 12001 win 30000	// 15001:16001
+0.000 < . 1:1(0) ack 12001 win 30000	// 17001:18001
+0.000 < . 1:1(0) ack 12001 win 30000	// 19001:20001
+0.000 < . 1:1(0) ack 12001 win 30000	// 21001:22001
+0.000 < . 1:1(0) ack 12001 win 30000	// 13001:24001
+0.000 < . 1:1(0) ack 12001 win 30000	// 25001:26001
+0.000 < . 1:1(0) ack 12001 win 30000	// 27001:28001
+0.000 < . 1:1(0) ack 12001 win 30000	// 29001:30001

// RTO 300ms
+0.270~+0.330 > . 12001:13001(1000) ack 1
+0.100 < . 1:1(0) ack 14001 win 38000
// RTO 600ms
+0.540~+0.660 > . 14001:15001(1000) ack 1
+0.100 < . 1:1(0) ack 16001 win 38000
// RTO 1200ms
+1.050~+1.350 > . 16001:17001(1000) ack 1
+0.100 < . 1:1(0) ack 18001 win 38000
// RTO 2400ms
+2.100~+2.700 > . 18001:19001(1000) ack 1
+0.100 < . 1:1(0) ack 20001 win 38000
// RTO 4800ms
+4.200~+5.400 > . 20001:21001(1000) ack 1
+0.100 < . 1:1(0) ack 22001 win 38000
// RTO 9600ms
+8.400~+10.800 > . 22001:23001(1000) ack 1
+0.100 < . 1:1(0) ack 24001 win 38000
// RTO 19200ms
+16.800~+21.600 > . 24001:25001(1000) ack 1

+1.000 `sysctl -q net.ipv4.tcp_reordering=3`
---------------------------------------------------------------------------

And this is what happens on current snapshot of master branch with
either net.ipv4.tcp_frto=0 or with the RFC patch:

---------------------------------------------------------------------------
--tolerance_usecs=10000

// flush cached TCP metrics
0.000  `ip tcp_metrics flush all`
+0.000 `sysctl -q net.ipv4.tcp_reordering=20`


// establish a connection
+0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0

+0.100 < S 0:0(0) win 40000 <mss 1000>
+0.000 > S. 0:0(0) ack 1 <mss 1460>
+0.100 < . 1:1(0) ack 1 win 40000
+0.000 accept(3, ..., ...) = 4

// Send 10 data segments.
+0.100 write(4, ..., 30000) = 30000
// For some reason (unknown yet), GSO packets are only 2000 bytes long
+0.000 > . 1:2001(2000) ack 1
+0.000 > . 2001:4001(2000) ack 1
+0.000 > . 4001:6001(2000) ack 1
+0.000 > . 6001:8001(2000) ack 1
+0.000 > . 8001:10001(2000) ack 1
+0.100 < . 1:1(0) ack 2001 win 38000
+0.000 > . 10001:12001(2000) ack 1
+0.000 > . 12001:14001(2000) ack 1
+0.001 < . 1:1(0) ack 4001 win 36000
+0.000 > . 14001:16001(2000) ack 1
+0.000 > . 16001:18001(2000) ack 1
+0.001 < . 1:1(0) ack 6001 win 34000
+0.000 > . 18001:20001(2000) ack 1
+0.000 > . 20001:22001(2000) ack 1
+0.001 < . 1:1(0) ack 8001 win 32000
+0.000 > . 22001:24001(2000) ack 1
+0.000 > . 24001:26001(2000) ack 1
+0.001 < . 1:1(0) ack 10001 win 30000
+0.000 > . 26001:28001(2000) ack 1
+0.000 > P. 28001:30001(2000) ack 1

// loss of 12001:13001, 14001:15001, ..., 28001:29001
+0.100 < . 1:1(0) ack 12001 win 30000	// original ack
+0.000 < . 1:1(0) ack 12001 win 30000	// 13001:14001
+0.000 < . 1:1(0) ack 12001 win 30000	// 15001:16001
+0.000 < . 1:1(0) ack 12001 win 30000	// 17001:18001
+0.000 < . 1:1(0) ack 12001 win 30000	// 19001:20001
+0.000 < . 1:1(0) ack 12001 win 30000	// 21001:22001
+0.000 < . 1:1(0) ack 12001 win 30000	// 13001:24001
+0.000 < . 1:1(0) ack 12001 win 30000	// 25001:26001
+0.000 < . 1:1(0) ack 12001 win 30000	// 27001:28001
+0.000 < . 1:1(0) ack 12001 win 30000	// 29001:30001

// RTO 300ms
+0.270~+0.330 > . 12001:13001(1000) ack 1
+0.100 < . 1:1(0) ack 14001 win 38000

+0.000 > . 14001:16001(2000) ack 1
+0.000 > . 16001:17001(1000) ack 1
+0.100 < . 1:1(0) ack 16001 win 38000

+0.000 > . 17001:18001(1000) ack 1
+0.000 > . 18001:20001(2000) ack 1
+0.000 > . 20001:21001(1000) ack 1
+0.100 < . 1:1(0) ack 18001 win 38000
+0.001 < . 1:1(0) ack 20001 win 36000
+0.001 < . 1:1(0) ack 21001 win 35000

+0.000 > . 21001:22001(1000) ack 1
+0.000 > . 22001:24001(2000) ack 1
+0.000 > . 24001:25001(1000) ack 1
+0.000 > . 25001:26001(1000) ack 1
+0.000 > . 26001:28001(2000) ack 1
+0.000 > . 28001:29001(1000) ack 1
+0.000 > P. 29001:30001(1000) ack 1
+0.100 < . 1:1(0) ack 22001 win 38000
+0.001 < . 1:1(0) ack 24001 win 36000
+0.001 < . 1:1(0) ack 26001 win 34000
+0.001 < . 1:1(0) ack 28001 win 32000
+0.001 < . 1:1(0) ack 30001 win 30000

+1.000 `sysctl -q net.ipv4.tcp_reordering=3`
---------------------------------------------------------------------------

Michal Kubecek
Yuchung Cheng June 13, 2018, 5:32 p.m. UTC | #2
On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>
> When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> timestamps (either because of (mis)configuration or because the other
> endpoint does not advertise them), specific pattern loss can make RTO grow
> exponentially until the sender is only able to send one packet per two
> minutes (TCP_RTO_MAX).
>
> One way to reproduce is to
>
>   - make sure the connection uses neither SACK nor timestamps
>   - let tp->reorder grow enough so that lost packets are retransmitted
>     after RTO (rather than when high_seq - snd_una > reorder * MSS)
>   - let the data flow stabilize
>   - drop multiple sender packets in "every second" pattern
>   - either there is no new data to send or acks received in response to new
>     data are also window updates (i.e. not dupacks by definition)
>
> In this scenario, the sender keeps cycling between retransmitting first
> lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out
> again. In this loop, the sender only gets
>
>   (a) acks for retransmitted segments (possibly together with old ones)
>   (b) window updates
>
> Without timestamps, neither can be used for RTT estimator and without SACK,
> we have no newly sacked segments to estimate RTT either. Therefore each
> timeout doubles RTO and without usable RTT samples so that there is nothing
> to counter the exponential growth.
>
> While disabling both SACK and timestamps doesn't make any sense, the
> resulting behaviour is so pathological that it deserves an improvement.
> (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in
> case both SACK and timestamps are disabled so that the sender falls back to
> traditional slow start retransmission.
>
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Yuchung Cheng <ycheng@google.com>

Thanks for the patch (and packedrill test)! I would encourage
submitting an errata to F-RTO RFC about this case.

> ---
>  net/ipv4/tcp_input.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 355d3dffd021..ed603f987b72 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2001,7 +2001,8 @@ void tcp_enter_loss(struct sock *sk)
>          */
>         tp->frto = net->ipv4.sysctl_tcp_frto &&
>                    (new_recovery || icsk->icsk_retransmits) &&
> -                  !inet_csk(sk)->icsk_mtup.probe_size;
> +                  !inet_csk(sk)->icsk_mtup.probe_size &&
> +                  (tcp_is_sack(tp) || tp->rx_opt.tstamp_ok);
>  }
>
>  /* If ACK arrived pointing to a remembered SACK, it means that our
> --
> 2.17.1
>
Eric Dumazet June 13, 2018, 5:48 p.m. UTC | #3
On 06/13/2018 10:32 AM, Yuchung Cheng wrote:
> On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>> When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
>> timestamps (either because of (mis)configuration or because the other
>> endpoint does not advertise them), specific pattern loss can make RTO grow
>> exponentially until the sender is only able to send one packet per two
>> minutes (TCP_RTO_MAX).

>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> 
> Thanks for the patch (and packedrill test)! 

Yes, thanks a lot Michal.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Ilpo Järvinen June 14, 2018, 8:42 a.m. UTC | #4
On Wed, 13 Jun 2018, Yuchung Cheng wrote:

> On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> > timestamps (either because of (mis)configuration or because the other
> > endpoint does not advertise them), specific pattern loss can make RTO grow
> > exponentially until the sender is only able to send one packet per two
> > minutes (TCP_RTO_MAX).
> >
> > One way to reproduce is to
> >
> >   - make sure the connection uses neither SACK nor timestamps
> >   - let tp->reorder grow enough so that lost packets are retransmitted
> >     after RTO (rather than when high_seq - snd_una > reorder * MSS)
> >   - let the data flow stabilize
> >   - drop multiple sender packets in "every second" pattern

Hmm? What is deterministically dropping every second packet for a 
particular flow that has RTOs in between?

Years back I was privately contacted by somebody from a middlebox vendor 
for a case with very similar exponentially growing RTO due to the FRTO 
heuristic. It turned out that they didn't want to send dupacks for 
out-of-order packets because they wanted to keep the TCP side of their 
deep packet inspection middlebox primitive. He claimed that the middlebox 
doesn't need to send dupacks because there could be such a TCP 
implementation that too doesn't do them either (not that he had anything 
to point to besides their middlebox ;-)), which according to him was 
not required because of his intepretation of RFC793 (IIRC). ...Nevermind 
anything that has occurred since that era.

...Back then, I also envisioned in that mail exchange with him that a 
middlebox could break FRTO by always forcing a drop on the key packet
FRTO depends on. Ironically, that is exactly what is required to trigger 
this issue? Sure, every a heuristic can be fooled if a deterministic (or
crafted) pattern is introduced to defeat that particular heuristic. ...But 
I'd prefer that networks "dropping every second packet" of a flow to be 
fixed rather than FRTO?

In addition, one could even argue that the sender is sending whole the 
time with lower and lower rate (given the exponentially increasing RTO) 
and still gets losses, so that a further rate reduction would be the 
correct action. ...But take this intuitive reasoning with some grain of 
salt (that is, I can see reasons myself to disagree with it :-)).

> >   - either there is no new data to send or acks received in response to new
> >     data are also window updates (i.e. not dupacks by definition)

Can you explain what exactly do you mean with this "no new data to send" 
condition here as F-RTO is/should not be used if there's no new data to 
send?!?

...Or, why is the receiver going against SHOULD in RFC5681:
   "A TCP receiver SHOULD send an immediate duplicate ACK when an out-
   of-order segment arrives."
? ...And yes, I know there's this very issue with window updates masking 
duplicate ACKs in Linux TCP receiver but I was met with some skepticism 
on whether fixing it is worth it or not.

> > In this scenario, the sender keeps cycling between retransmitting first
> > lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out
> > again. In this loop, the sender only gets
> >
> >   (a) acks for retransmitted segments (possibly together with old ones)
> >   (b) window updates
> >
> > Without timestamps, neither can be used for RTT estimator and without SACK,
> > we have no newly sacked segments to estimate RTT either. Therefore each
> > timeout doubles RTO and without usable RTT samples so that there is nothing
> > to counter the exponential growth.
> >
> > While disabling both SACK and timestamps doesn't make any sense, the
> > resulting behaviour is so pathological that it deserves an improvement.
> > (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in
> > case both SACK and timestamps are disabled so that the sender falls back to
> > traditional slow start retransmission.
> >
> > Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> 
> Thanks for the patch (and packedrill test)! I would encourage
> submitting an errata to F-RTO RFC about this case.

Unless there's a convincing explination how such a drop pattern would 
occur in real world except due to serious brokeness/misconfiguration on 
network side (that should not be there), I'm not that sure it's exactly
what erratas are meant for.
Michal Kubecek June 14, 2018, 9:34 a.m. UTC | #5
On Thu, Jun 14, 2018 at 11:42:43AM +0300, Ilpo Järvinen wrote:
> On Wed, 13 Jun 2018, Yuchung Cheng wrote:
> 
> > On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > >
> > > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> > > timestamps (either because of (mis)configuration or because the other
> > > endpoint does not advertise them), specific pattern loss can make RTO grow
> > > exponentially until the sender is only able to send one packet per two
> > > minutes (TCP_RTO_MAX).
> > >
> > > One way to reproduce is to
> > >
> > >   - make sure the connection uses neither SACK nor timestamps
> > >   - let tp->reorder grow enough so that lost packets are retransmitted
> > >     after RTO (rather than when high_seq - snd_una > reorder * MSS)
> > >   - let the data flow stabilize
> > >   - drop multiple sender packets in "every second" pattern
> 
> Hmm? What is deterministically dropping every second packet for a 
> particular flow that has RTOs in between?

AFAIK the customer we managed to push to investigate the primary source
of the packet loss identified some problems with their load balancing
solution but I don't have more details. For the record, the loss didn't
last through the phase of RTO growing exponentially (so that there were
no lost retransmissions) but did last long enough to drop at least 20
packets. With the exponential growth, that was enough for RTO to reach
TCP_RTO_MAX (120s) and make the connection essentially stalled.

Actually, it doesn't need to be exactly "every second". As long as you
don't lose two consecutive segments (which would allow you to fall back
in step (2a)), you can have more than one received segments between them
and get the same issue.

> Years back I was privately contacted by somebody from a middlebox vendor 
> for a case with very similar exponentially growing RTO due to the FRTO 
> heuristic. It turned out that they didn't want to send dupacks for 
> out-of-order packets because they wanted to keep the TCP side of their 
> deep packet inspection middlebox primitive. He claimed that the middlebox 
> doesn't need to send dupacks because there could be such a TCP 
> implementation that too doesn't do them either (not that he had anything 
> to point to besides their middlebox ;-)), which according to him was 
> not required because of his intepretation of RFC793 (IIRC). ...Nevermind 
> anything that has occurred since that era.
> 
> ...Back then, I also envisioned in that mail exchange with him that a 
> middlebox could break FRTO by always forcing a drop on the key packet
> FRTO depends on. Ironically, that is exactly what is required to trigger 
> this issue? Sure, every a heuristic can be fooled if a deterministic (or
> crafted) pattern is introduced to defeat that particular heuristic.

OK, let me elaborate a bit more about the background. Within last few
months, we had six different reports of TCP stalls (typically for NFS
connections alternating between idle period and bulk transfers) which
started after an upgrade from SLE11 (with 3.0 kernel) to SLE12 SP2 or
SP3 (both 4.4 kernel).

Two of them were analysed down to the NAS on the other side which was
sending SACK blocks violating the RFC in two different ways - as
described in thread "TCP one-by-one acking - RFC interpretation
question".

Three of them do not seem to show any apparent RFC violation and the
problem is only in RTO doubling with each retransmission while there are
no usable replies that could be used for RTT estimate (in the absence of
both SACK and timestamps).

For the sake of completeness, there was also one report from two days
ago which looked almost the same but in the end it turned out that in
this case, SLES (with Firefox) was the receiver and sender was actually
Windows 2016 server with Microsoft IIS.

> I'd prefer that networks "dropping every second packet" of a flow to be 
> fixed rather than FRTO?

Yes, that was my first reaction that their primary focus should be the
lossy network. However, it's not behaving like this all the time, the
periods of loss are relatively short - but long enough to trigger the
"RTO loop".

> In addition, one could even argue that the sender is sending whole the 
> time with lower and lower rate (given the exponentially increasing RTO) 
> and still gets losses, so that a further rate reduction would be the 
> correct action. ...But take this intuitive reasoning with some grain of 
> salt (that is, I can see reasons myself to disagree with it :-)).

As I explained above, the loss was over by the time of first RTO
retransmission. I should probably have made that clear in the commit
message.

> > >   - either there is no new data to send or acks received in response to new
> > >     data are also window updates (i.e. not dupacks by definition)
> 
> Can you explain what exactly do you mean with this "no new data to send" 
> condition here as F-RTO is/should not be used if there's no new data to 
> send?!?

AFAICS RFC 5682 is not explicit about this and offers multiple options.
Anyway, this is not essential and in most of the customer provided
captures, it wasn't the case.

> ...Or, why is the receiver going against SHOULD in RFC5681:
>    "A TCP receiver SHOULD send an immediate duplicate ACK when an out-
>    of-order segment arrives."
> ? ...And yes, I know there's this very issue with window updates masking 
> duplicate ACKs in Linux TCP receiver but I was met with some skepticism 
> on whether fixing it is worth it or not.

Normally, we would have timestamps (and even SACK). Without them, you
cannot reliably recognize a dupack with changed window size from
a spontaneous window update.

> > Acked-by: Yuchung Cheng <ycheng@google.com>
> > 
> > Thanks for the patch (and packedrill test)! I would encourage
> > submitting an errata to F-RTO RFC about this case.
> 
> Unless there's a convincing explination how such a drop pattern would 
> occur in real world except due to serious brokeness/misconfiguration on 
> network side (that should not be there), I'm not that sure it's exactly
> what erratas are meant for.

As explained above, this commit was not inspired by some theoretical
study trying to find dark corner cases, it was result of investigation
of reports from  multiple customer encountering the problem in
real-life.  Sure, there was always something bad, namely SACK/timestamps
being disabled and network losing packets, but the effect (one packet
per two minutes) is so disastrous that I believe it should be handled.

Michal Kubecek
Ilpo Järvinen June 14, 2018, 10:18 a.m. UTC | #6
On Wed, 13 Jun 2018, Michal Kubecek wrote:

> On Wed, Jun 13, 2018 at 06:55:43PM +0200, Michal Kubecek wrote:
> > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> > timestamps (either because of (mis)configuration or because the other
> > endpoint does not advertise them), specific pattern loss can make RTO grow
> > exponentially until the sender is only able to send one packet per two
> > minutes (TCP_RTO_MAX).
> > 
> > One way to reproduce is to
> > 
> >   - make sure the connection uses neither SACK nor timestamps
> >   - let tp->reorder grow enough so that lost packets are retransmitted
> >     after RTO (rather than when high_seq - snd_una > reorder * MSS)
> >   - let the data flow stabilize
> >   - drop multiple sender packets in "every second" pattern
> >   - either there is no new data to send or acks received in response to new
> >     data are also window updates (i.e. not dupacks by definition)
> > 
> > In this scenario, the sender keeps cycling between retransmitting first
> > lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out
> > again. In this loop, the sender only gets
> > 
> >   (a) acks for retransmitted segments (possibly together with old ones)
> >   (b) window updates
> > 
> > Without timestamps, neither can be used for RTT estimator and without SACK,
> > we have no newly sacked segments to estimate RTT either. Therefore each
> > timeout doubles RTO and without usable RTT samples so that there is nothing
> > to counter the exponential growth.
> > 
> > While disabling both SACK and timestamps doesn't make any sense, the
> > resulting behaviour is so pathological that it deserves an improvement.
> > (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in
> > case both SACK and timestamps are disabled so that the sender falls back to
> > traditional slow start retransmission.
> > 
> > Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> 
> I was able to illustrate the issue using a packetdrill script. It cheats
> a bit by setting net.ipv4.tcp_reordering to 30 so that it we can get to
> the issue more quickly. In this case, we don't have more data to send
> but it's not essential; the issue can be reproduced even with sending of
> new data in F-RTO, it would only make everything more complicated.
> 
> I was able to run the same script on kernels 4.17-rc6, 4.12 (SLE15) and
> 4.4 (SLE12-SP2). Kernel 3.12 required minor modifications but not in the
> important part (the slow start is a bit slower there).
> 
> ---------------------------------------------------------------------------
> --tolerance_usecs=10000
> 
> // flush cached TCP metrics
> 0.000  `ip tcp_metrics flush all`
> +0.000 `sysctl -q net.ipv4.tcp_reordering=20`
> 
> 
> // establish a connection
> +0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
> +0.000 bind(3, ..., ...) = 0
> +0.000 listen(3, 1) = 0
> 
> +0.100 < S 0:0(0) win 40000 <mss 1000>
> +0.000 > S. 0:0(0) ack 1 <mss 1460>
> +0.100 < . 1:1(0) ack 1 win 40000
> +0.000 accept(3, ..., ...) = 4
> 
> // Send 10 data segments.
> +0.100 write(4, ..., 30000) = 30000
> // For some reason (unknown yet), GSO packets are only 2000 bytes long
> +0.000 > . 1:2001(2000) ack 1
> +0.000 > . 2001:4001(2000) ack 1
> +0.000 > . 4001:6001(2000) ack 1
> +0.000 > . 6001:8001(2000) ack 1
> +0.000 > . 8001:10001(2000) ack 1
> +0.100 < . 1:1(0) ack 2001 win 38000
> +0.000 > . 10001:12001(2000) ack 1
> +0.000 > . 12001:14001(2000) ack 1
> +0.001 < . 1:1(0) ack 4001 win 36000
> +0.000 > . 14001:16001(2000) ack 1
> +0.000 > . 16001:18001(2000) ack 1
> +0.001 < . 1:1(0) ack 6001 win 34000
> +0.000 > . 18001:20001(2000) ack 1
> +0.000 > . 20001:22001(2000) ack 1
> +0.001 < . 1:1(0) ack 8001 win 32000
> +0.000 > . 22001:24001(2000) ack 1
> +0.000 > . 24001:26001(2000) ack 1
> +0.001 < . 1:1(0) ack 10001 win 30000
> +0.000 > . 26001:28001(2000) ack 1
> +0.000 > P. 28001:30001(2000) ack 1
> 
> // loss of 12001:13001, 14001:15001, ..., 28001:29001
> +0.100 < . 1:1(0) ack 12001 win 30000	// original ack
> +0.000 < . 1:1(0) ack 12001 win 30000	// 13001:14001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 15001:16001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 17001:18001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 19001:20001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 21001:22001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 13001:24001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 25001:26001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 27001:28001
> +0.000 < . 1:1(0) ack 12001 win 30000	// 29001:30001
> 
> // RTO 300ms
> +0.270~+0.330 > . 12001:13001(1000) ack 1

Lets analyze this case:
ca_state = CA_Loss

> +0.100 < . 1:1(0) ack 14001 win 38000

snd_una advances => icsk_retransmits = 0

...The lack of new data segments here seems very relevant to me and it 
hides from you what is really happening under the hood...

> // RTO 600ms
> +0.540~+0.660 > . 14001:15001(1000) ack 1

The above should already result false for FRTO in this case:
                   (new_recovery || icsk->icsk_retransmits) &&

...But it doesn't. If there would be the new data segment they would show 
to you that we're running a FRTO bogus undo here (with a burst of new 
data segments before the second RTO). The bogus undo on that ACK causes 
ca_state to switch away from CA_Loss and FRTO can then reoccur even though 
it was not intended. Please, try with this patch:
  https://patchwork.ozlabs.org/patch/883654/


...Since you're dealing with non-SACK flows here, you might want to 
consider the other fixes in that same series too as they all fix bad 
brokeness. I should do an updated version for that series but I've been 
waiting for the TCP testsuite to be published...
Ilpo Järvinen June 14, 2018, 11:51 a.m. UTC | #7
On Thu, 14 Jun 2018, Michal Kubecek wrote:

> On Thu, Jun 14, 2018 at 11:42:43AM +0300, Ilpo Järvinen wrote:
> > On Wed, 13 Jun 2018, Yuchung Cheng wrote:
> > > On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > > >
> > > > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
> > > > timestamps (either because of (mis)configuration or because the other
> > > > endpoint does not advertise them), specific pattern loss can make RTO grow
> > > > exponentially until the sender is only able to send one packet per two
> > > > minutes (TCP_RTO_MAX).
> > > >
> > > > One way to reproduce is to
> > > >
> > > >   - make sure the connection uses neither SACK nor timestamps
> > > >   - let tp->reorder grow enough so that lost packets are retransmitted
> > > >     after RTO (rather than when high_seq - snd_una > reorder * MSS)
> > > >   - let the data flow stabilize
> > > >   - drop multiple sender packets in "every second" pattern
> > 
> > Hmm? What is deterministically dropping every second packet for a 
> > particular flow that has RTOs in between?
> 
> AFAIK the customer we managed to push to investigate the primary source
> of the packet loss identified some problems with their load balancing
> solution but I don't have more details. For the record, the loss didn't
> last through the phase of RTO growing exponentially (so that there were
> no lost retransmissions) but did last long enough to drop at least 20
> packets. With the exponential growth, that was enough for RTO to reach
> TCP_RTO_MAX (120s) and make the connection essentially stalled.
> 
> Actually, it doesn't need to be exactly "every second". As long as you
> don't lose two consecutive segments (which would allow you to fall back
> in step (2a)), you can have more than one received segments between them
> and get the same issue.
> 
> > Years back I was privately contacted by somebody from a middlebox vendor 
> > for a case with very similar exponentially growing RTO due to the FRTO 
> > heuristic. It turned out that they didn't want to send dupacks for 
> > out-of-order packets because they wanted to keep the TCP side of their 
> > deep packet inspection middlebox primitive. He claimed that the middlebox 
> > doesn't need to send dupacks because there could be such a TCP 
> > implementation that too doesn't do them either (not that he had anything 
> > to point to besides their middlebox ;-)), which according to him was 
> > not required because of his intepretation of RFC793 (IIRC). ...Nevermind 
> > anything that has occurred since that era.
> > 
> > ...Back then, I also envisioned in that mail exchange with him that a 
> > middlebox could break FRTO by always forcing a drop on the key packet
> > FRTO depends on. Ironically, that is exactly what is required to trigger 
> > this issue? Sure, every a heuristic can be fooled if a deterministic (or
> > crafted) pattern is introduced to defeat that particular heuristic.
> 
> OK, let me elaborate a bit more about the background. Within last few
> months, we had six different reports of TCP stalls (typically for NFS
> connections alternating between idle period and bulk transfers) which
> started after an upgrade from SLE11 (with 3.0 kernel) to SLE12 SP2 or
> SP3 (both 4.4 kernel).
> 
> Two of them were analysed down to the NAS on the other side which was
> sending SACK blocks violating the RFC in two different ways - as
> described in thread "TCP one-by-one acking - RFC interpretation
> question".
> 
> Three of them do not seem to show any apparent RFC violation and the
> problem is only in RTO doubling with each retransmission while there are
> no usable replies that could be used for RTT estimate (in the absence of
> both SACK and timestamps).
> 
> For the sake of completeness, there was also one report from two days
> ago which looked almost the same but in the end it turned out that in
> this case, SLES (with Firefox) was the receiver and sender was actually
> Windows 2016 server with Microsoft IIS.
> 
> > I'd prefer that networks "dropping every second packet" of a flow to be 
> > fixed rather than FRTO?
> 
> Yes, that was my first reaction that their primary focus should be the
> lossy network. However, it's not behaving like this all the time, the
> periods of loss are relatively short - but long enough to trigger the
> "RTO loop".
>
> > In addition, one could even argue that the sender is sending whole the 
> > time with lower and lower rate (given the exponentially increasing RTO) 
> > and still gets losses, so that a further rate reduction would be the 
> > correct action. ...But take this intuitive reasoning with some grain of 
> > salt (that is, I can see reasons myself to disagree with it :-)).
> 
> As I explained above, the loss was over by the time of first RTO
> retransmission. I should probably have made that clear in the commit
> message.

Right. I had a preconveived mental image from another, similar case which 
requires a loss after each RTO but it does seem to be case here (and I 
only took look into the drill after sending the first mail out).

> > > >   - either there is no new data to send or acks received in response to new
> > > >     data are also window updates (i.e. not dupacks by definition)
> > 
> > Can you explain what exactly do you mean with this "no new data to send" 
> > condition here as F-RTO is/should not be used if there's no new data to 
> > send?!?
> 
> AFAICS RFC 5682 is not explicit about this and offers multiple options.
> Anyway, this is not essential and in most of the customer provided
> captures, it wasn't the case.

Lacking the new segments is essential for hiding the actual bug as the 
trace would look weird otherwise with a burst of new data segments (due 
to the other bug).

> > ...Or, why is the receiver going against SHOULD in RFC5681:
> >    "A TCP receiver SHOULD send an immediate duplicate ACK when an out-
> >    of-order segment arrives."
> > ? ...And yes, I know there's this very issue with window updates masking 
> > duplicate ACKs in Linux TCP receiver but I was met with some skepticism 
> > on whether fixing it is worth it or not.
> 
> Normally, we would have timestamps (and even SACK). Without them, you
> cannot reliably recognize a dupack with changed window size from
> a spontaneous window update.

No! The window should not update window on ACKs the receiver intends to 
designate as "duplicate ACKs". That is not without some potential cost 
though as it requires delaying window updates up to the next cumulative 
ACK. In the non-SACK series one of the changes is fixing this for
non-SACK Linux TCP flows.

> > > Acked-by: Yuchung Cheng <ycheng@google.com>
> > > 
> > > Thanks for the patch (and packedrill test)! I would encourage
> > > submitting an errata to F-RTO RFC about this case.
> > 
> > Unless there's a convincing explination how such a drop pattern would 
> > occur in real world except due to serious brokeness/misconfiguration on 
> > network side (that should not be there), I'm not that sure it's exactly
> > what erratas are meant for.
> 
> As explained above, this commit was not inspired by some theoretical
> study trying to find dark corner cases, it was result of investigation
> of reports from  multiple customer encountering the problem in
> real-life.  Sure, there was always something bad, namely SACK/timestamps
> being disabled and network losing packets, but the effect (one packet
> per two minutes) is so disastrous that I believe it should be handled.

Yes, I was guessing/hoping that the every second packet stuff was just for 
the reproducer (but of course it could have been something that is 
just very broken - it wouldn't surprise me that much :-(). Now, however,
I think if the FRTO undo bug is fixed, the main problem that remains is 
that with new data to send, then FRTO logic will not send anything after 
the first ACK and there won't be second ACK leading to that RTO loop. 
Given that FRTO RFC already recommends falling back to normal RTO recovery 
in the cases where there's no new data to send, I don't think there's
anything that needs to be changed in the RFC that would warrant an errata.
Michal Kubecek June 14, 2018, 1:18 p.m. UTC | #8
On Thu, Jun 14, 2018 at 02:51:18PM +0300, Ilpo Järvinen wrote:
> On Thu, 14 Jun 2018, Michal Kubecek wrote:
> > On Thu, Jun 14, 2018 at 11:42:43AM +0300, Ilpo Järvinen wrote:
> > > On Wed, 13 Jun 2018, Yuchung Cheng wrote:
> > > > On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > 
> > AFAICS RFC 5682 is not explicit about this and offers multiple options.
> > Anyway, this is not essential and in most of the customer provided
> > captures, it wasn't the case.
> 
> Lacking the new segments is essential for hiding the actual bug as the 
> trace would look weird otherwise with a burst of new data segments (due 
> to the other bug).

The trace wouldn't look so nice but it can be reproduced even with more
data to send. I've copied an example below. I couldn't find a really
nice one quickly so that first few retransmits (17:22:13.865105 through
17:23:05.841105) are without new data but starting at 17:23:58.189150,
you can see that sending new (previously unsent) data may not suffice to
break the loop.

> > Normally, we would have timestamps (and even SACK). Without them, you
> > cannot reliably recognize a dupack with changed window size from
> > a spontaneous window update.
> 
> No! The window should not update window on ACKs the receiver intends to 
> designate as "duplicate ACKs". That is not without some potential cost 
> though as it requires delaying window updates up to the next cumulative 
> ACK. In the non-SACK series one of the changes is fixing this for
> non-SACK Linux TCP flows.

That sounds like a reasonable change (at least at the first glance,
I didn't think about it too deeply) but even if we fix Linux stack to
behave like this, we cannot force everyone else to do the same.

Michal Kubecek


17:22:13.660030 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1101588007:1101650727, ack 1871152053, win 28, length 62720
17:22:13.660039 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294151936, win 12146, length 0
17:22:13.660047 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 62720:125440, ack 1, win 28, length 62720
17:22:13.660050 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294178816, win 12146, length 0
17:22:13.660052 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294196736, win 12146, length 0
17:22:13.660131 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 125440:188160, ack 1, win 28, length 62720
17:22:13.660142 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294223616, win 12146, length 0
17:22:13.660164 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 188160:250880, ack 1, win 28, length 62720
17:22:13.660171 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294250496, win 12146, length 0
17:22:13.660177 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294277376, win 12146, length 0
17:22:13.660181 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294304256, win 12146, length 0
17:22:13.660185 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294331136, win 12146, length 0
17:22:13.660196 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294349056, win 12146, length 0
17:22:13.660212 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 250880:313600, ack 1, win 28, length 62720
17:22:13.660224 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 313600:376320, ack 1, win 28, length 62720
17:22:13.660266 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294384896, win 12146, length 0
17:22:13.660292 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294411776, win 12146, length 0
17:22:13.660294 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294438656, win 12146, length 0
17:22:13.660295 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294465536, win 12146, length 0
17:22:13.660353 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 376320:439040, ack 1, win 28, length 62720
17:22:13.660377 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 439040:501760, ack 1, win 28, length 62720
17:22:13.660391 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294501376, win 12146, length 0
17:22:13.660396 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294519296, win 12146, length 0
17:22:13.660400 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 501760:564480, ack 1, win 28, length 62720
17:22:13.660409 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294555136, win 12146, length 0
17:22:13.660420 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294582016, win 12146, length 0
17:22:13.660434 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294608896, win 12146, length 0
17:22:13.660458 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 564480:627200, ack 1, win 28, length 62720
17:22:13.660515 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294644736, win 12146, length 0
17:22:13.660527 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294671616, win 12146, length 0
17:22:13.660540 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294698496, win 12146, length 0
17:22:13.660541 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294725376, win 12146, length 0
17:22:13.660542 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294743296, win 12146, length 0
17:22:13.660580 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 627200:689920, ack 1, win 28, length 62720
17:22:13.660597 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 689920:752640, ack 1, win 28, length 62720     <--- first loss
17:22:13.660642 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294770176, win 12146, length 0
17:22:13.660648 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 752640:815360, ack 1, win 28, length 62720
17:22:13.660655 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294797056, win 12146, length 0
17:22:13.660662 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294823936, win 12146, length 0
17:22:13.660666 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 815360:878080, ack 1, win 28, length 62720
17:22:13.660672 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294850816, win 12146, length 0
17:22:13.660696 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294877696, win 12146, length 0
17:22:13.660704 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 878080:940800, ack 1, win 28, length 62720
17:22:13.660765 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294913536, win 12146, length 0
17:22:13.660779 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 4294940416, win 12146, length 0
17:22:13.660791 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 0, win 12146, length 0
17:22:13.660793 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 26880, win 12146, length 0
17:22:13.660795 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 53760, win 12146, length 0
17:22:13.660821 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 940800:1003520, ack 1, win 28, length 62720
17:22:13.660837 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1003520:1066240, ack 1, win 28, length 62720
17:22:13.660890 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 80640, win 12146, length 0
17:22:13.660897 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1066240:1128960, ack 1, win 28, length 62720
17:22:13.660923 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 107520, win 12146, length 0
17:22:13.660928 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 134400, win 12146, length 0
17:22:13.660932 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 161280, win 12146, length 0
17:22:13.660936 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1128960:1191680, ack 1, win 28, length 62720
17:22:13.660944 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 188160, win 12146, length 0
17:22:13.661015 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 215040, win 12146, length 0
17:22:13.661044 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 241920, win 12146, length 0
17:22:13.661045 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 268800, win 12146, length 0
17:22:13.661047 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 295680, win 12146, length 0
17:22:13.661048 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 322560, win 12135, length 0
17:22:13.661106 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1191680:1254400, ack 1, win 28, length 62720
17:22:13.661139 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1254400:1317120, ack 1, win 28, length 62720
17:22:13.661145 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 349440, win 12146, length 0
17:22:13.661148 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 358400, win 12146, length 0
17:22:13.661149 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 358400, win 12146, length 0
17:22:13.661150 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 358400, win 12146, length 0
17:22:13.661151 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 358400, win 12146, length 0
17:22:13.661153 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 456960, win 12130, length 0
17:22:13.661155 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1317120:1379840, ack 1, win 28, length 62720
17:22:13.661178 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1379840:1442560, ack 1, win 28, length 62720
17:22:13.661192 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1442560:1505280, ack 1, win 28, length 62720
17:22:13.661264 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 483840, win 12146, length 0
17:22:13.661286 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 510720, win 12146, length 0
17:22:13.661292 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1505280:1568000, ack 1, win 28, length 62720
17:22:13.661299 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 537600, win 12146, length 0
17:22:13.661303 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 564480, win 12146, length 0
17:22:13.661308 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1568000:1630720, ack 1, win 28, length 62720
17:22:13.661317 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 582400, win 12140, length 0
17:22:13.661390 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 609280, win 12146, length 0
17:22:13.661411 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 636160, win 12146, length 0
17:22:13.661412 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 663040, win 12146, length 0
17:22:13.661429 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 689920, win 12146, length 0
17:22:13.661430 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 716800, win 12146, length 0
17:22:13.661437 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1630720:1693440, ack 1, win 28, length 62720
17:22:13.661445 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 725760, win 12146, length 0
17:22:13.661447 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661454 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1693440:1756160, ack 1, win 28, length 62720
17:22:13.661508 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0                    <--- first dupack
17:22:13.661513 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1756160:1818880, ack 1, win 28, length 62720
17:22:13.661520 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661524 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661527 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661530 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661532 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661635 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661637 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661638 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661640 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661641 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661642 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661653 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1818880:1881600, ack 1, win 28, length 62720
17:22:13.661757 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661761 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661764 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661768 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1881600:1944320, ack 1, win 28, length 62720
17:22:13.661778 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661782 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661886 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661891 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661894 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661897 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661900 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661902 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1944320:2007040, ack 1, win 28, length 62720
17:22:13.661928 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.661931 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662016 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662020 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662023 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662026 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662029 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662032 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2007040:2069760, ack 1, win 28, length 62720
17:22:13.662039 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662042 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662132 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662136 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662139 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662142 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662145 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662148 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2069760:2132480, ack 1, win 28, length 62720
17:22:13.662154 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662263 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662267 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662269 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662272 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662275 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662385 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662390 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2132480:2195200, ack 1, win 28, length 62720
17:22:13.662397 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662400 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662402 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662405 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662408 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662508 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662512 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662515 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2195200:2257920, ack 1, win 28, length 62720
17:22:13.662522 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662525 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662527 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662530 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662633 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662637 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662640 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662643 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2257920:2320640, ack 1, win 28, length 62720
17:22:13.662649 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662652 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662759 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662763 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662766 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.662881 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 734720, win 12146, length 0
17:22:13.865105 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 734720:743680, ack 1, win 28, length 8960      <--- first retransmit
17:22:13.865227 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 851200, win 12033, length 0
17:22:14.273092 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 851200:860160, ack 1, win 28, length 8960
17:22:14.273207 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 869120, win 12016, length 0
17:22:15.089125 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 869120:878080, ack 1, win 28, length 8960
17:22:15.089244 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 887040, win 11999, length 0
17:22:16.725135 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 887040:896000, ack 1, win 28, length 8960
17:22:16.725269 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 976640, win 11912, length 0
17:22:19.997144 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 976640:985600, ack 1, win 28, length 8960
17:22:19.997257 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1039360, win 11851, length 0
17:22:26.545096 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1039360:1048320, ack 1, win 28, length 8960
17:22:26.545212 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1057280, win 11834, length 0
17:22:39.629137 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1057280:1066240, ack 1, win 28, length 8960
17:22:39.629268 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1173760, win 11721, length 0
17:23:05.841105 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1173760:1182720, ack 1, win 28, length 8960
17:23:05.841229 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1290240, win 11613, length 0
17:23:58.189150 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1290240:1299200, ack 1, win 28, length 8960
17:23:58.189268 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11670, length 0
17:23:58.189310 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2320640:2383360, ack 1, win 28, length 62720   <--- new data
17:23:58.189416 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0                   <--- ack but window update
17:23:58.189424 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:23:58.189458 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:23:58.189466 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:23:58.189475 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2383360:2446080, ack 1, win 28, length 62720   <--- more new data
17:23:58.189575 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:23:58.189620 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:23:58.189623 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1344000, win 11689, length 0
17:25:42.769136 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1344000:1352960, ack 1, win 28, length 8960    <--- retransmit only after RTO
17:25:42.769243 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1361920, win 11672, length 0
17:27:43.085128 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1361920:1370880, ack 1, win 28, length 8960
17:27:43.085240 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11631, length 0
17:27:43.085261 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2446080:2508800, ack 1, win 28, length 62720
17:27:43.085363 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085425 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085430 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085433 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085437 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2508800:2571520, ack 1, win 28, length 62720
17:27:43.085458 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085461 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085531 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085578 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:27:43.085581 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1469440, win 11678, length 0
17:29:43.405123 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1469440:1478400, ack 1, win 28, length 8960
17:29:43.405249 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11614, length 0
17:29:43.405288 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2571520:2634240, ack 1, win 28, length 62720
17:29:43.405400 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405408 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405446 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405454 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405462 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2634240:2696960, ack 1, win 28, length 62720
17:29:43.405502 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405579 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405626 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:29:43.405629 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1594880, win 11671, length 0
17:31:43.725113 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1594880:1603840, ack 1, win 28, length 8960
17:31:43.725273 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1657600, win 11610, length 0
17:33:44.045093 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1657600:1666560, ack 1, win 28, length 8960
17:33:44.045248 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1675520, win 11636, length 0
17:35:44.365137 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1675520:1684480, ack 1, win 28, length 8960
17:35:44.365319 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11642, length 0
17:35:44.365345 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2696960:2759680, ack 1, win 28, length 62720
17:35:44.365370 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2759680:2822400, ack 1, win 28, length 62720
17:35:44.365463 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365467 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365509 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365513 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365517 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2822400:2885120, ack 1, win 28, length 62720
17:35:44.365541 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365563 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365567 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365623 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365670 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365674 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365678 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365682 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2885120:2947840, ack 1, win 28, length 62720
17:35:44.365801 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365850 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365854 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:35:44.365894 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1783040, win 11689, length 0
17:37:44.685086 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1783040:1792000, ack 1, win 28, length 8960
17:37:44.685204 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 1899520, win 11576, length 0
17:39:45.005099 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 1899520:1908480, ack 1, win 28, length 8960
17:39:45.005228 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 2947840, win 11616, length 0
17:39:45.005304 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 2947840:3010560, ack 1, win 28, length 62720
17:39:45.005339 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3010560:3073280, ack 1, win 28, length 62720
17:39:45.005385 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3073280:3136000, ack 1, win 28, length 62720
17:39:45.005408 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3136000:3198720, ack 1, win 28, length 62720
17:39:45.005430 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3198720:3261440, ack 1, win 28, length 62720
17:39:45.005458 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3261440:3324160, ack 1, win 28, length 62720
17:39:45.005516 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3324160:3386880, ack 1, win 28, length 62720
17:39:45.005572 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3386880:3449600, ack 1, win 28, length 62720
17:39:45.005595 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3449600:3512320, ack 1, win 28, length 62720
17:39:45.005616 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3512320:3575040, ack 1, win 28, length 62720
17:39:45.005654 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3575040:3637760, ack 1, win 28, length 62720
17:39:45.005675 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3637760:3700480, ack 1, win 28, length 62720
17:39:45.005710 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3700480:3763200, ack 1, win 28, length 62720
17:39:45.005739 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 2956800, win 11851, length 0
17:39:45.005765 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3763200:3825920, ack 1, win 28, length 62720
17:39:45.005798 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3825920:3888640, ack 1, win 28, length 62720
17:39:45.005824 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 2974720, win 11841, length 0
17:39:45.005826 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3001600, win 11827, length 0
17:39:45.005827 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3019520, win 11961, length 0
17:39:45.005829 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3046400, win 11947, length 0
17:39:45.005831 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3073280, win 11932, length 0
17:39:45.005832 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3100160, win 11918, length 0
17:39:45.005834 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3127040, win 11904, length 0
17:39:45.005835 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3136000, win 11898, length 0
17:39:45.005837 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3162880, win 12029, length 0
17:39:45.005838 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3171840, win 12023, length 0
17:39:45.005840 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3198720, win 12009, length 0
17:39:45.005841 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3207680, win 12004, length 0
17:39:45.005842 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3234560, win 11990, length 0
17:39:45.005844 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3252480, win 11980, length 0
17:39:45.005845 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3297280, win 12092, length 0
17:39:45.005859 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3888640:3951360, ack 1, win 28, length 62720
17:39:45.005892 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 3951360:4014080, ack 1, win 28, length 62720
17:39:45.005899 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3395840, win 12146, length 0
17:39:45.005903 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3422720, win 12146, length 0
17:39:45.005904 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3449600, win 12146, length 0
17:39:45.005906 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3512320, win 12115, length 0
17:39:45.005923 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4014080:4076800, ack 1, win 28, length 62720
17:39:45.005946 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3539200, win 12146, length 0
17:39:45.005948 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3566080, win 12146, length 0
17:39:45.005996 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3601920, win 12146, length 0
17:39:45.005998 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3628800, win 12146, length 0
17:39:45.005999 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3646720, win 12146, length 0
17:39:45.006002 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4076800:4139520, ack 1, win 28, length 62720
17:39:45.006040 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3673600, win 12146, length 0
17:39:45.006043 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3700480, win 12146, length 0
17:39:45.006045 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4139520:4202240, ack 1, win 28, length 62720
17:39:45.006085 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4202240:4264960, ack 1, win 28, length 62720
17:39:45.006087 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3718400, win 12146, length 0
17:39:45.006111 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4264960:4327680, ack 1, win 28, length 62720
17:39:45.006135 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3745280, win 12146, length 0
17:39:45.006137 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3772160, win 12146, length 0
17:39:45.006139 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3790080, win 12146, length 0
17:39:45.006168 IP 10.30.59.58.1556 > 10.31.112.14.30284: Flags [.], seq 4327680:4390400, ack 1, win 28, length 62720
17:39:45.006197 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3816960, win 12146, length 0
17:39:45.006200 IP 10.31.112.14.30284 > 10.30.59.58.1556: Flags [.], ack 3843840, win 12146, length 0
Ilpo Järvinen June 15, 2018, 8:05 a.m. UTC | #9
On Thu, 14 Jun 2018, Michal Kubecek wrote:

> On Thu, Jun 14, 2018 at 02:51:18PM +0300, Ilpo Järvinen wrote:
> > On Thu, 14 Jun 2018, Michal Kubecek wrote:
> > > On Thu, Jun 14, 2018 at 11:42:43AM +0300, Ilpo Järvinen wrote:
> > > > On Wed, 13 Jun 2018, Yuchung Cheng wrote:
> > > > > On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > > 
> > > AFAICS RFC 5682 is not explicit about this and offers multiple options.
> > > Anyway, this is not essential and in most of the customer provided
> > > captures, it wasn't the case.
> > 
> > Lacking the new segments is essential for hiding the actual bug as the 
> > trace would look weird otherwise with a burst of new data segments (due 
> > to the other bug).
> 
> The trace wouldn't look so nice but it can be reproduced even with more
> data to send. I've copied an example below. I couldn't find a really
> nice one quickly so that first few retransmits (17:22:13.865105 through
> 17:23:05.841105) are without new data but starting at 17:23:58.189150,
> you can see that sending new (previously unsent) data may not suffice to
> break the loop.

My point was that the new data segment bursts that occur if the sender 
isn't application limited indicate that there's something going wrong
with FRTO. And that wrong is also what is causing that RTO loop because
the sender doesn't see the previous FRTO recovery on second RTO. With 
my FRTO undo fix, (new_recovery || icsk->icsk_retransmits) will be false
and that will prevent the RTO loop.

> > > Normally, we would have timestamps (and even SACK). Without them, you
> > > cannot reliably recognize a dupack with changed window size from
> > > a spontaneous window update.
> > 
> > No! The window should not update window on ACKs the receiver intends to 
> > designate as "duplicate ACKs". That is not without some potential cost 
> > though as it requires delaying window updates up to the next cumulative 
> > ACK. In the non-SACK series one of the changes is fixing this for
> > non-SACK Linux TCP flows.
> 
> That sounds like a reasonable change (at least at the first glance,
> I didn't think about it too deeply) but even if we fix Linux stack to
> behave like this, we cannot force everyone else to do the same.

Unfortunately I don't know what the other stacks besides Linux do. But 
for Linux, the cause for the changing receiver window is the receiver 
window auto-tuning and I'm not sure if other stacks have a similar 
feature (or if that affects (almost) all ACKs like in Linux).
Michal Kubecek June 15, 2018, 9:27 a.m. UTC | #10
On Fri, Jun 15, 2018 at 11:05:03AM +0300, Ilpo Järvinen wrote:
> On Thu, 14 Jun 2018, Michal Kubecek wrote:
> > The trace wouldn't look so nice but it can be reproduced even with more
> > data to send. I've copied an example below. I couldn't find a really
> > nice one quickly so that first few retransmits (17:22:13.865105 through
> > 17:23:05.841105) are without new data but starting at 17:23:58.189150,
> > you can see that sending new (previously unsent) data may not suffice to
> > break the loop.
> 
> My point was that the new data segment bursts that occur if the sender 
> isn't application limited indicate that there's something going wrong
> with FRTO. And that wrong is also what is causing that RTO loop because
> the sender doesn't see the previous FRTO recovery on second RTO. With 
> my FRTO undo fix, (new_recovery || icsk->icsk_retransmits) will be false
> and that will prevent the RTO loop.

Yes, it would prevent the loop in this case (except it would be a bit
later, after second RTO rather than after first). But I'm not convinced
the logic of the patch is correct. If I understand it correctly, it
essentially changes "presumption of innocence" (if we get an ack past
what we retransmitted, we assume it was received earlier - i.e. would
have been sacked before if SACK was in use) to "presumption of guilt"
(whenever a retransmitted segment is acked, we assume nothing else acked
with it was received earlier). Or that you trade false negatives for
false positives.

Maybe I understand it wrong but it seems that you de facto prevent
Step (3b) from ever happening in non-SACK case.

> > > No! The window should not update window on ACKs the receiver intends to 
> > > designate as "duplicate ACKs". That is not without some potential cost 
> > > though as it requires delaying window updates up to the next cumulative 
> > > ACK. In the non-SACK series one of the changes is fixing this for
> > > non-SACK Linux TCP flows.
> > 
> > That sounds like a reasonable change (at least at the first glance,
> > I didn't think about it too deeply) but even if we fix Linux stack to
> > behave like this, we cannot force everyone else to do the same.
> 
> Unfortunately I don't know what the other stacks besides Linux do. But 
> for Linux, the cause for the changing receiver window is the receiver 
> window auto-tuning and I'm not sure if other stacks have a similar 
> feature (or if that affects (almost) all ACKs like in Linux).

The capture from my previous e-mail and some others I have seen indicate
that at least some implementations do not take care to never change
window size when responding to an out-of-order segment.  That means that
even if we change linux TCP this way (we might still need to send
a separate window update in some cases), we still cannot rely on others
doing the same.

I checked the capture attached to my previous e-mail again and there is
one thing where our F-RTO implementation (in 4.4, at least) is wrong,
IMHO. While the first ACK after "new data" (sent in (2b)) was a window
update (and therefore not dupack by definition) so that we could take
neither (3a) nor (3b), in some iterations there were further acks which
did not change window size. The text just before Step 1 says

  The F-RTO algorithm does not specify actions for receiving
  a segment that neither acknowledges new data nor is a duplicate
  acknowledgment.  The TCP sender SHOULD ignore such segments and
  wait for a segment that either acknowledges new data or is
  a duplicate acknowledgment.

My understanding is that this means that while the first ack after new
data is correctly ignored, the following ack which preserves window size
should be recognized as a dupack and (3a) should be taken.

Michal Kubecek
Ilpo Järvinen June 15, 2018, 10:35 a.m. UTC | #11
On Fri, 15 Jun 2018, Michal Kubecek wrote:

> On Fri, Jun 15, 2018 at 11:05:03AM +0300, Ilpo Järvinen wrote:
> > On Thu, 14 Jun 2018, Michal Kubecek wrote:
> > 
> > My point was that the new data segment bursts that occur if the sender 
> > isn't application limited indicate that there's something going wrong
> > with FRTO. And that wrong is also what is causing that RTO loop because
> > the sender doesn't see the previous FRTO recovery on second RTO. With 
> > my FRTO undo fix, (new_recovery || icsk->icsk_retransmits) will be false
> > and that will prevent the RTO loop.
> 
> Yes, it would prevent the loop in this case (except it would be a bit
> later, after second RTO rather than after first).

Hmm, I'm actually wrong about the new data missing bit I think. After
reading more code I'm quite sure conventional RTO recovery is triggered 
right away (as long as that bogus undo that ends the recovery wouldn't
occur first like it does without my fix). So no second RTO would be 
needed.

> But I'm not convinced
> the logic of the patch is correct. If I understand it correctly, it
> essentially changes "presumption of innocence" (if we get an ack past
> what we retransmitted, we assume it was received earlier - i.e. would
> have been sacked before if SACK was in use) to "presumption of guilt"
> (whenever a retransmitted segment is acked, we assume nothing else acked
> with it was received earlier). Or that you trade false negatives for
> false positives.

FRTO depends on knowing for sure what packet (original pre-RTO one or 
something that was transmitted post-RTO) triggered the ACK. If FRTO
isn't sure that the ACK was generated by a post-RTO packet, it must
not assume innocence! This change in practice affects just the time while 
the segment rexmitted by RTO is still there, that is, processing in step 2 
(if we get a cumulative ACK beyond it because the next loss is not for the 
subsequent segment but for some later segment, FLAG_ORIG_SACK_ACKED is set 
and we'll incorrectly do step 3b while still in FRTO has only reached step 
2 for real; this is fixed by my patch). ...The decision about 
positive/negative only occurs _after_ that in step 3.

> Maybe I understand it wrong but it seems that you de facto prevent
> Step (3b) from ever happening in non-SACK case.

Only if any of skb that was ACKed had been retransmitted. There shouldn't 
be any in step 3 because the RTO rexmit was ACKed (and also because 
of how new_recovery variable works wrt. earlier recoveries). Thus, in
step 3 the fix won't clear FLAG_ORIG_SACK_ACKED flag allowing FRTO to 
detect spurious RTOs using step 3b.

> > > > No! The window should not update window on ACKs the receiver intends to 
> > > > designate as "duplicate ACKs". That is not without some potential cost 
> > > > though as it requires delaying window updates up to the next cumulative 
> > > > ACK. In the non-SACK series one of the changes is fixing this for
> > > > non-SACK Linux TCP flows.
> > > 
> > > That sounds like a reasonable change (at least at the first glance,
> > > I didn't think about it too deeply) but even if we fix Linux stack to
> > > behave like this, we cannot force everyone else to do the same.
> > 
> > Unfortunately I don't know what the other stacks besides Linux do. But 
> > for Linux, the cause for the changing receiver window is the receiver 
> > window auto-tuning and I'm not sure if other stacks have a similar 
> > feature (or if that affects (almost) all ACKs like in Linux).
> 
> The capture from my previous e-mail and some others I have seen indicate
> that at least some implementations do not take care to never change
> window size when responding to an out-of-order segment.  That means that
> even if we change linux TCP this way (we might still need to send
> a separate window update in some cases), we still cannot rely on others
> doing the same.

Those implementations ignore what is a duplicate ACK (RFC5681, which
is also pointed into by RFC5682 for its defination):
  DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
      "duplicate" ... (e)
      the advertised window in the incoming acknowledgment equals the
      advertised window in the last incoming acknowledgment.

Not sending duplicate ACKs also means that fast recovery will not work
for those flows but that may not show up performance wise as long as you 
have enough capacity for any unnecessary rexmits the forced RTO recovery 
is going to do. RTO recovery may actually improve completion times for 
non-SACK flows as NewReno recovers only one lost pkt/RTT where as RTO 
recovery needs log(outstanding packets) RTTs at worst. For a large number 
of losses in a window, the log is going to win.

> I checked the capture attached to my previous e-mail again and there is
> one thing where our F-RTO implementation (in 4.4, at least) is wrong,
> IMHO. While the first ACK after "new data" (sent in (2b)) was a window
> update (and therefore not dupack by definition) so that we could take
> neither (3a) nor (3b), in some iterations there were further acks which
> did not change window size. The text just before Step 1 says
> 
>   The F-RTO algorithm does not specify actions for receiving
>   a segment that neither acknowledges new data nor is a duplicate
>   acknowledgment.  The TCP sender SHOULD ignore such segments and
>   wait for a segment that either acknowledges new data or is
>   a duplicate acknowledgment.
> 
> My understanding is that this means that while the first ack after new
> data is correctly ignored, the following ack which preserves window size
> should be recognized as a dupack and (3a) should be taken.

Linux FRTO never gets that far (without my fix) if the ACK in step 2
covers beyond the RTO rexmit because 3b is prematurely invoked, that's 
why you never see what would occur if 3a is taken. TCP thinks it's not 
recovering anymore and therefore can send only new data (if there's some 
available).

This is what I tried to tell earlier, with new data there you see there's 
something else wrong too with FRTO besides the RTO loop.
Yuchung Cheng June 27, 2018, 11:56 p.m. UTC | #12
On Fri, Jun 15, 2018 at 3:35 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Fri, 15 Jun 2018, Michal Kubecek wrote:
>
>> On Fri, Jun 15, 2018 at 11:05:03AM +0300, Ilpo Järvinen wrote:
>> > On Thu, 14 Jun 2018, Michal Kubecek wrote:
>> >
>> > My point was that the new data segment bursts that occur if the sender
>> > isn't application limited indicate that there's something going wrong
>> > with FRTO. And that wrong is also what is causing that RTO loop because
>> > the sender doesn't see the previous FRTO recovery on second RTO. With
>> > my FRTO undo fix, (new_recovery || icsk->icsk_retransmits) will be false
>> > and that will prevent the RTO loop.
>>
>> Yes, it would prevent the loop in this case (except it would be a bit
>> later, after second RTO rather than after first).
>
> Hmm, I'm actually wrong about the new data missing bit I think. After
> reading more code I'm quite sure conventional RTO recovery is triggered
> right away (as long as that bogus undo that ends the recovery wouldn't
> occur first like it does without my fix). So no second RTO would be
> needed.
>
>> But I'm not convinced
>> the logic of the patch is correct. If I understand it correctly, it
>> essentially changes "presumption of innocence" (if we get an ack past
>> what we retransmitted, we assume it was received earlier - i.e. would
>> have been sacked before if SACK was in use) to "presumption of guilt"
>> (whenever a retransmitted segment is acked, we assume nothing else acked
>> with it was received earlier). Or that you trade false negatives for
>> false positives.
>
> FRTO depends on knowing for sure what packet (original pre-RTO one or
> something that was transmitted post-RTO) triggered the ACK. If FRTO
> isn't sure that the ACK was generated by a post-RTO packet, it must
> not assume innocence! This change in practice affects just the time while
> the segment rexmitted by RTO is still there, that is, processing in step 2
> (if we get a cumulative ACK beyond it because the next loss is not for the
> subsequent segment but for some later segment, FLAG_ORIG_SACK_ACKED is set
> and we'll incorrectly do step 3b while still in FRTO has only reached step
> 2 for real; this is fixed by my patch). ...The decision about
> positive/negative only occurs _after_ that in step 3.
>
>> Maybe I understand it wrong but it seems that you de facto prevent
>> Step (3b) from ever happening in non-SACK case.
>
> Only if any of skb that was ACKed had been retransmitted. There shouldn't
> be any in step 3 because the RTO rexmit was ACKed (and also because
> of how new_recovery variable works wrt. earlier recoveries). Thus, in
> step 3 the fix won't clear FLAG_ORIG_SACK_ACKED flag allowing FRTO to
> detect spurious RTOs using step 3b.
>
>> > > > No! The window should not update window on ACKs the receiver intends to
>> > > > designate as "duplicate ACKs". That is not without some potential cost
>> > > > though as it requires delaying window updates up to the next cumulative
>> > > > ACK. In the non-SACK series one of the changes is fixing this for
>> > > > non-SACK Linux TCP flows.
>> > >
>> > > That sounds like a reasonable change (at least at the first glance,
>> > > I didn't think about it too deeply) but even if we fix Linux stack to
>> > > behave like this, we cannot force everyone else to do the same.
>> >
>> > Unfortunately I don't know what the other stacks besides Linux do. But
>> > for Linux, the cause for the changing receiver window is the receiver
>> > window auto-tuning and I'm not sure if other stacks have a similar
>> > feature (or if that affects (almost) all ACKs like in Linux).
>>
>> The capture from my previous e-mail and some others I have seen indicate
>> that at least some implementations do not take care to never change
>> window size when responding to an out-of-order segment.  That means that
>> even if we change linux TCP this way (we might still need to send
>> a separate window update in some cases), we still cannot rely on others
>> doing the same.
>
> Those implementations ignore what is a duplicate ACK (RFC5681, which
> is also pointed into by RFC5682 for its defination):
>   DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
>       "duplicate" ... (e)
>       the advertised window in the incoming acknowledgment equals the
>       advertised window in the last incoming acknowledgment.
>
> Not sending duplicate ACKs also means that fast recovery will not work
> for those flows but that may not show up performance wise as long as you
> have enough capacity for any unnecessary rexmits the forced RTO recovery
> is going to do. RTO recovery may actually improve completion times for
> non-SACK flows as NewReno recovers only one lost pkt/RTT where as RTO
> recovery needs log(outstanding packets) RTTs at worst. For a large number
> of losses in a window, the log is going to win.
>
>> I checked the capture attached to my previous e-mail again and there is
>> one thing where our F-RTO implementation (in 4.4, at least) is wrong,
>> IMHO. While the first ACK after "new data" (sent in (2b)) was a window
>> update (and therefore not dupack by definition) so that we could take
>> neither (3a) nor (3b), in some iterations there were further acks which
>> did not change window size. The text just before Step 1 says
>>
>>   The F-RTO algorithm does not specify actions for receiving
>>   a segment that neither acknowledges new data nor is a duplicate
>>   acknowledgment.  The TCP sender SHOULD ignore such segments and
>>   wait for a segment that either acknowledges new data or is
>>   a duplicate acknowledgment.
>>
>> My understanding is that this means that while the first ack after new
>> data is correctly ignored, the following ack which preserves window size
>> should be recognized as a dupack and (3a) should be taken.
>
> Linux FRTO never gets that far (without my fix) if the ACK in step 2
> covers beyond the RTO rexmit because 3b is prematurely invoked, that's
> why you never see what would occur if 3a is taken. TCP thinks it's not
> recovering anymore and therefore can send only new data (if there's some
> available).
>
> This is what I tried to tell earlier, with new data there you see there's
> something else wrong too with FRTO besides the RTO loop.
agreed. Ilpo do you mind re-submitting your fix
https://patchwork.ozlabs.org/patch/883654/ (IIRC I already acked-by)

tcptest suite may have to wait due to some internal workload Neal is juggling.

>
>
> --
>  i.
Ilpo Järvinen June 29, 2018, 10:17 a.m. UTC | #13
On Wed, 27 Jun 2018, Yuchung Cheng wrote:

> On Fri, Jun 15, 2018 at 3:35 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Fri, 15 Jun 2018, Michal Kubecek wrote:
> >
> >> My understanding is that this means that while the first ack after new
> >> data is correctly ignored, the following ack which preserves window size
> >> should be recognized as a dupack and (3a) should be taken.
> >
> > Linux FRTO never gets that far (without my fix) if the ACK in step 2
> > covers beyond the RTO rexmit because 3b is prematurely invoked, that's
> > why you never see what would occur if 3a is taken. TCP thinks it's not
> > recovering anymore and therefore can send only new data (if there's some
> > available).
> >
> > This is what I tried to tell earlier, with new data there you see there's
> > something else wrong too with FRTO besides the RTO loop.
>
> agreed. Ilpo do you mind re-submitting your fix
> https://patchwork.ozlabs.org/patch/883654/ (IIRC I already acked-by)

Resent as individual patch. And, no you didn't ack it but my impression
in general is that we have converged into agreement about the sender side 
fixes including this one but I'm hesitant to interpret such vague 
impression about agreement alone as acked-by.

(Now with hindsight, maybe I should have interpreted this statement
of yours above as acked-by and added it but I already did send it out 
without adding it).

> tcptest suite may have to wait due to some internal workload Neal is 
> juggling. 

Ok, no problem. Thanks for keeping me updated.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 355d3dffd021..ed603f987b72 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2001,7 +2001,8 @@  void tcp_enter_loss(struct sock *sk)
 	 */
 	tp->frto = net->ipv4.sysctl_tcp_frto &&
 		   (new_recovery || icsk->icsk_retransmits) &&
-		   !inet_csk(sk)->icsk_mtup.probe_size;
+		   !inet_csk(sk)->icsk_mtup.probe_size &&
+		   (tcp_is_sack(tp) || tp->rx_opt.tstamp_ok);
 }
 
 /* If ACK arrived pointing to a remembered SACK, it means that our