diff mbox

[1/1] tcp: fix spurious undos in CA_Open

Message ID alpine.DEB.2.00.1112121359470.15137@wel-95.cs.helsinki.fi
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ilpo Järvinen Dec. 12, 2011, 12:05 p.m. UTC
On Mon, 28 Nov 2011, Neal Cardwell wrote:

> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>
> >> Also, one (serious) word of caution! This change, by its extending of
> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
> >> managed to introduces subtle breaking to the state machine. Please check
> >> that the problem similar to what was fixed by
> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
> >> some other form causing spurious undos). ...To me it seems that
> >> tcp_packet_delayed might be similarly confused after the given patch.
> >
> > Neil, please look into this so we can have any such issues fixed
> > in time for the next merge window.
> 
> Absolutely. I will look into the areas that Ilpo mentioned.

Unfortunately nothing has happened? So I finally had to come up something 
myself. ...Compile tested only.

--
[PATCH 1/1] tcp: fix spurious undos in CA_Open

There's a landmine in tcp_packet_delayed. In CA_Open the
retrans_stamp is very eagerly cleared which falsely triggers
packet-delayed detection. Therefore this code cannot be used
in CA_Open if the retrans_stamp was already cleared.

This became issue once DSACK detection was extended to happen
in CA_Open state too (f698204bd0b, tcp: allow undo from
reordered DSACKs). Essentially the whole congestion control
is disabled because the undos now always restore the previous
window. I wonder if this was already in production... ...at
least the Internet didn't melt ;-).

Compile tested only.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   20 ++++++++++++++------
 1 files changed, 14 insertions(+), 6 deletions(-)

Comments

Neal Cardwell Dec. 12, 2011, 7:40 p.m. UTC | #1
On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 28 Nov 2011, Neal Cardwell wrote:
>
>> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>>
>> >> Also, one (serious) word of caution! This change, by its extending of
>> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
>> >> managed to introduces subtle breaking to the state machine. Please check
>> >> that the problem similar to what was fixed by
>> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
>> >> some other form causing spurious undos). ...To me it seems that
>> >> tcp_packet_delayed might be similarly confused after the given patch.
>> >
>> > Neil, please look into this so we can have any such issues fixed
>> > in time for the next merge window.
>>
>> Absolutely. I will look into the areas that Ilpo mentioned.
>
> Unfortunately nothing has happened? So I finally had to come up something
> myself. ...Compile tested only.

Sorry for the radio silence, but I have been looking at this. So far
it's been a time-consuming process, as we've uncovered a number of
pre-existing issues in tcp_packet_delayed that seem like quite serious
bugs:

1) tcp_packet_delayed implicitly assumes, but does not verify, that
the ACK in question (whose ECR field precedes retrans_stamp) covers
all retransmitted segments. So if we retransmit segments 1 and 2, and
then get an ACK for 1 that has an ECR indicating it was for the
original copy of segment 1, then we undo,  even in cases where segment
2 really was lost and needed to be retransmitted.

2) tcp_packet_delayed implicitly assumes that if we retransmit a
packet that was truly lost then any ACK for the retransmitted instance
of that packet will echo the timestamp field in the retransmitted
packet. But if the receiver is implementing delayed ACKs and correctly
implements timestamps per RFC 1323, then the receiver will be obeying
this clause: "Thus, when delayed ACKs are in use, the receiver should
reply with the TSval field from the earliest unacknowledged segment."
Thus tcp_packet_delayed can cause spurious undos in the following
circumstances:

- send segment 1 with tsval 10
- receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
- send segment 2 with tsval 20
- segment 2 is lost in transit
- RTO, retransmit segment 2 with tsval 30
- receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
- sender receives ACK for packet 2 with tsecho of 10, lower than 30,
and thinks RTO was spurious

For better or worse, both of these bugs appear to be inherited from
the Eifel Detection Algorithm RFC (RFC 3522).

In any case, we've been working on coming up with a solution for these issues.

> --
> [PATCH 1/1] tcp: fix spurious undos in CA_Open
>
> There's a landmine in tcp_packet_delayed. In CA_Open the
> retrans_stamp is very eagerly cleared which falsely triggers
> packet-delayed detection. Therefore this code cannot be used
> in CA_Open if the retrans_stamp was already cleared.
>
> This became issue once DSACK detection was extended to happen
> in CA_Open state too (f698204bd0b, tcp: allow undo from
> reordered DSACKs). Essentially the whole congestion control
> is disabled because the undos now always restore the previous
> window. I wonder if this was already in production... ...at
> least the Internet didn't melt ;-).

I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
that tcp_may_undo and tcp_packet_delayed are never called in state
TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
reference to retrans_stamp.

Please let me know if I'm missing something, or if there's a specific
scenario you're concerned about.

thanks,
neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Dec. 14, 2011, 8:57 a.m. UTC | #2
On Mon, 12 Dec 2011, Neal Cardwell wrote:

> On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Mon, 28 Nov 2011, Neal Cardwell wrote:
> >
> >> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
> >>
> >> >> Also, one (serious) word of caution! This change, by its extending of
> >> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
> >> >> managed to introduces subtle breaking to the state machine. Please check
> >> >> that the problem similar to what was fixed by
> >> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
> >> >> some other form causing spurious undos). ...To me it seems that
> >> >> tcp_packet_delayed might be similarly confused after the given patch.
> >> >
> >> > Neil, please look into this so we can have any such issues fixed
> >> > in time for the next merge window.
> >>
> >> Absolutely. I will look into the areas that Ilpo mentioned.
> >
> > Unfortunately nothing has happened? So I finally had to come up something
> > myself. ...Compile tested only.
> 
> Sorry for the radio silence, but I have been looking at this. So far
> it's been a time-consuming process, as we've uncovered a number of
> pre-existing issues in tcp_packet_delayed that seem like quite serious
> bugs:
> 
> 1) tcp_packet_delayed implicitly assumes, but does not verify, that
> the ACK in question (whose ECR field precedes retrans_stamp) covers
> all retransmitted segments. So if we retransmit segments 1 and 2, and
> then get an ACK for 1 that has an ECR indicating it was for the
> original copy of segment 1, then we undo,  even in cases where segment
> 2 really was lost and needed to be retransmitted.

I think this might be an mis-implementation in tcp_try_undo_partial.

> 2) tcp_packet_delayed implicitly assumes that if we retransmit a
> packet that was truly lost then any ACK for the retransmitted instance
> of that packet will echo the timestamp field in the retransmitted
> packet. But if the receiver is implementing delayed ACKs and correctly
> implements timestamps per RFC 1323, then the receiver will be obeying
> this clause: "Thus, when delayed ACKs are in use, the receiver should
> reply with the TSval field from the earliest unacknowledged segment."
> Thus tcp_packet_delayed can cause spurious undos in the following
> circumstances:
>
> - send segment 1 with tsval 10
> - receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
> - send segment 2 with tsval 20
> - segment 2 is lost in transit
> - RTO, retransmit segment 2 with tsval 30
> - receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
> - sender receives ACK for packet 2 with tsecho of 10, lower than 30,
> and thinks RTO was spurious

About this I'm actually aware of, I have even some long email postponed 
related to this to discuss in context of 1323bis (as 1323bis seems pretty 
much standstill I'm not in a big hurry :-)). I think 1323 algo is flawed 
anyway and it was fixed/clarified/better in the bis (which doesn't 
address this well enough though).

I discovered this while I setup a network with ACK losses to figure out 
other issue. I discovered that the RTT measurements occassionally took 
"the wrong TS" bloating the RTT estimate. As RTT samples included RTO 
delays, with enough ACK losses the RTO eventually hits the max value. 
...As a result, not very nice performance was acquired for that such a 
flow.

I think there are two possible solutions:
  1) Echo latest timestamp for below window packets (I'm not sure if 
  there's can of worms w.r.t. security in this approach). A good thing
  in this would be that it would really solve the rexmit ambiguity problem 
  which the current approach does not do.
  2) Ignore packets with DSACK in RTT measurement.

1323bis is not of much help in this which TS to echo (which is why I 
intented to mention it on tcpm). IIRC, it depended on which takes 
precedence: RFC793 in-window check or the given TS algorithm, which is 
not discussed anywhere.

> For better or worse, both of these bugs appear to be inherited from
> the Eifel Detection Algorithm RFC (RFC 3522).

RFC3522 does not have the issue 1) as it's working after RTO+first ACK 
only. The second issue is valid for it though.

> In any case, we've been working on coming up with a solution for these issues.
> 
> > --
> > [PATCH 1/1] tcp: fix spurious undos in CA_Open
> >
> > There's a landmine in tcp_packet_delayed. In CA_Open the
> > retrans_stamp is very eagerly cleared which falsely triggers
> > packet-delayed detection. Therefore this code cannot be used
> > in CA_Open if the retrans_stamp was already cleared.
> >
> > This became issue once DSACK detection was extended to happen
> > in CA_Open state too (f698204bd0b, tcp: allow undo from
> > reordered DSACKs). Essentially the whole congestion control
> > is disabled because the undos now always restore the previous
> > window. I wonder if this was already in production... ...at
> > least the Internet didn't melt ;-).
> 
> I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
> that tcp_may_undo and tcp_packet_delayed are never called in state
> TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
> tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
> functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
> tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
> reference to retrans_stamp.

It seems that I've missed this difference in tcp_try_undo_dsack (well, 
seen it but never thought it that much). I think you're right. Thanks for 
taking a look.

Some WARN_ON to enforce this !CA_Open condition there might not hurt 
though.
Neal Cardwell Dec. 18, 2011, 6:23 p.m. UTC | #3
On Wed, Dec 14, 2011 at 3:57 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 12 Dec 2011, Neal Cardwell wrote:
>
>> On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> > On Mon, 28 Nov 2011, Neal Cardwell wrote:
>> >
>> >> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>> >>
>> >> >> Also, one (serious) word of caution! This change, by its extending of
>> >> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
>> >> >> managed to introduces subtle breaking to the state machine. Please check
>> >> >> that the problem similar to what was fixed by
>> >> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
>> >> >> some other form causing spurious undos). ...To me it seems that
>> >> >> tcp_packet_delayed might be similarly confused after the given patch.
>> >> >
>> >> > Neil, please look into this so we can have any such issues fixed
>> >> > in time for the next merge window.
>> >>
>> >> Absolutely. I will look into the areas that Ilpo mentioned.
>> >
>> > Unfortunately nothing has happened? So I finally had to come up something
>> > myself. ...Compile tested only.
>>
>> Sorry for the radio silence, but I have been looking at this. So far
>> it's been a time-consuming process, as we've uncovered a number of
>> pre-existing issues in tcp_packet_delayed that seem like quite serious
>> bugs:
>>
>> 1) tcp_packet_delayed implicitly assumes, but does not verify, that
>> the ACK in question (whose ECR field precedes retrans_stamp) covers
>> all retransmitted segments. So if we retransmit segments 1 and 2, and
>> then get an ACK for 1 that has an ECR indicating it was for the
>> original copy of segment 1, then we undo,  even in cases where segment
>> 2 really was lost and needed to be retransmitted.
>
> I think this might be an mis-implementation in tcp_try_undo_partial.

Yeah, I'm pretty sure this is a bug. There needs to be some additional
logic, whether in tcp_try_undo_partial or tcp_packet_delayed, or
somewhere else, to fix this bug.


>> 2) tcp_packet_delayed implicitly assumes that if we retransmit a
>> packet that was truly lost then any ACK for the retransmitted instance
>> of that packet will echo the timestamp field in the retransmitted
>> packet. But if the receiver is implementing delayed ACKs and correctly
>> implements timestamps per RFC 1323, then the receiver will be obeying
>> this clause: "Thus, when delayed ACKs are in use, the receiver should
>> reply with the TSval field from the earliest unacknowledged segment."
>> Thus tcp_packet_delayed can cause spurious undos in the following
>> circumstances:
>>
>> - send segment 1 with tsval 10
>> - receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
>> - send segment 2 with tsval 20
>> - segment 2 is lost in transit
>> - RTO, retransmit segment 2 with tsval 30
>> - receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
>> - sender receives ACK for packet 2 with tsecho of 10, lower than 30,
>> and thinks RTO was spurious
>
> About this I'm actually aware of, I have even some long email postponed
> related to this to discuss in context of 1323bis (as 1323bis seems pretty
> much standstill I'm not in a big hurry :-)). I think 1323 algo is flawed
> anyway and it was fixed/clarified/better in the bis (which doesn't
> address this well enough though).
>
> I discovered this while I setup a network with ACK losses to figure out
> other issue. I discovered that the RTT measurements occassionally took
> "the wrong TS" bloating the RTT estimate. As RTT samples included RTO
> delays, with enough ACK losses the RTO eventually hits the max value.
> ...As a result, not very nice performance was acquired for that such a
> flow.
>
> I think there are two possible solutions:
>  1) Echo latest timestamp for below window packets (I'm not sure if
>  there's can of worms w.r.t. security in this approach). A good thing
>  in this would be that it would really solve the rexmit ambiguity problem
>  which the current approach does not do.
>  2) Ignore packets with DSACK in RTT measurement.
>
> 1323bis is not of much help in this which TS to echo (which is why I
> intented to mention it on tcpm). IIRC, it depended on which takes
> precedence: RFC793 in-window check or the given TS algorithm, which is
> not discussed anywhere.

Interesting. I hope the standards get improved in this area.


>> For better or worse, both of these bugs appear to be inherited from
>> the Eifel Detection Algorithm RFC (RFC 3522).
>
> RFC3522 does not have the issue 1) as it's working after RTO+first ACK
> only. The second issue is valid for it though.

Actually RFC 3522 does have issue 1) because it is also in effect for
fast retransmit / fast recovery; RFC 3522 says "If the Eifel detection
algorithm is used, the following steps MUST be taken by a TCP sender,
but only upon initiation of loss recovery, i.e., when either the
timeout-based retransmit or the fast retransmit is sent."

In addition, RFC 3517 allows senders to send out multiple
retransmissions upon the first ACK that initiates fast recovery:
Section 5., Algorithm Details, says "(5) In order to take advantage of
potential additional available cwnd, proceed to step (C) below. ...
(C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more
segments." As a consequence, senders can have multiple retransmissions
in flight at the point they get the first ACK during fast recovery, at
which point they run the Eifel detection algorithm.

So I'm pretty sure 1) is an issue both in the RFCs and in Linux.


>> In any case, we've been working on coming up with a solution for these issues.
>>
>> > --
>> > [PATCH 1/1] tcp: fix spurious undos in CA_Open
>> >
>> > There's a landmine in tcp_packet_delayed. In CA_Open the
>> > retrans_stamp is very eagerly cleared which falsely triggers
>> > packet-delayed detection. Therefore this code cannot be used
>> > in CA_Open if the retrans_stamp was already cleared.
>> >
>> > This became issue once DSACK detection was extended to happen
>> > in CA_Open state too (f698204bd0b, tcp: allow undo from
>> > reordered DSACKs). Essentially the whole congestion control
>> > is disabled because the undos now always restore the previous
>> > window. I wonder if this was already in production... ...at
>> > least the Internet didn't melt ;-).
>>
>> I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
>> that tcp_may_undo and tcp_packet_delayed are never called in state
>> TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
>> tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
>> functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
>> tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
>> reference to retrans_stamp.
>
> It seems that I've missed this difference in tcp_try_undo_dsack (well,
> seen it but never thought it that much). I think you're right. Thanks for
> taking a look.
>
> Some WARN_ON to enforce this !CA_Open condition there might not hurt
> though.

Sounds reasonable. I'll think about this as I work on coming up with a
fix for the 1) and 2) bugs discussed above.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..7d8934d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2640,8 +2640,14 @@  static void tcp_cwnd_down(struct sock *sk, int flag)
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline int tcp_packet_delayed(const struct tcp_sock *tp)
+static inline int tcp_packet_delayed(const struct sock *sk)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (icsk->icsk_ca_state == TCP_CA_Open && !tp->retrans_stamp)
+		return 0;
+
 	return !tp->retrans_stamp ||
 		(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
 		 before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp));
@@ -2701,9 +2707,11 @@  static void tcp_undo_cwr(struct sock *sk, const bool undo_ssthresh)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
-static inline int tcp_may_undo(const struct tcp_sock *tp)
+static inline int tcp_may_undo(const struct sock *sk)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(sk));
 }
 
 /* People celebrate: "We love our President!" */
@@ -2711,7 +2719,7 @@  static int tcp_try_undo_recovery(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2788,7 +2796,7 @@  static int tcp_try_undo_partial(struct sock *sk, int acked)
 	/* Partial ACK arrived. Force Hoe's retransmit. */
 	int failed = tcp_is_reno(tp) || (tcp_fackets_out(tp) > tp->reordering);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit.
 		 */
@@ -2815,7 +2823,7 @@  static int tcp_try_undo_loss(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		struct sk_buff *skb;
 		tcp_for_write_queue(skb, sk) {
 			if (skb == tcp_send_head(sk))