diff mbox series

[net,4/5] tcp: prevent bogus undos when SACK is not enabled

Message ID 1520427569-14365-5-git-send-email-ilpo.jarvinen@helsinki.fi
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series tcp: fixes to non-SACK TCP | expand

Commit Message

Ilpo Järvinen March 7, 2018, 12:59 p.m. UTC
A bogus undo may/will trigger when the loss recovery state is
kept until snd_una is above high_seq. If tcp_any_retrans_done
is zero, retrans_stamp is cleared in this transient state. On
the next ACK, tcp_try_undo_recovery again executes and
tcp_may_undo will always return true because tcp_packet_delayed
has this condition:
    return !tp->retrans_stamp || ...

Check for the false fast retransmit transient condition in
tcp_packet_delayed to avoid bogus undos. Since snd_una may have
advanced on this ACK but CA state still remains unchanged,
prior_snd_una needs to be passed instead of tp->snd_una.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 50 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 16 deletions(-)

Comments

Neal Cardwell March 7, 2018, 8:19 p.m. UTC | #1
On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> A bogus undo may/will trigger when the loss recovery state is
> kept until snd_una is above high_seq. If tcp_any_retrans_done
> is zero, retrans_stamp is cleared in this transient state. On
> the next ACK, tcp_try_undo_recovery again executes and
> tcp_may_undo will always return true because tcp_packet_delayed
> has this condition:
>     return !tp->retrans_stamp || ...
>
> Check for the false fast retransmit transient condition in
> tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> advanced on this ACK but CA state still remains unchanged,
> prior_snd_una needs to be passed instead of tp->snd_una.

This one also seems like a case where it would be nice to have a
specific packet-by-packet example, or trace, or packetdrill scenario.
Something that we might be able to translate into a test, or at least
to document the issue more explicitly.

Thanks!
neal
Yuchung Cheng March 7, 2018, 11:48 p.m. UTC | #2
On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > A bogus undo may/will trigger when the loss recovery state is
> > kept until snd_una is above high_seq. If tcp_any_retrans_done
> > is zero, retrans_stamp is cleared in this transient state. On
> > the next ACK, tcp_try_undo_recovery again executes and
> > tcp_may_undo will always return true because tcp_packet_delayed
> > has this condition:
> >     return !tp->retrans_stamp || ...
> >
> > Check for the false fast retransmit transient condition in
> > tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> > advanced on this ACK but CA state still remains unchanged,
> > prior_snd_una needs to be passed instead of tp->snd_una.
>
> This one also seems like a case where it would be nice to have a
> specific packet-by-packet example, or trace, or packetdrill scenario.
> Something that we might be able to translate into a test, or at least
> to document the issue more explicitly.
I am hesitate for further logic to make undo "perfect" on non-sack
cases b/c undo is very complicated and SACK is extremely
well-supported today. so a trace to demonstrate how severe this issue
is appreciated.

>
> Thanks!
> neal
Ilpo Järvinen March 9, 2018, 2:11 p.m. UTC | #3
On Wed, 7 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > > A bogus undo may/will trigger when the loss recovery state is
> > > kept until snd_una is above high_seq. If tcp_any_retrans_done
> > > is zero, retrans_stamp is cleared in this transient state. On
> > > the next ACK, tcp_try_undo_recovery again executes and
> > > tcp_may_undo will always return true because tcp_packet_delayed
> > > has this condition:
> > >     return !tp->retrans_stamp || ...
> > >
> > > Check for the false fast retransmit transient condition in
> > > tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> > > advanced on this ACK but CA state still remains unchanged,
> > > prior_snd_una needs to be passed instead of tp->snd_una.
> >
> > This one also seems like a case where it would be nice to have a
> > specific packet-by-packet example, or trace, or packetdrill scenario.
> > Something that we might be able to translate into a test, or at least
> > to document the issue more explicitly.
>
> I am hesitate for further logic to make undo "perfect" on non-sack
> cases b/c undo is very complicated and SACK is extremely
> well-supported today. so a trace to demonstrate how severe this issue
> is appreciated.

This is not just some remote corner cases to which I perhaps would 
understand your "making undo perfect" comment. Those undos result in
a burst that, at worst, triggers additional buffer overflow because the 
correct CC action is cancelled. Unfortunately I don't have now permission
to publish the time-seq graph about it but I've tried to improve the
changelog messages so that you can better understand under which
conditions the problem occurs.

SACK case remains the same even after this change. I did rework the
logic a bit though (pass prior_snd_una rather than flag around) to
make it more obvious but the change is unfortunately lengthy (no matter
what I pass through the call-chain).
Eric Dumazet March 9, 2018, 2:32 p.m. UTC | #4
On 03/09/2018 06:11 AM, Ilpo Järvinen wrote:
> On Wed, 7 Mar 2018, Yuchung Cheng wrote:
> 
>> On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell <ncardwell@google.com> wrote:
>>>
>>> On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
>>>> A bogus undo may/will trigger when the loss recovery state is
>>>> kept until snd_una is above high_seq. If tcp_any_retrans_done
>>>> is zero, retrans_stamp is cleared in this transient state. On
>>>> the next ACK, tcp_try_undo_recovery again executes and
>>>> tcp_may_undo will always return true because tcp_packet_delayed
>>>> has this condition:
>>>>      return !tp->retrans_stamp || ...
>>>>
>>>> Check for the false fast retransmit transient condition in
>>>> tcp_packet_delayed to avoid bogus undos. Since snd_una may have
>>>> advanced on this ACK but CA state still remains unchanged,
>>>> prior_snd_una needs to be passed instead of tp->snd_una.
>>>
>>> This one also seems like a case where it would be nice to have a
>>> specific packet-by-packet example, or trace, or packetdrill scenario.
>>> Something that we might be able to translate into a test, or at least
>>> to document the issue more explicitly.
>>
>> I am hesitate for further logic to make undo "perfect" on non-sack
>> cases b/c undo is very complicated and SACK is extremely
>> well-supported today. so a trace to demonstrate how severe this issue
>> is appreciated.
> 
> This is not just some remote corner cases to which I perhaps would
> understand your "making undo perfect" comment. Those undos result in
> a burst that, at worst, triggers additional buffer overflow because the
> correct CC action is cancelled. Unfortunately I don't have now permission
> to publish the time-seq graph about it but I've tried to improve the
> changelog messages so that you can better understand under which
> conditions the problem occurs.
> 
> SACK case remains the same even after this change. I did rework the
> logic a bit though (pass prior_snd_una rather than flag around) to
> make it more obvious but the change is unfortunately lengthy (no matter
> what I pass through the call-chain).
> 
> 

Can you provide packetdrill test(s) then if you can not provide traces ?

If we can not have tests, we will likely have future regressions, since
most developments are using SACK in mind.

Fact that these bugs have been unnoticed for years is concerning.

We have the goal of updating packetdrill and publish soon our regression 
suite (about 1500 TCP tests)

CC Willem who is working on that.
David Miller March 9, 2018, 3:23 p.m. UTC | #5
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)

> Unfortunately I don't have now permission to publish the time-seq
> graph about it but I've tried to improve the changelog messages so
> that you can better understand under which conditions the problem
> occurs.

It is indeed extremely unfortunate that you wish to justify a change
for which you cannot provide the supporting data at all.
David Miller March 9, 2018, 3:28 p.m. UTC | #6
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 9 Mar 2018 06:32:21 -0800

> We have the goal of updating packetdrill and publish soon our
> regression suite (about 1500 TCP tests)
> 
> CC Willem who is working on that.

Excellent news!
Ilpo Järvinen March 9, 2018, 7:23 p.m. UTC | #7
On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Well, the permission for time-seq graps was/is still simply just in 
pending state and then my patience on sending v2 out ran out :-).

Given that I perceived right from the start that the main purpose for
that "data" was to create a packetdrill test for that testsuite (that
is still in the soon-to-be-published state after all these years ;-)),
I didn't find significant benefit from a time-seq graph compared with
more verbose description that is now place in the changelog. In fact,
I think that time-seq graph well into the flow will be less useful than
the current explination in the changelog if one wants to come up a
packetdrill test but I'm no packetdrill expert.
Ilpo Järvinen March 13, 2018, 10:24 a.m. UTC | #8
On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Here is the time-seqno graph about the issue:

https://www.cs.helsinki.fi/u/ijjarvin/linux/nonsackbugs/recovery_undo_bug.pdf

First the correct CC action (wnd reduction) occurs; then bogus undo 
causes bursting back to the window with which the congestion losses 
occurred earlier; because of the burst, some packets get lost due to 
congestion again.

The sender is actually somewhat lucky here: If only one packet would get 
lost instead of three, the same process would repeat for the next recovery 
(as cumulative ACK to high_seq condition would reoccur).
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e20f9ad..b689915 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,6 +98,7 @@  int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call tcp_send_challenge_ack()	*/
 #define FLAG_ACK_MAYBE_DELAYED	0x10000 /* Likely a delayed ACK */
+#define FLAG_PACKET_DELAYED	0x20000 /* 0 rexmits or tstamps reveal delayed pkt */
 
 #define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -2243,10 +2244,19 @@  static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct sock *sk,
+				      const u32 prior_snd_una)
 {
-	return !tp->retrans_stamp ||
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tp->retrans_stamp) {
+		/* Sender will be in a transient state with cleared
+		 * retrans_stamp during false fast retransmit prevention
+		 * mechanism
+		 */
+		return !tcp_false_fast_retrans_possible(sk, prior_snd_una);
+	}
+	return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2336,17 +2346,20 @@  static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
 	tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct sock *sk, const int flag)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	return tp->undo_marker &&
+	       (!tp->undo_retrans || (flag & FLAG_PACKET_DELAYED));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk, flag)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2393,11 +2406,11 @@  static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo, const int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (frto_undo || tcp_may_undo(tp)) {
+	if (frto_undo || tcp_may_undo(sk, flag)) {
 		tcp_undo_cwnd_reduction(sk, true);
 
 		DBGUNDO(sk, "partial loss");
@@ -2636,7 +2649,7 @@  static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 	bool recovered = !before(tp->snd_una, tp->high_seq);
 
 	if ((flag & FLAG_SND_UNA_ADVANCED) &&
-	    tcp_try_undo_loss(sk, false))
+	    tcp_try_undo_loss(sk, false, flag))
 		return;
 
 	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2649,7 +2662,7 @@  static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 		 */
 		if ((flag & FLAG_ORIG_SACK_ACKED) &&
 		    (tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) &&
-		    tcp_try_undo_loss(sk, true))
+		    tcp_try_undo_loss(sk, true, flag))
 			return;
 
 		if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2672,7 +2685,7 @@  static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 
 	if (recovered) {
 		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
-		tcp_try_undo_recovery(sk);
+		tcp_try_undo_recovery(sk, flag);
 		return;
 	}
 	if (tcp_is_reno(tp)) {
@@ -2688,11 +2701,12 @@  static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 }
 
 /* Undo during fast recovery after partial ACK. */
-static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
+static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una,
+				 const int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->undo_marker && tcp_packet_delayed(tp)) {
+	if (tp->undo_marker && (flag & FLAG_PACKET_DELAYED)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit. Check reordering.
 		 */
@@ -2795,13 +2809,17 @@  static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		case TCP_CA_Recovery:
 			if (tcp_is_reno(tp))
 				tcp_reset_reno_sack(tp);
-			if (tcp_try_undo_recovery(sk))
+			if (tcp_try_undo_recovery(sk, flag))
 				return;
 			tcp_end_cwnd_reduction(sk);
 			break;
 		}
 	}
 
+	if (icsk->icsk_ca_state >= TCP_CA_Recovery &&
+	    tcp_packet_delayed(sk, prior_snd_una))
+		flag |= FLAG_PACKET_DELAYED;
+
 	/* E. Process state. */
 	switch (icsk->icsk_ca_state) {
 	case TCP_CA_Recovery:
@@ -2809,7 +2827,7 @@  static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 			if (tcp_is_reno(tp) && is_dupack)
 				tcp_add_reno_sack(sk);
 		} else {
-			if (tcp_try_undo_partial(sk, prior_snd_una))
+			if (tcp_try_undo_partial(sk, prior_snd_una, flag))
 				return;
 			/* Partial ACK arrived. Force fast retransmit. */
 			do_lost = tcp_is_reno(tp) ||