diff mbox

tcp: fix tcp_retransmit_skb() to maintain MSS invariant

Message ID 1330698449-2969-1-git-send-email-ncardwell@google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Neal Cardwell March 2, 2012, 2:27 p.m. UTC
This commit fixes tcp_retransmit_skb() to respect the invariant that
an skb in the write queue that might be SACKed (that is, that precedes
tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral
multiple of tcp_skb_mss(skb).

Various parts of the TCP code maintain or assume this invariant,
including at least tcp_write_xmit(), tcp_mss_split_point(),
tcp_match_skb_to_sack(), and tcp_shifted_skb().

tcp_retransmit_skb() did not maintain this invariant. It checked the
current MSS and called tcp_fragment() to make sure that the skb we're
retransmitting is at most cur_mss, but in the process it took the
excess bytes and created an arbitrary-length skb (one that is not
necessarily an integral multiple of its MSS) and inserted it in the
write queue after the skb we're retransmitting.

One potential indirect effect of this problem is tcp_shifted_skb()
creating a coalesced SACKed skb that has a pcount that is 1 too large
for its length. This happened because tcp_shifted_skb() assumed that
skbs are integral multiples of MSS, so you can just add pcounts of
input skbs to find the pcount of the output skb. Suspected specific
symtoms of this problem include the WARN_ON(len > skb->len) in
tcp_fragment() firing, as the 1-too-large pcount ripples though to
tcp_mark_head_lost() trying to chop off too many bytes to mark as
lost.

It's also possible this bug is related to recent reports of sacked_out
becoming negative.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_output.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 43 insertions(+), 1 deletions(-)

Comments

Yuchung Cheng March 2, 2012, 4:28 p.m. UTC | #1
On Fri, Mar 2, 2012 at 6:27 AM, Neal Cardwell <ncardwell@google.com> wrote:
> This commit fixes tcp_retransmit_skb() to respect the invariant that
> an skb in the write queue that might be SACKed (that is, that precedes
> tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral
> multiple of tcp_skb_mss(skb).
>
> Various parts of the TCP code maintain or assume this invariant,
> including at least tcp_write_xmit(), tcp_mss_split_point(),
> tcp_match_skb_to_sack(), and tcp_shifted_skb().
>
> tcp_retransmit_skb() did not maintain this invariant. It checked the
> current MSS and called tcp_fragment() to make sure that the skb we're
> retransmitting is at most cur_mss, but in the process it took the
> excess bytes and created an arbitrary-length skb (one that is not
> necessarily an integral multiple of its MSS) and inserted it in the
> write queue after the skb we're retransmitting.
>
> One potential indirect effect of this problem is tcp_shifted_skb()
> creating a coalesced SACKed skb that has a pcount that is 1 too large
> for its length. This happened because tcp_shifted_skb() assumed that
> skbs are integral multiples of MSS, so you can just add pcounts of
> input skbs to find the pcount of the output skb. Suspected specific
> symtoms of this problem include the WARN_ON(len > skb->len) in
> tcp_fragment() firing, as the 1-too-large pcount ripples though to
> tcp_mark_head_lost() trying to chop off too many bytes to mark as
> lost.
>
> It's also possible this bug is related to recent reports of sacked_out
> becoming negative.
>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>

I especially like the comment about the invariant, which is less
explicit in other parts of GSO code.


> ---
>  net/ipv4/tcp_output.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 43 insertions(+), 1 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4ff3b6d..13034ad 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2070,6 +2070,48 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
>        }
>  }
>
> +/* So we can retransmit skb, fragment it to be cur_mss bytes. In
> + * addition, we must maintain the invariant that whatever skbs we
> + * leave in the write queue are integral multiples of the MSS or a
> + * remaining small sub-MSS portion. This means we fragment the skb
> + * into potentially three skbs in the write queue:
> + *
> + *  (1) The first skb of exactly 1*cur_mss, which we will retransmit now.
> + *  (2) A "bulk" skb that is an integral multiple of the cur_mss
> + *  (3) A "left-over" skb that has any remaining portion smaller than cur_mss
> + *
> + * Since either of the two required fragmentation operations can fail
> + * (e.g. due to ENOMEM), and we want this invariant to be maintained
> + * if either fails, we chop off (3) first and then chop off (1).
> + *
> + * Returns non-zero if an error occurred which prevented the full splitting.
> + */
> +static int tcp_retrans_mss_split(struct sock *sk, struct sk_buff *skb,
> +                                unsigned int cur_mss)
> +{
> +       int err;
> +       unsigned int len;
> +
> +       /* Chop off any "left-over" at end that is not aligned to cur_mss. */
> +       if (cur_mss != tcp_skb_mss(skb)) {
> +               len = skb->len - skb->len % cur_mss;
> +               if (len < skb->len) {
> +                       err = tcp_fragment(sk, skb, len, cur_mss);
> +                       if (err < 0)
> +                               return err;
> +               }
> +       }
> +
> +       /* Chop off a single MSS at the beginning to retransmit now. */
> +       if (skb->len > cur_mss) {
> +               err = tcp_fragment(sk, skb, cur_mss, cur_mss);
> +               if (err < 0)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
>  /* This retransmits one SKB.  Policy decisions and retransmit queue
>  * state updates are done by the caller.  Returns non-zero if an
>  * error occurred which prevented the send.
> @@ -2115,7 +2157,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
>                return -EAGAIN;
>
>        if (skb->len > cur_mss) {
> -               if (tcp_fragment(sk, skb, cur_mss, cur_mss))
> +               if (tcp_retrans_mss_split(sk, skb, cur_mss))
>                        return -ENOMEM; /* We'll try again later. */
>        } else {
>                int oldpcount = tcp_skb_pcount(skb);
> --
> 1.7.7.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen March 2, 2012, 8:25 p.m. UTC | #2
On Fri, 2 Mar 2012, Neal Cardwell wrote:

> This commit fixes tcp_retransmit_skb() to respect the invariant that
> an skb in the write queue that might be SACKed (that is, that precedes
> tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral
> multiple of tcp_skb_mss(skb).
> 
> Various parts of the TCP code maintain or assume this invariant,
> including at least tcp_write_xmit(), tcp_mss_split_point(),
> tcp_match_skb_to_sack(), and tcp_shifted_skb().

What invariant? We've never had such one you describe?!? It should be 
perfectly ok to have last part of the super-skb to have less than MSS 
bytes. The last one can be of arbitary size and you should be able to 
shift/merge as many of them (the non-full last segments) as you want and 
can fit to a super skb. The only condition which should allow/need 
resplitting  these is the retransmit after reneging and therefore we can 
keep the pcount changes minimal to avoid bogus "window adjustments" while 
nothing was changed on the wire for real. ...One more thought, perhaps
reneging+partial re-sacked case is potentially broken for real but it 
should be solved while reneging is being processed by doing all those 
window affecting pcount tweaks only then, but not too likely scenario 
in the first place (but I suppose this could lead even to negative pcount 
at worst which is certainly bad thing to happen).

Please note though that some SACK(tag) code still needs to maintain the 
sent time decided mss boundaries while shifting partial but even that 
would not be necessary if DSACK code would be more clever than it is at 
the moment. I used to have some plans to fix this too but it would have 
depended on rbtree stuff which wasn't/isn't there because that patchset
is still in rottens-in-some-git state.

> tcp_retransmit_skb() did not maintain this invariant. It checked the
> current MSS and called tcp_fragment() to make sure that the skb we're
> retransmitting is at most cur_mss, but in the process it took the
> excess bytes and created an arbitrary-length skb (one that is not
> necessarily an integral multiple of its MSS) and inserted it in the
> write queue after the skb we're retransmitting.

Also if MSS changes after userspace writes, we'll have the very same 
"problem" (or other "normal behavior" leaves those non-full skbs e.g., due 
to slow application). ...This should be non-issue really if the code is 
properly written.

> One potential indirect effect of this problem is tcp_shifted_skb()
> creating a coalesced SACKed skb that has a pcount that is 1 too large
> for its length. This happened because tcp_shifted_skb() assumed that
> skbs are integral multiples of MSS, so you can just add pcounts of
> input skbs to find the pcount of the output skb.

...It does _not_ assume what you say! ...Instead, it does this on purpose.
Anything else is wrong thing because it affects window calculations just 
to get our clever gso hacks to work.

SACK code should maintain the original pcounts from sent time, not change 
them! So if we sent MSS+1 and MSS+1 we should have (1+1)+(1+1) as pcount 
after shifting instead of 1+1+1. ...This keeping of original pcounts is 
the real invariant I used while writing the new sacktag code with 
shifting.

> Suspected specific
> symtoms of this problem include the WARN_ON(len > skb->len) in
> tcp_fragment() firing, as the 1-too-large pcount ripples though to
> tcp_mark_head_lost() trying to chop off too many bytes to mark as
> lost.

Why would we be splitting SACKed segment there?!? ...Ah, it seems that 
it indeed is possible now after the non-FACK mode fragment, however, it
is totally useless work and seems to be breaking stuff. I think that the 
proper solution is to prevent splitting of already SACKed stuff. ...Nice 
find btw. :-)

> It's also possible this bug is related to recent reports of sacked_out
> becoming negative.

Perhaps, but those are much more likely due to the recent changes 
breaking stuff instead (I haven't seen those reports about sacked_out 
before the three latest fixes, tcp_fragment stuff is reportedly older 
though)...
Yuchung Cheng March 3, 2012, 3:05 a.m. UTC | #3
On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> Why would we be splitting SACKed segment there?!? ...Ah, it seems that
> it indeed is possible now after the non-FACK mode fragment, however, it
> is totally useless work and seems to be breaking stuff. I think that the
> proper solution is to prevent splitting of already SACKed stuff. ...Nice
> find btw. :-)

Bingo! I totally agree this is the bug. looks like a one-line patch now.

We _really_ need to refactor the cryptic tcp_mark_head_lost(). I have a patch
ready for review once the recent sack turmoil settles.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell March 3, 2012, 7:44 a.m. UTC | #4
On Fri, Mar 2, 2012 at 10:05 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
>> Why would we be splitting SACKed segment there?!? ...Ah, it seems that
>> it indeed is possible now after the non-FACK mode fragment, however, it
>> is totally useless work and seems to be breaking stuff. I think that the
>> proper solution is to prevent splitting of already SACKed stuff. ...Nice
>> find btw. :-)
>
> Bingo! I totally agree this is the bug. looks like a one-line patch now.

Agreed. I sent out a proposal:
  http://patchwork.ozlabs.org/patch/144408/

Thanks, Ilpo, for filling in the details. The comment in
tcp_match_skb_to_sack() that says "Round if necessary so that SACKs
cover only full MSSes and/or the remaining small portion (if
present)", plus the analogous MSS alignment logic in
tcp_mss_split_point(), had me convinced that this MSS alignment
invariant that was needed for tcp_match_skb_to_sack() to work
correctly was actually intended.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen March 5, 2012, 9:31 a.m. UTC | #5
On Sat, 3 Mar 2012, Neal Cardwell wrote:

> On Fri, Mar 2, 2012 at 10:05 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen
> > <ilpo.jarvinen@helsinki.fi> wrote:
> >> Why would we be splitting SACKed segment there?!? ...Ah, it seems that
> >> it indeed is possible now after the non-FACK mode fragment, however, it
> >> is totally useless work and seems to be breaking stuff. I think that the
> >> proper solution is to prevent splitting of already SACKed stuff. ...Nice
> >> find btw. :-)
> >
> > Bingo! I totally agree this is the bug. looks like a one-line patch now.
> 
> Agreed. I sent out a proposal:
>   http://patchwork.ozlabs.org/patch/144408/
> 
> Thanks, Ilpo, for filling in the details.

This unnecessary fragmenting probably also ate up some of the benefits 
from the recombining efforts done by shift/merge.

> The comment in
> tcp_match_skb_to_sack() that says "Round if necessary so that SACKs
> cover only full MSSes and/or the remaining small portion (if
> present)", plus the analogous MSS alignment logic in
> tcp_mss_split_point(), had me convinced that this MSS alignment
> invariant that was needed for tcp_match_skb_to_sack() to work
> correctly was actually intended.

tcp_mss_split_point is there to work around received window non-MSS 
oddities (when the whole window is almost fully utilized), plus some 
logic to enforce nagle marker skb. That stuff always works with unsent 
stuff anyway so it's much simpler to handle than skbs where we've already 
made some real commitments.

tcp_match_skb_to_sack tries to find the original mss boundaries that 
gso/tso generated and align into them if SACK range does contain 
non-aligned octets. Again, this is there to avoid pcount tweaks that would 
be necessary if we split to two skbs that have less than MSS sized tail 
part, and also to prevent receiver driven split to very small skbs 
attacks. ...Hopefully it also serves as an incentive to fix those 
middleboxes/end hosts which are in violation here by not maintaining the 
original packet boundaries :-).

However, now that I look it in this new light also tcp_match_skb_to_sack 
is subject to similar kind of bug, ie., it can fragment already 
shifted/merged skbs (although it's not that likely one would get a 
scenario to actually trigger that, requires partial re-SACK). Fixing this 
will be significantly more complicated though, and seems to require that 
sacktag logic inspects ->sacked much earlier making it again harder to 
follow :-(.
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ff3b6d..13034ad 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2070,6 +2070,48 @@  static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 	}
 }
 
+/* So we can retransmit skb, fragment it to be cur_mss bytes. In
+ * addition, we must maintain the invariant that whatever skbs we
+ * leave in the write queue are integral multiples of the MSS or a
+ * remaining small sub-MSS portion. This means we fragment the skb
+ * into potentially three skbs in the write queue:
+ *
+ *  (1) The first skb of exactly 1*cur_mss, which we will retransmit now.
+ *  (2) A "bulk" skb that is an integral multiple of the cur_mss
+ *  (3) A "left-over" skb that has any remaining portion smaller than cur_mss
+ *
+ * Since either of the two required fragmentation operations can fail
+ * (e.g. due to ENOMEM), and we want this invariant to be maintained
+ * if either fails, we chop off (3) first and then chop off (1).
+ *
+ * Returns non-zero if an error occurred which prevented the full splitting.
+ */
+static int tcp_retrans_mss_split(struct sock *sk, struct sk_buff *skb,
+				 unsigned int cur_mss)
+{
+	int err;
+	unsigned int len;
+
+	/* Chop off any "left-over" at end that is not aligned to cur_mss. */
+	if (cur_mss != tcp_skb_mss(skb)) {
+		len = skb->len - skb->len % cur_mss;
+		if (len < skb->len) {
+			err = tcp_fragment(sk, skb, len, cur_mss);
+			if (err < 0)
+				return err;
+		}
+	}
+
+	/* Chop off a single MSS at the beginning to retransmit now. */
+	if (skb->len > cur_mss) {
+		err = tcp_fragment(sk, skb, cur_mss, cur_mss);
+		if (err < 0)
+			return err;
+	}
+
+	return 0;
+}
+
 /* This retransmits one SKB.  Policy decisions and retransmit queue
  * state updates are done by the caller.  Returns non-zero if an
  * error occurred which prevented the send.
@@ -2115,7 +2157,7 @@  int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 		return -EAGAIN;
 
 	if (skb->len > cur_mss) {
-		if (tcp_fragment(sk, skb, cur_mss, cur_mss))
+		if (tcp_retrans_mss_split(sk, skb, cur_mss))
 			return -ENOMEM; /* We'll try again later. */
 	} else {
 		int oldpcount = tcp_skb_pcount(skb);