Message ID | 1330698449-2969-1-git-send-email-ncardwell@google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Mar 2, 2012 at 6:27 AM, Neal Cardwell <ncardwell@google.com> wrote: > This commit fixes tcp_retransmit_skb() to respect the invariant that > an skb in the write queue that might be SACKed (that is, that precedes > tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral > multiple of tcp_skb_mss(skb). > > Various parts of the TCP code maintain or assume this invariant, > including at least tcp_write_xmit(), tcp_mss_split_point(), > tcp_match_skb_to_sack(), and tcp_shifted_skb(). > > tcp_retransmit_skb() did not maintain this invariant. It checked the > current MSS and called tcp_fragment() to make sure that the skb we're > retransmitting is at most cur_mss, but in the process it took the > excess bytes and created an arbitrary-length skb (one that is not > necessarily an integral multiple of its MSS) and inserted it in the > write queue after the skb we're retransmitting. > > One potential indirect effect of this problem is tcp_shifted_skb() > creating a coalesced SACKed skb that has a pcount that is 1 too large > for its length. This happened because tcp_shifted_skb() assumed that > skbs are integral multiples of MSS, so you can just add pcounts of > input skbs to find the pcount of the output skb. Suspected specific > symtoms of this problem include the WARN_ON(len > skb->len) in > tcp_fragment() firing, as the 1-too-large pcount ripples though to > tcp_mark_head_lost() trying to chop off too many bytes to mark as > lost. > > It's also possible this bug is related to recent reports of sacked_out > becoming negative. > > Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> I especially like the comment about the invariant, which is less explicit in other parts of GSO code. > --- > net/ipv4/tcp_output.c | 44 +++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 43 insertions(+), 1 deletions(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 4ff3b6d..13034ad 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2070,6 +2070,48 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to, > } > } > > +/* So we can retransmit skb, fragment it to be cur_mss bytes. In > + * addition, we must maintain the invariant that whatever skbs we > + * leave in the write queue are integral multiples of the MSS or a > + * remaining small sub-MSS portion. This means we fragment the skb > + * into potentially three skbs in the write queue: > + * > + * (1) The first skb of exactly 1*cur_mss, which we will retransmit now. > + * (2) A "bulk" skb that is an integral multiple of the cur_mss > + * (3) A "left-over" skb that has any remaining portion smaller than cur_mss > + * > + * Since either of the two required fragmentation operations can fail > + * (e.g. due to ENOMEM), and we want this invariant to be maintained > + * if either fails, we chop off (3) first and then chop off (1). > + * > + * Returns non-zero if an error occurred which prevented the full splitting. > + */ > +static int tcp_retrans_mss_split(struct sock *sk, struct sk_buff *skb, > + unsigned int cur_mss) > +{ > + int err; > + unsigned int len; > + > + /* Chop off any "left-over" at end that is not aligned to cur_mss. */ > + if (cur_mss != tcp_skb_mss(skb)) { > + len = skb->len - skb->len % cur_mss; > + if (len < skb->len) { > + err = tcp_fragment(sk, skb, len, cur_mss); > + if (err < 0) > + return err; > + } > + } > + > + /* Chop off a single MSS at the beginning to retransmit now. */ > + if (skb->len > cur_mss) { > + err = tcp_fragment(sk, skb, cur_mss, cur_mss); > + if (err < 0) > + return err; > + } > + > + return 0; > +} > + > /* This retransmits one SKB. Policy decisions and retransmit queue > * state updates are done by the caller. Returns non-zero if an > * error occurred which prevented the send. > @@ -2115,7 +2157,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) > return -EAGAIN; > > if (skb->len > cur_mss) { > - if (tcp_fragment(sk, skb, cur_mss, cur_mss)) > + if (tcp_retrans_mss_split(sk, skb, cur_mss)) > return -ENOMEM; /* We'll try again later. */ > } else { > int oldpcount = tcp_skb_pcount(skb); > -- > 1.7.7.3 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Mar 2012, Neal Cardwell wrote: > This commit fixes tcp_retransmit_skb() to respect the invariant that > an skb in the write queue that might be SACKed (that is, that precedes > tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral > multiple of tcp_skb_mss(skb). > > Various parts of the TCP code maintain or assume this invariant, > including at least tcp_write_xmit(), tcp_mss_split_point(), > tcp_match_skb_to_sack(), and tcp_shifted_skb(). What invariant? We've never had such one you describe?!? It should be perfectly ok to have last part of the super-skb to have less than MSS bytes. The last one can be of arbitary size and you should be able to shift/merge as many of them (the non-full last segments) as you want and can fit to a super skb. The only condition which should allow/need resplitting these is the retransmit after reneging and therefore we can keep the pcount changes minimal to avoid bogus "window adjustments" while nothing was changed on the wire for real. ...One more thought, perhaps reneging+partial re-sacked case is potentially broken for real but it should be solved while reneging is being processed by doing all those window affecting pcount tweaks only then, but not too likely scenario in the first place (but I suppose this could lead even to negative pcount at worst which is certainly bad thing to happen). Please note though that some SACK(tag) code still needs to maintain the sent time decided mss boundaries while shifting partial but even that would not be necessary if DSACK code would be more clever than it is at the moment. I used to have some plans to fix this too but it would have depended on rbtree stuff which wasn't/isn't there because that patchset is still in rottens-in-some-git state. > tcp_retransmit_skb() did not maintain this invariant. It checked the > current MSS and called tcp_fragment() to make sure that the skb we're > retransmitting is at most cur_mss, but in the process it took the > excess bytes and created an arbitrary-length skb (one that is not > necessarily an integral multiple of its MSS) and inserted it in the > write queue after the skb we're retransmitting. Also if MSS changes after userspace writes, we'll have the very same "problem" (or other "normal behavior" leaves those non-full skbs e.g., due to slow application). ...This should be non-issue really if the code is properly written. > One potential indirect effect of this problem is tcp_shifted_skb() > creating a coalesced SACKed skb that has a pcount that is 1 too large > for its length. This happened because tcp_shifted_skb() assumed that > skbs are integral multiples of MSS, so you can just add pcounts of > input skbs to find the pcount of the output skb. ...It does _not_ assume what you say! ...Instead, it does this on purpose. Anything else is wrong thing because it affects window calculations just to get our clever gso hacks to work. SACK code should maintain the original pcounts from sent time, not change them! So if we sent MSS+1 and MSS+1 we should have (1+1)+(1+1) as pcount after shifting instead of 1+1+1. ...This keeping of original pcounts is the real invariant I used while writing the new sacktag code with shifting. > Suspected specific > symtoms of this problem include the WARN_ON(len > skb->len) in > tcp_fragment() firing, as the 1-too-large pcount ripples though to > tcp_mark_head_lost() trying to chop off too many bytes to mark as > lost. Why would we be splitting SACKed segment there?!? ...Ah, it seems that it indeed is possible now after the non-FACK mode fragment, however, it is totally useless work and seems to be breaking stuff. I think that the proper solution is to prevent splitting of already SACKed stuff. ...Nice find btw. :-) > It's also possible this bug is related to recent reports of sacked_out > becoming negative. Perhaps, but those are much more likely due to the recent changes breaking stuff instead (I haven't seen those reports about sacked_out before the three latest fixes, tcp_fragment stuff is reportedly older though)...
On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > Why would we be splitting SACKed segment there?!? ...Ah, it seems that > it indeed is possible now after the non-FACK mode fragment, however, it > is totally useless work and seems to be breaking stuff. I think that the > proper solution is to prevent splitting of already SACKed stuff. ...Nice > find btw. :-) Bingo! I totally agree this is the bug. looks like a one-line patch now. We _really_ need to refactor the cryptic tcp_mark_head_lost(). I have a patch ready for review once the recent sack turmoil settles. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 2, 2012 at 10:05 PM, Yuchung Cheng <ycheng@google.com> wrote: > On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen > <ilpo.jarvinen@helsinki.fi> wrote: >> Why would we be splitting SACKed segment there?!? ...Ah, it seems that >> it indeed is possible now after the non-FACK mode fragment, however, it >> is totally useless work and seems to be breaking stuff. I think that the >> proper solution is to prevent splitting of already SACKed stuff. ...Nice >> find btw. :-) > > Bingo! I totally agree this is the bug. looks like a one-line patch now. Agreed. I sent out a proposal: http://patchwork.ozlabs.org/patch/144408/ Thanks, Ilpo, for filling in the details. The comment in tcp_match_skb_to_sack() that says "Round if necessary so that SACKs cover only full MSSes and/or the remaining small portion (if present)", plus the analogous MSS alignment logic in tcp_mss_split_point(), had me convinced that this MSS alignment invariant that was needed for tcp_match_skb_to_sack() to work correctly was actually intended. neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 3 Mar 2012, Neal Cardwell wrote: > On Fri, Mar 2, 2012 at 10:05 PM, Yuchung Cheng <ycheng@google.com> wrote: > > On Fri, Mar 2, 2012 at 12:25 PM, Ilpo Järvinen > > <ilpo.jarvinen@helsinki.fi> wrote: > >> Why would we be splitting SACKed segment there?!? ...Ah, it seems that > >> it indeed is possible now after the non-FACK mode fragment, however, it > >> is totally useless work and seems to be breaking stuff. I think that the > >> proper solution is to prevent splitting of already SACKed stuff. ...Nice > >> find btw. :-) > > > > Bingo! I totally agree this is the bug. looks like a one-line patch now. > > Agreed. I sent out a proposal: > http://patchwork.ozlabs.org/patch/144408/ > > Thanks, Ilpo, for filling in the details. This unnecessary fragmenting probably also ate up some of the benefits from the recombining efforts done by shift/merge. > The comment in > tcp_match_skb_to_sack() that says "Round if necessary so that SACKs > cover only full MSSes and/or the remaining small portion (if > present)", plus the analogous MSS alignment logic in > tcp_mss_split_point(), had me convinced that this MSS alignment > invariant that was needed for tcp_match_skb_to_sack() to work > correctly was actually intended. tcp_mss_split_point is there to work around received window non-MSS oddities (when the whole window is almost fully utilized), plus some logic to enforce nagle marker skb. That stuff always works with unsent stuff anyway so it's much simpler to handle than skbs where we've already made some real commitments. tcp_match_skb_to_sack tries to find the original mss boundaries that gso/tso generated and align into them if SACK range does contain non-aligned octets. Again, this is there to avoid pcount tweaks that would be necessary if we split to two skbs that have less than MSS sized tail part, and also to prevent receiver driven split to very small skbs attacks. ...Hopefully it also serves as an incentive to fix those middleboxes/end hosts which are in violation here by not maintaining the original packet boundaries :-). However, now that I look it in this new light also tcp_match_skb_to_sack is subject to similar kind of bug, ie., it can fragment already shifted/merged skbs (although it's not that likely one would get a scenario to actually trigger that, requires partial re-SACK). Fixing this will be significantly more complicated though, and seems to require that sacktag logic inspects ->sacked much earlier making it again harder to follow :-(.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 4ff3b6d..13034ad 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2070,6 +2070,48 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to, } } +/* So we can retransmit skb, fragment it to be cur_mss bytes. In + * addition, we must maintain the invariant that whatever skbs we + * leave in the write queue are integral multiples of the MSS or a + * remaining small sub-MSS portion. This means we fragment the skb + * into potentially three skbs in the write queue: + * + * (1) The first skb of exactly 1*cur_mss, which we will retransmit now. + * (2) A "bulk" skb that is an integral multiple of the cur_mss + * (3) A "left-over" skb that has any remaining portion smaller than cur_mss + * + * Since either of the two required fragmentation operations can fail + * (e.g. due to ENOMEM), and we want this invariant to be maintained + * if either fails, we chop off (3) first and then chop off (1). + * + * Returns non-zero if an error occurred which prevented the full splitting. + */ +static int tcp_retrans_mss_split(struct sock *sk, struct sk_buff *skb, + unsigned int cur_mss) +{ + int err; + unsigned int len; + + /* Chop off any "left-over" at end that is not aligned to cur_mss. */ + if (cur_mss != tcp_skb_mss(skb)) { + len = skb->len - skb->len % cur_mss; + if (len < skb->len) { + err = tcp_fragment(sk, skb, len, cur_mss); + if (err < 0) + return err; + } + } + + /* Chop off a single MSS at the beginning to retransmit now. */ + if (skb->len > cur_mss) { + err = tcp_fragment(sk, skb, cur_mss, cur_mss); + if (err < 0) + return err; + } + + return 0; +} + /* This retransmits one SKB. Policy decisions and retransmit queue * state updates are done by the caller. Returns non-zero if an * error occurred which prevented the send. @@ -2115,7 +2157,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) return -EAGAIN; if (skb->len > cur_mss) { - if (tcp_fragment(sk, skb, cur_mss, cur_mss)) + if (tcp_retrans_mss_split(sk, skb, cur_mss)) return -ENOMEM; /* We'll try again later. */ } else { int oldpcount = tcp_skb_pcount(skb);
This commit fixes tcp_retransmit_skb() to respect the invariant that an skb in the write queue that might be SACKed (that is, that precedes tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral multiple of tcp_skb_mss(skb). Various parts of the TCP code maintain or assume this invariant, including at least tcp_write_xmit(), tcp_mss_split_point(), tcp_match_skb_to_sack(), and tcp_shifted_skb(). tcp_retransmit_skb() did not maintain this invariant. It checked the current MSS and called tcp_fragment() to make sure that the skb we're retransmitting is at most cur_mss, but in the process it took the excess bytes and created an arbitrary-length skb (one that is not necessarily an integral multiple of its MSS) and inserted it in the write queue after the skb we're retransmitting. One potential indirect effect of this problem is tcp_shifted_skb() creating a coalesced SACKed skb that has a pcount that is 1 too large for its length. This happened because tcp_shifted_skb() assumed that skbs are integral multiples of MSS, so you can just add pcounts of input skbs to find the pcount of the output skb. Suspected specific symtoms of this problem include the WARN_ON(len > skb->len) in tcp_fragment() firing, as the 1-too-large pcount ripples though to tcp_mark_head_lost() trying to chop off too many bytes to mark as lost. It's also possible this bug is related to recent reports of sacked_out becoming negative. Signed-off-by: Neal Cardwell <ncardwell@google.com> --- net/ipv4/tcp_output.c | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 43 insertions(+), 1 deletions(-)