Message ID | 1283859552.2338.402.camel@edumazet-laptop |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 7 Sep 2010, Eric Dumazet wrote: > Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit : > > > > The small 78 byte window is why the sending system is splitting up the > > writes into smaller pieces. > > > > I presume that the system advertises exactly a 78 byte window because > > this is how large the commands are. But this is an extremely foolish > > and baroque thing to do, and it's why you are having problems. > > I am not sure why TSO added a "Bound mss with half of window" > requirement for tcp_sync_mss() I've thought it is more related on window behavior in general and much much older than TSO (it certainly seems to be old one). I guess we might run to some SWS issue if MSS < rwin < 2*MSS with your patch that are avoided by the current approach? > I tried with MSS=1000 and WIN=1000, and segment size chosen is 500 Perhaps worth to try with MSS=999 WIN=1000 pair too and see what happens? > With WIN=78, 78/2->39 is then capped to 48 (68U - tp->tcp_header_len) > > Is there a hard requirement about segment size being at most half the > window ? > > Following patch solves the problem for me : > > [PATCH] tcp: bound mss to window in tcp_sync_mss() > > Leandro Melo de Sales noticed that if a peer announces a very small > initial tcp window (78 in his case), first sent frames have unnecessary > small lengths (48 in his case) > > CLNT->SRV [SYN] Seq=0 Win=5840 Len=0 MSS=1460 > SRV->CLNT [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78 > CLNT->SRV [ACK] Seq=1 Ack=1 Win=5840 Len=0 > CLNT->SRV [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48 > CLNT->SRV [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30 > SRV->CLNT [ACK] Seq=1 Ack=49 Win=78 Len=0 > SRV->CLNT [RST, ACK] Seq=1 Ack=79 Win=78 Len=0 > > tcp_sync_mss() bounds mss to half the window, while it could use full > window: > > CLNT->SRV [SYN] Seq=0 Win=5840 Len=0 MSS=1460 > SRV->CLNT [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78 > CLNT->SRV [ACK] Seq=1 Ack=1 Win=5840 Len=0 > CLNT->SRV [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78 > SRV->CLNT [ACK] Seq=1 Ack=79 Win=78 Len=0 > > Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> > --- > include/net/tcp.h | 9 +++++++++ > net/ipv4/tcp_output.c | 2 +- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index eaa9582..c262676 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize) > return pktsize; > } > > +/* Bound MSS / TSO packet size with the window */ > +static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize) > +{ > + if (tp->max_window && pktsize > tp->max_window) > + return max(tp->max_window, 68U - tp->tcp_header_len); > + else > + return pktsize; > +} > + > > /* tcp.c */ > extern void tcp_get_info(struct sock *, struct tcp_info *); > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index de3bd84..49cdbe4 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) > icsk->icsk_mtup.search_high = pmtu; > > mss_now = tcp_mtu_to_mss(sk, pmtu); > - mss_now = tcp_bound_to_half_wnd(tp, mss_now); > + mss_now = tcp_bound_to_wnd(tp, mss_now); > > /* And store cached results */ > icsk->icsk_pmtu_cookie = pmtu; I don't quite follow why tcp.c caller should then be left as is? Though it seems to me that it is more complex than tcp_sync_mss case.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 07 Sep 2010 13:39:12 +0200 > I am not sure why TSO added a "Bound mss with half of window" > requirement for tcp_sync_mss() Because if we TSO to the full window the link doesn't fill up properly. We definitely do this on purpose, that's for sure. :-) I'll look into the full version control history to unearth the exact reason so we can validate your patch, good find Eric. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Is there a hard requirement about segment size being at most half the > window ? In the deep, dark, past of other stacks, there were concerns about devolving into stop-start. I suppose from a theoretical standpoint, there is a chance that MSS==window, coupled with immediate ACKnowledgement on the receiver could send zero windows that might elicit undesirable probes? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Tue, 7 Sep 2010 15:15:25 +0300 (EEST) [ Alexey, problem is that when receiver's maximum window is miniscule (f.e. equal to MSS :-), we never send full MSS sized frames due to our sender size SWS implementation. ] > On Tue, 7 Sep 2010, Eric Dumazet wrote: > >> Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit : >> >> >> > The small 78 byte window is why the sending system is splitting up the >> > writes into smaller pieces. >> > >> > I presume that the system advertises exactly a 78 byte window because >> > this is how large the commands are. But this is an extremely foolish >> > and baroque thing to do, and it's why you are having problems. >> >> I am not sure why TSO added a "Bound mss with half of window" >> requirement for tcp_sync_mss() > > I've thought it is more related on window behavior in general and > much much older than TSO (it certainly seems to be old one). > > I guess we might run to some SWS issue if MSS < rwin < 2*MSS with your > patch that are avoided by the current approach? Right, this clamping is part of RFC1122 silly window syndrome avoidance. In ancient times we used to do this straight in sendmsg(), which had the comment: /* We also need to worry about the window. If * window < 1/2 the maximum window we've seen * from this host, don't use it. This is * sender side silly window prevention, as * specified in RFC1122. (Note that this is * different than earlier versions of SWS * prevention, e.g. RFC813.). What we * actually do is use the whole MSS. Since * the results in the right edge of the packet * being outside the window, it will be queued * for later rather than sent. */ But in January 2000, Alexey Kuznetsov moved this logic into tcp_sync_mss(). netdev-vger-2.6 commit is: -------------------- commit 214d457eb454a70f0f373371de044403834d8042 Author: davem <davem> Date: Tue Jan 18 08:24:09 2000 +0000 Merge in bug fixes and small enhancements from Alexey for the TCP/UDP softnet mega-merge from the other day. -------------------- Well, what is the SWS sender rule? RFC1122 states that we should send data if (wher U == usable window, D == data queued up but not yet sent): 1) if a maximum-sized segment can be sent, i.e, if: min(D,U) >= Eff.snd.MSS; 2) or if the data is pushed and all queued data can be sent now, i.e., if: [SND.NXT = SND.UNA and] PUSHED and D <= U (the bracketed condition is imposed by the Nagle algorithm); 3) or if at least a fraction Fs of the maximum window can be sent, i.e., if: [SND.NXT = SND.UNA and] min(D.U) >= Fs * Max(SND.WND); 4) or if data is PUSHed and the override timeout occurs. The recommmended value for the "Fs" fraction is 1/2, so that's where the "one-half" logic comes from. The current code implements this by pre-chopping the MSS to half the largest window we've seen advertised by the peer, f.e. when doing sendmsg() packetization. Packets are packetized to this 1/2 MAX_WINDOW MSS, represented by mss_now. Then the send path SWS logic will send any packet that is at least "mss_now". Effectively we try to implement test #1 and #3 above at the same time by just making test #1 and chopping the Eff.snd.MSS by half the largest receive window we've seen advertised by the peer. This strategy seems to break down when the peer's MSS and the maximum receive window we'll ever see from the peer are the same order of magnitude. It seems that the conditions above really need to be checked in the right order, and because we try to combine a later test (case #3) with an earlier test (case #1) we don't send a full sized frame in this special scenerio. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello! > [ Alexey, problem is that when receiver's maximum window is miniscule > (f.e. equal to MSS :-), we never send full MSS sized frames due to > our sender size SWS implementation. ] I see. The problem was that we do early packetization. If we chop frames to real mss (> max_window/2), we cannot do #3 (min(D.U) >= Fs * Max(SND.WND)) without subsequent fragmentation. The solution to chop to max_window/2 was mine and it was intended to solve the problem for devices with _large_ mtu, where max_window is comparable with mtu not because window is small, but because mtu is large. This case was important from performance viewpoint, fragmentation would destroy our smart early packetization technique. What's about the case of sane mtu and small max_window, that case was not simply ignored as "not-so-important", it also had a rational explanation, see below. If the issue must be resolved, I would suggest to: 1. Complicate mss = min(mss, max_window/2). Probably, do something like: if (max_window >= 65536 /* just a guess */) mss = min(mss, max_window/2); else mss = min(mss, max_window); 2. Add SWS avoidance checks in tcp_write_xmit(). It should detect condition when end_seq > tcp_wnd_end(tp) (like now), but proceed with fragmentation when tp->snd_nxt == tp->snd_una && tp->snd_wnd >= tp->max_window/2. Luckily, all this logic is already there due to TSO, only conditions when to fragment are to be adjusted a little. Frankly, I am not so sure that the issue should be resolved. There is one more aspect, not related to SWS. When mss > max_window/2 we can have only one segment in pipe, which is not good. When mss==max_window and we never see full sized frame sent, this looks strange, but I bet it is still better for performance under almost any curcumstances. I do not know actual context, of course. I can guess the situation can be like this: "I set window=mss exactly to see only one packet in flight all the times. Why the hell linux tries to send two mss/2 sized?" :-) > In ancient times we used to do this straight in sendmsg(), which had > the comment: > > /* We also need to worry about the window. If > * window < 1/2 the maximum window we've seen > * from this host, don't use it. This is > * sender side silly window prevention, as > * specified in RFC1122. (Note that this is > * different than earlier versions of SWS > * prevention, e.g. RFC813.). What we > * actually do is use the whole MSS. Since > * the results in the right edge of the packet > * being outside the window, it will be queued > * for later rather than sent. > */ BTW the comment was good, but this logic was not actually implemented. The code below this comment was incorrect, it chopped segments at tail of write_queue (with seq > snd_nxt) based on window calculation at head of queue, so that it did not work. Actually, this check can be done not earlier than in tcp_write_xmit(). Alexey -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 7, 2010 at 8:39 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit : > > >> The small 78 byte window is why the sending system is splitting up the >> writes into smaller pieces. >> >> I presume that the system advertises exactly a 78 byte window because >> this is how large the commands are. But this is an extremely foolish >> and baroque thing to do, and it's why you are having problems. > > I am not sure why TSO added a "Bound mss with half of window" > requirement for tcp_sync_mss() > > I tried with MSS=1000 and WIN=1000, and segment size chosen is 500 > > With WIN=78, 78/2->39 is then capped to 48 (68U - tp->tcp_header_len) > > Is there a hard requirement about segment size being at most half the > window ? > > Following patch solves the problem for me : > > [PATCH] tcp: bound mss to window in tcp_sync_mss() > > Leandro Melo de Sales noticed that if a peer announces a very small > initial tcp window (78 in his case), first sent frames have unnecessary > small lengths (48 in his case) > > CLNT->SRV [SYN] Seq=0 Win=5840 Len=0 MSS=1460 > SRV->CLNT [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78 > CLNT->SRV [ACK] Seq=1 Ack=1 Win=5840 Len=0 > CLNT->SRV [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48 > CLNT->SRV [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30 > SRV->CLNT [ACK] Seq=1 Ack=49 Win=78 Len=0 > SRV->CLNT [RST, ACK] Seq=1 Ack=79 Win=78 Len=0 > > tcp_sync_mss() bounds mss to half the window, while it could use full > window: > > CLNT->SRV [SYN] Seq=0 Win=5840 Len=0 MSS=1460 > SRV->CLNT [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78 > CLNT->SRV [ACK] Seq=1 Ack=1 Win=5840 Len=0 > CLNT->SRV [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78 > SRV->CLNT [ACK] Seq=1 Ack=79 Win=78 Len=0 > > Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> > --- > include/net/tcp.h | 9 +++++++++ > net/ipv4/tcp_output.c | 2 +- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index eaa9582..c262676 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize) > return pktsize; > } > > +/* Bound MSS / TSO packet size with the window */ > +static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize) > +{ > + if (tp->max_window && pktsize > tp->max_window) > + return max(tp->max_window, 68U - tp->tcp_header_len); > + else > + return pktsize; > +} > + > /* tcp.c */ > extern void tcp_get_info(struct sock *, struct tcp_info *); > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index de3bd84..49cdbe4 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) > icsk->icsk_mtup.search_high = pmtu; > > mss_now = tcp_mtu_to_mss(sk, pmtu); > - mss_now = tcp_bound_to_half_wnd(tp, mss_now); > + mss_now = tcp_bound_to_wnd(tp, mss_now); > > /* And store cached results */ > icsk->icsk_pmtu_cookie = pmtu; > > > Hi Eric and others, Good. I have tested and it is working... This is the exactly translation for code what I mentioned. Very nice... Will this patch be applied? Thank you, Leandro. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: ツ Leandro Melo de Sales <leandroal@gmail.com> Date: Wed, 8 Sep 2010 11:06:43 -0300 > Good. I have tested and it is working... This is the exactly > translation for code what I mentioned. Very nice... > > Will this patch be applied? No because the patch is wrong, see the rest of the discussion. That line Eric is removing in his patch needs to be there because it is part of our implementation of sender side Silly Window Syndrome avoidance. We'll have to fix this another way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2010/9/8 David Miller <davem@davemloft.net>: > From: ツ Leandro Melo de Sales <leandroal@gmail.com> > Date: Wed, 8 Sep 2010 11:06:43 -0300 > >> Good. I have tested and it is working... This is the exactly >> translation for code what I mentioned. Very nice... >> >> Will this patch be applied? > > No because the patch is wrong, see the rest of the discussion. > > That line Eric is removing in his patch needs to be there > because it is part of our implementation of sender side > Silly Window Syndrome avoidance. > > We'll have to fix this another way. > Yes, just after I reply asking this question, I read the other discussions, mainly the comments from Alexey about SWS and TSO. Is someone working on trying to fix this another way? Thank you, Leandro. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/tcp.h b/include/net/tcp.h index eaa9582..c262676 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize) return pktsize; } +/* Bound MSS / TSO packet size with the window */ +static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize) +{ + if (tp->max_window && pktsize > tp->max_window) + return max(tp->max_window, 68U - tp->tcp_header_len); + else + return pktsize; +} + /* tcp.c */ extern void tcp_get_info(struct sock *, struct tcp_info *); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index de3bd84..49cdbe4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) icsk->icsk_mtup.search_high = pmtu; mss_now = tcp_mtu_to_mss(sk, pmtu); - mss_now = tcp_bound_to_half_wnd(tp, mss_now); + mss_now = tcp_bound_to_wnd(tp, mss_now); /* And store cached results */ icsk->icsk_pmtu_cookie = pmtu;