Message ID | alpine.DEB.2.00.0912021745200.7024@wel-95.cs.helsinki.fi |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Le Wed, 2 Dec 2009 18:05:24 +0200 (EET), "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> a écrit : > > > In one of the cases, also the sg end dies (the 4th case). I > > > suppose that was running earlier kernel already? > > > > This case was the one with tcp_frto=2 and tcp_timestamps=0 on houba. > > I suppose we're confused, I was refering to .4. case, did you perhaps > mix that up with the latest set of tests which yields .8.? You're right, I was talking about .8 > In the recent work, the most suspicious things are the new timeout > things, I'll read them through once I've some time (but so far I've > not found anything wrong in them but I of course can miss something > subtle). ...I've added Damian as CC if he has some idea. If you want > you can try with a trivial revert of that stuff, I've included a > patch for that below. I will try it. I just discover something not good. I use tuxonice[1] branch on houba for 2.6.32*. Although my 2.6.31 kernel is vanilla[2]. [1] git://git.kernel.org/pub/scm/linux/kernel/git/nigelc/tuxonice-head.git [2] git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git Now, I'm gonna go compile kernels again and again and make more tests ... :)
Frederic Leroy schrieb: > Le Wed, 2 Dec 2009 18:05:24 +0200 (EET), > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> a écrit : > >>>> In one of the cases, also the sg end dies (the 4th case). I >>>> suppose that was running earlier kernel already? >>> This case was the one with tcp_frto=2 and tcp_timestamps=0 on houba. >> I suppose we're confused, I was refering to .4. case, did you perhaps >> mix that up with the latest set of tests which yields .8.? > > You're right, I was talking about .8 > >> In the recent work, the most suspicious things are the new timeout >> things, I'll read them through once I've some time (but so far I've >> not found anything wrong in them but I of course can miss something >> subtle). ...I've added Damian as CC if he has some idea. If you want >> you can try with a trivial revert of that stuff, I've included a >> patch for that below. > > I will try it. Hi, could you please printk retrans_stamp just before the return in include/net/tcp.h:retransmits_timed_out()? If the value is not monotonically increasing but is reset to 0 at some point, this might lead to problems in tcp_write_timeout(). It's the only idea I have now. > > I just discover something not good. I use tuxonice[1] branch on houba > for 2.6.32*. > Although my 2.6.31 kernel is vanilla[2]. > > [1] git://git.kernel.org/pub/scm/linux/kernel/git/nigelc/tuxonice-head.git > [2] git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > > Now, I'm gonna go compile kernels again and again and make more > tests ... :) > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 02, 2009 at 06:05:24PM +0200, Ilpo Järvinen wrote: > > In the recent work, the most suspicious things are the new timeout things, > I'll read them through once I've some time (but so far I've not found > anything wrong in them but I of course can miss something subtle). > ...I've added Damian as CC if he has some idea. If you want you can try > with a trivial revert of that stuff, I've included a patch for that below. > [PATCH] Revert new RTO backoff stuff So I recompiled my kernel with linus-stable tree. It works (bad) as the tuxonice 2.6.32-rc5. All further test are on linus-stable tree. I made 3 test with your patch. All 3 copy worked well. So it's in the new RTO backoff. I made only one trace (.9).
On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote: > could you please printk retrans_stamp just before the return in > include/net/tcp.h:retransmits_timed_out()? > If the value is not monotonically increasing but is reset to 0 at some > point, this might lead to problems in tcp_write_timeout(). > It's the only idea I have now. Your idea is good. Only one out of 4 value is not null. Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10 I make 2 attempts. Printk corresponding to .10 are those after the line "wlan1 enter promiscuous mode"
I've added Greg as CC to make him aware of this issue in early as it now affects 2.6.32 too (rather important to get dealt quickly in stable once we have a tested solution since TCP is pretty broken with the silent deaths this problem seems to cause). ...One possibility would be to just queue the tested revert to stable and sort this thing out for 2.6.33 in net-2.6. Opinions, Dave?, Greg? Now back to the issue... You said in the other mail that "All further test are on linus-stable tree.", which has this contradiction that Linus does not maintain stable trees. Which exactly was the tree used for the .9. test, Linus' tree or the 2.6.31 stable tree? I suppose the former since the revert wouldn't apply to 2.6.31 so I just want to confirm. On Thu, 3 Dec 2009, Frederic Leroy wrote: > On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote: > > could you please printk retrans_stamp just before the return in > > include/net/tcp.h:retransmits_timed_out()? > > If the value is not monotonically increasing but is reset to 0 at some > > point, this might lead to problems in tcp_write_timeout(). > > It's the only idea I have now. > > Your idea is good. > Only one out of 4 value is not null. > > Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10 > > I make 2 attempts. Printk corresponding to .10 are those after the line > "wlan1 enter promiscuous mode" Nice thinking indeed Damian, thanks. ...But but, where exactly did you print? ...There are multiple returns and the return false branch is expected to have a zero retrans_stamp in a typical case but that is not a problem because we never use the value. ...Anyway, if I'm wrong with my suspicion and it still holds that we have zero retrans_stamp in the substraction too, it could have something to do with this snippet: static void tcp_try_to_open(struct sock *sk, int flag) { struct tcp_sock *tp = tcp_sk(sk); tcp_verify_left_out(tp); if (!tp->frto_counter && tp->retrans_out == 0) tp->retrans_stamp = 0; ...It bit me last time when FRTO was enabled after very small modification (without running a full verification after the trivial looking modification). ...So I've worked around this clearing for FRTO as you can see :-). Also, we have the another mystery to be solved, the fast retransmission is not triggered for some reason (or alternatively not captured in to a log), even in the working .9. case. It would be easy to see whether it works at all from TCP point of view by looking into mibs once you have have some transfers in a working configuration: grep -A1 TCP /proc/net/netstat ...luckily this fast retransmit issue is less crucial as almost all people are pretty happy already if their RTO-based recovery works even if the fast recovery would not. So figuring it out can be postponed (if one has to prioritize) until the silent death issue is out of the way.
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Thu, 3 Dec 2009 12:29:39 +0200 (EET) > I've added Greg as CC to make him aware of this issue in early as it now > affects 2.6.32 too (rather important to get dealt quickly in stable once > we have a tested solution since TCP is pretty broken with the silent > deaths this problem seems to cause). ...One possibility would be to just > queue the tested revert to stable and sort this thing out for 2.6.33 in > net-2.6. What revert? Zero context provided here... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen schrieb: > I've added Greg as CC to make him aware of this issue in early as it now > affects 2.6.32 too (rather important to get dealt quickly in stable once > we have a tested solution since TCP is pretty broken with the silent > deaths this problem seems to cause). ...One possibility would be to just > queue the tested revert to stable and sort this thing out for 2.6.33 in > net-2.6. > > Opinions, Dave?, Greg? > > Now back to the issue... > > You said in the other mail that "All further test are on linus-stable > tree.", which has this contradiction that Linus does not maintain stable > trees. Which exactly was the tree used for the .9. test, Linus' tree or > the 2.6.31 stable tree? I suppose the former since the revert wouldn't > apply to 2.6.31 so I just want to confirm. > > > On Thu, 3 Dec 2009, Frederic Leroy wrote: >> On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote: >>> could you please printk retrans_stamp just before the return in >>> include/net/tcp.h:retransmits_timed_out()? >>> If the value is not monotonically increasing but is reset to 0 at some >>> point, this might lead to problems in tcp_write_timeout(). >>> It's the only idea I have now. >> Your idea is good. >> Only one out of 4 value is not null. >> >> Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10 >> >> I make 2 attempts. Printk corresponding to .10 are those after the line >> "wlan1 enter promiscuous mode" > > Nice thinking indeed Damian, thanks. ...But but, where exactly did you > print? ...There are multiple returns and the return false branch is > expected to have a zero retrans_stamp in a typical case but that is not > a problem because we never use the value. Yes, it's the retrans_stamp in the subtraction I suspected to be 0. I also suspect this to happen only in the ca_state < CA_Loss case, so one first solution might be to return true whenever retrans_stamp == 0. Unluckily, I still cannot reproduce the scp stalls here, so it would be nice if Frederic printed retrans_stamp together with icsk_ca_state and icsk_retransmits, please. Damian > ...Anyway, if I'm wrong with my suspicion and it still holds that we have > zero retrans_stamp in the substraction too, it could have something to do > with this snippet: > > static void tcp_try_to_open(struct sock *sk, int flag) > { > struct tcp_sock *tp = tcp_sk(sk); > > tcp_verify_left_out(tp); > > if (!tp->frto_counter && tp->retrans_out == 0) > tp->retrans_stamp = 0; > > ...It bit me last time when FRTO was enabled after very small modification > (without running a full verification after the trivial looking > modification). ...So I've worked around this clearing for FRTO as you > can see :-). > > > Also, we have the another mystery to be solved, the fast retransmission is > not triggered for some reason (or alternatively not captured in to a > log), even in the working .9. case. It would be easy to see whether it > works at all from TCP point of view by looking into mibs once you have > have some transfers in a working configuration: > > grep -A1 TCP /proc/net/netstat > > ...luckily this fast retransmit issue is less crucial as almost all people > are pretty happy already if their RTO-based recovery works even if the > fast recovery would not. So figuring it out can be postponed (if one has > to prioritize) until the silent death issue is out of the way. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le Thu, 3 Dec 2009 12:29:39 +0200 (EET), "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> a écrit : > Opinions, Dave?, Greg? > > Now back to the issue... > > You said in the other mail that "All further test are on linus-stable > tree.", which has this contradiction that Linus does not maintain > stable trees. Which exactly was the tree used for the .9. test Sorry I'm confused and so confuse you. For .9 .10 and now I'm only using : git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > Linus' tree or the 2.6.31 stable tree? I suppose the former since the > revert wouldn't apply to 2.6.31 so I just want to confirm. I didn't keep the source of the old 2.6.31 kernel I have. So it's either git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git or git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6-stable.git > Nice thinking indeed Damian, thanks. ...But but, where exactly did > you print? ...There are multiple returns and the return false branch > is expected to have a zero retrans_stamp in a typical case but that > is not a problem because we never use the value. Here is the code : http://www.starox.org/pub/scp_stall/printk_retrans_stamp.patch > ...Anyway, if I'm wrong with my suspicion and it still holds that we > have zero retrans_stamp in the substraction too, it could have > something to do with this snippet: > > static void tcp_try_to_open(struct sock *sk, int flag) > { > struct tcp_sock *tp = tcp_sk(sk); > > tcp_verify_left_out(tp); > > if (!tp->frto_counter && tp->retrans_out == 0) > tp->retrans_stamp = 0; > > ...It bit me last time when FRTO was enabled after very small > modification (without running a full verification after the trivial > looking modification). ...So I've worked around this clearing for > FRTO as you can see :-). :) > Also, we have the another mystery to be solved, the fast > retransmission is not triggered for some reason (or alternatively not > captured in to a log), even in the working .9. case. It would be easy > to see whether it works at all from TCP point of view by looking into > mibs once you have have some transfers in a working configuration: > > grep -A1 TCP /proc/net/netstat I will try this evening. I can do test only outside office hours.
Damian Lukowski schrieb: > Ilpo Järvinen schrieb: >> I've added Greg as CC to make him aware of this issue in early as it now >> affects 2.6.32 too (rather important to get dealt quickly in stable once >> we have a tested solution since TCP is pretty broken with the silent >> deaths this problem seems to cause). ...One possibility would be to just >> queue the tested revert to stable and sort this thing out for 2.6.33 in >> net-2.6. >> >> Opinions, Dave?, Greg? >> >> Now back to the issue... >> >> You said in the other mail that "All further test are on linus-stable >> tree.", which has this contradiction that Linus does not maintain stable >> trees. Which exactly was the tree used for the .9. test, Linus' tree or >> the 2.6.31 stable tree? I suppose the former since the revert wouldn't >> apply to 2.6.31 so I just want to confirm. >> >> >> On Thu, 3 Dec 2009, Frederic Leroy wrote: >>> On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote: >>>> could you please printk retrans_stamp just before the return in >>>> include/net/tcp.h:retransmits_timed_out()? >>>> If the value is not monotonically increasing but is reset to 0 at some >>>> point, this might lead to problems in tcp_write_timeout(). >>>> It's the only idea I have now. >>> Your idea is good. >>> Only one out of 4 value is not null. >>> >>> Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10 >>> >>> I make 2 attempts. Printk corresponding to .10 are those after the line >>> "wlan1 enter promiscuous mode" >> Nice thinking indeed Damian, thanks. ...But but, where exactly did you >> print? ...There are multiple returns and the return false branch is >> expected to have a zero retrans_stamp in a typical case but that is not >> a problem because we never use the value. > > Yes, it's the retrans_stamp in the subtraction I suspected to be 0. > I also suspect this to happen only in the ca_state < CA_Loss case, > so one first solution might be to return true whenever retrans_stamp == 0. return false, of course. > Unluckily, I still cannot reproduce the scp stalls here, so it would be nice > if Frederic printed retrans_stamp together with icsk_ca_state and > icsk_retransmits, please. > > Damian > >> ...Anyway, if I'm wrong with my suspicion and it still holds that we have >> zero retrans_stamp in the substraction too, it could have something to do >> with this snippet: >> >> static void tcp_try_to_open(struct sock *sk, int flag) >> { >> struct tcp_sock *tp = tcp_sk(sk); >> >> tcp_verify_left_out(tp); >> >> if (!tp->frto_counter && tp->retrans_out == 0) >> tp->retrans_stamp = 0; >> >> ...It bit me last time when FRTO was enabled after very small modification >> (without running a full verification after the trivial looking >> modification). ...So I've worked around this clearing for FRTO as you >> can see :-). >> >> >> Also, we have the another mystery to be solved, the fast retransmission is >> not triggered for some reason (or alternatively not captured in to a >> log), even in the working .9. case. It would be easy to see whether it >> works at all from TCP point of view by looking into mibs once you have >> have some transfers in a working configuration: >> >> grep -A1 TCP /proc/net/netstat >> >> ...luckily this fast retransmit issue is less crucial as almost all people >> are pretty happy already if their RTO-based recovery works even if the >> fast recovery would not. So figuring it out can be postponed (if one has >> to prioritize) until the silent death issue is out of the way. >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 3 Dec 2009, Frederic Leroy wrote: > Le Thu, 3 Dec 2009 12:29:39 +0200 (EET), > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> a écrit : > > > Opinions, Dave?, Greg? > > > > Now back to the issue... > > > > You said in the other mail that "All further test are on linus-stable > > tree.", which has this contradiction that Linus does not maintain > > stable trees. Which exactly was the tree used for the .9. test > > Sorry I'm confused and so confuse you. > For .9 .10 and now I'm only using : > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git Thanks for the confirmation. > > Nice thinking indeed Damian, thanks. ...But but, where exactly did > > you print? ...There are multiple returns and the return false branch > > is expected to have a zero retrans_stamp in a typical case but that > > is not a problem because we never use the value. > > Here is the code : > http://www.starox.org/pub/scp_stall/printk_retrans_stamp.patch So I was wrong. > > Also, we have the another mystery to be solved, the fast > > retransmission is not triggered for some reason (or alternatively not > > captured in to a log), even in the working .9. case. It would be easy > > to see whether it works at all from TCP point of view by looking into > > mibs once you have have some transfers in a working configuration: > > > > grep -A1 TCP /proc/net/netstat > > I will try this evening. I can do test only outside office hours. Ok. Please take before test and after test capture to allow easy compare of what changed.
On Thu, 3 Dec 2009, Damian Lukowski wrote: > > On Thu, 3 Dec 2009, Frederic Leroy wrote: > >> On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote: > >>> could you please printk retrans_stamp just before the return in > >>> include/net/tcp.h:retransmits_timed_out()? > >>> If the value is not monotonically increasing but is reset to 0 at some > >>> point, this might lead to problems in tcp_write_timeout(). > >>> It's the only idea I have now. > >> Your idea is good. > >> Only one out of 4 value is not null. > >> > >> Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10 > >> > >> I make 2 attempts. Printk corresponding to .10 are those after the line > >> "wlan1 enter promiscuous mode" > > > > Nice thinking indeed Damian, thanks. ...But but, where exactly did you > > print? ...There are multiple returns and the return false branch is > > expected to have a zero retrans_stamp in a typical case but that is not > > a problem because we never use the value. > > Yes, it's the retrans_stamp in the subtraction I suspected to be 0. > I also suspect this to happen only in the ca_state < CA_Loss case, > so one first solution might be to return true whenever retrans_stamp == 0. I suppose adding || !tp->retrans_stamp into the false condition is fine as long as we don't then have a connection that can cause a connection to hang there forever for some reason (this needs to be understood well enough, not just test driven in stables :-)). > Unluckily, I still cannot reproduce the scp stalls here, so it would be nice > if Frederic printed retrans_stamp together with icsk_ca_state and > icsk_retransmits, please. It wouldn't hurt to know tp->packets_out and tp->retrans_out too, that might have some significant w.r.t what happens because of FRTO.
Frederic Leroy wrote: > Le Thu, 3 Dec 2009 12:29:39 +0200 (EET), > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> a écrit : > >> Opinions, Dave?, Greg? >> >> Now back to the issue... >> >> You said in the other mail that "All further test are on linus-stable >> tree.", which has this contradiction that Linus does not maintain >> stable trees. Which exactly was the tree used for the .9. test > > Sorry I'm confused and so confuse you. > For .9 .10 and now I'm only using : > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > >> Linus' tree or the 2.6.31 stable tree? I suppose the former since the >> revert wouldn't apply to 2.6.31 so I just want to confirm. > > I didn't keep the source of the old 2.6.31 kernel I have. > So it's either > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > or > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6-stable.git > >> Nice thinking indeed Damian, thanks. ...But but, where exactly did >> you print? ...There are multiple returns and the return false branch >> is expected to have a zero retrans_stamp in a typical case but that >> is not a problem because we never use the value. > > Here is the code : > http://www.starox.org/pub/scp_stall/printk_retrans_stamp.patch > >> ...Anyway, if I'm wrong with my suspicion and it still holds that we >> have zero retrans_stamp in the substraction too, it could have >> something to do with this snippet: >> >> static void tcp_try_to_open(struct sock *sk, int flag) >> { >> struct tcp_sock *tp = tcp_sk(sk); >> >> tcp_verify_left_out(tp); >> >> if (!tp->frto_counter && tp->retrans_out == 0) >> tp->retrans_stamp = 0; >> >> ...It bit me last time when FRTO was enabled after very small >> modification (without running a full verification after the trivial >> looking modification). ...So I've worked around this clearing for >> FRTO as you can see :-). > > :) > >> Also, we have the another mystery to be solved, the fast >> retransmission is not triggered for some reason (or alternatively not >> captured in to a log), even in the working .9. case. It would be easy >> to see whether it works at all from TCP point of view by looking into >> mibs once you have have some transfers in a working configuration: >> >> grep -A1 TCP /proc/net/netstat > > I will try this evening. I can do test only outside office hours. If you don't mind, could you also post the output of "sysctl -a | grep net.ipv4.tcp", please. The tars you posted (proc_net_tcp.tbz2) seem to be empty. Thanks. Best regards, Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen wrote: [snipped] > Also, we have the another mystery to be solved, the fast retransmission is > not triggered for some reason (or alternatively not captured in to a > log), even in the working .9. case. It would be easy to see whether it > works at all from TCP point of view by looking into mibs once you have > have some transfers in a working configuration: > > grep -A1 TCP /proc/net/netstat > > ...luckily this fast retransmit issue is less crucial as almost all people > are pretty happy already if their RTO-based recovery works even if the > fast recovery would not. So figuring it out can be postponed (if one has > to prioritize) until the silent death issue is out of the way. > > I looked at the working .9 case stream from 192.168.1.15 to 192.168.1.19. I don't think it is a mystery that fast retransmit does not trigger. The condition SACKED_DATA > 3* SMSS is simply not fulfilled. Neither are there 3 non-continuous SACK sequences. The segments sent are too small :-( Interesting though, seems to me in this case non-SACK would be better than SACK. Or did I miss something? Hey we could cook up a draft for this problem ;-) Anyway, real problem is, RTO does not trigger... Best regards, Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 3 Dec 2009, Arnd Hannemann wrote: > Ilpo Järvinen wrote: > > [snipped] > > > Also, we have the another mystery to be solved, the fast retransmission is > > not triggered for some reason (or alternatively not captured in to a > > log), even in the working .9. case. It would be easy to see whether it > > works at all from TCP point of view by looking into mibs once you have > > have some transfers in a working configuration: > > > > grep -A1 TCP /proc/net/netstat > > > > ...luckily this fast retransmit issue is less crucial as almost all people > > are pretty happy already if their RTO-based recovery works even if the > > fast recovery would not. So figuring it out can be postponed (if one has > > to prioritize) until the silent death issue is out of the way. > > > > > > I looked at the working .9 case stream from 192.168.1.15 to 192.168.1.19. > I don't think it is a mystery that fast retransmit does not trigger. > The condition SACKED_DATA > 3* SMSS is simply not fulfilled. > Neither are there 3 non-continuous SACK sequences. > The segments sent are too small :-( > Interesting though, seems to me in this case non-SACK would be better than SACK. > Or did I miss something? Yes, a particularly big one, linux does not count SACKs bytes but packets. In the first recovery, plenty of packets are SACKed: 135 sack 1 {2598:2646}> 108 sack 1 {2598:2694}> 121 sack 1 {2598:2742}> 95 sack 1 {2598:2790}> 426 sack 1 {2598:2838}> fackets_out should be 6 now which is way more than 3 which is the default tp->reordering. > Hey we could cook up a draft for this problem ;-) > > Anyway, real problem is, RTO does not trigger... There are two problems. ...Both are real. ;-) But significance of the other is much worse than the other.
Ilpo Järvinen wrote: > On Thu, 3 Dec 2009, Arnd Hannemann wrote: > >> Ilpo Järvinen wrote: >> >> [snipped] >> >>> Also, we have the another mystery to be solved, the fast retransmission is >>> not triggered for some reason (or alternatively not captured in to a >>> log), even in the working .9. case. It would be easy to see whether it >>> works at all from TCP point of view by looking into mibs once you have >>> have some transfers in a working configuration: >>> >>> grep -A1 TCP /proc/net/netstat >>> >>> ...luckily this fast retransmit issue is less crucial as almost all people >>> are pretty happy already if their RTO-based recovery works even if the >>> fast recovery would not. So figuring it out can be postponed (if one has >>> to prioritize) until the silent death issue is out of the way. >>> >>> >> I looked at the working .9 case stream from 192.168.1.15 to 192.168.1.19. >> I don't think it is a mystery that fast retransmit does not trigger. >> The condition SACKED_DATA > 3* SMSS is simply not fulfilled. >> Neither are there 3 non-continuous SACK sequences. >> The segments sent are too small :-( >> Interesting though, seems to me in this case non-SACK would be better than SACK. >> Or did I miss something? > > Yes, a particularly big one, linux does not count SACKs bytes but packets. > In the first recovery, plenty of packets are SACKed: > > 135 sack 1 {2598:2646}> > 108 sack 1 {2598:2694}> > 121 sack 1 {2598:2742}> > 95 sack 1 {2598:2790}> > 426 sack 1 {2598:2838}> > > fackets_out should be 6 now which is way more than 3 which is the > default tp->reordering. Ok, you probable know better than me. But, aren't the SKBs collapsed to SMSS size segments and then counted? I thought so. The 3*SMSS restriction is from RFC 3517, but of course you know. > >> Hey we could cook up a draft for this problem ;-) >> >> Anyway, real problem is, RTO does not trigger... > > There are two problems. ...Both are real. ;-) But significance of the > other is much worse than the other. I agree. I'm already trying to get scp stalling, but no luck so far. Neither with artificially dropping packets, nor using WLAN :-( Best regards, Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 3 Dec 2009, Arnd Hannemann wrote: > Ilpo Järvinen wrote: > > On Thu, 3 Dec 2009, Arnd Hannemann wrote: > > > >> Ilpo Järvinen wrote: > >> > >> [snipped] > >> > >>> Also, we have the another mystery to be solved, the fast retransmission is > >>> not triggered for some reason (or alternatively not captured in to a > >>> log), even in the working .9. case. It would be easy to see whether it > >>> works at all from TCP point of view by looking into mibs once you have > >>> have some transfers in a working configuration: > >>> > >>> grep -A1 TCP /proc/net/netstat > >>> > >>> ...luckily this fast retransmit issue is less crucial as almost all people > >>> are pretty happy already if their RTO-based recovery works even if the > >>> fast recovery would not. So figuring it out can be postponed (if one has > >>> to prioritize) until the silent death issue is out of the way. > >>> > >>> > >> I looked at the working .9 case stream from 192.168.1.15 to 192.168.1.19. > >> I don't think it is a mystery that fast retransmit does not trigger. > >> The condition SACKED_DATA > 3* SMSS is simply not fulfilled. > >> Neither are there 3 non-continuous SACK sequences. > >> The segments sent are too small :-( > >> Interesting though, seems to me in this case non-SACK would be better than SACK. > >> Or did I miss something? > > > > Yes, a particularly big one, linux does not count SACKs bytes but packets. > > In the first recovery, plenty of packets are SACKed: > > > > 135 sack 1 {2598:2646}> > > 108 sack 1 {2598:2694}> > > 121 sack 1 {2598:2742}> > > 95 sack 1 {2598:2790}> > > 426 sack 1 {2598:2838}> > > > > fackets_out should be 6 now which is way more than 3 which is the > > default tp->reordering. > > Ok, you probable know better than me. > But, aren't the SKBs collapsed to SMSS size segments and then > counted? I thought so. > The 3*SMSS restriction is from RFC 3517, but of course you know. On the sender side (for SACKed skbs) we should retrain the segment counter still for the collapsed skb (at least in SACK code this was my intention but it could be that there is something wrong in that area). Besides, I think I've seen the fast rexmit missing with "sack 3" (ie., three holes) case too so that would point out into some other bug. Btw, we can potentially go well beyond MSS sized collapse too for the sacked skbs as long as there is room in sg frags. It's a different store for the rexmits though but that "collapse" is not significant here I think. > >> Hey we could cook up a draft for this problem ;-) > >> > >> Anyway, real problem is, RTO does not trigger... > > > > There are two problems. ...Both are real. ;-) But significance of the > > other is much worse than the other. > > I agree. > I'm already trying to get scp stalling, but no luck so far. Neither with > artificially dropping packets, nor using WLAN :-( I got it to happen but sadly scp stalled because of another issue related to rtt bloat (check this thread in archive if you're interested). I think that might need some clarification for 1323bis too but I'm currently thinking it through before giving my input/feedback on that on tcpm. Are you sure you drop for the right direction, ie., for the ACK/scp flow control direction which sends those small packets? Data direction losses seem somewhat insignificant here.
On Thu, Dec 03, 2009 at 12:29:39PM +0200, Ilpo Järvinen wrote: > I've added Greg as CC to make him aware of this issue in early as it now > affects 2.6.32 too (rather important to get dealt quickly in stable once > we have a tested solution since TCP is pretty broken with the silent > deaths this problem seems to cause). ...One possibility would be to just > queue the tested revert to stable and sort this thing out for 2.6.33 in > net-2.6. > > Opinions, Dave?, Greg? As always, if you have a patch that you want in the -stable tree, send it, with the git commit id of the patch that is in Linus's tree, to the stable@kernel.org email address so I don't loose it. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Greg KH <gregkh@suse.de> Date: Thu, 3 Dec 2009 10:32:07 -0800 > On Thu, Dec 03, 2009 at 12:29:39PM +0200, Ilpo Järvinen wrote: >> I've added Greg as CC to make him aware of this issue in early as it now >> affects 2.6.32 too (rather important to get dealt quickly in stable once >> we have a tested solution since TCP is pretty broken with the silent >> deaths this problem seems to cause). ...One possibility would be to just >> queue the tested revert to stable and sort this thing out for 2.6.33 in >> net-2.6. >> >> Opinions, Dave?, Greg? > > As always, if you have a patch that you want in the -stable tree, send > it, with the git commit id of the patch that is in Linus's tree, to the > stable@kernel.org email address so I don't loose it. Yep, we'll do that once we track this down and make a final decision on how to fix it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index fbe427a..da07602 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -311,12 +311,9 @@ tcp_no_metrics_save - BOOLEAN connections. tcp_orphan_retries - INTEGER - This value influences the timeout of a locally closed TCP connection, - when RTO retransmissions remain unacknowledged. - See tcp_retries2 for more details. - - The default value is 7. - If your machine is a loaded WEB server, + How may times to retry before killing TCP connection, closed + by our side. Default value 7 corresponds to ~50sec-16min + depending on RTO. If you machine is loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans. @@ -330,28 +327,16 @@ tcp_retrans_collapse - BOOLEAN certain TCP stacks. tcp_retries1 - INTEGER - This value influences the time, after which TCP decides, that - something is wrong due to unacknowledged RTO retransmissions, - and reports this suspicion to the network layer. - See tcp_retries2 for more details. - - RFC 1122 recommends at least 3 retransmissions, which is the - default. + How many times to retry before deciding that something is wrong + and it is necessary to report this suspicion to network layer. + Minimal RFC value is 3, it is default, which corresponds + to ~3sec-8min depending on RTO. tcp_retries2 - INTEGER - This value influences the timeout of an alive TCP connection, - when RTO retransmissions remain unacknowledged. - Given a value of N, a hypothetical TCP connection following - exponential backoff with an initial RTO of TCP_RTO_MIN would - retransmit N times before killing the connection at the (N+1)th RTO. - - The default value of 15 yields a hypothetical timeout of 924.6 - seconds and is a lower bound for the effective timeout. - TCP will effectively time out at the first RTO which exceeds the - hypothetical timeout. - - RFC 1122 recommends at least 100 seconds for the timeout, - which corresponds to a value of at least 8. + How may times to retry before killing alive TCP connection. + RFC1122 says that the limit should be longer than 100 sec. + It is too small number. Default value 15 corresponds to ~13-30min + depending on RTO. tcp_rfc1337 - BOOLEAN If set, the TCP stack behaves conforming to RFC1337. If unset, diff --git a/include/net/tcp.h b/include/net/tcp.h index 03a49c7..983367e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -469,7 +469,6 @@ extern void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, int nonagle); extern int tcp_may_send_now(struct sock *sk); extern int tcp_retransmit_skb(struct sock *, struct sk_buff *); -extern void tcp_retransmit_timer(struct sock *sk); extern void tcp_xmit_retransmit_queue(struct sock *); extern void tcp_simple_retransmit(struct sock *); extern int tcp_trim_head(struct sock *, struct sk_buff *, u32); @@ -522,17 +521,6 @@ extern int tcp_mtu_to_mss(struct sock *sk, int pmtu); extern int tcp_mss_to_mtu(struct sock *sk, int mss); extern void tcp_mtup_init(struct sock *sk); -static inline void tcp_bound_rto(const struct sock *sk) -{ - if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX) - inet_csk(sk)->icsk_rto = TCP_RTO_MAX; -} - -static inline u32 __tcp_set_rto(const struct tcp_sock *tp) -{ - return (tp->srtt >> 3) + tp->rttvar; -} - static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd) { tp->pred_flags = htonl((tp->tcp_header_len << 26) | @@ -1259,29 +1247,6 @@ static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_bu #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) -/* This function calculates a "timeout" which is equivalent to the timeout of a - * TCP connection after "boundary" unsucessful, exponentially backed-off - * retransmissions with an initial RTO of TCP_RTO_MIN. - */ -static inline bool retransmits_timed_out(const struct sock *sk, - unsigned int boundary) -{ - unsigned int timeout, linear_backoff_thresh; - - if (!inet_csk(sk)->icsk_retransmits) - return false; - - linear_backoff_thresh = ilog2(TCP_RTO_MAX/TCP_RTO_MIN); - - if (boundary <= linear_backoff_thresh) - timeout = ((2 << boundary) - 1) * TCP_RTO_MIN; - else - timeout = ((2 << linear_backoff_thresh) - 1) * TCP_RTO_MIN + - (boundary - linear_backoff_thresh) * TCP_RTO_MAX; - - return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= timeout; -} - static inline struct sk_buff *tcp_send_head(struct sock *sk) { return sk->sk_send_head; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index d86784b..6322e62 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -685,7 +685,7 @@ static inline void tcp_set_rto(struct sock *sk) * is invisible. Actually, Linux-2.4 also generates erratic * ACKs in some circumstances. */ - inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); + inet_csk(sk)->icsk_rto = (tp->srtt >> 3) + tp->rttvar; /* 2. Fixups made earlier cannot be right. * If we do not estimate RTO correctly without them, @@ -696,7 +696,8 @@ static inline void tcp_set_rto(struct sock *sk) /* NOTE: clamping at TCP_RTO_MIN is not required, current algo * guarantees that rto is higher. */ - tcp_bound_rto(sk); + if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX) + inet_csk(sk)->icsk_rto = TCP_RTO_MAX; } /* Save metrics learned by this TCP session. diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 7cda24b..702ce88 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -328,29 +328,26 @@ static void do_pmtu_discovery(struct sock *sk, struct iphdr *iph, u32 mtu) * */ -void tcp_v4_err(struct sk_buff *icmp_skb, u32 info) +void tcp_v4_err(struct sk_buff *skb, u32 info) { - struct iphdr *iph = (struct iphdr *)icmp_skb->data; - struct tcphdr *th = (struct tcphdr *)(icmp_skb->data + (iph->ihl << 2)); - struct inet_connection_sock *icsk; + struct iphdr *iph = (struct iphdr *)skb->data; + struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2)); struct tcp_sock *tp; struct inet_sock *inet; - const int type = icmp_hdr(icmp_skb)->type; - const int code = icmp_hdr(icmp_skb)->code; + const int type = icmp_hdr(skb)->type; + const int code = icmp_hdr(skb)->code; struct sock *sk; - struct sk_buff *skb; __u32 seq; - __u32 remaining; int err; - struct net *net = dev_net(icmp_skb->dev); + struct net *net = dev_net(skb->dev); - if (icmp_skb->len < (iph->ihl << 2) + 8) { + if (skb->len < (iph->ihl << 2) + 8) { ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS); return; } sk = inet_lookup(net, &tcp_hashinfo, iph->daddr, th->dest, - iph->saddr, th->source, inet_iif(icmp_skb)); + iph->saddr, th->source, inet_iif(skb)); if (!sk) { ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS); return; @@ -370,7 +367,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info) if (sk->sk_state == TCP_CLOSE) goto out; - icsk = inet_csk(sk); tp = tcp_sk(sk); seq = ntohl(th->seq); if (sk->sk_state != TCP_LISTEN && @@ -397,39 +393,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info) } err = icmp_err_convert[code].errno; - /* check if icmp_skb allows revert of backoff - * (see draft-zimmermann-tcp-lcd) */ - if (code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) - break; - if (seq != tp->snd_una || !icsk->icsk_retransmits || - !icsk->icsk_backoff) - break; - - icsk->icsk_backoff--; - inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) << - icsk->icsk_backoff; - tcp_bound_rto(sk); - - skb = tcp_write_queue_head(sk); - BUG_ON(!skb); - - remaining = icsk->icsk_rto - min(icsk->icsk_rto, - tcp_time_stamp - TCP_SKB_CB(skb)->when); - - if (remaining) { - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, - remaining, TCP_RTO_MAX); - } else if (sock_owned_by_user(sk)) { - /* RTO revert clocked out retransmission, - * but socket is locked. Will defer. */ - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, - HZ/20, TCP_RTO_MAX); - } else { - /* RTO revert clocked out retransmission. - * Will retransmit now */ - tcp_retransmit_timer(sk); - } - break; case ICMP_TIME_EXCEEDED: err = EHOSTUNREACH; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index cdb2ca7..c520fb6 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -137,14 +137,13 @@ static int tcp_write_timeout(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); int retry_until; - bool do_reset; if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { if (icsk->icsk_retransmits) dst_negative_advice(&sk->sk_dst_cache); retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; } else { - if (retransmits_timed_out(sk, sysctl_tcp_retries1)) { + if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { /* Black hole detection */ tcp_mtu_probing(icsk, sk); @@ -156,15 +155,13 @@ static int tcp_write_timeout(struct sock *sk) const int alive = (icsk->icsk_rto < TCP_RTO_MAX); retry_until = tcp_orphan_retries(sk, alive); - do_reset = alive || - !retransmits_timed_out(sk, retry_until); - if (tcp_out_of_resources(sk, do_reset)) + if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until)) return 1; } } - if (retransmits_timed_out(sk, retry_until)) { + if (icsk->icsk_retransmits >= retry_until) { /* Has it gone just too far? */ tcp_write_err(sk); return 1; @@ -282,7 +279,7 @@ static void tcp_probe_timer(struct sock *sk) * The TCP retransmit timer. */ -void tcp_retransmit_timer(struct sock *sk) +static void tcp_retransmit_timer(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); @@ -388,7 +385,7 @@ void tcp_retransmit_timer(struct sock *sk) out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); - if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1)) + if (icsk->icsk_retransmits > sysctl_tcp_retries1) __sk_dst_reset(sk); out:;