Message ID | 1432657435.4060.267.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Hi Eric, On Tue, May 26, 2015 at 09:23:55AM -0700, Eric Dumazet wrote: > On Tue, 2015-05-26 at 10:25 -0400, Ido Yariv wrote: > > The Tail Loss Probe RFC specifies that the PTO value should be set to > > max(2 * SRTT, 10ms), where SRTT is the smoothed round-trip time. > > > > The PTO value is converted to jiffies, so the timer might expire > > prematurely. This is especially problematic on systems in which HZ=100. > > > > To work around this issue, increase the number of jiffies by one, > > ensuring that the timeout won't expire in less than 10ms. > > > > Signed-off-by: Ido Yariv <idox.yariv@intel.com> > > --- > > net/ipv4/tcp_output.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 534e5fd..6f57d3d 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -2207,7 +2207,7 @@ bool tcp_schedule_loss_probe(struct sock *sk) > > if (tp->packets_out == 1) > > timeout = max_t(u32, timeout, > > (rtt + (rtt >> 1) + TCP_DELACK_MAX)); > > - timeout = max_t(u32, timeout, msecs_to_jiffies(10)); > > + timeout = max_t(u32, timeout, msecs_to_jiffies(10) + 1); > > > > /* If RTO is shorter, just schedule TLP in its place. */ > > tlp_time_stamp = tcp_time_stamp + timeout; > > Have you really hit an issue, or did you send this patch after all these > msecs_to_jiffies() discussions on lkml/netdev ? This actually fixed a specific issue I ran into. This issue caused a degradation in throughput in a benchmark which sent relatively small chunks of data (100KB) in a loop. The impact was quite substantial - with this patch, throughput increased by up to 50% on the platform this was tested on. > > Not sure this is the right fix. > > TLP was really tested with an effective min delay of 10ms. > > Adding 10% for the sake of crazy HZ=100 builds seems a high price. > (All recent TCP changes were tested with HZ=1000 BTW ...) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 534e5fdb04c11152bae36f47a786e8b10b823cd3..5321df89af9b59c6727395c489e6f9b2770dcd5e 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2208,6 +2208,9 @@ bool tcp_schedule_loss_probe(struct sock *sk) > timeout = max_t(u32, timeout, > (rtt + (rtt >> 1) + TCP_DELACK_MAX)); > timeout = max_t(u32, timeout, msecs_to_jiffies(10)); > +#if HZ <= 100 > + timeout = max_t(u32, timeout, 2); > +#endif > > /* If RTO is shorter, just schedule TLP in its place. */ > tlp_time_stamp = tcp_time_stamp + timeout; This was actually the first incarnation of this patch. However, while the impact of this issue when HZ=100 is the greatest, it can also impact other settings as well. For instance, if HZ=250, the timer could expire after a bit over 8ms instead of 10ms, and 9ms for HZ=1000. By increasing the number of jiffies, we ensure that we'll wait at least 10ms but never less than that, so for HZ=1000, it'll be anywhere between 10ms and 11ms instead of 9ms and 10ms. Thanks, Ido. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-05-26 at 13:02 -0400, Ido Yariv wrote: > Hi Eric, > > On Tue, May 26, 2015 at 09:23:55AM -0700, Eric Dumazet wrote: > > On Tue, 2015-05-26 at 10:25 -0400, Ido Yariv wrote: > > > The Tail Loss Probe RFC specifies that the PTO value should be set to > > > max(2 * SRTT, 10ms), where SRTT is the smoothed round-trip time. > > > > > > The PTO value is converted to jiffies, so the timer might expire > > > prematurely. This is especially problematic on systems in which HZ=100. > > > > > > To work around this issue, increase the number of jiffies by one, > > > ensuring that the timeout won't expire in less than 10ms. > > > > > > Signed-off-by: Ido Yariv <idox.yariv@intel.com> > > > --- > > > net/ipv4/tcp_output.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > index 534e5fd..6f57d3d 100644 > > > --- a/net/ipv4/tcp_output.c > > > +++ b/net/ipv4/tcp_output.c > > > @@ -2207,7 +2207,7 @@ bool tcp_schedule_loss_probe(struct sock *sk) > > > if (tp->packets_out == 1) > > > timeout = max_t(u32, timeout, > > > (rtt + (rtt >> 1) + TCP_DELACK_MAX)); > > > - timeout = max_t(u32, timeout, msecs_to_jiffies(10)); > > > + timeout = max_t(u32, timeout, msecs_to_jiffies(10) + 1); > > > > > > /* If RTO is shorter, just schedule TLP in its place. */ > > > tlp_time_stamp = tcp_time_stamp + timeout; > > > > Have you really hit an issue, or did you send this patch after all these > > msecs_to_jiffies() discussions on lkml/netdev ? > > This actually fixed a specific issue I ran into. This issue caused a > degradation in throughput in a benchmark which sent relatively small > chunks of data (100KB) in a loop. The impact was quite substantial - > with this patch, throughput increased by up to 50% on the platform this > was tested on. Really ? You have more problems if your benchmark relies on TLP. Please share your setup, because I suspect you hit other more serious bugs. > This was actually the first incarnation of this patch. However, while > the impact of this issue when HZ=100 is the greatest, it can also impact > other settings as well. For instance, if HZ=250, the timer could expire > after a bit over 8ms instead of 10ms, and 9ms for HZ=1000. > > By increasing the number of jiffies, we ensure that we'll wait at least > 10ms but never less than that, so for HZ=1000, it'll be anywhere between > 10ms and 11ms instead of 9ms and 10ms. Yes, but we do not want to blindly increase this timeout, tested few years ago with this exact value : between 9 and 10 ms. Not between 10 and 11 ms, with an added 10% in max latencies. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On Tue, May 26, 2015 at 10:13:40AM -0700, Eric Dumazet wrote: > On Tue, 2015-05-26 at 13:02 -0400, Ido Yariv wrote: > > Hi Eric, > > > > On Tue, May 26, 2015 at 09:23:55AM -0700, Eric Dumazet wrote: > > > Have you really hit an issue, or did you send this patch after all these > > > msecs_to_jiffies() discussions on lkml/netdev ? > > > > This actually fixed a specific issue I ran into. This issue caused a > > degradation in throughput in a benchmark which sent relatively small > > chunks of data (100KB) in a loop. The impact was quite substantial - > > with this patch, throughput increased by up to 50% on the platform this > > was tested on. > > > Really ? You have more problems if your benchmark relies on TLP. > > Please share your setup, because I suspect you hit other more serious > bugs. The platform this was tested on was an embedded platform with a wifi module (11n, 20MHZ). The other end was a computer running Windows, and the benchmarking software was IxChariot. The whole setup was running in a shielded box with minimal interferences. As it seems, the throughput was limited by the congestion window. Further analysis led to TLP - the fact that its timer was expiring prematurely impacted cwnd, which in turn prevented the wireless driver from having enough skbs to buffer and send. Increasing the size of the chunks being sent had a similar impact on throughput, presumably because the congestion window had enough time to increase. Changing the congestion window to Westwood from cubic/reno also had a similar impact on throughput. > > This was actually the first incarnation of this patch. However, while > > the impact of this issue when HZ=100 is the greatest, it can also impact > > other settings as well. For instance, if HZ=250, the timer could expire > > after a bit over 8ms instead of 10ms, and 9ms for HZ=1000. > > > > By increasing the number of jiffies, we ensure that we'll wait at least > > 10ms but never less than that, so for HZ=1000, it'll be anywhere between > > 10ms and 11ms instead of 9ms and 10ms. > > Yes, but we do not want to blindly increase this timeout, tested few > years ago with this exact value : between 9 and 10 ms. Not between 10 > and 11 ms, with an added 10% in max latencies. I understand, and I also suspect that having it expire after 9ms will have very little impact, if at all. Since it mainly affects HZ=100 systems, we can simply go with having at least 2 jiffies on these systems, and leave everything else as is. However, if the 10ms has a special meaning (couldn't find reasoning for it in the RFC), making sure this timer doesn't expire prematurely could be beneficial. I'm afraid this was not tested on the setup mentioned above though. Thanks, Ido. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-05-26 at 13:55 -0400, Ido Yariv wrote: > Hi Eric, > > > I understand, and I also suspect that having it expire after 9ms will > have very little impact, if at all. > > Since it mainly affects HZ=100 systems, we can simply go with having at > least 2 jiffies on these systems, and leave everything else as is. > > However, if the 10ms has a special meaning (couldn't find reasoning for > it in the RFC), making sure this timer doesn't expire prematurely could > be beneficial. I'm afraid this was not tested on the setup mentioned > above though. > RFC did not explain how 10ms delay was implemented. This is the kind of dark side. RFC are full of 'unsaid things', like OS bugs. What is not said in TLP paper is : linux timers have a 'jiffie' granularity that might be 1/100, 1/250, 1/1000, or even 1/64 on Alpha processors... Fact is : We did TLP implementation and experimentations and paper at the same time, and we do not want to change the current behavior on HZ=1000 hosts. This is the kind of change that would require lot of tests for Google. Please resend your patch so that only HZ <= 100 is changed, we will happily acknowledge it. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-05-26 at 13:55 -0400, Ido Yariv wrote: > > The platform this was tested on was an embedded platform with a wifi > module (11n, 20MHZ). The other end was a computer running Windows, and > the benchmarking software was IxChariot. > The whole setup was running in a shielded box with minimal > interferences. > > As it seems, the throughput was limited by the congestion window. > Further analysis led to TLP - the fact that its timer was expiring > prematurely impacted cwnd, which in turn prevented the wireless driver > from having enough skbs to buffer and send. > > Increasing the size of the chunks being sent had a similar impact on > throughput, presumably because the congestion window had enough time to > increase. > > Changing the congestion window to Westwood from cubic/reno also had a > similar impact on throughput. > Have you tested what results you had by completely disabling TLP ? Maybe a timer of 10 to 20ms is too short anyway in your testbed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On Tue, May 26, 2015 at 11:25:21AM -0700, Eric Dumazet wrote: > On Tue, 2015-05-26 at 13:55 -0400, Ido Yariv wrote: > > > > > The platform this was tested on was an embedded platform with a wifi > > module (11n, 20MHZ). The other end was a computer running Windows, and > > the benchmarking software was IxChariot. > > The whole setup was running in a shielded box with minimal > > interferences. > > > > As it seems, the throughput was limited by the congestion window. > > Further analysis led to TLP - the fact that its timer was expiring > > prematurely impacted cwnd, which in turn prevented the wireless driver > > from having enough skbs to buffer and send. > > > > Increasing the size of the chunks being sent had a similar impact on > > throughput, presumably because the congestion window had enough time to > > increase. > > > > Changing the congestion window to Westwood from cubic/reno also had a > > similar impact on throughput. > > > > Have you tested what results you had by completely disabling TLP ? > > Maybe a timer of 10 to 20ms is too short anyway in your testbed. Yes, I have (by writing 2 to /proc/sys/net/ipv4/tcp_early_retrans), and it also had a similar effect. That was actually the first workaround for this issue, before the issue in TLP was traced. With 10ms to 20ms timers the throughput was just the same as disabling TLP altogether, so it seems it was just enough to handle. Cheers, Ido. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 534e5fdb04c11152bae36f47a786e8b10b823cd3..5321df89af9b59c6727395c489e6f9b2770dcd5e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2208,6 +2208,9 @@ bool tcp_schedule_loss_probe(struct sock *sk) timeout = max_t(u32, timeout, (rtt + (rtt >> 1) + TCP_DELACK_MAX)); timeout = max_t(u32, timeout, msecs_to_jiffies(10)); +#if HZ <= 100 + timeout = max_t(u32, timeout, 2); +#endif /* If RTO is shorter, just schedule TLP in its place. */ tlp_time_stamp = tcp_time_stamp + timeout;