diff mbox series

[net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

Message ID 20171022033808.12641-1-den@klaipeden.com
State Superseded, archived
Delegated to: David Miller
Headers show
Series [net-next] tcp: do tcp_mstamp_refresh before retransmits on TSQ handler | expand

Commit Message

Koichiro Den Oct. 22, 2017, 3:38 a.m. UTC
When retransmission on TSQ handler was introduced in the commit
f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
skbs' timestamps were updated on the actual transmission. In the later
commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
being done so. In the commit, the comment says "We try to refresh
tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
it's rare enough.

About the former, even though possible retransmissions on the tasklet
comes just after the destructor run in NET_RX softirq handling, the time
between them could be nonnegligibly large to the extent that
tcp_rack_advance or rto rearming be affected if other (remaining) RX,
BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.

So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
ensures the accuracy of algorithms relying on it.

Signed-off-by: Koichiro Den <den@klaipeden.com>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Oct. 22, 2017, 3:52 a.m. UTC | #1
On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> When retransmission on TSQ handler was introduced in the commit
> f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> skbs' timestamps were updated on the actual transmission. In the later
> commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> being done so. In the commit, the comment says "We try to refresh
> tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> it's rare enough.
> 
> About the former, even though possible retransmissions on the tasklet
> comes just after the destructor run in NET_RX softirq handling, the time
> between them could be nonnegligibly large to the extent that
> tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> 
> So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
> ensures the accuracy of algorithms relying on it.
> 
> Signed-off-by: Koichiro Den <den@klaipeden.com>
> ---
>  net/ipv4/tcp_output.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Very nice catch, thanks a lot Koichiro.

This IMO would target net tree, since it is a bug fix.

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")

Thanks !

We should have caught that in our regression packetdrill tests...
Eric Dumazet Oct. 22, 2017, 3:52 a.m. UTC | #2
On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:

> Very nice catch, thanks a lot Koichiro.
> 
> This IMO would target net tree, since it is a bug fix.
> 
> Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")

Reviewed-by: Eric Dumazet <edumazet@google.com>
Koichiro Den Oct. 22, 2017, 4:10 a.m. UTC | #3
On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > When retransmission on TSQ handler was introduced in the commit
> > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > skbs' timestamps were updated on the actual transmission. In the later
> > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > being done so. In the commit, the comment says "We try to refresh
> > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > it's rare enough.
> > 
> > About the former, even though possible retransmissions on the tasklet
> > comes just after the destructor run in NET_RX softirq handling, the time
> > between them could be nonnegligibly large to the extent that
> > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > 
> > So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
> > ensures the accuracy of algorithms relying on it.
> > 
> > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Very nice catch, thanks a lot Koichiro.
> 
> This IMO would target net tree, since it is a bug fix.
> 
> Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Ok I will submit it to net tree. Thanks!
> 
> Thanks !
> 
> We should have caught that in our regression packetdrill tests...
In its "remote" mode testing not relying on tun xmit, I agree it could be
caught. If it's better to write, test and attach the script, please let me know.

Thank you.
Eric Dumazet Oct. 22, 2017, 5:21 a.m. UTC | #4
On Sun, 2017-10-22 at 13:10 +0900, Koichiro Den wrote:
> On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> > On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > > When retransmission on TSQ handler was introduced in the commit
> > > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > > skbs' timestamps were updated on the actual transmission. In the later
> > > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > > being done so. In the commit, the comment says "We try to refresh
> > > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > > it's rare enough.
> > > 
> > > About the former, even though possible retransmissions on the tasklet
> > > comes just after the destructor run in NET_RX softirq handling, the time
> > > between them could be nonnegligibly large to the extent that
> > > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > > 
> > > So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
> > > ensures the accuracy of algorithms relying on it.
> > > 
> > > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Very nice catch, thanks a lot Koichiro.
> > 
> > This IMO would target net tree, since it is a bug fix.
> > 
> > Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
> Ok I will submit it to net tree. Thanks!
> > 
> > Thanks !
> > 
> > We should have caught that in our regression packetdrill tests...
> In its "remote" mode testing not relying on tun xmit, I agree it could be
> caught. If it's better to write, test and attach the script, please let me know.

Packetdrill in the normal (local) mode will do it.

Push fq packet scheduler on tun0, and packets will be held long enough
in FQ that TSQ will be effective.

Adding TCP TS support on folloginw packetdrill test should demonstrate
the issue and if your patch cures it.

// Test if TSQ applies to retransmits.

`tc qdisc replace dev tun0 root fq quantum 1514 initial_quantum 1514 flow_limit 5
sysctl -e -q net.ipv4.tcp_min_tso_segs=1`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
 +.01 < . 1:1(0) ack 1 win 1500
   +0 accept(3, ..., ...) = 4

   +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
// send one frame per ms
   +0 setsockopt(4, SOL_SOCKET, SO_MAX_PACING_RATE, [1514000], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1000000], 4) = 0

   +0 write(4, ..., 120000) = 120000

   +0 > .  1:1461(1460) ack 1
+.001 > .  1461:2921(1460) ack 1
+.001 > .  2921:4381(1460) ack 1
+.001 > .  4381:5841(1460) ack 1
+.001 > .  5841:7301(1460) ack 1
+.001 > .  7301:8761(1460) ack 1
+.001 > .  8761:10221(1460) ack 1
+.001 > .  10221:11681(1460) ack 1
+.001 > .  11681:13141(1460) ack 1
+.001 > .  13141:14601(1460) ack 1
 +.01 < . 1:1(0) ack 14601 win 1500

+.001 > .  14601:16061(1460) ack 1
+.001 > .  16061:17521(1460) ack 1
+.001 > .  17521:18981(1460) ack 1
+.001 > .  18981:20441(1460) ack 1
+.001 > .  20441:21901(1460) ack 1
+.001 > .  21901:23361(1460) ack 1
+.001 > .  23361:24821(1460) ack 1
+.001 > .  24821:26281(1460) ack 1
+.001 > .  26281:27741(1460) ack 1
+.001 > .  27741:29201(1460) ack 1
+.001 > .  29201:30661(1460) ack 1
+.001 > .  30661:32121(1460) ack 1
+.001 > .  32121:33581(1460) ack 1
+.001 > .  33581:35041(1460) ack 1
+.001 > .  35041:36501(1460) ack 1
+.001 > .  36501:37961(1460) ack 1
+.001 > .  37961:39421(1460) ack 1
+.001 > .  39421:40881(1460) ack 1
+.001 > .  40881:42341(1460) ack 1
+.001 > .  42341:43801(1460) ack 1
 +.01 < .  1:1(0) ack 43801 win 1500

+.001 > .  43801:45261(1460) ack 1
+.001 > .  45261:46721(1460) ack 1
+.001 > .  46721:48181(1460) ack 1
+.001 > .  48181:49641(1460) ack 1
+.001 > .  49641:51101(1460) ack 1
+.001 > .  51101:52561(1460) ack 1
+.001 > .  52561:54021(1460) ack 1
+.001 > .  54021:55481(1460) ack 1
+.001 > .  55481:56941(1460) ack 1
+.001 > .  56941:58401(1460) ack 1
+.001 > .  58401:59861(1460) ack 1
+.001 > .  59861:61321(1460) ack 1
+.001 > .  61321:62781(1460) ack 1
+.001 > .  62781:64241(1460) ack 1
+.001 > .  64241:65701(1460) ack 1
+.001 > .  65701:67161(1460) ack 1
+.001 > .  67161:68621(1460) ack 1
+.001 > .  68621:70081(1460) ack 1
+.001 > .  70081:71541(1460) ack 1
+.001 > .  71541:73001(1460) ack 1
+.001 > .  73001:74461(1460) ack 1
+.001 > .  74461:75921(1460) ack 1
+.001 > .  75921:77381(1460) ack 1
+.001 > .  77381:78841(1460) ack 1
+.001 > .  78841:80301(1460) ack 1
+.001 > .  80301:81761(1460) ack 1
+.001 > .  81761:83221(1460) ack 1
+.001 > .  83221:84681(1460) ack 1
+.001 > .  84681:86141(1460) ack 1
+.001 > .  86141:87601(1460) ack 1
+.001 > .  87601:89061(1460) ack 1
+.001 > .  89061:90521(1460) ack 1
+.001 > .  90521:91981(1460) ack 1
+.001 > .  91981:93441(1460) ack 1
+.001 > .  93441:94901(1460) ack 1
+.001 > .  94901:96361(1460) ack 1
+.001 > .  96361:97821(1460) ack 1
+.001 > .  97821:99281(1460) ack 1
+.001 > .  99281:100741(1460) ack 1
+.001 > .  100741:102201(1460) ack 1

// Ok lets trigger a bunch of rtx !
+.001 < .  1:1(0) ack 43801 win 1500 <nop,nop, sack 78841:102201>
+.001 > .  43801:45261(1460) ack 1
+.001 > .  45261:46721(1460) ack 1
+.001 > .  46721:48181(1460) ack 1
+.001 > .  48181:49641(1460) ack 1
+.001 > .  49641:51101(1460) ack 1
+.001 > .  51101:52561(1460) ack 1
+.001 > .  52561:54021(1460) ack 1
+.001 > .  54021:55481(1460) ack 1
+.001 > .  55481:56941(1460) ack 1
+.001 > .  56941:58401(1460) ack 1
+.001 > .  58401:59861(1460) ack 1
+.001 > .  59861:61321(1460) ack 1
+.001 > .  61321:62781(1460) ack 1
+.001 > .  62781:64241(1460) ack 1
+.001 > .  64241:65701(1460) ack 1
+.001 > .  65701:67161(1460) ack 1
Koichiro Den Oct. 22, 2017, 12:59 p.m. UTC | #5
On Sat, 2017-10-21 at 22:21 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 13:10 +0900, Koichiro Den wrote:
> > On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> > > On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > > > When retransmission on TSQ handler was introduced in the commit
> > > > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > > > skbs' timestamps were updated on the actual transmission. In the later
> > > > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > > > being done so. In the commit, the comment says "We try to refresh
> > > > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > > > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > > > it's rare enough.
> > > > 
> > > > About the former, even though possible retransmissions on the tasklet
> > > > comes just after the destructor run in NET_RX softirq handling, the time
> > > > between them could be nonnegligibly large to the extent that
> > > > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > > > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > > > 
> > > > So in the same way as tcp_write_timer_handler does, doing
> > > > tcp_mstamp_refresh
> > > > ensures the accuracy of algorithms relying on it.
> > > > 
> > > > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > > > ---
> > > >  net/ipv4/tcp_output.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Very nice catch, thanks a lot Koichiro.
> > > 
> > > This IMO would target net tree, since it is a bug fix.
> > > 
> > > Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
> > 
> > Ok I will submit it to net tree. Thanks!
> > > 
> > > Thanks !
> > > 
> > > We should have caught that in our regression packetdrill tests...
> > 
> > In its "remote" mode testing not relying on tun xmit, I agree it could be
> > caught. If it's better to write, test and attach the script, please let me
> > know.
> 
> Packetdrill in the normal (local) mode will do it.
> 
> Push fq packet scheduler on tun0, and packets will be held long enough
> in FQ that TSQ will be effective.
Ah yes, I missed it, thank you.
> 
> Adding TCP TS support on folloginw packetdrill test should demonstrate
> the issue and if your patch cures it.
Thanks for the demo script. I thought making the issue described in my commit
message obvious with a packetdrill script on its own was a bit difficult without
heavy load on test bed (or intentionally injecting time-consuming code).

IIUC, whether or not TCP TS val reflects a bit later timestamp than its last
reception, which triggered TSQ handler, sounds much better. Though it is still
like "With some probability, one millisecond delay is reflected on TS val, so
the packetdrill test passes. Otherwise we can say that the test fails."
Am I correct? To be honest I am still wondering what is the best way to make the
issue obvious.
> // Test if TSQ applies to retransmits.
> 
> `tc qdisc replace dev tun0 root fq quantum 1514 initial_quantum 1514
> flow_limit 5
> sysctl -e -q net.ipv4.tcp_min_tso_segs=1`
> 
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>  +.01 < . 1:1(0) ack 1 win 1500
>    +0 accept(3, ..., ...) = 4
> 
>    +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
> // send one frame per ms
>    +0 setsockopt(4, SOL_SOCKET, SO_MAX_PACING_RATE, [1514000], 4) = 0
>    +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1000000], 4) = 0
> 
>    +0 write(4, ..., 120000) = 120000
> 
>    +0 > .  1:1461(1460) ack 1
> +.001 > .  1461:2921(1460) ack 1
> +.001 > .  2921:4381(1460) ack 1
> +.001 > .  4381:5841(1460) ack 1
> +.001 > .  5841:7301(1460) ack 1
> +.001 > .  7301:8761(1460) ack 1
> +.001 > .  8761:10221(1460) ack 1
> +.001 > .  10221:11681(1460) ack 1
> +.001 > .  11681:13141(1460) ack 1
> +.001 > .  13141:14601(1460) ack 1
>  +.01 < . 1:1(0) ack 14601 win 1500
> 
> +.001 > .  14601:16061(1460) ack 1
> +.001 > .  16061:17521(1460) ack 1
> +.001 > .  17521:18981(1460) ack 1
> +.001 > .  18981:20441(1460) ack 1
> +.001 > .  20441:21901(1460) ack 1
> +.001 > .  21901:23361(1460) ack 1
> +.001 > .  23361:24821(1460) ack 1
> +.001 > .  24821:26281(1460) ack 1
> +.001 > .  26281:27741(1460) ack 1
> +.001 > .  27741:29201(1460) ack 1
> +.001 > .  29201:30661(1460) ack 1
> +.001 > .  30661:32121(1460) ack 1
> +.001 > .  32121:33581(1460) ack 1
> +.001 > .  33581:35041(1460) ack 1
> +.001 > .  35041:36501(1460) ack 1
> +.001 > .  36501:37961(1460) ack 1
> +.001 > .  37961:39421(1460) ack 1
> +.001 > .  39421:40881(1460) ack 1
> +.001 > .  40881:42341(1460) ack 1
> +.001 > .  42341:43801(1460) ack 1
>  +.01 < .  1:1(0) ack 43801 win 1500
> 
> +.001 > .  43801:45261(1460) ack 1
> +.001 > .  45261:46721(1460) ack 1
> +.001 > .  46721:48181(1460) ack 1
> +.001 > .  48181:49641(1460) ack 1
> +.001 > .  49641:51101(1460) ack 1
> +.001 > .  51101:52561(1460) ack 1
> +.001 > .  52561:54021(1460) ack 1
> +.001 > .  54021:55481(1460) ack 1
> +.001 > .  55481:56941(1460) ack 1
> +.001 > .  56941:58401(1460) ack 1
> +.001 > .  58401:59861(1460) ack 1
> +.001 > .  59861:61321(1460) ack 1
> +.001 > .  61321:62781(1460) ack 1
> +.001 > .  62781:64241(1460) ack 1
> +.001 > .  64241:65701(1460) ack 1
> +.001 > .  65701:67161(1460) ack 1
> +.001 > .  67161:68621(1460) ack 1
> +.001 > .  68621:70081(1460) ack 1
> +.001 > .  70081:71541(1460) ack 1
> +.001 > .  71541:73001(1460) ack 1
> +.001 > .  73001:74461(1460) ack 1
> +.001 > .  74461:75921(1460) ack 1
> +.001 > .  75921:77381(1460) ack 1
> +.001 > .  77381:78841(1460) ack 1
> +.001 > .  78841:80301(1460) ack 1
> +.001 > .  80301:81761(1460) ack 1
> +.001 > .  81761:83221(1460) ack 1
> +.001 > .  83221:84681(1460) ack 1
> +.001 > .  84681:86141(1460) ack 1
> +.001 > .  86141:87601(1460) ack 1
> +.001 > .  87601:89061(1460) ack 1
> +.001 > .  89061:90521(1460) ack 1
> +.001 > .  90521:91981(1460) ack 1
> +.001 > .  91981:93441(1460) ack 1
> +.001 > .  93441:94901(1460) ack 1
> +.001 > .  94901:96361(1460) ack 1
> +.001 > .  96361:97821(1460) ack 1
> +.001 > .  97821:99281(1460) ack 1
> +.001 > .  99281:100741(1460) ack 1
> +.001 > .  100741:102201(1460) ack 1
> 
> // Ok lets trigger a bunch of rtx !
> +.001 < .  1:1(0) ack 43801 win 1500 <nop,nop, sack 78841:102201>
> +.001 > .  43801:45261(1460) ack 1
> +.001 > .  45261:46721(1460) ack 1
> +.001 > .  46721:48181(1460) ack 1
> +.001 > .  48181:49641(1460) ack 1
> +.001 > .  49641:51101(1460) ack 1
> +.001 > .  51101:52561(1460) ack 1
> +.001 > .  52561:54021(1460) ack 1
> +.001 > .  54021:55481(1460) ack 1
> +.001 > .  55481:56941(1460) ack 1
> +.001 > .  56941:58401(1460) ack 1
> +.001 > .  58401:59861(1460) ack 1
> +.001 > .  59861:61321(1460) ack 1
> +.001 > .  61321:62781(1460) ack 1
> +.001 > .  62781:64241(1460) ack 1
> +.001 > .  64241:65701(1460) ack 1
> +.001 > .  65701:67161(1460) ack 1
> 
> 
> 
> 
>
Eric Dumazet Oct. 22, 2017, 4:49 p.m. UTC | #6
On Sun, 2017-10-22 at 21:59 +0900, Koichiro Den wrote:
> On Sat, 2017-10-21 at 22:21 -0700, Eric Dumazet wrote:
> > On Sun, 2017-10-22 at 13:10 +0900, Koichiro Den wrote:
> > > On Sat, 2017-10-21 at 20:52 -0700, Eric Dumazet wrote:
> > > > On Sun, 2017-10-22 at 12:38 +0900, Koichiro Den wrote:
> > > > > When retransmission on TSQ handler was introduced in the commit
> > > > > f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
> > > > > skbs' timestamps were updated on the actual transmission. In the later
> > > > > commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
> > > > > being done so. In the commit, the comment says "We try to refresh
> > > > > tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
> > > > > tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
> > > > > it's rare enough.
> > > > > 
> > > > > About the former, even though possible retransmissions on the tasklet
> > > > > comes just after the destructor run in NET_RX softirq handling, the time
> > > > > between them could be nonnegligibly large to the extent that
> > > > > tcp_rack_advance or rto rearming be affected if other (remaining) RX,
> > > > > BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
> > > > > 
> > > > > So in the same way as tcp_write_timer_handler does, doing
> > > > > tcp_mstamp_refresh
> > > > > ensures the accuracy of algorithms relying on it.
> > > > > 
> > > > > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > > > > ---
> > > > >  net/ipv4/tcp_output.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > Very nice catch, thanks a lot Koichiro.
> > > > 
> > > > This IMO would target net tree, since it is a bug fix.
> > > > 
> > > > Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
> > > 
> > > Ok I will submit it to net tree. Thanks!
> > > > 
> > > > Thanks !
> > > > 
> > > > We should have caught that in our regression packetdrill tests...
> > > 
> > > In its "remote" mode testing not relying on tun xmit, I agree it could be
> > > caught. If it's better to write, test and attach the script, please let me
> > > know.
> > 
> > Packetdrill in the normal (local) mode will do it.
> > 
> > Push fq packet scheduler on tun0, and packets will be held long enough
> > in FQ that TSQ will be effective.
> Ah yes, I missed it, thank you.
> > 
> > Adding TCP TS support on folloginw packetdrill test should demonstrate
> > the issue and if your patch cures it.
> Thanks for the demo script. I thought making the issue described in my commit
> message obvious with a packetdrill script on its own was a bit difficult without
> heavy load on test bed (or intentionally injecting time-consuming code).
> 
> IIUC, whether or not TCP TS val reflects a bit later timestamp than its last
> reception, which triggered TSQ handler, sounds much better. Though it is still
> like "With some probability, one millisecond delay is reflected on TS val, so
> the packetdrill test passes. Otherwise we can say that the test fails."
> Am I correct? To be honest I am still wondering what is the best way to make the
> issue obvious.

Here is the test I cooked.

Note that it actually passes just fine, do I am wondering if your patch
is needed after all..


# cat fq-pacing-tsq-rtx-ts.pkt
// Test if TSQ applies to retransmits.

`tc qdisc replace dev tun0 root fq quantum 1514 initial_quantum 1514 flow_limit 5
sysctl -e -q net.ipv4.tcp_min_tso_segs=1`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1460,sackOK,TS val 190 ecr 0>
   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 5000 ecr 190>
 +.01 < . 1:1(0) ack 1 win 65535 <nop,nop,TS val 200 ecr 5000>
   +0 accept(3, ..., ...) = 4

   +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
// send one frame per ms
   +0 setsockopt(4, SOL_SOCKET, SO_MAX_PACING_RATE, [1514000], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1000000], 4) = 0

   +0 write(4, ..., 120000) = 120000

+.000 > . 1:1449(1448) ack 1 <nop, nop, TS val 400 ecr 200>
+.001 > . 1449:2897(1448) ack 1 <nop, nop, TS val 401 ecr 200>
+.001 > . 2897:4345(1448) ack 1 <nop, nop, TS val 402 ecr 200>
+.001 > . 4345:5793(1448) ack 1 <nop, nop, TS val 403 ecr 200>
+.001 > . 5793:7241(1448) ack 1 <nop, nop, TS val 404 ecr 200>
+.001 > . 7241:8689(1448) ack 1 <nop, nop, TS val 405 ecr 200>
+.001 > . 8689:10137(1448) ack 1 <nop, nop, TS val 406 ecr 200>
+.001 > . 10137:11585(1448) ack 1 <nop, nop, TS val 407 ecr 200>
+.001 > . 11585:13033(1448) ack 1 <nop, nop, TS val 408 ecr 200>
+.001 > . 13033:14481(1448) ack 1 <nop, nop, TS val 409 ecr 200>
 +.01 < . 1:1(0) ack 14481 win 65535 <nop, nop, TS val 220 ecr 409>

+.000 > . 14481:15929(1448) ack 1 <nop, nop, TS val 420 ecr 220>
+.001 > . 15929:17377(1448) ack 1 <nop, nop, TS val 421 ecr 220>
+.001 > . 17377:18825(1448) ack 1 <nop, nop, TS val 422 ecr 220>
+.001 > . 18825:20273(1448) ack 1 <nop, nop, TS val 423 ecr 220>
+.001 > . 20273:21721(1448) ack 1 <nop, nop, TS val 424 ecr 220>
+.001 > . 21721:23169(1448) ack 1 <nop, nop, TS val 425 ecr 220>
+.001 > . 23169:24617(1448) ack 1 <nop, nop, TS val 426 ecr 220>
+.001 > . 24617:26065(1448) ack 1 <nop, nop, TS val 427 ecr 220>
+.001 > . 26065:27513(1448) ack 1 <nop, nop, TS val 428 ecr 220>
+.001 > . 27513:28961(1448) ack 1 <nop, nop, TS val 429 ecr 220>
+.001 > . 28961:30409(1448) ack 1 <nop, nop, TS val 430 ecr 220>
+.001 > . 30409:31857(1448) ack 1 <nop, nop, TS val 431 ecr 220>
+.001 > . 31857:33305(1448) ack 1 <nop, nop, TS val 432 ecr 220>
+.001 > . 33305:34753(1448) ack 1 <nop, nop, TS val 433 ecr 220>
+.001 > . 34753:36201(1448) ack 1 <nop, nop, TS val 434 ecr 220>
+.001 > . 36201:37649(1448) ack 1 <nop, nop, TS val 435 ecr 220>
+.001 > P. 37649:39097(1448) ack 1 <nop, nop, TS val 436 ecr 220>
+.001 > . 39097:40545(1448) ack 1 <nop, nop, TS val 437 ecr 220>
+.001 > . 40545:41993(1448) ack 1 <nop, nop, TS val 438 ecr 220>
+.001 > . 41993:43441(1448) ack 1 <nop, nop, TS val 439 ecr 220>
 +.01 < .  1:1(0) ack 43441 win 65535 <nop, nop, TS val 240 ecr 439>

+.000 > . 43441:44889(1448) ack 1 <nop, nop, TS val 450 ecr 240>
+.001 > . 44889:46337(1448) ack 1 <nop, nop, TS val 451 ecr 240>
+.001 > . 46337:47785(1448) ack 1 <nop, nop, TS val 452 ecr 240>
+.001 > . 47785:49233(1448) ack 1 <nop, nop, TS val 453 ecr 240>
+.001 > . 49233:50681(1448) ack 1 <nop, nop, TS val 454 ecr 240>
+.001 > . 50681:52129(1448) ack 1 <nop, nop, TS val 455 ecr 240>
+.001 > . 52129:53577(1448) ack 1 <nop, nop, TS val 456 ecr 240>
+.001 > . 53577:55025(1448) ack 1 <nop, nop, TS val 457 ecr 240>
+.001 > . 55025:56473(1448) ack 1 <nop, nop, TS val 458 ecr 240>
+.001 > . 56473:57921(1448) ack 1 <nop, nop, TS val 459 ecr 240>
+.001 > . 57921:59369(1448) ack 1 <nop, nop, TS val 460 ecr 240>
+.001 > . 59369:60817(1448) ack 1 <nop, nop, TS val 461 ecr 240>
+.001 > . 60817:62265(1448) ack 1 <nop, nop, TS val 462 ecr 240>
+.001 > . 62265:63713(1448) ack 1 <nop, nop, TS val 463 ecr 240>
+.001 > . 63713:65161(1448) ack 1 <nop, nop, TS val 464 ecr 240>
+.001 > . 65161:66609(1448) ack 1 <nop, nop, TS val 465 ecr 240>
+.001 > . 66609:68057(1448) ack 1 <nop, nop, TS val 466 ecr 240>
+.001 > . 68057:69505(1448) ack 1 <nop, nop, TS val 467 ecr 240>
+.001 > P. 69505:70953(1448) ack 1 <nop, nop, TS val 468 ecr 240>
+.001 > . 70953:72401(1448) ack 1 <nop, nop, TS val 469 ecr 240>
+.001 > . 72401:73849(1448) ack 1 <nop, nop, TS val 470 ecr 240>
+.001 > . 73849:75297(1448) ack 1 <nop, nop, TS val 471 ecr 240>
+.001 > . 75297:76745(1448) ack 1 <nop, nop, TS val 472 ecr 240>
+.001 > . 76745:78193(1448) ack 1 <nop, nop, TS val 473 ecr 240>
+.001 > . 78193:79641(1448) ack 1 <nop, nop, TS val 474 ecr 240>
+.001 > . 79641:81089(1448) ack 1 <nop, nop, TS val 475 ecr 240>
+.001 > . 81089:82537(1448) ack 1 <nop, nop, TS val 476 ecr 240>
+.001 > . 82537:83985(1448) ack 1 <nop, nop, TS val 477 ecr 240>
+.001 > . 83985:85433(1448) ack 1 <nop, nop, TS val 478 ecr 240>
+.001 > . 85433:86881(1448) ack 1 <nop, nop, TS val 479 ecr 240>
+.001 > . 86881:88329(1448) ack 1 <nop, nop, TS val 480 ecr 240>
+.001 > . 88329:89777(1448) ack 1 <nop, nop, TS val 481 ecr 240>
+.001 > . 89777:91225(1448) ack 1 <nop, nop, TS val 482 ecr 240>
+.001 > . 91225:92673(1448) ack 1 <nop, nop, TS val 483 ecr 240>
+.001 > . 92673:94121(1448) ack 1 <nop, nop, TS val 484 ecr 240>
+.001 > . 94121:95569(1448) ack 1 <nop, nop, TS val 485 ecr 240>
+.001 > . 95569:97017(1448) ack 1 <nop, nop, TS val 486 ecr 240>
+.001 > . 97017:98465(1448) ack 1 <nop, nop, TS val 487 ecr 240>
+.001 > . 98465:99913(1448) ack 1 <nop, nop, TS val 488 ecr 240>
+.001 > . 99913:101361(1448) ack 1 <nop, nop, TS val 489 ecr 240>

// Ok lets trigger a bunch of rtx !
+.001 < .  1:1(0) ack 43441 win 65535 <nop,nop, TS val 280 ecr 489, nop, nop, sack 78841:101361>
+.000 > . 43441:44889(1448) ack 1 <nop, nop, TS val 500 ecr 280>
+.001 > . 44889:46337(1448) ack 1 <nop, nop, TS val 501 ecr 280>
+.001 > . 46337:47785(1448) ack 1 <nop, nop, TS val 502 ecr 280>
+.001 > . 47785:49233(1448) ack 1 <nop, nop, TS val 503 ecr 280>
+.001 > . 49233:50681(1448) ack 1 <nop, nop, TS val 504 ecr 280>
+.001 > . 50681:52129(1448) ack 1 <nop, nop, TS val 505 ecr 280>
+.001 > . 52129:53577(1448) ack 1 <nop, nop, TS val 506 ecr 280>
+.001 > . 53577:55025(1448) ack 1 <nop, nop, TS val 507 ecr 280>
+.001 > . 55025:56473(1448) ack 1 <nop, nop, TS val 508 ecr 280>
+.001 > . 56473:57921(1448) ack 1 <nop, nop, TS val 509 ecr 280>
+.001 > . 57921:59369(1448) ack 1 <nop, nop, TS val 510 ecr 280>
+.001 > . 59369:60817(1448) ack 1 <nop, nop, TS val 511 ecr 280>
+.001 > . 60817:62265(1448) ack 1 <nop, nop, TS val 512 ecr 280>
+.001 > . 62265:63713(1448) ack 1 <nop, nop, TS val 513 ecr 280>
+.001 > . 63713:65161(1448) ack 1 <nop, nop, TS val 514 ecr 280>



A "tcpdump -p -n -s 0 -i any port 8080" while packetdrill runs, shows for the RTX phase :

09:43:46.008860 IP 192.0.2.1.33257 > 192.168.105.33.8080: Flags [.], ack 43441, win 65535, options [nop,nop,TS val 280 ecr 895244774,nop,nop,sack 1 {78841:101361}], length 0
09:43:46.008882 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 43441:44889, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.009832 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 44889:46337, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.010831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 46337:47785, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.011832 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 47785:49233, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.012831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 49233:50681, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.013832 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 50681:52129, ack 1, win 28960, options [nop,nop,TS val 895244777 ecr 280], length 1448: HTTP
09:43:46.014832 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 52129:53577, ack 1, win 28960, options [nop,nop,TS val 895244778 ecr 280], length 1448: HTTP
09:43:46.015831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 53577:55025, ack 1, win 28960, options [nop,nop,TS val 895244779 ecr 280], length 1448: HTTP
09:43:46.016831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 55025:56473, ack 1, win 28960, options [nop,nop,TS val 895244780 ecr 280], length 1448: HTTP
09:43:46.017831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 56473:57921, ack 1, win 28960, options [nop,nop,TS val 895244781 ecr 280], length 1448: HTTP
09:43:46.018831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 57921:59369, ack 1, win 28960, options [nop,nop,TS val 895244782 ecr 280], length 1448: HTTP
09:43:46.019832 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 59369:60817, ack 1, win 28960, options [nop,nop,TS val 895244783 ecr 280], length 1448: HTTP
09:43:46.020831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 60817:62265, ack 1, win 28960, options [nop,nop,TS val 895244784 ecr 280], length 1448: HTTP
09:43:46.021831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 62265:63713, ack 1, win 28960, options [nop,nop,TS val 895244785 ecr 280], length 1448: HTTP
09:43:46.022831 IP 192.168.105.33.8080 > 192.0.2.1.33257: Flags [.], seq 63713:65161, ack 1, win 28960, options [nop,nop,TS val 895244786 ecr 280], length 1448: HTTP

We do see the TS val being properly incremented (after the initial burst to hit TSQ limit)
Eric Dumazet Oct. 22, 2017, 5:11 p.m. UTC | #7
On Sun, 2017-10-22 at 09:49 -0700, Eric Dumazet wrote:

> Here is the test I cooked.
> 
> Note that it actually passes just fine, do I am wondering if your patch
> is needed after all..
> 

Oh this is because tcp_mstamp_refresh() is called from tcp_write_xmit()
and tcp_write_xmit() is directly called from tcp_tsq_handler()

So it looks your patch has no observable effect.

(Even with usec TS as we use here at Google, it has no effect since TCP
stack does not consider RTT samples being valid if skb was ever
retransmitted)
Koichiro Den Oct. 23, 2017, 3:26 a.m. UTC | #8
On Sun, 2017-10-22 at 10:11 -0700, Eric Dumazet wrote:
> On Sun, 2017-10-22 at 09:49 -0700, Eric Dumazet wrote:
> 
> > Here is the test I cooked.
> > 
> > Note that it actually passes just fine, do I am wondering if your patch
> > is needed after all..
> > 
> 
> Oh this is because tcp_mstamp_refresh() is called from tcp_write_xmit()
> and tcp_write_xmit() is directly called from tcp_tsq_handler()

Thanks for all you help. I guess we should focus on retransmit call path
which comes just before that, or not?
> 
> So it looks your patch has no observable effect.
> 
> (Even with usec TS as we use here at Google, it has no effect since TCP
> stack does not consider RTT samples being valid if skb was ever
> retransmitted)
> 
> 

Is the point about tcp_rack_advance part or in general?

Btw, perf probe makes the original point described in my commit message apparent
as follows:

$ perf probe 'T1=tcp_write_xmit:25 tm=+1464(%bx):u64'
$ perf probe 'T2=tcp_xmit_retransmit_queue+0
              l_o=+1664(%di):u32  # tp->lost_out
              r_o=+1520(%di):u32  # tp->retrans_out
              tm=+1464(%di):u64'  # skb_mstamp will be set to this.
$ perf trace -aT --no-syscalls --call-graph dwarf \
  -e probe:T1/max-stack=4/
  -e probe:T2/max-stack=1/ |& \
  sed 's/..kernel.kallsyms..//;s/probe://' # to make it a bit more readable :p

$ # run fq-pacing-tsq-rtx-ts.pkt Eric provided:

[before patch applied]
-----8<------8<-----
2560581.604 T2:(ffffffff817da5e0) l_o=25 r_o=0 tm=2560581404)
                                       tcp_xmit_retransmit_queue.part.33
                                       tcp_xmit_recovery.part.53
2560581.614 T1:(ffffffff817d7e74) tm=2560581612)  # <--- (1)
                                       tcp_write_xmit
                                       __tcp_push_pending_frames
                                       tcp_rcv_established
                                       tcp_v4_do_rcv

/*
 * ^
 * | could be unexpectedly long, e.g. due to BLOCK_SOFTIRQ
 * v
 */
2560582.215 T2:(ffffffff817da5e0) l_o=25 r_o=5 tm=2560581612)
/* <- same as (1) above */
                                       tcp_xmit_retransmit_queue.part.33
                                       tcp_tasklet_func
----->8------->8-----


[after patch applied]
-----8<------8<-----
 97846.150 T2:(ffffffff817da5e0) l_o=25 r_o=0 tm=97846132)
                                       tcp_xmit_retransmit_queue.part.33
                                       tcp_xmit_recovery.part.53
 97846.159 T1:(ffffffff817d7e74) tm=97846158)    # <--- (1)
                                       tcp_write_xmit
                                       __tcp_push_pending_frames
                                       tcp_rcv_established
                                       tcp_v4_do_rcv
/*
 * ^
 * | could be unexpectedly long, e.g. due to BLOCK_SOFTIRQ
 * v
 */
 97847.144 T2:(ffffffff817da5e0) l_o=25 r_o=5 tm=97847134)
/* <- not the same as (1).
 * never affected now that we ensure it's refreshed */
                                       tcp_xmit_retransmit_queue.part.33
                                       tcp_tasklet_func
----->8------->8-----

Now I wonder this is more of a theoretical one rather than a patch to fix one
specific bug.
Eric Dumazet Oct. 23, 2017, 3:40 a.m. UTC | #9
On Mon, 2017-10-23 at 12:26 +0900, Koichiro Den wrote:

> Now I wonder this is more of a theoretical one rather than a patch to fix one
> specific bug.


Note that I said that your patch was fine and I added a 'Reviewed-by:'
tag.


What I meant is that it has no direct effect on correctness of TCP
stack. I could not cook a packetdrill test that shows the difference
before and after your patch.

BTW, in the following sequence :

A)   Fetch high-res timestamp and store in X
B)   Use X

B) Can use a quite old value of X, depending on scheduling (preempt
kernels or interrupt handling)

TCP really does not care of how accurate X is, it is a best effort.

For RTX packets, it is even more the case, since TCP does not take RTT
samples from packets that were retransmitted.
Koichiro Den Oct. 23, 2017, 4:28 a.m. UTC | #10
On Sun, 2017-10-22 at 20:40 -0700, Eric Dumazet wrote:
> On Mon, 2017-10-23 at 12:26 +0900, Koichiro Den wrote:
> 
> > Now I wonder this is more of a theoretical one rather than a patch to fix
> > one
> > specific bug.
> 
> 
> Note that I said that your patch was fine and I added a 'Reviewed-by:'
> tag.
Sure, sorry about my confusing comment.
> 
> 
> What I meant is that it has no direct effect on correctness of TCP
> stack. I could not cook a packetdrill test that shows the difference
> before and after your patch.
> 
> BTW, in the following sequence :
> 
> A)   Fetch high-res timestamp and store in X
> B)   Use X
> 
> B) Can use a quite old value of X, depending on scheduling (preempt
> kernels or interrupt handling)
> 
> TCP really does not care of how accurate X is, it is a best effort.
> 
I agreed. In e.g., hard interrupt storm, this extra refreshing is just
make the expected delay smaller under the same condition.
> For RTX packets, it is even more the case, since TCP does not take RTT
> samples from packets that were retransmitted.
Indeed, meaning that tcp_clean_rtx_queue implementation never takes.
But for me it seems that there is some possibility RACK algorithm will take it.
Eric Dumazet Oct. 23, 2017, 4:31 a.m. UTC | #11
On Mon, 2017-10-23 at 13:28 +0900, Koichiro Den wrote:

> Indeed, meaning that tcp_clean_rtx_queue implementation never takes.
> But for me it seems that there is some possibility RACK algorithm will take it.

As long as tp->tcp_mstamp is monotonically increasing (and it is ;) ),
RACK will behave properly.

I am not sure we need to discuss this forever frankly ;)
Koichiro Den Oct. 23, 2017, 4:36 a.m. UTC | #12
On Sun, 2017-10-22 at 21:31 -0700, Eric Dumazet wrote:
> On Mon, 2017-10-23 at 13:28 +0900, Koichiro Den wrote:
> 
> > Indeed, meaning that tcp_clean_rtx_queue implementation never takes.
> > But for me it seems that there is some possibility RACK algorithm will take
> > it.
> 
> As long as tp->tcp_mstamp is monotonically increasing (and it is ;) ),
> RACK will behave properly.
> 
> I am not sure we need to discuss this forever frankly ;)
>  
> 
> 
Yeah, sure, thanks for your time and helpful comments.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 988733f289c8..7a4d9b719b47 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -743,8 +743,10 @@  static void tcp_tsq_handler(struct sock *sk)
 		struct tcp_sock *tp = tcp_sk(sk);
 
 		if (tp->lost_out > tp->retrans_out &&
-		    tp->snd_cwnd > tcp_packets_in_flight(tp))
+		    tp->snd_cwnd > tcp_packets_in_flight(tp)) {
+			tcp_mstamp_refresh(tp);
 			tcp_xmit_retransmit_queue(sk);
+		}
 
 		tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle,
 			       0, GFP_ATOMIC);