diff mbox series

[net-next] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

Message ID 20180703011726.8301-1-jmaxwell37@gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show
Series [net-next] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy | expand

Commit Message

Jon Maxwell July 3, 2018, 1:17 a.m. UTC
Every time the TCP retransmission timer fires. It checks to see if there is a 
timeout before scheduling the next retransmit timer. The retransmit interval 
between each retransmission increases exponentially. The issue is that in order 
for the timeout to occur the retransmit timer needs to fire again. If the user 
timeout check happens after the 9th retransmit for example. It needs to wait for 
the 10th retransmit timer to fire in order to evaluate whether a timeout has 
occurred or not. If the interval is large enough then the timeout will be 
inaccurate.

For example with a TCP_USER_TIMEOUT of 10 seconds without patch:

1st retransmit:

22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]

Last retransmit:

22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]

Timeout:

send: Connection timed out
Sun Jul  1 22:25:34 EDT 2018

We can see that last retransmit took ~7 seconds. Which pushed the total 
timeout to ~15 seconds instead of the expected 10 seconds. This gets more 
inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.

Fix this by recalculating the last retransmit interval so that it fires when
the timeout should occur. Only implement when icsk->icsk_user_timeout is set. 

Test results with the patch is the expected 10 second timeout:

1st retransmit:

01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]

Last retransmit:

01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]

Timeout:

send: Connection timed out
Mon Jul  2 01:38:09 EDT 2018

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
 net/ipv4/tcp_timer.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Neal Cardwell July 3, 2018, 3 a.m. UTC | #1
On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@gmail.com> wrote:
>
> Every time the TCP retransmission timer fires. It checks to see if there is a
> timeout before scheduling the next retransmit timer. The retransmit interval
> between each retransmission increases exponentially. The issue is that in order
> for the timeout to occur the retransmit timer needs to fire again. If the user
> timeout check happens after the 9th retransmit for example. It needs to wait for
> the 10th retransmit timer to fire in order to evaluate whether a timeout has
> occurred or not. If the interval is large enough then the timeout will be
> inaccurate.
>
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>
> 1st retransmit:
>
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Sun Jul  1 22:25:34 EDT 2018
>
> We can see that last retransmit took ~7 seconds. Which pushed the total
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>
> Fix this by recalculating the last retransmit interval so that it fires when
> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>
> Test results with the patch is the expected 10 second timeout:
>
> 1st retransmit:
>
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Mon Jul  2 01:38:09 EDT 2018
>
> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> ---
>  net/ipv4/tcp_timer.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..94491a481722 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
> +       __u32 time_remaining = 0;
>
>         if (tp->fastopen_rsk) {
>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>                 /* Use normal (exponential) backoff */
>                 icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>         }
> +       if (icsk->icsk_user_timeout) {
> +               time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
> +                              (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
> +               if (time_remaining < icsk->icsk_rto)
> +                       icsk->icsk_rto = time_remaining;
> +       }

Thanks, a more precise user timeout sounds nice. A couple thoughts:

(a) The  icsk->icsk_rto is in jiffies, and the time_remaining is in
msecs, so it looks like there is a units mismatch here in the
comparisons and assignment.

(b) It also seems like the time_remaining could be negative, because
(a) the icsk_user_timeout is not involved in the baseline RTO
calculation, so that perhaps the first RTO to fire might be beyond the
icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
timer handler can be delayed beyond the targeted icsk_user_timeout.
But time_remaining is a __u32, and  icsk->icsk_rto is also a __u32, so
it seems like a negative number in time_remaining would usually be
treated as a very large unsigned positive number in this comparison:

+               if (time_remaining < icsk->icsk_rto)

(c) If the user timeout is changed between RTO expirations to push the
user timeout further in the future,  then it seems like this commit
will have side effects that left the icsk->icsk_rto in a weird state
that does not do the expected exponential backoff correctly.

(d) There are also wrapping issues to watch out for, since the
tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
milliseconds, which will wrap every 49 days or so. Seems like the code
is OK in that respect.

(e) It also might be nice to put this logic in a helper, rather than
growing the body of tcp_retransmit_timer().

What about something like (pseudocode):

--

static __u32 tcp_clamp_rto_to_user_timeout(sk):
   rto = icsk->icsk_rto;
   if (!icsk->icsk_user_timeout)
      return rto;
   elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
   user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
   if (elapsed >= user_timeout)
      rto = 1;  /* user timeout has passed; fire ASAP */
   else
      rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
   return rto;

tcp_retransmit_timer():
...
  rto = tcp_clamp_rto_to_user_timeout(sk);
  inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);

--

neal
Jon Maxwell July 3, 2018, 3:15 a.m. UTC | #2
On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@gmail.com> wrote:
>>
>> Every time the TCP retransmission timer fires. It checks to see if there is a
>> timeout before scheduling the next retransmit timer. The retransmit interval
>> between each retransmission increases exponentially. The issue is that in order
>> for the timeout to occur the retransmit timer needs to fire again. If the user
>> timeout check happens after the 9th retransmit for example. It needs to wait for
>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>> occurred or not. If the interval is large enough then the timeout will be
>> inaccurate.
>>
>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>
>> 1st retransmit:
>>
>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Sun Jul  1 22:25:34 EDT 2018
>>
>> We can see that last retransmit took ~7 seconds. Which pushed the total
>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>
>> Fix this by recalculating the last retransmit interval so that it fires when
>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>
>> Test results with the patch is the expected 10 second timeout:
>>
>> 1st retransmit:
>>
>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Mon Jul  2 01:38:09 EDT 2018
>>
>> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
>> ---
>>  net/ipv4/tcp_timer.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 3b3611729928..94491a481722 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>         struct tcp_sock *tp = tcp_sk(sk);
>>         struct net *net = sock_net(sk);
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>> +       __u32 time_remaining = 0;
>>
>>         if (tp->fastopen_rsk) {
>>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>>                 /* Use normal (exponential) backoff */
>>                 icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>         }
>> +       if (icsk->icsk_user_timeout) {
>> +               time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>> +                              (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>> +               if (time_remaining < icsk->icsk_rto)
>> +                       icsk->icsk_rto = time_remaining;
>> +       }
>
> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>
> (a) The  icsk->icsk_rto is in jiffies, and the time_remaining is in
> msecs, so it looks like there is a units mismatch here in the
> comparisons and assignment.
>
> (b) It also seems like the time_remaining could be negative, because
> (a) the icsk_user_timeout is not involved in the baseline RTO
> calculation, so that perhaps the first RTO to fire might be beyond the
> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
> timer handler can be delayed beyond the targeted icsk_user_timeout.
> But time_remaining is a __u32, and  icsk->icsk_rto is also a __u32, so
> it seems like a negative number in time_remaining would usually be
> treated as a very large unsigned positive number in this comparison:
>
> +               if (time_remaining < icsk->icsk_rto)
>
> (c) If the user timeout is changed between RTO expirations to push the
> user timeout further in the future,  then it seems like this commit
> will have side effects that left the icsk->icsk_rto in a weird state
> that does not do the expected exponential backoff correctly.
>
> (d) There are also wrapping issues to watch out for, since the
> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
> milliseconds, which will wrap every 49 days or so. Seems like the code
> is OK in that respect.
>
> (e) It also might be nice to put this logic in a helper, rather than
> growing the body of tcp_retransmit_timer().
>
> What about something like (pseudocode):
>
> --
>
> static __u32 tcp_clamp_rto_to_user_timeout(sk):
>    rto = icsk->icsk_rto;
>    if (!icsk->icsk_user_timeout)
>       return rto;
>    elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
>    user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
>    if (elapsed >= user_timeout)
>       rto = 1;  /* user timeout has passed; fire ASAP */
>    else
>       rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
>    return rto;
>
> tcp_retransmit_timer():
> ...
>   rto = tcp_clamp_rto_to_user_timeout(sk);
>   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>

Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.

Regards

Jon

> --
>
> neal
Jon Maxwell July 3, 2018, 4:42 a.m. UTC | #3
On Tue, Jul 3, 2018 at 1:15 PM, Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <ncardwell@google.com> wrote:
>> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@gmail.com> wrote:
>>>
>>> Every time the TCP retransmission timer fires. It checks to see if there is a
>>> timeout before scheduling the next retransmit timer. The retransmit interval
>>> between each retransmission increases exponentially. The issue is that in order
>>> for the timeout to occur the retransmit timer needs to fire again. If the user
>>> timeout check happens after the 9th retransmit for example. It needs to wait for
>>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>>> occurred or not. If the interval is large enough then the timeout will be
>>> inaccurate.
>>>
>>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>>
>>> 1st retransmit:
>>>
>>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Sun Jul  1 22:25:34 EDT 2018
>>>
>>> We can see that last retransmit took ~7 seconds. Which pushed the total
>>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>>
>>> Fix this by recalculating the last retransmit interval so that it fires when
>>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>>
>>> Test results with the patch is the expected 10 second timeout:
>>>
>>> 1st retransmit:
>>>
>>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Mon Jul  2 01:38:09 EDT 2018
>>>
>>> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
>>> ---
>>>  net/ipv4/tcp_timer.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index 3b3611729928..94491a481722 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>>         struct tcp_sock *tp = tcp_sk(sk);
>>>         struct net *net = sock_net(sk);
>>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>> +       __u32 time_remaining = 0;
>>>
>>>         if (tp->fastopen_rsk) {
>>>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>>>                 /* Use normal (exponential) backoff */
>>>                 icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>>         }
>>> +       if (icsk->icsk_user_timeout) {
>>> +               time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>>> +                              (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>>> +               if (time_remaining < icsk->icsk_rto)
>>> +                       icsk->icsk_rto = time_remaining;
>>> +       }
>>
>> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>>
>> (a) The  icsk->icsk_rto is in jiffies, and the time_remaining is in
>> msecs, so it looks like there is a units mismatch here in the
>> comparisons and assignment.
>>
>> (b) It also seems like the time_remaining could be negative, because
>> (a) the icsk_user_timeout is not involved in the baseline RTO
>> calculation, so that perhaps the first RTO to fire might be beyond the
>> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
>> timer handler can be delayed beyond the targeted icsk_user_timeout.
>> But time_remaining is a __u32, and  icsk->icsk_rto is also a __u32, so
>> it seems like a negative number in time_remaining would usually be
>> treated as a very large unsigned positive number in this comparison:
>>
>> +               if (time_remaining < icsk->icsk_rto)
>>
>> (c) If the user timeout is changed between RTO expirations to push the
>> user timeout further in the future,  then it seems like this commit
>> will have side effects that left the icsk->icsk_rto in a weird state
>> that does not do the expected exponential backoff correctly.
>>
>> (d) There are also wrapping issues to watch out for, since the
>> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
>> milliseconds, which will wrap every 49 days or so. Seems like the code
>> is OK in that respect.
>>
>> (e) It also might be nice to put this logic in a helper, rather than
>> growing the body of tcp_retransmit_timer().
>>
>> What about something like (pseudocode):
>>
>> --
>>
>> static __u32 tcp_clamp_rto_to_user_timeout(sk):
>>    rto = icsk->icsk_rto;
>>    if (!icsk->icsk_user_timeout)
>>       return rto;
>>    elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
>>    user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
>>    if (elapsed >= user_timeout)
>>       rto = 1;  /* user timeout has passed; fire ASAP */
>>    else
>>       rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
>>    return rto;
>>
>> tcp_retransmit_timer():
>> ...
>>   rto = tcp_clamp_rto_to_user_timeout(sk);
>>   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>>
>
> Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.
>
> Regards
>
> Jon
>

Thanks for your input and suggestions Neal.

Results were positive in the reproducer. I'll tidy the patch up a bit and submit
as v1 with your ideas.

>> --
>>
>> neal
diff mbox series

Patch

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3b3611729928..94491a481722 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -407,6 +407,7 @@  void tcp_retransmit_timer(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	__u32 time_remaining = 0;
 
 	if (tp->fastopen_rsk) {
 		WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
@@ -535,6 +536,12 @@  void tcp_retransmit_timer(struct sock *sk)
 		/* Use normal (exponential) backoff */
 		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
 	}
+	if (icsk->icsk_user_timeout) {
+		time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
+			       (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
+		if (time_remaining < icsk->icsk_rto)
+			icsk->icsk_rto = time_remaining;
+	}
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
 	if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
 		__sk_dst_reset(sk);