diff mbox

tcp: bound RTO to minimum

Message ID 1314229310-8074-1-git-send-email-hagen@jauu.net
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Hagen Paul Pfeifer Aug. 24, 2011, 11:41 p.m. UTC
Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
adjust the value to TCP_RTO_MIN.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
 include/net/tcp.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

Comments

Hagen Paul Pfeifer Aug. 24, 2011, 11:43 p.m. UTC | #1
This should do the trick Eric, Ilpo?

Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng Aug. 25, 2011, 1:50 a.m. UTC | #2
On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
> adjust the value to TCP_RTO_MIN.
>
but tp->rttvar is already lower-bounded via tcp_rto_min()?

static inline void tcp_set_rto(struct sock *sk)
{
...

  /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
   * guarantees that rto is higher.
   */
  tcp_bound_rto(sk);
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2011, 5:28 a.m. UTC | #3
Le mercredi 24 août 2011 à 18:50 -0700, Yuchung Cheng a écrit :
> On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> > Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
> > adjust the value to TCP_RTO_MIN.
> >
> but tp->rttvar is already lower-bounded via tcp_rto_min()?
> 
> static inline void tcp_set_rto(struct sock *sk)
> {
> ...
> 
>   /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>    * guarantees that rto is higher.
>    */
>   tcp_bound_rto(sk);
> }

Yes, and furthermore, we also limit ICMP rate, so in in my tests, I
reach in a few rounds icsk_rto > 1sec

07:16:13.010633 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 3833540215:3833540263(48) ack 2593537670 win 305
07:16:13.221111 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:13.661151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:14.541153 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:16.301152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
<from this point, icsk_rto=1.76sec >
07:16:18.061158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:19.821158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:21.581018 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:23.341156 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:25.101151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:26.861155 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:28.621158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:30.381152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
07:16:32.141157 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 

Real question is : do we really want to process ~1000 timer interrupts
per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
requests, only to make tcp revover in ~1sec when connectivity returns
back. This just doesnt scale.

On a server handling ~1.000.000 (long living) sessions, using
application side keepalives (say one message sent every minute on each
session), a temporary connectivity disruption _could_ makes it enter a
critical zone, burning cpu and memory.

It seems TCP-LCD (RFC6069) depends very much on ICMP being rate limited.

I'll have to check what happens on multiple sessions : We might have
cpus fighting on a single inetpeer and throtle, thus allowing backoff to
increase after all. 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Zimmermann Aug. 25, 2011, 7:28 a.m. UTC | #4
Hi Eric,

Am 25.08.2011 um 07:28 schrieb Eric Dumazet:

> Le mercredi 24 août 2011 à 18:50 -0700, Yuchung Cheng a écrit :
>> On Wed, Aug 24, 2011 at 4:41 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>>> Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
>>> adjust the value to TCP_RTO_MIN.
>>> 
>> but tp->rttvar is already lower-bounded via tcp_rto_min()?
>> 
>> static inline void tcp_set_rto(struct sock *sk)
>> {
>> ...
>> 
>>  /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>>   * guarantees that rto is higher.
>>   */
>>  tcp_bound_rto(sk);
>> }
> 
> Yes, and furthermore, we also limit ICMP rate, so in in my tests, I
> reach in a few rounds icsk_rto > 1sec
> 
> 07:16:13.010633 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 3833540215:3833540263(48) ack 2593537670 win 305
> 07:16:13.221111 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:13.661151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:14.541153 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:16.301152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> <from this point, icsk_rto=1.76sec >
> 07:16:18.061158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:19.821158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:21.581018 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:23.341156 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:25.101151 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:26.861155 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:28.621158 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:30.381152 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 07:16:32.141157 IP 10.2.1.2.59352 > 10.2.1.1.ssh: P 0:48(48) ack 1 win 305 
> 
> Real question is : do we really want to process ~1000 timer interrupts
> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> requests, only to make tcp revover in ~1sec when connectivity returns
> back. This just doesnt scale.

maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
probing time of 120s, we 600 retransmits in a worst-case-senario
(assumed that we get for every rot retransmission an icmp). No?

> 
> On a server handling ~1.000.000 (long living) sessions, using
> application side keepalives (say one message sent every minute on each
> session), a temporary connectivity disruption _could_ makes it enter a
> critical zone, burning cpu and memory.
> 
> It seems TCP-LCD (RFC6069) depends very much on ICMP being rate limited.

This is right. We assume that a server/router sends only icmps when they
have free cycles.

> 
> I'll have to check what happens on multiple sessions : We might have
> cpus fighting on a single inetpeer and throtle, thus allowing backoff to
> increase after all. 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2011, 8:26 a.m. UTC | #5
Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:

> > Real question is : do we really want to process ~1000 timer interrupts
> > per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> > requests, only to make tcp revover in ~1sec when connectivity returns
> > back. This just doesnt scale.
> 
> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> probing time of 120s, we 600 retransmits in a worst-case-senario
> (assumed that we get for every rot retransmission an icmp). No?

Where is asserted the "max probing time of 120s" ? 

It is not the case on my machine :
I have way more retransmits than that, even if spaced by 1600 ms

07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)

Old kernels where performing up to 15 retries, doing exponential backoff.

Now its kind of unlimited, according to experimental results.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Zimmermann Aug. 25, 2011, 8:44 a.m. UTC | #6
Am 25.08.2011 um 10:26 schrieb Eric Dumazet:

> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>> 
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>> 
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.

Yes I know. And in combination with RFC6069 we have to convert this
See Section 7.1

and

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6fa12c85031485dff38ce550c24f10da23b0adaa

Is the transformation broken? Damian?


> 
> Now its kind of unlimited, according to experimental results.

Ok, unlimited is not what I expect...


> 
> 
> 

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Hannemann Aug. 25, 2011, 8:46 a.m. UTC | #7
Hi,

Am 25.08.2011 10:26, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>>
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.
> 
> Now its kind of unlimited, according to experimental results.

That shouldn't be. It should stop after the same time a TCP connection with an
RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
this doesn't work as expected.

* 200ms + 400ms + 800ms ...

Best regards,
Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2011, 9:09 a.m. UTC | #8
Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
> Hi,
> 
> Am 25.08.2011 10:26, schrieb Eric Dumazet:
> > Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> >> Hi Eric,
> >>
> >> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> > 
> >>> Real question is : do we really want to process ~1000 timer interrupts
> >>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> >>> requests, only to make tcp revover in ~1sec when connectivity returns
> >>> back. This just doesnt scale.
> >>
> >> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> >> probing time of 120s, we 600 retransmits in a worst-case-senario
> >> (assumed that we get for every rot retransmission an icmp). No?
> > 
> > Where is asserted the "max probing time of 120s" ? 
> > 
> > It is not the case on my machine :
> > I have way more retransmits than that, even if spaced by 1600 ms
> > 
> > 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> > 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> > 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> > 
> > Old kernels where performing up to 15 retries, doing exponential backoff.
> > 
> > Now its kind of unlimited, according to experimental results.
> 
> That shouldn't be. It should stop after the same time a TCP connection with an
> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
> this doesn't work as expected.
> 
> * 200ms + 400ms + 800ms ...

It is 924 second with retries2=15 (default value)

I said ~1000 probes.

If ICMP are not rate limited, that could be about 924*5 probes, instead
of 15 probes on old kernels.

Maybe we should refine the thing a bit, to not reverse backoff unless
rto is > some_threshold.

Say 10s being the value, that would give at most 92 tries.

I mean, what is the gain to be able to restart a frozen TCP session with
a 1sec latency instead of 10s if it was blocked more than 60 seconds ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Hannemann Aug. 25, 2011, 9:46 a.m. UTC | #9
Hi Eric,

Am 25.08.2011 11:09, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
>> Am 25.08.2011 10:26, schrieb Eric Dumazet:
>>> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>>>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
>>>
>>>>> Real question is : do we really want to process ~1000 timer interrupts
>>>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>>>> back. This just doesnt scale.
>>>>
>>>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>>>> probing time of 120s, we 600 retransmits in a worst-case-senario
>>>> (assumed that we get for every rot retransmission an icmp). No?
>>>
>>> Where is asserted the "max probing time of 120s" ? 
>>>
>>> It is not the case on my machine :
>>> I have way more retransmits than that, even if spaced by 1600 ms
>>>
>>> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
>>> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
>>> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
>>>
>>> Old kernels where performing up to 15 retries, doing exponential backoff.
>>>
>>> Now its kind of unlimited, according to experimental results.
>>
>> That shouldn't be. It should stop after the same time a TCP connection with an
>> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
>> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
>> this doesn't work as expected.
>>
>> * 200ms + 400ms + 800ms ...
> 
> It is 924 second with retries2=15 (default value)
> 
> I said ~1000 probes.
> 
> If ICMP are not rate limited, that could be about 924*5 probes, instead
> of 15 probes on old kernels.

At a rate of 5 packets/s if RTT is zero, yes. I would like to say: so
what? But your example with millions of idle connections stands.

> Maybe we should refine the thing a bit, to not reverse backoff unless
> rto is > some_threshold.
> 
> Say 10s being the value, that would give at most 92 tries.

I personally think that 10s would be too large and eliminate the benefit of the
algorithm, so I would prefer a different solution.

In case of one bulk data TCP session, which was transmitting hundreds of packets/s
before the connectivity disruption those worst case rate of 5 packet/s really
seems conservative enough.

However in case of a lot of idle connections, which were transmitting only
a number of packets per minute. We might increase the rate drastically for
a certain period until it throttles down. You say that we have a problem here
correct?

Do you think it would be possible without much hassle to use a kind of "global"
rate limiting only for these probe packets of a TCP connection?

> I mean, what is the gain to be able to restart a frozen TCP session with
> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

I'm afraid it does a lot, especially in highly dynamic environments. You
don't have just the additional latency, you may actually miss the full
period where connectivity was there, and then just retransmit into the next
connectivity disrupted period.

Best regards,
Arnd



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 25, 2011, 10:02 a.m. UTC | #10
Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 11:09, schrieb Eric Dumazet:

> > Maybe we should refine the thing a bit, to not reverse backoff unless
> > rto is > some_threshold.
> > 
> > Say 10s being the value, that would give at most 92 tries.
> 
> I personally think that 10s would be too large and eliminate the benefit of the
> algorithm, so I would prefer a different solution.
> 
> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> before the connectivity disruption those worst case rate of 5 packet/s really
> seems conservative enough.
> 
> However in case of a lot of idle connections, which were transmitting only
> a number of packets per minute. We might increase the rate drastically for
> a certain period until it throttles down. You say that we have a problem here
> correct?
> 
> Do you think it would be possible without much hassle to use a kind of "global"
> rate limiting only for these probe packets of a TCP connection?
> 
> > I mean, what is the gain to be able to restart a frozen TCP session with
> > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> 
> I'm afraid it does a lot, especially in highly dynamic environments. You
> don't have just the additional latency, you may actually miss the full
> period where connectivity was there, and then just retransmit into the next
> connectivity disrupted period.

Problem with this is that with short and synchronized timers, all
sessions will flood at the same time and you'll get congestion this
time.

The reason for exponential backoff is also to smooth the restarts of
sessions, because timers are randomized.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Aug. 25, 2011, 10:14 a.m. UTC | #11
On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> > Hi Eric,
> > 
> > Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
> > > Maybe we should refine the thing a bit, to not reverse backoff unless
> > > rto is > some_threshold.
> > > 
> > > Say 10s being the value, that would give at most 92 tries.
> > 
> > I personally think that 10s would be too large and eliminate the benefit of the
> > algorithm, so I would prefer a different solution.
> > 
> > In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> > before the connectivity disruption those worst case rate of 5 packet/s really
> > seems conservative enough.
> > 
> > However in case of a lot of idle connections, which were transmitting only
> > a number of packets per minute. We might increase the rate drastically for
> > a certain period until it throttles down. You say that we have a problem here
> > correct?
> > 
> > Do you think it would be possible without much hassle to use a kind of 
> > "global" rate limiting only for these probe packets of a TCP connection?
> >
> > > I mean, what is the gain to be able to restart a frozen TCP session with
> > > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> > 
> > I'm afraid it does a lot, especially in highly dynamic environments. You
> > don't have just the additional latency, you may actually miss the full
> > period where connectivity was there, and then just retransmit into the next
> > connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.
>
> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

But if you get a real congestion the system will self-regulate using 
exponential backoffs due to lack of ICMPs for some of the connections?
Arnd Hannemann Aug. 25, 2011, 10:15 a.m. UTC | #12
Hi Eric,

Am 25.08.2011 12:02, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
>>> Maybe we should refine the thing a bit, to not reverse backoff unless
>>> rto is > some_threshold.
>>>
>>> Say 10s being the value, that would give at most 92 tries.
>>
>> I personally think that 10s would be too large and eliminate the benefit of the
>> algorithm, so I would prefer a different solution.
>>
>> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
>> before the connectivity disruption those worst case rate of 5 packet/s really
>> seems conservative enough.
>>
>> However in case of a lot of idle connections, which were transmitting only
>> a number of packets per minute. We might increase the rate drastically for
>> a certain period until it throttles down. You say that we have a problem here
>> correct?
>>
>> Do you think it would be possible without much hassle to use a kind of "global"
>> rate limiting only for these probe packets of a TCP connection?
>>
>>> I mean, what is the gain to be able to restart a frozen TCP session with
>>> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
>>
>> I'm afraid it does a lot, especially in highly dynamic environments. You
>> don't have just the additional latency, you may actually miss the full
>> period where connectivity was there, and then just retransmit into the next
>> connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.

Why do you think the timers are "syncronized"? If you have congestion
then you will do exponential backoff.

> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

If the RTO of these sessions were "randomized" they keep this randomization,
even if backoffs are reverted, at least they should.

Best regards
Arnd


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..9b5f4bf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -520,6 +520,8 @@  static inline void tcp_bound_rto(const struct sock *sk)
 {
 	if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
 		inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
+	else if (inet_csk(sk)->icsk_rto < TCP_RTO_MIN)
+		inet_csk(sk)->icsk_rto = TCP_RTO_MIN;
 }
 
 static inline u32 __tcp_set_rto(const struct tcp_sock *tp)