diff mbox

[net-next,v2] tcp: avoid reducing cwnd when ACK+DSACK is received

Message ID 1420719609-18638-1-git-send-email-sebastien.barre@uclouvain.be
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Sébastien Barré Jan. 8, 2015, 12:20 p.m. UTC
When the peer has delayed ack enabled, it may reply to a probe with an
ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
such ACK+DSACK will be missed and only at next, higher ack will the TLP
episode be considered done. Since the DSACK is not present anymore,
this will cost a cwnd reduction.

This patch ensures that this scenario does not cause a cwnd reduction, since
receiving an ACK+DSACK indicates that both the initial segment and the probe
have been received by the peer.

Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>

---

Changes:
Applied Neal's comments:
-adapted commit first line
-moved logic to if condition, and removed is_tlp_dupack

Thanks Neal for those comments !

 net/ipv4/tcp_input.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

Comments

Eric Dumazet Jan. 8, 2015, 3:07 p.m. UTC | #1
On Thu, 2015-01-08 at 13:20 +0100, Sébastien Barré wrote:
> When the peer has delayed ack enabled, it may reply to a probe with an
> ACK+D-SACK, with ack value set to tlp_high_seq.


The most probable cause would not be a delayed ack (rto should be much
bigger), but a lost ACK.


>  In the current code,
> such ACK+DSACK will be missed and only at next, higher ack will the TLP
> episode be considered done. Since the DSACK is not present anymore,
> this will cost a cwnd reduction.
> 
> This patch ensures that this scenario does not cause a cwnd reduction, since
> receiving an ACK+DSACK indicates that both the initial segment and the probe
> have been received by the peer.

Do you have at hand a packetdrill test to demonstrate that the patch
works ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2015, 3:19 p.m. UTC | #2
On Thu, 2015-01-08 at 13:20 +0100, Sébastien Barré wrote:
> When the peer has delayed ack enabled, it may reply to a probe with an
> ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
> such ACK+DSACK will be missed and only at next, higher ack will the TLP
> episode be considered done. Since the DSACK is not present anymore,
> this will cost a cwnd reduction.
> 
...
>  net/ipv4/tcp_input.c | 30 +++++++++++++-----------------
>  1 file changed, 13 insertions(+), 17 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 075ab4d..cf63a29 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3363,29 +3363,25 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
>  static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	bool is_tlp_dupack = (ack == tp->tlp_high_seq) &&
> -			     !(flag & (FLAG_SND_UNA_ADVANCED |
> -				       FLAG_NOT_DUP | FLAG_DATA_SACKED));
>  

No idea why you removed this bool, it really helped to understand the
code. This makes your patch looks more complex than needed.

>  	/* Mark the end of TLP episode on receiving TLP dupack or when
>  	 * ack is after tlp_high_seq.
> +	 * With delayed acks, we may also get a regular ACK+DSACK, in which
> +	 * case we don't want to reduce the cwnd either.
>  	 */
> -	if (is_tlp_dupack) {
> +	if (((ack == tp->tlp_high_seq) &&
> +	     !(flag & (FLAG_SND_UNA_ADVANCED |
> +		       FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> +	    (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
>  		tp->tlp_high_seq = 0;
> -		return;
> -	}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sébastien Barré Jan. 8, 2015, 3:24 p.m. UTC | #3
Hi Eric,

Le 08/01/2015 16:07, Eric Dumazet a écrit :
>
>>   In the current code,
>> such ACK+DSACK will be missed and only at next, higher ack will the TLP
>> episode be considered done. Since the DSACK is not present anymore,
>> this will cost a cwnd reduction.
>>
>> This patch ensures that this scenario does not cause a cwnd reduction, since
>> receiving an ACK+DSACK indicates that both the initial segment and the probe
>> have been received by the peer.
> Do you have at hand a packetdrill test to demonstrate that the patch
> works ?
Not currently, but that would be very good indeed.
I will prepare one and send it.

regards,

Sébastien.
>
> Thanks !
>
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sébastien Barré Jan. 8, 2015, 3:39 p.m. UTC | #4
Le 08/01/2015 16:19, Eric Dumazet a écrit :
> On Thu, 2015-01-08 at 13:20 +0100, Sébastien Barré wrote:
>> When the peer has delayed ack enabled, it may reply to a probe with an
>> ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
>> such ACK+DSACK will be missed and only at next, higher ack will the TLP
>> episode be considered done. Since the DSACK is not present anymore,
>> this will cost a cwnd reduction.
>>
> ...
>>   net/ipv4/tcp_input.c | 30 +++++++++++++-----------------
>>   1 file changed, 13 insertions(+), 17 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 075ab4d..cf63a29 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3363,29 +3363,25 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
>>   static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
>>   {
>>   	struct tcp_sock *tp = tcp_sk(sk);
>> -	bool is_tlp_dupack = (ack == tp->tlp_high_seq) &&
>> -			     !(flag & (FLAG_SND_UNA_ADVANCED |
>> -				       FLAG_NOT_DUP | FLAG_DATA_SACKED));
>>   
> No idea why you removed this bool, it really helped to understand the
> code. This makes your patch looks more complex than needed.
I did not remove it in v1 
(http://www.spinics.net/lists/netdev/msg308653.html)
The removal was a request from Neal 
(http://www.spinics.net/lists/netdev/msg308758.html)

I think he found it special to have part of the logic apart, in a bool, 
and part of it in the if condition.
One possible option is to restore is_tlp_dupack and include the DSACK 
check in it, although this will
not do much more than moving complexity in the bool definition. But 
indeed, that might make
the patch more readable.

What do you and Neal think ?

regards,

Sébastien.
>
>>   	/* Mark the end of TLP episode on receiving TLP dupack or when
>>   	 * ack is after tlp_high_seq.
>> +	 * With delayed acks, we may also get a regular ACK+DSACK, in which
>> +	 * case we don't want to reduce the cwnd either.
>>   	 */
>> -	if (is_tlp_dupack) {
>> +	if (((ack == tp->tlp_high_seq) &&
>> +	     !(flag & (FLAG_SND_UNA_ADVANCED |
>> +		       FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
>> +	    (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
>>   		tp->tlp_high_seq = 0;
>> -		return;
>> -	}
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Jan. 8, 2015, 3:43 p.m. UTC | #5
> Le 08/01/2015 16:07, Eric Dumazet a écrit :
>> Do you have at hand a packetdrill test to demonstrate that the patch
>> works ?

I cooked up the packetdrill test below when Sebastien sent out his v1
a few weeks ago. It fails on a kernel without his patch, and passes on
a kernel with his patch.

The code change looks fine to me, but if Eric prefers that the
expression be assigned to a bool before the check, that also sounds
fine to me.

neal


------------
// Establish a connection.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.020 < . 1:1(0) ack 1 win 257
+0    accept(3, ..., ...) = 4

// Send 1 packet.
+0    write(4, ..., 1000) = 1000
+0    > P. 1:1001(1000) ack 1

// Loss probe retransmission.
// packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
// In this case, this means: 1.5*RTT + 200ms = 230ms
+.230 > P. 1:1001(1000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

// Receiver ACKs at tlp_high_seq with a DSACK,
// indicating they received the original packet and probe.
+.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
+0    %{ assert tcpi_snd_cwnd == 10 }%

// Send another packet.
+0    write(4, ..., 1000) = 1000
+0    > P. 1001:2001(1000) ack 1

// Receiver ACKs above tlp_high_seq, which should end the TLP episode
// if we haven't already. We should not reduce cwnd.
+.020 < . 1:1(0) ack 2001 win 257
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Jan. 8, 2015, 3:49 p.m. UTC | #6
On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
<sebastien.barre@uclouvain.be> wrote:
> What do you and Neal think ?

My preference is to have the whole expression detecting the case where
the receiver got the probe packet encoded in a single expression. I
don't have a strong feeling about whether it should be stored in a
bool (to reduce the size of the diff) or written directly into the if
() expression (to reduce the size of the code). I'll defer to Eric on
which he thinks is better. :-)

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Jan. 8, 2015, 3:52 p.m. UTC | #7
On Thu, Jan 8, 2015 at 10:49 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> <sebastien.barre@uclouvain.be> wrote:
>> What do you and Neal think ?
>
> My preference is to have the whole expression detecting the case where
> the receiver got the probe packet encoded in a single expression.

If we do store it in a bool, then "is_tlp_dupack" is no longer quite
accurate (we're expanding the check to include ACKs that are not
dupacks), so I'd suggest a name like "is_probe_rcvd".

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2015, 3:59 p.m. UTC | #8
On Thu, 2015-01-08 at 16:24 +0100, Sébastien Barré wrote:
> Hi Eric,
> 
> Le 08/01/2015 16:07, Eric Dumazet a écrit :
> >
> >>   In the current code,
> >> such ACK+DSACK will be missed and only at next, higher ack will the TLP
> >> episode be considered done. Since the DSACK is not present anymore,
> >> this will cost a cwnd reduction.
> >>
> >> This patch ensures that this scenario does not cause a cwnd reduction, since
> >> receiving an ACK+DSACK indicates that both the initial segment and the probe
> >> have been received by the peer.
> > Do you have at hand a packetdrill test to demonstrate that the patch
> > works ?
> Not currently, but that would be very good indeed.
> I will prepare one and send it.

Here is a skeleton you might adapt :

// TLP test when a sender does not have any new segments for transmission.
// In the absence of ACKs, the sender retransmits an old segment as a probe
// after 2 RTTs.

// Tolerate RTO variations
--tolerance_usecs=15000

// Set up production config.
`../common/defaults.sh
sysctl -q net.ipv4.tcp_early_retrans=3
`

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100  < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0     > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+0.100 < . 1:1(0) ack 1 win 257
+0     accept(3, ..., ...) = 4

// Send 10 MSS.
+0.100 write(4, ..., 10000) = 10000
+0 > . 1:5001(5000) ack 1
+0 > P. 5001:10001(5000) ack 1

// TLP retransmission in 2 RTTs.
+0.200 > P. 9001:10001(1000) ack 1
+0 %{
assert tcpi_snd_cwnd == 10, 'tcpi_snd_cwnd=%d' % tcpi_snd_cwnd
assert tcpi_unacked == 10
}%

+0.100 < . 1:1(0) ack 10001 win 257 <sack 9001:1001,nop,nop>

+0 %{
assert tcpi_snd_cwnd == 20, 'tcpi_snd_cwnd=%d' % tcpi_snd_cwnd
assert tcpi_unacked == 0, 'tcpi_unacked=%d' % tcpi_unacked
}%



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sébastien Barré Jan. 8, 2015, 4 p.m. UTC | #9
Le 08/01/2015 16:43, Neal Cardwell a écrit :
>> Le 08/01/2015 16:07, Eric Dumazet a écrit :
>>> Do you have at hand a packetdrill test to demonstrate that the patch
>>> works ?
> I cooked up the packetdrill test below when Sebastien sent out his v1
> a few weeks ago. It fails on a kernel without his patch, and passes on
> a kernel with his patch.
Thanks a lot for this ! I will include it in our test suite then.
I understand that there is convergence on having a
bool called is_probe_rcvd with the whole logic.
Eric, is this ok for you ?

Thanks,

Sébastien.
>
> The code change looks fine to me, but if Eric prefers that the
> expression be assigned to a bool before the check, that also sounds
> fine to me.
>
> neal
>
>
> ------------
> // Establish a connection.
> 0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0    bind(3, ..., ...) = 0
> +0    listen(3, 1) = 0
>
> +0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
> +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
> +.020 < . 1:1(0) ack 1 win 257
> +0    accept(3, ..., ...) = 4
>
> // Send 1 packet.
> +0    write(4, ..., 1000) = 1000
> +0    > P. 1:1001(1000) ack 1
>
> // Loss probe retransmission.
> // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
> // In this case, this means: 1.5*RTT + 200ms = 230ms
> +.230 > P. 1:1001(1000) ack 1
> +0    %{ assert tcpi_snd_cwnd == 10 }%
>
> // Receiver ACKs at tlp_high_seq with a DSACK,
> // indicating they received the original packet and probe.
> +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
> +0    %{ assert tcpi_snd_cwnd == 10 }%
>
> // Send another packet.
> +0    write(4, ..., 1000) = 1000
> +0    > P. 1001:2001(1000) ack 1
>
> // Receiver ACKs above tlp_high_seq, which should end the TLP episode
> // if we haven't already. We should not reduce cwnd.
> +.020 < . 1:1(0) ack 2001 win 257
> +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2015, 4:25 p.m. UTC | #10
On Thu, 2015-01-08 at 10:49 -0500, Neal Cardwell wrote:
> On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> <sebastien.barre@uclouvain.be> wrote:
> > What do you and Neal think ?
> 
> My preference is to have the whole expression detecting the case where
> the receiver got the probe packet encoded in a single expression. I
> don't have a strong feeling about whether it should be stored in a
> bool (to reduce the size of the diff) or written directly into the if
> () expression (to reduce the size of the code). I'll defer to Eric on
> which he thinks is better. :-)

There is no shame using helpers with nice names to help understand this
TCP stack. Even if the helper is used exactly once.

In this case, it seems we test 2 different conditions, so this could use
2 helpers with self describing names.

When I see : 

> +     if (((ack == tp->tlp_high_seq) &&
> +          !(flag & (FLAG_SND_UNA_ADVANCED |
> +                    FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> +         (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
>               tp->tlp_high_seq = 0;

My brain hurts.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2015, 5:17 p.m. UTC | #11
On Thu, 2015-01-08 at 08:25 -0800, Eric Dumazet wrote:
> On Thu, 2015-01-08 at 10:49 -0500, Neal Cardwell wrote:
> > On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> > <sebastien.barre@uclouvain.be> wrote:
> > > What do you and Neal think ?
> > 
> > My preference is to have the whole expression detecting the case where
> > the receiver got the probe packet encoded in a single expression. I
> > don't have a strong feeling about whether it should be stored in a
> > bool (to reduce the size of the diff) or written directly into the if
> > () expression (to reduce the size of the code). I'll defer to Eric on
> > which he thinks is better. :-)
> 
> There is no shame using helpers with nice names to help understand this
> TCP stack. Even if the helper is used exactly once.
> 
> In this case, it seems we test 2 different conditions, so this could use
> 2 helpers with self describing names.
> 
> When I see : 
> 
> > +     if (((ack == tp->tlp_high_seq) &&
> > +          !(flag & (FLAG_SND_UNA_ADVANCED |
> > +                    FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> > +         (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
> >               tp->tlp_high_seq = 0;
> 
> My brain hurts.

BTW, tlp_high_seq is used no matter what (initial value is 0).

While the probability ACK seq being exactly 0 is small (1 out of 2^32),
probability that !before(ack, tp->tlp_high_seq) being a false positive
is very high. Initial sequence number are supposed to be random.

I wonder if setting tp->tlp_high_seq to 0 is the right thing to do.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2015, 5:27 p.m. UTC | #12
On Thu, 2015-01-08 at 09:17 -0800, Eric Dumazet wrote:

> BTW, tlp_high_seq is used no matter what (initial value is 0).
> 
> While the probability ACK seq being exactly 0 is small (1 out of 2^32),
> probability that !before(ack, tp->tlp_high_seq) being a false positive
> is very high. Initial sequence number are supposed to be random.
> 

Oh well, calls to tcp_process_tlp_ack() are guarded by :

if (tp->tlp_high_seq)
    tcp_process_tlp_ack(sk, ack, flag);



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng Jan. 9, 2015, 7:43 p.m. UTC | #13
On Thu, Jan 8, 2015 at 8:25 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Thu, 2015-01-08 at 10:49 -0500, Neal Cardwell wrote:
> > On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> > <sebastien.barre@uclouvain.be> wrote:
> > > What do you and Neal think ?
> >
> > My preference is to have the whole expression detecting the case where
> > the receiver got the probe packet encoded in a single expression. I
> > don't have a strong feeling about whether it should be stored in a
> > bool (to reduce the size of the diff) or written directly into the if
> > () expression (to reduce the size of the code). I'll defer to Eric on
> > which he thinks is better. :-)
>
> There is no shame using helpers with nice names to help understand this
> TCP stack. Even if the helper is used exactly once.
>
> In this case, it seems we test 2 different conditions, so this could use
> 2 helpers with self describing names.
>
> When I see :
>
> > +     if (((ack == tp->tlp_high_seq) &&
> > +          !(flag & (FLAG_SND_UNA_ADVANCED |
> > +                    FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> > +         (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
> >               tp->tlp_high_seq = 0;
>
> My brain hurts.
Sebastien: I suggest breaking down by ACK types for readability. e.g.,

/* This routine deals with acks during a TLP episode.
 * We mark the end of a TLP episode on receiving TLP dupack or when
 * ack is after tlp_high_seq.
 * Ref: loss detection algorithm in draft-dukkipati-tcpm-tcp-loss-probe.
 */
static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
{
        struct tcp_sock *tp = tcp_sk(sk);

        if (before(ack, tp->tlp_high_seq))
                return;

        if (flag & FLAG_DSACKING_ACK) {
                /* This DSACK means original and TLP probe arrived; no loss */
                tp->tlp_high_seq = 0;
        } else if (after(ack, tp->tlp_high_seq)) {
                /* ACK advances: there was a loss, so reduce cwnd. Reset
                 * tlp_high_seq in tcp_init_cwnd_reduction()
                 */
                tcp_init_cwnd_reduction(sk);
                tcp_set_ca_state(sk, TCP_CA_CWR);
                tcp_end_cwnd_reduction(sk);
                tcp_try_keep_open(sk);
                NET_INC_STATS_BH(sock_net(sk),
                                 LINUX_MIB_TCPLOSSPROBERECOVERY);
        } else if (!(flag & (FLAG_SND_UNA_ADVANCED |
                             FLAG_NOT_DUP | FLAG_DATA_SACKED))) {
                /* Pure dupack: original and TLP probe arrived; no loss */
                tp->tlp_high_seq = 0;
        }
}

>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Jan. 9, 2015, 8:36 p.m. UTC | #14
On Fri, Jan 9, 2015 at 2:43 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Sebastien: I suggest breaking down by ACK types for readability. e.g.,

I like this approach, as well.  It breaks up the logic into smaller
pieces that are easier to comment and understand, without having
helper bool flags to read and mentally track. And this version passes
all our internal TLP tests, FWIW.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sébastien Barré Jan. 10, 2015, 11:51 a.m. UTC | #15
All,

Le 09/01/2015 20:43, Yuchung Cheng a écrit :
>
> Sebastien: I suggest breaking down by ACK types for readability. e.g.,
>
> /* This routine deals with acks during a TLP episode.
>   * We mark the end of a TLP episode on receiving TLP dupack or when
>   * ack is after tlp_high_seq.
>   * Ref: loss detection algorithm in draft-dukkipati-tcpm-tcp-loss-probe.
>   */
> static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
> {
>          struct tcp_sock *tp = tcp_sk(sk);
>
>          if (before(ack, tp->tlp_high_seq))
>                  return;
>
>          if (flag & FLAG_DSACKING_ACK) {
>                  /* This DSACK means original and TLP probe arrived; no loss */
>                  tp->tlp_high_seq = 0;
>          } else if (after(ack, tp->tlp_high_seq)) {
>                  /* ACK advances: there was a loss, so reduce cwnd. Reset
>                   * tlp_high_seq in tcp_init_cwnd_reduction()
Indeed, hadn't seen that.
>                   */
>                  tcp_init_cwnd_reduction(sk);
>                  tcp_set_ca_state(sk, TCP_CA_CWR);
>                  tcp_end_cwnd_reduction(sk);
>                  tcp_try_keep_open(sk);
>                  NET_INC_STATS_BH(sock_net(sk),
>                                   LINUX_MIB_TCPLOSSPROBERECOVERY);
>          } else if (!(flag & (FLAG_SND_UNA_ADVANCED |
>                               FLAG_NOT_DUP | FLAG_DATA_SACKED))) {
>                  /* Pure dupack: original and TLP probe arrived; no loss */
>                  tp->tlp_high_seq = 0;
>          }
> }
That looks much more readable compared to my v2.
It is currently passing our tests (These are in fact MPTCP tests appart 
from Neal's packetdrill that I will add, but actually the MPTCP stack 
happens to reveal this situation quite easily, I think because in MPTCP, 
we store the send queue in the "meta-flow", which currently cannot be 
used for tail loss probes).

As probably everyone will be happy with this (Eric as well ?), I suggest 
I prepare a v3 once all our tests are passed as well, with Yuchung's 
structure and Neal's packetdrill test in the commit text. Will also add 
proper credit as there is now stuff from several people in those few 
lines now :-).

Looks good ?

Thanks again for your fast and helpful interactions !

Sébastien.

>
>>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2015, 5:37 p.m. UTC | #16
On Sat, 2015-01-10 at 12:51 +0100, Sébastien Barré wrote:

> That looks much more readable compared to my v2.
> It is currently passing our tests (These are in fact MPTCP tests appart 
> from Neal's packetdrill that I will add, but actually the MPTCP stack 
> happens to reveal this situation quite easily, I think because in MPTCP, 
> we store the send queue in the "meta-flow", which currently cannot be 
> used for tail loss probes).
> 
> As probably everyone will be happy with this (Eric as well ?), I suggest 
> I prepare a v3 once all our tests are passed as well, with Yuchung's 
> structure and Neal's packetdrill test in the commit text. Will also add 
> proper credit as there is now stuff from several people in those few 
> lines now :-).
> 
> Looks good ?

Definitely !

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Jan. 12, 2015, 11:52 a.m. UTC | #17
From: Yuchung Cheng

> Sent: 09 January 2015 19:44

...
> Sebastien: I suggest breaking down by ACK types for readability. e.g.,

> 

> /* This routine deals with acks during a TLP episode.

>  * We mark the end of a TLP episode on receiving TLP dupack or when

>  * ack is after tlp_high_seq.

>  * Ref: loss detection algorithm in draft-dukkipati-tcpm-tcp-loss-probe.

>  */

> static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)

> {

>         struct tcp_sock *tp = tcp_sk(sk);

> 

>         if (before(ack, tp->tlp_high_seq))

>                 return;

> 

>         if (flag & FLAG_DSACKING_ACK) {

>                 /* This DSACK means original and TLP probe arrived; no loss */

>                 tp->tlp_high_seq = 0;


I think I'd add a 'return' here.

>         } else if (after(ack, tp->tlp_high_seq)) {

>                 /* ACK advances: there was a loss, so reduce cwnd. Reset

>                  * tlp_high_seq in tcp_init_cwnd_reduction()

>                  */

>                 tcp_init_cwnd_reduction(sk);

>                 tcp_set_ca_state(sk, TCP_CA_CWR);

>                 tcp_end_cwnd_reduction(sk);

>                 tcp_try_keep_open(sk);

>                 NET_INC_STATS_BH(sock_net(sk),

>                                  LINUX_MIB_TCPLOSSPROBERECOVERY);


and here

>         } else if (!(flag & (FLAG_SND_UNA_ADVANCED |

>                              FLAG_NOT_DUP | FLAG_DATA_SACKED))) {

>                 /* Pure dupack: original and TLP probe arrived; no loss */

>                 tp->tlp_high_seq = 0;

>         }

> }


	David
Neal Cardwell Jan. 12, 2015, 3:02 p.m. UTC | #18
On Mon, Jan 12, 2015 at 6:52 AM, David Laight <David.Laight@aculab.com> wrote:
>>         if (flag & FLAG_DSACKING_ACK) {
>>                 /* This DSACK means original and TLP probe arrived; no loss */
>>                 tp->tlp_high_seq = 0;
>
> I think I'd add a 'return' here.

What's the benefit of adding 'return' in those two spots? That adds
extra code to read, with no change in behavior, and no increase in
maintainability.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..cf63a29 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3363,29 +3363,25 @@  static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
 static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	bool is_tlp_dupack = (ack == tp->tlp_high_seq) &&
-			     !(flag & (FLAG_SND_UNA_ADVANCED |
-				       FLAG_NOT_DUP | FLAG_DATA_SACKED));
 
 	/* Mark the end of TLP episode on receiving TLP dupack or when
 	 * ack is after tlp_high_seq.
+	 * With delayed acks, we may also get a regular ACK+DSACK, in which
+	 * case we don't want to reduce the cwnd either.
 	 */
-	if (is_tlp_dupack) {
+	if (((ack == tp->tlp_high_seq) &&
+	     !(flag & (FLAG_SND_UNA_ADVANCED |
+		       FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
+	    (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
 		tp->tlp_high_seq = 0;
-		return;
-	}
-
-	if (after(ack, tp->tlp_high_seq)) {
+	} else if (after(ack, tp->tlp_high_seq)) {
 		tp->tlp_high_seq = 0;
-		/* Don't reduce cwnd if DSACK arrives for TLP retrans. */
-		if (!(flag & FLAG_DSACKING_ACK)) {
-			tcp_init_cwnd_reduction(sk);
-			tcp_set_ca_state(sk, TCP_CA_CWR);
-			tcp_end_cwnd_reduction(sk);
-			tcp_try_keep_open(sk);
-			NET_INC_STATS_BH(sock_net(sk),
-					 LINUX_MIB_TCPLOSSPROBERECOVERY);
-		}
+		tcp_init_cwnd_reduction(sk);
+		tcp_set_ca_state(sk, TCP_CA_CWR);
+		tcp_end_cwnd_reduction(sk);
+		tcp_try_keep_open(sk);
+		NET_INC_STATS_BH(sock_net(sk),
+				 LINUX_MIB_TCPLOSSPROBERECOVERY);
 	}
 }