[v3,net,1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

Message ID 1520936711-16784-2-git-send-email-ilpo.jarvinen@helsinki.fi
State Deferred
Delegated to: David Miller
Headers show
Series
  • tcp: fixes to non-SACK TCP
Related show

Commit Message

Ilpo Järvinen March 13, 2018, 10:25 a.m.
A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Yuchung Cheng March 27, 2018, 2:07 a.m. | #1
On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> A miscalculation for the number of acknowledged packets occurs during
> RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> any non-retransmitted skbs. The reason is that pkts_acked value
> calculated in tcp_clean_rtx_queue is not correct for slow start after
> RTO as it may include segments that were not lost and therefore did
> not need retransmissions in the slow start following the RTO. Then
> tcp_slow_start will add the excess into cwnd bloating it and
> triggering a burst.
>
> Instead, we want to pass only the number of retransmitted segments
> that were covered by the cumulative ACK (and potentially newly sent
> data segments too if the cumulative ACK covers that far).
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..4a26c09 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       /* Due to discontinuity on RTO in the artificial
> +                        * sacked_out calculations, TCP must restrict
> +                        * pkts_acked without SACK to rexmits and new data
> +                        * segments
> +                        */
> +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> +                               pkts_acked = rexmit_acked + newdata_acked;
> +
My understanding is there are two problems

1) your fix: the reordering logic in tcp-remove_reno_sacks requires
precise cumulatively acked count, not newly acked count?


2) current code: pkts_acked can substantially over-estimate the newly
delivered pkts in both SACK and non-SACK cases. For example, let's say
99/100 packets are already sacked, and the next ACK acks 100 pkts.
pkts_acked == 100 but really only one packet is delivered. It's wrong
to inform congestion control that 100 packets have just delivered.
AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
the newly delivered packets.

A better fix for both SACK and non-SACK, seems to be moving
ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
calibrated? this is what BBR is currently doing to avoid these pitfalls.


>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;
> --
> 2.7.4
>
Ilpo Järvinen March 27, 2018, 2:23 p.m. | #2
On Mon, 26 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       /* Due to discontinuity on RTO in the artificial
> > +                        * sacked_out calculations, TCP must restrict
> > +                        * pkts_acked without SACK to rexmits and new data
> > +                        * segments
> > +                        */
> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> > +                               pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
> 
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?

While I'm not entirely sure if you intented to say that my fix is broken 
or not, I thought this very difference alot while making the fix and I 
believe that this fix is needed because of the discontinuity at RTO 
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact 
in the imitation of sacked_out for non-SACK but at RTO we can't keep that 
in sync because we set L-bits (and have no S-bits to guide us). Thus, we 
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.

In tcp_remove_reno_sacks acked - sacked_out is being used to calculate 
tp->delivered, using plain cumulative acked causes congestion control 
breakage later as call to tcp_cong_control will directly use the 
difference in tp->delivered.

This boils down the exact definition of tp->delivered (the one given in 
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things 
in the defination that can make it entirely unsuitable for cc decisions. 
Should those segments that we (possibly) already counted into 
tp->delivered during (potentially preceeding) CA_Recovery be added to it 
for _second time_ or not? This fix avoids such double counting (the 
price is that we might underestimate). For SACK+ACK losses, the similar 
question is: Should those segments that we missed counting into 
tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) 
be added into tp->delivered now during RTO recovery or not (I'm not 
proposing we fix this unless we want to fix the both issues at the same 
time here as its impact with SACK is not that significant)? Is 
tp->delivered supposed to under- or overestimate (in the cases we're not 
sure what/when something happened)? ...If it's overestimating under any 
circumstances (for the current ACK), we cannot base our cc decision on it.

tcp_check_reno_reordering might like to have the cumulatively acked count 
but due to the forementioned discontinuity we anyway cannot accurately 
provide in CA_Loss what tcp_limit_reno_sacked expects (and the 
other CA states are unaffected by this fix). While we could call 
tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't 
remove the fundamental discontinuity problem that we have for what 
tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
that tcp_limit_reno_sacked is just never going to work after we zero 
tp->sacked_out (this would actually be the problem #3) or at least ends 
up underestimating the reordering degree by the amount we cleared from 
sacked_out at RTO?

> 2) current code: pkts_acked can substantially over-estimate the newly
> delivered pkts in both SACK and non-SACK cases. For example, let's say
> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
> pkts_acked == 100 but really only one packet is delivered. It's wrong
> to inform congestion control that 100 packets have just delivered.
> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
> the newly delivered packets.
>
> A better fix for both SACK and non-SACK, seems to be moving
> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
> calibrated? this is what BBR is currently doing to avoid these pitfalls.

Unfortunately that would not fix the problematic calculation of 
tp->delivered.

But you might be right that there's a problem #2 (it's hard to notice 
for real though as most of the cc modules don't seem to use it for 
anything). However, this is for pkts_acked so are you sure what the 
cc modules exactly expect: the shrinkage in # of outstanding packets or 
the number of newly delivered packets?

Looking (only) quickly into the modules (that use it other than in 
CA_Open):
- Somewhat suspicious delayed ACK detection logic in cdg
- HTCP might want shrinkage in # of outstanding packets (but I'm not 
entirely sure) for throughput measurement
- I guess TCP illinois is broken (assumes it's newly delivered packets) 
but it would not need to use it at all as it just proxies the value in 
a local variable into tcp_illinois_cong_avoid that has the correct acked 
readily available.

There are separate cong_avoid and cong_control callbacks that are more 
obviously oriented for doing cc where as I'd think pkts_acked callback 
seems more oriented to for RTT-sample related operations. Therefore, I'm 
not entirely sure I this is what is wanted for pkts_acked, especially 
given the HTCP example.
Yuchung Cheng March 27, 2018, 3:06 p.m. | #3
On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> >
>> > A miscalculation for the number of acknowledged packets occurs during
>> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> > any non-retransmitted skbs. The reason is that pkts_acked value
>> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> > RTO as it may include segments that were not lost and therefore did
>> > not need retransmissions in the slow start following the RTO. Then
>> > tcp_slow_start will add the excess into cwnd bloating it and
>> > triggering a burst.
>> >
>> > Instead, we want to pass only the number of retransmitted segments
>> > that were covered by the cumulative ACK (and potentially newly sent
>> > data segments too if the cumulative ACK covers that far).
>> >
>> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>> > ---
>> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>> >  1 file changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> > index 9a1b3c1..4a26c09 100644
>> > --- a/net/ipv4/tcp_input.c
>> > +++ b/net/ipv4/tcp_input.c
>> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >         long seq_rtt_us = -1L;
>> >         long ca_rtt_us = -1L;
>> >         u32 pkts_acked = 0;
>> > +       u32 rexmit_acked = 0;
>> > +       u32 newdata_acked = 0;
>> >         u32 last_in_flight = 0;
>> >         bool rtt_update;
>> >         int flag = 0;
>> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
>> > -                       if (sacked & TCPCB_SACKED_RETRANS)
>> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
>> >                                 tp->retrans_out -= acked_pcount;
>> > +                               rexmit_acked += acked_pcount;
>> > +                       }
>> >                         flag |= FLAG_RETRANS_DATA_ACKED;
>> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>> >                         last_ackt = skb->skb_mstamp;
>> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                                 reord = start_seq;
>> >                         if (!after(scb->end_seq, tp->high_seq))
>> >                                 flag |= FLAG_ORIG_SACK_ACKED;
>> > +                       else
>> > +                               newdata_acked += acked_pcount;
>> >                 }
>> >
>> >                 if (sacked & TCPCB_SACKED_ACKED) {
>> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (tcp_is_reno(tp)) {
>> > +                       /* Due to discontinuity on RTO in the artificial
>> > +                        * sacked_out calculations, TCP must restrict
>> > +                        * pkts_acked without SACK to rexmits and new data
>> > +                        * segments
>> > +                        */
>> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
>> > +                               pkts_acked = rexmit_acked + newdata_acked;
>> > +
>> My understanding is there are two problems
>>
>> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> precise cumulatively acked count, not newly acked count?
>
> While I'm not entirely sure if you intented to say that my fix is broken
> or not, I thought this very difference alot while making the fix and I
> believe that this fix is needed because of the discontinuity at RTO
> (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>
> In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> tp->delivered, using plain cumulative acked causes congestion control
> breakage later as call to tcp_cong_control will directly use the
> difference in tp->delivered.
>
> This boils down the exact definition of tp->delivered (the one given in
> the header is not detailed enough). I guess you might have better idea
> what it exactly is since one of you has added it? There are subtle things
> in the defination that can make it entirely unsuitable for cc decisions.
> Should those segments that we (possibly) already counted into
> tp->delivered during (potentially preceeding) CA_Recovery be added to it
> for _second time_ or not? This fix avoids such double counting (the
Where is the double counting, assuming normal DUPACK behavior?

In the non-sack case:

1. upon receiving a DUPACK, we assume one packet has been delivered by
incrementing tp->delivered in tcp_add_reno_sack()
2. upon receiving a partial ACK or an ACK that acks recovery point
(high_seq), tp->delivered is incremented by (cumulatively acked -
#dupacks) in tcp_remove_reno_sacks()

therefore tp->delivered is tracking the # of packets delivered
(sacked, acked, DUPACK'd) with the most information it could have
inferred.

From congestion control's perspective, it cares about the delivery
information (e.g. how much), not the sequences (what or how).

I am pointing out that

1) your fix may fix one corner CC packet accounting issue in the
non-SACK and CA_Loss
2) but it does not fix the other (major) CC packet accounting issue in
the SACK case
3) it breaks the dynamic dupthresh / reordering in the non-SACK case
as tcp_check_reno_reordering requires strictly cum. ack.

Therefore I prefer
a) using tp->delivered to fix (1)(2)
b) perhaps improving tp->delivered SACK emulation code in the non-SACK case


> price is that we might underestimate). For SACK+ACK losses, the similar
> question is: Should those segments that we missed counting into
> tp->delivered during preceeding CA_Recovery (due to losing enough SACKs)
> be added into tp->delivered now during RTO recovery or not (I'm not
> proposing we fix this unless we want to fix the both issues at the same
> time here as its impact with SACK is not that significant)? Is
> tp->delivered supposed to under- or overestimate (in the cases we're not
> sure what/when something happened)? ...If it's overestimating under any
> circumstances (for the current ACK), we cannot base our cc decision on it.
>
> tcp_check_reno_reordering might like to have the cumulatively acked count
> but due to the forementioned discontinuity we anyway cannot accurately
> provide in CA_Loss what tcp_limit_reno_sacked expects (and the
> other CA states are unaffected by this fix). While we could call
> tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't
> remove the fundamental discontinuity problem that we have for what
> tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
> that tcp_limit_reno_sacked is just never going to work after we zero
> tp->sacked_out (this would actually be the problem #3) or at least ends
> up underestimating the reordering degree by the amount we cleared from
> sacked_out at RTO?
>
>> 2) current code: pkts_acked can substantially over-estimate the newly
>> delivered pkts in both SACK and non-SACK cases. For example, let's say
>> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
>> pkts_acked == 100 but really only one packet is delivered. It's wrong
>> to inform congestion control that 100 packets have just delivered.
>> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
>> the newly delivered packets.
>>
>> A better fix for both SACK and non-SACK, seems to be moving
>> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
>> calibrated? this is what BBR is currently doing to avoid these pitfalls.
>
> Unfortunately that would not fix the problematic calculation of
> tp->delivered.
>
> But you might be right that there's a problem #2 (it's hard to notice
> for real though as most of the cc modules don't seem to use it for
> anything). However, this is for pkts_acked so are you sure what the
> cc modules exactly expect: the shrinkage in # of outstanding packets or
> the number of newly delivered packets?
>
> Looking (only) quickly into the modules (that use it other than in
> CA_Open):
> - Somewhat suspicious delayed ACK detection logic in cdg
> - HTCP might want shrinkage in # of outstanding packets (but I'm not
> entirely sure) for throughput measurement
> - I guess TCP illinois is broken (assumes it's newly delivered packets)
> but it would not need to use it at all as it just proxies the value in
> a local variable into tcp_illinois_cong_avoid that has the correct acked
> readily available.
>
> There are separate cong_avoid and cong_control callbacks that are more
> obviously oriented for doing cc where as I'd think pkts_acked callback
> seems more oriented to for RTT-sample related operations. Therefore, I'm
> not entirely sure I this is what is wanted for pkts_acked, especially
> given the HTCP example.
>
>
> --
>  i.
Ilpo Järvinen March 28, 2018, 12:45 p.m. | #4
On Tue, 27 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> >
> >> > A miscalculation for the number of acknowledged packets occurs during
> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> > RTO as it may include segments that were not lost and therefore did
> >> > not need retransmissions in the slow start following the RTO. Then
> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> > triggering a burst.
> >> >
> >> > Instead, we want to pass only the number of retransmitted segments
> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> > data segments too if the cumulative ACK covers that far).
> >> >
> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> >> > ---
> >>
> >> My understanding is there are two problems
> >>
> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> precise cumulatively acked count, not newly acked count?
> >
> > While I'm not entirely sure if you intented to say that my fix is broken
> > or not, I thought this very difference alot while making the fix and I
> > believe that this fix is needed because of the discontinuity at RTO
> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >
> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > tp->delivered, using plain cumulative acked causes congestion control
> > breakage later as call to tcp_cong_control will directly use the
> > difference in tp->delivered.
> >
> > This boils down the exact definition of tp->delivered (the one given in
> > the header is not detailed enough). I guess you might have better idea
> > what it exactly is since one of you has added it? There are subtle things
> > in the defination that can make it entirely unsuitable for cc decisions.
> > Should those segments that we (possibly) already counted into
> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > for _second time_ or not? This fix avoids such double counting (the
> Where is the double counting, assuming normal DUPACK behavior?
> 
> In the non-sack case:
> 
> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> incrementing tp->delivered in tcp_add_reno_sack()

1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity 
I've tried to point out quite many times already)...

> 2. upon receiving a partial ACK or an ACK that acks recovery point
> (high_seq), tp->delivered is incremented by (cumulatively acked -
> #dupacks) in tcp_remove_reno_sacks()

...and this won't happen correctly anymore after RTO (since non-SACK 
won't keep #dupacks due to the discontinuity). Thus we end up adding
cumulatively acked - 0 to tp->delivered on those ACKs.

> therefore tp->delivered is tracking the # of packets delivered
> (sacked, acked, DUPACK'd) with the most information it could have
> inferred.

Since you didn't answer any of my questions about tp->delivered directly, 
let me rephrase them to this example (non-SACK, of course):

4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for 
three segments. How much should tp->delivered be incremented? 2 or 3?

...I think 2 is the right answer.

> From congestion control's perspective, it cares about the delivery
> information (e.g. how much), not the sequences (what or how).

I guess you must have missed my point. I'm talking about "how much" 
whole the time. It's just about when can we account for it (and when not).

> I am pointing out that
> 
> 1) your fix may fix one corner CC packet accounting issue in the
> non-SACK and CA_Loss
> 2) but it does not fix the other (major) CC packet accounting issue in
> the SACK case

You say major but it's major if and only if one is using one of the 
affected cc modules which were not that many and IMHO not very mainstream 
anyway (check my list from the previous email, not that I'd mind if they'd 
get fixed too). The rest of the cc modules do not seem to use the incorrect
value even if some of them have the pkts_acked callback.

Other CC packet accounting (regardless of SACK) is based on tp->delivered 
and tp->delivered is NOT similarly miscounted with SACK because the logic 
depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK 
loss).

> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> as tcp_check_reno_reordering requires strictly cum. ack.

I think that the current non-SACK reordering detection logic is not that 
sound after RTO (but I'm yet to prove this). ...There seems to be some 
failure modes which is why I even thought of disabling the whole thing
for non-SACK RTO recoveries and as it now seems you do care more about 
non-SACK than you initial claimed, I might even have motivation to fix 
more those corners rather than the worst bugs only ;-). ...But I'll make 
the tp->delivered fix only in this patch to avoid any change this part of 
the code.

> Therefore I prefer
> a) using tp->delivered to fix (1)(2)
> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case

Somehow I get an impression that you might assume/say here that 
tp->delivered is all correct for non-SACK? ...It isn't without a fix.
Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is 
currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect 
pkts_acked is fed to it. ...Thus, b) is an intermediate step in the 
miscalculation chain I'm fixing with this change. B) resolves also (1) 
without additional changes to logic!

I've included below the updated change that only fixes tp->delivered 
calculation (not tested besides compiling). ...But I think it's worse than 
my previous version because tcp_remove_reno_sacks then assumes 
non-sensical L|S skbs (but there seem to be other, worse limitations in 
sacked_out imitation after RTO because we've all skbs marked with L-bit so 
it's not that big deal for me as most of the that imitation code is no-op 
anyway after RTO).


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..e11748d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 
 	if (acked > 0) {
 		/* One ACK acked hole. The rest eat duplicate ACKs. */
-		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
 		if (acked - 1 >= tp->sacked_out)
 			tp->sacked_out = 0;
 		else
@@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 	tcp_verify_left_out(tp);
 }
 
+static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
+{
+	if (acked > 0)
+		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
+}
+
 static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
 {
 	tp->sacked_out = 0;
@@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
+						      pkts_acked :
+						      rexmit_acked + newdata_acked);
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;
Yuchung Cheng March 28, 2018, 2:14 p.m. | #5
On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Tue, 27 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> >> <ilpo.jarvinen@helsinki.fi> wrote:
>> >> >
>> >> > A miscalculation for the number of acknowledged packets occurs during
>> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> >> > any non-retransmitted skbs. The reason is that pkts_acked value
>> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> >> > RTO as it may include segments that were not lost and therefore did
>> >> > not need retransmissions in the slow start following the RTO. Then
>> >> > tcp_slow_start will add the excess into cwnd bloating it and
>> >> > triggering a burst.
>> >> >
>> >> > Instead, we want to pass only the number of retransmitted segments
>> >> > that were covered by the cumulative ACK (and potentially newly sent
>> >> > data segments too if the cumulative ACK covers that far).
>> >> >
>> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>> >> > ---
>> >>
>> >> My understanding is there are two problems
>> >>
>> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> >> precise cumulatively acked count, not newly acked count?
>> >
>> > While I'm not entirely sure if you intented to say that my fix is broken
>> > or not, I thought this very difference alot while making the fix and I
>> > believe that this fix is needed because of the discontinuity at RTO
>> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
>> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
>> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
>> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>> >
>> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
>> > tp->delivered, using plain cumulative acked causes congestion control
>> > breakage later as call to tcp_cong_control will directly use the
>> > difference in tp->delivered.
>> >
>> > This boils down the exact definition of tp->delivered (the one given in
>> > the header is not detailed enough). I guess you might have better idea
>> > what it exactly is since one of you has added it? There are subtle things
>> > in the defination that can make it entirely unsuitable for cc decisions.
>> > Should those segments that we (possibly) already counted into
>> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
>> > for _second time_ or not? This fix avoids such double counting (the
>> Where is the double counting, assuming normal DUPACK behavior?
>>
>> In the non-sack case:
>>
>> 1. upon receiving a DUPACK, we assume one packet has been delivered by
>> incrementing tp->delivered in tcp_add_reno_sack()
>
> 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> I've tried to point out quite many times already)...
>
>> 2. upon receiving a partial ACK or an ACK that acks recovery point
>> (high_seq), tp->delivered is incremented by (cumulatively acked -
>> #dupacks) in tcp_remove_reno_sacks()
>
> ...and this won't happen correctly anymore after RTO (since non-SACK
> won't keep #dupacks due to the discontinuity). Thus we end up adding
> cumulatively acked - 0 to tp->delivered on those ACKs.
>
>> therefore tp->delivered is tracking the # of packets delivered
>> (sacked, acked, DUPACK'd) with the most information it could have
>> inferred.
>
> Since you didn't answer any of my questions about tp->delivered directly,
> let me rephrase them to this example (non-SACK, of course):
>
> 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> three segments. How much should tp->delivered be incremented? 2 or 3?
>
> ...I think 2 is the right answer.
>
>> From congestion control's perspective, it cares about the delivery
>> information (e.g. how much), not the sequences (what or how).
>
> I guess you must have missed my point. I'm talking about "how much"
> whole the time. It's just about when can we account for it (and when not).
>
>> I am pointing out that
>>
>> 1) your fix may fix one corner CC packet accounting issue in the
>> non-SACK and CA_Loss
>> 2) but it does not fix the other (major) CC packet accounting issue in
>> the SACK case
>
> You say major but it's major if and only if one is using one of the
> affected cc modules which were not that many and IMHO not very mainstream
> anyway (check my list from the previous email, not that I'd mind if they'd
> get fixed too). The rest of the cc modules do not seem to use the incorrect
> value even if some of them have the pkts_acked callback.
>
> Other CC packet accounting (regardless of SACK) is based on tp->delivered
> and tp->delivered is NOT similarly miscounted with SACK because the logic
> depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> loss).
>
>> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
>> as tcp_check_reno_reordering requires strictly cum. ack.
>
> I think that the current non-SACK reordering detection logic is not that
> sound after RTO (but I'm yet to prove this). ...There seems to be some
> failure modes which is why I even thought of disabling the whole thing
> for non-SACK RTO recoveries and as it now seems you do care more about
> non-SACK than you initial claimed, I might even have motivation to fix
> more those corners rather than the worst bugs only ;-). ...But I'll make
> the tp->delivered fix only in this patch to avoid any change this part of
> the code.
>
>> Therefore I prefer
>> a) using tp->delivered to fix (1)(2)
>> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
>
> Somehow I get an impression that you might assume/say here that
[resending in plaintext]
That's wrong impression. Perhaps it's worth re-iterating what I agree
and disagree

1. [agree] there's accounting issue in non-SACK as you discovered
which causes CC misbehavior

2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
    => that field is not used by common C.C. (you said so too)

3. [disagree] adjusting pkts_acked may not affect reordering
accounting in non-sack


For cubic or reno, the main source is the "acked_sacked" passed into
tcp_cong_avoid(). that variable is derived from tp->delivered.
Therefore we need to fix that to address the problem in (1)

I have yet to read your code. Will read later today.

> tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> miscalculation chain I'm fixing with this change. B) resolves also (1)
> without additional changes to logic!
>
> I've included below the updated change that only fixes tp->delivered
> calculation (not tested besides compiling). ...But I think it's worse than
> my previous version because tcp_remove_reno_sacks then assumes
> non-sensical L|S skbs (but there seem to be other, worse limitations in
> sacked_out imitation after RTO because we've all skbs marked with L-bit so
> it's not that big deal for me as most of the that imitation code is no-op
> anyway after RTO).
>
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..e11748d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>
>         if (acked > 0) {
>                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
>                 if (acked - 1 >= tp->sacked_out)
>                         tp->sacked_out = 0;
>                 else
> @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>         tcp_verify_left_out(tp);
>  }
>
> +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> +{
> +       if (acked > 0)
> +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> +}
> +
>  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
>  {
>         tp->sacked_out = 0;
> @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> +                                                     pkts_acked :
> +                                                     rexmit_acked + newdata_acked);
>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;
Yuchung Cheng March 28, 2018, 3:04 p.m. | #6
On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@google.com> wrote:
>
> On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >> >
> >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> >> >
> >> >> > A miscalculation for the number of acknowledged packets occurs during
> >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> >> > RTO as it may include segments that were not lost and therefore did
> >> >> > not need retransmissions in the slow start following the RTO. Then
> >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> >> > triggering a burst.
> >> >> >
> >> >> > Instead, we want to pass only the number of retransmitted segments
> >> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> >> > data segments too if the cumulative ACK covers that far).
> >> >> >
> >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> >> >> > ---
> >> >>
> >> >> My understanding is there are two problems
> >> >>
> >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> >> precise cumulatively acked count, not newly acked count?
> >> >
> >> > While I'm not entirely sure if you intented to say that my fix is broken
> >> > or not, I thought this very difference alot while making the fix and I
> >> > believe that this fix is needed because of the discontinuity at RTO
> >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >> >
> >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> >> > tp->delivered, using plain cumulative acked causes congestion control
> >> > breakage later as call to tcp_cong_control will directly use the
> >> > difference in tp->delivered.
> >> >
> >> > This boils down the exact definition of tp->delivered (the one given in
> >> > the header is not detailed enough). I guess you might have better idea
> >> > what it exactly is since one of you has added it? There are subtle things
> >> > in the defination that can make it entirely unsuitable for cc decisions.
> >> > Should those segments that we (possibly) already counted into
> >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> >> > for _second time_ or not? This fix avoids such double counting (the
> >> Where is the double counting, assuming normal DUPACK behavior?
> >>
> >> In the non-sack case:
> >>
> >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> >> incrementing tp->delivered in tcp_add_reno_sack()
> >
> > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > I've tried to point out quite many times already)...
> >
> >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> >> #dupacks) in tcp_remove_reno_sacks()
> >
> > ...and this won't happen correctly anymore after RTO (since non-SACK
> > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > cumulatively acked - 0 to tp->delivered on those ACKs.
> >
> >> therefore tp->delivered is tracking the # of packets delivered
> >> (sacked, acked, DUPACK'd) with the most information it could have
> >> inferred.
> >
> > Since you didn't answer any of my questions about tp->delivered directly,
> > let me rephrase them to this example (non-SACK, of course):
> >
> > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > three segments. How much should tp->delivered be incremented? 2 or 3?
> >
> > ...I think 2 is the right answer.
> >
> >> From congestion control's perspective, it cares about the delivery
> >> information (e.g. how much), not the sequences (what or how).
> >
> > I guess you must have missed my point. I'm talking about "how much"
> > whole the time. It's just about when can we account for it (and when not).
> >
> >> I am pointing out that
> >>
> >> 1) your fix may fix one corner CC packet accounting issue in the
> >> non-SACK and CA_Loss
> >> 2) but it does not fix the other (major) CC packet accounting issue in
> >> the SACK case
> >
> > You say major but it's major if and only if one is using one of the
> > affected cc modules which were not that many and IMHO not very mainstream
> > anyway (check my list from the previous email, not that I'd mind if they'd
> > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > value even if some of them have the pkts_acked callback.
> >
> > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > loss).
> >
> >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> >> as tcp_check_reno_reordering requires strictly cum. ack.
> >
> > I think that the current non-SACK reordering detection logic is not that
> > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > failure modes which is why I even thought of disabling the whole thing
> > for non-SACK RTO recoveries and as it now seems you do care more about
> > non-SACK than you initial claimed, I might even have motivation to fix
> > more those corners rather than the worst bugs only ;-). ...But I'll make
> > the tp->delivered fix only in this patch to avoid any change this part of
> > the code.
> >
> >> Therefore I prefer
> >> a) using tp->delivered to fix (1)(2)
> >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> >
> > Somehow I get an impression that you might assume/say here that
> [resending in plaintext]
> That's wrong impression. Perhaps it's worth re-iterating what I agree
> and disagree
>
> 1. [agree] there's accounting issue in non-SACK as you discovered
> which causes CC misbehavior
>
> 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
>     => that field is not used by common C.C. (you said so too)
>
> 3. [disagree] adjusting pkts_acked may not affect reordering
> accounting in non-sack
>
>
> For cubic or reno, the main source is the "acked_sacked" passed into
> tcp_cong_avoid(). that variable is derived from tp->delivered.
> Therefore we need to fix that to address the problem in (1)
>
> I have yet to read your code. Will read later today.
Your patch looks good. Some questions / comments:

1. Why only apply to CA_Loss and not also CA_Recovery? this may
mitigate GRO issue I noted in other thread.

2. minor style nit: can we adjust tp->delivered directly in
tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
(e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
control when SACK is used. One day I'd like to rename all these *reno*
to _nonsack_.

Thanks

>
> > tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> > Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> > currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> > pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> > miscalculation chain I'm fixing with this change. B) resolves also (1)
> > without additional changes to logic!
> >
> > I've included below the updated change that only fixes tp->delivered
> > calculation (not tested besides compiling). ...But I think it's worse than
> > my previous version because tcp_remove_reno_sacks then assumes
> > non-sensical L|S skbs (but there seem to be other, worse limitations in
> > sacked_out imitation after RTO because we've all skbs marked with L-bit so
> > it's not that big deal for me as most of the that imitation code is no-op
> > anyway after RTO).
> >
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..e11748d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >
> >         if (acked > 0) {
> >                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> > -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> >                 if (acked - 1 >= tp->sacked_out)
> >                         tp->sacked_out = 0;
> >                 else
> > @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >         tcp_verify_left_out(tp);
> >  }
> >
> > +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> > +{
> > +       if (acked > 0)
> > +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> > +}
> > +
> >  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
> >  {
> >         tp->sacked_out = 0;
> > @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> > +                                                     pkts_acked :
> > +                                                     rexmit_acked + newdata_acked);
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> >                 } else {
> >                         int delta;
Ilpo Järvinen April 4, 2018, 10:42 a.m. | #7
On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@google.com> wrote:
> >
> > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> > <ilpo.jarvinen@helsinki.fi> wrote:
> > > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> > >
> > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> > >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> > >> >
> > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > >> >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> >> >
> > >> >> > A miscalculation for the number of acknowledged packets occurs during
> > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > >> >> > RTO as it may include segments that were not lost and therefore did
> > >> >> > not need retransmissions in the slow start following the RTO. Then
> > >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> > >> >> > triggering a burst.
> > >> >> >
> > >> >> > Instead, we want to pass only the number of retransmitted segments
> > >> >> > that were covered by the cumulative ACK (and potentially newly sent
> > >> >> > data segments too if the cumulative ACK covers that far).
> > >> >> >
> > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > >> >> > ---
> > >> >>
> > >> >> My understanding is there are two problems
> > >> >>
> > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> > >> >> precise cumulatively acked count, not newly acked count?
> > >> >
> > >> > While I'm not entirely sure if you intented to say that my fix is broken
> > >> > or not, I thought this very difference alot while making the fix and I
> > >> > believe that this fix is needed because of the discontinuity at RTO
> > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> > >> >
> > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > >> > tp->delivered, using plain cumulative acked causes congestion control
> > >> > breakage later as call to tcp_cong_control will directly use the
> > >> > difference in tp->delivered.
> > >> >
> > >> > This boils down the exact definition of tp->delivered (the one given in
> > >> > the header is not detailed enough). I guess you might have better idea
> > >> > what it exactly is since one of you has added it? There are subtle things
> > >> > in the defination that can make it entirely unsuitable for cc decisions.
> > >> > Should those segments that we (possibly) already counted into
> > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > >> > for _second time_ or not? This fix avoids such double counting (the
> > >> Where is the double counting, assuming normal DUPACK behavior?
> > >>
> > >> In the non-sack case:
> > >>
> > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> > >> incrementing tp->delivered in tcp_add_reno_sack()
> > >
> > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > > I've tried to point out quite many times already)...
> > >
> > >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> > >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> > >> #dupacks) in tcp_remove_reno_sacks()
> > >
> > > ...and this won't happen correctly anymore after RTO (since non-SACK
> > > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > > cumulatively acked - 0 to tp->delivered on those ACKs.
> > >
> > >> therefore tp->delivered is tracking the # of packets delivered
> > >> (sacked, acked, DUPACK'd) with the most information it could have
> > >> inferred.
> > >
> > > Since you didn't answer any of my questions about tp->delivered directly,
> > > let me rephrase them to this example (non-SACK, of course):
> > >
> > > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > > three segments. How much should tp->delivered be incremented? 2 or 3?
> > >
> > > ...I think 2 is the right answer.
> > >
> > >> From congestion control's perspective, it cares about the delivery
> > >> information (e.g. how much), not the sequences (what or how).
> > >
> > > I guess you must have missed my point. I'm talking about "how much"
> > > whole the time. It's just about when can we account for it (and when not).
> > >
> > >> I am pointing out that
> > >>
> > >> 1) your fix may fix one corner CC packet accounting issue in the
> > >> non-SACK and CA_Loss
> > >> 2) but it does not fix the other (major) CC packet accounting issue in
> > >> the SACK case
> > >
> > > You say major but it's major if and only if one is using one of the
> > > affected cc modules which were not that many and IMHO not very mainstream
> > > anyway (check my list from the previous email, not that I'd mind if they'd
> > > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > > value even if some of them have the pkts_acked callback.
> > >
> > > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > > loss).
> > >
> > >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> > >> as tcp_check_reno_reordering requires strictly cum. ack.
> > >
> > > I think that the current non-SACK reordering detection logic is not that
> > > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > > failure modes which is why I even thought of disabling the whole thing
> > > for non-SACK RTO recoveries and as it now seems you do care more about
> > > non-SACK than you initial claimed, I might even have motivation to fix
> > > more those corners rather than the worst bugs only ;-). ...But I'll make
> > > the tp->delivered fix only in this patch to avoid any change this part of
> > > the code.
> > >
> > >> Therefore I prefer
> > >> a) using tp->delivered to fix (1)(2)
> > >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> > >
> > > Somehow I get an impression that you might assume/say here that
> > [resending in plaintext]
> > That's wrong impression. Perhaps it's worth re-iterating what I agree
> > and disagree
> >
> > 1. [agree] there's accounting issue in non-SACK as you discovered
> > which causes CC misbehavior
> >
> > 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
> >     => that field is not used by common C.C. (you said so too)
> >
> > 3. [disagree] adjusting pkts_acked may not affect reordering
> > accounting in non-sack
> >
> >
> >
> > For cubic or reno, the main source is the "acked_sacked" passed into
> > tcp_cong_avoid(). that variable is derived from tp->delivered.
> > Therefore we need to fix that to address the problem in (1)
> >
> > I have yet to read your code. Will read later today.
> Your patch looks good. Some questions / comments:

Just to be sure that we understand each other the correct way around this 
time, I guess it resolved both of your "disagree" points above?

> 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> mitigate GRO issue I noted in other thread.

Hmm, that's indeed a good idea. I'll give it some more thought but my 
initial impression is that it should work.

I initially thought that the GRO issues just had to do with missing xmit 
opportunities and was confused by the use of "mitigate" here but then 
realized you meant also (or perhaps mainly?) that GRO causes similar 
bursts (once the too small sacked_out runs out).

> 2. minor style nit: can we adjust tp->delivered directly in
> tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> control when SACK is used. One day I'd like to rename all these *reno*
> to _nonsack_.

That's what I actually did first but put ended up putting it into own 
function because of line lengths (not a particularly good reason).

...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

I'll also try to keep that "reno" thing in my mind.
Neal Cardwell April 4, 2018, 5:17 p.m. | #8
On Wed, Apr 4, 2018 at 6:42 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
wrote:

> On Wed, 28 Mar 2018, Yuchung Cheng wrote:
...
> > Your patch looks good. Some questions / comments:

The patch looks good to me as well (I read the Mar 28 version). Thanks for
this fix, Ilpo.

> Just to be sure that we understand each other the correct way around this
> time, I guess it resolved both of your "disagree" points above?

> > 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> > mitigate GRO issue I noted in other thread.

> Hmm, that's indeed a good idea. I'll give it some more thought but my
> initial impression is that it should work.

Great. I would agree with Yuchung's suggestion to apply this fix to
CA_Recovery as well.

> > 2. minor style nit: can we adjust tp->delivered directly in
> > tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> > (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> > control when SACK is used. One day I'd like to rename all these *reno*
> > to _nonsack_.

> That's what I actually did first but put ended up putting it into own
> function because of line lengths (not a particularly good reason).

> ...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

> I'll also try to keep that "reno" thing in my mind.

Either approach here sounds fine to me. Though personally I find it more
readable to keep all the non-SACK helpers together and consistently named,
including this one (even if the name is confusing, at least it is
consistent... :-).

neal

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@  static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3058,10 @@  static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@  static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@  static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			/* Due to discontinuity on RTO in the artificial
+			 * sacked_out calculations, TCP must restrict
+			 * pkts_acked without SACK to rexmits and new data
+			 * segments
+			 */
+			if (icsk->icsk_ca_state == TCP_CA_Loss)
+				pkts_acked = rexmit_acked + newdata_acked;
+
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;