mbox series

[bpf-next,0/7] bpf: Propagate cn to TCP

Message ID 20190323080542.173569-1-brakmo@fb.com
Headers show
Series bpf: Propagate cn to TCP | expand

Message

Lawrence Brakmo March 23, 2019, 8:05 a.m. UTC
This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
    NET_XMIT_SUCCESS    (0)    - continue with packet output
    NET_XMIT_DROP       (1)    - drop packet and do cn
    NET_XMIT_CN         (2)    - continue with packet output and do cn
    -EPERM                     - drop packet

There is also support for setting the probe timer to a small value,
specified by a sysctl, when a packet is dropped when calling
queue_xmit in __tcp_transmit_skb and there are no other packets in
transit.

In addition, HBM programs are modified to collect and return more
statistics.

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.
Smaller probe timers improve the performance of Cubic and DCTCP when the
rates are small enough that there are times when HBM cannot send a packet
per RTT in order to mainting the bandwidth limit.

The following results are obtained for rate limits of 1Gbps and 200Mbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic. 

A following patch will add support for fq's Earliest Departure Time (EDT).

The command used was:
  ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                   -s=<server running netserver>
  where:
     <rate>   is 1000 or 200
     --no_cn  specifies no cwr notifications
     dctcp    use of dctcp

                       Cubic                    DCTCP
Lim,Prob,DA    Mbps cwnd cred drops  Mbps cwnd cred drops
------------   ---- ---- ---- -----  ---- ---- ---- -----
  1G, 0,40       35  462 -320 67%     995    1 -212  0.05%
  1G, 0,40,cn   349    3 -229  0.15   995    1 -212  0.05
  1G, 0, 5,cn   941    2 -189  0.13   995    1 -212  0.05

200M, 0,40,cn    50    3 -152  0.34    31    3 -203  0.50
200M, 0, 5,cn    43    2 -202  0.48    33    3 -199  0.50
200M,20, 5,cn   199    2 -209  0.38   199    1 -214  0.30

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  Prob = Probe timer
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

brakmo (7):
  bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  bpf: cgroup inet skb programs can return 0 to 3
  bpf: Update __cgroup_bpf_run_filter_skb with cn
  bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  bpf: sysctl for probe_on_drop
  bpf: Add cn support to hbm_out_kern.c
  bpf: Add more stats to HBM

 include/linux/bpf.h        | 50 +++++++++++++++++++++++++++++
 include/linux/filter.h     |  3 +-
 include/net/netns/ipv4.h   |  1 +
 kernel/bpf/cgroup.c        | 25 ++++++++++++---
 kernel/bpf/syscall.c       | 12 +++++++
 kernel/bpf/verifier.c      | 16 +++++++--
 net/ipv4/ip_output.c       | 39 ++++++++++++----------
 net/ipv4/sysctl_net_ipv4.c | 10 ++++++
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_output.c      | 18 +++++++++--
 net/ipv6/ip6_output.c      | 22 +++++++------
 samples/bpf/do_hbm_test.sh | 10 ++++--
 samples/bpf/hbm.c          | 51 +++++++++++++++++++++++++++--
 samples/bpf/hbm.h          |  9 +++++-
 samples/bpf/hbm_kern.h     | 66 ++++++++++++++++++++++++++++++++++++--
 samples/bpf/hbm_out_kern.c | 48 +++++++++++++++++++--------
 16 files changed, 321 insertions(+), 60 deletions(-)

Comments

Eric Dumazet March 23, 2019, 9:12 a.m. UTC | #1
On 03/23/2019 01:05 AM, brakmo wrote:
> This patchset adds support for propagating congestion notifications (cn)
> to TCP from cgroup inet skb egress BPF programs.
> 
> Current cgroup skb BPF programs cannot trigger TCP congestion window
> reductions, even when they drop a packet. This patch-set adds support
> for cgroup skb BPF programs to send congestion notifications in the
> return value when the packets are TCP packets. Rather than the
> current 1 for keeping the packet and 0 for dropping it, they can
> now return:
>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>     NET_XMIT_DROP       (1)    - drop packet and do cn
>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>     -EPERM                     - drop packet
>

I believe I already mentioned this model is broken, if you have any virtual
device before the cgroup BPF program.

Please think about offloading the pacing/throttling in the NIC,
there is no way we will report back to tcp stack instant notifications.

This patch series is going way too far for my taste.

This idea is not new, you were at Google when it was experimented by Nandita and
others, and we know it is not worth the pain.
Alexei Starovoitov March 23, 2019, 3:41 p.m. UTC | #2
On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/23/2019 01:05 AM, brakmo wrote:
> > This patchset adds support for propagating congestion notifications (cn)
> > to TCP from cgroup inet skb egress BPF programs.
> > 
> > Current cgroup skb BPF programs cannot trigger TCP congestion window
> > reductions, even when they drop a packet. This patch-set adds support
> > for cgroup skb BPF programs to send congestion notifications in the
> > return value when the packets are TCP packets. Rather than the
> > current 1 for keeping the packet and 0 for dropping it, they can
> > now return:
> >     NET_XMIT_SUCCESS    (0)    - continue with packet output
> >     NET_XMIT_DROP       (1)    - drop packet and do cn
> >     NET_XMIT_CN         (2)    - continue with packet output and do cn
> >     -EPERM                     - drop packet
> >
> 
> I believe I already mentioned this model is broken, if you have any virtual
> device before the cgroup BPF program.
> 
> Please think about offloading the pacing/throttling in the NIC,
> there is no way we will report back to tcp stack instant notifications.

I don't think 'offload to google proprietary nic' is a suggestion
that folks can practically follow.
Very few NICs can offload pacing to hw and there are plenty of limitations.
This patch set represents a pure sw solution that works and scales to millions of flows.

> This patch series is going way too far for my taste.

I would really appreciate if you can do a technical review of the patches.
Our previous approach didn't quite work due to complexity around locked/non-locked socket.
This is a cleaner approach.
Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
This approach is better since it works for other protocols and can be
used by qdiscs w/o any bpf.

> This idea is not new, you were at Google when it was experimented by Nandita and
> others, and we know it is not worth the pain.

google networking needs are different from the rest of the world.

Thank you.
Lawrence Brakmo March 24, 2019, 1:14 a.m. UTC | #3
On 3/23/19, 10:12 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

    
    
    On 03/23/2019 01:05 AM, brakmo wrote:
    > This patchset adds support for propagating congestion notifications (cn)
    > to TCP from cgroup inet skb egress BPF programs.
    > 
    > Current cgroup skb BPF programs cannot trigger TCP congestion window
    > reductions, even when they drop a packet. This patch-set adds support
    > for cgroup skb BPF programs to send congestion notifications in the
    > return value when the packets are TCP packets. Rather than the
    > current 1 for keeping the packet and 0 for dropping it, they can
    > now return:
    >     NET_XMIT_SUCCESS    (0)    - continue with packet output
    >     NET_XMIT_DROP       (1)    - drop packet and do cn
    >     NET_XMIT_CN         (2)    - continue with packet output and do cn
    >     -EPERM                     - drop packet
    >
    
    I believe I already mentioned this model is broken, if you have any virtual
    device before the cgroup BPF program.

Current qdisc can return values 0 to 2, how is this different from the cgroup
BPF program returning these values? I understand that virtual devices before
the cgroup or qdisc may not propagate these values (and I would say the
problem is then with the virtual device), but not everyone uses virtual
devices like that. For them, HBM if not appropriate if they are using
Cubic (however, it works with DCTCP or with fq's EDT).
    
    Please think about offloading the pacing/throttling in the NIC,
    there is no way we will report back to tcp stack instant notifications.
    
Not everyone has the ability for offloading to the NIC.

    This patch series is going way too far for my taste.
    
Too far which way? I'm simply extending a mechanism that has been present
for a while with qdiscs to cgroup egress BPF programs. I understand if
it has no use in your environment, but we believe it has in ours.

    This idea is not new, you were at Google when it was experimented by Nandita and
    others, and we know it is not worth the pain.
    
There was no eBPF at that time. We like the flexibility we get by programing
the algorithms in eBPF.

These are not intrusive changes, they simply extend the current limited return
values form cgroup skb egress BPF programs to be more in line with qdiscs.
Eric Dumazet March 24, 2019, 5:36 a.m. UTC | #4
On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/23/2019 01:05 AM, brakmo wrote:
>>> This patchset adds support for propagating congestion notifications (cn)
>>> to TCP from cgroup inet skb egress BPF programs.
>>>
>>> Current cgroup skb BPF programs cannot trigger TCP congestion window
>>> reductions, even when they drop a packet. This patch-set adds support
>>> for cgroup skb BPF programs to send congestion notifications in the
>>> return value when the packets are TCP packets. Rather than the
>>> current 1 for keeping the packet and 0 for dropping it, they can
>>> now return:
>>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>>>     NET_XMIT_DROP       (1)    - drop packet and do cn
>>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>>>     -EPERM                     - drop packet
>>>
>>
>> I believe I already mentioned this model is broken, if you have any virtual
>> device before the cgroup BPF program.
>>
>> Please think about offloading the pacing/throttling in the NIC,
>> there is no way we will report back to tcp stack instant notifications.
> 
> I don't think 'offload to google proprietary nic' is a suggestion
> that folks can practically follow.
> Very few NICs can offload pacing to hw and there are plenty of limitations.
> This patch set represents a pure sw solution that works and scales to millions of flows.
> 
>> This patch series is going way too far for my taste.
> 
> I would really appreciate if you can do a technical review of the patches.
> Our previous approach didn't quite work due to complexity around locked/non-locked socket.
> This is a cleaner approach.
> Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
> This approach is better since it works for other protocols and can be
> used by qdiscs w/o any bpf.
> 
>> This idea is not new, you were at Google when it was experimented by Nandita and
>> others, and we know it is not worth the pain.
> 
> google networking needs are different from the rest of the world.
> 

This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.

We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
is not enough.

All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.

Please show us experimental results before we have to carry these huge hacks.

Thank you.
Eric Dumazet March 24, 2019, 5:48 a.m. UTC | #5
On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:

>> Please think about offloading the pacing/throttling in the NIC,
>> there is no way we will report back to tcp stack instant notificatiions.
> 
> I don't think 'offload to google proprietary nic' is a suggestion
> that folks can practically follow.
> Very few NICs can offload pacing to hw and there are plenty of limitations.
> This patch set represents a pure sw solution that works and scales to millions of flows.

offloading is not a unique feature to "Google nic", this is ridiculous.

By offloading I meant : Any hop that might not be controlled by linux stack.

Think about VM, and think about pacers that can be on various existing NIC,
like Mellanox ones. For them, it is trivial to implement L4S style capabilities.

Beauty of ECN is that is relies on standard TCP feedback, not on some immediate
reaction from one one intermediate hop, sending another signal back to the sender.

HTTP 3 is planned to use QUIC (over UDP), we do not want to add TCP-only features in
IP fast path.

Lets have experimental results first, please ?
Eric Dumazet March 24, 2019, 5:58 a.m. UTC | #6
On 03/23/2019 06:14 PM, Lawrence Brakmo wrote:

> There was no eBPF at that time. We like the flexibility we get by programing
> the algorithms in eBPF.

eBPF being there does not mean we can adopt whatever research work,
we want evaluations of the costs and benefits.

> 
> These are not intrusive changes, they simply extend the current limited return
> values form cgroup skb egress BPF programs to be more in line with qdiscs.

They are quite intrusive changes to IP layer, we can not deny this.
Alexei Starovoitov March 24, 2019, 4:19 p.m. UTC | #7
On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 03/23/2019 01:05 AM, brakmo wrote:
> >>> This patchset adds support for propagating congestion notifications (cn)
> >>> to TCP from cgroup inet skb egress BPF programs.
> >>>
> >>> Current cgroup skb BPF programs cannot trigger TCP congestion window
> >>> reductions, even when they drop a packet. This patch-set adds support
> >>> for cgroup skb BPF programs to send congestion notifications in the
> >>> return value when the packets are TCP packets. Rather than the
> >>> current 1 for keeping the packet and 0 for dropping it, they can
> >>> now return:
> >>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
> >>>     NET_XMIT_DROP       (1)    - drop packet and do cn
> >>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
> >>>     -EPERM                     - drop packet
> >>>
> >>
> >> I believe I already mentioned this model is broken, if you have any virtual
> >> device before the cgroup BPF program.
> >>
> >> Please think about offloading the pacing/throttling in the NIC,
> >> there is no way we will report back to tcp stack instant notifications.
> > 
> > I don't think 'offload to google proprietary nic' is a suggestion
> > that folks can practically follow.
> > Very few NICs can offload pacing to hw and there are plenty of limitations.
> > This patch set represents a pure sw solution that works and scales to millions of flows.
> > 
> >> This patch series is going way too far for my taste.
> > 
> > I would really appreciate if you can do a technical review of the patches.
> > Our previous approach didn't quite work due to complexity around locked/non-locked socket.
> > This is a cleaner approach.
> > Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
> > This approach is better since it works for other protocols and can be
> > used by qdiscs w/o any bpf.
> > 
> >> This idea is not new, you were at Google when it was experimented by Nandita and
> >> others, and we know it is not worth the pain.
> > 
> > google networking needs are different from the rest of the world.
> > 
> 
> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.
> 
> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
> is not enough.
> 
> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.
> 
> Please show us experimental results before we have to carry these huge hacks.

I have a feeling we're looking at different patches.
All of your comments seems to be talking about something else.
I have a hard time connecting them to this patch set.
Have you read the cover letter and patches of _this_ set?

The results are in the cover letter and there is no change in behavior in networking stack.
Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
the return code to tcp/udp layer.
The concept works for both despite of what you're claiming being tcp only.

Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
Please realize that existing qdiscs already doing this.
The patchset allows bpf-cgroup to do the same.

If you were arguing about sysctl knob in patch 5 that I would understand.
Instead of a knob we can hard code the value for now. But please explain
what issues you see with that knob.

Also to be fair I applied your
commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
without demanding 'experimental results' because the feature makes sense.
Yet, you folks didn't produce a _single_ example or test result since then.
This is not cool.
Eric Dumazet March 25, 2019, 8:33 a.m. UTC | #8
On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
>>> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/23/2019 01:05 AM, brakmo wrote:
>>>>> This patchset adds support for propagating congestion notifications (cn)
>>>>> to TCP from cgroup inet skb egress BPF programs.
>>>>>
>>>>> Current cgroup skb BPF programs cannot trigger TCP congestion window
>>>>> reductions, even when they drop a packet. This patch-set adds support
>>>>> for cgroup skb BPF programs to send congestion notifications in the
>>>>> return value when the packets are TCP packets. Rather than the
>>>>> current 1 for keeping the packet and 0 for dropping it, they can
>>>>> now return:
>>>>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>>>>>     NET_XMIT_DROP       (1)    - drop packet and do cn
>>>>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>>>>>     -EPERM                     - drop packet
>>>>>
>>>>
>>>> I believe I already mentioned this model is broken, if you have any virtual
>>>> device before the cgroup BPF program.
>>>>
>>>> Please think about offloading the pacing/throttling in the NIC,
>>>> there is no way we will report back to tcp stack instant notifications.
>>>
>>> I don't think 'offload to google proprietary nic' is a suggestion
>>> that folks can practically follow.
>>> Very few NICs can offload pacing to hw and there are plenty of limitations.
>>> This patch set represents a pure sw solution that works and scales to millions of flows.
>>>
>>>> This patch series is going way too far for my taste.
>>>
>>> I would really appreciate if you can do a technical review of the patches.
>>> Our previous approach didn't quite work due to complexity around locked/non-locked socket.
>>> This is a cleaner approach.
>>> Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
>>> This approach is better since it works for other protocols and can be
>>> used by qdiscs w/o any bpf.
>>>
>>>> This idea is not new, you were at Google when it was experimented by Nandita and
>>>> others, and we know it is not worth the pain.
>>>
>>> google networking needs are different from the rest of the world.
>>>
>>
>> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.
>>
>> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
>> is not enough.
>>
>> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.
>>
>> Please show us experimental results before we have to carry these huge hacks.
> 
> I have a feeling we're looking at different patches.
> All of your comments seems to be talking about something else.
> I have a hard time connecting them to this patch set.
> Have you read the cover letter and patches of _this_ set?
> 
> The results are in the cover letter and there is no change in behavior in networking stack.
> Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
> Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
> the return code to tcp/udp layer.
> The concept works for both despite of what you're claiming being tcp only.
> 
> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> Please realize that existing qdiscs already doing this.
> The patchset allows bpf-cgroup to do the same.

Not the same thing I am afraid.

> 
> If you were arguing about sysctl knob in patch 5 that I would understand.
> Instead of a knob we can hard code the value for now. But please explain
> what issues you see with that knob.
> 
> Also to be fair I applied your
> commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
> without demanding 'experimental results' because the feature makes sense.
> Yet, you folks didn't produce a _single_ example or test result since then.
> This is not cool.

This patch did not change fast path at all Alexei

Look at [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

This part changes ip stacks.
Eric Dumazet March 25, 2019, 8:48 a.m. UTC | #9
On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> 
> 
> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:

>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
>> Please realize that existing qdiscs already doing this.
>> The patchset allows bpf-cgroup to do the same.
> 
> Not the same thing I am afraid.

To be clear Alexei :

Existing qdisc set CE mark on a packet, exactly like a router would do.
Simple and universal.
This can be stacked, and done far away from the sender.

We do not _call_ back local TCP to propagate cn.

We simply rely on the fact that incoming ACK will carry the needed information,
and TCP stack already handles the case just fine.

Larry cover letter does not really explain why we need to handle a corner case
(local drops) with such intrusive changes.

TCP Small Queues already should make sure local drops are non existent.
Alexei Starovoitov March 26, 2019, 4:27 a.m. UTC | #10
On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> > 
> > 
> > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> 
> >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >> Please realize that existing qdiscs already doing this.
> >> The patchset allows bpf-cgroup to do the same.
> > 
> > Not the same thing I am afraid.
> 
> To be clear Alexei :
> 
> Existing qdisc set CE mark on a packet, exactly like a router would do.
> Simple and universal.
> This can be stacked, and done far away from the sender.
> 
> We do not _call_ back local TCP to propagate cn.

How do you classify NET_XMIT_CN ?
It's exactly local call back to indicate CN into tcp from layers below tcp.
tc-bpf prog returning 'drop' code means drop+cn
whereas cgroup-bpf prog returning 'drop' means drop only.
This patch set is fixing this discrepancy.

> We simply rely on the fact that incoming ACK will carry the needed information,
> and TCP stack already handles the case just fine.
> 
> Larry cover letter does not really explain why we need to handle a corner case
> (local drops) with such intrusive changes.

Only after so many rounds of back and forth I think I'm starting
to understand your 'intrusive change' comment.
I think you're referring to:
-       return ip_finish_output2(net, sk, skb);
+       ret = ip_finish_output2(net, sk, skb);
+       return ret ? : ret_bpf;

right?
And your concern that this change slows down ip stack?
Please spell out your concerns in more verbose way to avoid this guessing game.

I've looked at assembler and indeed this change pessimizes tailcall
optimization. What kind of benchmark do you want to see ?
As an alternative we can do it under static_key that cgroup-bpf is under.
It will be larger number of lines changed, but tailcall
optimization can be preserved.

> TCP Small Queues already should make sure local drops are non existent.

tsq don't help.
Here is the comment from patch 5:
"When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
and packets_out is 0, it is beneficial to set a small probe timer.
Otherwise, the throughput for the flow can suffer because it may need to
depend on the probe timer to start sending again. "

In other words when we're clamping aggregated cgroup bandwidth with this facility
we may need to drop the only in flight packet for this flow
and the flow restarts after default 200ms probe timer.
In such case it's well under tsq limit.

Thinking about tsq... I think it would be very interesting to add bpf hook
there as well and have programmable and dynamic tsq limit, but that is
orthogonal to this patch set. We will explore this idea separately.
Eric Dumazet March 26, 2019, 8:06 a.m. UTC | #11
On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:
> On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
>>
>>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
>>>> Please realize that existing qdiscs already doing this.
>>>> The patchset allows bpf-cgroup to do the same.
>>>
>>> Not the same thing I am afraid.
>>
>> To be clear Alexei :
>>
>> Existing qdisc set CE mark on a packet, exactly like a router would do.
>> Simple and universal.
>> This can be stacked, and done far away from the sender.
>>
>> We do not _call_ back local TCP to propagate cn.
> 
> How do you classify NET_XMIT_CN ?


> It's exactly local call back to indicate CN into tcp from layers below tcp.
> tc-bpf prog returning 'drop' code means drop+cn
> whereas cgroup-bpf prog returning 'drop' means drop only.
> This patch set is fixing this discrepancy.


Except this does not work universally.
team or bonding or any tunnel wont propagate this signal.
(This is not something that can be fixed either, since qdisc can be installed there)

This is a wrong design, and we want to get rid of it, and not 'fix it'.

Back to my original concerns :

1) Larry just upstreamed the bpf_skb_ecn_set_ce().
   This part was fine, we did the same in fq_codel an fq already.

2)
 This ability to inject directly to TCP stack local drops is exactly confirming
 why I complained earlier about why policers are often wrong. Trying to cope with them
 is a nightmare.

I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer)

EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops.

One of the ECN goal is to avoid drops in the first place, so adding something to favor
local drops sounds a step in the wrong direction.
Eric Dumazet March 26, 2019, 8:13 a.m. UTC | #12
On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:

> In other words when we're clamping aggregated cgroup bandwidth with this facility
> we may need to drop the only in flight packet for this flow
> and the flow restarts after default 200ms probe timer.

TLP should send a probe that does not depend on 200ms but more on RTT.
Alexei Starovoitov March 26, 2019, 3:07 p.m. UTC | #13
On Tue, Mar 26, 2019 at 01:06:23AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> >>>
> >>>
> >>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> >>
> >>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >>>> Please realize that existing qdiscs already doing this.
> >>>> The patchset allows bpf-cgroup to do the same.
> >>>
> >>> Not the same thing I am afraid.
> >>
> >> To be clear Alexei :
> >>
> >> Existing qdisc set CE mark on a packet, exactly like a router would do.
> >> Simple and universal.
> >> This can be stacked, and done far away from the sender.
> >>
> >> We do not _call_ back local TCP to propagate cn.
> > 
> > How do you classify NET_XMIT_CN ?
> 
> 
> > It's exactly local call back to indicate CN into tcp from layers below tcp.
> > tc-bpf prog returning 'drop' code means drop+cn
> > whereas cgroup-bpf prog returning 'drop' means drop only.
> > This patch set is fixing this discrepancy.
> 
> 
> Except this does not work universally.
> team or bonding or any tunnel wont propagate this signal.
> (This is not something that can be fixed either, since qdisc can be installed there)
> 
> This is a wrong design, and we want to get rid of it, and not 'fix it'.

so after 20+ years linux qdisc design is wrong?
bpf is about choice. We have to give people tools to experiment even
when we philosophically disagree on the design.
We need to make sure that new changes don't hurt existing use cases though.
Performance bar has to remain high.

> Back to my original concerns :
> 
> 1) Larry just upstreamed the bpf_skb_ecn_set_ce().
>    This part was fine, we did the same in fq_codel an fq already.

ecn doesn't work with all congestion controls.

> 2)
>  This ability to inject directly to TCP stack local drops is exactly confirming
>  why I complained earlier about why policers are often wrong. Trying to cope with them
>  is a nightmare.
> 
> I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer)

I've tried to explain several times that it's not about policer.
That's why we called it NRM. Network Resource Manager.
The goal is to control neworking resources in containerized environment
where container scope is a cgroup.
we could have done some of this logic from tc-bpf layer, but overhead is prohibitive there,
since tc sees all packets while we need to control only certain cgroups and
work properly within cgroup hierarchy.

> EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops.

we already discussed this. EDT is not applicable in all cases.

> One of the ECN goal is to avoid drops in the first place, so adding something to favor
> local drops sounds a step in the wrong direction.

according to this statement BBR is wrong design then, since it ignores ecn?

Anyway, to move forward...
We're going to explore few ways to reduce tailcall pessimization in patch 4 and
will proceed with the rest as-is.
Eric Dumazet March 26, 2019, 3:43 p.m. UTC | #14
On 03/26/2019 08:07 AM, Alexei Starovoitov wrote:

> so after 20+ years linux qdisc design is wrong?

Yes it is how it is, return values can not be propagated back to the TCP stack in all cases.

When a packet is queued to Qdisc 1, there is no way we can return
a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2.

Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will
return a success which might be a lie.


> bpf is about choice. We have to give people tools to experiment even
> when we philosophically disagree on the design.

Maybe, but I feel that for the moment, the choice is only for FB, and rest
of the world has to re-invent private ebpf code in order to benefit from all of this.

I doubt many TCP users will have the skills/money to benefit from this.

Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity.

It seems TCP stack became a playground for experiments.
Alexei Starovoitov March 26, 2019, 5:01 p.m. UTC | #15
On Tue, Mar 26, 2019 at 08:43:11AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/26/2019 08:07 AM, Alexei Starovoitov wrote:
> 
> > so after 20+ years linux qdisc design is wrong?
> 
> Yes it is how it is, return values can not be propagated back to the TCP stack in all cases.
> 
> When a packet is queued to Qdisc 1, there is no way we can return
> a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2.
> 
> Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will
> return a success which might be a lie.

root and children qdiscs propagate the return value already.
different unrelated qdiscs are just like different physical switches.
they can communicate only via on the wire protocol.
nothing wrong with that.
But not everything can and should communicate over the wire.

> > bpf is about choice. We have to give people tools to experiment even
> > when we philosophically disagree on the design.
> 
> Maybe, but I feel that for the moment, the choice is only for FB, and rest
> of the world has to re-invent private ebpf code in order to benefit from all of this.

Clearly a misunderstanding. Please see samples/bpf/hbm_out_kern.c
The source code of bpf program is not only public, but algorithm is well documented.

> I doubt many TCP users will have the skills/money to benefit from this.

majority of bpf features are actively used and often not by authors
who introduced them.

> Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity.

your herculean effort to keep tcp stack in excellent shape
is greatly appreciated. No doubt about that.

> It seems TCP stack became a playground for experiments.

exactly.
The next tcp congestion control algorithm should be implementable in bpf.
If kernel was extensible to that degree likely there would have been
no need to bypass it and invent quic.
Eric Dumazet March 26, 2019, 6:07 p.m. UTC | #16
On 03/26/2019 10:01 AM, Alexei Starovoitov wrote:

> The next tcp congestion control algorithm should be implementable in bpf.

Congestion control do not need per-packet decision, and can indeed be
implemented in bpf.

Some researchers even have implemented them in user space (CCP if I remember well)

> If kernel was extensible to that degree likely there would have been
> no need to bypass it and invent quic.

QUIC is a complete re-design, and works around fundamental things
that TCP can not change.

bpf in TCP could not have changed the rise of QUIC or any other
modern transport, really.

Lets stop this discussion, we are digressing :)