Message ID | 20190323080542.173569-1-brakmo@fb.com |
---|---|
Headers | show |
Series | bpf: Propagate cn to TCP | expand |
On 03/23/2019 01:05 AM, brakmo wrote: > This patchset adds support for propagating congestion notifications (cn) > to TCP from cgroup inet skb egress BPF programs. > > Current cgroup skb BPF programs cannot trigger TCP congestion window > reductions, even when they drop a packet. This patch-set adds support > for cgroup skb BPF programs to send congestion notifications in the > return value when the packets are TCP packets. Rather than the > current 1 for keeping the packet and 0 for dropping it, they can > now return: > NET_XMIT_SUCCESS (0) - continue with packet output > NET_XMIT_DROP (1) - drop packet and do cn > NET_XMIT_CN (2) - continue with packet output and do cn > -EPERM - drop packet > I believe I already mentioned this model is broken, if you have any virtual device before the cgroup BPF program. Please think about offloading the pacing/throttling in the NIC, there is no way we will report back to tcp stack instant notifications. This patch series is going way too far for my taste. This idea is not new, you were at Google when it was experimented by Nandita and others, and we know it is not worth the pain.
On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: > > > On 03/23/2019 01:05 AM, brakmo wrote: > > This patchset adds support for propagating congestion notifications (cn) > > to TCP from cgroup inet skb egress BPF programs. > > > > Current cgroup skb BPF programs cannot trigger TCP congestion window > > reductions, even when they drop a packet. This patch-set adds support > > for cgroup skb BPF programs to send congestion notifications in the > > return value when the packets are TCP packets. Rather than the > > current 1 for keeping the packet and 0 for dropping it, they can > > now return: > > NET_XMIT_SUCCESS (0) - continue with packet output > > NET_XMIT_DROP (1) - drop packet and do cn > > NET_XMIT_CN (2) - continue with packet output and do cn > > -EPERM - drop packet > > > > I believe I already mentioned this model is broken, if you have any virtual > device before the cgroup BPF program. > > Please think about offloading the pacing/throttling in the NIC, > there is no way we will report back to tcp stack instant notifications. I don't think 'offload to google proprietary nic' is a suggestion that folks can practically follow. Very few NICs can offload pacing to hw and there are plenty of limitations. This patch set represents a pure sw solution that works and scales to millions of flows. > This patch series is going way too far for my taste. I would really appreciate if you can do a technical review of the patches. Our previous approach didn't quite work due to complexity around locked/non-locked socket. This is a cleaner approach. Either we go with this one or will add a bpf hook into __tcp_transmit_skb. This approach is better since it works for other protocols and can be used by qdiscs w/o any bpf. > This idea is not new, you were at Google when it was experimented by Nandita and > others, and we know it is not worth the pain. google networking needs are different from the rest of the world. Thank you.
On 3/23/19, 10:12 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote: On 03/23/2019 01:05 AM, brakmo wrote: > This patchset adds support for propagating congestion notifications (cn) > to TCP from cgroup inet skb egress BPF programs. > > Current cgroup skb BPF programs cannot trigger TCP congestion window > reductions, even when they drop a packet. This patch-set adds support > for cgroup skb BPF programs to send congestion notifications in the > return value when the packets are TCP packets. Rather than the > current 1 for keeping the packet and 0 for dropping it, they can > now return: > NET_XMIT_SUCCESS (0) - continue with packet output > NET_XMIT_DROP (1) - drop packet and do cn > NET_XMIT_CN (2) - continue with packet output and do cn > -EPERM - drop packet > I believe I already mentioned this model is broken, if you have any virtual device before the cgroup BPF program. Current qdisc can return values 0 to 2, how is this different from the cgroup BPF program returning these values? I understand that virtual devices before the cgroup or qdisc may not propagate these values (and I would say the problem is then with the virtual device), but not everyone uses virtual devices like that. For them, HBM if not appropriate if they are using Cubic (however, it works with DCTCP or with fq's EDT). Please think about offloading the pacing/throttling in the NIC, there is no way we will report back to tcp stack instant notifications. Not everyone has the ability for offloading to the NIC. This patch series is going way too far for my taste. Too far which way? I'm simply extending a mechanism that has been present for a while with qdiscs to cgroup egress BPF programs. I understand if it has no use in your environment, but we believe it has in ours. This idea is not new, you were at Google when it was experimented by Nandita and others, and we know it is not worth the pain. There was no eBPF at that time. We like the flexibility we get by programing the algorithms in eBPF. These are not intrusive changes, they simply extend the current limited return values form cgroup skb egress BPF programs to be more in line with qdiscs.
On 03/23/2019 08:41 AM, Alexei Starovoitov wrote: > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: >> >> >> On 03/23/2019 01:05 AM, brakmo wrote: >>> This patchset adds support for propagating congestion notifications (cn) >>> to TCP from cgroup inet skb egress BPF programs. >>> >>> Current cgroup skb BPF programs cannot trigger TCP congestion window >>> reductions, even when they drop a packet. This patch-set adds support >>> for cgroup skb BPF programs to send congestion notifications in the >>> return value when the packets are TCP packets. Rather than the >>> current 1 for keeping the packet and 0 for dropping it, they can >>> now return: >>> NET_XMIT_SUCCESS (0) - continue with packet output >>> NET_XMIT_DROP (1) - drop packet and do cn >>> NET_XMIT_CN (2) - continue with packet output and do cn >>> -EPERM - drop packet >>> >> >> I believe I already mentioned this model is broken, if you have any virtual >> device before the cgroup BPF program. >> >> Please think about offloading the pacing/throttling in the NIC, >> there is no way we will report back to tcp stack instant notifications. > > I don't think 'offload to google proprietary nic' is a suggestion > that folks can practically follow. > Very few NICs can offload pacing to hw and there are plenty of limitations. > This patch set represents a pure sw solution that works and scales to millions of flows. > >> This patch series is going way too far for my taste. > > I would really appreciate if you can do a technical review of the patches. > Our previous approach didn't quite work due to complexity around locked/non-locked socket. > This is a cleaner approach. > Either we go with this one or will add a bpf hook into __tcp_transmit_skb. > This approach is better since it works for other protocols and can be > used by qdiscs w/o any bpf. > >> This idea is not new, you were at Google when it was experimented by Nandita and >> others, and we know it is not worth the pain. > > google networking needs are different from the rest of the world. > This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei. We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy is not enough. All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender. Please show us experimental results before we have to carry these huge hacks. Thank you.
On 03/23/2019 08:41 AM, Alexei Starovoitov wrote: > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: >> Please think about offloading the pacing/throttling in the NIC, >> there is no way we will report back to tcp stack instant notificatiions. > > I don't think 'offload to google proprietary nic' is a suggestion > that folks can practically follow. > Very few NICs can offload pacing to hw and there are plenty of limitations. > This patch set represents a pure sw solution that works and scales to millions of flows. offloading is not a unique feature to "Google nic", this is ridiculous. By offloading I meant : Any hop that might not be controlled by linux stack. Think about VM, and think about pacers that can be on various existing NIC, like Mellanox ones. For them, it is trivial to implement L4S style capabilities. Beauty of ECN is that is relies on standard TCP feedback, not on some immediate reaction from one one intermediate hop, sending another signal back to the sender. HTTP 3 is planned to use QUIC (over UDP), we do not want to add TCP-only features in IP fast path. Lets have experimental results first, please ?
On 03/23/2019 06:14 PM, Lawrence Brakmo wrote: > There was no eBPF at that time. We like the flexibility we get by programing > the algorithms in eBPF. eBPF being there does not mean we can adopt whatever research work, we want evaluations of the costs and benefits. > > These are not intrusive changes, they simply extend the current limited return > values form cgroup skb egress BPF programs to be more in line with qdiscs. They are quite intrusive changes to IP layer, we can not deny this.
On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote: > > > On 03/23/2019 08:41 AM, Alexei Starovoitov wrote: > > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: > >> > >> > >> On 03/23/2019 01:05 AM, brakmo wrote: > >>> This patchset adds support for propagating congestion notifications (cn) > >>> to TCP from cgroup inet skb egress BPF programs. > >>> > >>> Current cgroup skb BPF programs cannot trigger TCP congestion window > >>> reductions, even when they drop a packet. This patch-set adds support > >>> for cgroup skb BPF programs to send congestion notifications in the > >>> return value when the packets are TCP packets. Rather than the > >>> current 1 for keeping the packet and 0 for dropping it, they can > >>> now return: > >>> NET_XMIT_SUCCESS (0) - continue with packet output > >>> NET_XMIT_DROP (1) - drop packet and do cn > >>> NET_XMIT_CN (2) - continue with packet output and do cn > >>> -EPERM - drop packet > >>> > >> > >> I believe I already mentioned this model is broken, if you have any virtual > >> device before the cgroup BPF program. > >> > >> Please think about offloading the pacing/throttling in the NIC, > >> there is no way we will report back to tcp stack instant notifications. > > > > I don't think 'offload to google proprietary nic' is a suggestion > > that folks can practically follow. > > Very few NICs can offload pacing to hw and there are plenty of limitations. > > This patch set represents a pure sw solution that works and scales to millions of flows. > > > >> This patch series is going way too far for my taste. > > > > I would really appreciate if you can do a technical review of the patches. > > Our previous approach didn't quite work due to complexity around locked/non-locked socket. > > This is a cleaner approach. > > Either we go with this one or will add a bpf hook into __tcp_transmit_skb. > > This approach is better since it works for other protocols and can be > > used by qdiscs w/o any bpf. > > > >> This idea is not new, you were at Google when it was experimented by Nandita and > >> others, and we know it is not worth the pain. > > > > google networking needs are different from the rest of the world. > > > > This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei. > > We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy > is not enough. > > All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender. > > Please show us experimental results before we have to carry these huge hacks. I have a feeling we're looking at different patches. All of your comments seems to be talking about something else. I have a hard time connecting them to this patch set. Have you read the cover letter and patches of _this_ set? The results are in the cover letter and there is no change in behavior in networking stack. Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes. Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass the return code to tcp/udp layer. The concept works for both despite of what you're claiming being tcp only. Cover letter also explains why bpf_skb_ecn_set_ce is not enough. Please realize that existing qdiscs already doing this. The patchset allows bpf-cgroup to do the same. If you were arguing about sysctl knob in patch 5 that I would understand. Instead of a knob we can hard code the value for now. But please explain what issues you see with that knob. Also to be fair I applied your commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs") without demanding 'experimental results' because the feature makes sense. Yet, you folks didn't produce a _single_ example or test result since then. This is not cool.
On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: > On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote: >> >> >> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote: >>> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: >>>> >>>> >>>> On 03/23/2019 01:05 AM, brakmo wrote: >>>>> This patchset adds support for propagating congestion notifications (cn) >>>>> to TCP from cgroup inet skb egress BPF programs. >>>>> >>>>> Current cgroup skb BPF programs cannot trigger TCP congestion window >>>>> reductions, even when they drop a packet. This patch-set adds support >>>>> for cgroup skb BPF programs to send congestion notifications in the >>>>> return value when the packets are TCP packets. Rather than the >>>>> current 1 for keeping the packet and 0 for dropping it, they can >>>>> now return: >>>>> NET_XMIT_SUCCESS (0) - continue with packet output >>>>> NET_XMIT_DROP (1) - drop packet and do cn >>>>> NET_XMIT_CN (2) - continue with packet output and do cn >>>>> -EPERM - drop packet >>>>> >>>> >>>> I believe I already mentioned this model is broken, if you have any virtual >>>> device before the cgroup BPF program. >>>> >>>> Please think about offloading the pacing/throttling in the NIC, >>>> there is no way we will report back to tcp stack instant notifications. >>> >>> I don't think 'offload to google proprietary nic' is a suggestion >>> that folks can practically follow. >>> Very few NICs can offload pacing to hw and there are plenty of limitations. >>> This patch set represents a pure sw solution that works and scales to millions of flows. >>> >>>> This patch series is going way too far for my taste. >>> >>> I would really appreciate if you can do a technical review of the patches. >>> Our previous approach didn't quite work due to complexity around locked/non-locked socket. >>> This is a cleaner approach. >>> Either we go with this one or will add a bpf hook into __tcp_transmit_skb. >>> This approach is better since it works for other protocols and can be >>> used by qdiscs w/o any bpf. >>> >>>> This idea is not new, you were at Google when it was experimented by Nandita and >>>> others, and we know it is not worth the pain. >>> >>> google networking needs are different from the rest of the world. >>> >> >> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei. >> >> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy >> is not enough. >> >> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender. >> >> Please show us experimental results before we have to carry these huge hacks. > > I have a feeling we're looking at different patches. > All of your comments seems to be talking about something else. > I have a hard time connecting them to this patch set. > Have you read the cover letter and patches of _this_ set? > > The results are in the cover letter and there is no change in behavior in networking stack. > Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes. > Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass > the return code to tcp/udp layer. > The concept works for both despite of what you're claiming being tcp only. > > Cover letter also explains why bpf_skb_ecn_set_ce is not enough. > Please realize that existing qdiscs already doing this. > The patchset allows bpf-cgroup to do the same. Not the same thing I am afraid. > > If you were arguing about sysctl knob in patch 5 that I would understand. > Instead of a knob we can hard code the value for now. But please explain > what issues you see with that knob. > > Also to be fair I applied your > commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs") > without demanding 'experimental results' because the feature makes sense. > Yet, you folks didn't produce a _single_ example or test result since then. > This is not cool. This patch did not change fast path at all Alexei Look at [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls This part changes ip stacks.
On 03/25/2019 01:33 AM, Eric Dumazet wrote: > > > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough. >> Please realize that existing qdiscs already doing this. >> The patchset allows bpf-cgroup to do the same. > > Not the same thing I am afraid. To be clear Alexei : Existing qdisc set CE mark on a packet, exactly like a router would do. Simple and universal. This can be stacked, and done far away from the sender. We do not _call_ back local TCP to propagate cn. We simply rely on the fact that incoming ACK will carry the needed information, and TCP stack already handles the case just fine. Larry cover letter does not really explain why we need to handle a corner case (local drops) with such intrusive changes. TCP Small Queues already should make sure local drops are non existent.
On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote: > > > On 03/25/2019 01:33 AM, Eric Dumazet wrote: > > > > > > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: > > >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough. > >> Please realize that existing qdiscs already doing this. > >> The patchset allows bpf-cgroup to do the same. > > > > Not the same thing I am afraid. > > To be clear Alexei : > > Existing qdisc set CE mark on a packet, exactly like a router would do. > Simple and universal. > This can be stacked, and done far away from the sender. > > We do not _call_ back local TCP to propagate cn. How do you classify NET_XMIT_CN ? It's exactly local call back to indicate CN into tcp from layers below tcp. tc-bpf prog returning 'drop' code means drop+cn whereas cgroup-bpf prog returning 'drop' means drop only. This patch set is fixing this discrepancy. > We simply rely on the fact that incoming ACK will carry the needed information, > and TCP stack already handles the case just fine. > > Larry cover letter does not really explain why we need to handle a corner case > (local drops) with such intrusive changes. Only after so many rounds of back and forth I think I'm starting to understand your 'intrusive change' comment. I think you're referring to: - return ip_finish_output2(net, sk, skb); + ret = ip_finish_output2(net, sk, skb); + return ret ? : ret_bpf; right? And your concern that this change slows down ip stack? Please spell out your concerns in more verbose way to avoid this guessing game. I've looked at assembler and indeed this change pessimizes tailcall optimization. What kind of benchmark do you want to see ? As an alternative we can do it under static_key that cgroup-bpf is under. It will be larger number of lines changed, but tailcall optimization can be preserved. > TCP Small Queues already should make sure local drops are non existent. tsq don't help. Here is the comment from patch 5: "When a packet is dropped when calling queue_xmit in __tcp_transmit_skb and packets_out is 0, it is beneficial to set a small probe timer. Otherwise, the throughput for the flow can suffer because it may need to depend on the probe timer to start sending again. " In other words when we're clamping aggregated cgroup bandwidth with this facility we may need to drop the only in flight packet for this flow and the flow restarts after default 200ms probe timer. In such case it's well under tsq limit. Thinking about tsq... I think it would be very interesting to add bpf hook there as well and have programmable and dynamic tsq limit, but that is orthogonal to this patch set. We will explore this idea separately.
On 03/25/2019 09:27 PM, Alexei Starovoitov wrote: > On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote: >> >> >> On 03/25/2019 01:33 AM, Eric Dumazet wrote: >>> >>> >>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: >> >>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough. >>>> Please realize that existing qdiscs already doing this. >>>> The patchset allows bpf-cgroup to do the same. >>> >>> Not the same thing I am afraid. >> >> To be clear Alexei : >> >> Existing qdisc set CE mark on a packet, exactly like a router would do. >> Simple and universal. >> This can be stacked, and done far away from the sender. >> >> We do not _call_ back local TCP to propagate cn. > > How do you classify NET_XMIT_CN ? > It's exactly local call back to indicate CN into tcp from layers below tcp. > tc-bpf prog returning 'drop' code means drop+cn > whereas cgroup-bpf prog returning 'drop' means drop only. > This patch set is fixing this discrepancy. Except this does not work universally. team or bonding or any tunnel wont propagate this signal. (This is not something that can be fixed either, since qdisc can be installed there) This is a wrong design, and we want to get rid of it, and not 'fix it'. Back to my original concerns : 1) Larry just upstreamed the bpf_skb_ecn_set_ce(). This part was fine, we did the same in fq_codel an fq already. 2) This ability to inject directly to TCP stack local drops is exactly confirming why I complained earlier about why policers are often wrong. Trying to cope with them is a nightmare. I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer) EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops. One of the ECN goal is to avoid drops in the first place, so adding something to favor local drops sounds a step in the wrong direction.
On 03/25/2019 09:27 PM, Alexei Starovoitov wrote: > In other words when we're clamping aggregated cgroup bandwidth with this facility > we may need to drop the only in flight packet for this flow > and the flow restarts after default 200ms probe timer. TLP should send a probe that does not depend on 200ms but more on RTT.
On Tue, Mar 26, 2019 at 01:06:23AM -0700, Eric Dumazet wrote: > > > On 03/25/2019 09:27 PM, Alexei Starovoitov wrote: > > On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote: > >> > >> > >> On 03/25/2019 01:33 AM, Eric Dumazet wrote: > >>> > >>> > >>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: > >> > >>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough. > >>>> Please realize that existing qdiscs already doing this. > >>>> The patchset allows bpf-cgroup to do the same. > >>> > >>> Not the same thing I am afraid. > >> > >> To be clear Alexei : > >> > >> Existing qdisc set CE mark on a packet, exactly like a router would do. > >> Simple and universal. > >> This can be stacked, and done far away from the sender. > >> > >> We do not _call_ back local TCP to propagate cn. > > > > How do you classify NET_XMIT_CN ? > > > > It's exactly local call back to indicate CN into tcp from layers below tcp. > > tc-bpf prog returning 'drop' code means drop+cn > > whereas cgroup-bpf prog returning 'drop' means drop only. > > This patch set is fixing this discrepancy. > > > Except this does not work universally. > team or bonding or any tunnel wont propagate this signal. > (This is not something that can be fixed either, since qdisc can be installed there) > > This is a wrong design, and we want to get rid of it, and not 'fix it'. so after 20+ years linux qdisc design is wrong? bpf is about choice. We have to give people tools to experiment even when we philosophically disagree on the design. We need to make sure that new changes don't hurt existing use cases though. Performance bar has to remain high. > Back to my original concerns : > > 1) Larry just upstreamed the bpf_skb_ecn_set_ce(). > This part was fine, we did the same in fq_codel an fq already. ecn doesn't work with all congestion controls. > 2) > This ability to inject directly to TCP stack local drops is exactly confirming > why I complained earlier about why policers are often wrong. Trying to cope with them > is a nightmare. > > I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer) I've tried to explain several times that it's not about policer. That's why we called it NRM. Network Resource Manager. The goal is to control neworking resources in containerized environment where container scope is a cgroup. we could have done some of this logic from tc-bpf layer, but overhead is prohibitive there, since tc sees all packets while we need to control only certain cgroups and work properly within cgroup hierarchy. > EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops. we already discussed this. EDT is not applicable in all cases. > One of the ECN goal is to avoid drops in the first place, so adding something to favor > local drops sounds a step in the wrong direction. according to this statement BBR is wrong design then, since it ignores ecn? Anyway, to move forward... We're going to explore few ways to reduce tailcall pessimization in patch 4 and will proceed with the rest as-is.
On 03/26/2019 08:07 AM, Alexei Starovoitov wrote: > so after 20+ years linux qdisc design is wrong? Yes it is how it is, return values can not be propagated back to the TCP stack in all cases. When a packet is queued to Qdisc 1, there is no way we can return a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2. Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will return a success which might be a lie. > bpf is about choice. We have to give people tools to experiment even > when we philosophically disagree on the design. Maybe, but I feel that for the moment, the choice is only for FB, and rest of the world has to re-invent private ebpf code in order to benefit from all of this. I doubt many TCP users will have the skills/money to benefit from this. Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity. It seems TCP stack became a playground for experiments.
On Tue, Mar 26, 2019 at 08:43:11AM -0700, Eric Dumazet wrote: > > > On 03/26/2019 08:07 AM, Alexei Starovoitov wrote: > > > so after 20+ years linux qdisc design is wrong? > > Yes it is how it is, return values can not be propagated back to the TCP stack in all cases. > > When a packet is queued to Qdisc 1, there is no way we can return > a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2. > > Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will > return a success which might be a lie. root and children qdiscs propagate the return value already. different unrelated qdiscs are just like different physical switches. they can communicate only via on the wire protocol. nothing wrong with that. But not everything can and should communicate over the wire. > > bpf is about choice. We have to give people tools to experiment even > > when we philosophically disagree on the design. > > Maybe, but I feel that for the moment, the choice is only for FB, and rest > of the world has to re-invent private ebpf code in order to benefit from all of this. Clearly a misunderstanding. Please see samples/bpf/hbm_out_kern.c The source code of bpf program is not only public, but algorithm is well documented. > I doubt many TCP users will have the skills/money to benefit from this. majority of bpf features are actively used and often not by authors who introduced them. > Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity. your herculean effort to keep tcp stack in excellent shape is greatly appreciated. No doubt about that. > It seems TCP stack became a playground for experiments. exactly. The next tcp congestion control algorithm should be implementable in bpf. If kernel was extensible to that degree likely there would have been no need to bypass it and invent quic.
On 03/26/2019 10:01 AM, Alexei Starovoitov wrote: > The next tcp congestion control algorithm should be implementable in bpf. Congestion control do not need per-packet decision, and can indeed be implemented in bpf. Some researchers even have implemented them in user space (CCP if I remember well) > If kernel was extensible to that degree likely there would have been > no need to bypass it and invent quic. QUIC is a complete re-design, and works around fundamental things that TCP can not change. bpf in TCP could not have changed the rise of QUIC or any other modern transport, really. Lets stop this discussion, we are digressing :)