Message ID | 20200227032013.12385-1-dsahern@kernel.org |
---|---|
Headers | show |
Series | Add support for XDP in egress path | expand |
David Ahern <dsahern@kernel.org> writes: > From: David Ahern <dsahern@gmail.com> > > This series adds support for XDP in the egress path by introducing > a new XDP attachment type, BPF_XDP_EGRESS, and adding an if_link API > for attaching the program to a netdevice and reporting the program. > The intent is to emulate the current RX path for XDP as much as > possible to maintain consistency and symmetry in the 2 paths with > their APIs and when the programs are run: at first touch in the Rx > path and last touch in the Tx path. > > The intent is to be able to run bpf programs on all packets regardless > of how they got to the xmit function of the netdevice - as an skb or a > redirected xdp frame. This is a missing primitive for XDP allowing > solutions to build small, targeted programs properly distributed in the > networking path allowing for example an egress firewall / ACL / traffic > verification or packet manipulation and encapping an entire ethernet > frame whether it is locally generated traffic, forwarded via the slow > path (ie., full stack processing) or xdp redirected frames. I'm totally on board with these goals! As for this: > Attempting to tag the EGRESS path as yet another mode is inconsistent > on a number of levels - from the current usage of XDP_FLAGS to options > passed to the verifier for restricting xdp_md accesses. Using the API > as listed above maintains consistency with all existing code. You *are* effectively tagging the EGRESS path as another mode: You are restricting which fields of the context object the program can access, and you're restricting where the program can be attached. I am pretty sure we will end up accumulating more differences, either in more metadata that is only available in one mode (we've already discussed exposing TX qlen on egress programs), or even helpers that only make sense in one mode. So it doesn't make sense to discuss whether egress programs are a distinct type from ingress programs: They clearly are. What we are discussing is how to encode this type difference. You are proposing to encode it using expected_attach_type as a subtype identifier instead of using a new type number. There is already precedence for this with the tracing programs, and I do think it makes sense - ingress and egress XDP programs are clearly related, just as (e.g.) fentry/fexit/freplace programs are. However, my issue with this encoding is that it is write-only: You can't inspect a BPF program already loaded into the kernel and tell which type it is. So my proposal would be to make it explicit: Expose the expected_attach_type as a new field in bpf_prog_info so userspace can query it, and clearly document it as, essentially, a program subtype that can significantly affect how a program is treated by the kernel. -Toke
On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > However, my issue with this encoding is that it is write-only: You can't > inspect a BPF program already loaded into the kernel and tell which type > it is. So my proposal would be to make it explicit: Expose the > expected_attach_type as a new field in bpf_prog_info so userspace can > query it, and clearly document it as, essentially, a program subtype > that can significantly affect how a program is treated by the kernel. You had the same request for "freplace" target prog. My answer to both is still the same: Please take a look at drgn and the script that Andrey posted. All this information is trivial to extract from the kernel without introducing new uapi.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> However, my issue with this encoding is that it is write-only: You can't >> inspect a BPF program already loaded into the kernel and tell which type >> it is. So my proposal would be to make it explicit: Expose the >> expected_attach_type as a new field in bpf_prog_info so userspace can >> query it, and clearly document it as, essentially, a program subtype >> that can significantly affect how a program is treated by the kernel. > > You had the same request for "freplace" target prog. Yes, and for the same reason; I'm being consistent here :) > My answer to both is still the same: > Please take a look at drgn and the script that Andrey posted. > All this information is trivial to extract from the kernel > without introducing new uapi. I'm sorry, but having this kind of write-only data structure is a horrible API design; and saying "you can just parse the internal kernel data structures to see what is going on" is a cop-out. The whole point of having a stable UAPI is to make it possible for people to write applications that make use of kernel features with an expectation that those applications will keep working. XDP is a networking feature; people building networking applications shouldn't have to chase internal kernel APIs just to keep their packet processing programs working. Besides, it's already UAPI - there's a setter for it! If we introduce this new egress program type that is still going to be a de facto new program type with different semantics than the RX program, whether we try to hide the fact or not. Even if we end up completely changing the internal data structures, we're still going to have to support someone loading a program with type==XDP and attach_type == XDP_EGRESS. So why can't we allow the user to query the state of a previously loaded program as well? -Toke
On Thu, Feb 27, 2020 at 06:06:57PM +0100, Toke Høiland-Jørgensen wrote: > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> However, my issue with this encoding is that it is write-only: You can't > >> inspect a BPF program already loaded into the kernel and tell which type > >> it is. So my proposal would be to make it explicit: Expose the > >> expected_attach_type as a new field in bpf_prog_info so userspace can > >> query it, and clearly document it as, essentially, a program subtype > >> that can significantly affect how a program is treated by the kernel. > > > > You had the same request for "freplace" target prog. > > Yes, and for the same reason; I'm being consistent here :) > > > My answer to both is still the same: > > Please take a look at drgn and the script that Andrey posted. > > All this information is trivial to extract from the kernel > > without introducing new uapi. > > I'm sorry, but having this kind of write-only data structure is a > horrible API design; and saying "you can just parse the internal kernel > data structures to see what is going on" is a cop-out. The whole point > of having a stable UAPI is to make it possible for people to write > applications that make use of kernel features with an expectation that > those applications will keep working. XDP is a networking feature; > people building networking applications shouldn't have to chase internal > kernel APIs just to keep their packet processing programs working. You're mistaking human needs for tooling needs. Humans want to see all sorts of internal things and they can tweak drgn script to look at them on the live kernel and examine kdump's vmcore with the same drgn script. Tooling needs uapi to work on a live system. There are 26 attach_types that are used by various cgroup based progs, sockmap, lirc, flow_dissector and tracing. Many of these attach types are used in production and not a single time _tools_ had a need to retrieve that field from the kernel. Show me the tool that needs to read expected_attach_type back and then we can discuss how to expose it in uapi. There is another side of this. struct bpf_prog->pages field currently is not exposed in uapi and I've seen tools doing cat /proc/pid/fdinfo/fd|grep "memlock:" just to retrive 'pages' field. That's the case where 'struct bpf_prog_info' should be extended.
From: David Ahern <dsahern@gmail.com> This series adds support for XDP in the egress path by introducing a new XDP attachment type, BPF_XDP_EGRESS, and adding an if_link API for attaching the program to a netdevice and reporting the program. The intent is to emulate the current RX path for XDP as much as possible to maintain consistency and symmetry in the 2 paths with their APIs and when the programs are run: at first touch in the Rx path and last touch in the Tx path. The intent is to be able to run bpf programs on all packets regardless of how they got to the xmit function of the netdevice - as an skb or a redirected xdp frame. This is a missing primitive for XDP allowing solutions to build small, targeted programs properly distributed in the networking path allowing for example an egress firewall / ACL / traffic verification or packet manipulation and encapping an entire ethernet frame whether it is locally generated traffic, forwarded via the slow path (ie., full stack processing) or xdp redirected frames. This set adds initial support for XDP in the egress path to the tun driver, mostly for XDP_DROP (e.g., ACL on the ingress path to the VM). XDP_TX on Rx means send the packet out the device it arrived on; given that, XDP_Tx for the Tx path is treated as equivalent to XDP_PASS - ie., continue on the Tx path. XDP_REDIRECT in the Tx path can be handled in a follow set; conceptually it is no different than the Rx path - the frame is handed off to another device though loops do need to be detected. The expectation is that the current mode levels (skb, driver and hardware) will also apply to the egress path. The current patch set focuses on the driver mode with tun. Attempting to tag the EGRESS path as yet another mode is inconsistent on a number of levels - from the current usage of XDP_FLAGS to options passed to the verifier for restricting xdp_md accesses. Using the API as listed above maintains consistency with all existing code. As for libbpf, we believe that an API that mirrors and is consistent with the existing xdp functions for the ingress path is the right way to go for the egress path. Meaning, the libbpf API is replicated with _egress in the name, but the libbpf implementation shares common code to the degree possible. Some of the patches in this set were originally part of Jason Wang's work "XDP offload with virtio-net" [1]. The work overlaps work done by me, so we decided to consolidate the work into the egress path first with the offload option expected to be a follow on. At this point we are recognizing enough differences in the use case that we are diverging again. For instance, there is agreement that running the offloaded program should be done in vhost process context making the "cost" attributable to the VM (vhost or qemu), but it is more important for the host managed programs to be in the primary packet path. As an example, packets should not be queueud to the vhost ring buffer until *after* the XDP program for the Tx path has been run otherwise there is the potential for the ring buffer to fill with packets that the XDP program intends to drop (e.g., host managed ACL). v4: - updated cover letter - patches related to code movement between tuntap, headers and vhost are dropped; previous RFC ran the XDP program in vhost context vs this set which runs them before queueing to vhost. As a part of this moved invocation of egress program to tun_net_xmit and tun_xdp_xmit. - renamed do_xdp_generic to do_xdp_generic_rx to emphasize is called in the Rx path; added rx argument to do_xdp_generic_core since it is used for both directions and needs to know which queue values to set in xdp_buff v3: - reworked the patches - splitting patch 1 from RFC v2 into 3, combining patch 2 from RFC v2 into the first 3, combining patches 6 and 7 from RFC v2 into 1 since both did a trivial rename and export. Reordered the patches such that kernel changes are first followed by libbpf and an enhancement to a sample. - moved small xdp related helper functions from tun.c to tun.h to make tun_ptr_free usable from the tap code. This is needed to handle the case of tap builtin and tun built as a module. - pkt_ptrs added to `struct tun_file` and passed to tun_consume_packets rather than declaring pkts as an array on the stack. v2: - New XDP attachment type: Jesper, Toke and Alexei discussed whether to introduce a new program type. Since this set adds a way to attach regular XDP program to the tx path, as per Alexei's suggestion, a new attachment type BPF_XDP_EGRESS is introduced. - libbpf API changes: Alexei had suggested _opts() style of API extension. Considering it two new libbpf APIs are introduced which are equivalent to existing APIs. New ones can be extended easily. Please see individual patches for details. xdp1 sample program is modified to use new APIs. - tun: Some patches from previous set are removed as they are irrelevant in this series. They will in introduced later. [1]: https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net David Ahern (9): net: Add XDP setup and query commands for Tx programs xdp: Add xdp_txq_info to xdp_buff net: Add IFLA_XDP_EGRESS for XDP programs in the egress path net: core: rename netif_receive_generic_xdp to do_generic_xdp_core net: core: Rename do_xdp_generic to do_xdp_generic_rx and export tun: set egress XDP program tun: Support xdp in the Tx path for skb tun: Support xdp in the Tx path for xdp_frames libbpf: Add egress XDP support Prashant Bhole (2): net: Add BPF_XDP_EGRESS as a bpf_attach_type samples/bpf: xdp1, add egress XDP support drivers/net/tun.c | 156 ++++++++++++++++++++++++++--- include/linux/netdevice.h | 8 +- include/net/xdp.h | 5 + include/uapi/linux/bpf.h | 7 +- include/uapi/linux/if_link.h | 1 + net/core/dev.c | 59 +++++++---- net/core/filter.c | 34 +++++-- net/core/rtnetlink.c | 114 ++++++++++++++++++++- samples/bpf/xdp1_user.c | 30 +++++- tools/include/uapi/linux/bpf.h | 1 + tools/include/uapi/linux/if_link.h | 1 + tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/libbpf.h | 6 ++ tools/lib/bpf/libbpf.map | 3 + tools/lib/bpf/netlink.c | 52 ++++++++-- 15 files changed, 418 insertions(+), 61 deletions(-)