mbox series

[RFC,v4,bpf-next,00/11] Add support for XDP in egress path

Message ID 20200227032013.12385-1-dsahern@kernel.org
Headers show
Series Add support for XDP in egress path | expand

Message

David Ahern Feb. 27, 2020, 3:20 a.m. UTC
From: David Ahern <dsahern@gmail.com>

This series adds support for XDP in the egress path by introducing
a new XDP attachment type, BPF_XDP_EGRESS, and adding an if_link API
for attaching the program to a netdevice and reporting the program.
The intent is to emulate the current RX path for XDP as much as
possible to maintain consistency and symmetry in the 2 paths with
their APIs and when the programs are run: at first touch in the Rx
path and last touch in the Tx path.

The intent is to be able to run bpf programs on all packets regardless
of how they got to the xmit function of the netdevice - as an skb or a
redirected xdp frame. This is a missing primitive for XDP allowing
solutions to build small, targeted programs properly distributed in the
networking path allowing for example an egress firewall / ACL / traffic
verification or packet manipulation and encapping an entire ethernet
frame whether it is locally generated traffic, forwarded via the slow
path (ie., full stack processing) or xdp redirected frames. 

This set adds initial support for XDP in the egress path to the tun
driver, mostly for XDP_DROP (e.g., ACL on the ingress path to the
VM). XDP_TX on Rx means send the packet out the device it arrived
on; given that, XDP_Tx for the Tx path is treated as equivalent to
XDP_PASS - ie., continue on the Tx path. XDP_REDIRECT in the Tx path
can be handled in a follow set; conceptually it is no different than
the Rx path - the frame is handed off to another device though loops
do need to be detected.

The expectation is that the current mode levels (skb, driver and hardware)
will also apply to the egress path. The current patch set focuses on the
driver mode with tun. Attempting to tag the EGRESS path as yet another
mode is inconsistent on a number of levels - from the current usage of
XDP_FLAGS to options passed to the verifier for restricting xdp_md
accesses. Using the API as listed above maintains consistency with all
existing code.

As for libbpf, we believe that an API that mirrors and is consistent
with the existing xdp functions for the ingress path is the right way
to go for the egress path. Meaning, the libbpf API is replicated with
_egress in the name, but the libbpf implementation shares common code
to the degree possible.

Some of the patches in this set were originally part of Jason
Wang's work "XDP offload with virtio-net" [1]. The work overlaps
work done by me, so we decided to consolidate the work into the
egress path first with the offload option expected to be a follow
on.

At this point we are recognizing enough differences in the use case
that we are diverging again. For instance, there is agreement that
running the offloaded program should be done in vhost process context
making the "cost" attributable to the VM (vhost or qemu), but it is
more important for the host managed programs to be in the primary
packet path. As an example, packets should not be queueud to the vhost
ring buffer until *after* the XDP program for the Tx path has been run
otherwise there is the potential for the ring buffer to fill with
packets that the XDP program intends to drop (e.g., host managed ACL).

v4:
- updated cover letter
- patches related to code movement between tuntap, headers and vhost
  are dropped; previous RFC ran the XDP program in vhost context vs
  this set which runs them before queueing to vhost. As a part of this
  moved invocation of egress program to tun_net_xmit and tun_xdp_xmit.
- renamed do_xdp_generic to do_xdp_generic_rx to emphasize is called
  in the Rx path; added rx argument to do_xdp_generic_core since it
  is used for both directions and needs to know which queue values to
  set in xdp_buff

v3:
- reworked the patches - splitting patch 1 from RFC v2 into 3, combining
  patch 2 from RFC v2 into the first 3, combining patches 6 and 7 from
  RFC v2 into 1 since both did a trivial rename and export. Reordered
  the patches such that kernel changes are first followed by libbpf and
  an enhancement to a sample.

- moved small xdp related helper functions from tun.c to tun.h to make
  tun_ptr_free usable from the tap code. This is needed to handle the
  case of tap builtin and tun built as a module.

- pkt_ptrs added to `struct tun_file` and passed to tun_consume_packets
  rather than declaring pkts as an array on the stack.

v2:
- New XDP attachment type: Jesper, Toke and Alexei discussed whether
  to introduce a new program type. Since this set adds a way to attach
  regular XDP program to the tx path, as per Alexei's suggestion, a
  new attachment type BPF_XDP_EGRESS is introduced.

- libbpf API changes:
  Alexei had suggested _opts() style of API extension. Considering it
  two new libbpf APIs are introduced which are equivalent to existing
  APIs. New ones can be extended easily. Please see individual patches
  for details. xdp1 sample program is modified to use new APIs.

- tun: Some patches from previous set are removed as they are
  irrelevant in this series. They will in introduced later.

[1]: https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net


David Ahern (9):
  net: Add XDP setup and query commands for Tx programs
  xdp: Add xdp_txq_info to xdp_buff
  net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  net: core: rename netif_receive_generic_xdp to do_generic_xdp_core
  net: core: Rename do_xdp_generic to do_xdp_generic_rx and export
  tun: set egress XDP program
  tun: Support xdp in the Tx path for skb
  tun: Support xdp in the Tx path for xdp_frames
  libbpf: Add egress XDP support

Prashant Bhole (2):
  net: Add BPF_XDP_EGRESS as a bpf_attach_type
  samples/bpf: xdp1, add egress XDP support

 drivers/net/tun.c                  | 156 ++++++++++++++++++++++++++---
 include/linux/netdevice.h          |   8 +-
 include/net/xdp.h                  |   5 +
 include/uapi/linux/bpf.h           |   7 +-
 include/uapi/linux/if_link.h       |   1 +
 net/core/dev.c                     |  59 +++++++----
 net/core/filter.c                  |  34 +++++--
 net/core/rtnetlink.c               | 114 ++++++++++++++++++++-
 samples/bpf/xdp1_user.c            |  30 +++++-
 tools/include/uapi/linux/bpf.h     |   1 +
 tools/include/uapi/linux/if_link.h |   1 +
 tools/lib/bpf/libbpf.c             |   2 +
 tools/lib/bpf/libbpf.h             |   6 ++
 tools/lib/bpf/libbpf.map           |   3 +
 tools/lib/bpf/netlink.c            |  52 ++++++++--
 15 files changed, 418 insertions(+), 61 deletions(-)

Comments

Toke Høiland-Jørgensen Feb. 27, 2020, 11:55 a.m. UTC | #1
David Ahern <dsahern@kernel.org> writes:

> From: David Ahern <dsahern@gmail.com>
>
> This series adds support for XDP in the egress path by introducing
> a new XDP attachment type, BPF_XDP_EGRESS, and adding an if_link API
> for attaching the program to a netdevice and reporting the program.
> The intent is to emulate the current RX path for XDP as much as
> possible to maintain consistency and symmetry in the 2 paths with
> their APIs and when the programs are run: at first touch in the Rx
> path and last touch in the Tx path.
>
> The intent is to be able to run bpf programs on all packets regardless
> of how they got to the xmit function of the netdevice - as an skb or a
> redirected xdp frame. This is a missing primitive for XDP allowing
> solutions to build small, targeted programs properly distributed in the
> networking path allowing for example an egress firewall / ACL / traffic
> verification or packet manipulation and encapping an entire ethernet
> frame whether it is locally generated traffic, forwarded via the slow
> path (ie., full stack processing) or xdp redirected frames.

I'm totally on board with these goals!

As for this:

> Attempting to tag the EGRESS path as yet another mode is inconsistent
> on a number of levels - from the current usage of XDP_FLAGS to options
> passed to the verifier for restricting xdp_md accesses. Using the API
> as listed above maintains consistency with all existing code.

You *are* effectively tagging the EGRESS path as another mode: You are
restricting which fields of the context object the program can access,
and you're restricting where the program can be attached. I am pretty
sure we will end up accumulating more differences, either in more
metadata that is only available in one mode (we've already discussed
exposing TX qlen on egress programs), or even helpers that only make
sense in one mode.

So it doesn't make sense to discuss whether egress programs are a
distinct type from ingress programs: They clearly are. What we are
discussing is how to encode this type difference. You are proposing to
encode it using expected_attach_type as a subtype identifier instead of
using a new type number. There is already precedence for this with the
tracing programs, and I do think it makes sense - ingress and egress XDP
programs are clearly related, just as (e.g.) fentry/fexit/freplace
programs are.

However, my issue with this encoding is that it is write-only: You can't
inspect a BPF program already loaded into the kernel and tell which type
it is. So my proposal would be to make it explicit: Expose the
expected_attach_type as a new field in bpf_prog_info so userspace can
query it, and clearly document it as, essentially, a program subtype
that can significantly affect how a program is treated by the kernel.

-Toke
Alexei Starovoitov Feb. 27, 2020, 4:22 p.m. UTC | #2
On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> However, my issue with this encoding is that it is write-only: You can't
> inspect a BPF program already loaded into the kernel and tell which type
> it is. So my proposal would be to make it explicit: Expose the
> expected_attach_type as a new field in bpf_prog_info so userspace can
> query it, and clearly document it as, essentially, a program subtype
> that can significantly affect how a program is treated by the kernel.

You had the same request for "freplace" target prog.
My answer to both is still the same:
Please take a look at drgn and the script that Andrey posted.
All this information is trivial to extract from the kernel
without introducing new uapi.
Toke Høiland-Jørgensen Feb. 27, 2020, 5:06 p.m. UTC | #3
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> However, my issue with this encoding is that it is write-only: You can't
>> inspect a BPF program already loaded into the kernel and tell which type
>> it is. So my proposal would be to make it explicit: Expose the
>> expected_attach_type as a new field in bpf_prog_info so userspace can
>> query it, and clearly document it as, essentially, a program subtype
>> that can significantly affect how a program is treated by the kernel.
>
> You had the same request for "freplace" target prog.

Yes, and for the same reason; I'm being consistent here :)

> My answer to both is still the same:
> Please take a look at drgn and the script that Andrey posted.
> All this information is trivial to extract from the kernel
> without introducing new uapi.

I'm sorry, but having this kind of write-only data structure is a
horrible API design; and saying "you can just parse the internal kernel
data structures to see what is going on" is a cop-out. The whole point
of having a stable UAPI is to make it possible for people to write
applications that make use of kernel features with an expectation that
those applications will keep working. XDP is a networking feature;
people building networking applications shouldn't have to chase internal
kernel APIs just to keep their packet processing programs working.

Besides, it's already UAPI - there's a setter for it! If we introduce
this new egress program type that is still going to be a de facto new
program type with different semantics than the RX program, whether we
try to hide the fact or not. Even if we end up completely changing the
internal data structures, we're still going to have to support someone 
loading a program with type==XDP and attach_type == XDP_EGRESS. So why
can't we allow the user to query the state of a previously loaded
program as well?

-Toke
Alexei Starovoitov Feb. 27, 2020, 6:37 p.m. UTC | #4
On Thu, Feb 27, 2020 at 06:06:57PM +0100, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Feb 27, 2020 at 3:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> However, my issue with this encoding is that it is write-only: You can't
> >> inspect a BPF program already loaded into the kernel and tell which type
> >> it is. So my proposal would be to make it explicit: Expose the
> >> expected_attach_type as a new field in bpf_prog_info so userspace can
> >> query it, and clearly document it as, essentially, a program subtype
> >> that can significantly affect how a program is treated by the kernel.
> >
> > You had the same request for "freplace" target prog.
> 
> Yes, and for the same reason; I'm being consistent here :)
> 
> > My answer to both is still the same:
> > Please take a look at drgn and the script that Andrey posted.
> > All this information is trivial to extract from the kernel
> > without introducing new uapi.
> 
> I'm sorry, but having this kind of write-only data structure is a
> horrible API design; and saying "you can just parse the internal kernel
> data structures to see what is going on" is a cop-out. The whole point
> of having a stable UAPI is to make it possible for people to write
> applications that make use of kernel features with an expectation that
> those applications will keep working. XDP is a networking feature;
> people building networking applications shouldn't have to chase internal
> kernel APIs just to keep their packet processing programs working.

You're mistaking human needs for tooling needs.
Humans want to see all sorts of internal things and they can tweak drgn
script to look at them on the live kernel and examine kdump's vmcore
with the same drgn script.
Tooling needs uapi to work on a live system.
There are 26 attach_types that are used by various cgroup based progs,
sockmap, lirc, flow_dissector and tracing. Many of these attach types
are used in production and not a single time _tools_ had a need to
retrieve that field from the kernel.
Show me the tool that needs to read expected_attach_type back and
then we can discuss how to expose it in uapi.

There is another side of this. struct bpf_prog->pages field currently
is not exposed in uapi and I've seen tools doing
cat /proc/pid/fdinfo/fd|grep "memlock:"
just to retrive 'pages' field. That's the case where 'struct bpf_prog_info'
should be extended.