mbox series

[v5,bpf-next,00/11] net: Add support for XDP in egress path

Message ID 20200513014607.40418-1-dsahern@kernel.org
Headers show
Series net: Add support for XDP in egress path | expand

Message

David Ahern May 13, 2020, 1:45 a.m. UTC
From: David Ahern <dahern@digitalocean.com>

This series adds support for XDP in the egress path by introducing
a new XDP attachment type, BPF_XDP_EGRESS, and adding a UAPI to
if_link.h for attaching the program to a netdevice and reporting
the program. This allows bpf programs to be run on redirected xdp
frames with the context showing the Tx device.

This is a missing primitive for XDP allowing solutions to build small,
targeted programs properly distributed in the networking path allowing,
for example, an egress firewall/ACL/traffic verification or packet
manipulation based on data specific to the egress device.

Nothing about running a program in the Tx path requires driver specific
resources like the Rx path has. Thus, programs can be run in core
code and attached to the net_device struct similar to skb mode. The
egress attach is done using the new XDP_FLAGS_EGRESS_MODE flag, and
is reported by the kernel using the XDP_ATTACHED_EGRESS_CORE attach
flag with IFLA_XDP_EGRESS_PROG_ID making the api similar to existing
APIs for XDP.

The egress program is run in bq_xmit_all before invoking ndo_xdp_xmit.
This is similar to cls_bpf programs which run before the call to
ndo_dev_xmit. Together the 2 locations cover all packets about to be
sent to a device for Tx.

xdp egress programs are not run on skbs, so a cls-bpf counterpart
should also be attached to the device to cover all packets -
xdp_frames and skbs.

v5:
- rebased to top of bpf-next
- dropped skb path; cls-bpf provides an option for the same functionality
  without having to take a performance hit (e.g., disabling GSO).
- updated fall through notation to 'fallthrough;' statement per
  checkpatch warning

v4:
- added space in bpftool help in partch 12 - Toke
- updated to top of bpf-next

v3:
- removed IFLA_XDP_EGRESS and dropped back to XDP_FLAGS_EGRESS_MODE
  as the uapi to specify the attach. This caused the ordering of the
  patches to change with the uapi now introduced in the second patch
  and 2 refactoring patches are dropped. Samples and test programs
  updated to use the new API.

v2:
- changed rx checks in xdp_is_valid_access to any expected_attach_type
- add xdp_egress argument to bpftool prog rst document
- do not allow IFLA_XDP and IFLA_XDP_EGRESS in the same config. There
  is no way to rollback IFLA_XDP if IFLA_XDP_EGRESS fails.
- comments from Andrii on libbpf

v1:
- add selftests
- flip the order of xdp generic patches as requested by Toke
- fixed the count arg to do_xdp_egress_frame - Toke
- remove meta data invalidate in __xdp_egress_frame - Toke
- fixed data_hard_start in __xdp_egress_frame - Jesper
- refactored convert_to_xdp_frame to reuse buf to frame code - Jesper
- added missed refactoring patch when generating patch set

RFC v5:
- updated cover letter
- moved running of ebpf program to from ndo_{start,xdp}_xmit to core
  code. Dropped all tun and vhost related changes.
- added egress support to bpftool

RFC v4:
- updated cover letter
- patches related to code movement between tuntap, headers and vhost
  are dropped; previous RFC ran the XDP program in vhost context vs
  this set which runs them before queueing to vhost. As a part of this
  moved invocation of egress program to tun_net_xmit and tun_xdp_xmit.
- renamed do_xdp_generic to do_xdp_generic_rx to emphasize is called
  in the Rx path; added rx argument to do_xdp_generic_core since it
  is used for both directions and needs to know which queue values to
  set in xdp_buff

RFC v3:
- reworked the patches - splitting patch 1 from RFC v2 into 3, combining
  patch 2 from RFC v2 into the first 3, combining patches 6 and 7 from
  RFC v2 into 1 since both did a trivial rename and export. Reordered
  the patches such that kernel changes are first followed by libbpf and
  an enhancement to a sample.

- moved small xdp related helper functions from tun.c to tun.h to make
  tun_ptr_free usable from the tap code. This is needed to handle the
  case of tap builtin and tun built as a module.

- pkt_ptrs added to `struct tun_file` and passed to tun_consume_packets
  rather than declaring pkts as an array on the stack.

RFC v2:
- New XDP attachment type: Jesper, Toke and Alexei discussed whether
  to introduce a new program type. Since this set adds a way to attach
  regular XDP program to the tx path, as per Alexei's suggestion, a
  new attachment type BPF_XDP_EGRESS is introduced.

- libbpf API changes:
  Alexei had suggested _opts() style of API extension. Considering it
  two new libbpf APIs are introduced which are equivalent to existing
  APIs. New ones can be extended easily. Please see individual patches
  for details. xdp1 sample program is modified to use new APIs.

- tun: Some patches from previous set are removed as they are
  irrelevant in this series. They will in introduced later.


David Ahern (11):
  net: Refactor convert_to_xdp_frame
  net: uapi for XDP programs in the egress path
  net: Add XDP setup and query commands for Tx programs
  net: Add BPF_XDP_EGRESS as a bpf_attach_type
  xdp: Add xdp_txq_info to xdp_buff
  net: set XDP egress program on netdevice
  net: Support xdp in the Tx path for xdp_frames
  libbpf: Add egress XDP support
  bpftool: Add support for XDP egress
  selftest: Add xdp_egress attach tests
  samples/bpf: add XDP egress support to xdp1

 include/linux/netdevice.h                     |   7 +
 include/net/xdp.h                             |  35 +++--
 include/uapi/linux/bpf.h                      |   3 +
 include/uapi/linux/if_link.h                  |   6 +-
 kernel/bpf/devmap.c                           |  19 ++-
 net/core/dev.c                                | 147 ++++++++++++++++--
 net/core/filter.c                             |  26 ++++
 net/core/rtnetlink.c                          |  23 ++-
 samples/bpf/xdp1_user.c                       |  11 +-
 .../bpf/bpftool/Documentation/bpftool-net.rst |   4 +-
 .../bpftool/Documentation/bpftool-prog.rst    |   2 +-
 tools/bpf/bpftool/bash-completion/bpftool     |   4 +-
 tools/bpf/bpftool/net.c                       |   6 +-
 tools/bpf/bpftool/netlink_dumper.c            |   5 +
 tools/bpf/bpftool/prog.c                      |   2 +-
 tools/include/uapi/linux/bpf.h                |   3 +
 tools/include/uapi/linux/if_link.h            |   6 +-
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/netlink.c                       |   6 +
 .../bpf/prog_tests/xdp_egress_attach.c        |  56 +++++++
 .../selftests/bpf/progs/test_xdp_egress.c     |  12 ++
 .../bpf/progs/test_xdp_egress_fail.c          |  16 ++
 23 files changed, 358 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_egress_attach.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_egress.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_egress_fail.c

Comments

Toke Høiland-Jørgensen May 13, 2020, 10:43 a.m. UTC | #1
David Ahern <dsahern@kernel.org> writes:

> From: David Ahern <dahern@digitalocean.com>
>
> This series adds support for XDP in the egress path by introducing
> a new XDP attachment type, BPF_XDP_EGRESS, and adding a UAPI to
> if_link.h for attaching the program to a netdevice and reporting
> the program. This allows bpf programs to be run on redirected xdp
> frames with the context showing the Tx device.
>
> This is a missing primitive for XDP allowing solutions to build small,
> targeted programs properly distributed in the networking path allowing,
> for example, an egress firewall/ACL/traffic verification or packet
> manipulation based on data specific to the egress device.
>
> Nothing about running a program in the Tx path requires driver specific
> resources like the Rx path has. Thus, programs can be run in core
> code and attached to the net_device struct similar to skb mode. The
> egress attach is done using the new XDP_FLAGS_EGRESS_MODE flag, and
> is reported by the kernel using the XDP_ATTACHED_EGRESS_CORE attach
> flag with IFLA_XDP_EGRESS_PROG_ID making the api similar to existing
> APIs for XDP.
>
> The egress program is run in bq_xmit_all before invoking ndo_xdp_xmit.
> This is similar to cls_bpf programs which run before the call to
> ndo_dev_xmit. Together the 2 locations cover all packets about to be
> sent to a device for Tx.
>
> xdp egress programs are not run on skbs, so a cls-bpf counterpart
> should also be attached to the device to cover all packets -
> xdp_frames and skbs.
>
> v5:
> - rebased to top of bpf-next
> - dropped skb path; cls-bpf provides an option for the same functionality
>   without having to take a performance hit (e.g., disabling GSO).

I don't like this. I makes the egress hook asymmetrical with the ingress
hook (ingress hook sees all traffic, egress only some of it). If the
performance hit of disabling GSO is the concern, maybe it's better to
wait until we figure out how to deal with that (presumably by
multi-buffer XDP)?

-Toke
David Ahern May 13, 2020, 7:37 p.m. UTC | #2
On 5/13/20 4:43 AM, Toke Høiland-Jørgensen wrote:
> I don't like this. I makes the egress hook asymmetrical with the ingress
> hook (ingress hook sees all traffic, egress only some of it). If the
> performance hit of disabling GSO is the concern, maybe it's better to
> wait until we figure out how to deal with that (presumably by
> multi-buffer XDP)?

XDP is for accelerated networking. Disabling a h/w offload feature to
use a s/w feature is just wrong. But it is more than just disabling GSO,
and multi-buffer support for XDP is still not going to solve the
problem. XDP is free form allowing any packet modifications - pushing
and popping headers - and, for example, that blows up all of the skb
markers for mac, network, transport and their inner versions. Walking
the skb after an XDP program has run to reset the markers does not make
sense. Combine this with the generic xdp overhead (e.g., handling skb
clone and linearize), and the whole thing just does not make sense.

We have to accept there a lot of use cases / code paths that simply can
not be converted to work with both skbs and xdp_frames. The qdisc code
is one example. This is another. Requiring a tc program for the skb path
is an acceptable trade off.
John Fastabend May 15, 2020, 10:54 p.m. UTC | #3
David Ahern wrote:
> On 5/13/20 4:43 AM, Toke Høiland-Jørgensen wrote:
> > I don't like this. I makes the egress hook asymmetrical with the ingress
> > hook (ingress hook sees all traffic, egress only some of it). If the
> > performance hit of disabling GSO is the concern, maybe it's better to
> > wait until we figure out how to deal with that (presumably by
> > multi-buffer XDP)?
> 
> XDP is for accelerated networking. Disabling a h/w offload feature to
> use a s/w feature is just wrong. But it is more than just disabling GSO,
> and multi-buffer support for XDP is still not going to solve the
> problem. XDP is free form allowing any packet modifications - pushing
> and popping headers - and, for example, that blows up all of the skb
> markers for mac, network, transport and their inner versions. Walking
> the skb after an XDP program has run to reset the markers does not make
> sense. Combine this with the generic xdp overhead (e.g., handling skb
> clone and linearize), and the whole thing just does not make sense.
> 
> We have to accept there a lot of use cases / code paths that simply can
> not be converted to work with both skbs and xdp_frames. The qdisc code
> is one example. This is another. Requiring a tc program for the skb path
> is an acceptable trade off.

Hi David,

Another way to set up egress programs that I had been thinking about is to
build a prog_array map with a slot per interface then after doing the
redirect (or I guess the tail call program can do the redirect) do the
tail call into the "egress" program.

From a programming side this would look like,


  ---> ingress xdp bpf                BPF_MAP_TYPE_PROG_ARRAY
         redirect(ifindex)            +---------+
         tail_call(ifindex)           |         |
                      |               +---------+
                      +-------------> | ifindex | 
                                      +---------+
                                      |         |
                                      +---------+


         return XDP_REDIRECT
                        |
                        +-------------> xdp_xmit


The controller would then update the BPF_MAP_TYPE_PROG_ARRAY instead of
attaching to egress interface itself as in the series here. I think it
would only require that tail call program return XDP_REDIRECT so the
driver knows to follow through with the redirect. OTOH the egress program
can decide to DROP or PASS as well. The DROP case is straight forward,
packet gets dropped. The PASS case is interesting because it will cause
the packet to go to the stack. Which may or may not be expected I guess.
We could always lint the programs or force the programs to return only
XDP_REDIRECT/XDP_PASS from libbpf side.

Would there be any differences from my example and your series from the
datapath side? I think from the BPF program side the only difference
would be return codes XDP_REDIRECT vs XDP_PASS. The control plane is
different however. I don't have a good sense of one being better than
the other. Do you happen to see some reason to prefer native xdp egress
program types over prog array usage?

From performance side I suspect they will be more or less equivalant.

On the positive side using a PROG_ARRAY doesn't require a new attach
point. A con might be right-sizing the PROG_ARRAY to map to interfaces?
Do you have 1000's of interfaces here? Or some unknown number of
interfaces? I've had building resizable hash/array maps for awhile
on my todo list so could add that for other use cases as well if that
was the only problem.

Sorry for the late reply it took me a bit of time to mull over the
patches.

Thanks,
John
David Ahern May 15, 2020, 11:15 p.m. UTC | #4
On 5/15/20 4:54 PM, John Fastabend wrote:
> Hi David,
> 
> Another way to set up egress programs that I had been thinking about is to
> build a prog_array map with a slot per interface then after doing the
> redirect (or I guess the tail call program can do the redirect) do the
> tail call into the "egress" program.
> 
> From a programming side this would look like,
> 
> 
>   ---> ingress xdp bpf                BPF_MAP_TYPE_PROG_ARRAY
>          redirect(ifindex)            +---------+
>          tail_call(ifindex)           |         |
>                       |               +---------+
>                       +-------------> | ifindex | 
>                                       +---------+
>                                       |         |
>                                       +---------+
> 
> 
>          return XDP_REDIRECT
>                         |
>                         +-------------> xdp_xmit
> 
> 
> The controller would then update the BPF_MAP_TYPE_PROG_ARRAY instead of
> attaching to egress interface itself as in the series here. I think it
> would only require that tail call program return XDP_REDIRECT so the
> driver knows to follow through with the redirect. OTOH the egress program
> can decide to DROP or PASS as well. The DROP case is straight forward,
> packet gets dropped. The PASS case is interesting because it will cause
> the packet to go to the stack. Which may or may not be expected I guess.
> We could always lint the programs or force the programs to return only
> XDP_REDIRECT/XDP_PASS from libbpf side.
> 
> Would there be any differences from my example and your series from the
> datapath side? I think from the BPF program side the only difference
> would be return codes XDP_REDIRECT vs XDP_PASS. The control plane is
> different however. I don't have a good sense of one being better than
> the other. Do you happen to see some reason to prefer native xdp egress
> program types over prog array usage?

host ingress to VM is one use case; VM to VM on the same host is another.

> 
> From performance side I suspect they will be more or less equivalant.
> 
> On the positive side using a PROG_ARRAY doesn't require a new attach
> point. A con might be right-sizing the PROG_ARRAY to map to interfaces?
> Do you have 1000's of interfaces here? Or some unknown number of

1000ish is probably the right ballpark - up to 500 VM's on a host each
with a public and private network connection. From there each interface
can have their own firewall (ingress and egress; most likely VM unique
data, but to be flexible potentially different programs e.g., blacklist
vs whitelist). Each VM will definitely have its own network data - mac
and network addresses, and since VMs are untrusted packet validation in
both directions is a requirement.

With respect to lifecycle management of the programs and the data,
putting VM specific programs and maps on VM specific taps simplifies
management. VM terminates, taps are deleted, programs and maps
disappear. So no validator thread needed to handle stray data / programs
from the inevitable cleanup problems when everything is lumped into 1
program / map or even array of programs and maps.

To me the distributed approach is the simplest and best. The program on
the host nics can be stupid simple; no packet parsing beyond the
ethernet header. It's job is just a traffic demuxer very much like a
switch. All VM logic and data is local to the VM's interfaces.


> interfaces? I've had building resizable hash/array maps for awhile
> on my todo list so could add that for other use cases as well if that
> was the only problem.
> 
> Sorry for the late reply it took me a bit of time to mull over the
> patches.
> 
> Thanks,
> John
>
David Ahern May 18, 2020, 3:40 a.m. UTC | #5
I am trying to understand the resistance here. There are ingress/egress
hooks for most of the layers - tc, netfilter, and even within bpf APIs.
Clearly there is a need for this kind of symmetry across the APIs, so
why the resistance or hesitation for XDP?

Stacking programs on the Rx side into the host was brought up 9
revisions ago when the first patches went out. It makes for an
unnecessarily complicated design and is antithetical to the whole
Unix/Linux philosophy of small focused programs linked together to
provide a solution.

Can you elaborate on your concerns?
Toke Høiland-Jørgensen May 18, 2020, 9:08 a.m. UTC | #6
David Ahern <dsahern@gmail.com> writes:

> On 5/13/20 4:43 AM, Toke Høiland-Jørgensen wrote:
>> I don't like this. I makes the egress hook asymmetrical with the ingress
>> hook (ingress hook sees all traffic, egress only some of it). If the
>> performance hit of disabling GSO is the concern, maybe it's better to
>> wait until we figure out how to deal with that (presumably by
>> multi-buffer XDP)?
>
> XDP is for accelerated networking. Disabling a h/w offload feature to
> use a s/w feature is just wrong. But it is more than just disabling GSO,
> and multi-buffer support for XDP is still not going to solve the
> problem. XDP is free form allowing any packet modifications - pushing
> and popping headers - and, for example, that blows up all of the skb
> markers for mac, network, transport and their inner versions. Walking
> the skb after an XDP program has run to reset the markers does not make
> sense. Combine this with the generic xdp overhead (e.g., handling skb
> clone and linearize), and the whole thing just does not make sense.

I can see your point that fixing up the whole skb after the program has
run is not a good idea. But to me that just indicates that the hook is
in the wrong place: that it really should be in the driver, executed at
a point where the skb data structure is no longer necessary (similar to
how the ingress hook is before the skb is generated).

Otherwise, what you're proposing is not an egress hook, but rather a
'post-REDIRECT hook', which is strictly less powerful. This may or may
not be useful in its own right, but let's not pretend it's a full egress
hook. Personally I feel that the egress hook is what we should be going
for, not this partial thing.

-Toke
David Ahern May 18, 2020, 2:44 p.m. UTC | #7
On 5/18/20 3:08 AM, Toke Høiland-Jørgensen wrote:
> I can see your point that fixing up the whole skb after the program has
> run is not a good idea. But to me that just indicates that the hook is
> in the wrong place: that it really should be in the driver, executed at
> a point where the skb data structure is no longer necessary (similar to
> how the ingress hook is before the skb is generated).

Have you created a cls_bpf program to modify skbs? Have you looked at
the helpers, the restrictions and the tight management of skb changes?
Have you followed the skb from create to device handoff through the
drivers? Have you looked at the history of encapsulations, gso handling,
offloads, ...? I have and it drove home that the skb path and xdp paths
are radically different. XDP is meant to be light and fast, and trying
to cram an skb down the xdp path is a dead end.

> 
> Otherwise, what you're proposing is not an egress hook, but rather a
> 'post-REDIRECT hook', which is strictly less powerful. This may or may
> not be useful in its own right, but let's not pretend it's a full egress
> hook. Personally I feel that the egress hook is what we should be going
> for, not this partial thing.

You are hand waving. Be specific, with details.

Less powerful how? There are only so many operations you can do to a
packet. What do you want to do and what can't be done with this proposed
change? Why must it be done as XDP vs proper synergy between the 2 paths.
Toke Høiland-Jørgensen May 18, 2020, 6 p.m. UTC | #8
David Ahern <dsahern@gmail.com> writes:

> On 5/18/20 3:08 AM, Toke Høiland-Jørgensen wrote:
>> I can see your point that fixing up the whole skb after the program has
>> run is not a good idea. But to me that just indicates that the hook is
>> in the wrong place: that it really should be in the driver, executed at
>> a point where the skb data structure is no longer necessary (similar to
>> how the ingress hook is before the skb is generated).
>
> Have you created a cls_bpf program to modify skbs? Have you looked at
> the helpers, the restrictions and the tight management of skb changes?
> Have you followed the skb from create to device handoff through the
> drivers? Have you looked at the history of encapsulations, gso handling,
> offloads, ...?

Have you tried re-reading the first sentence of the paragraph you're
replying to? You know, the one that started with "I can see your point
that..."

>> Otherwise, what you're proposing is not an egress hook, but rather a
>> 'post-REDIRECT hook', which is strictly less powerful. This may or may
>> not be useful in its own right, but let's not pretend it's a full egress
>> hook. Personally I feel that the egress hook is what we should be going
>> for, not this partial thing.
>
> You are hand waving. Be specific, with details.

Are you deliberately trying to antagonise me or something? It's a really
odd way to try to make your case...

> Less powerful how? There are only so many operations you can do to a
> packet. What do you want to do and what can't be done with this proposed
> change? Why must it be done as XDP vs proper synergy between the 2 paths.

I meant 'less powerful' in the obvious sense: it only sees a subset of
the packets going out of the interface. And so I worry that it will (a)
make an already hard to use set of APIs even more confusing, and (b)
turn out to not be enough so we'll end up needing a "real" egress hook.

As I said in my previous email, a post-REDIRECT hook may or may not be
useful in its own right. I'm kinda on the fence about that, but am
actually leaning towards it being useful; however, I am concerned that
it'll end up being redundant if we do get a full egress hook.

-Toke
John Fastabend May 18, 2020, 6:10 p.m. UTC | #9
David Ahern wrote:
> On 5/15/20 4:54 PM, John Fastabend wrote:
> > Hi David,
> > 
> > Another way to set up egress programs that I had been thinking about is to
> > build a prog_array map with a slot per interface then after doing the
> > redirect (or I guess the tail call program can do the redirect) do the
> > tail call into the "egress" program.
> > 
> > From a programming side this would look like,
> > 
> > 
> >   ---> ingress xdp bpf                BPF_MAP_TYPE_PROG_ARRAY
> >          redirect(ifindex)            +---------+
> >          tail_call(ifindex)           |         |
> >                       |               +---------+
> >                       +-------------> | ifindex | 
> >                                       +---------+
> >                                       |         |
> >                                       +---------+
> > 
> > 
> >          return XDP_REDIRECT
> >                         |
> >                         +-------------> xdp_xmit
> > 
> > 
> > The controller would then update the BPF_MAP_TYPE_PROG_ARRAY instead of
> > attaching to egress interface itself as in the series here. I think it
> > would only require that tail call program return XDP_REDIRECT so the
> > driver knows to follow through with the redirect. OTOH the egress program
> > can decide to DROP or PASS as well. The DROP case is straight forward,
> > packet gets dropped. The PASS case is interesting because it will cause
> > the packet to go to the stack. Which may or may not be expected I guess.
> > We could always lint the programs or force the programs to return only
> > XDP_REDIRECT/XDP_PASS from libbpf side.
> > 
> > Would there be any differences from my example and your series from the
> > datapath side? I think from the BPF program side the only difference
> > would be return codes XDP_REDIRECT vs XDP_PASS. The control plane is
> > different however. I don't have a good sense of one being better than
> > the other. Do you happen to see some reason to prefer native xdp egress
> > program types over prog array usage?
> 
> host ingress to VM is one use case; VM to VM on the same host is another.

But host ingress to VM would still work with tail calls because the XDP
packet came from another XDP program. At least that is how I understand
it.

VM to VM case, again using tail calls on the sending VM ingress hook
would work also.

> 
> > 
> > From performance side I suspect they will be more or less equivalant.
> > 
> > On the positive side using a PROG_ARRAY doesn't require a new attach
> > point. A con might be right-sizing the PROG_ARRAY to map to interfaces?
> > Do you have 1000's of interfaces here? Or some unknown number of
> 
> 1000ish is probably the right ballpark - up to 500 VM's on a host each
> with a public and private network connection. From there each interface
> can have their own firewall (ingress and egress; most likely VM unique
> data, but to be flexible potentially different programs e.g., blacklist
> vs whitelist). Each VM will definitely have its own network data - mac
> and network addresses, and since VMs are untrusted packet validation in
> both directions is a requirement.

Understood and makes sense.

> 
> With respect to lifecycle management of the programs and the data,
> putting VM specific programs and maps on VM specific taps simplifies
> management. VM terminates, taps are deleted, programs and maps
> disappear. So no validator thread needed to handle stray data / programs
> from the inevitable cleanup problems when everything is lumped into 1
> program / map or even array of programs and maps.

OK. Also presumably you already have a hook into this event to insert
the tc filter programs so its probably a natural hook for mgmt.

> 
> To me the distributed approach is the simplest and best. The program on
> the host nics can be stupid simple; no packet parsing beyond the
> ethernet header. It's job is just a traffic demuxer very much like a
> switch. All VM logic and data is local to the VM's interfaces.

IMO it seems more natural and efficient to use a tail call. But, I
can see how if the ingress program is a l2/l3 switch and the VM hook
is a l2/l3 filter it feels more like a switch+firewall layout we
would normally use on a "real" (v)switch. Also I think the above point
where cleanup is free because of the tap tear down is a win.

> 
> 
> > interfaces? I've had building resizable hash/array maps for awhile
> > on my todo list so could add that for other use cases as well if that
> > was the only problem.
> > 
> > Sorry for the late reply it took me a bit of time to mull over the
> > patches.
> > 
> > Thanks,
> > John
> > 

Pulling in below because I think it was for me.

> I am trying to understand the resistance here. There are ingress/egress
> hooks for most of the layers - tc, netfilter, and even within bpf APIs.
> Clearly there is a need for this kind of symmetry across the APIs, so
> why the resistance or hesitation for XDP?

Because I don't see it as necessary and it adds another xdp interface. I
also didn't fully understand why it would be useful.

> 
> Stacking programs on the Rx side into the host was brought up 9
> revisions ago when the first patches went out. It makes for an
> unnecessarily complicated design and is antithetical to the whole
> Unix/Linux philosophy of small focused programs linked together to
> provide a solution.

I know it was brought up earlier and at the time the hook was also being
used for skbs. This sort of convinced me it was different from the tail
call example. Once skbs usage become impractical it seems like the
same datapath can be implemented with xdp+prog_array. As I understand
it this is still the case. The datapath could be implemented as a set
of xdp+prog_array hooks but the mgmt life-cycle is different and also
the mental model is a bit different. At least the mental model of the
BPF developer has to be different.

> Can you elaborate on your concerns?

Just understanding the use case.

My summary is the series gives us a few nice things (a) allows control
plane to be simpler because programs will not need to be explicitly
garbage collected, (b) we don't have to guess a right-size for a
program array map because we don't have to manage a map at all, and
(c) helps bpf programmers mental model by using separate attach points
for each function.

I can see how (a) and (b) will be useful so no objections from my
side to merge the series.
Daniel Borkmann May 18, 2020, 9:06 p.m. UTC | #10
On 5/18/20 8:00 PM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@gmail.com> writes:
>> On 5/18/20 3:08 AM, Toke Høiland-Jørgensen wrote:
[...]
>> Less powerful how? There are only so many operations you can do to a
>> packet. What do you want to do and what can't be done with this proposed
>> change? Why must it be done as XDP vs proper synergy between the 2 paths.
> 
> I meant 'less powerful' in the obvious sense: it only sees a subset of
> the packets going out of the interface. And so I worry that it will (a)
> make an already hard to use set of APIs even more confusing, and (b)
> turn out to not be enough so we'll end up needing a "real" egress hook.
> 
> As I said in my previous email, a post-REDIRECT hook may or may not be
> useful in its own right. I'm kinda on the fence about that, but am
> actually leaning towards it being useful; however, I am concerned that
> it'll end up being redundant if we do get a full egress hook.

I tend to agree with this. From a user point of view, say, one that has used
the ingress XDP path before, the expectation would very likely be that an XDP
"egress hook" would see all the traffic similarly as on the ingress side, but
since the skb path has been dropped in this revision - I agree with you, David,
that it makes sense to do so - calling it XDP "egress" then feels a bit misleading
wrt expectations. I'd assume we'd see a lot of confused users on this very list
asking why their BPF program doesn't trigger.

So given we neither call this hook on the skb path, nor XDP_TX nor AF_XDP's TX
path, I was wondering also wrt the discussion with John if it makes sense to
make this hook a property of the devmap _itself_, for example, to have a default
BPF prog upon devmap creation or a dev-specific override that is passed on map
update along with the dev. At least this would make it very clear where this is
logically tied to and triggered from, and if needed (?) would provide potentially
more flexibility on specifiying BPF progs to be called while also solving your
use-case.

Thanks,
Daniel
Daniel Borkmann May 18, 2020, 9:23 p.m. UTC | #11
On 5/18/20 4:44 PM, David Ahern wrote:
> On 5/18/20 3:08 AM, Toke Høiland-Jørgensen wrote:
>> I can see your point that fixing up the whole skb after the program has
>> run is not a good idea. But to me that just indicates that the hook is
>> in the wrong place: that it really should be in the driver, executed at
>> a point where the skb data structure is no longer necessary (similar to
>> how the ingress hook is before the skb is generated).
> 
> Have you created a cls_bpf program to modify skbs? Have you looked at
> the helpers, the restrictions and the tight management of skb changes?
> Have you followed the skb from create to device handoff through the
> drivers? Have you looked at the history of encapsulations, gso handling,
> offloads, ...? I have and it drove home that the skb path and xdp paths
> are radically different. XDP is meant to be light and fast, and trying
> to cram an skb down the xdp path is a dead end.

Agree, it's already challenging in itself to abstract the skb internals and
protocol specifics away for tc BPF programs while keeping them reasonably
fast (e.g. not destroying skb GSO specifics, etc). Good example is the whole
bpf_skb_adjust_room() flags mess. :/ The buffer would have to be an XDP
one straight from socket layer and stay that way as an xdp-buff down to the
driver, not the other way around where you'd pay the price of back'n'forth
conversion to xdp-buff and then passing it to the driver while handling/
fixing up all the skb details after the BPF prog was run. AF_XDP's xmit would
be more suited for something like that.

Thanks,
Daniel
David Ahern May 18, 2020, 11:37 p.m. UTC | #12
On 5/18/20 12:00 PM, Toke Høiland-Jørgensen wrote:
> I meant 'less powerful' in the obvious sense: it only sees a subset of
> the packets going out of the interface. And so I worry that it will (a)
> make an already hard to use set of APIs even more confusing, and (b)
> turn out to not be enough so we'll end up needing a "real" egress hook.
> 
> As I said in my previous email, a post-REDIRECT hook may or may not be
> useful in its own right. I'm kinda on the fence about that, but am
> actually leaning towards it being useful; however, I am concerned that
> it'll end up being redundant if we do get a full egress hook.
> 

I made the changes to mlx5 to run programs in the driver back in early
March. I have looked at both i40e and mlx5 xmit functions all the way to
h/w handoff to get 2 vendor perspectives. With xdp I can push any header
I want - e.g., mpls - and as soon as I do the markers are wrong. Take a
look at mlx5e_sq_xmit and how it gets the transport header offset. Or
i40e_tso. Those markers are necessary for the offloads so there is no
'post skb' location to run a bpf program in the driver and have the
result be sane for hardware handoff.

[ as an aside, a co-worker just happened to hit something like this
today (unrelated to xdp). He called dev_queue_xmit with a large,
manually crafted packet and no skb markers. Both the boxes (connected
back to back) had to be rebooted.]

From what I can see there are 3 ways to run an XDP program on skbs in
the Tx path:
1. disable hardware offloads (which is nonsense - you don't disable H/W
acceleration for S/W acceleration),

2. neuter XDP egress and not allow bpf_xdp_adjust_head (that is a key
feature of XDP), or

3. walk the skb afterwards and reset the markers (performance killer).

I have stared at this code for months; I would love for someone to prove
me wrong.
David Ahern May 18, 2020, 11:52 p.m. UTC | #13
On 5/18/20 12:10 PM, John Fastabend wrote:
>>
>> host ingress to VM is one use case; VM to VM on the same host is another.
> 
> But host ingress to VM would still work with tail calls because the XDP
> packet came from another XDP program. At least that is how I understand
> it.
> 
> VM to VM case, again using tail calls on the sending VM ingress hook
> would work also.

understood. I realize I can attach the program array all around, I just
see that as complex control plane / performance hit depending on how the
programs are wired up.

>>
>> With respect to lifecycle management of the programs and the data,
>> putting VM specific programs and maps on VM specific taps simplifies
>> management. VM terminates, taps are deleted, programs and maps
>> disappear. So no validator thread needed to handle stray data / programs
>> from the inevitable cleanup problems when everything is lumped into 1
>> program / map or even array of programs and maps.
> 
> OK. Also presumably you already have a hook into this event to insert
> the tc filter programs so its probably a natural hook for mgmt.

For VMs there is no reason to have an skb at all, so no tc filter program.

> 
>>
>> To me the distributed approach is the simplest and best. The program on
>> the host nics can be stupid simple; no packet parsing beyond the
>> ethernet header. It's job is just a traffic demuxer very much like a
>> switch. All VM logic and data is local to the VM's interfaces.
> 
> IMO it seems more natural and efficient to use a tail call. But, I
> can see how if the ingress program is a l2/l3 switch and the VM hook
> is a l2/l3 filter it feels more like a switch+firewall layout we
> would normally use on a "real" (v)switch. Also I think the above point
> where cleanup is free because of the tap tear down is a win.

exactly. To the VM. the host is part of the network. The host should be
passing the packets as fast and as simply as possible from ingress nic
to vm. It can be done completely as xdp frames and doing so reduces the
CPU cycles per packet in the host (yes, there are caveats to that
statement).

VM to host nic, and VM to VM have their own challenges which need to be
tackled next.

But the end goal is to have all VM traffic touched by the host as xdp
frames and without creating a complex control plane. The distributed
approach is much simpler and cleaner - and seems to follow what Cilium
is doing to a degree, or that is my interpretation of

"By attaching to the TC ingress hook of the host side of this veth pair
Cilium can monitor and enforce policy on all traffic exiting a
container. By attaching a BPF program to the veth pair associated with
each container and routing all network traffic to the host side virtual
devices with another BPF program attached to the tc ingress hook as well
Cilium can monitor and enforce policy on all traffic entering or exiting
the node."

https://docs.cilium.io/en/v1.7/architecture/
David Ahern May 19, 2020, 12:02 a.m. UTC | #14
On 5/18/20 3:06 PM, Daniel Borkmann wrote:
> So given we neither call this hook on the skb path, nor XDP_TX nor
> AF_XDP's TX
> path, I was wondering also wrt the discussion with John if it makes
> sense to
> make this hook a property of the devmap _itself_, for example, to have a
> default
> BPF prog upon devmap creation or a dev-specific override that is passed
> on map
> update along with the dev. At least this would make it very clear where
> this is
> logically tied to and triggered from, and if needed (?) would provide
> potentially
> more flexibility on specifiying BPF progs to be called while also
> solving your
> use-case.
> 

You lost me on the 'property of the devmap.' The programs need to be per
netdevice, and devmap is an array of devices. Can you elaborate?
John Fastabend May 19, 2020, 6:04 a.m. UTC | #15
David Ahern wrote:
> On 5/18/20 12:10 PM, John Fastabend wrote:
> >>
> >> host ingress to VM is one use case; VM to VM on the same host is another.
> > 
> > But host ingress to VM would still work with tail calls because the XDP
> > packet came from another XDP program. At least that is how I understand
> > it.
> > 
> > VM to VM case, again using tail calls on the sending VM ingress hook
> > would work also.
> 
> understood. I realize I can attach the program array all around, I just
> see that as complex control plane / performance hit depending on how the
> programs are wired up.
> 

Hard to argue with out a specific program. I think it could go either way.
I'll concede the control plane might be more complex but not so convinced
about performance. Either way having a program attached to the life cycle
of the VM seems like something that would be nice to have. In the tc skb
case if we attach to a qdisc it is removed automatically when the device
is removed. Having something similar for xdp is probably a good thing.

Worth following up in Daniel's thread. Another way to do that instead of
having the program associated with the ifindex is to have it associated
with the devmap entry. Basically when we add an entry in the devmap if
we had a program fd associated with it they could both be released when
the devmap entry is removed. This will happen automatically if the ifindex
is removed. But, rather than fragment threads too much I'll wait for
Daniel's reply.

> >>
> >> With respect to lifecycle management of the programs and the data,
> >> putting VM specific programs and maps on VM specific taps simplifies
> >> management. VM terminates, taps are deleted, programs and maps
> >> disappear. So no validator thread needed to handle stray data / programs
> >> from the inevitable cleanup problems when everything is lumped into 1
> >> program / map or even array of programs and maps.
> > 
> > OK. Also presumably you already have a hook into this event to insert
> > the tc filter programs so its probably a natural hook for mgmt.
> 
> For VMs there is no reason to have an skb at all, so no tc filter program.

+1 nice win for sure.

> 
> > 
> >>
> >> To me the distributed approach is the simplest and best. The program on
> >> the host nics can be stupid simple; no packet parsing beyond the
> >> ethernet header. It's job is just a traffic demuxer very much like a
> >> switch. All VM logic and data is local to the VM's interfaces.
> > 
> > IMO it seems more natural and efficient to use a tail call. But, I
> > can see how if the ingress program is a l2/l3 switch and the VM hook
> > is a l2/l3 filter it feels more like a switch+firewall layout we
> > would normally use on a "real" (v)switch. Also I think the above point
> > where cleanup is free because of the tap tear down is a win.
> 
> exactly. To the VM. the host is part of the network. The host should be
> passing the packets as fast and as simply as possible from ingress nic
> to vm. It can be done completely as xdp frames and doing so reduces the
> CPU cycles per packet in the host (yes, there are caveats to that
> statement).
> 
> VM to host nic, and VM to VM have their own challenges which need to be
> tackled next.
> 
> But the end goal is to have all VM traffic touched by the host as xdp
> frames and without creating a complex control plane. The distributed
> approach is much simpler and cleaner - and seems to follow what Cilium
> is doing to a degree, or that is my interpretation of

+1 agree everything as xdp pkt is a great goal.

> 
> "By attaching to the TC ingress hook of the host side of this veth pair
> Cilium can monitor and enforce policy on all traffic exiting a
> container. By attaching a BPF program to the veth pair associated with
> each container and routing all network traffic to the host side virtual
> devices with another BPF program attached to the tc ingress hook as well
> Cilium can monitor and enforce policy on all traffic entering or exiting
> the node."
> 
> https://docs.cilium.io/en/v1.7/architecture/

In many configurations there are no egress hooks though because policy (the
firewall piece) is implemented as part of the ingress hook. Because the
ingress TC hook "knows" where it will redirect a packet it can also run
the policy logic for that pod/VM/etc.
Daniel Borkmann May 19, 2020, 1:31 p.m. UTC | #16
On 5/19/20 2:02 AM, David Ahern wrote:
> On 5/18/20 3:06 PM, Daniel Borkmann wrote:
>> So given we neither call this hook on the skb path, nor XDP_TX nor
>> AF_XDP's TX
>> path, I was wondering also wrt the discussion with John if it makes
>> sense to
>> make this hook a property of the devmap _itself_, for example, to have a
>> default
>> BPF prog upon devmap creation or a dev-specific override that is passed
>> on map
>> update along with the dev. At least this would make it very clear where
>> this is
>> logically tied to and triggered from, and if needed (?) would provide
>> potentially
>> more flexibility on specifiying BPF progs to be called while also
>> solving your
>> use-case.
> 
> You lost me on the 'property of the devmap.' The programs need to be per
> netdevice, and devmap is an array of devices. Can you elaborate?

I meant that the dev{map,hash} would get extended in a way where the
__dev_map_update_elem() receives an (ifindex, BPF prog fd) tuple from
user space and holds the program's ref as long as it is in the map slot.
Then, upon redirect to the given device in the devmap, we'd execute the
prog as well in order to also allow for XDP_DROP policy in there. Upon
map update when we drop the dev from the map slot, we also release the
reference to the associated BPF prog. What I mean to say wrt 'property
of the devmap' is that this program is _only_ used in combination with
redirection to devmap, so given we are not solving all the other egress
cases for reasons mentioned, it would make sense to tie it logically to
the devmap which would also make it clear from a user perspective _when_
the prog is expected to run.

Thanks,
Daniel
Jesper Dangaard Brouer May 19, 2020, 2:21 p.m. UTC | #17
On Tue, 19 May 2020 15:31:20 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 5/19/20 2:02 AM, David Ahern wrote:
> > On 5/18/20 3:06 PM, Daniel Borkmann wrote:  
> >>
> >> So given we neither call this hook on the skb path, nor XDP_TX nor
> >> AF_XDP's TX path, I was wondering also wrt the discussion with
> >> John if it makes sense to make this hook a property of the devmap
> >> _itself_, for example, to have a default BPF prog upon devmap
> >> creation or a dev-specific override that is passed on map update
> >> along with the dev. At least this would make it very clear where
> >> this is logically tied to and triggered from, and if needed (?)
> >> would provide potentially more flexibility on specifiying BPF
> >> progs to be called while also solving your use-case.  
> > 
> > You lost me on the 'property of the devmap.' The programs need to be per
> > netdevice, and devmap is an array of devices. Can you elaborate?  
> 
> I meant that the dev{map,hash} would get extended in a way where the
> __dev_map_update_elem() receives an (ifindex, BPF prog fd) tuple from
> user space and holds the program's ref as long as it is in the map slot.
> Then, upon redirect to the given device in the devmap, we'd execute the
> prog as well in order to also allow for XDP_DROP policy in there. Upon
> map update when we drop the dev from the map slot, we also release the
> reference to the associated BPF prog. What I mean to say wrt 'property
> of the devmap' is that this program is _only_ used in combination with
> redirection to devmap, so given we are not solving all the other egress
> cases for reasons mentioned, it would make sense to tie it logically to
> the devmap which would also make it clear from a user perspective _when_
> the prog is expected to run.

Yes, I agree.

I also have a use-case for 'cpumap' (cc. Lorenzo as I asked him to
work on it).  We want to run another XDP program on the CPU that
receives the xdp_frame, and then allow it to XDP redirect again.
It would make a lot of sense, to attach this XDP program via inserting
an BPF-prog-fd into the map as a value.

Notice that we would also need another expected-attach-type for this
case, as we want to allow XDP program to read xdp_md->ingress_ifindex,
but we don't have xdp_rxq_info any-longer. Thus, we need to remap that
to xdp_frame->dev_rx->ifindex (instead of rxq->dev->ifindex).

The practical use-case is the espressobin mvneta based ARM64 board,
that can only receive IRQs + RX-frames on CPU-0, but hardware have more
TX-queues that we would like to take advantage of on both CPUs.
Toke Høiland-Jørgensen May 19, 2020, 2:52 p.m. UTC | #18
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 5/19/20 2:02 AM, David Ahern wrote:
>> On 5/18/20 3:06 PM, Daniel Borkmann wrote:
>>> So given we neither call this hook on the skb path, nor XDP_TX nor
>>> AF_XDP's TX
>>> path, I was wondering also wrt the discussion with John if it makes
>>> sense to
>>> make this hook a property of the devmap _itself_, for example, to have a
>>> default
>>> BPF prog upon devmap creation or a dev-specific override that is passed
>>> on map
>>> update along with the dev. At least this would make it very clear where
>>> this is
>>> logically tied to and triggered from, and if needed (?) would provide
>>> potentially
>>> more flexibility on specifiying BPF progs to be called while also
>>> solving your
>>> use-case.
>> 
>> You lost me on the 'property of the devmap.' The programs need to be per
>> netdevice, and devmap is an array of devices. Can you elaborate?
>
> I meant that the dev{map,hash} would get extended in a way where the
> __dev_map_update_elem() receives an (ifindex, BPF prog fd) tuple from
> user space and holds the program's ref as long as it is in the map slot.
> Then, upon redirect to the given device in the devmap, we'd execute the
> prog as well in order to also allow for XDP_DROP policy in there. Upon
> map update when we drop the dev from the map slot, we also release the
> reference to the associated BPF prog. What I mean to say wrt 'property
> of the devmap' is that this program is _only_ used in combination with
> redirection to devmap, so given we are not solving all the other egress
> cases for reasons mentioned, it would make sense to tie it logically to
> the devmap which would also make it clear from a user perspective _when_
> the prog is expected to run.

I would be totally on board with this. Also makes sense for the
multicast map type, if you want to fix up the packet after the redirect,
just stick the fixer-upper program into the map along with the ifindex.

-Toke
David Ahern May 19, 2020, 4:37 p.m. UTC | #19
On 5/19/20 7:31 AM, Daniel Borkmann wrote:
> I meant that the dev{map,hash} would get extended in a way where the
> __dev_map_update_elem() receives an (ifindex, BPF prog fd) tuple from
> user space and holds the program's ref as long as it is in the map slot.
> Then, upon redirect to the given device in the devmap, we'd execute the
> prog as well in order to also allow for XDP_DROP policy in there. Upon
> map update when we drop the dev from the map slot, we also release the
> reference to the associated BPF prog. What I mean to say wrt 'property
> of the devmap' is that this program is _only_ used in combination with
> redirection to devmap, so given we are not solving all the other egress
> cases for reasons mentioned, it would make sense to tie it logically to
> the devmap which would also make it clear from a user perspective _when_
> the prog is expected to run.

Thanks. I will take a look at this.
Lorenzo Bianconi May 19, 2020, 4:58 p.m. UTC | #20
> On Tue, 19 May 2020 15:31:20 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> > On 5/19/20 2:02 AM, David Ahern wrote:
> > > On 5/18/20 3:06 PM, Daniel Borkmann wrote:  
> > >>
> > >> So given we neither call this hook on the skb path, nor XDP_TX nor
> > >> AF_XDP's TX path, I was wondering also wrt the discussion with
> > >> John if it makes sense to make this hook a property of the devmap
> > >> _itself_, for example, to have a default BPF prog upon devmap
> > >> creation or a dev-specific override that is passed on map update
> > >> along with the dev. At least this would make it very clear where
> > >> this is logically tied to and triggered from, and if needed (?)
> > >> would provide potentially more flexibility on specifiying BPF
> > >> progs to be called while also solving your use-case.  
> > > 
> > > You lost me on the 'property of the devmap.' The programs need to be per
> > > netdevice, and devmap is an array of devices. Can you elaborate?  
> > 
> > I meant that the dev{map,hash} would get extended in a way where the
> > __dev_map_update_elem() receives an (ifindex, BPF prog fd) tuple from
> > user space and holds the program's ref as long as it is in the map slot.
> > Then, upon redirect to the given device in the devmap, we'd execute the
> > prog as well in order to also allow for XDP_DROP policy in there. Upon
> > map update when we drop the dev from the map slot, we also release the
> > reference to the associated BPF prog. What I mean to say wrt 'property
> > of the devmap' is that this program is _only_ used in combination with
> > redirection to devmap, so given we are not solving all the other egress
> > cases for reasons mentioned, it would make sense to tie it logically to
> > the devmap which would also make it clear from a user perspective _when_
> > the prog is expected to run.
> 
> Yes, I agree.
> 
> I also have a use-case for 'cpumap' (cc. Lorenzo as I asked him to
> work on it).  We want to run another XDP program on the CPU that
> receives the xdp_frame, and then allow it to XDP redirect again.
> It would make a lot of sense, to attach this XDP program via inserting
> an BPF-prog-fd into the map as a value.
> 
> Notice that we would also need another expected-attach-type for this
> case, as we want to allow XDP program to read xdp_md->ingress_ifindex,
> but we don't have xdp_rxq_info any-longer. Thus, we need to remap that
> to xdp_frame->dev_rx->ifindex (instead of rxq->dev->ifindex).

here I am looking at how we can extend cpumaps in order to pass from
usersapce the qsize and a bpf program file descriptor adding an element
to the map and allow cpu_map_update_elem() to load the program (e.g.
similar to dev_change_xdp_fd())
Doing so we can have an approach similar to veth xdp implementation.

Regards,
Lorenzo

> 
> The practical use-case is the espressobin mvneta based ARM64 board,
> that can only receive IRQs + RX-frames on CPU-0, but hardware have more
> TX-queues that we would like to take advantage of on both CPUs.
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>