Message ID | 20200709013008.3900892-1-liuhangbin@gmail.com |
---|---|
Headers | show |
Series | xdp: add a new helper for dev map multicast support | expand |
On 7/9/20 3:30 AM, Hangbin Liu wrote: > This patch is for xdp multicast support. which has been discussed before[0], > The goal is to be able to implement an OVS-like data plane in XDP, i.e., > a software switch that can forward XDP frames to multiple ports. > > To achieve this, an application needs to specify a group of interfaces > to forward a packet to. It is also common to want to exclude one or more > physical interfaces from the forwarding operation - e.g., to forward a > packet to all interfaces in the multicast group except the interface it > arrived on. While this could be done simply by adding more groups, this > quickly leads to a combinatorial explosion in the number of groups an > application has to maintain. > > To avoid the combinatorial explosion, we propose to include the ability > to specify an "exclude group" as part of the forwarding operation. This > needs to be a group (instead of just a single port index), because a > physical interface can be part of a logical grouping, such as a bond > device. > > Thus, the logical forwarding operation becomes a "set difference" > operation, i.e. "forward to all ports in group A that are not also in > group B". This series implements such an operation using device maps to > represent the groups. This means that the XDP program specifies two > device maps, one containing the list of netdevs to redirect to, and the > other containing the exclude list. Could you move this description as part of patch 1/3 instead of cover letter? Mostly given this helps understanding the rationale wrt exclusion map which is otherwise lacking from just looking at the patch itself. Assuming you have a bond, how does this look in practice for your mentioned ovs-like data plane in XDP? The map for 'group A' is shared among all XDP progs and the map for 'group B' is managed per prog? The BPF_F_EXCLUDE_INGRESS is clear, but how would this look wrt forwarding from a phys dev /to/ the bond iface w/ XDP? Also, what about tc BPF helper support for the case where not every device might have native XDP (but they could still share the maps)? > To achieve this, I re-implement a new helper bpf_redirect_map_multi() > to accept two maps, the forwarding map and exclude map. If user > don't want to use exclude map and just want simply stop redirecting back > to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS. > > The 2nd and 3rd patches are for usage sample and testing purpose, so there > is no effort has been made on performance optimisation. I did same tests > with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the > test result(the veth peer has a dummy xdp program with XDP_DROP directly): > > Version | Test | Native | Generic > 5.8 rc1 | xdp_redirect_map i40e->i40e | 10.0M | 1.9M > 5.8 rc1 | xdp_redirect_map i40e->veth | 12.7M | 1.6M > 5.8 rc1 + patch | xdp_redirect_map i40e->i40e | 10.0M | 1.9M > 5.8 rc1 + patch | xdp_redirect_map i40e->veth | 12.3M | 1.6M > 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e | 7.2M | 1.5M > 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth | 8.5M | 1.3M > 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth | 3.0M | 0.98M > > The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop > the arrays and do clone skb/xdpf. The native path is slower than generic > path as we send skbs by pktgen. So the result looks reasonable. > > Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for > suggestions and help on implementation. > > [0] https://xdp-project.net/#Handling-multicast > > v6: converted helper return types from int to long > > v5: > a) Check devmap_get_next_key() return value. > b) Pass through flags to __bpf_tx_xdp_map() instead of bool value. > c) In function dev_map_enqueue_multi(), consume xdpf for the last > obj instead of the first on. > d) Update helper description and code comments to explain that we > use NULL target value to distinguish multicast and unicast > forwarding. > e) Update memory model, memory id and frame_sz in xdpf_clone(). > f) Split the tests from sample and add a bpf kernel selftest patch. > > v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo > > v3: Based on Toke's suggestion, do the following update > a) Update bpf_redirect_map_multi() description in bpf.h. > b) Fix exclude_ifindex checking order in dev_in_exclude_map(). > c) Fix one more xdpf clone in dev_map_enqueue_multi(). > d) Go find next one in dev_map_enqueue_multi() if the interface is not > able to forward instead of abort the whole loop. > e) Remove READ_ONCE/WRITE_ONCE for ex_map. > > v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept > include/exclude maps directly. > > Hangbin Liu (3): > xdp: add a new helper for dev map multicast support > sample/bpf: add xdp_redirect_map_multicast test > selftests/bpf: add xdp_redirect_multi test > > include/linux/bpf.h | 20 ++ > include/linux/filter.h | 1 + > include/net/xdp.h | 1 + > include/uapi/linux/bpf.h | 22 +++ > kernel/bpf/devmap.c | 154 ++++++++++++++++ > kernel/bpf/verifier.c | 6 + > net/core/filter.c | 109 ++++++++++- > net/core/xdp.c | 29 +++ > samples/bpf/Makefile | 3 + > samples/bpf/xdp_redirect_map_multi_kern.c | 57 ++++++ > samples/bpf/xdp_redirect_map_multi_user.c | 166 +++++++++++++++++ > tools/include/uapi/linux/bpf.h | 22 +++ > tools/testing/selftests/bpf/Makefile | 4 +- > .../bpf/progs/xdp_redirect_multi_kern.c | 90 +++++++++ > .../selftests/bpf/test_xdp_redirect_multi.sh | 164 +++++++++++++++++ > .../selftests/bpf/xdp_redirect_multi.c | 173 ++++++++++++++++++ > 16 files changed, 1015 insertions(+), 6 deletions(-) > create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c > create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c > create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c > create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh > create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c >
On Fri, Jul 10, 2020 at 12:37:59AM +0200, Daniel Borkmann wrote: > On 7/9/20 3:30 AM, Hangbin Liu wrote: > > This patch is for xdp multicast support. which has been discussed before[0], > > The goal is to be able to implement an OVS-like data plane in XDP, i.e., > > a software switch that can forward XDP frames to multiple ports. > > > > To achieve this, an application needs to specify a group of interfaces > > to forward a packet to. It is also common to want to exclude one or more > > physical interfaces from the forwarding operation - e.g., to forward a > > packet to all interfaces in the multicast group except the interface it > > arrived on. While this could be done simply by adding more groups, this > > quickly leads to a combinatorial explosion in the number of groups an > > application has to maintain. > > > > To avoid the combinatorial explosion, we propose to include the ability > > to specify an "exclude group" as part of the forwarding operation. This > > needs to be a group (instead of just a single port index), because a > > physical interface can be part of a logical grouping, such as a bond > > device. > > > > Thus, the logical forwarding operation becomes a "set difference" > > operation, i.e. "forward to all ports in group A that are not also in > > group B". This series implements such an operation using device maps to > > represent the groups. This means that the XDP program specifies two > > device maps, one containing the list of netdevs to redirect to, and the > > other containing the exclude list. > > Could you move this description as part of patch 1/3 instead of cover > letter? Mostly given this helps understanding the rationale wrt exclusion > map which is otherwise lacking from just looking at the patch itself. OK, I will > > Assuming you have a bond, how does this look in practice for your mentioned > ovs-like data plane in XDP? The map for 'group A' is shared among all XDP > progs and the map for 'group B' is managed per prog? The BPF_F_EXCLUDE_INGRESS Yes, kind of. Since we have two maps as parameter. The 'group A map'(include map) will be shared between the interfaces in same group/vlan. The 'group B map' (exclude map) is interface specific. Each interface will hold it's own exclude map. As most time each interface only exclude itself, a null map + BPF_F_EXCLUDE_INGRESS should be enough. For bond situation. e.g. A active-backup bond0 with eth1 + eth2 as slaves. If eth1 is active interface, we can add eth2 to the exclude map. > is clear, but how would this look wrt forwarding from a phys dev /to/ the > bond iface w/ XDP? As bond interface doesn't support native XDP, This forwarding only works for physical slave interfaces. For generic xdp, maybe we can forward to bond interface directly, but I haven't tried. > > Also, what about tc BPF helper support for the case where not every device > might have native XDP (but they could still share the maps)? I haven't tried tc BPF. This helper works for both generic and native xdp forwarding. I think it should also works if we load the prog with native xdp mode in one interface and generic xdp mode in another interface, couldn't we? Thanks Hangbin
On 7/10/20 9:36 AM, Hangbin Liu wrote: > On Fri, Jul 10, 2020 at 12:37:59AM +0200, Daniel Borkmann wrote: >> On 7/9/20 3:30 AM, Hangbin Liu wrote: >>> This patch is for xdp multicast support. which has been discussed before[0], >>> The goal is to be able to implement an OVS-like data plane in XDP, i.e., >>> a software switch that can forward XDP frames to multiple ports. >>> >>> To achieve this, an application needs to specify a group of interfaces >>> to forward a packet to. It is also common to want to exclude one or more >>> physical interfaces from the forwarding operation - e.g., to forward a >>> packet to all interfaces in the multicast group except the interface it >>> arrived on. While this could be done simply by adding more groups, this >>> quickly leads to a combinatorial explosion in the number of groups an >>> application has to maintain. >>> >>> To avoid the combinatorial explosion, we propose to include the ability >>> to specify an "exclude group" as part of the forwarding operation. This >>> needs to be a group (instead of just a single port index), because a >>> physical interface can be part of a logical grouping, such as a bond >>> device. >>> >>> Thus, the logical forwarding operation becomes a "set difference" >>> operation, i.e. "forward to all ports in group A that are not also in >>> group B". This series implements such an operation using device maps to >>> represent the groups. This means that the XDP program specifies two >>> device maps, one containing the list of netdevs to redirect to, and the >>> other containing the exclude list. >> >> Could you move this description as part of patch 1/3 instead of cover >> letter? Mostly given this helps understanding the rationale wrt exclusion >> map which is otherwise lacking from just looking at the patch itself. > > OK, I will > >> Assuming you have a bond, how does this look in practice for your mentioned >> ovs-like data plane in XDP? The map for 'group A' is shared among all XDP >> progs and the map for 'group B' is managed per prog? The BPF_F_EXCLUDE_INGRESS > > Yes, kind of. Since we have two maps as parameter. The 'group A map'(include map) > will be shared between the interfaces in same group/vlan. The 'group B map' > (exclude map) is interface specific. Each interface will hold it's own exclude map. > > As most time each interface only exclude itself, a null map + BPF_F_EXCLUDE_INGRESS > should be enough. > > For bond situation. e.g. A active-backup bond0 with eth1 + eth2 as slaves. > If eth1 is active interface, we can add eth2 to the exclude map. Right, but what about the other direction where one device forwards to a bond, presumably eth1 + eth2 are in the include map and shared also between other ifaces? Given the logic for the bond mode is on bond0, so one layer higher, how do you determine which of eth1 + eth2 to send to in the BPF prog? Daemon listening for link events via arp or mii monitor and then update include map? Ideally would be nice to have some sort of a bond0 pass-through for the XDP buffer so it ends up eventually at one of the two through the native logic, e.g. what do you do when it's configured in xor mode or when slave dev is selected via hash or some other user logic (e.g. via team driver); how would this be modeled via inclusion map? I guess the issue can be regarded independently to this set, but given you mention explicitly bond here as a use case for the exclusion map, I was wondering how you solve the inclusion one for bond devices for your data plane? >> is clear, but how would this look wrt forwarding from a phys dev /to/ the >> bond iface w/ XDP? > > As bond interface doesn't support native XDP, This forwarding only works for > physical slave interfaces. > > For generic xdp, maybe we can forward to bond interface directly, but I > haven't tried. > >> Also, what about tc BPF helper support for the case where not every device >> might have native XDP (but they could still share the maps)? > > I haven't tried tc BPF. This helper works for both generic and native xdp > forwarding. I think it should also works if we load the prog with native > xdp mode in one interface and generic xdp mode in another interface, couldn't > we? Yes, that would work though generic XDP comes with its own set of issues, but presumably this sort of traffic could be considered slow-path anyway. Thanks, Daniel
On 7/10/20 9:02 AM, Daniel Borkmann wrote: > Right, but what about the other direction where one device forwards to a > bond, > presumably eth1 + eth2 are in the include map and shared also between other > ifaces? Given the logic for the bond mode is on bond0, so one layer > higher, how > do you determine which of eth1 + eth2 to send to in the BPF prog? Daemon > listening > for link events via arp or mii monitor and then update include map? > Ideally would > be nice to have some sort of a bond0 pass-through for the XDP buffer so > it ends > up eventually at one of the two through the native logic, e.g. what do > you do when > it's configured in xor mode or when slave dev is selected via hash or > some other > user logic (e.g. via team driver); how would this be modeled via > inclusion map? I > guess the issue can be regarded independently to this set, but given you > mention > explicitly bond here as a use case for the exclusion map, I was > wondering how you > solve the inclusion one for bond devices for your data plane? bond driver does not support xdp_xmit, and I do not believe there is a good ROI for adapting it to handle xdp buffers. For round robin and active-backup modes it is straightforward to adapt the new ndo_get_xmit_slave to work with ebpf. That is not the case for any of them that use a hash on the skb. e.g., for L3+L4 hashing I found it easier to replicate the algorithm in bpf than trying to adapt the bond code to work with XDP buffers. I put that in the category of 'XDP is advanced networking that requires unraveling the generic for a specific deployment.' In short, for bonds and Tx the bpf program needs to pick the slave device.