mbox series

[PATCHv7,bpf-next,0/3] xdp: add a new helper for dev map multicast support

Message ID 20200714063257.1694964-1-liuhangbin@gmail.com
Headers show
Series xdp: add a new helper for dev map multicast support | expand

Message

Hangbin Liu July 14, 2020, 6:32 a.m. UTC
This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 2nd and 3rd patches are for usage sample and testing purpose, so there
is no effort has been made on performance optimisation. I did same tests
with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
test result(the veth peer has a dummy xdp program with XDP_DROP directly):

Version         | Test                                   | Native | Generic
5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M

The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
the arrays and do clone skb/xdpf. The native path is slower than generic
path as we send skbs by pktgen. So the result looks reasonable.

Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v7: Fix helper flag check
    Limit the *ex_map* to use DEVMAP_HASH only and update function
    dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (3):
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: add xdp_redirect_multi test

 include/linux/bpf.h                           |  20 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  26 +++
 kernel/bpf/devmap.c                           | 140 ++++++++++++++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 111 ++++++++++-
 net/core/xdp.c                                |  29 +++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  57 ++++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 166 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  26 +++
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       |  90 +++++++++
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 164 +++++++++++++++++
 .../selftests/bpf/xdp_redirect_multi.c        | 173 ++++++++++++++++++
 16 files changed, 1011 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

Comments

Toke Høiland-Jørgensen July 14, 2020, 12:29 p.m. UTC | #1
Hangbin Liu <liuhangbin@gmail.com> writes:

> This patch is for xdp multicast support. which has been discussed before[0],
> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
> a software switch that can forward XDP frames to multiple ports.
>
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
>
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because there
> may have multi interfaces you want to exclude.
>
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.
>
> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. If user
> don't want to use exclude map and just want simply stop redirecting back
> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
>
> The 2nd and 3rd patches are for usage sample and testing purpose, so there
> is no effort has been made on performance optimisation. I did same tests
> with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
> test result(the veth peer has a dummy xdp program with XDP_DROP directly):
>
> Version         | Test                                   | Native | Generic
> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
>
> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> the arrays and do clone skb/xdpf. The native path is slower than generic
> path as we send skbs by pktgen. So the result looks reasonable.
>
> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
> suggestions and help on implementation.
>
> [0] https://xdp-project.net/#Handling-multicast
>
> v7: Fix helper flag check
>     Limit the *ex_map* to use DEVMAP_HASH only and update function
>     dev_in_exclude_map() to get better performance.

Did it help? The performance numbers in the table above are the same as
in v6...

-Toke
David Ahern July 14, 2020, 5:12 p.m. UTC | #2
On 7/14/20 6:29 AM, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
>> This patch is for xdp multicast support. which has been discussed before[0],
>> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
>> a software switch that can forward XDP frames to multiple ports.
>>
>> To achieve this, an application needs to specify a group of interfaces
>> to forward a packet to. It is also common to want to exclude one or more
>> physical interfaces from the forwarding operation - e.g., to forward a
>> packet to all interfaces in the multicast group except the interface it
>> arrived on. While this could be done simply by adding more groups, this
>> quickly leads to a combinatorial explosion in the number of groups an
>> application has to maintain.
>>
>> To avoid the combinatorial explosion, we propose to include the ability
>> to specify an "exclude group" as part of the forwarding operation. This
>> needs to be a group (instead of just a single port index), because there
>> may have multi interfaces you want to exclude.
>>
>> Thus, the logical forwarding operation becomes a "set difference"
>> operation, i.e. "forward to all ports in group A that are not also in
>> group B". This series implements such an operation using device maps to
>> represent the groups. This means that the XDP program specifies two
>> device maps, one containing the list of netdevs to redirect to, and the
>> other containing the exclude list.
>>
>> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
>> to accept two maps, the forwarding map and exclude map. If user
>> don't want to use exclude map and just want simply stop redirecting back
>> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
>>
>> The 2nd and 3rd patches are for usage sample and testing purpose, so there
>> is no effort has been made on performance optimisation. I did same tests
>> with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
>> test result(the veth peer has a dummy xdp program with XDP_DROP directly):
>>
>> Version         | Test                                   | Native | Generic
>> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
>> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
>> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
>> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
>>
>> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
>> the arrays and do clone skb/xdpf. The native path is slower than generic
>> path as we send skbs by pktgen. So the result looks reasonable.
>>
>> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
>> suggestions and help on implementation.
>>
>> [0] https://xdp-project.net/#Handling-multicast
>>
>> v7: Fix helper flag check
>>     Limit the *ex_map* to use DEVMAP_HASH only and update function
>>     dev_in_exclude_map() to get better performance.
> 
> Did it help? The performance numbers in the table above are the same as
> in v6...
> 

If there is only 1 entry in the exclude map, then the numbers should be
about the same.
Toke Høiland-Jørgensen July 14, 2020, 9:53 p.m. UTC | #3
>David Ahern <dsahern@gmail.com> writes:

> On 7/14/20 6:29 AM, Toke Høiland-Jørgensen wrote:
>> Hangbin Liu <liuhangbin@gmail.com> writes:
>> 
>>> This patch is for xdp multicast support. which has been discussed before[0],
>>> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
>>> a software switch that can forward XDP frames to multiple ports.
>>>
>>> To achieve this, an application needs to specify a group of interfaces
>>> to forward a packet to. It is also common to want to exclude one or more
>>> physical interfaces from the forwarding operation - e.g., to forward a
>>> packet to all interfaces in the multicast group except the interface it
>>> arrived on. While this could be done simply by adding more groups, this
>>> quickly leads to a combinatorial explosion in the number of groups an
>>> application has to maintain.
>>>
>>> To avoid the combinatorial explosion, we propose to include the ability
>>> to specify an "exclude group" as part of the forwarding operation. This
>>> needs to be a group (instead of just a single port index), because there
>>> may have multi interfaces you want to exclude.
>>>
>>> Thus, the logical forwarding operation becomes a "set difference"
>>> operation, i.e. "forward to all ports in group A that are not also in
>>> group B". This series implements such an operation using device maps to
>>> represent the groups. This means that the XDP program specifies two
>>> device maps, one containing the list of netdevs to redirect to, and the
>>> other containing the exclude list.
>>>
>>> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
>>> to accept two maps, the forwarding map and exclude map. If user
>>> don't want to use exclude map and just want simply stop redirecting back
>>> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
>>>
>>> The 2nd and 3rd patches are for usage sample and testing purpose, so there
>>> is no effort has been made on performance optimisation. I did same tests
>>> with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
>>> test result(the veth peer has a dummy xdp program with XDP_DROP directly):
>>>
>>> Version         | Test                                   | Native | Generic
>>> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
>>> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
>>> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
>>> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
>>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
>>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
>>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
>>>
>>> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
>>> the arrays and do clone skb/xdpf. The native path is slower than generic
>>> path as we send skbs by pktgen. So the result looks reasonable.
>>>
>>> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
>>> suggestions and help on implementation.
>>>
>>> [0] https://xdp-project.net/#Handling-multicast
>>>
>>> v7: Fix helper flag check
>>>     Limit the *ex_map* to use DEVMAP_HASH only and update function
>>>     dev_in_exclude_map() to get better performance.
>> 
>> Did it help? The performance numbers in the table above are the same as
>> in v6...
>> 
>
> If there is only 1 entry in the exclude map, then the numbers should be
> about the same.

I would still expect the lack of the calls to devmap_get_next_key() to
at least provide a small speedup, no? That the numbers are completely
unchanged looks a bit suspicious...

-Toke
Hangbin Liu July 15, 2020, 3:45 a.m. UTC | #4
On Tue, Jul 14, 2020 at 11:12:59AM -0600, David Ahern wrote:
> >> with pktgen(pkt size 64) to compire with xdp_redirect_map(). Here is the
> >> test result(the veth peer has a dummy xdp program with XDP_DROP directly):
> >>
> >> Version         | Test                                   | Native | Generic
> >> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> >> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
> >> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> >> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
> >> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
> >> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
> >> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
> >>
> >> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> >> the arrays and do clone skb/xdpf. The native path is slower than generic
> >> path as we send skbs by pktgen. So the result looks reasonable.
> >>
> >> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
> >> suggestions and help on implementation.
> >>
> >> [0] https://xdp-project.net/#Handling-multicast
> >>
> >> v7: Fix helper flag check
> >>     Limit the *ex_map* to use DEVMAP_HASH only and update function
> >>     dev_in_exclude_map() to get better performance.
> > 
> > Did it help? The performance numbers in the table above are the same as
> > in v6...
> > 
> 
> If there is only 1 entry in the exclude map, then the numbers should be
> about the same.

Yes, I didn't re-run the test. Because when do the testing, I use null exclude
map + flag BPF_F_EXCLUDE_INGRESS. So the perf number should have no difference
with last patch.

Thanks
Hangbin
Hangbin Liu July 15, 2020, 12:31 p.m. UTC | #5
On Tue, Jul 14, 2020 at 11:53:20PM +0200, Toke Høiland-Jørgensen wrote:
> >David Ahern <dsahern@gmail.com> writes:
> >>> Version         | Test                                   | Native | Generic
> >>> 5.8 rc1         | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> >>> 5.8 rc1         | xdp_redirect_map       i40e->veth      |  12.7M |   1.6M
> >>> 5.8 rc1 + patch | xdp_redirect_map       i40e->i40e      |  10.0M |   1.9M
> >>> 5.8 rc1 + patch | xdp_redirect_map       i40e->veth      |  12.3M |   1.6M
> >>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e      |   7.2M |   1.5M
> >>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->veth      |   8.5M |   1.3M
> >>> 5.8 rc1 + patch | xdp_redirect_map_multi i40e->i40e+veth |   3.0M |  0.98M
> >>>
> >>> The bpf_redirect_map_multi() is slower than bpf_redirect_map() as we loop
> >>> the arrays and do clone skb/xdpf. The native path is slower than generic
> >>> path as we send skbs by pktgen. So the result looks reasonable.
> >>>
> >>> Last but not least, thanks a lot to Jiri, Eelco, Toke and Jesper for
> >>> suggestions and help on implementation.
> >>>
> >>> [0] https://xdp-project.net/#Handling-multicast
> >>>
> >>> v7: Fix helper flag check
> >>>     Limit the *ex_map* to use DEVMAP_HASH only and update function
> >>>     dev_in_exclude_map() to get better performance.
> >> 
> >> Did it help? The performance numbers in the table above are the same as
> >> in v6...
> >> 
> >
> > If there is only 1 entry in the exclude map, then the numbers should be
> > about the same.
> 
> I would still expect the lack of the calls to devmap_get_next_key() to
> at least provide a small speedup, no? That the numbers are completely
> unchanged looks a bit suspicious...

As I replied to David, I didn't re-run the test as I thought there should
no much difference as the exclude map on has 1 entry.

There should be a small speedup compared with previous patch. But as the
test system re-installed and rebooted, there will be some jitter to the
test result. It would be a little hard to observe the improvement.

Thanks
Hangbin