mbox series

[ovs-dev,v3,00/11] netdev-doca

Message ID 20260401091318.2671624-1-elibr@nvidia.com
Headers show
Series netdev-doca | expand

Message

Eli Britstein April 1, 2026, 9:13 a.m. UTC
Introduce a new netdev type - netdev-doca.
In order to compile, need to install doca on the build machine.

v2-v1:
- Fixed licence comment headers.
- Abandoned dpdk-extra patch. It is a configuration issue and there is
  also [1].
- Added co-author to some of the commits.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2026-February/430134.html

v3-v2:
- There is a lot of code that can be shared with netdev-dpdk. Refactored it
  to enable sharing.
- Added documentation files.
- Styling fixes.
- Reworked ovs-rcu and refmap patches.
- Added support for github actions (CI) to compile with doca.
- Simplified acinclude.m4.
- Reworked sysfs access. Most of the accesses are now with doca-API.
- Reworked error paths - returned values, messages and rollbacks.

Ariel Levkovich (1):
  acinclude.m4: Add '--with-doca' option.

Eli Britstein (8):
  packets: Move ETH_TYPE_LLDP to be a public define.
  netdev-dpdk-private: Refactor declarations from netdev-dpdk.
  netdev-dpdk: Change access from dev->common.xxx to common->xxx.
  netdev-dpdk: Make 'started' field atomic.
  netdev-dpdk: Direct mempool usage.
  netdev-dpdk: Refactor common functions for reuse by netdev-doca.
  unixctl: Introduce unixctl_mem_stream().
  netdev-doca: Introduce doca netdev.

Gaetan Rivet (2):
  ovs-rcu: Add support for embedded variant.
  refmap: Introduce reference map.

 .ci/doca-build.sh                     |   36 +
 .ci/doca-install.sh                   |   20 +
 .github/workflows/build-and-test.yml  |   58 +
 Documentation/automake.mk             |    2 +
 Documentation/howto/doca.rst          |  143 ++
 Documentation/howto/index.rst         |    1 +
 Documentation/intro/install/doca.rst  |  104 +
 Documentation/intro/install/index.rst |    1 +
 Makefile.am                           |    2 +
 NEWS                                  |    4 +
 acinclude.m4                          |  174 ++
 configure.ac                          |    3 +-
 lib/automake.mk                       |   13 +
 lib/dpdk.c                            |   32 +-
 lib/dpdk.h                            |    1 +
 lib/guarded-list.c                    |   10 +
 lib/guarded-list.h                    |    2 +
 lib/netdev-doca.c                     | 2898 +++++++++++++++++++++++++
 lib/netdev-doca.h                     |  159 ++
 lib/netdev-dpdk-private.h             |  290 +++
 lib/netdev-dpdk.c                     | 2163 +++++++++---------
 lib/ovs-doca.c                        |  812 +++++++
 lib/ovs-doca.h                        |  113 +
 lib/ovs-lldp.c                        |    1 -
 lib/ovs-rcu.c                         |  110 +-
 lib/ovs-rcu.h                         |   39 +
 lib/packets.h                         |    7 +-
 lib/refmap.c                          |  485 +++++
 lib/refmap.h                          |  130 ++
 lib/unixctl.c                         |   44 +-
 lib/unixctl.h                         |    3 +
 tests/automake.mk                     |    1 +
 tests/library.at                      |    5 +
 tests/ofproto-macros.at               |    1 +
 tests/test-aa.c                       |    2 -
 tests/test-rcu.c                      |  131 ++
 tests/test-refmap.c                   |  894 ++++++++
 utilities/checkpatch_dict.txt         |    4 +
 vswitchd/bridge.c                     |    5 +
 vswitchd/ovs-vswitchd.c               |    3 +
 vswitchd/vswitch.ovsschema            |    9 +-
 vswitchd/vswitch.xml                  |   95 +-
 42 files changed, 7802 insertions(+), 1208 deletions(-)
 create mode 100755 .ci/doca-build.sh
 create mode 100755 .ci/doca-install.sh
 create mode 100644 Documentation/howto/doca.rst
 create mode 100644 Documentation/intro/install/doca.rst
 create mode 100644 lib/netdev-doca.c
 create mode 100644 lib/netdev-doca.h
 create mode 100644 lib/netdev-dpdk-private.h
 create mode 100644 lib/ovs-doca.c
 create mode 100644 lib/ovs-doca.h
 create mode 100644 lib/refmap.c
 create mode 100644 lib/refmap.h
 create mode 100644 tests/test-refmap.c

Comments

Eelco Chaudron April 9, 2026, 12:16 p.m. UTC | #1
On 1 Apr 2026, at 11:13, Eli Britstein wrote:

> Introduce a new netdev type - netdev-doca.
Hi Eli/Gaetan,

Not sure if you’ve noticed, but this series has failing tests due to a
UBSan issue in patch 1. Maybe you can take a quick look and fix this so
that all unit tests pass.

I would suggest setting up a GitHub Action in your local repo so these
tests run before submitting a series. That way, you can be sure it passes
at least the basic checks.

I’ll start reviewing the series, but I’ll be on PTO at the end of next
week for about a week, so a full review may take some time. I’ll try to
send comments on individual patches so you can continue working on what
has been reviewed so far.

Cheers,
Eelco


Example error:

../../lib/ovs-rcu.c:292:9: runtime error: call to function rule_destroy_cb through pointer to incorrect function type 'void (*)(void *)'
/home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-router.c:1244: note: rule_destroy_cb defined here
    #0 0x560e4675ce49 in ovsrcu_run_cbset /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-rcu.c:292:9
    #1 0x560e4675bef1 in ovsrcu_call_postponed /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-rcu.c:370:9
    #2 0x560e4687e368 in time_poll /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/timeval.c:325:17
    #3 0x560e467f76d4 in poll_block /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/poll-loop.c:364:14
    #4 0x560e4613c401 in main /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../vswitchd/ovs-vswitchd.c:149:9
    #5 0x7efe9d42a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
    #6 0x7efe9d42a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
    #7 0x560e46027fa4 in _start (/home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/vswitchd/ovs-vswitchd+0x74efa4) (BuildId: a540f0f744c789e977aecf68f1c17a835856d3f8)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../lib/ovs-rcu.c:292:9


[...]
Eli Britstein April 9, 2026, 2:18 p.m. UTC | #2
On 09/04/2026 15:16, Eelco Chaudron wrote:
> External email: Use caution opening links or attachments
>
>
> On 1 Apr 2026, at 11:13, Eli Britstein wrote:
>
>> Introduce a new netdev type - netdev-doca.
> Hi Eli/Gaetan,
>
> Not sure if you’ve noticed, but this series has failing tests due to a
> UBSan issue in patch 1. Maybe you can take a quick look and fix this so
> that all unit tests pass.
>
> I would suggest setting up a GitHub Action in your local repo so these
> tests run before submitting a series. That way, you can be sure it passes
> at least the basic checks.
>
> I’ll start reviewing the series, but I’ll be on PTO at the end of next
> week for about a week, so a full review may take some time. I’ll try to
> send comments on individual patches so you can continue working on what
> has been reviewed so far.
>
> Cheers,
> Eelco
>
>
> Example error:
>
> ../../lib/ovs-rcu.c:292:9: runtime error: call to function rule_destroy_cb through pointer to incorrect function type 'void (*)(void *)'
> /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-router.c:1244: note: rule_destroy_cb defined here
>      #0 0x560e4675ce49 in ovsrcu_run_cbset /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-rcu.c:292:9
>      #1 0x560e4675bef1 in ovsrcu_call_postponed /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/ovs-rcu.c:370:9
>      #2 0x560e4687e368 in time_poll /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/timeval.c:325:17
>      #3 0x560e467f76d4 in poll_block /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../lib/poll-loop.c:364:14
>      #4 0x560e4613c401 in main /home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/../../vswitchd/ovs-vswitchd.c:149:9
>      #5 0x7efe9d42a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
>      #6 0x7efe9d42a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
>      #7 0x560e46027fa4 in _start (/home/runner/work/ovs/ovs/openvswitch-3.7.90/_build/sub/vswitchd/ovs-vswitchd+0x74efa4) (BuildId: a540f0f744c789e977aecf68f1c17a835856d3f8)
>
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../lib/ovs-rcu.c:292:9

Sure, I'll take that into consideration.

So far, I found some issues myself that will be fixed in v4:

- refmap still may fail in the scenario you described

- there is no destroy of the mempool

- demote some error messages to WARN if adding a non-existent port or 
invalid port

- some documentation fixes and improvements


Also, currently there is no support to build deb/rpm with doca support. 
I'm working on that, so it will probably affect the flags stuff around 
acinclude.m4.

Enjoy your PTO.

>
> [...]
>
Eelco Chaudron April 30, 2026, 9:25 a.m. UTC | #3
On 1 Apr 2026, at 11:13, Eli Britstein wrote:

> Introduce a new netdev type - netdev-doca.
> In order to compile, need to install doca on the build machine.

Hi Eli,

I was running the DOCA test suite patch[1] on this series, and I get a
few errors.  The earlier tunnel related errors are gone though.

   95: conntrack - IPv4 fragmentation                  FAILED (system-traffic.at:4881)
   97: conntrack - IPv4 fragmentation expiry           FAILED (system-traffic.at:5002)
  105: conntrack - IPv6 fragmentation                  FAILED (system-traffic.at:5264)
  107: conntrack - IPv6 fragmentation expiry           FAILED (system-traffic.at:5390)

These tests pass fine with check-dpdk and check-dpdk-offloads.

I did not investigate the issues, as I'm focussing on the general
review of the series, so please take a look.

Cheers,

Eelco


[1]: https://patchwork.ozlabs.org/project/openvswitch/patch/e88f13b8266bd6bb792a522b8b323e76923f376e.1777380423.git.echaudro@redhat.com/
Eli Britstein May 3, 2026, 11:57 a.m. UTC | #4
On 30/04/2026 12:25, Eelco Chaudron wrote:
> External email: Use caution opening links or attachments
>
>
> On 1 Apr 2026, at 11:13, Eli Britstein wrote:
>
>> Introduce a new netdev type - netdev-doca.
>> In order to compile, need to install doca on the build machine.
> Hi Eli,
>
> I was running the DOCA test suite patch[1] on this series, and I get a
> few errors.  The earlier tunnel related errors are gone though.
>
>     95: conntrack - IPv4 fragmentation                  FAILED (system-traffic.at:4881)
>     97: conntrack - IPv4 fragmentation expiry           FAILED (system-traffic.at:5002)
>    105: conntrack - IPv6 fragmentation                  FAILED (system-traffic.at:5264)
>    107: conntrack - IPv6 fragmentation expiry           FAILED (system-traffic.at:5390)
>
> These tests pass fine with check-dpdk and check-dpdk-offloads.
>
> I did not investigate the issues, as I'm focussing on the general
> review of the series, so please take a look.

Thanks Eelco for this info.

I looked into it.

The RC is that there is a UAF in the following scenario:

1. A frag packet is added to frag_list in ipf module, and not returned 
to the mempool.
2. the datapath is destroyed. At this point, we first remove the ports, 
and only then destroy CT/IPF.
3. When a doca port is destroyed, it frees its packet mempool 
(rte_mempool_free). in this process, the memory is zeroed in dpdk code: 
lib/eal/common/malloc_elem.c -> malloc_elem_free() -> memset(ptr, 0, 
data_len).
4. When ipf_destroy occurs, the packet is already free. It crashes when 
IPF tries to do dp_packet_delete().

With DPDK, there is this commit that frees a mempool only when it's 
"full", meaning all its mbufs are returned to the pool:

91fccdad72a2 ("netdev-dpdk: Free mempool only when no in-use mbufs.")

With this, the mempool is not destroyed at all unless another dpdk port 
is configured, then the sweep mechanism will destroy it.

IMO, this behavior is a kind of a WA, for which I'm not sure what is the 
correct solution.

I will add similar doca-sweep for now, but I think this is not correct.

WDYT?

PS: I had a lot of trouble running the tests. There are few fixes to the 
infrastructure that I will post.

>
> Cheers,
>
> Eelco
>
>
> [1]: https://patchwork.ozlabs.org/project/openvswitch/patch/e88f13b8266bd6bb792a522b8b323e76923f376e.1777380423.git.echaudro@redhat.com/
>
Eelco Chaudron May 4, 2026, 7:30 a.m. UTC | #5
On 3 May 2026, at 13:57, Eli Britstein wrote:

> On 30/04/2026 12:25, Eelco Chaudron wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 1 Apr 2026, at 11:13, Eli Britstein wrote:
>>
>>> Introduce a new netdev type - netdev-doca.
>>> In order to compile, need to install doca on the build machine.
>> Hi Eli,
>>
>> I was running the DOCA test suite patch[1] on this series, and I get a
>> few errors.  The earlier tunnel related errors are gone though.
>>
>>     95: conntrack - IPv4 fragmentation                  FAILED (system-traffic.at:4881)
>>     97: conntrack - IPv4 fragmentation expiry           FAILED (system-traffic.at:5002)
>>    105: conntrack - IPv6 fragmentation                  FAILED (system-traffic.at:5264)
>>    107: conntrack - IPv6 fragmentation expiry           FAILED (system-traffic.at:5390)
>>
>> These tests pass fine with check-dpdk and check-dpdk-offloads.
>>
>> I did not investigate the issues, as I'm focussing on the general
>> review of the series, so please take a look.
>
> Thanks Eelco for this info.
>
> I looked into it.
>
> The RC is that there is a UAF in the following scenario:
>
> 1. A frag packet is added to frag_list in ipf module, and not returned to the mempool.
> 2. the datapath is destroyed. At this point, we first remove the ports, and only then destroy CT/IPF.
> 3. When a doca port is destroyed, it frees its packet mempool (rte_mempool_free). in this process, the memory is zeroed in dpdk code: lib/eal/common/malloc_elem.c -> malloc_elem_free() -> memset(ptr, 0, data_len).
> 4. When ipf_destroy occurs, the packet is already free. It crashes when IPF tries to do dp_packet_delete().
>
> With DPDK, there is this commit that frees a mempool only when it's "full", meaning all its mbufs are returned to the pool:
>
> 91fccdad72a2 ("netdev-dpdk: Free mempool only when no in-use mbufs.")
>
> With this, the mempool is not destroyed at all unless another dpdk port is configured, then the sweep mechanism will destroy it.
>
> IMO, this behavior is a kind of a WA, for which I'm not sure what is the correct solution.
>
> I will add similar doca-sweep for now, but I think this is not correct.
>
> WDYT?

I've included Mike and Kevin who worked on ipf before. Maybe they
have some more insight into how to fix this.
>
> PS: I had a lot of trouble running the tests. There are few fixes to the infrastructure that I will post.

Interested to find out what your problems are, as I had no problems
following the documentation. Anyhow, I can include your changes (if you
reply to the RFC patch), or you can include my patch in your series.

Cheers,

Eelco
Mike Pattrick May 4, 2026, 3:57 p.m. UTC | #6
On Mon, May 4, 2026 at 3:30 AM Eelco Chaudron <echaudro@redhat.com> wrote:

>
>
> On 3 May 2026, at 13:57, Eli Britstein wrote:
>
> > On 30/04/2026 12:25, Eelco Chaudron wrote:
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 1 Apr 2026, at 11:13, Eli Britstein wrote:
> >>
> >>> Introduce a new netdev type - netdev-doca.
> >>> In order to compile, need to install doca on the build machine.
> >> Hi Eli,
> >>
> >> I was running the DOCA test suite patch[1] on this series, and I get a
> >> few errors.  The earlier tunnel related errors are gone though.
> >>
> >>     95: conntrack - IPv4 fragmentation                  FAILED (
> system-traffic.at:4881)
> >>     97: conntrack - IPv4 fragmentation expiry           FAILED (
> system-traffic.at:5002)
> >>    105: conntrack - IPv6 fragmentation                  FAILED (
> system-traffic.at:5264)
> >>    107: conntrack - IPv6 fragmentation expiry           FAILED (
> system-traffic.at:5390)
> >>
> >> These tests pass fine with check-dpdk and check-dpdk-offloads.
> >>
> >> I did not investigate the issues, as I'm focussing on the general
> >> review of the series, so please take a look.
> >
> > Thanks Eelco for this info.
> >
> > I looked into it.
> >
> > The RC is that there is a UAF in the following scenario:
> >
> > 1. A frag packet is added to frag_list in ipf module, and not returned
> to the mempool.
> > 2. the datapath is destroyed. At this point, we first remove the ports,
> and only then destroy CT/IPF.
> > 3. When a doca port is destroyed, it frees its packet mempool
> (rte_mempool_free). in this process, the memory is zeroed in dpdk code:
> lib/eal/common/malloc_elem.c -> malloc_elem_free() -> memset(ptr, 0,
> data_len).
> > 4. When ipf_destroy occurs, the packet is already free. It crashes when
> IPF tries to do dp_packet_delete().
> >
> > With DPDK, there is this commit that frees a mempool only when it's
> "full", meaning all its mbufs are returned to the pool:
> >
> > 91fccdad72a2 ("netdev-dpdk: Free mempool only when no in-use mbufs.")
> >
> > With this, the mempool is not destroyed at all unless another dpdk port
> is configured, then the sweep mechanism will destroy it.
> >
> > IMO, this behavior is a kind of a WA, for which I'm not sure what is the
> correct solution.
> >
> > I will add similar doca-sweep for now, but I think this is not correct.
> >
> > WDYT?
>
> I've included Mike and Kevin who worked on ipf before. Maybe they
> have some more insight into how to fix this.
>

The commit message indicates a non-ipf motivation for the mp_sweep thread,
and the current behaviour is very convenient because it allows ipf to age
off those packets as part of the frag expiry mechanism.

But I think it makes sense for ipf to walk the frags_list and purge any
fragments that come from an interface before we destroy the iface. Aside
from the UAF, the current behaviour could also enable a fragment from a
deleted interface to interfere with the packet transmission of a new
interface.

However, in the current implementation of ipf that would be a very
expensive operation. We could add another index on in_port to help reduce
that cost. We may also want to revisit how the ipf_lock works, a read/write
lock on the frag_lists (including exp_list, complete_list) and a separate
mutex in each ipf_list. This would improve parallelization, at the expense
of more complex code.

I could create an RFC for this if others agree.

Cheers,
M


> >
> > PS: I had a lot of trouble running the tests. There are few fixes to the
> infrastructure that I will post.
>
> Interested to find out what your problems are, as I had no problems
> following the documentation. Anyhow, I can include your changes (if you
> reply to the RFC patch), or you can include my patch in your series.
>
> Cheers,
>
> Eelco
>
>