diff mbox series

net: add initial support for AF_XDP network backend

Message ID 20230622215824.2173343-1-i.maximets@ovn.org
State New
Headers show
Series net: add initial support for AF_XDP network backend | expand

Commit Message

Ilya Maximets June 22, 2023, 9:58 p.m. UTC
AF_XDP is a network socket family that allows communication directly
with the network device driver in the kernel, bypassing most or all
of the kernel networking stack.  In the essence, the technology is
pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
and works with any network interfaces without driver modifications.
Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
require access to character devices or unix sockets.  Only access to
the network interface itself is necessary.

This patch implements a network backend that communicates with the
kernel by creating an AF_XDP socket.  A chunk of userspace memory
is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
Fill and Completion) are placed in that memory along with a pool of
memory buffers for the packet data.  Data transmission is done by
allocating one of the buffers, copying packet data into it and
placing the pointer into Tx ring.  After transmission, device will
return the buffer via Completion ring.  On Rx, device will take
a buffer form a pre-populated Fill ring, write the packet data into
it and place the buffer into Rx ring.

AF_XDP network backend takes on the communication with the host
kernel and the network interface and forwards packets to/from the
peer device in QEMU.

Usage example:

  -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
  -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1

XDP program bridges the socket with a network interface.  It can be
attached to the interface in 2 different modes:

1. skb - this mode should work for any interface and doesn't require
         driver support.  With a caveat of lower performance.

2. native - this does require support from the driver and allows to
            bypass skb allocation in the kernel and potentially use
            zero-copy while getting packets in/out userspace.

By default, QEMU will try to use native mode and fall back to skb.
Mode can be forced via 'mode' option.  To force 'copy' even in native
mode, use 'force-copy=on' option.  This might be useful if there is
some issue with the driver.

Option 'queues=N' allows to specify how many device queues should
be open.  Note that all the queues that are not open are still
functional and can receive traffic, but it will not be delivered to
QEMU.  So, the number of device queues should generally match the
QEMU configuration, unless the device is shared with something
else and the traffic re-direction to appropriate queues is correctly
configured on a device level (e.g. with ethtool -N).
'start-queue=M' option can be used to specify from which queue id
QEMU should start configuring 'N' queues.  It might also be necessary
to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
for examples.

In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
capabilities in order to load default XSK/XDP programs to the
network interface and configure BTF maps.  It is possible, however,
to run only with CAP_NET_RAW.  For that to work, an external process
with admin capabilities will need to pre-load default XSK program
and pass an open file descriptor for this program's 'xsks_map' to
QEMU process on startup.  Network backend will need to be configured
with 'inhibit=on' to avoid loading of the programs.  The file
descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.

There are few performance challenges with the current network backends.

First is that they do not support IO threads.  This means that data
path is handled by the main thread in QEMU and may slow down other
work or may be slowed down by some other work.  This also means that
taking advantage of multi-queue is generally not possible today.

Another thing is that data path is going through the device emulation
code, which is not really optimized for performance.  The fastest
"frontend" device is virtio-net.  But it's not optimized for heavy
traffic either, because it expects such use-cases to be handled via
some implementation of vhost (user, kernel, vdpa).  In practice, we
have virtio notifications and rcu lock/unlock on a per-packet basis
and not very efficient accesses to the guest memory.  Communication
channels between backend and frontend devices do not allow passing
more than one packet at a time as well.

Some of these challenges can be avoided in the future by adding better
batching into device emulation or by implementing vhost-af-xdp variant.

There are also a few kernel limitations.  AF_XDP sockets do not
support any kinds of checksum or segmentation offloading.  Buffers
are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
support is not implemented for AF_XDP today.  Also, transmission in
all non-zero-copy modes is synchronous, i.e. done in a syscall.
That doesn't allow high packet rates on virtual interfaces.

However, keeping in mind all of these challenges, current implementation
of the AF_XDP backend shows a decent performance while running on top
of a physical NIC with zero-copy support.

Test setup:

2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
Network backend is configured to open the NIC directly in native mode.
The driver supports zero-copy.  NIC is configured to use 1 queue.

Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
for PPS testing.

iperf3 result:
 TCP stream      : 19.1 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
 Tx only         : 3.4 Mpps
 Rx only         : 2.0 Mpps
 L2 FWD Loopback : 1.5 Mpps

In skb mode the same setup shows much lower performance, similar to
the setup where pair of physical NICs is replaced with veth a pair:

iperf3 result:
  TCP stream      : 9 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
  Tx only         : 1.2 Mpps
  Rx only         : 1.0 Mpps
  L2 FWD Loopback : 0.7 Mpps

Results in skb mode or over the veth are close to results of a tap
backend with vhost=on and disabled segmentation offloading bridged
with a NIC.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 MAINTAINERS                                   |   4 +
 hmp-commands.hx                               |   2 +-
 meson.build                                   |  14 +
 meson_options.txt                             |   2 +
 net/af-xdp.c                                  | 501 ++++++++++++++++++
 net/clients.h                                 |   5 +
 net/meson.build                               |   3 +
 net/net.c                                     |   6 +
 qapi/net.json                                 |  54 +-
 qemu-options.hx                               |  61 ++-
 .../ci/org.centos/stream/8/x86_64/configure   |   1 +
 scripts/meson-buildoptions.sh                 |   3 +
 tests/docker/dockerfiles/debian-amd64.docker  |   1 +
 13 files changed, 654 insertions(+), 3 deletions(-)
 create mode 100644 net/af-xdp.c

Comments

Jason Wang June 25, 2023, 7:06 a.m. UTC | #1
On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> AF_XDP is a network socket family that allows communication directly
> with the network device driver in the kernel, bypassing most or all
> of the kernel networking stack.  In the essence, the technology is
> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> and works with any network interfaces without driver modifications.
> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> require access to character devices or unix sockets.  Only access to
> the network interface itself is necessary.
>
> This patch implements a network backend that communicates with the
> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> Fill and Completion) are placed in that memory along with a pool of
> memory buffers for the packet data.  Data transmission is done by
> allocating one of the buffers, copying packet data into it and
> placing the pointer into Tx ring.  After transmission, device will
> return the buffer via Completion ring.  On Rx, device will take
> a buffer form a pre-populated Fill ring, write the packet data into
> it and place the buffer into Rx ring.
>
> AF_XDP network backend takes on the communication with the host
> kernel and the network interface and forwards packets to/from the
> peer device in QEMU.
>
> Usage example:
>
>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>
> XDP program bridges the socket with a network interface.  It can be
> attached to the interface in 2 different modes:
>
> 1. skb - this mode should work for any interface and doesn't require
>          driver support.  With a caveat of lower performance.
>
> 2. native - this does require support from the driver and allows to
>             bypass skb allocation in the kernel and potentially use
>             zero-copy while getting packets in/out userspace.
>
> By default, QEMU will try to use native mode and fall back to skb.
> Mode can be forced via 'mode' option.  To force 'copy' even in native
> mode, use 'force-copy=on' option.  This might be useful if there is
> some issue with the driver.
>
> Option 'queues=N' allows to specify how many device queues should
> be open.  Note that all the queues that are not open are still
> functional and can receive traffic, but it will not be delivered to
> QEMU.  So, the number of device queues should generally match the
> QEMU configuration, unless the device is shared with something
> else and the traffic re-direction to appropriate queues is correctly
> configured on a device level (e.g. with ethtool -N).
> 'start-queue=M' option can be used to specify from which queue id
> QEMU should start configuring 'N' queues.  It might also be necessary
> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> for examples.
>
> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> capabilities in order to load default XSK/XDP programs to the
> network interface and configure BTF maps.

I think you mean "BPF" actually?

>  It is possible, however,
> to run only with CAP_NET_RAW.

Qemu often runs without any privileges, so we need to fix it.

I think adding support for SCM_RIGHTS via monitor would be a way to go.


> For that to work, an external process
> with admin capabilities will need to pre-load default XSK program
> and pass an open file descriptor for this program's 'xsks_map' to
> QEMU process on startup.  Network backend will need to be configured
> with 'inhibit=on' to avoid loading of the programs.  The file
> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
>
> There are few performance challenges with the current network backends.
>
> First is that they do not support IO threads.

The current networking codes needs some major recatoring to support IO
threads which I'm not sure is worthwhile.

> This means that data
> path is handled by the main thread in QEMU and may slow down other
> work or may be slowed down by some other work.  This also means that
> taking advantage of multi-queue is generally not possible today.
>
> Another thing is that data path is going through the device emulation
> code, which is not really optimized for performance.  The fastest
> "frontend" device is virtio-net.  But it's not optimized for heavy
> traffic either, because it expects such use-cases to be handled via
> some implementation of vhost (user, kernel, vdpa).  In practice, we
> have virtio notifications and rcu lock/unlock on a per-packet basis
> and not very efficient accesses to the guest memory.  Communication
> channels between backend and frontend devices do not allow passing
> more than one packet at a time as well.
>
> Some of these challenges can be avoided in the future by adding better
> batching into device emulation or by implementing vhost-af-xdp variant.

It might require you to register(pin) the whole guest memory to XSK or
there could be a copy. Both of them are sub-optimal.

A really interesting project is to do AF_XDP passthrough, then we
don't need to care about pin and copy and we will get ultra speed in
the guest. (But again, it might needs BPF support in virtio-net).

>
> There are also a few kernel limitations.  AF_XDP sockets do not
> support any kinds of checksum or segmentation offloading.  Buffers
> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> support is not implemented for AF_XDP today.  Also, transmission in
> all non-zero-copy modes is synchronous, i.e. done in a syscall.
> That doesn't allow high packet rates on virtual interfaces.
>
> However, keeping in mind all of these challenges, current implementation
> of the AF_XDP backend shows a decent performance while running on top
> of a physical NIC with zero-copy support.
>
> Test setup:
>
> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> Network backend is configured to open the NIC directly in native mode.
> The driver supports zero-copy.  NIC is configured to use 1 queue.
>
> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> for PPS testing.
>
> iperf3 result:
>  TCP stream      : 19.1 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>  Tx only         : 3.4 Mpps
>  Rx only         : 2.0 Mpps
>  L2 FWD Loopback : 1.5 Mpps

I don't object to merging this backend (considering we've already
merged netmap) once the code is fine, but the number is not amazing so
I wonder what is the use case for this backend?

Thanks
Jason Wang June 26, 2023, 6:32 a.m. UTC | #2
On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > AF_XDP is a network socket family that allows communication directly
> > with the network device driver in the kernel, bypassing most or all
> > of the kernel networking stack.  In the essence, the technology is
> > pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> > and works with any network interfaces without driver modifications.
> > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> > require access to character devices or unix sockets.  Only access to
> > the network interface itself is necessary.
> >
> > This patch implements a network backend that communicates with the
> > kernel by creating an AF_XDP socket.  A chunk of userspace memory
> > is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> > Fill and Completion) are placed in that memory along with a pool of
> > memory buffers for the packet data.  Data transmission is done by
> > allocating one of the buffers, copying packet data into it and
> > placing the pointer into Tx ring.  After transmission, device will
> > return the buffer via Completion ring.  On Rx, device will take
> > a buffer form a pre-populated Fill ring, write the packet data into
> > it and place the buffer into Rx ring.
> >
> > AF_XDP network backend takes on the communication with the host
> > kernel and the network interface and forwards packets to/from the
> > peer device in QEMU.
> >
> > Usage example:
> >
> >   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >
> > XDP program bridges the socket with a network interface.  It can be
> > attached to the interface in 2 different modes:
> >
> > 1. skb - this mode should work for any interface and doesn't require
> >          driver support.  With a caveat of lower performance.
> >
> > 2. native - this does require support from the driver and allows to
> >             bypass skb allocation in the kernel and potentially use
> >             zero-copy while getting packets in/out userspace.
> >
> > By default, QEMU will try to use native mode and fall back to skb.
> > Mode can be forced via 'mode' option.  To force 'copy' even in native
> > mode, use 'force-copy=on' option.  This might be useful if there is
> > some issue with the driver.
> >
> > Option 'queues=N' allows to specify how many device queues should
> > be open.  Note that all the queues that are not open are still
> > functional and can receive traffic, but it will not be delivered to
> > QEMU.  So, the number of device queues should generally match the
> > QEMU configuration, unless the device is shared with something
> > else and the traffic re-direction to appropriate queues is correctly
> > configured on a device level (e.g. with ethtool -N).
> > 'start-queue=M' option can be used to specify from which queue id
> > QEMU should start configuring 'N' queues.  It might also be necessary
> > to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> > for examples.
> >
> > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> > capabilities in order to load default XSK/XDP programs to the
> > network interface and configure BTF maps.
>
> I think you mean "BPF" actually?
>
> >  It is possible, however,
> > to run only with CAP_NET_RAW.
>
> Qemu often runs without any privileges, so we need to fix it.
>
> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>
>
> > For that to work, an external process
> > with admin capabilities will need to pre-load default XSK program
> > and pass an open file descriptor for this program's 'xsks_map' to
> > QEMU process on startup.  Network backend will need to be configured
> > with 'inhibit=on' to avoid loading of the programs.  The file
> > descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
> >
> > There are few performance challenges with the current network backends.
> >
> > First is that they do not support IO threads.
>
> The current networking codes needs some major recatoring to support IO
> threads which I'm not sure is worthwhile.
>
> > This means that data
> > path is handled by the main thread in QEMU and may slow down other
> > work or may be slowed down by some other work.  This also means that
> > taking advantage of multi-queue is generally not possible today.
> >
> > Another thing is that data path is going through the device emulation
> > code, which is not really optimized for performance.  The fastest
> > "frontend" device is virtio-net.  But it's not optimized for heavy
> > traffic either, because it expects such use-cases to be handled via
> > some implementation of vhost (user, kernel, vdpa).  In practice, we
> > have virtio notifications and rcu lock/unlock on a per-packet basis
> > and not very efficient accesses to the guest memory.  Communication
> > channels between backend and frontend devices do not allow passing
> > more than one packet at a time as well.
> >
> > Some of these challenges can be avoided in the future by adding better
> > batching into device emulation or by implementing vhost-af-xdp variant.
>
> It might require you to register(pin) the whole guest memory to XSK or
> there could be a copy. Both of them are sub-optimal.
>
> A really interesting project is to do AF_XDP passthrough, then we
> don't need to care about pin and copy and we will get ultra speed in
> the guest. (But again, it might needs BPF support in virtio-net).
>
> >
> > There are also a few kernel limitations.  AF_XDP sockets do not
> > support any kinds of checksum or segmentation offloading.  Buffers
> > are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> > support is not implemented for AF_XDP today.  Also, transmission in
> > all non-zero-copy modes is synchronous, i.e. done in a syscall.
> > That doesn't allow high packet rates on virtual interfaces.
> >
> > However, keeping in mind all of these challenges, current implementation
> > of the AF_XDP backend shows a decent performance while running on top
> > of a physical NIC with zero-copy support.
> >
> > Test setup:
> >
> > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> > Network backend is configured to open the NIC directly in native mode.
> > The driver supports zero-copy.  NIC is configured to use 1 queue.
> >
> > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> > for PPS testing.
> >
> > iperf3 result:
> >  TCP stream      : 19.1 Gbps
> >
> > dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >  Tx only         : 3.4 Mpps
> >  Rx only         : 2.0 Mpps
> >  L2 FWD Loopback : 1.5 Mpps
>
> I don't object to merging this backend (considering we've already
> merged netmap) once the code is fine, but the number is not amazing so
> I wonder what is the use case for this backend?

A more ambitious method is to reuse DPDK via dedicated threads, then
we can reuse any of its PMD like AF_XDP.

Thanks

>
> Thanks
Ilya Maximets June 26, 2023, 1:12 p.m. UTC | #3
On 6/26/23 08:32, Jason Wang wrote:
> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>
>>> AF_XDP is a network socket family that allows communication directly
>>> with the network device driver in the kernel, bypassing most or all
>>> of the kernel networking stack.  In the essence, the technology is
>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>> and works with any network interfaces without driver modifications.
>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>> require access to character devices or unix sockets.  Only access to
>>> the network interface itself is necessary.
>>>
>>> This patch implements a network backend that communicates with the
>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>> Fill and Completion) are placed in that memory along with a pool of
>>> memory buffers for the packet data.  Data transmission is done by
>>> allocating one of the buffers, copying packet data into it and
>>> placing the pointer into Tx ring.  After transmission, device will
>>> return the buffer via Completion ring.  On Rx, device will take
>>> a buffer form a pre-populated Fill ring, write the packet data into
>>> it and place the buffer into Rx ring.
>>>
>>> AF_XDP network backend takes on the communication with the host
>>> kernel and the network interface and forwards packets to/from the
>>> peer device in QEMU.
>>>
>>> Usage example:
>>>
>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>
>>> XDP program bridges the socket with a network interface.  It can be
>>> attached to the interface in 2 different modes:
>>>
>>> 1. skb - this mode should work for any interface and doesn't require
>>>          driver support.  With a caveat of lower performance.
>>>
>>> 2. native - this does require support from the driver and allows to
>>>             bypass skb allocation in the kernel and potentially use
>>>             zero-copy while getting packets in/out userspace.
>>>
>>> By default, QEMU will try to use native mode and fall back to skb.
>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>> some issue with the driver.
>>>
>>> Option 'queues=N' allows to specify how many device queues should
>>> be open.  Note that all the queues that are not open are still
>>> functional and can receive traffic, but it will not be delivered to
>>> QEMU.  So, the number of device queues should generally match the
>>> QEMU configuration, unless the device is shared with something
>>> else and the traffic re-direction to appropriate queues is correctly
>>> configured on a device level (e.g. with ethtool -N).
>>> 'start-queue=M' option can be used to specify from which queue id
>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>> for examples.
>>>
>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>> capabilities in order to load default XSK/XDP programs to the
>>> network interface and configure BTF maps.
>>
>> I think you mean "BPF" actually?

"BPF Type Format maps" kind of makes some sense, but yes. :)

>>
>>>  It is possible, however,
>>> to run only with CAP_NET_RAW.
>>
>> Qemu often runs without any privileges, so we need to fix it.
>>
>> I think adding support for SCM_RIGHTS via monitor would be a way to go.

I looked through the code and it seems like we can run completely
non-privileged as far as kernel concerned.  We'll need an API
modification in libxdp though.

The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
a base socket creation.  Binding and other configuration doesn't
require any privileges.  So, we could create a socket externally
and pass it to QEMU.  Should work, unless it's an oversight from
the kernel side that needs to be patched. :)  libxdp doesn't have
a way to specify externally created socket today, so we'll need
to change that.  Should be easy to do though.  I can explore.

In case the bind syscall will actually need CAP_NET_RAW for some
reason, we could change the kernel and allow non-privileged bind
by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
process bind the socket to a particular device, so QEMU can't
bind it to a random one.  Might be a good use case to allow even
if not strictly necessary.

>>
>>
>>> For that to work, an external process
>>> with admin capabilities will need to pre-load default XSK program
>>> and pass an open file descriptor for this program's 'xsks_map' to
>>> QEMU process on startup.  Network backend will need to be configured
>>> with 'inhibit=on' to avoid loading of the programs.  The file
>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
>>>
>>> There are few performance challenges with the current network backends.
>>>
>>> First is that they do not support IO threads.
>>
>> The current networking codes needs some major recatoring to support IO
>> threads which I'm not sure is worthwhile.
>>
>>> This means that data
>>> path is handled by the main thread in QEMU and may slow down other
>>> work or may be slowed down by some other work.  This also means that
>>> taking advantage of multi-queue is generally not possible today.
>>>
>>> Another thing is that data path is going through the device emulation
>>> code, which is not really optimized for performance.  The fastest
>>> "frontend" device is virtio-net.  But it's not optimized for heavy
>>> traffic either, because it expects such use-cases to be handled via
>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>>> have virtio notifications and rcu lock/unlock on a per-packet basis
>>> and not very efficient accesses to the guest memory.  Communication
>>> channels between backend and frontend devices do not allow passing
>>> more than one packet at a time as well.
>>>
>>> Some of these challenges can be avoided in the future by adding better
>>> batching into device emulation or by implementing vhost-af-xdp variant.
>>
>> It might require you to register(pin) the whole guest memory to XSK or
>> there could be a copy. Both of them are sub-optimal.

A single copy by itself shouldn't be a huge problem, right?
vhost-user and -kernel do copy packets.

>>
>> A really interesting project is to do AF_XDP passthrough, then we
>> don't need to care about pin and copy and we will get ultra speed in
>> the guest. (But again, it might needs BPF support in virtio-net).

I suppose, if we're doing pass-through we need a new device type and a
driver in the kernel/dpdk.  There is no point pretending it's a
virtio-net and translating between different ring layouts.  Or is there?

>>
>>>
>>> There are also a few kernel limitations.  AF_XDP sockets do not
>>> support any kinds of checksum or segmentation offloading.  Buffers
>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>>> support is not implemented for AF_XDP today.  Also, transmission in
>>> all non-zero-copy modes is synchronous, i.e. done in a syscall.
>>> That doesn't allow high packet rates on virtual interfaces.
>>>
>>> However, keeping in mind all of these challenges, current implementation
>>> of the AF_XDP backend shows a decent performance while running on top
>>> of a physical NIC with zero-copy support.
>>>
>>> Test setup:
>>>
>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>>> Network backend is configured to open the NIC directly in native mode.
>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>>
>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>>> for PPS testing.
>>>
>>> iperf3 result:
>>>  TCP stream      : 19.1 Gbps
>>>
>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>  Tx only         : 3.4 Mpps
>>>  Rx only         : 2.0 Mpps
>>>  L2 FWD Loopback : 1.5 Mpps
>>
>> I don't object to merging this backend (considering we've already
>> merged netmap) once the code is fine, but the number is not amazing so
>> I wonder what is the use case for this backend?

I don't think there is a use case right now that would significantly benefit
from the current implementation, so I'm fine if the merge is postponed.
It is noticeably more performant than a tap with vhost=on in terms of PPS.
So, that might be one case.  Taking into account that just rcu lock and
unlock in virtio-net code takes more time than a packet copy, some batching
on QEMU side should improve performance significantly.  And it shouldn't be
too hard to implement.

Performance over virtual interfaces may potentially be improved by creating
a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
scale well.

So, I do think that there is a potential in this backend.

The main benefit, assuming we can reach performance comparable with other
high-performance backends (vhost-user), I think, is the fact that it's
Linux-native and doesn't require talking with any other devices
(like chardevs/sockets), except for a network interface itself. i.e. it
could be easier to manage in complex environments.

> A more ambitious method is to reuse DPDK via dedicated threads, then
> we can reuse any of its PMD like AF_XDP.

Linking with DPDK will make configuration much more complex.  I don't
think it makes sense to bring it in for AF_XDP specifically.  Might be
a separate project though, sure.  

Best regards, Ilya Maximets.
Jason Wang June 27, 2023, 2:54 a.m. UTC | #4
On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/26/23 08:32, Jason Wang wrote:
> > On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>
> >>> AF_XDP is a network socket family that allows communication directly
> >>> with the network device driver in the kernel, bypassing most or all
> >>> of the kernel networking stack.  In the essence, the technology is
> >>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> >>> and works with any network interfaces without driver modifications.
> >>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >>> require access to character devices or unix sockets.  Only access to
> >>> the network interface itself is necessary.
> >>>
> >>> This patch implements a network backend that communicates with the
> >>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> >>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> >>> Fill and Completion) are placed in that memory along with a pool of
> >>> memory buffers for the packet data.  Data transmission is done by
> >>> allocating one of the buffers, copying packet data into it and
> >>> placing the pointer into Tx ring.  After transmission, device will
> >>> return the buffer via Completion ring.  On Rx, device will take
> >>> a buffer form a pre-populated Fill ring, write the packet data into
> >>> it and place the buffer into Rx ring.
> >>>
> >>> AF_XDP network backend takes on the communication with the host
> >>> kernel and the network interface and forwards packets to/from the
> >>> peer device in QEMU.
> >>>
> >>> Usage example:
> >>>
> >>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>>
> >>> XDP program bridges the socket with a network interface.  It can be
> >>> attached to the interface in 2 different modes:
> >>>
> >>> 1. skb - this mode should work for any interface and doesn't require
> >>>          driver support.  With a caveat of lower performance.
> >>>
> >>> 2. native - this does require support from the driver and allows to
> >>>             bypass skb allocation in the kernel and potentially use
> >>>             zero-copy while getting packets in/out userspace.
> >>>
> >>> By default, QEMU will try to use native mode and fall back to skb.
> >>> Mode can be forced via 'mode' option.  To force 'copy' even in native
> >>> mode, use 'force-copy=on' option.  This might be useful if there is
> >>> some issue with the driver.
> >>>
> >>> Option 'queues=N' allows to specify how many device queues should
> >>> be open.  Note that all the queues that are not open are still
> >>> functional and can receive traffic, but it will not be delivered to
> >>> QEMU.  So, the number of device queues should generally match the
> >>> QEMU configuration, unless the device is shared with something
> >>> else and the traffic re-direction to appropriate queues is correctly
> >>> configured on a device level (e.g. with ethtool -N).
> >>> 'start-queue=M' option can be used to specify from which queue id
> >>> QEMU should start configuring 'N' queues.  It might also be necessary
> >>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> >>> for examples.
> >>>
> >>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >>> capabilities in order to load default XSK/XDP programs to the
> >>> network interface and configure BTF maps.
> >>
> >> I think you mean "BPF" actually?
>
> "BPF Type Format maps" kind of makes some sense, but yes. :)
>
> >>
> >>>  It is possible, however,
> >>> to run only with CAP_NET_RAW.
> >>
> >> Qemu often runs without any privileges, so we need to fix it.
> >>
> >> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>
> I looked through the code and it seems like we can run completely
> non-privileged as far as kernel concerned.  We'll need an API
> modification in libxdp though.
>
> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
> a base socket creation.  Binding and other configuration doesn't
> require any privileges.  So, we could create a socket externally
> and pass it to QEMU.

That's the way TAP works for example.

>  Should work, unless it's an oversight from
> the kernel side that needs to be patched. :)  libxdp doesn't have
> a way to specify externally created socket today, so we'll need
> to change that.  Should be easy to do though.  I can explore.

Please do that.

>
> In case the bind syscall will actually need CAP_NET_RAW for some
> reason, we could change the kernel and allow non-privileged bind
> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
> process bind the socket to a particular device, so QEMU can't
> bind it to a random one.  Might be a good use case to allow even
> if not strictly necessary.

Yes.

>
> >>
> >>
> >>> For that to work, an external process
> >>> with admin capabilities will need to pre-load default XSK program
> >>> and pass an open file descriptor for this program's 'xsks_map' to
> >>> QEMU process on startup.  Network backend will need to be configured
> >>> with 'inhibit=on' to avoid loading of the programs.  The file
> >>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
> >>>
> >>> There are few performance challenges with the current network backends.
> >>>
> >>> First is that they do not support IO threads.
> >>
> >> The current networking codes needs some major recatoring to support IO
> >> threads which I'm not sure is worthwhile.
> >>
> >>> This means that data
> >>> path is handled by the main thread in QEMU and may slow down other
> >>> work or may be slowed down by some other work.  This also means that
> >>> taking advantage of multi-queue is generally not possible today.
> >>>
> >>> Another thing is that data path is going through the device emulation
> >>> code, which is not really optimized for performance.  The fastest
> >>> "frontend" device is virtio-net.  But it's not optimized for heavy
> >>> traffic either, because it expects such use-cases to be handled via
> >>> some implementation of vhost (user, kernel, vdpa).  In practice, we
> >>> have virtio notifications and rcu lock/unlock on a per-packet basis
> >>> and not very efficient accesses to the guest memory.  Communication
> >>> channels between backend and frontend devices do not allow passing
> >>> more than one packet at a time as well.
> >>>
> >>> Some of these challenges can be avoided in the future by adding better
> >>> batching into device emulation or by implementing vhost-af-xdp variant.
> >>
> >> It might require you to register(pin) the whole guest memory to XSK or
> >> there could be a copy. Both of them are sub-optimal.
>
> A single copy by itself shouldn't be a huge problem, right?

Probably.

> vhost-user and -kernel do copy packets.
>
> >>
> >> A really interesting project is to do AF_XDP passthrough, then we
> >> don't need to care about pin and copy and we will get ultra speed in
> >> the guest. (But again, it might needs BPF support in virtio-net).
>
> I suppose, if we're doing pass-through we need a new device type and a
> driver in the kernel/dpdk.  There is no point pretending it's a
> virtio-net and translating between different ring layouts.

Yes.

>  Or is there?
>
> >>
> >>>
> >>> There are also a few kernel limitations.  AF_XDP sockets do not
> >>> support any kinds of checksum or segmentation offloading.  Buffers
> >>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> >>> support is not implemented for AF_XDP today.  Also, transmission in
> >>> all non-zero-copy modes is synchronous, i.e. done in a syscall.
> >>> That doesn't allow high packet rates on virtual interfaces.
> >>>
> >>> However, keeping in mind all of these challenges, current implementation
> >>> of the AF_XDP backend shows a decent performance while running on top
> >>> of a physical NIC with zero-copy support.
> >>>
> >>> Test setup:
> >>>
> >>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> >>> Network backend is configured to open the NIC directly in native mode.
> >>> The driver supports zero-copy.  NIC is configured to use 1 queue.
> >>>
> >>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> >>> for PPS testing.
> >>>
> >>> iperf3 result:
> >>>  TCP stream      : 19.1 Gbps
> >>>
> >>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>>  Tx only         : 3.4 Mpps
> >>>  Rx only         : 2.0 Mpps
> >>>  L2 FWD Loopback : 1.5 Mpps
> >>
> >> I don't object to merging this backend (considering we've already
> >> merged netmap) once the code is fine, but the number is not amazing so
> >> I wonder what is the use case for this backend?
>
> I don't think there is a use case right now that would significantly benefit
> from the current implementation, so I'm fine if the merge is postponed.

Just to be clear, I don't want to postpone this if we decide to
invest/enhance it. I will go through the codes and get back.

> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> So, that might be one case.  Taking into account that just rcu lock and
> unlock in virtio-net code takes more time than a packet copy, some batching
> on QEMU side should improve performance significantly.  And it shouldn't be
> too hard to implement.
>
> Performance over virtual interfaces may potentially be improved by creating
> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> scale well.

Interestingly, actually, there are a lot of "duplication" between
io_uring and AF_XDP:

1) both have similar memory model (user register)
2) both use ring for communication

I wonder if we can let io_uring talks directly to AF_XDP.

>
> So, I do think that there is a potential in this backend.
>
> The main benefit, assuming we can reach performance comparable with other
> high-performance backends (vhost-user), I think, is the fact that it's
> Linux-native and doesn't require talking with any other devices
> (like chardevs/sockets), except for a network interface itself. i.e. it
> could be easier to manage in complex environments.

Yes.

>
> > A more ambitious method is to reuse DPDK via dedicated threads, then
> > we can reuse any of its PMD like AF_XDP.
>
> Linking with DPDK will make configuration much more complex.  I don't
> think it makes sense to bring it in for AF_XDP specifically.  Might be
> a separate project though, sure.

Right.

Thanks

>
> Best regards, Ilya Maximets.
>
Stefan Hajnoczi June 27, 2023, 8:56 a.m. UTC | #5
Can multiple VMs share a host netdev by filtering incoming traffic
based on each VM's MAC address and directing it to the appropriate
XSK? If yes, then I think AF_XDP is interesting when SR-IOV or similar
hardware features are not available.

The idea of an AF_XDP passthrough device seems interesting because it
would minimize the overhead and avoid some of the existing software
limitations (mostly in QEMU's networking subsystem) that you
described. I don't know whether the AF_XDP API is suitable or can be
extended to build a hardware emulation interface, but it seems
plausible. When Stefano Garzarella played with io_uring passthrough
into the guest, one of the issues was guest memory translation (since
the guest doesn't use host userspace virtual addresses). I guess
AF_XDP would need an API for adding/removing memory translations or
operate in a mode where addresses are relative offsets from the start
of the umem regions (but this may be impractical if it limits where
the guest can allocate packet payload buffers).

Whether you pursue the passthrough approach or not, making -netdev
af-xdp work in an environment where QEMU runs unprivileged seems like
the most important practical issue to solve.

Stefan
Ilya Maximets June 27, 2023, 10:46 p.m. UTC | #6
On 6/27/23 04:54, Jason Wang wrote:
> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/26/23 08:32, Jason Wang wrote:
>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>
>>>>> AF_XDP is a network socket family that allows communication directly
>>>>> with the network device driver in the kernel, bypassing most or all
>>>>> of the kernel networking stack.  In the essence, the technology is
>>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>>> and works with any network interfaces without driver modifications.
>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>>> require access to character devices or unix sockets.  Only access to
>>>>> the network interface itself is necessary.
>>>>>
>>>>> This patch implements a network backend that communicates with the
>>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>>> Fill and Completion) are placed in that memory along with a pool of
>>>>> memory buffers for the packet data.  Data transmission is done by
>>>>> allocating one of the buffers, copying packet data into it and
>>>>> placing the pointer into Tx ring.  After transmission, device will
>>>>> return the buffer via Completion ring.  On Rx, device will take
>>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>>> it and place the buffer into Rx ring.
>>>>>
>>>>> AF_XDP network backend takes on the communication with the host
>>>>> kernel and the network interface and forwards packets to/from the
>>>>> peer device in QEMU.
>>>>>
>>>>> Usage example:
>>>>>
>>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>>
>>>>> XDP program bridges the socket with a network interface.  It can be
>>>>> attached to the interface in 2 different modes:
>>>>>
>>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>>          driver support.  With a caveat of lower performance.
>>>>>
>>>>> 2. native - this does require support from the driver and allows to
>>>>>             bypass skb allocation in the kernel and potentially use
>>>>>             zero-copy while getting packets in/out userspace.
>>>>>
>>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>>> some issue with the driver.
>>>>>
>>>>> Option 'queues=N' allows to specify how many device queues should
>>>>> be open.  Note that all the queues that are not open are still
>>>>> functional and can receive traffic, but it will not be delivered to
>>>>> QEMU.  So, the number of device queues should generally match the
>>>>> QEMU configuration, unless the device is shared with something
>>>>> else and the traffic re-direction to appropriate queues is correctly
>>>>> configured on a device level (e.g. with ethtool -N).
>>>>> 'start-queue=M' option can be used to specify from which queue id
>>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>>> for examples.
>>>>>
>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>>> capabilities in order to load default XSK/XDP programs to the
>>>>> network interface and configure BTF maps.
>>>>
>>>> I think you mean "BPF" actually?
>>
>> "BPF Type Format maps" kind of makes some sense, but yes. :)
>>
>>>>
>>>>>  It is possible, however,
>>>>> to run only with CAP_NET_RAW.
>>>>
>>>> Qemu often runs without any privileges, so we need to fix it.
>>>>
>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>>
>> I looked through the code and it seems like we can run completely
>> non-privileged as far as kernel concerned.  We'll need an API
>> modification in libxdp though.
>>
>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
>> a base socket creation.  Binding and other configuration doesn't
>> require any privileges.  So, we could create a socket externally
>> and pass it to QEMU.
> 
> That's the way TAP works for example.
> 
>>  Should work, unless it's an oversight from
>> the kernel side that needs to be patched. :)  libxdp doesn't have
>> a way to specify externally created socket today, so we'll need
>> to change that.  Should be easy to do though.  I can explore.
> 
> Please do that.

I have a prototype:
  https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3

Need to test it out and then submit PR to xdp-tools project.

> 
>>
>> In case the bind syscall will actually need CAP_NET_RAW for some
>> reason, we could change the kernel and allow non-privileged bind
>> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
>> process bind the socket to a particular device, so QEMU can't
>> bind it to a random one.  Might be a good use case to allow even
>> if not strictly necessary.
> 
> Yes.

Will propose something for a kernel as well.  We might want something
more granular though, e.g. bind to a queue instead of a device.  In
case we want better control in the device sharing scenario.

> 
>>
>>>>
>>>>
>>>>> For that to work, an external process
>>>>> with admin capabilities will need to pre-load default XSK program
>>>>> and pass an open file descriptor for this program's 'xsks_map' to
>>>>> QEMU process on startup.  Network backend will need to be configured
>>>>> with 'inhibit=on' to avoid loading of the programs.  The file
>>>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
>>>>>
>>>>> There are few performance challenges with the current network backends.
>>>>>
>>>>> First is that they do not support IO threads.
>>>>
>>>> The current networking codes needs some major recatoring to support IO
>>>> threads which I'm not sure is worthwhile.
>>>>
>>>>> This means that data
>>>>> path is handled by the main thread in QEMU and may slow down other
>>>>> work or may be slowed down by some other work.  This also means that
>>>>> taking advantage of multi-queue is generally not possible today.
>>>>>
>>>>> Another thing is that data path is going through the device emulation
>>>>> code, which is not really optimized for performance.  The fastest
>>>>> "frontend" device is virtio-net.  But it's not optimized for heavy
>>>>> traffic either, because it expects such use-cases to be handled via
>>>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>>>>> have virtio notifications and rcu lock/unlock on a per-packet basis
>>>>> and not very efficient accesses to the guest memory.  Communication
>>>>> channels between backend and frontend devices do not allow passing
>>>>> more than one packet at a time as well.
>>>>>
>>>>> Some of these challenges can be avoided in the future by adding better
>>>>> batching into device emulation or by implementing vhost-af-xdp variant.
>>>>
>>>> It might require you to register(pin) the whole guest memory to XSK or
>>>> there could be a copy. Both of them are sub-optimal.
>>
>> A single copy by itself shouldn't be a huge problem, right?
> 
> Probably.
> 
>> vhost-user and -kernel do copy packets.
>>
>>>>
>>>> A really interesting project is to do AF_XDP passthrough, then we
>>>> don't need to care about pin and copy and we will get ultra speed in
>>>> the guest. (But again, it might needs BPF support in virtio-net).
>>
>> I suppose, if we're doing pass-through we need a new device type and a
>> driver in the kernel/dpdk.  There is no point pretending it's a
>> virtio-net and translating between different ring layouts.
> 
> Yes.
> 
>>  Or is there?
>>
>>>>
>>>>>
>>>>> There are also a few kernel limitations.  AF_XDP sockets do not
>>>>> support any kinds of checksum or segmentation offloading.  Buffers
>>>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>>>>> support is not implemented for AF_XDP today.  Also, transmission in
>>>>> all non-zero-copy modes is synchronous, i.e. done in a syscall.
>>>>> That doesn't allow high packet rates on virtual interfaces.
>>>>>
>>>>> However, keeping in mind all of these challenges, current implementation
>>>>> of the AF_XDP backend shows a decent performance while running on top
>>>>> of a physical NIC with zero-copy support.
>>>>>
>>>>> Test setup:
>>>>>
>>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>>>>> Network backend is configured to open the NIC directly in native mode.
>>>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>>>>
>>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>>>>> for PPS testing.
>>>>>
>>>>> iperf3 result:
>>>>>  TCP stream      : 19.1 Gbps
>>>>>
>>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>>>  Tx only         : 3.4 Mpps
>>>>>  Rx only         : 2.0 Mpps
>>>>>  L2 FWD Loopback : 1.5 Mpps
>>>>
>>>> I don't object to merging this backend (considering we've already
>>>> merged netmap) once the code is fine, but the number is not amazing so
>>>> I wonder what is the use case for this backend?
>>
>> I don't think there is a use case right now that would significantly benefit
>> from the current implementation, so I'm fine if the merge is postponed.
> 
> Just to be clear, I don't want to postpone this if we decide to
> invest/enhance it. I will go through the codes and get back.

Ack.  Thanks.

> 
>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
>> So, that might be one case.  Taking into account that just rcu lock and
>> unlock in virtio-net code takes more time than a packet copy, some batching
>> on QEMU side should improve performance significantly.  And it shouldn't be
>> too hard to implement.
>>
>> Performance over virtual interfaces may potentially be improved by creating
>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
>> scale well.
> 
> Interestingly, actually, there are a lot of "duplication" between
> io_uring and AF_XDP:
> 
> 1) both have similar memory model (user register)
> 2) both use ring for communication
> 
> I wonder if we can let io_uring talks directly to AF_XDP.

Well, if we submit poll() in QEMU main loop via io_uring, then we can
avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
virtual interfaces.  io_uring thread in the kernel will be able to
perform transmission for us.

But yeah, there are way too many way too similar ring buffer interfaces
in the kernel.

> 
>>
>> So, I do think that there is a potential in this backend.
>>
>> The main benefit, assuming we can reach performance comparable with other
>> high-performance backends (vhost-user), I think, is the fact that it's
>> Linux-native and doesn't require talking with any other devices
>> (like chardevs/sockets), except for a network interface itself. i.e. it
>> could be easier to manage in complex environments.
> 
> Yes.
> 
>>
>>> A more ambitious method is to reuse DPDK via dedicated threads, then
>>> we can reuse any of its PMD like AF_XDP.
>>
>> Linking with DPDK will make configuration much more complex.  I don't
>> think it makes sense to bring it in for AF_XDP specifically.  Might be
>> a separate project though, sure.
> 
> Right.
> 
> Thanks
> 
>>
>> Best regards, Ilya Maximets.
>>
>
Ilya Maximets June 27, 2023, 11:10 p.m. UTC | #7
On 6/27/23 10:56, Stefan Hajnoczi wrote:
> Can multiple VMs share a host netdev by filtering incoming traffic
> based on each VM's MAC address and directing it to the appropriate
> XSK? If yes, then I think AF_XDP is interesting when SR-IOV or similar
> hardware features are not available.

Good point.  Thanks!

Yes, they can.  Traffic can be re-directed via 'ethtool -N' similarly
to an example in the patch.  Or, potentially, via custom XDP program.
Then different QEMU instances may use different start-queue arguments
and use their own range of queues this way.

> 
> The idea of an AF_XDP passthrough device seems interesting because it
> would minimize the overhead and avoid some of the existing software
> limitations (mostly in QEMU's networking subsystem) that you
> described. I don't know whether the AF_XDP API is suitable or can be
> extended to build a hardware emulation interface, but it seems
> plausible. When Stefano Garzarella played with io_uring passthrough
> into the guest, one of the issues was guest memory translation (since
> the guest doesn't use host userspace virtual addresses). I guess
> AF_XDP would need an API for adding/removing memory translations or
> operate in a mode where addresses are relative offsets from the start
> of the umem regions

Actually, addresses in AF_XDP rings are already offsets from the
start of the umem region.  For example, xsk_umem__get_data is
implemented as &((char *)umem_area)[addr]; inside libxdp.  So, that
should not be an issue.

> (but this may be impractical if it limits where
> the guest can allocate packet payload buffers).

Yeah, we will either need to:

a. register the whole guest memory as umem and offset buffer pointers
   in the guest driver by the start of guest physical memory.

   (I'm not familiar much with QEMU memory subsystem.  Is guest physical
    memory always start at 0? I know that it's not always true for the
    real hardware.)

b. or require the guest driver to allocate a chunk of aligned contiguous
   memory and copy all the packets there on Tx.  And populate the Fill
   ring only with buffers from that area.  Assuming guest pages align
   with the host pages.  Again, a single copy might not be that bad,
   but it's hard to tell what the actual impact will be without testing.

> 
> Whether you pursue the passthrough approach or not, making -netdev
> af-xdp work in an environment where QEMU runs unprivileged seems like
> the most important practical issue to solve.

Yes, working on it.  Doesn't seem to be hard to do, but I need to test.

Best regards, Ilya Maximets.
Jason Wang June 28, 2023, 3:27 a.m. UTC | #8
On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/27/23 04:54, Jason Wang wrote:
> > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 6/26/23 08:32, Jason Wang wrote:
> >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>
> >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>
> >>>>> AF_XDP is a network socket family that allows communication directly
> >>>>> with the network device driver in the kernel, bypassing most or all
> >>>>> of the kernel networking stack.  In the essence, the technology is
> >>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> >>>>> and works with any network interfaces without driver modifications.
> >>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >>>>> require access to character devices or unix sockets.  Only access to
> >>>>> the network interface itself is necessary.
> >>>>>
> >>>>> This patch implements a network backend that communicates with the
> >>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> >>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> >>>>> Fill and Completion) are placed in that memory along with a pool of
> >>>>> memory buffers for the packet data.  Data transmission is done by
> >>>>> allocating one of the buffers, copying packet data into it and
> >>>>> placing the pointer into Tx ring.  After transmission, device will
> >>>>> return the buffer via Completion ring.  On Rx, device will take
> >>>>> a buffer form a pre-populated Fill ring, write the packet data into
> >>>>> it and place the buffer into Rx ring.
> >>>>>
> >>>>> AF_XDP network backend takes on the communication with the host
> >>>>> kernel and the network interface and forwards packets to/from the
> >>>>> peer device in QEMU.
> >>>>>
> >>>>> Usage example:
> >>>>>
> >>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>>>>
> >>>>> XDP program bridges the socket with a network interface.  It can be
> >>>>> attached to the interface in 2 different modes:
> >>>>>
> >>>>> 1. skb - this mode should work for any interface and doesn't require
> >>>>>          driver support.  With a caveat of lower performance.
> >>>>>
> >>>>> 2. native - this does require support from the driver and allows to
> >>>>>             bypass skb allocation in the kernel and potentially use
> >>>>>             zero-copy while getting packets in/out userspace.
> >>>>>
> >>>>> By default, QEMU will try to use native mode and fall back to skb.
> >>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
> >>>>> mode, use 'force-copy=on' option.  This might be useful if there is
> >>>>> some issue with the driver.
> >>>>>
> >>>>> Option 'queues=N' allows to specify how many device queues should
> >>>>> be open.  Note that all the queues that are not open are still
> >>>>> functional and can receive traffic, but it will not be delivered to
> >>>>> QEMU.  So, the number of device queues should generally match the
> >>>>> QEMU configuration, unless the device is shared with something
> >>>>> else and the traffic re-direction to appropriate queues is correctly
> >>>>> configured on a device level (e.g. with ethtool -N).
> >>>>> 'start-queue=M' option can be used to specify from which queue id
> >>>>> QEMU should start configuring 'N' queues.  It might also be necessary
> >>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> >>>>> for examples.
> >>>>>
> >>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >>>>> capabilities in order to load default XSK/XDP programs to the
> >>>>> network interface and configure BTF maps.
> >>>>
> >>>> I think you mean "BPF" actually?
> >>
> >> "BPF Type Format maps" kind of makes some sense, but yes. :)
> >>
> >>>>
> >>>>>  It is possible, however,
> >>>>> to run only with CAP_NET_RAW.
> >>>>
> >>>> Qemu often runs without any privileges, so we need to fix it.
> >>>>
> >>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
> >>
> >> I looked through the code and it seems like we can run completely
> >> non-privileged as far as kernel concerned.  We'll need an API
> >> modification in libxdp though.
> >>
> >> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
> >> a base socket creation.  Binding and other configuration doesn't
> >> require any privileges.  So, we could create a socket externally
> >> and pass it to QEMU.
> >
> > That's the way TAP works for example.
> >
> >>  Should work, unless it's an oversight from
> >> the kernel side that needs to be patched. :)  libxdp doesn't have
> >> a way to specify externally created socket today, so we'll need
> >> to change that.  Should be easy to do though.  I can explore.
> >
> > Please do that.
>
> I have a prototype:
>   https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
>
> Need to test it out and then submit PR to xdp-tools project.
>
> >
> >>
> >> In case the bind syscall will actually need CAP_NET_RAW for some
> >> reason, we could change the kernel and allow non-privileged bind
> >> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
> >> process bind the socket to a particular device, so QEMU can't
> >> bind it to a random one.  Might be a good use case to allow even
> >> if not strictly necessary.
> >
> > Yes.
>
> Will propose something for a kernel as well.  We might want something
> more granular though, e.g. bind to a queue instead of a device.  In
> case we want better control in the device sharing scenario.

I may miss something but the bind is already done at dev plus queue
right now, isn't it?


>
> >
> >>
> >>>>
> >>>>
> >>>>> For that to work, an external process
> >>>>> with admin capabilities will need to pre-load default XSK program
> >>>>> and pass an open file descriptor for this program's 'xsks_map' to
> >>>>> QEMU process on startup.  Network backend will need to be configured
> >>>>> with 'inhibit=on' to avoid loading of the programs.  The file
> >>>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
> >>>>>
> >>>>> There are few performance challenges with the current network backends.
> >>>>>
> >>>>> First is that they do not support IO threads.
> >>>>
> >>>> The current networking codes needs some major recatoring to support IO
> >>>> threads which I'm not sure is worthwhile.
> >>>>
> >>>>> This means that data
> >>>>> path is handled by the main thread in QEMU and may slow down other
> >>>>> work or may be slowed down by some other work.  This also means that
> >>>>> taking advantage of multi-queue is generally not possible today.
> >>>>>
> >>>>> Another thing is that data path is going through the device emulation
> >>>>> code, which is not really optimized for performance.  The fastest
> >>>>> "frontend" device is virtio-net.  But it's not optimized for heavy
> >>>>> traffic either, because it expects such use-cases to be handled via
> >>>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
> >>>>> have virtio notifications and rcu lock/unlock on a per-packet basis
> >>>>> and not very efficient accesses to the guest memory.  Communication
> >>>>> channels between backend and frontend devices do not allow passing
> >>>>> more than one packet at a time as well.
> >>>>>
> >>>>> Some of these challenges can be avoided in the future by adding better
> >>>>> batching into device emulation or by implementing vhost-af-xdp variant.
> >>>>
> >>>> It might require you to register(pin) the whole guest memory to XSK or
> >>>> there could be a copy. Both of them are sub-optimal.
> >>
> >> A single copy by itself shouldn't be a huge problem, right?
> >
> > Probably.
> >
> >> vhost-user and -kernel do copy packets.
> >>
> >>>>
> >>>> A really interesting project is to do AF_XDP passthrough, then we
> >>>> don't need to care about pin and copy and we will get ultra speed in
> >>>> the guest. (But again, it might needs BPF support in virtio-net).
> >>
> >> I suppose, if we're doing pass-through we need a new device type and a
> >> driver in the kernel/dpdk.  There is no point pretending it's a
> >> virtio-net and translating between different ring layouts.
> >
> > Yes.
> >
> >>  Or is there?
> >>
> >>>>
> >>>>>
> >>>>> There are also a few kernel limitations.  AF_XDP sockets do not
> >>>>> support any kinds of checksum or segmentation offloading.  Buffers
> >>>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> >>>>> support is not implemented for AF_XDP today.  Also, transmission in
> >>>>> all non-zero-copy modes is synchronous, i.e. done in a syscall.
> >>>>> That doesn't allow high packet rates on virtual interfaces.
> >>>>>
> >>>>> However, keeping in mind all of these challenges, current implementation
> >>>>> of the AF_XDP backend shows a decent performance while running on top
> >>>>> of a physical NIC with zero-copy support.
> >>>>>
> >>>>> Test setup:
> >>>>>
> >>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> >>>>> Network backend is configured to open the NIC directly in native mode.
> >>>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
> >>>>>
> >>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> >>>>> for PPS testing.
> >>>>>
> >>>>> iperf3 result:
> >>>>>  TCP stream      : 19.1 Gbps
> >>>>>
> >>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>>>>  Tx only         : 3.4 Mpps
> >>>>>  Rx only         : 2.0 Mpps
> >>>>>  L2 FWD Loopback : 1.5 Mpps
> >>>>
> >>>> I don't object to merging this backend (considering we've already
> >>>> merged netmap) once the code is fine, but the number is not amazing so
> >>>> I wonder what is the use case for this backend?
> >>
> >> I don't think there is a use case right now that would significantly benefit
> >> from the current implementation, so I'm fine if the merge is postponed.
> >
> > Just to be clear, I don't want to postpone this if we decide to
> > invest/enhance it. I will go through the codes and get back.
>
> Ack.  Thanks.
>
> >
> >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> >> So, that might be one case.  Taking into account that just rcu lock and
> >> unlock in virtio-net code takes more time than a packet copy, some batching
> >> on QEMU side should improve performance significantly.  And it shouldn't be
> >> too hard to implement.
> >>
> >> Performance over virtual interfaces may potentially be improved by creating
> >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> >> scale well.
> >
> > Interestingly, actually, there are a lot of "duplication" between
> > io_uring and AF_XDP:
> >
> > 1) both have similar memory model (user register)
> > 2) both use ring for communication
> >
> > I wonder if we can let io_uring talks directly to AF_XDP.
>
> Well, if we submit poll() in QEMU main loop via io_uring, then we can
> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> virtual interfaces.  io_uring thread in the kernel will be able to
> perform transmission for us.

It would be nice if we can use iothread/vhost other than the main loop
even if io_uring can use kthreads. We can avoid the memory translation
cost.

Thanks

>
> But yeah, there are way too many way too similar ring buffer interfaces
> in the kernel.
>
> >
> >>
> >> So, I do think that there is a potential in this backend.
> >>
> >> The main benefit, assuming we can reach performance comparable with other
> >> high-performance backends (vhost-user), I think, is the fact that it's
> >> Linux-native and doesn't require talking with any other devices
> >> (like chardevs/sockets), except for a network interface itself. i.e. it
> >> could be easier to manage in complex environments.
> >
> > Yes.
> >
> >>
> >>> A more ambitious method is to reuse DPDK via dedicated threads, then
> >>> we can reuse any of its PMD like AF_XDP.
> >>
> >> Linking with DPDK will make configuration much more complex.  I don't
> >> think it makes sense to bring it in for AF_XDP specifically.  Might be
> >> a separate project though, sure.
> >
> > Right.
> >
> > Thanks
> >
> >>
> >> Best regards, Ilya Maximets.
> >>
> >
>
Stefan Hajnoczi June 28, 2023, 7:45 a.m. UTC | #9
On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 6/27/23 04:54, Jason Wang wrote:
> > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>
> > >> On 6/26/23 08:32, Jason Wang wrote:
> > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > >>>>
> > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > >> So, that might be one case.  Taking into account that just rcu lock and
> > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > >> too hard to implement.
> > >>
> > >> Performance over virtual interfaces may potentially be improved by creating
> > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > >> scale well.
> > >
> > > Interestingly, actually, there are a lot of "duplication" between
> > > io_uring and AF_XDP:
> > >
> > > 1) both have similar memory model (user register)
> > > 2) both use ring for communication
> > >
> > > I wonder if we can let io_uring talks directly to AF_XDP.
> >
> > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > virtual interfaces.  io_uring thread in the kernel will be able to
> > perform transmission for us.
>
> It would be nice if we can use iothread/vhost other than the main loop
> even if io_uring can use kthreads. We can avoid the memory translation
> cost.

The QEMU event loop (AioContext) has io_uring code
(utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
on patches to re-enable it and will probably send them in July. The
patches also add an API to submit arbitrary io_uring operations so
that you can do stuff besides file descriptor monitoring. Both the
main loop and IOThreads will be able to use io_uring on Linux hosts.

Stefan
Jason Wang June 28, 2023, 7:59 a.m. UTC | #10
On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >
> > > On 6/27/23 04:54, Jason Wang wrote:
> > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > >>
> > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >>>>
> > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > >> too hard to implement.
> > > >>
> > > >> Performance over virtual interfaces may potentially be improved by creating
> > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > >> scale well.
> > > >
> > > > Interestingly, actually, there are a lot of "duplication" between
> > > > io_uring and AF_XDP:
> > > >
> > > > 1) both have similar memory model (user register)
> > > > 2) both use ring for communication
> > > >
> > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > >
> > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > perform transmission for us.
> >
> > It would be nice if we can use iothread/vhost other than the main loop
> > even if io_uring can use kthreads. We can avoid the memory translation
> > cost.
>
> The QEMU event loop (AioContext) has io_uring code
> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> on patches to re-enable it and will probably send them in July. The
> patches also add an API to submit arbitrary io_uring operations so
> that you can do stuff besides file descriptor monitoring. Both the
> main loop and IOThreads will be able to use io_uring on Linux hosts.

Just to make sure I understand. If we still need a copy from guest to
io_uring buffer, we still need to go via memory API for GPA which
seems expensive.

Vhost seems to be a shortcut for this.

Thanks

>
> Stefan
>
Stefan Hajnoczi June 28, 2023, 8:14 a.m. UTC | #11
On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > >
> > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > >>
> > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >>>>
> > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > >> too hard to implement.
> > > > >>
> > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > >> scale well.
> > > > >
> > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > io_uring and AF_XDP:
> > > > >
> > > > > 1) both have similar memory model (user register)
> > > > > 2) both use ring for communication
> > > > >
> > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > >
> > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > perform transmission for us.
> > >
> > > It would be nice if we can use iothread/vhost other than the main loop
> > > even if io_uring can use kthreads. We can avoid the memory translation
> > > cost.
> >
> > The QEMU event loop (AioContext) has io_uring code
> > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > on patches to re-enable it and will probably send them in July. The
> > patches also add an API to submit arbitrary io_uring operations so
> > that you can do stuff besides file descriptor monitoring. Both the
> > main loop and IOThreads will be able to use io_uring on Linux hosts.
>
> Just to make sure I understand. If we still need a copy from guest to
> io_uring buffer, we still need to go via memory API for GPA which
> seems expensive.
>
> Vhost seems to be a shortcut for this.

I'm not sure how exactly you're thinking of using io_uring.

Simply using io_uring for the event loop (file descriptor monitoring)
doesn't involve an extra buffer, but the packet payload still needs to
reside in AF_XDP umem, so there is a copy between guest memory and
umem. If umem encompasses guest memory, it may be possible to avoid
copying the packet payload.

Stefan
Jason Wang June 28, 2023, 8:18 a.m. UTC | #12
On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > >
> > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > >>
> > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >>>>
> > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > >> too hard to implement.
> > > > > >>
> > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > >> scale well.
> > > > > >
> > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > io_uring and AF_XDP:
> > > > > >
> > > > > > 1) both have similar memory model (user register)
> > > > > > 2) both use ring for communication
> > > > > >
> > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > >
> > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > perform transmission for us.
> > > >
> > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > cost.
> > >
> > > The QEMU event loop (AioContext) has io_uring code
> > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > on patches to re-enable it and will probably send them in July. The
> > > patches also add an API to submit arbitrary io_uring operations so
> > > that you can do stuff besides file descriptor monitoring. Both the
> > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> >
> > Just to make sure I understand. If we still need a copy from guest to
> > io_uring buffer, we still need to go via memory API for GPA which
> > seems expensive.
> >
> > Vhost seems to be a shortcut for this.
>
> I'm not sure how exactly you're thinking of using io_uring.
>
> Simply using io_uring for the event loop (file descriptor monitoring)
> doesn't involve an extra buffer, but the packet payload still needs to
> reside in AF_XDP umem, so there is a copy between guest memory and
> umem.

So there would be a translation from GPA to HVA (unless io_uring
support 2 stages) which needs to go via qemu memory core. And this
part seems to be very expensive according to my test in the past.

> If umem encompasses guest memory,

It requires you to pin the whole guest memory and a GPA to HVA
translation is still required.

Thanks

>it may be possible to avoid
> copying the packet payload.
>
> Stefan
>
Stefan Hajnoczi June 28, 2023, 8:25 a.m. UTC | #13
On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > >
> > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > >>
> > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >>>>
> > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > >> too hard to implement.
> > > > > > >>
> > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > >> scale well.
> > > > > > >
> > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > io_uring and AF_XDP:
> > > > > > >
> > > > > > > 1) both have similar memory model (user register)
> > > > > > > 2) both use ring for communication
> > > > > > >
> > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > >
> > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > perform transmission for us.
> > > > >
> > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > cost.
> > > >
> > > > The QEMU event loop (AioContext) has io_uring code
> > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > on patches to re-enable it and will probably send them in July. The
> > > > patches also add an API to submit arbitrary io_uring operations so
> > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > >
> > > Just to make sure I understand. If we still need a copy from guest to
> > > io_uring buffer, we still need to go via memory API for GPA which
> > > seems expensive.
> > >
> > > Vhost seems to be a shortcut for this.
> >
> > I'm not sure how exactly you're thinking of using io_uring.
> >
> > Simply using io_uring for the event loop (file descriptor monitoring)
> > doesn't involve an extra buffer, but the packet payload still needs to
> > reside in AF_XDP umem, so there is a copy between guest memory and
> > umem.
>
> So there would be a translation from GPA to HVA (unless io_uring
> support 2 stages) which needs to go via qemu memory core. And this
> part seems to be very expensive according to my test in the past.

Yes, but in the current approach where AF_XDP is implemented as a QEMU
netdev, there is already QEMU device emulation (e.g. virtio-net)
happening. So the GPA to HVA translation will happen anyway in device
emulation.

Are you thinking about AF_XDP passthrough where the guest directly
interacts with AF_XDP?

> > If umem encompasses guest memory,
>
> It requires you to pin the whole guest memory and a GPA to HVA
> translation is still required.

Ilya mentioned that umem uses relative offsets instead of absolute
memory addresses. In the AF_XDP passthrough case this means no address
translation needs to be added to AF_XDP.

Regarding pinning - I wonder if that's something that can be refined
in the kernel by adding an AF_XDP flag that enables on-demand pinning
of umem. That way only rx and tx buffers that are currently in use
will be pinned. The disadvantage is the runtime overhead to pin/unpin
pages. I'm not sure whether it's possible to implement this, I haven't
checked the kernel code.

Stefan
Ilya Maximets June 28, 2023, 11:15 a.m. UTC | #14
On 6/28/23 05:27, Jason Wang wrote:
> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/27/23 04:54, Jason Wang wrote:
>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>
>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>
>>>>>>> AF_XDP is a network socket family that allows communication directly
>>>>>>> with the network device driver in the kernel, bypassing most or all
>>>>>>> of the kernel networking stack.  In the essence, the technology is
>>>>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>>>>> and works with any network interfaces without driver modifications.
>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>>>>> require access to character devices or unix sockets.  Only access to
>>>>>>> the network interface itself is necessary.
>>>>>>>
>>>>>>> This patch implements a network backend that communicates with the
>>>>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>>>>> Fill and Completion) are placed in that memory along with a pool of
>>>>>>> memory buffers for the packet data.  Data transmission is done by
>>>>>>> allocating one of the buffers, copying packet data into it and
>>>>>>> placing the pointer into Tx ring.  After transmission, device will
>>>>>>> return the buffer via Completion ring.  On Rx, device will take
>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>>>>> it and place the buffer into Rx ring.
>>>>>>>
>>>>>>> AF_XDP network backend takes on the communication with the host
>>>>>>> kernel and the network interface and forwards packets to/from the
>>>>>>> peer device in QEMU.
>>>>>>>
>>>>>>> Usage example:
>>>>>>>
>>>>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>>>>
>>>>>>> XDP program bridges the socket with a network interface.  It can be
>>>>>>> attached to the interface in 2 different modes:
>>>>>>>
>>>>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>>>>          driver support.  With a caveat of lower performance.
>>>>>>>
>>>>>>> 2. native - this does require support from the driver and allows to
>>>>>>>             bypass skb allocation in the kernel and potentially use
>>>>>>>             zero-copy while getting packets in/out userspace.
>>>>>>>
>>>>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>>>>> some issue with the driver.
>>>>>>>
>>>>>>> Option 'queues=N' allows to specify how many device queues should
>>>>>>> be open.  Note that all the queues that are not open are still
>>>>>>> functional and can receive traffic, but it will not be delivered to
>>>>>>> QEMU.  So, the number of device queues should generally match the
>>>>>>> QEMU configuration, unless the device is shared with something
>>>>>>> else and the traffic re-direction to appropriate queues is correctly
>>>>>>> configured on a device level (e.g. with ethtool -N).
>>>>>>> 'start-queue=M' option can be used to specify from which queue id
>>>>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>>>>> for examples.
>>>>>>>
>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>>>>> capabilities in order to load default XSK/XDP programs to the
>>>>>>> network interface and configure BTF maps.
>>>>>>
>>>>>> I think you mean "BPF" actually?
>>>>
>>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
>>>>
>>>>>>
>>>>>>>  It is possible, however,
>>>>>>> to run only with CAP_NET_RAW.
>>>>>>
>>>>>> Qemu often runs without any privileges, so we need to fix it.
>>>>>>
>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>>>>
>>>> I looked through the code and it seems like we can run completely
>>>> non-privileged as far as kernel concerned.  We'll need an API
>>>> modification in libxdp though.
>>>>
>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
>>>> a base socket creation.  Binding and other configuration doesn't
>>>> require any privileges.  So, we could create a socket externally
>>>> and pass it to QEMU.
>>>
>>> That's the way TAP works for example.
>>>
>>>>  Should work, unless it's an oversight from
>>>> the kernel side that needs to be patched. :)  libxdp doesn't have
>>>> a way to specify externally created socket today, so we'll need
>>>> to change that.  Should be easy to do though.  I can explore.
>>>
>>> Please do that.
>>
>> I have a prototype:
>>   https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
>>
>> Need to test it out and then submit PR to xdp-tools project.
>>
>>>
>>>>
>>>> In case the bind syscall will actually need CAP_NET_RAW for some
>>>> reason, we could change the kernel and allow non-privileged bind
>>>> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
>>>> process bind the socket to a particular device, so QEMU can't
>>>> bind it to a random one.  Might be a good use case to allow even
>>>> if not strictly necessary.
>>>
>>> Yes.
>>
>> Will propose something for a kernel as well.  We might want something
>> more granular though, e.g. bind to a queue instead of a device.  In
>> case we want better control in the device sharing scenario.
> 
> I may miss something but the bind is already done at dev plus queue
> right now, isn't it?


Yes, the bind() syscall will bind socket to the dev+queue.  I was talking
about SO_BINDTODEVICE that only ties the socket to a particular device,
but not a queue.

Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
assuming a privileged process does:

  fd = socket(AF_XDP, ...);
  setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);

And sends fd to a non-privileged process.  That non-privileged process
will be able to call:

  bind(fd, <device>, <random queue>);

It will have to use the same device, but can choose any queue, if that
queue is not already busy with another socket.

So, I was thinking maybe implementing something like XDP_BINDTOQID option.
This way the privileged process may call:

  setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);

And later kernel will be able to refuse bind() for any other queue for
this particular socket.

Not sure if that is necessary though.
Since we're allocating the socket in the privileged process, that process
may add the socket to the BPF map on the correct queue id.  This way the
non-privileged process will not be able to receive any packets from any
other queue on this socket, even if bound to it.  And no other AF_XDP
socket will be able to be bound to that other queue as well.  So, the
rogue QEMU will be able to hog one extra queue, but it will not be able
to intercept traffic any from it, AFAICT.  May not be a huge problem
after all.

SO_BINDTODEVICE would still be nice to have.  Especially for cases where
we give the whole device to one VM.

Best regards, Ilya Maximets.
Jason Wang June 29, 2023, 5:25 a.m. UTC | #15
On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > >
> > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > >>
> > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >>>>
> > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > >> too hard to implement.
> > > > > > > >>
> > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > >> scale well.
> > > > > > > >
> > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > io_uring and AF_XDP:
> > > > > > > >
> > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > 2) both use ring for communication
> > > > > > > >
> > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > >
> > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > perform transmission for us.
> > > > > >
> > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > cost.
> > > > >
> > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > on patches to re-enable it and will probably send them in July. The
> > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > >
> > > > Just to make sure I understand. If we still need a copy from guest to
> > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > seems expensive.
> > > >
> > > > Vhost seems to be a shortcut for this.
> > >
> > > I'm not sure how exactly you're thinking of using io_uring.
> > >
> > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > doesn't involve an extra buffer, but the packet payload still needs to
> > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > umem.
> >
> > So there would be a translation from GPA to HVA (unless io_uring
> > support 2 stages) which needs to go via qemu memory core. And this
> > part seems to be very expensive according to my test in the past.
>
> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> netdev, there is already QEMU device emulation (e.g. virtio-net)
> happening. So the GPA to HVA translation will happen anyway in device
> emulation.

Just to make sure we're on the same page.

I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
QEMU netdev, it would be very hard to achieve that if we stick to
using the Qemu memory core translations which need to take care about
too much extra stuff. That's why I suggest using vhost in io threads
which only cares about ram so the translation could be very fast.

>
> Are you thinking about AF_XDP passthrough where the guest directly
> interacts with AF_XDP?

This could be another way to solve, since it won't use Qemu's memory
core to do the translation.

>
> > > If umem encompasses guest memory,
> >
> > It requires you to pin the whole guest memory and a GPA to HVA
> > translation is still required.
>
> Ilya mentioned that umem uses relative offsets instead of absolute
> memory addresses. In the AF_XDP passthrough case this means no address
> translation needs to be added to AF_XDP.

I don't see how it can avoid the translations as it works at the level
of HVA. But what guests submit is PA or even IOVA.

What's more, guest memory could be backed by different memory
backends, this means a single umem may not even work.

>
> Regarding pinning - I wonder if that's something that can be refined
> in the kernel by adding an AF_XDP flag that enables on-demand pinning
> of umem. That way only rx and tx buffers that are currently in use
> will be pinned. The disadvantage is the runtime overhead to pin/unpin
> pages. I'm not sure whether it's possible to implement this, I haven't
> checked the kernel code.

It requires the device to do page faults which is not commonly
supported nowadays.

Thanks

>
> Stefan
>
Stefan Hajnoczi June 29, 2023, 12:35 p.m. UTC | #16
On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > >
> > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > >>
> > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > >> too hard to implement.
> > > > > > > > >>
> > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > >> scale well.
> > > > > > > > >
> > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > io_uring and AF_XDP:
> > > > > > > > >
> > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > 2) both use ring for communication
> > > > > > > > >
> > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > >
> > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > perform transmission for us.
> > > > > > >
> > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > cost.
> > > > > >
> > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > >
> > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > seems expensive.
> > > > >
> > > > > Vhost seems to be a shortcut for this.
> > > >
> > > > I'm not sure how exactly you're thinking of using io_uring.
> > > >
> > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > umem.
> > >
> > > So there would be a translation from GPA to HVA (unless io_uring
> > > support 2 stages) which needs to go via qemu memory core. And this
> > > part seems to be very expensive according to my test in the past.
> >
> > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > happening. So the GPA to HVA translation will happen anyway in device
> > emulation.
>
> Just to make sure we're on the same page.
>
> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> QEMU netdev, it would be very hard to achieve that if we stick to
> using the Qemu memory core translations which need to take care about
> too much extra stuff. That's why I suggest using vhost in io threads
> which only cares about ram so the translation could be very fast.

What does using "vhost in io threads" mean? Is that a vhost kernel
approach where userspace dedicates threads (the stuff that Mike
Christie has been working on)? I haven't looked at how Mike's recent
patches work, but I wouldn't call that approach QEMU IOThreads,
because the threads probably don't run the AioContext event loop and
instead execute vhost kernel code the entire time.

But despite these questions, I think I'm beginning to understand that
you're proposing a vhost_net.ko AF_XDP implementation instead of a
userspace QEMU AF_XDP netdev implementation. I wonder if any
optimizations can be made when the AF_XDP user is kernel code instead
of userspace code.

> >
> > Are you thinking about AF_XDP passthrough where the guest directly
> > interacts with AF_XDP?
>
> This could be another way to solve, since it won't use Qemu's memory
> core to do the translation.
>
> >
> > > > If umem encompasses guest memory,
> > >
> > > It requires you to pin the whole guest memory and a GPA to HVA
> > > translation is still required.
> >
> > Ilya mentioned that umem uses relative offsets instead of absolute
> > memory addresses. In the AF_XDP passthrough case this means no address
> > translation needs to be added to AF_XDP.
>
> I don't see how it can avoid the translations as it works at the level
> of HVA. But what guests submit is PA or even IOVA.

In a passthrough scenario the guest is doing AF_XDP, so it writes
relative umem offsets, thereby eliminating address translation
concerns (the addresses are not PAs or IOVAs). However, this approach
probably won't work well with memory hotplug - or at least it will end
up becoming a memory translation mechanism in order to support memory
hotplug.

>
> What's more, guest memory could be backed by different memory
> backends, this means a single umem may not even work.

Maybe. I don't know the nature of umem. If there can be multiple vmas
in the umem range, then there should be no issue mixing different
memory backends.

>
> >
> > Regarding pinning - I wonder if that's something that can be refined
> > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > of umem. That way only rx and tx buffers that are currently in use
> > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > pages. I'm not sure whether it's possible to implement this, I haven't
> > checked the kernel code.
>
> It requires the device to do page faults which is not commonly
> supported nowadays.

I don't understand this comment. AF_XDP processes each rx/tx
descriptor. At that point it can getuserpages() or similar in order to
pin the page. When the memory is no longer needed, it can put those
pages. No fault mechanism is needed. What am I missing?

Stefan
Jason Wang June 30, 2023, 7:41 a.m. UTC | #17
On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > >
> > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > >>
> > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > >> too hard to implement.
> > > > > > > > > >>
> > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > >> scale well.
> > > > > > > > > >
> > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > >
> > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > 2) both use ring for communication
> > > > > > > > > >
> > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > >
> > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > perform transmission for us.
> > > > > > > >
> > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > cost.
> > > > > > >
> > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > >
> > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > seems expensive.
> > > > > >
> > > > > > Vhost seems to be a shortcut for this.
> > > > >
> > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > >
> > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > umem.
> > > >
> > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > part seems to be very expensive according to my test in the past.
> > >
> > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > happening. So the GPA to HVA translation will happen anyway in device
> > > emulation.
> >
> > Just to make sure we're on the same page.
> >
> > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > QEMU netdev, it would be very hard to achieve that if we stick to
> > using the Qemu memory core translations which need to take care about
> > too much extra stuff. That's why I suggest using vhost in io threads
> > which only cares about ram so the translation could be very fast.
>
> What does using "vhost in io threads" mean?

It means a vhost userspace dataplane that is implemented via io threads.

> Is that a vhost kernel
> approach where userspace dedicates threads (the stuff that Mike
> Christie has been working on)? I haven't looked at how Mike's recent
> patches work, but I wouldn't call that approach QEMU IOThreads,
> because the threads probably don't run the AioContext event loop and
> instead execute vhost kernel code the entire time.
>
> But despite these questions, I think I'm beginning to understand that
> you're proposing a vhost_net.ko AF_XDP implementation instead of a
> userspace QEMU AF_XDP netdev implementation.

Sorry for being unclear, but I'm not proposing that.

> I wonder if any
> optimizations can be made when the AF_XDP user is kernel code instead
> of userspace code.

The only possible way to go is to adapt AF_XDP umem memory model to
vhost which I'm not sure of anything we can gain.

>
> > >
> > > Are you thinking about AF_XDP passthrough where the guest directly
> > > interacts with AF_XDP?
> >
> > This could be another way to solve, since it won't use Qemu's memory
> > core to do the translation.
> >
> > >
> > > > > If umem encompasses guest memory,
> > > >
> > > > It requires you to pin the whole guest memory and a GPA to HVA
> > > > translation is still required.
> > >
> > > Ilya mentioned that umem uses relative offsets instead of absolute
> > > memory addresses. In the AF_XDP passthrough case this means no address
> > > translation needs to be added to AF_XDP.
> >
> > I don't see how it can avoid the translations as it works at the level
> > of HVA. But what guests submit is PA or even IOVA.
>
> In a passthrough scenario the guest is doing AF_XDP, so it writes
> relative umem offsets, thereby eliminating address translation
> concerns (the addresses are not PAs or IOVAs). However, this approach
> probably won't work well with memory hotplug - or at least it will end
> up becoming a memory translation mechanism in order to support memory
> hotplug.

Ok.

>
> >
> > What's more, guest memory could be backed by different memory
> > backends, this means a single umem may not even work.
>
> Maybe. I don't know the nature of umem. If there can be multiple vmas
> in the umem range, then there should be no issue mixing different
> memory backends.

If I understand correctly, a single umem requires contiguous VA at least.

>
> >
> > >
> > > Regarding pinning - I wonder if that's something that can be refined
> > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > of umem. That way only rx and tx buffers that are currently in use
> > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > checked the kernel code.
> >
> > It requires the device to do page faults which is not commonly
> > supported nowadays.
>
> I don't understand this comment. AF_XDP processes each rx/tx
> descriptor. At that point it can getuserpages() or similar in order to
> pin the page. When the memory is no longer needed, it can put those
> pages. No fault mechanism is needed. What am I missing?

Ok, I think I kind of get you, you mean doing pinning while processing
rx/tx buffers? It's not easy since GUP itself is not very fast, it may
hit PPS for sure.

Thanks

>
> Stefan
>
Jason Wang June 30, 2023, 7:44 a.m. UTC | #18
On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/28/23 05:27, Jason Wang wrote:
> > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 6/27/23 04:54, Jason Wang wrote:
> >>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>
> >>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>
> >>>>>>> AF_XDP is a network socket family that allows communication directly
> >>>>>>> with the network device driver in the kernel, bypassing most or all
> >>>>>>> of the kernel networking stack.  In the essence, the technology is
> >>>>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> >>>>>>> and works with any network interfaces without driver modifications.
> >>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >>>>>>> require access to character devices or unix sockets.  Only access to
> >>>>>>> the network interface itself is necessary.
> >>>>>>>
> >>>>>>> This patch implements a network backend that communicates with the
> >>>>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> >>>>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> >>>>>>> Fill and Completion) are placed in that memory along with a pool of
> >>>>>>> memory buffers for the packet data.  Data transmission is done by
> >>>>>>> allocating one of the buffers, copying packet data into it and
> >>>>>>> placing the pointer into Tx ring.  After transmission, device will
> >>>>>>> return the buffer via Completion ring.  On Rx, device will take
> >>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
> >>>>>>> it and place the buffer into Rx ring.
> >>>>>>>
> >>>>>>> AF_XDP network backend takes on the communication with the host
> >>>>>>> kernel and the network interface and forwards packets to/from the
> >>>>>>> peer device in QEMU.
> >>>>>>>
> >>>>>>> Usage example:
> >>>>>>>
> >>>>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>>>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>>>>>>
> >>>>>>> XDP program bridges the socket with a network interface.  It can be
> >>>>>>> attached to the interface in 2 different modes:
> >>>>>>>
> >>>>>>> 1. skb - this mode should work for any interface and doesn't require
> >>>>>>>          driver support.  With a caveat of lower performance.
> >>>>>>>
> >>>>>>> 2. native - this does require support from the driver and allows to
> >>>>>>>             bypass skb allocation in the kernel and potentially use
> >>>>>>>             zero-copy while getting packets in/out userspace.
> >>>>>>>
> >>>>>>> By default, QEMU will try to use native mode and fall back to skb.
> >>>>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
> >>>>>>> mode, use 'force-copy=on' option.  This might be useful if there is
> >>>>>>> some issue with the driver.
> >>>>>>>
> >>>>>>> Option 'queues=N' allows to specify how many device queues should
> >>>>>>> be open.  Note that all the queues that are not open are still
> >>>>>>> functional and can receive traffic, but it will not be delivered to
> >>>>>>> QEMU.  So, the number of device queues should generally match the
> >>>>>>> QEMU configuration, unless the device is shared with something
> >>>>>>> else and the traffic re-direction to appropriate queues is correctly
> >>>>>>> configured on a device level (e.g. with ethtool -N).
> >>>>>>> 'start-queue=M' option can be used to specify from which queue id
> >>>>>>> QEMU should start configuring 'N' queues.  It might also be necessary
> >>>>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> >>>>>>> for examples.
> >>>>>>>
> >>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >>>>>>> capabilities in order to load default XSK/XDP programs to the
> >>>>>>> network interface and configure BTF maps.
> >>>>>>
> >>>>>> I think you mean "BPF" actually?
> >>>>
> >>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
> >>>>
> >>>>>>
> >>>>>>>  It is possible, however,
> >>>>>>> to run only with CAP_NET_RAW.
> >>>>>>
> >>>>>> Qemu often runs without any privileges, so we need to fix it.
> >>>>>>
> >>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
> >>>>
> >>>> I looked through the code and it seems like we can run completely
> >>>> non-privileged as far as kernel concerned.  We'll need an API
> >>>> modification in libxdp though.
> >>>>
> >>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
> >>>> a base socket creation.  Binding and other configuration doesn't
> >>>> require any privileges.  So, we could create a socket externally
> >>>> and pass it to QEMU.
> >>>
> >>> That's the way TAP works for example.
> >>>
> >>>>  Should work, unless it's an oversight from
> >>>> the kernel side that needs to be patched. :)  libxdp doesn't have
> >>>> a way to specify externally created socket today, so we'll need
> >>>> to change that.  Should be easy to do though.  I can explore.
> >>>
> >>> Please do that.
> >>
> >> I have a prototype:
> >>   https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
> >>
> >> Need to test it out and then submit PR to xdp-tools project.
> >>
> >>>
> >>>>
> >>>> In case the bind syscall will actually need CAP_NET_RAW for some
> >>>> reason, we could change the kernel and allow non-privileged bind
> >>>> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
> >>>> process bind the socket to a particular device, so QEMU can't
> >>>> bind it to a random one.  Might be a good use case to allow even
> >>>> if not strictly necessary.
> >>>
> >>> Yes.
> >>
> >> Will propose something for a kernel as well.  We might want something
> >> more granular though, e.g. bind to a queue instead of a device.  In
> >> case we want better control in the device sharing scenario.
> >
> > I may miss something but the bind is already done at dev plus queue
> > right now, isn't it?
>
>
> Yes, the bind() syscall will bind socket to the dev+queue.  I was talking
> about SO_BINDTODEVICE that only ties the socket to a particular device,
> but not a queue.
>
> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
> assuming a privileged process does:
>
>   fd = socket(AF_XDP, ...);
>   setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);
>
> And sends fd to a non-privileged process.  That non-privileged process
> will be able to call:
>
>   bind(fd, <device>, <random queue>);
>
> It will have to use the same device, but can choose any queue, if that
> queue is not already busy with another socket.
>
> So, I was thinking maybe implementing something like XDP_BINDTOQID option.
> This way the privileged process may call:
>
>   setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);
>
> And later kernel will be able to refuse bind() for any other queue for
> this particular socket.

Not sure, if file descriptor passing works, we probably don't need another way.

>
> Not sure if that is necessary though.
> Since we're allocating the socket in the privileged process, that process
> may add the socket to the BPF map on the correct queue id.  This way the
> non-privileged process will not be able to receive any packets from any
> other queue on this socket, even if bound to it.  And no other AF_XDP
> socket will be able to be bound to that other queue as well.

I think that's by design, or anything wrong with this model?

> So, the
> rogue QEMU will be able to hog one extra queue, but it will not be able
> to intercept traffic any from it, AFAICT.  May not be a huge problem
> after all.
>
> SO_BINDTODEVICE would still be nice to have.  Especially for cases where
> we give the whole device to one VM.

Then we need to use AF_XDP in the guest which seems to be a different
topic. Alibaba is working on the AF_XDP support for virtio-net.

Thanks

>
> Best regards, Ilya Maximets.
>
Ilya Maximets June 30, 2023, 3:01 p.m. UTC | #19
On 6/30/23 09:44, Jason Wang wrote:
> On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 6/28/23 05:27, Jason Wang wrote:
>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 6/27/23 04:54, Jason Wang wrote:
>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>
>>>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>
>>>>>>>>> AF_XDP is a network socket family that allows communication directly
>>>>>>>>> with the network device driver in the kernel, bypassing most or all
>>>>>>>>> of the kernel networking stack.  In the essence, the technology is
>>>>>>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>>>>>>> and works with any network interfaces without driver modifications.
>>>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>>>>>>> require access to character devices or unix sockets.  Only access to
>>>>>>>>> the network interface itself is necessary.
>>>>>>>>>
>>>>>>>>> This patch implements a network backend that communicates with the
>>>>>>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>>>>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>>>>>>> Fill and Completion) are placed in that memory along with a pool of
>>>>>>>>> memory buffers for the packet data.  Data transmission is done by
>>>>>>>>> allocating one of the buffers, copying packet data into it and
>>>>>>>>> placing the pointer into Tx ring.  After transmission, device will
>>>>>>>>> return the buffer via Completion ring.  On Rx, device will take
>>>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>>>>>>> it and place the buffer into Rx ring.
>>>>>>>>>
>>>>>>>>> AF_XDP network backend takes on the communication with the host
>>>>>>>>> kernel and the network interface and forwards packets to/from the
>>>>>>>>> peer device in QEMU.
>>>>>>>>>
>>>>>>>>> Usage example:
>>>>>>>>>
>>>>>>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>>>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>>>>>>
>>>>>>>>> XDP program bridges the socket with a network interface.  It can be
>>>>>>>>> attached to the interface in 2 different modes:
>>>>>>>>>
>>>>>>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>>>>>>          driver support.  With a caveat of lower performance.
>>>>>>>>>
>>>>>>>>> 2. native - this does require support from the driver and allows to
>>>>>>>>>             bypass skb allocation in the kernel and potentially use
>>>>>>>>>             zero-copy while getting packets in/out userspace.
>>>>>>>>>
>>>>>>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>>>>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>>>>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>>>>>>> some issue with the driver.
>>>>>>>>>
>>>>>>>>> Option 'queues=N' allows to specify how many device queues should
>>>>>>>>> be open.  Note that all the queues that are not open are still
>>>>>>>>> functional and can receive traffic, but it will not be delivered to
>>>>>>>>> QEMU.  So, the number of device queues should generally match the
>>>>>>>>> QEMU configuration, unless the device is shared with something
>>>>>>>>> else and the traffic re-direction to appropriate queues is correctly
>>>>>>>>> configured on a device level (e.g. with ethtool -N).
>>>>>>>>> 'start-queue=M' option can be used to specify from which queue id
>>>>>>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>>>>>>> for examples.
>>>>>>>>>
>>>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>>>>>>> capabilities in order to load default XSK/XDP programs to the
>>>>>>>>> network interface and configure BTF maps.
>>>>>>>>
>>>>>>>> I think you mean "BPF" actually?
>>>>>>
>>>>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
>>>>>>
>>>>>>>>
>>>>>>>>>  It is possible, however,
>>>>>>>>> to run only with CAP_NET_RAW.
>>>>>>>>
>>>>>>>> Qemu often runs without any privileges, so we need to fix it.
>>>>>>>>
>>>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>>>>>>
>>>>>> I looked through the code and it seems like we can run completely
>>>>>> non-privileged as far as kernel concerned.  We'll need an API
>>>>>> modification in libxdp though.
>>>>>>
>>>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
>>>>>> a base socket creation.  Binding and other configuration doesn't
>>>>>> require any privileges.  So, we could create a socket externally
>>>>>> and pass it to QEMU.
>>>>>
>>>>> That's the way TAP works for example.
>>>>>
>>>>>>  Should work, unless it's an oversight from
>>>>>> the kernel side that needs to be patched. :)  libxdp doesn't have
>>>>>> a way to specify externally created socket today, so we'll need
>>>>>> to change that.  Should be easy to do though.  I can explore.
>>>>>
>>>>> Please do that.
>>>>
>>>> I have a prototype:
>>>>   https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
>>>>
>>>> Need to test it out and then submit PR to xdp-tools project.

The change is now accepted:
  https://github.com/xdp-project/xdp-tools/commit/740c839806a02517da5bce7bd0ccaba908b3f675

I can update the QEMU patch with support for passing socket fds.  It may
look like this:

 -netved af-xdp,eth0,queues=2,inhibit=on,sock-fds=fd1,fd2

We'll need an fd per queue.  And we may require these fds to be already
added to the xsks map, so QEMU doesn't need xsks-map-fd.

I'd say we'll need to compile support for that conditionally based on
availability of xsk_umem__create_with_fd() as it may not be available
in distributions for some time.
Alternative is to require libxdp >= 1.4.0, which is not released yet.

The last restriction will be that QEMU will need 32 MB of RLIMIT_MEMLOCK
per queue for umem registration, but that should not be a huge deal, right?
Alternative is to have CAP_IPC_LOCK.


And I'd keep the xsks-map-fd parameter for setups that do not have latest
libxdp and can allow CAP_NET_RAW.  So, they could still do:

 -netdev af-xdp,eth0,queues=2,inhibit=on,xsks-map-fd=fd

What do you think?


>>>>
>>>>>
>>>>>>
>>>>>> In case the bind syscall will actually need CAP_NET_RAW for some
>>>>>> reason, we could change the kernel and allow non-privileged bind
>>>>>> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
>>>>>> process bind the socket to a particular device, so QEMU can't
>>>>>> bind it to a random one.  Might be a good use case to allow even
>>>>>> if not strictly necessary.
>>>>>
>>>>> Yes.
>>>>
>>>> Will propose something for a kernel as well.  We might want something
>>>> more granular though, e.g. bind to a queue instead of a device.  In
>>>> case we want better control in the device sharing scenario.
>>>
>>> I may miss something but the bind is already done at dev plus queue
>>> right now, isn't it?
>>
>>
>> Yes, the bind() syscall will bind socket to the dev+queue.  I was talking
>> about SO_BINDTODEVICE that only ties the socket to a particular device,
>> but not a queue.
>>
>> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
>> assuming a privileged process does:
>>
>>   fd = socket(AF_XDP, ...);
>>   setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);
>>
>> And sends fd to a non-privileged process.  That non-privileged process
>> will be able to call:
>>
>>   bind(fd, <device>, <random queue>);
>>
>> It will have to use the same device, but can choose any queue, if that
>> queue is not already busy with another socket.
>>
>> So, I was thinking maybe implementing something like XDP_BINDTOQID option.
>> This way the privileged process may call:
>>
>>   setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);
>>
>> And later kernel will be able to refuse bind() for any other queue for
>> this particular socket.
> 
> Not sure, if file descriptor passing works, we probably don't need another way.
> 
>>
>> Not sure if that is necessary though.
>> Since we're allocating the socket in the privileged process, that process
>> may add the socket to the BPF map on the correct queue id.  This way the
>> non-privileged process will not be able to receive any packets from any
>> other queue on this socket, even if bound to it.  And no other AF_XDP
>> socket will be able to be bound to that other queue as well.
> 
> I think that's by design, or anything wrong with this model?

No, should be fine.  I'll posted a simple SO_BINDTODEVICE change to bpf-next
as an RFC for now since the tree is closed:
  https://lore.kernel.org/netdev/20230630145831.2988845-1-i.maximets@ovn.org/

Will re-send a non-RFC once it is open (after 10th of July, IIRC).

> 
>> So, the
>> rogue QEMU will be able to hog one extra queue, but it will not be able
>> to intercept traffic any from it, AFAICT.  May not be a huge problem
>> after all.
>>
>> SO_BINDTODEVICE would still be nice to have.  Especially for cases where
>> we give the whole device to one VM.
> 
> Then we need to use AF_XDP in the guest which seems to be a different
> topic. Alibaba is working on the AF_XDP support for virtio-net.
> 
> Thanks
> 
>>
>> Best regards, Ilya Maximets.
>>
>
Stefan Hajnoczi July 3, 2023, 9:03 a.m. UTC | #20
On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > >> too hard to implement.
> > > > > > > > > > >>
> > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > >> scale well.
> > > > > > > > > > >
> > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > >
> > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > >
> > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > >
> > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > perform transmission for us.
> > > > > > > > >
> > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > cost.
> > > > > > > >
> > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > >
> > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > seems expensive.
> > > > > > >
> > > > > > > Vhost seems to be a shortcut for this.
> > > > > >
> > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > >
> > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > umem.
> > > > >
> > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > part seems to be very expensive according to my test in the past.
> > > >
> > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > emulation.
> > >
> > > Just to make sure we're on the same page.
> > >
> > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > using the Qemu memory core translations which need to take care about
> > > too much extra stuff. That's why I suggest using vhost in io threads
> > > which only cares about ram so the translation could be very fast.
> >
> > What does using "vhost in io threads" mean?
>
> It means a vhost userspace dataplane that is implemented via io threads.

AFAIK this does not exist today. QEMU's built-in devices that use
IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
vhost-user, or vDPA but not built-in devices that use IOThreads. The
built-in devices implement VirtioDeviceClass callbacks directly and
use AioContext APIs to run in IOThreads.

Do you have an idea for using vhost code for built-in devices? Maybe
it's fastest if you explain your idea and its advantages instead of me
guessing.

> > > > Regarding pinning - I wonder if that's something that can be refined
> > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > > of umem. That way only rx and tx buffers that are currently in use
> > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > > checked the kernel code.
> > >
> > > It requires the device to do page faults which is not commonly
> > > supported nowadays.
> >
> > I don't understand this comment. AF_XDP processes each rx/tx
> > descriptor. At that point it can getuserpages() or similar in order to
> > pin the page. When the memory is no longer needed, it can put those
> > pages. No fault mechanism is needed. What am I missing?
>
> Ok, I think I kind of get you, you mean doing pinning while processing
> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> hit PPS for sure.

Yes. It's not as fast as permanently pinning rx/tx buffers, but it
supports unpinned guest RAM.

There are variations on this approach, like keeping a certain amount
of pages pinned after they have been used so the cost of
pinning/unpinning can be avoided when the same pages are reused in the
future, but I don't know how effective that is in practice.

Is there a more efficient approach without relying on hardware page
fault support?

My understanding is that hardware page fault support is not yet
deployed. We'd be left with pinning guest RAM permanently or using a
runtime pinning/unpinning approach like I've described.

Stefan
Jason Wang July 5, 2023, 6:02 a.m. UTC | #21
On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > > >> scale well.
> > > > > > > > > > > >
> > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > >
> > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > > >
> > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > > >
> > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > > perform transmission for us.
> > > > > > > > > >
> > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > > cost.
> > > > > > > > >
> > > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > > >
> > > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > > seems expensive.
> > > > > > > >
> > > > > > > > Vhost seems to be a shortcut for this.
> > > > > > >
> > > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > > >
> > > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > > umem.
> > > > > >
> > > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > > part seems to be very expensive according to my test in the past.
> > > > >
> > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > > emulation.
> > > >
> > > > Just to make sure we're on the same page.
> > > >
> > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > > using the Qemu memory core translations which need to take care about
> > > > too much extra stuff. That's why I suggest using vhost in io threads
> > > > which only cares about ram so the translation could be very fast.
> > >
> > > What does using "vhost in io threads" mean?
> >
> > It means a vhost userspace dataplane that is implemented via io threads.
>
> AFAIK this does not exist today. QEMU's built-in devices that use
> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> built-in devices implement VirtioDeviceClass callbacks directly and
> use AioContext APIs to run in IOThreads.

Yes.

>
> Do you have an idea for using vhost code for built-in devices? Maybe
> it's fastest if you explain your idea and its advantages instead of me
> guessing.

It's something like I'd proposed in [1]:

1) a vhost that is implemented via IOThreads
2) memory translation is done via vhost memory table/IOTLB

The advantages are:

1) No 3rd application like DPDK application
2) Attack surface were reduced
3) Better understanding/interactions with device model for things like
RSS and IOMMU

There could be some dis-advantages but it's not obvious to me :)

It's something like linking SPDK/DPDK to Qemu.

>
> > > > > Regarding pinning - I wonder if that's something that can be refined
> > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > > > of umem. That way only rx and tx buffers that are currently in use
> > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > > > checked the kernel code.
> > > >
> > > > It requires the device to do page faults which is not commonly
> > > > supported nowadays.
> > >
> > > I don't understand this comment. AF_XDP processes each rx/tx
> > > descriptor. At that point it can getuserpages() or similar in order to
> > > pin the page. When the memory is no longer needed, it can put those
> > > pages. No fault mechanism is needed. What am I missing?
> >
> > Ok, I think I kind of get you, you mean doing pinning while processing
> > rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> > hit PPS for sure.
>
> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> supports unpinned guest RAM.

Right, it's a balance between pin and PPS. PPS seems to be more
important in this case.

>
> There are variations on this approach, like keeping a certain amount
> of pages pinned after they have been used so the cost of
> pinning/unpinning can be avoided when the same pages are reused in the
> future, but I don't know how effective that is in practice.
>
> Is there a more efficient approach without relying on hardware page
> fault support?

I guess so, I see some slides that say device page fault is very slow.

>
> My understanding is that hardware page fault support is not yet
> deployed. We'd be left with pinning guest RAM permanently or using a
> runtime pinning/unpinning approach like I've described.

Probably.

Thanks

>
> Stefan
>
Stefan Hajnoczi July 6, 2023, 7:08 p.m. UTC | #22
On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > > > >> scale well.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > > > >
> > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > > > >
> > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > > > perform transmission for us.
> > > > > > > > > > >
> > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > > > cost.
> > > > > > > > > >
> > > > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > > > >
> > > > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > > > seems expensive.
> > > > > > > > >
> > > > > > > > > Vhost seems to be a shortcut for this.
> > > > > > > >
> > > > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > > > >
> > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > > > umem.
> > > > > > >
> > > > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > > > part seems to be very expensive according to my test in the past.
> > > > > >
> > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > > > emulation.
> > > > >
> > > > > Just to make sure we're on the same page.
> > > > >
> > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > > > using the Qemu memory core translations which need to take care about
> > > > > too much extra stuff. That's why I suggest using vhost in io threads
> > > > > which only cares about ram so the translation could be very fast.
> > > >
> > > > What does using "vhost in io threads" mean?
> > >
> > > It means a vhost userspace dataplane that is implemented via io threads.
> >
> > AFAIK this does not exist today. QEMU's built-in devices that use
> > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > built-in devices implement VirtioDeviceClass callbacks directly and
> > use AioContext APIs to run in IOThreads.
>
> Yes.
>
> >
> > Do you have an idea for using vhost code for built-in devices? Maybe
> > it's fastest if you explain your idea and its advantages instead of me
> > guessing.
>
> It's something like I'd proposed in [1]:
>
> 1) a vhost that is implemented via IOThreads
> 2) memory translation is done via vhost memory table/IOTLB
>
> The advantages are:
>
> 1) No 3rd application like DPDK application
> 2) Attack surface were reduced
> 3) Better understanding/interactions with device model for things like
> RSS and IOMMU
>
> There could be some dis-advantages but it's not obvious to me :)

Why is QEMU's native device emulation API not the natural choice for
writing built-in devices? I don't understand why the vhost interface
is desirable for built-in devices.

>
> It's something like linking SPDK/DPDK to Qemu.

Sergio Lopez tried loading vhost-user devices as shared libraries that
run in the QEMU process. It worked as an experiment but wasn't pursued
further.

I think that might make sense in specific cases where there is an
existing vhost-user codebase that needs to run as part of QEMU.

In this case the AF_XDP code is new, so it's not a case of moving
existing code into QEMU.

>
> >
> > > > > > Regarding pinning - I wonder if that's something that can be refined
> > > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > > > > of umem. That way only rx and tx buffers that are currently in use
> > > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > > > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > > > > checked the kernel code.
> > > > >
> > > > > It requires the device to do page faults which is not commonly
> > > > > supported nowadays.
> > > >
> > > > I don't understand this comment. AF_XDP processes each rx/tx
> > > > descriptor. At that point it can getuserpages() or similar in order to
> > > > pin the page. When the memory is no longer needed, it can put those
> > > > pages. No fault mechanism is needed. What am I missing?
> > >
> > > Ok, I think I kind of get you, you mean doing pinning while processing
> > > rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> > > hit PPS for sure.
> >
> > Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> > supports unpinned guest RAM.
>
> Right, it's a balance between pin and PPS. PPS seems to be more
> important in this case.
>
> >
> > There are variations on this approach, like keeping a certain amount
> > of pages pinned after they have been used so the cost of
> > pinning/unpinning can be avoided when the same pages are reused in the
> > future, but I don't know how effective that is in practice.
> >
> > Is there a more efficient approach without relying on hardware page
> > fault support?
>
> I guess so, I see some slides that say device page fault is very slow.
>
> >
> > My understanding is that hardware page fault support is not yet
> > deployed. We'd be left with pinning guest RAM permanently or using a
> > runtime pinning/unpinning approach like I've described.
>
> Probably.
>
> Thanks
>
> >
> > Stefan
> >
>
Jason Wang July 7, 2023, 1:43 a.m. UTC | #23
On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > > > > >> scale well.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > > > > perform transmission for us.
> > > > > > > > > > > >
> > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > > > > cost.
> > > > > > > > > > >
> > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > > > > >
> > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > > > > seems expensive.
> > > > > > > > > >
> > > > > > > > > > Vhost seems to be a shortcut for this.
> > > > > > > > >
> > > > > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > > > > >
> > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > > > > umem.
> > > > > > > >
> > > > > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > > > > part seems to be very expensive according to my test in the past.
> > > > > > >
> > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > > > > emulation.
> > > > > >
> > > > > > Just to make sure we're on the same page.
> > > > > >
> > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > > > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > > > > using the Qemu memory core translations which need to take care about
> > > > > > too much extra stuff. That's why I suggest using vhost in io threads
> > > > > > which only cares about ram so the translation could be very fast.
> > > > >
> > > > > What does using "vhost in io threads" mean?
> > > >
> > > > It means a vhost userspace dataplane that is implemented via io threads.
> > >
> > > AFAIK this does not exist today. QEMU's built-in devices that use
> > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > > vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > > built-in devices implement VirtioDeviceClass callbacks directly and
> > > use AioContext APIs to run in IOThreads.
> >
> > Yes.
> >
> > >
> > > Do you have an idea for using vhost code for built-in devices? Maybe
> > > it's fastest if you explain your idea and its advantages instead of me
> > > guessing.
> >
> > It's something like I'd proposed in [1]:
> >
> > 1) a vhost that is implemented via IOThreads
> > 2) memory translation is done via vhost memory table/IOTLB
> >
> > The advantages are:
> >
> > 1) No 3rd application like DPDK application
> > 2) Attack surface were reduced
> > 3) Better understanding/interactions with device model for things like
> > RSS and IOMMU
> >
> > There could be some dis-advantages but it's not obvious to me :)
>
> Why is QEMU's native device emulation API not the natural choice for
> writing built-in devices? I don't understand why the vhost interface
> is desirable for built-in devices.

Unless the memory helpers (like address translations) were optimized
fully to satisfy this 10M+ PPS.

Not sure if this is too hard, but last time I benchmark, perf told me
most of the time spent in the translation.

Using a vhost is a workaround since its memory model is much more
simpler so it can skip lots of memory sections like I/O and ROM etc.

Thanks

>
> >
> > It's something like linking SPDK/DPDK to Qemu.
>
> Sergio Lopez tried loading vhost-user devices as shared libraries that
> run in the QEMU process. It worked as an experiment but wasn't pursued
> further.
>
> I think that might make sense in specific cases where there is an
> existing vhost-user codebase that needs to run as part of QEMU.
>
> In this case the AF_XDP code is new, so it's not a case of moving
> existing code into QEMU.
>
> >
> > >
> > > > > > > Regarding pinning - I wonder if that's something that can be refined
> > > > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > > > > > of umem. That way only rx and tx buffers that are currently in use
> > > > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > > > > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > > > > > checked the kernel code.
> > > > > >
> > > > > > It requires the device to do page faults which is not commonly
> > > > > > supported nowadays.
> > > > >
> > > > > I don't understand this comment. AF_XDP processes each rx/tx
> > > > > descriptor. At that point it can getuserpages() or similar in order to
> > > > > pin the page. When the memory is no longer needed, it can put those
> > > > > pages. No fault mechanism is needed. What am I missing?
> > > >
> > > > Ok, I think I kind of get you, you mean doing pinning while processing
> > > > rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> > > > hit PPS for sure.
> > >
> > > Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> > > supports unpinned guest RAM.
> >
> > Right, it's a balance between pin and PPS. PPS seems to be more
> > important in this case.
> >
> > >
> > > There are variations on this approach, like keeping a certain amount
> > > of pages pinned after they have been used so the cost of
> > > pinning/unpinning can be avoided when the same pages are reused in the
> > > future, but I don't know how effective that is in practice.
> > >
> > > Is there a more efficient approach without relying on hardware page
> > > fault support?
> >
> > I guess so, I see some slides that say device page fault is very slow.
> >
> > >
> > > My understanding is that hardware page fault support is not yet
> > > deployed. We'd be left with pinning guest RAM permanently or using a
> > > runtime pinning/unpinning approach like I've described.
> >
> > Probably.
> >
> > Thanks
> >
> > >
> > > Stefan
> > >
> >
>
Ilya Maximets July 7, 2023, 11:21 a.m. UTC | #24
On 7/7/23 03:43, Jason Wang wrote:
> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
>>>>>>>>>>>>>>>> too hard to implement.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
>>>>>>>>>>>>>>>> scale well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
>>>>>>>>>>>>>>> io_uring and AF_XDP:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
>>>>>>>>>>>>>>> 2) both use ring for communication
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
>>>>>>>>>>>>>> perform transmission for us.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
>>>>>>>>>>>>> cost.
>>>>>>>>>>>>
>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
>>>>>>>>>>>
>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
>>>>>>>>>>> seems expensive.
>>>>>>>>>>>
>>>>>>>>>>> Vhost seems to be a shortcut for this.
>>>>>>>>>>
>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
>>>>>>>>>>
>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
>>>>>>>>>> umem.
>>>>>>>>>
>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
>>>>>>>>> part seems to be very expensive according to my test in the past.
>>>>>>>>
>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
>>>>>>>> emulation.
>>>>>>>
>>>>>>> Just to make sure we're on the same page.
>>>>>>>
>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
>>>>>>> using the Qemu memory core translations which need to take care about
>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
>>>>>>> which only cares about ram so the translation could be very fast.
>>>>>>
>>>>>> What does using "vhost in io threads" mean?
>>>>>
>>>>> It means a vhost userspace dataplane that is implemented via io threads.
>>>>
>>>> AFAIK this does not exist today. QEMU's built-in devices that use
>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
>>>> built-in devices implement VirtioDeviceClass callbacks directly and
>>>> use AioContext APIs to run in IOThreads.
>>>
>>> Yes.
>>>
>>>>
>>>> Do you have an idea for using vhost code for built-in devices? Maybe
>>>> it's fastest if you explain your idea and its advantages instead of me
>>>> guessing.
>>>
>>> It's something like I'd proposed in [1]:
>>>
>>> 1) a vhost that is implemented via IOThreads
>>> 2) memory translation is done via vhost memory table/IOTLB
>>>
>>> The advantages are:
>>>
>>> 1) No 3rd application like DPDK application
>>> 2) Attack surface were reduced
>>> 3) Better understanding/interactions with device model for things like
>>> RSS and IOMMU
>>>
>>> There could be some dis-advantages but it's not obvious to me :)
>>
>> Why is QEMU's native device emulation API not the natural choice for
>> writing built-in devices? I don't understand why the vhost interface
>> is desirable for built-in devices.
> 
> Unless the memory helpers (like address translations) were optimized
> fully to satisfy this 10M+ PPS.
> 
> Not sure if this is too hard, but last time I benchmark, perf told me
> most of the time spent in the translation.
> 
> Using a vhost is a workaround since its memory model is much more
> simpler so it can skip lots of memory sections like I/O and ROM etc.

So, we can have a thread running as part of QEMU process that implements
vhost functionality for a virtio-net device.  And this thread has an
optimized way to access memory.  What prevents current virtio-net emulation
code accessing memory in the same optimized way?  i.e. we likely don't
actually need to implement the whole vhost-virtio communication protocol
in order to have faster memory access from the device emulation code.
I mean, if vhost can access device memory faster, why device itself can't?

With that we could probably split the "datapath" part of the virtio-net
emulation into a separate thread driven by iothread loop.

Then add batch API for communication with a network backend (af-xdp) to
avoid per-packet calls.

These are 3 more or less independent tasks that should allow the similar
performance to a full fledged vhost control and dataplane implementation
inside QEMU.

Or am I missing something? (Probably)

> 
> Thanks
> 
>>
>>>
>>> It's something like linking SPDK/DPDK to Qemu.
>>
>> Sergio Lopez tried loading vhost-user devices as shared libraries that
>> run in the QEMU process. It worked as an experiment but wasn't pursued
>> further.
>>
>> I think that might make sense in specific cases where there is an
>> existing vhost-user codebase that needs to run as part of QEMU.
>>
>> In this case the AF_XDP code is new, so it's not a case of moving
>> existing code into QEMU.
>>
>>>
>>>>
>>>>>>>> Regarding pinning - I wonder if that's something that can be refined
>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
>>>>>>>> of umem. That way only rx and tx buffers that are currently in use
>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
>>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't
>>>>>>>> checked the kernel code.
>>>>>>>
>>>>>>> It requires the device to do page faults which is not commonly
>>>>>>> supported nowadays.
>>>>>>
>>>>>> I don't understand this comment. AF_XDP processes each rx/tx
>>>>>> descriptor. At that point it can getuserpages() or similar in order to
>>>>>> pin the page. When the memory is no longer needed, it can put those
>>>>>> pages. No fault mechanism is needed. What am I missing?
>>>>>
>>>>> Ok, I think I kind of get you, you mean doing pinning while processing
>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
>>>>> hit PPS for sure.
>>>>
>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
>>>> supports unpinned guest RAM.
>>>
>>> Right, it's a balance between pin and PPS. PPS seems to be more
>>> important in this case.
>>>
>>>>
>>>> There are variations on this approach, like keeping a certain amount
>>>> of pages pinned after they have been used so the cost of
>>>> pinning/unpinning can be avoided when the same pages are reused in the
>>>> future, but I don't know how effective that is in practice.
>>>>
>>>> Is there a more efficient approach without relying on hardware page
>>>> fault support?
>>>
>>> I guess so, I see some slides that say device page fault is very slow.
>>>
>>>>
>>>> My understanding is that hardware page fault support is not yet
>>>> deployed. We'd be left with pinning guest RAM permanently or using a
>>>> runtime pinning/unpinning approach like I've described.
>>>
>>> Probably.
>>>
>>> Thanks
>>>
>>>>
>>>> Stefan
>>>>
>>>
>>
>
Jason Wang July 10, 2023, 3:51 a.m. UTC | #25
On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/7/23 03:43, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>
> >> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>
> >>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>
> >>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> >>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
> >>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
> >>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
> >>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
> >>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> >>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> >>>>>>>>>>>>>>>> scale well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
> >>>>>>>>>>>>>>> io_uring and AF_XDP:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> >>>>>>>>>>>>>>> 2) both use ring for communication
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
> >>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> >>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
> >>>>>>>>>>>>>> perform transmission for us.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
> >>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
> >>>>>>>>>>>>> cost.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> >>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> >>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
> >>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
> >>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
> >>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
> >>>>>>>>>>>
> >>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
> >>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
> >>>>>>>>>>> seems expensive.
> >>>>>>>>>>>
> >>>>>>>>>>> Vhost seems to be a shortcut for this.
> >>>>>>>>>>
> >>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> >>>>>>>>>>
> >>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
> >>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
> >>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
> >>>>>>>>>> umem.
> >>>>>>>>>
> >>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> >>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> >>>>>>>>> part seems to be very expensive according to my test in the past.
> >>>>>>>>
> >>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> >>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> >>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
> >>>>>>>> emulation.
> >>>>>>>
> >>>>>>> Just to make sure we're on the same page.
> >>>>>>>
> >>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> >>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> >>>>>>> using the Qemu memory core translations which need to take care about
> >>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> >>>>>>> which only cares about ram so the translation could be very fast.
> >>>>>>
> >>>>>> What does using "vhost in io threads" mean?
> >>>>>
> >>>>> It means a vhost userspace dataplane that is implemented via io threads.
> >>>>
> >>>> AFAIK this does not exist today. QEMU's built-in devices that use
> >>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> >>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> >>>> built-in devices implement VirtioDeviceClass callbacks directly and
> >>>> use AioContext APIs to run in IOThreads.
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>> Do you have an idea for using vhost code for built-in devices? Maybe
> >>>> it's fastest if you explain your idea and its advantages instead of me
> >>>> guessing.
> >>>
> >>> It's something like I'd proposed in [1]:
> >>>
> >>> 1) a vhost that is implemented via IOThreads
> >>> 2) memory translation is done via vhost memory table/IOTLB
> >>>
> >>> The advantages are:
> >>>
> >>> 1) No 3rd application like DPDK application
> >>> 2) Attack surface were reduced
> >>> 3) Better understanding/interactions with device model for things like
> >>> RSS and IOMMU
> >>>
> >>> There could be some dis-advantages but it's not obvious to me :)
> >>
> >> Why is QEMU's native device emulation API not the natural choice for
> >> writing built-in devices? I don't understand why the vhost interface
> >> is desirable for built-in devices.
> >
> > Unless the memory helpers (like address translations) were optimized
> > fully to satisfy this 10M+ PPS.
> >
> > Not sure if this is too hard, but last time I benchmark, perf told me
> > most of the time spent in the translation.
> >
> > Using a vhost is a workaround since its memory model is much more
> > simpler so it can skip lots of memory sections like I/O and ROM etc.
>
> So, we can have a thread running as part of QEMU process that implements
> vhost functionality for a virtio-net device.  And this thread has an
> optimized way to access memory.  What prevents current virtio-net emulation
> code accessing memory in the same optimized way?

Current emulation using memory core accessors which needs to take care
of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
considered since day0 of vhost. You can do some experiment on this e.g
just dropping packets after fetching it from the TX ring.

> i.e. we likely don't
> actually need to implement the whole vhost-virtio communication protocol
> in order to have faster memory access from the device emulation code.
> I mean, if vhost can access device memory faster, why device itself can't?

I'm not saying it can't but it would end up with something similar to
vhost. And that's why I'm saying using vhost is a shortcut (at least
for a POC).

Thanks

>
> With that we could probably split the "datapath" part of the virtio-net
> emulation into a separate thread driven by iothread loop.
>
> Then add batch API for communication with a network backend (af-xdp) to
> avoid per-packet calls.
>
> These are 3 more or less independent tasks that should allow the similar
> performance to a full fledged vhost control and dataplane implementation
> inside QEMU.
>
> Or am I missing something? (Probably)
>
> >
> > Thanks
> >
> >>
> >>>
> >>> It's something like linking SPDK/DPDK to Qemu.
> >>
> >> Sergio Lopez tried loading vhost-user devices as shared libraries that
> >> run in the QEMU process. It worked as an experiment but wasn't pursued
> >> further.
> >>
> >> I think that might make sense in specific cases where there is an
> >> existing vhost-user codebase that needs to run as part of QEMU.
> >>
> >> In this case the AF_XDP code is new, so it's not a case of moving
> >> existing code into QEMU.
> >>
> >>>
> >>>>
> >>>>>>>> Regarding pinning - I wonder if that's something that can be refined
> >>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
> >>>>>>>> of umem. That way only rx and tx buffers that are currently in use
> >>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
> >>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't
> >>>>>>>> checked the kernel code.
> >>>>>>>
> >>>>>>> It requires the device to do page faults which is not commonly
> >>>>>>> supported nowadays.
> >>>>>>
> >>>>>> I don't understand this comment. AF_XDP processes each rx/tx
> >>>>>> descriptor. At that point it can getuserpages() or similar in order to
> >>>>>> pin the page. When the memory is no longer needed, it can put those
> >>>>>> pages. No fault mechanism is needed. What am I missing?
> >>>>>
> >>>>> Ok, I think I kind of get you, you mean doing pinning while processing
> >>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> >>>>> hit PPS for sure.
> >>>>
> >>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> >>>> supports unpinned guest RAM.
> >>>
> >>> Right, it's a balance between pin and PPS. PPS seems to be more
> >>> important in this case.
> >>>
> >>>>
> >>>> There are variations on this approach, like keeping a certain amount
> >>>> of pages pinned after they have been used so the cost of
> >>>> pinning/unpinning can be avoided when the same pages are reused in the
> >>>> future, but I don't know how effective that is in practice.
> >>>>
> >>>> Is there a more efficient approach without relying on hardware page
> >>>> fault support?
> >>>
> >>> I guess so, I see some slides that say device page fault is very slow.
> >>>
> >>>>
> >>>> My understanding is that hardware page fault support is not yet
> >>>> deployed. We'd be left with pinning guest RAM permanently or using a
> >>>> runtime pinning/unpinning approach like I've described.
> >>>
> >>> Probably.
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> Stefan
> >>>>
> >>>
> >>
> >
>
Ilya Maximets July 10, 2023, 10:56 a.m. UTC | #26
On 7/10/23 05:51, Jason Wang wrote:
> On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 7/7/23 03:43, Jason Wang wrote:
>>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
>>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
>>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
>>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
>>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
>>>>>>>>>>>>>>>>>> too hard to implement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
>>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
>>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
>>>>>>>>>>>>>>>>>> scale well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
>>>>>>>>>>>>>>>>> io_uring and AF_XDP:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
>>>>>>>>>>>>>>>>> 2) both use ring for communication
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
>>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
>>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
>>>>>>>>>>>>>>>> perform transmission for us.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
>>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
>>>>>>>>>>>>>>> cost.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
>>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
>>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
>>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
>>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
>>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
>>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
>>>>>>>>>>>>> seems expensive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Vhost seems to be a shortcut for this.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
>>>>>>>>>>>>
>>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
>>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
>>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
>>>>>>>>>>>> umem.
>>>>>>>>>>>
>>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
>>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
>>>>>>>>>>> part seems to be very expensive according to my test in the past.
>>>>>>>>>>
>>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
>>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
>>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
>>>>>>>>>> emulation.
>>>>>>>>>
>>>>>>>>> Just to make sure we're on the same page.
>>>>>>>>>
>>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
>>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
>>>>>>>>> using the Qemu memory core translations which need to take care about
>>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
>>>>>>>>> which only cares about ram so the translation could be very fast.
>>>>>>>>
>>>>>>>> What does using "vhost in io threads" mean?
>>>>>>>
>>>>>>> It means a vhost userspace dataplane that is implemented via io threads.
>>>>>>
>>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
>>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
>>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
>>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
>>>>>> use AioContext APIs to run in IOThreads.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
>>>>>> it's fastest if you explain your idea and its advantages instead of me
>>>>>> guessing.
>>>>>
>>>>> It's something like I'd proposed in [1]:
>>>>>
>>>>> 1) a vhost that is implemented via IOThreads
>>>>> 2) memory translation is done via vhost memory table/IOTLB
>>>>>
>>>>> The advantages are:
>>>>>
>>>>> 1) No 3rd application like DPDK application
>>>>> 2) Attack surface were reduced
>>>>> 3) Better understanding/interactions with device model for things like
>>>>> RSS and IOMMU
>>>>>
>>>>> There could be some dis-advantages but it's not obvious to me :)
>>>>
>>>> Why is QEMU's native device emulation API not the natural choice for
>>>> writing built-in devices? I don't understand why the vhost interface
>>>> is desirable for built-in devices.
>>>
>>> Unless the memory helpers (like address translations) were optimized
>>> fully to satisfy this 10M+ PPS.
>>>
>>> Not sure if this is too hard, but last time I benchmark, perf told me
>>> most of the time spent in the translation.
>>>
>>> Using a vhost is a workaround since its memory model is much more
>>> simpler so it can skip lots of memory sections like I/O and ROM etc.
>>
>> So, we can have a thread running as part of QEMU process that implements
>> vhost functionality for a virtio-net device.  And this thread has an
>> optimized way to access memory.  What prevents current virtio-net emulation
>> code accessing memory in the same optimized way?
> 
> Current emulation using memory core accessors which needs to take care
> of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> considered since day0 of vhost. You can do some experiment on this e.g
> just dropping packets after fetching it from the TX ring.

If I'm reading that right, virtio implementation is using address space
caching by utilizing a memory listener and pre-translated addresses of
interesting memory regions.  Then it's performing address_space_read_cached,
which is bypassing all the memory address translation logic on a cache hit.
That sounds pretty similar to how memory table is prepared for vhost.

> 
>> i.e. we likely don't
>> actually need to implement the whole vhost-virtio communication protocol
>> in order to have faster memory access from the device emulation code.
>> I mean, if vhost can access device memory faster, why device itself can't?
> 
> I'm not saying it can't but it would end up with something similar to
> vhost. And that's why I'm saying using vhost is a shortcut (at least
> for a POC).
> 
> Thanks
> 
>>
>> With that we could probably split the "datapath" part of the virtio-net
>> emulation into a separate thread driven by iothread loop.
>>
>> Then add batch API for communication with a network backend (af-xdp) to
>> avoid per-packet calls.
>>
>> These are 3 more or less independent tasks that should allow the similar
>> performance to a full fledged vhost control and dataplane implementation
>> inside QEMU.
>>
>> Or am I missing something? (Probably)
>>
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>> It's something like linking SPDK/DPDK to Qemu.
>>>>
>>>> Sergio Lopez tried loading vhost-user devices as shared libraries that
>>>> run in the QEMU process. It worked as an experiment but wasn't pursued
>>>> further.
>>>>
>>>> I think that might make sense in specific cases where there is an
>>>> existing vhost-user codebase that needs to run as part of QEMU.
>>>>
>>>> In this case the AF_XDP code is new, so it's not a case of moving
>>>> existing code into QEMU.
>>>>
>>>>>
>>>>>>
>>>>>>>>>> Regarding pinning - I wonder if that's something that can be refined
>>>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
>>>>>>>>>> of umem. That way only rx and tx buffers that are currently in use
>>>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
>>>>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't
>>>>>>>>>> checked the kernel code.
>>>>>>>>>
>>>>>>>>> It requires the device to do page faults which is not commonly
>>>>>>>>> supported nowadays.
>>>>>>>>
>>>>>>>> I don't understand this comment. AF_XDP processes each rx/tx
>>>>>>>> descriptor. At that point it can getuserpages() or similar in order to
>>>>>>>> pin the page. When the memory is no longer needed, it can put those
>>>>>>>> pages. No fault mechanism is needed. What am I missing?
>>>>>>>
>>>>>>> Ok, I think I kind of get you, you mean doing pinning while processing
>>>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
>>>>>>> hit PPS for sure.
>>>>>>
>>>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
>>>>>> supports unpinned guest RAM.
>>>>>
>>>>> Right, it's a balance between pin and PPS. PPS seems to be more
>>>>> important in this case.
>>>>>
>>>>>>
>>>>>> There are variations on this approach, like keeping a certain amount
>>>>>> of pages pinned after they have been used so the cost of
>>>>>> pinning/unpinning can be avoided when the same pages are reused in the
>>>>>> future, but I don't know how effective that is in practice.
>>>>>>
>>>>>> Is there a more efficient approach without relying on hardware page
>>>>>> fault support?
>>>>>
>>>>> I guess so, I see some slides that say device page fault is very slow.
>>>>>
>>>>>>
>>>>>> My understanding is that hardware page fault support is not yet
>>>>>> deployed. We'd be left with pinning guest RAM permanently or using a
>>>>>> runtime pinning/unpinning approach like I've described.
>>>>>
>>>>> Probably.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>
>>>>
>>>
>>
>
Stefan Hajnoczi July 10, 2023, 3:14 p.m. UTC | #27
On Thu, 6 Jul 2023 at 21:43, Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > > > > > >> scale well.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > > > > > perform transmission for us.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > > > > > cost.
> > > > > > > > > > > >
> > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > > > > > >
> > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > > > > > seems expensive.
> > > > > > > > > > >
> > > > > > > > > > > Vhost seems to be a shortcut for this.
> > > > > > > > > >
> > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > > > > > >
> > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > > > > > umem.
> > > > > > > > >
> > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > > > > > part seems to be very expensive according to my test in the past.
> > > > > > > >
> > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > > > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > > > > > emulation.
> > > > > > >
> > > > > > > Just to make sure we're on the same page.
> > > > > > >
> > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > > > > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > > > > > using the Qemu memory core translations which need to take care about
> > > > > > > too much extra stuff. That's why I suggest using vhost in io threads
> > > > > > > which only cares about ram so the translation could be very fast.
> > > > > >
> > > > > > What does using "vhost in io threads" mean?
> > > > >
> > > > > It means a vhost userspace dataplane that is implemented via io threads.
> > > >
> > > > AFAIK this does not exist today. QEMU's built-in devices that use
> > > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > > > vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > > > built-in devices implement VirtioDeviceClass callbacks directly and
> > > > use AioContext APIs to run in IOThreads.
> > >
> > > Yes.
> > >
> > > >
> > > > Do you have an idea for using vhost code for built-in devices? Maybe
> > > > it's fastest if you explain your idea and its advantages instead of me
> > > > guessing.
> > >
> > > It's something like I'd proposed in [1]:
> > >
> > > 1) a vhost that is implemented via IOThreads
> > > 2) memory translation is done via vhost memory table/IOTLB
> > >
> > > The advantages are:
> > >
> > > 1) No 3rd application like DPDK application
> > > 2) Attack surface were reduced
> > > 3) Better understanding/interactions with device model for things like
> > > RSS and IOMMU
> > >
> > > There could be some dis-advantages but it's not obvious to me :)
> >
> > Why is QEMU's native device emulation API not the natural choice for
> > writing built-in devices? I don't understand why the vhost interface
> > is desirable for built-in devices.
>
> Unless the memory helpers (like address translations) were optimized
> fully to satisfy this 10M+ PPS.
>
> Not sure if this is too hard, but last time I benchmark, perf told me
> most of the time spent in the translation.
>
> Using a vhost is a workaround since its memory model is much more
> simpler so it can skip lots of memory sections like I/O and ROM etc.

I see, that sounds like a question of optimization. Most DMA transfers
will be to/from guest RAM and it seems like QEMU's memory API could be
optimized for that case. PIO/MMIO dispatch could use a different API
from DMA transfers, if necessary.

I don't think there is a fundamental reason why QEMU's own device
emulation code cannot translate memory as fast as vhost devices can.

Stefan
Stefan Hajnoczi July 10, 2023, 3:21 p.m. UTC | #28
On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/10/23 05:51, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 7/7/23 03:43, Jason Wang wrote:
> >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>
> >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
> >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
> >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
> >>>>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
> >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> >>>>>>>>>>>>>>>>>> scale well.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
> >>>>>>>>>>>>>>>>> io_uring and AF_XDP:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> >>>>>>>>>>>>>>>>> 2) both use ring for communication
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
> >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> >>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
> >>>>>>>>>>>>>>>> perform transmission for us.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
> >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
> >>>>>>>>>>>>>>> cost.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
> >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
> >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
> >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
> >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
> >>>>>>>>>>>>> seems expensive.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Vhost seems to be a shortcut for this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
> >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
> >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
> >>>>>>>>>>>> umem.
> >>>>>>>>>>>
> >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> >>>>>>>>>>> part seems to be very expensive according to my test in the past.
> >>>>>>>>>>
> >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
> >>>>>>>>>> emulation.
> >>>>>>>>>
> >>>>>>>>> Just to make sure we're on the same page.
> >>>>>>>>>
> >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> >>>>>>>>> using the Qemu memory core translations which need to take care about
> >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> >>>>>>>>> which only cares about ram so the translation could be very fast.
> >>>>>>>>
> >>>>>>>> What does using "vhost in io threads" mean?
> >>>>>>>
> >>>>>>> It means a vhost userspace dataplane that is implemented via io threads.
> >>>>>>
> >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
> >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
> >>>>>> use AioContext APIs to run in IOThreads.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>>
> >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
> >>>>>> it's fastest if you explain your idea and its advantages instead of me
> >>>>>> guessing.
> >>>>>
> >>>>> It's something like I'd proposed in [1]:
> >>>>>
> >>>>> 1) a vhost that is implemented via IOThreads
> >>>>> 2) memory translation is done via vhost memory table/IOTLB
> >>>>>
> >>>>> The advantages are:
> >>>>>
> >>>>> 1) No 3rd application like DPDK application
> >>>>> 2) Attack surface were reduced
> >>>>> 3) Better understanding/interactions with device model for things like
> >>>>> RSS and IOMMU
> >>>>>
> >>>>> There could be some dis-advantages but it's not obvious to me :)
> >>>>
> >>>> Why is QEMU's native device emulation API not the natural choice for
> >>>> writing built-in devices? I don't understand why the vhost interface
> >>>> is desirable for built-in devices.
> >>>
> >>> Unless the memory helpers (like address translations) were optimized
> >>> fully to satisfy this 10M+ PPS.
> >>>
> >>> Not sure if this is too hard, but last time I benchmark, perf told me
> >>> most of the time spent in the translation.
> >>>
> >>> Using a vhost is a workaround since its memory model is much more
> >>> simpler so it can skip lots of memory sections like I/O and ROM etc.
> >>
> >> So, we can have a thread running as part of QEMU process that implements
> >> vhost functionality for a virtio-net device.  And this thread has an
> >> optimized way to access memory.  What prevents current virtio-net emulation
> >> code accessing memory in the same optimized way?
> >
> > Current emulation using memory core accessors which needs to take care
> > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> > considered since day0 of vhost. You can do some experiment on this e.g
> > just dropping packets after fetching it from the TX ring.
>
> If I'm reading that right, virtio implementation is using address space
> caching by utilizing a memory listener and pre-translated addresses of
> interesting memory regions.  Then it's performing address_space_read_cached,
> which is bypassing all the memory address translation logic on a cache hit.
> That sounds pretty similar to how memory table is prepared for vhost.

Exactly, but only for the vring memory structures (avail, used, and
descriptor rings in the Split Virtqueue Layout).

The packet headers and payloads are still translated using the
uncached virtqueue_pop() -> dma_memory_map() -> address_space_map()
API.

Running a tx packet drop benchmark as Jason suggested and checking if
memory translation is a bottleneck seems worthwhile. Improving
dma_memory_map() performance would speed up all built-in QEMU devices.

Jason: When you noticed this bottleneck, were you using a normal
virtio-net-pci device without vIOMMU?

Stefan
Jason Wang July 11, 2023, 3 a.m. UTC | #29
On Mon, Jul 10, 2023 at 6:55 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/10/23 05:51, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 7/7/23 03:43, Jason Wang wrote:
> >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>
> >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
> >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
> >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
> >>>>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
> >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> >>>>>>>>>>>>>>>>>> scale well.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
> >>>>>>>>>>>>>>>>> io_uring and AF_XDP:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> >>>>>>>>>>>>>>>>> 2) both use ring for communication
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
> >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> >>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
> >>>>>>>>>>>>>>>> perform transmission for us.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
> >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
> >>>>>>>>>>>>>>> cost.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
> >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
> >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
> >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
> >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
> >>>>>>>>>>>>> seems expensive.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Vhost seems to be a shortcut for this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
> >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
> >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
> >>>>>>>>>>>> umem.
> >>>>>>>>>>>
> >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> >>>>>>>>>>> part seems to be very expensive according to my test in the past.
> >>>>>>>>>>
> >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
> >>>>>>>>>> emulation.
> >>>>>>>>>
> >>>>>>>>> Just to make sure we're on the same page.
> >>>>>>>>>
> >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> >>>>>>>>> using the Qemu memory core translations which need to take care about
> >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> >>>>>>>>> which only cares about ram so the translation could be very fast.
> >>>>>>>>
> >>>>>>>> What does using "vhost in io threads" mean?
> >>>>>>>
> >>>>>>> It means a vhost userspace dataplane that is implemented via io threads.
> >>>>>>
> >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
> >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
> >>>>>> use AioContext APIs to run in IOThreads.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>>
> >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
> >>>>>> it's fastest if you explain your idea and its advantages instead of me
> >>>>>> guessing.
> >>>>>
> >>>>> It's something like I'd proposed in [1]:
> >>>>>
> >>>>> 1) a vhost that is implemented via IOThreads
> >>>>> 2) memory translation is done via vhost memory table/IOTLB
> >>>>>
> >>>>> The advantages are:
> >>>>>
> >>>>> 1) No 3rd application like DPDK application
> >>>>> 2) Attack surface were reduced
> >>>>> 3) Better understanding/interactions with device model for things like
> >>>>> RSS and IOMMU
> >>>>>
> >>>>> There could be some dis-advantages but it's not obvious to me :)
> >>>>
> >>>> Why is QEMU's native device emulation API not the natural choice for
> >>>> writing built-in devices? I don't understand why the vhost interface
> >>>> is desirable for built-in devices.
> >>>
> >>> Unless the memory helpers (like address translations) were optimized
> >>> fully to satisfy this 10M+ PPS.
> >>>
> >>> Not sure if this is too hard, but last time I benchmark, perf told me
> >>> most of the time spent in the translation.
> >>>
> >>> Using a vhost is a workaround since its memory model is much more
> >>> simpler so it can skip lots of memory sections like I/O and ROM etc.
> >>
> >> So, we can have a thread running as part of QEMU process that implements
> >> vhost functionality for a virtio-net device.  And this thread has an
> >> optimized way to access memory.  What prevents current virtio-net emulation
> >> code accessing memory in the same optimized way?
> >
> > Current emulation using memory core accessors which needs to take care
> > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> > considered since day0 of vhost. You can do some experiment on this e.g
> > just dropping packets after fetching it from the TX ring.
>
> If I'm reading that right, virtio implementation is using address space
> caching by utilizing a memory listener and pre-translated addresses of
> interesting memory regions.  Then it's performing address_space_read_cached,
> which is bypassing all the memory address translation logic on a cache hit.
> That sounds pretty similar to how memory table is prepared for vhost.

It's only done for virtqueue metadata (desc, driver and device area),
we still need to do dma map for the packet buffer itself.

Thanks

>
> >
> >> i.e. we likely don't
> >> actually need to implement the whole vhost-virtio communication protocol
> >> in order to have faster memory access from the device emulation code.
> >> I mean, if vhost can access device memory faster, why device itself can't?
> >
> > I'm not saying it can't but it would end up with something similar to
> > vhost. And that's why I'm saying using vhost is a shortcut (at least
> > for a POC).
> >
> > Thanks
> >
> >>
> >> With that we could probably split the "datapath" part of the virtio-net
> >> emulation into a separate thread driven by iothread loop.
> >>
> >> Then add batch API for communication with a network backend (af-xdp) to
> >> avoid per-packet calls.
> >>
> >> These are 3 more or less independent tasks that should allow the similar
> >> performance to a full fledged vhost control and dataplane implementation
> >> inside QEMU.
> >>
> >> Or am I missing something? (Probably)
> >>
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>>>
> >>>>> It's something like linking SPDK/DPDK to Qemu.
> >>>>
> >>>> Sergio Lopez tried loading vhost-user devices as shared libraries that
> >>>> run in the QEMU process. It worked as an experiment but wasn't pursued
> >>>> further.
> >>>>
> >>>> I think that might make sense in specific cases where there is an
> >>>> existing vhost-user codebase that needs to run as part of QEMU.
> >>>>
> >>>> In this case the AF_XDP code is new, so it's not a case of moving
> >>>> existing code into QEMU.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>>>> Regarding pinning - I wonder if that's something that can be refined
> >>>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
> >>>>>>>>>> of umem. That way only rx and tx buffers that are currently in use
> >>>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
> >>>>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't
> >>>>>>>>>> checked the kernel code.
> >>>>>>>>>
> >>>>>>>>> It requires the device to do page faults which is not commonly
> >>>>>>>>> supported nowadays.
> >>>>>>>>
> >>>>>>>> I don't understand this comment. AF_XDP processes each rx/tx
> >>>>>>>> descriptor. At that point it can getuserpages() or similar in order to
> >>>>>>>> pin the page. When the memory is no longer needed, it can put those
> >>>>>>>> pages. No fault mechanism is needed. What am I missing?
> >>>>>>>
> >>>>>>> Ok, I think I kind of get you, you mean doing pinning while processing
> >>>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
> >>>>>>> hit PPS for sure.
> >>>>>>
> >>>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
> >>>>>> supports unpinned guest RAM.
> >>>>>
> >>>>> Right, it's a balance between pin and PPS. PPS seems to be more
> >>>>> important in this case.
> >>>>>
> >>>>>>
> >>>>>> There are variations on this approach, like keeping a certain amount
> >>>>>> of pages pinned after they have been used so the cost of
> >>>>>> pinning/unpinning can be avoided when the same pages are reused in the
> >>>>>> future, but I don't know how effective that is in practice.
> >>>>>>
> >>>>>> Is there a more efficient approach without relying on hardware page
> >>>>>> fault support?
> >>>>>
> >>>>> I guess so, I see some slides that say device page fault is very slow.
> >>>>>
> >>>>>>
> >>>>>> My understanding is that hardware page fault support is not yet
> >>>>>> deployed. We'd be left with pinning guest RAM permanently or using a
> >>>>>> runtime pinning/unpinning approach like I've described.
> >>>>>
> >>>>> Probably.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>
> >>>>>> Stefan
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
Jason Wang July 11, 2023, 3:02 a.m. UTC | #30
On Mon, Jul 10, 2023 at 11:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 7/10/23 05:51, Jason Wang wrote:
> > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>
> > >> On 7/7/23 03:43, Jason Wang wrote:
> > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>
> > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>
> > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>>>
> > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>
> > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just rcu lock and
> > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching
> > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And it shouldn't be
> > >>>>>>>>>>>>>>>>>> too hard to implement.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating
> > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > >>>>>>>>>>>>>>>>>> scale well.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between
> > >>>>>>>>>>>>>>>>> io_uring and AF_XDP:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> > >>>>>>>>>>>>>>>>> 2) both use ring for communication
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > >>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be able to
> > >>>>>>>>>>>>>>>> perform transmission for us.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop
> > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation
> > >>>>>>>>>>>>>>> cost.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The
> > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so
> > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the
> > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to
> > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
> > >>>>>>>>>>>>> seems expensive.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Vhost seems to be a shortcut for this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring)
> > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to
> > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
> > >>>>>>>>>>>> umem.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> > >>>>>>>>>>> part seems to be very expensive according to my test in the past.
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
> > >>>>>>>>>> emulation.
> > >>>>>>>>>
> > >>>>>>>>> Just to make sure we're on the same page.
> > >>>>>>>>>
> > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> > >>>>>>>>> using the Qemu memory core translations which need to take care about
> > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> > >>>>>>>>> which only cares about ram so the translation could be very fast.
> > >>>>>>>>
> > >>>>>>>> What does using "vhost in io threads" mean?
> > >>>>>>>
> > >>>>>>> It means a vhost userspace dataplane that is implemented via io threads.
> > >>>>>>
> > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
> > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
> > >>>>>> use AioContext APIs to run in IOThreads.
> > >>>>>
> > >>>>> Yes.
> > >>>>>
> > >>>>>>
> > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
> > >>>>>> it's fastest if you explain your idea and its advantages instead of me
> > >>>>>> guessing.
> > >>>>>
> > >>>>> It's something like I'd proposed in [1]:
> > >>>>>
> > >>>>> 1) a vhost that is implemented via IOThreads
> > >>>>> 2) memory translation is done via vhost memory table/IOTLB
> > >>>>>
> > >>>>> The advantages are:
> > >>>>>
> > >>>>> 1) No 3rd application like DPDK application
> > >>>>> 2) Attack surface were reduced
> > >>>>> 3) Better understanding/interactions with device model for things like
> > >>>>> RSS and IOMMU
> > >>>>>
> > >>>>> There could be some dis-advantages but it's not obvious to me :)
> > >>>>
> > >>>> Why is QEMU's native device emulation API not the natural choice for
> > >>>> writing built-in devices? I don't understand why the vhost interface
> > >>>> is desirable for built-in devices.
> > >>>
> > >>> Unless the memory helpers (like address translations) were optimized
> > >>> fully to satisfy this 10M+ PPS.
> > >>>
> > >>> Not sure if this is too hard, but last time I benchmark, perf told me
> > >>> most of the time spent in the translation.
> > >>>
> > >>> Using a vhost is a workaround since its memory model is much more
> > >>> simpler so it can skip lots of memory sections like I/O and ROM etc.
> > >>
> > >> So, we can have a thread running as part of QEMU process that implements
> > >> vhost functionality for a virtio-net device.  And this thread has an
> > >> optimized way to access memory.  What prevents current virtio-net emulation
> > >> code accessing memory in the same optimized way?
> > >
> > > Current emulation using memory core accessors which needs to take care
> > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> > > considered since day0 of vhost. You can do some experiment on this e.g
> > > just dropping packets after fetching it from the TX ring.
> >
> > If I'm reading that right, virtio implementation is using address space
> > caching by utilizing a memory listener and pre-translated addresses of
> > interesting memory regions.  Then it's performing address_space_read_cached,
> > which is bypassing all the memory address translation logic on a cache hit.
> > That sounds pretty similar to how memory table is prepared for vhost.
>
> Exactly, but only for the vring memory structures (avail, used, and
> descriptor rings in the Split Virtqueue Layout).

Yes. It should speed up somehow.

>
> The packet headers and payloads are still translated using the
> uncached virtqueue_pop() -> dma_memory_map() -> address_space_map()
> API.
>
> Running a tx packet drop benchmark as Jason suggested and checking if
> memory translation is a bottleneck seems worthwhile. Improving
> dma_memory_map() performance would speed up all built-in QEMU devices.

+1

>
> Jason: When you noticed this bottleneck, were you using a normal
> virtio-net-pci device without vIOMMU?

Normal virtio-net-pci device without vIOMMU.

Thanks

>
> Stefan
>
Jason Wang July 11, 2023, 3:04 a.m. UTC | #31
On Mon, Jul 10, 2023 at 11:14 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, 6 Jul 2023 at 21:43, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS.
> > > > > > > > > > > > > > > >> So, that might be one case.  Taking into account that just rcu lock and
> > > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching
> > > > > > > > > > > > > > > >> on QEMU side should improve performance significantly.  And it shouldn't be
> > > > > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating
> > > > > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
> > > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > > > > > > > > > > > > > > >> scale well.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > > > > > > > > > > perform transmission for us.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > > > > > > > > > > cost.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > > > > > > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > > > > > > > > > > >
> > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to
> > > > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > > > > > > > seems expensive.
> > > > > > > > > > > >
> > > > > > > > > > > > Vhost seems to be a shortcut for this.
> > > > > > > > > > >
> > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > > > > > > > >
> > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > > > > > > > umem.
> > > > > > > > > >
> > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > > > > > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > > > > > > > part seems to be very expensive according to my test in the past.
> > > > > > > > >
> > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > > > > > > > happening. So the GPA to HVA translation will happen anyway in device
> > > > > > > > > emulation.
> > > > > > > >
> > > > > > > > Just to make sure we're on the same page.
> > > > > > > >
> > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > > > > > > > QEMU netdev, it would be very hard to achieve that if we stick to
> > > > > > > > using the Qemu memory core translations which need to take care about
> > > > > > > > too much extra stuff. That's why I suggest using vhost in io threads
> > > > > > > > which only cares about ram so the translation could be very fast.
> > > > > > >
> > > > > > > What does using "vhost in io threads" mean?
> > > > > >
> > > > > > It means a vhost userspace dataplane that is implemented via io threads.
> > > > >
> > > > > AFAIK this does not exist today. QEMU's built-in devices that use
> > > > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> > > > > vhost-user, or vDPA but not built-in devices that use IOThreads. The
> > > > > built-in devices implement VirtioDeviceClass callbacks directly and
> > > > > use AioContext APIs to run in IOThreads.
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > Do you have an idea for using vhost code for built-in devices? Maybe
> > > > > it's fastest if you explain your idea and its advantages instead of me
> > > > > guessing.
> > > >
> > > > It's something like I'd proposed in [1]:
> > > >
> > > > 1) a vhost that is implemented via IOThreads
> > > > 2) memory translation is done via vhost memory table/IOTLB
> > > >
> > > > The advantages are:
> > > >
> > > > 1) No 3rd application like DPDK application
> > > > 2) Attack surface were reduced
> > > > 3) Better understanding/interactions with device model for things like
> > > > RSS and IOMMU
> > > >
> > > > There could be some dis-advantages but it's not obvious to me :)
> > >
> > > Why is QEMU's native device emulation API not the natural choice for
> > > writing built-in devices? I don't understand why the vhost interface
> > > is desirable for built-in devices.
> >
> > Unless the memory helpers (like address translations) were optimized
> > fully to satisfy this 10M+ PPS.
> >
> > Not sure if this is too hard, but last time I benchmark, perf told me
> > most of the time spent in the translation.
> >
> > Using a vhost is a workaround since its memory model is much more
> > simpler so it can skip lots of memory sections like I/O and ROM etc.
>
> I see, that sounds like a question of optimization. Most DMA transfers
> will be to/from guest RAM and it seems like QEMU's memory API could be
> optimized for that case. PIO/MMIO dispatch could use a different API
> from DMA transfers, if necessary.

Probably.

>
> I don't think there is a fundamental reason why QEMU's own device
> emulation code cannot translate memory as fast as vhost devices can.

Yes, it can do what vhost can do. Starting from a vhost may help us to
know where we could go for the optimization of the memory core.

Thanks

>
> Stefan
>
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 7f323cd2eb..ca85422676 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2925,6 +2925,10 @@  W: http://info.iet.unipi.it/~luigi/netmap/
 S: Maintained
 F: net/netmap.c
 
+AF_XDP network backend
+R: Ilya Maximets <i.maximets@ovn.org>
+F: net/af-xdp.c
+
 Host Memory Backends
 M: David Hildenbrand <david@redhat.com>
 M: Igor Mammedov <imammedo@redhat.com>
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 2cbd0f77a0..af9ffe4681 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1295,7 +1295,7 @@  ERST
     {
         .name       = "netdev_add",
         .args_type  = "netdev:O",
-        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
+        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
 #ifdef CONFIG_VMNET
                       "|vmnet-host|vmnet-shared|vmnet-bridged"
 #endif
diff --git a/meson.build b/meson.build
index 6ef78ea278..d0abb658c5 100644
--- a/meson.build
+++ b/meson.build
@@ -1883,6 +1883,18 @@  if libbpf.found() and not cc.links('''
   endif
 endif
 
+# libxdp
+libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
+if libxdp.found() and \
+      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
+  libxdp = not_found
+  if get_option('af_xdp').enabled()
+    error('af-xdp support requires libbpf version >= 0.7')
+  else
+    warning('af-xdp support requires libbpf version >= 0.7, disabling')
+  endif
+endif
+
 # libdw
 libdw = not_found
 if not get_option('libdw').auto() or \
@@ -2106,6 +2118,7 @@  config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
 config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
 config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
 config_host_data.set('CONFIG_EBPF', libbpf.found())
+config_host_data.set('CONFIG_AF_XDP', libxdp.found())
 config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
 config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
 config_host_data.set('CONFIG_LIBNFS', libnfs.found())
@@ -4279,6 +4292,7 @@  summary_info += {'PVRDMA support':    have_pvrdma}
 summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
 summary_info += {'libcap-ng support': libcap_ng}
 summary_info += {'bpf support':       libbpf}
+summary_info += {'AF_XDP support':    libxdp}
 summary_info += {'rbd support':       rbd}
 summary_info += {'smartcard support': cacard}
 summary_info += {'U2F support':       u2f}
diff --git a/meson_options.txt b/meson_options.txt
index 90237389e2..31596d59f1 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -120,6 +120,8 @@  option('avx512bw', type: 'feature', value: 'auto',
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 
+option('af_xdp', type : 'feature', value : 'auto',
+       description: 'AF_XDP network backend support')
 option('attr', type : 'feature', value : 'auto',
        description: 'attr/xattr support')
 option('auth_pam', type : 'feature', value : 'auto',
diff --git a/net/af-xdp.c b/net/af-xdp.c
new file mode 100644
index 0000000000..f78e7c9f96
--- /dev/null
+++ b/net/af-xdp.c
@@ -0,0 +1,501 @@ 
+/*
+ * AF_XDP network backend.
+ *
+ * Copyright (c) 2023 Red Hat, Inc.
+ *
+ * Authors:
+ *  Ilya Maximets <i.maximets@ovn.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+
+#include "qemu/osdep.h"
+#include <bpf/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <xdp/xsk.h>
+
+#include "clients.h"
+#include "monitor/monitor.h"
+#include "net/net.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/error-report.h"
+#include "qemu/iov.h"
+#include "qemu/main-loop.h"
+#include "qemu/memalign.h"
+
+
+typedef struct AFXDPState {
+    NetClientState       nc;
+
+    struct xsk_socket    *xsk;
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_ring_cons cq;
+    struct xsk_ring_prod fq;
+
+    char                 ifname[IFNAMSIZ];
+    int                  ifindex;
+    bool                 read_poll;
+    bool                 write_poll;
+    uint32_t             outstanding_tx;
+
+    uint64_t             *pool;
+    uint32_t             n_pool;
+    char                 *buffer;
+    struct xsk_umem      *umem;
+
+    uint32_t             n_queues;
+    uint32_t             xdp_flags;
+    bool                 inhibit;
+} AFXDPState;
+
+#define AF_XDP_BATCH_SIZE 64
+
+static void af_xdp_send(void *opaque);
+static void af_xdp_writable(void *opaque);
+
+/* Set the event-loop handlers for the af-xdp backend. */
+static void af_xdp_update_fd_handler(AFXDPState *s)
+{
+    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
+                        s->read_poll ? af_xdp_send : NULL,
+                        s->write_poll ? af_xdp_writable : NULL,
+                        s);
+}
+
+/* Update the read handler. */
+static void af_xdp_read_poll(AFXDPState *s, bool enable)
+{
+    if (s->read_poll != enable) {
+        s->read_poll = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+/* Update the write handler. */
+static void af_xdp_write_poll(AFXDPState *s, bool enable)
+{
+    if (s->write_poll != enable) {
+        s->write_poll = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+static void af_xdp_poll(NetClientState *nc, bool enable)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    if (s->read_poll != enable || s->write_poll != enable) {
+        s->write_poll = enable;
+        s->read_poll  = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+static void af_xdp_complete_tx(AFXDPState *s)
+{
+    uint32_t idx = 0;
+    uint32_t done, i;
+    uint64_t *addr;
+
+    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
+
+    for (i = 0; i < done; i++) {
+        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
+        s->pool[s->n_pool++] = *addr;
+        s->outstanding_tx--;
+    }
+
+    if (done) {
+        xsk_ring_cons__release(&s->cq, done);
+    }
+}
+
+/*
+ * The fd_write() callback, invoked if the fd is marked as writable
+ * after a poll.
+ */
+static void af_xdp_writable(void *opaque)
+{
+    AFXDPState *s = opaque;
+
+    /* Try to recover buffers that are already sent. */
+    af_xdp_complete_tx(s);
+
+    /*
+     * Unregister the handler, unless we still have packets to transmit
+     * and kernel needs a wake up.
+     */
+    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
+        af_xdp_write_poll(s, false);
+    }
+
+    /* Flush any buffered packets. */
+    qemu_flush_queued_packets(&s->nc);
+}
+
+static ssize_t af_xdp_receive(NetClientState *nc,
+                              const uint8_t *buf, size_t size)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+    struct xdp_desc *desc;
+    uint32_t idx;
+    void *data;
+
+    /* Try to recover buffers that are already sent. */
+    af_xdp_complete_tx(s);
+
+    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
+        /* We can't transmit packet this size... */
+        return size;
+    }
+
+    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
+        /*
+         * Out of buffers or space in tx ring.  Poll until we can write.
+         * This will also kick the Tx, if it was waiting on CQ.
+         */
+        af_xdp_write_poll(s, true);
+        return 0;
+    }
+
+    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
+    desc->addr = s->pool[--s->n_pool];
+    desc->len = size;
+
+    data = xsk_umem__get_data(s->buffer, desc->addr);
+    memcpy(data, buf, size);
+
+    xsk_ring_prod__submit(&s->tx, 1);
+    s->outstanding_tx++;
+
+    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
+        af_xdp_write_poll(s, true);
+    }
+
+    return size;
+}
+
+/*
+ * Complete a previous send (backend --> guest) and enable the
+ * fd_read callback.
+ */
+static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    af_xdp_read_poll(s, true);
+}
+
+static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
+{
+    uint32_t i, idx = 0;
+
+    /* Leave one packet for Tx, just in case. */
+    if (s->n_pool < n + 1) {
+        n = s->n_pool;
+    }
+
+    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
+        return;
+    }
+
+    for (i = 0; i < n; i++) {
+        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
+    }
+    xsk_ring_prod__submit(&s->fq, n);
+
+    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
+        /* Receive was blocked by not having enough buffers.  Wake it up. */
+        af_xdp_read_poll(s, true);
+    }
+}
+
+static void af_xdp_send(void *opaque)
+{
+    uint32_t i, n_rx, idx = 0;
+    AFXDPState *s = opaque;
+
+    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
+    if (!n_rx) {
+        return;
+    }
+
+    for (i = 0; i < n_rx; i++) {
+        const struct xdp_desc *desc;
+        struct iovec iov;
+
+        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
+
+        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
+        iov.iov_len = desc->len;
+
+        s->pool[s->n_pool++] = desc->addr;
+
+        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
+                                     af_xdp_send_completed)) {
+            /*
+             * The peer does not receive anymore.  Packet is queued, stop
+             * reading from the backend until af_xdp_send_completed().
+             */
+            af_xdp_read_poll(s, false);
+
+            /* Re-peek the descriptors to not break the ring cache. */
+            xsk_ring_cons__cancel(&s->rx, n_rx);
+            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
+            g_assert(n_rx == i + 1);
+            break;
+        }
+    }
+
+    /* Release actually sent descriptors and try to re-fill. */
+    xsk_ring_cons__release(&s->rx, n_rx);
+    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
+}
+
+/* Flush and close. */
+static void af_xdp_cleanup(NetClientState *nc)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    qemu_purge_queued_packets(nc);
+
+    af_xdp_poll(nc, false);
+
+    xsk_socket__delete(s->xsk);
+    s->xsk = NULL;
+    g_free(s->pool);
+    s->pool = NULL;
+    xsk_umem__delete(s->umem);
+    s->umem = NULL;
+    qemu_vfree(s->buffer);
+    s->buffer = NULL;
+
+    /* Remove the program if it's the last open queue. */
+    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
+        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
+        fprintf(stderr,
+                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
+                s->ifname, s->ifindex);
+    }
+}
+
+static int af_xdp_umem_create(AFXDPState *s, Error **errp)
+{
+    struct xsk_umem_config config = {
+        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+        .frame_headroom = 0,
+    };
+    uint64_t n_descs;
+    uint64_t size;
+    int64_t i;
+
+    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
+    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
+               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
+    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
+
+    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
+    memset(s->buffer, 0, size);
+
+    if (xsk_umem__create(&s->umem, s->buffer, size, &s->fq, &s->cq, &config)) {
+        qemu_vfree(s->buffer);
+        error_setg_errno(errp, errno,
+                         "failed to create umem for %s queue_index: %d",
+                         s->ifname, s->nc.queue_index);
+        return -1;
+    }
+
+    s->pool = g_new(uint64_t, n_descs);
+    /* Fill the pool in the opposite order, because it's a LIFO queue. */
+    for (i = n_descs; i >= 0; i--) {
+        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
+    }
+    s->n_pool = n_descs;
+
+    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
+
+    return 0;
+}
+
+static int af_xdp_socket_create(AFXDPState *s,
+                                const NetdevAFXDPOptions *opts,
+                                int xsks_map_fd, Error **errp)
+{
+    struct xsk_socket_config cfg = {
+        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+        .libxdp_flags = 0,
+        .bind_flags = XDP_USE_NEED_WAKEUP,
+        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
+    };
+    int queue_id, error = 0;
+
+    s->inhibit = opts->has_inhibit && opts->inhibit;
+    if (s->inhibit) {
+        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
+    }
+
+    if (opts->has_force_copy && opts->force_copy) {
+        cfg.bind_flags |= XDP_COPY;
+    }
+
+    queue_id = s->nc.queue_index;
+    if (opts->has_start_queue && opts->start_queue > 0) {
+        queue_id += opts->start_queue;
+    }
+
+    if (opts->has_mode) {
+        /* Specific mode requested. */
+        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
+                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
+        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                               s->umem, &s->rx, &s->tx, &cfg)) {
+            error = errno;
+        }
+    } else {
+        /* No mode requested, try native first. */
+        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                               s->umem, &s->rx, &s->tx, &cfg)) {
+            /* Can't use native mode, try skb. */
+            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
+            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
+
+            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                                   s->umem, &s->rx, &s->tx, &cfg)) {
+                error = errno;
+            }
+        }
+    }
+
+    if (error) {
+        error_setg_errno(errp, error,
+                         "failed to create AF_XDP socket for %s queue_id: %d",
+                         s->ifname, queue_id);
+        return -1;
+    }
+
+    if (s->inhibit) {
+        int xsk_fd = xsk_socket__fd(s->xsk);
+
+        /* Need to update the map manually, libxdp skipped that step. */
+        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
+        if (error) {
+            error_setg_errno(errp, error,
+                             "failed to update xsks map for %s queue_id: %d",
+                             s->ifname, queue_id);
+            return -1;
+        }
+    }
+
+    s->xdp_flags = cfg.xdp_flags;
+
+    return 0;
+}
+
+/* NetClientInfo methods. */
+static NetClientInfo net_af_xdp_info = {
+    .type = NET_CLIENT_DRIVER_AF_XDP,
+    .size = sizeof(AFXDPState),
+    .receive = af_xdp_receive,
+    .poll = af_xdp_poll,
+    .cleanup = af_xdp_cleanup,
+};
+
+/*
+ * The exported init function.
+ *
+ * ... -net af-xdp,ifname="..."
+ */
+int net_init_af_xdp(const Netdev *netdev,
+                    const char *name, NetClientState *peer, Error **errp)
+{
+    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
+    NetClientState *nc, *nc0 = NULL;
+    unsigned int ifindex;
+    uint32_t prog_id = 0;
+    int xsks_map_fd = -1;
+    int64_t i, queues;
+    Error *err = NULL;
+    AFXDPState *s;
+
+    ifindex = if_nametoindex(opts->ifname);
+    if (!ifindex) {
+        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
+                         opts->ifname);
+        return -1;
+    }
+
+    queues = opts->has_queues ? opts->queues : 1;
+    if (queues < 1) {
+        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
+                   queues, opts->ifname);
+        return -1;
+    }
+
+    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
+        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
+        return -1;
+    }
+
+    if (opts->xsks_map_fd) {
+        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
+        if (xsks_map_fd < 0) {
+            return -1;
+        }
+    }
+
+    for (i = 0; i < queues; i++) {
+        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
+        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
+        nc->queue_index = i;
+
+        if (!nc0) {
+            nc0 = nc;
+        }
+
+        s = DO_UPCAST(AFXDPState, nc, nc);
+
+        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
+        s->ifindex = ifindex;
+        s->n_queues = queues;
+
+        if (af_xdp_umem_create(s, errp)
+            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
+            /* Make sure the XDP program will be removed. */
+            s->n_queues = i;
+            error_propagate(errp, err);
+            goto err;
+        }
+    }
+
+    if (nc0) {
+        s = DO_UPCAST(AFXDPState, nc, nc0);
+        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
+            error_setg_errno(errp, errno,
+                             "no XDP program loaded on '%s', ifindex: %d",
+                             s->ifname, s->ifindex);
+            goto err;
+        }
+    }
+
+    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
+
+    return 0;
+
+err:
+    if (nc0) {
+        qemu_del_net_client(nc0);
+    }
+
+    return -1;
+}
diff --git a/net/clients.h b/net/clients.h
index ed8bdfff1e..be53794582 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -64,6 +64,11 @@  int net_init_netmap(const Netdev *netdev, const char *name,
                     NetClientState *peer, Error **errp);
 #endif
 
+#ifdef CONFIG_AF_XDP
+int net_init_af_xdp(const Netdev *netdev, const char *name,
+                    NetClientState *peer, Error **errp);
+#endif
+
 int net_init_vhost_user(const Netdev *netdev, const char *name,
                         NetClientState *peer, Error **errp);
 
diff --git a/net/meson.build b/net/meson.build
index bdf564a57b..61628d4684 100644
--- a/net/meson.build
+++ b/net/meson.build
@@ -36,6 +36,9 @@  system_ss.add(when: vde, if_true: files('vde.c'))
 if have_netmap
   system_ss.add(files('netmap.c'))
 endif
+
+system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
+
 if have_vhost_net_user
   system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
   system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
diff --git a/net/net.c b/net/net.c
index 6492ad530e..127f70932b 100644
--- a/net/net.c
+++ b/net/net.c
@@ -1082,6 +1082,9 @@  static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
 #ifdef CONFIG_NETMAP
         [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
 #endif
+#ifdef CONFIG_AF_XDP
+        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
+#endif
 #ifdef CONFIG_NET_BRIDGE
         [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
 #endif
@@ -1186,6 +1189,9 @@  void show_netdevs(void)
 #ifdef CONFIG_NETMAP
         "netmap",
 #endif
+#ifdef CONFIG_AF_XDP
+        "af-xdp",
+#endif
 #ifdef CONFIG_POSIX
         "vhost-user",
 #endif
diff --git a/qapi/net.json b/qapi/net.json
index db67501308..bb30a0d3c6 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -408,6 +408,56 @@ 
     'ifname':     'str',
     '*devname':    'str' } }
 
+##
+# @AFXDPMode:
+#
+# Attach mode for a default XDP program
+#
+# @skb: generic mode, no driver support necessary
+#
+# @native: DRV mode, program is attached to a driver, packets are passed to
+#     the socket without allocation of skb.
+#
+# Since: 8.1
+##
+{ 'enum': 'AFXDPMode',
+  'data': [ 'native', 'skb' ] }
+
+##
+# @NetdevAFXDPOptions:
+#
+# AF_XDP network backend
+#
+# @ifname: The name of an existing network interface.
+#
+# @mode: Attach mode for a default XDP program.  If not specified, then
+#     'native' will be tried first, then 'skb'.
+#
+# @inhibit: Don't load a default XDP program, use one already loaded to
+#     the interface (default: false).  Requires @xsks-map-fd.
+#
+# @xsks-map-fd: A file descriptor for an already open XDP socket map in
+#     the already loaded XDP program.  Requires @inhibit.
+#
+# @force-copy: Force XDP copy mode even if device supports zero-copy.
+#     (default: false)
+#
+# @queues: number of queues to be used for multiqueue interfaces (default: 1).
+#
+# @start-queue: Use @queues starting from this queue number (default: 0).
+#
+# Since: 8.1
+##
+{ 'struct': 'NetdevAFXDPOptions',
+  'data': {
+    'ifname':       'str',
+    '*mode':        'AFXDPMode',
+    '*inhibit':     'bool',
+    '*xsks-map-fd': 'str',
+    '*force-copy':  'bool',
+    '*queues':      'int',
+    '*start-queue': 'int' } }
+
 ##
 # @NetdevVhostUserOptions:
 #
@@ -642,13 +692,14 @@ 
 # @vmnet-bridged: since 7.1
 # @stream: since 7.2
 # @dgram: since 7.2
+# @af-xdp: since 8.1
 #
 # Since: 2.7
 ##
 { 'enum': 'NetClientDriver',
   'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
             'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
-            'vhost-vdpa',
+            'vhost-vdpa', 'af-xdp',
             { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
             { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
             { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
@@ -680,6 +731,7 @@ 
     'bridge':   'NetdevBridgeOptions',
     'hubport':  'NetdevHubPortOptions',
     'netmap':   'NetdevNetmapOptions',
+    'af-xdp':   'NetdevAFXDPOptions',
     'vhost-user': 'NetdevVhostUserOptions',
     'vhost-vdpa': 'NetdevVhostVDPAOptions',
     'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
diff --git a/qemu-options.hx b/qemu-options.hx
index b57489d7ca..7d0844b2be 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2856,6 +2856,17 @@  DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
     "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
     "                netmap device, defaults to '/dev/netmap')\n"
 #endif
+#ifdef CONFIG_AF_XDP
+    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
+    "         [,inhibit=on|off][,xsks-map-fd=k][,queues=n][,start-queue=m]\n"
+    "                attach to the existing network interface 'name' with AF_XDP socket\n"
+    "                use 'mode=MODE' to specify an XDP program attach mode\n"
+    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
+    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
+    "                use 'xsks-map-fd=k' to provide a file descriptor for xsks map with inhibit=on\n"
+    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
+    "                use 'start-queue=m' to specify the first queue that should be used\n"
+#endif
 #ifdef CONFIG_POSIX
     "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
     "                configure a vhost-user network, backed by a chardev 'dev'\n"
@@ -2901,6 +2912,9 @@  DEF("nic", HAS_ARG, QEMU_OPTION_nic,
 #ifdef CONFIG_NETMAP
     "netmap|"
 #endif
+#ifdef CONFIG_AF_XDP
+    "af-xdp|"
+#endif
 #ifdef CONFIG_POSIX
     "vhost-user|"
 #endif
@@ -2929,6 +2943,9 @@  DEF("net", HAS_ARG, QEMU_OPTION_net,
 #ifdef CONFIG_NETMAP
     "netmap|"
 #endif
+#ifdef CONFIG_AF_XDP
+    "af-xdp|"
+#endif
 #ifdef CONFIG_VMNET
     "vmnet-host|vmnet-shared|vmnet-bridged|"
 #endif
@@ -2936,7 +2953,7 @@  DEF("net", HAS_ARG, QEMU_OPTION_net,
     "                old way to initialize a host network interface\n"
     "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
 SRST
-``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
+``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
     This option is a shortcut for configuring both the on-board
     (default) guest NIC hardware and the host network backend in one go.
     The host backend options are the same as with the corresponding
@@ -3350,6 +3367,48 @@  SRST
         # launch QEMU instance
         |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
 
+``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,inhibit=on|off][,xsks-map-fd=k][,queues=n][,start-queue=m]``
+    Configure AF_XDP backend to connect to a network interface 'name'
+    using AF_XDP socket.  A specific program attach mode for a default
+    XDP program can be forced with 'mode', defaults to best-effort,
+    where the likely most performant mode will be in use.  Or the load
+    can be inhibited.  In this case XDP program should be pre-loaded
+    externally and 'xsks-map-fd' provided with a file descriptor for an
+    open XDP socket map of that program.  Number of queues 'n' should
+    generally match the number or queues in the interface, defaults to 1.
+    Traffic arriving on non-configured device queues will not be delivered
+    to the network backend.
+
+    .. parsed-literal::
+
+        # set number of queues to 1
+        ethtool -L eth0 combined 4
+        # launch QEMU instance
+        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
+            -netdev af-xdp,id=n1,ifname=eth0,queues=4
+
+    'start-queue' option can be specified if a particular range of queues
+    [m, m + n] should be in use.  For example, this is necessary in order
+    to use MLX NICs in native mode.  The driver will create a separate set
+    of queues on top of regular ones, and only these queues can be used
+    for AF_XDP sockets.  MLX NICs will also require an additional traffic
+    redirection with ethtool to these queues.  E.g.:
+
+    .. parsed-literal::
+
+        # set number of queues to 1
+        ethtool -L eth0 combined 1
+        # redirect all the traffic to the second queue (id: 1)
+        # note: mlx5 driver requires non-empty key/mask pair.
+        ethtool -N eth0 flow-type ether \\
+            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
+        ethtool -N eth0 flow-type ether \\
+            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
+        # launch QEMU instance
+        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
+            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
+
+
 ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
     Establish a vhost-user netdev, backed by a chardev id. The chardev
     should be a unix domain socket backed one. The vhost-user uses a
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index d02b09a4b9..7585c4c4ed 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -35,6 +35,7 @@ 
 --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
 --with-coroutine=ucontext \
 --tls-priority=@QEMU,SYSTEM \
+--disable-af-xdp \
 --disable-attr \
 --disable-auth-pam \
 --disable-avx2 \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 5714fd93d9..e1490fd4fe 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -75,6 +75,7 @@  meson_options_help() {
   printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
   printf "%s\n" '(unless built with --without-default-features):'
   printf "%s\n" ''
+  printf "%s\n" '  af-xdp          AF_XDP network backend support'
   printf "%s\n" '  alsa            ALSA sound support'
   printf "%s\n" '  attr            attr/xattr support'
   printf "%s\n" '  auth-pam        PAM access control'
@@ -208,6 +209,8 @@  meson_options_help() {
 }
 _meson_option_parse() {
   case $1 in
+    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
+    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
     --enable-alsa) printf "%s" -Dalsa=enabled ;;
     --disable-alsa) printf "%s" -Dalsa=disabled ;;
     --enable-attr) printf "%s" -Dattr=enabled ;;
diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
index e39871c7bb..207f7adfb9 100644
--- a/tests/docker/dockerfiles/debian-amd64.docker
+++ b/tests/docker/dockerfiles/debian-amd64.docker
@@ -97,6 +97,7 @@  RUN export DEBIAN_FRONTEND=noninteractive && \
                       libvirglrenderer-dev \
                       libvte-2.91-dev \
                       libxen-dev \
+                      libxdp-dev \
                       libzstd-dev \
                       llvm \
                       locales \