[ovs-dev,v8] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)

Message ID 1515168820-31888-1-git-send-email-ciara.loftus@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series
  • [ovs-dev,v8] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)
Related show

Commit Message

Ciara Loftus Jan. 5, 2018, 4:13 p.m.
Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
option to 'true' when configuring the Interface:

ovs-vsctl set Interface dpdkvhostuserclient0
options:vhost-server-path=/tmp/dpdkvhostuserclient0
options:dq-zero-copy=true

When packets from a vHost device with zero copy enabled are destined for
a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
must be set to a smaller value. 128 is recommended. This can be achieved
like so:

ovs-vsctl set Interface dpdkport options:n_txq_desc=128

Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
to should not exceed 128. Due to this requirement, the feature is
considered 'experimental'.

Testing of the patch showed a 15% improvement when switching 512B
packets between vHost devices on different VMs on the same host when
zero copy was enabled on the transmitting device.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
v8:
* Disallow configurability after vHost device has been registered &
update docs accordingly.
* Give performance datapoint in commit message.

 Documentation/intro/install/dpdk.rst     |  2 +
 Documentation/topics/dpdk/vhost-user.rst | 72 ++++++++++++++++++++++++++++++++
 NEWS                                     |  1 +
 lib/netdev-dpdk.c                        |  9 +++-
 vswitchd/vswitch.xml                     | 11 +++++
 5 files changed, 94 insertions(+), 1 deletion(-)

Comments

Stokes, Ian Jan. 16, 2018, 10 a.m. | #1
> -----Original Message-----
> From: Loftus, Ciara
> Sent: Friday, January 5, 2018 4:14 PM
> To: dev@openvswitch.org
> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>; jan.scheurich@ericsson.com; ktraynor@redhat.com;
> i.maximets@samsung.com
> Subject: [PATCH v8] netdev-dpdk: Add support for vHost dequeue zero copy
> (experimental)
> 

Hi Ciara,

This seems to have been simplified from the previous patchsets, I'll need to do some testing but some comments inline below that could be addressed for a v9.

Thanks
Ian

> Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> option to 'true' when configuring the Interface:
> 
> ovs-vsctl set Interface dpdkvhostuserclient0
> options:vhost-server-path=/tmp/dpdkvhostuserclient0
> options:dq-zero-copy=true
> 
> When packets from a vHost device with zero copy enabled are destined for a
> single 'dpdk' port, the number of tx descriptors on that 'dpdk' port must
> be set to a smaller value. 128 is recommended. This can be achieved like
> so:
> 
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> 
> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> to should not exceed 128. Due to this requirement, the feature is
> considered 'experimental'.
> 
> Testing of the patch showed a 15% improvement when switching 512B packets
> between vHost devices on different VMs on the same host when zero copy was
> enabled on the transmitting device.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
> v8:
> * Disallow configurability after vHost device has been registered & update
> docs accordingly.
> * Give performance datapoint in commit message.
> 
>  Documentation/intro/install/dpdk.rst     |  2 +
>  Documentation/topics/dpdk/vhost-user.rst | 72
> ++++++++++++++++++++++++++++++++
>  NEWS                                     |  1 +
>  lib/netdev-dpdk.c                        |  9 +++-
>  vswitchd/vswitch.xml                     | 11 +++++
>  5 files changed, 94 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> index 3fecb5c..087eb88 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -518,6 +518,8 @@ The above command sets the number of rx queues for
> DPDK physical interface.
>  The rx queues are assigned to pmd threads on the same NUMA node in a
> round-robin fashion.
> 
> +.. _dpdk-queues-sizes:
> +
>  DPDK Physical Port Queue Sizes
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> index 8447e2d..1a6c6d0 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -458,3 +458,75 @@ Sample XML
>      </domain>
> 
>  .. _QEMU documentation: http://git.qemu-
> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy (experimental)
> +-------------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy
> +operation must be used to copy that packet from guest address space to
> +host address space. This memcpy can be removed by enabling dequeue zero-
> copy like so::
> +
> +    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> +        dpdkvhostuserclient0 type=dpdkvhostuserclient \
> +        options:vhost-server-path=/tmp/dpdkvhostclient0 \
> +        options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is
> +passed to the host, instead of a copy of the packet. Removing this
> +memcpy can give a performance improvement for some use cases, for
> +example switching large packets between different VMs. However additional
> packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly
> +enabled by setting the 'dq-zero-copy' option to 'true' while specifying
> +the 'vhost-server-path' option as above. If you wish to split out the
> +command into multiple commands as below, ensure 'dq-zero-copy' is set
> +before
> +'vhost-server-path'::
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=true
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
> +        options:vhost-server-path=/tmp/dpdkvhostclient0
> +
> +The feature is only available to dpdkvhostuserclient port types.
> +
> +A limitation exists whereby if packets from a vHost port with
> +dq-zero-copy=true are destined for a 'dpdk' type port, the number of tx
> +descriptors (n_txq_desc) for that port must be reduced to a smaller
> +number, 128 being the recommended value. This can be achieved by issuing
> the following command::
> +
> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will
> +send to should not exceed 128. For example, in case of a bond over two
> +physical ports in balance-tcp mode, one must divide 128 by the number of
> links in the bond.
> +
> +Refer to :ref:`<dpdk-queue-sizes>` for more information.

Compilation fails with the following from above during the Doc check stage:

WARNING: undefined label: <dpdk-queue-sizes> (if the link has no caption the label must precede a section header)

> +
> +The reason for this limitation is due to how the zero copy
> +functionality is implemented. The vHost device's 'tx used vring', a
> +virtio structure used for tracking used ie. sent descriptors, will only
> +be updated when the NIC frees the corresponding mbuf. If we don't free
> +the mbufs frequently enough, that vring will be starved and packets
> +will no longer be processed. One way to ensure we don't encounter this
> +scenario, is to configure n_txq_desc to a small enough number such that
> +the 'mbuf free threshold' for the NIC will be hit more often and thus
> +free mbufs more frequently. The value of 128 is suggested, but values
> +of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the
> +virtio queue size in the guest is increased to 1024 (available to
> configure in QEMU versions v2.10 and greater). This value can be set like
> so::
> +
> +    $ qemu-system-x86_64 ... -chardev
> socket,id=char1,path=<sockpath>,server
> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> +      tx_queue_size=1024
> +
> +Because of this limitation, this feature is condsidered 'experimental'.

Minor: typo above for 'considered'.

> +
> +The feature currently does not fully work with QEMU >= v2.7 due to a
> +bug in DPDK which will be addressed in an upcoming release. The patch
> +to fix this issue can be found on `Patchwork
> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> +
> +Further information can be found in the `DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> diff --git a/NEWS b/NEWS
> index 752e98f..1eaa32c 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -24,6 +24,7 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU
> +     * Add support for vHost dequeue zero copy (experimental)
> 
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 8f22264..d100b31
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1384,6 +1384,10 @@ netdev_dpdk_vhost_client_set_config(struct netdev
> *netdev,
>          path = smap_get(args, "vhost-server-path");
>          if (path && strcmp(path, dev->vhost_id)) {
>              strcpy(dev->vhost_id, path);
> +            /* check zero copy configuration */
> +            if (smap_get_bool(args, "dq-zero-copy", false)) {
> +                dev->vhost_driver_flags |=
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +            }
>              netdev_request_reconfigure(netdev);
>          }
>      }
> @@ -3278,7 +3282,7 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev
> *netdev)  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      int err;
> -    uint64_t vhost_flags = 0;
> +    uint64_t vhost_flags = dev->vhost_driver_flags;
> 
>      ovs_mutex_lock(&dev->mutex);
> 
> @@ -3307,6 +3311,9 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev
> *netdev)
>              VLOG_INFO("vHost User device '%s' created in 'client' mode, "
>                        "using client socket '%s'",
>                        dev->up.name, dev->vhost_id);
> +            if (dev->vhost_driver_flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
> +                VLOG_INFO("Zero copy enabled for vHost port %s", dev-
> >up.name);
> +            }
>          }
> 
>          err = rte_vhost_driver_callback_register(dev->vhost_id,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 018d644..16b3c3e 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
>          </p>
>        </column>
> 
> +      <column name="options" key="dq-zero-copy"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          The value specifies whether or not to enable dequeue zero copy
> on
> +          the given interface.
> +          Must be set before vhost-server-path is specified.
> +          Only supported by dpdkvhostuserclient interfaces.
> +          The feature is considered experimental.
> +        </p>
> +      </column>
> +
>        <column name="options" key="n_rxq_desc"
>                type='{"type": "integer", "minInteger": 1, "maxInteger":
> 4096}'>
>          <p>
> --
> 2.7.5
Ilya Maximets Jan. 19, 2018, 11:37 a.m. | #2
On 05.01.2018 19:13, Ciara Loftus wrote:
> Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> option to 'true' when configuring the Interface:
> 
> ovs-vsctl set Interface dpdkvhostuserclient0
> options:vhost-server-path=/tmp/dpdkvhostuserclient0
> options:dq-zero-copy=true
> 
> When packets from a vHost device with zero copy enabled are destined for
> a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> must be set to a smaller value. 128 is recommended. This can be achieved
> like so:
> 
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> 
> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> to should not exceed 128. Due to this requirement, the feature is
> considered 'experimental'.
> 
> Testing of the patch showed a 15% improvement when switching 512B
> packets between vHost devices on different VMs on the same host when
> zero copy was enabled on the transmitting device.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
> v8:
> * Disallow configurability after vHost device has been registered &
> update docs accordingly.
> * Give performance datapoint in commit message.
> 
>  Documentation/intro/install/dpdk.rst     |  2 +
>  Documentation/topics/dpdk/vhost-user.rst | 72 ++++++++++++++++++++++++++++++++
>  NEWS                                     |  1 +
>  lib/netdev-dpdk.c                        |  9 +++-
>  vswitchd/vswitch.xml                     | 11 +++++
>  5 files changed, 94 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
> index 3fecb5c..087eb88 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -518,6 +518,8 @@ The above command sets the number of rx queues for DPDK physical interface.
>  The rx queues are assigned to pmd threads on the same NUMA node in a
>  round-robin fashion.
>  
> +.. _dpdk-queues-sizes:
> +
>  DPDK Physical Port Queue Sizes
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
> index 8447e2d..1a6c6d0 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -458,3 +458,75 @@ Sample XML
>      </domain>
>  
>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy (experimental)
> +-------------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
> +must be used to copy that packet from guest address space to host address
> +space. This memcpy can be removed by enabling dequeue zero-copy like so::
> +
> +    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> +        dpdkvhostuserclient0 type=dpdkvhostuserclient \
> +        options:vhost-server-path=/tmp/dpdkvhostclient0 \
> +        options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is passed to
> +the host, instead of a copy of the packet. Removing this memcpy can give a
> +performance improvement for some use cases, for example switching large packets
> +between different VMs. However additional packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly enabled
> +by setting the 'dq-zero-copy' option to 'true' while specifying the
> +'vhost-server-path' option as above. If you wish to split out the command into
> +multiple commands as below, ensure 'dq-zero-copy' is set before
> +'vhost-server-path'::
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
> +        options:vhost-server-path=/tmp/dpdkvhostclient0
> +
> +The feature is only available to dpdkvhostuserclient port types.
> +
> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
> +for that port must be reduced to a smaller number, 128 being the recommended
> +value. This can be achieved by issuing the following command::
> +
> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
> +should not exceed 128. For example, in case of a bond over two physical ports
> +in balance-tcp mode, one must divide 128 by the number of links in the bond.
> +
> +Refer to :ref:`<dpdk-queue-sizes>` for more information.
> +
> +The reason for this limitation is due to how the zero copy functionality is
> +implemented. The vHost device's 'tx used vring', a virtio structure used for
> +tracking used ie. sent descriptors, will only be updated when the NIC frees
> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
> +vring will be starved and packets will no longer be processed. One way to
> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
> +often and thus free mbufs more frequently. The value of 128 is suggested, but
> +values of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the virtio
> +queue size in the guest is increased to 1024 (available to configure in QEMU
> +versions v2.10 and greater). This value can be set like so::
> +
> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> +      tx_queue_size=1024
> +
> +Because of this limitation, this feature is condsidered 'experimental'.
> +
> +The feature currently does not fully work with QEMU >= v2.7 due to a bug in
> +DPDK which will be addressed in an upcoming release. The patch to fix this
> +issue can be found on
> +`Patchwork
> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> +
> +Further information can be found in the
> +`DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__

There are many keywords like 'vhost-server-path', 'dq-zero-copy', 'true' and
so on in above text. IMHO, they should be marked like ``something`` to be
highlighted.

> diff --git a/NEWS b/NEWS
> index 752e98f..1eaa32c 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -24,6 +24,7 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU
> +     * Add support for vHost dequeue zero copy (experimental)
>  
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 8f22264..d100b31 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1384,6 +1384,10 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>          path = smap_get(args, "vhost-server-path");
>          if (path && strcmp(path, dev->vhost_id)) {
>              strcpy(dev->vhost_id, path);
> +            /* check zero copy configuration */
> +            if (smap_get_bool(args, "dq-zero-copy", false)) {
> +                dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;

In case of rte_vhost_driver_register() failure while reconfiguration we'll not
be able to disable "dq-zero-copy" for that port, becose we're only anble to enable it.
For example, it could happen in case of typo in vhost_id.

> +            }
>              netdev_request_reconfigure(netdev);
>          }
>      }
> @@ -3278,7 +3282,7 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      int err;
> -    uint64_t vhost_flags = 0;
> +    uint64_t vhost_flags = dev->vhost_driver_flags;

This should be under the mutex.

>  
>      ovs_mutex_lock(&dev->mutex);
>  
> @@ -3307,6 +3311,9 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>              VLOG_INFO("vHost User device '%s' created in 'client' mode, "
>                        "using client socket '%s'",
>                        dev->up.name, dev->vhost_id);
> +            if (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
> +                VLOG_INFO("Zero copy enabled for vHost port %s", dev->up.name);
> +            }
>          }
>  
>          err = rte_vhost_driver_callback_register(dev->vhost_id,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 018d644..16b3c3e 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="options" key="dq-zero-copy"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          The value specifies whether or not to enable dequeue zero copy on
> +          the given interface.
> +          Must be set before vhost-server-path is specified.
> +          Only supported by dpdkvhostuserclient interfaces.
> +          The feature is considered experimental.
> +        </p>
> +      </column>
> +
>        <column name="options" key="n_rxq_desc"
>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>          <p>
>
Ciara Loftus Jan. 19, 2018, 3:31 p.m. | #3
> 
> On 05.01.2018 19:13, Ciara Loftus wrote:
> > Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> > option to 'true' when configuring the Interface:
> >
> > ovs-vsctl set Interface dpdkvhostuserclient0
> > options:vhost-server-path=/tmp/dpdkvhostuserclient0
> > options:dq-zero-copy=true
> >
> > When packets from a vHost device with zero copy enabled are destined for
> > a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> > must be set to a smaller value. 128 is recommended. This can be achieved
> > like so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> > to should not exceed 128. Due to this requirement, the feature is
> > considered 'experimental'.
> >
> > Testing of the patch showed a 15% improvement when switching 512B
> > packets between vHost devices on different VMs on the same host when
> > zero copy was enabled on the transmitting device.
> >
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > ---
> > v8:
> > * Disallow configurability after vHost device has been registered &
> > update docs accordingly.
> > * Give performance datapoint in commit message.
> >
> >  Documentation/intro/install/dpdk.rst     |  2 +
> >  Documentation/topics/dpdk/vhost-user.rst | 72
> ++++++++++++++++++++++++++++++++
> >  NEWS                                     |  1 +
> >  lib/netdev-dpdk.c                        |  9 +++-
> >  vswitchd/vswitch.xml                     | 11 +++++
> >  5 files changed, 94 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> > index 3fecb5c..087eb88 100644
> > --- a/Documentation/intro/install/dpdk.rst
> > +++ b/Documentation/intro/install/dpdk.rst
> > @@ -518,6 +518,8 @@ The above command sets the number of rx queues
> for DPDK physical interface.
> >  The rx queues are assigned to pmd threads on the same NUMA node in a
> >  round-robin fashion.
> >
> > +.. _dpdk-queues-sizes:
> > +
> >  DPDK Physical Port Queue Sizes
> >  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> > index 8447e2d..1a6c6d0 100644
> > --- a/Documentation/topics/dpdk/vhost-user.rst
> > +++ b/Documentation/topics/dpdk/vhost-user.rst
> > @@ -458,3 +458,75 @@ Sample XML
> >      </domain>
> >
> >  .. _QEMU documentation: http://git.qemu-
> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;h=7890d7169;hb=HEAD
> > +
> > +vhost-user Dequeue Zero Copy (experimental)
> > +-------------------------------------------
> > +
> > +Normally when dequeuing a packet from a vHost User device, a memcpy
> operation
> > +must be used to copy that packet from guest address space to host
> address
> > +space. This memcpy can be removed by enabling dequeue zero-copy like
> so::
> > +
> > +    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> > +        dpdkvhostuserclient0 type=dpdkvhostuserclient \
> > +        options:vhost-server-path=/tmp/dpdkvhostclient0 \
> > +        options:dq-zero-copy=true
> > +
> > +With this feature enabled, a reference (pointer) to the packet is passed to
> > +the host, instead of a copy of the packet. Removing this memcpy can give
> a
> > +performance improvement for some use cases, for example switching
> large packets
> > +between different VMs. However additional packet loss may be
> observed.
> > +
> > +Note that the feature is disabled by default and must be explicitly enabled
> > +by setting the 'dq-zero-copy' option to 'true' while specifying the
> > +'vhost-server-path' option as above. If you wish to split out the command
> into
> > +multiple commands as below, ensure 'dq-zero-copy' is set before
> > +'vhost-server-path'::
> > +
> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=true
> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
> > +        options:vhost-server-path=/tmp/dpdkvhostclient0
> > +
> > +The feature is only available to dpdkvhostuserclient port types.
> > +
> > +A limitation exists whereby if packets from a vHost port with dq-zero-
> copy=true
> > +are destined for a 'dpdk' type port, the number of tx descriptors
> (n_txq_desc)
> > +for that port must be reduced to a smaller number, 128 being the
> recommended
> > +value. This can be achieved by issuing the following command::
> > +
> > +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> > +
> > +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
> > +should not exceed 128. For example, in case of a bond over two physical
> ports
> > +in balance-tcp mode, one must divide 128 by the number of links in the
> bond.
> > +
> > +Refer to :ref:`<dpdk-queue-sizes>` for more information.
> > +
> > +The reason for this limitation is due to how the zero copy functionality is
> > +implemented. The vHost device's 'tx used vring', a virtio structure used for
> > +tracking used ie. sent descriptors, will only be updated when the NIC
> frees
> > +the corresponding mbuf. If we don't free the mbufs frequently enough,
> that
> > +vring will be starved and packets will no longer be processed. One way to
> > +ensure we don't encounter this scenario, is to configure n_txq_desc to a
> small
> > +enough number such that the 'mbuf free threshold' for the NIC will be hit
> more
> > +often and thus free mbufs more frequently. The value of 128 is
> suggested, but
> > +values of 64 and 256 have been tested and verified to work too, with
> differing
> > +performance characteristics. A value of 512 can be used too, if the virtio
> > +queue size in the guest is increased to 1024 (available to configure in
> QEMU
> > +versions v2.10 and greater). This value can be set like so::
> > +
> > +    $ qemu-system-x86_64 ... -chardev
> socket,id=char1,path=<sockpath>,server
> > +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> > +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> > +      tx_queue_size=1024
> > +
> > +Because of this limitation, this feature is condsidered 'experimental'.
> > +
> > +The feature currently does not fully work with QEMU >= v2.7 due to a bug
> in
> > +DPDK which will be addressed in an upcoming release. The patch to fix this
> > +issue can be found on
> > +`Patchwork
> > +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> > +
> > +Further information can be found in the
> > +`DPDK documentation
> > +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> 
> There are many keywords like 'vhost-server-path', 'dq-zero-copy', 'true' and
> so on in above text. IMHO, they should be marked like ``something`` to be
> highlighted.
> 
> > diff --git a/NEWS b/NEWS
> > index 752e98f..1eaa32c 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -24,6 +24,7 @@ Post-v2.8.0
> >     - DPDK:
> >       * Add support for DPDK v17.11
> >       * Add support for vHost IOMMU
> > +     * Add support for vHost dequeue zero copy (experimental)
> >
> >  v2.8.0 - 31 Aug 2017
> >  --------------------
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 8f22264..d100b31 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -1384,6 +1384,10 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
> >          path = smap_get(args, "vhost-server-path");
> >          if (path && strcmp(path, dev->vhost_id)) {
> >              strcpy(dev->vhost_id, path);
> > +            /* check zero copy configuration */
> > +            if (smap_get_bool(args, "dq-zero-copy", false)) {
> > +                dev->vhost_driver_flags |=
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> 
> In case of rte_vhost_driver_register() failure while reconfiguration we'll not
> be able to disable "dq-zero-copy" for that port, becose we're only anble to
> enable it.
> For example, it could happen in case of typo in vhost_id.

Hi Ilya,

Can you clarify what you want here. Previously we had the ability to enable/disable zc but I removed it due to your comments.
Do you want to be able to enable/disable zc before successful driver register but not after? What is the benefit?

Thanks,
Ciara

> 
> > +            }
> >              netdev_request_reconfigure(netdev);
> >          }
> >      }
> > @@ -3278,7 +3282,7 @@ netdev_dpdk_vhost_client_reconfigure(struct
> netdev *netdev)
> >  {
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >      int err;
> > -    uint64_t vhost_flags = 0;
> > +    uint64_t vhost_flags = dev->vhost_driver_flags;
> 
> This should be under the mutex.
> 
> >
> >      ovs_mutex_lock(&dev->mutex);
> >
> > @@ -3307,6 +3311,9 @@ netdev_dpdk_vhost_client_reconfigure(struct
> netdev *netdev)
> >              VLOG_INFO("vHost User device '%s' created in 'client' mode, "
> >                        "using client socket '%s'",
> >                        dev->up.name, dev->vhost_id);
> > +            if (dev->vhost_driver_flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
> > +                VLOG_INFO("Zero copy enabled for vHost port %s", dev-
> >up.name);
> > +            }
> >          }
> >
> >          err = rte_vhost_driver_callback_register(dev->vhost_id,
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 018d644..16b3c3e 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
> >          </p>
> >        </column>
> >
> > +      <column name="options" key="dq-zero-copy"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          The value specifies whether or not to enable dequeue zero copy on
> > +          the given interface.
> > +          Must be set before vhost-server-path is specified.
> > +          Only supported by dpdkvhostuserclient interfaces.
> > +          The feature is considered experimental.
> > +        </p>
> > +      </column>
> > +
> >        <column name="options" key="n_rxq_desc"
> >                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
> >          <p>
> >
Ilya Maximets Jan. 19, 2018, 3:56 p.m. | #4
On 19.01.2018 18:31, Loftus, Ciara wrote:
>>
>> On 05.01.2018 19:13, Ciara Loftus wrote:
>>> Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
>>> option to 'true' when configuring the Interface:
>>>
>>> ovs-vsctl set Interface dpdkvhostuserclient0
>>> options:vhost-server-path=/tmp/dpdkvhostuserclient0
>>> options:dq-zero-copy=true
>>>
>>> When packets from a vHost device with zero copy enabled are destined for
>>> a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
>>> must be set to a smaller value. 128 is recommended. This can be achieved
>>> like so:
>>>
>>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>>
>>> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
>>> to should not exceed 128. Due to this requirement, the feature is
>>> considered 'experimental'.
>>>
>>> Testing of the patch showed a 15% improvement when switching 512B
>>> packets between vHost devices on different VMs on the same host when
>>> zero copy was enabled on the transmitting device.
>>>
>>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>>> ---
>>> v8:
>>> * Disallow configurability after vHost device has been registered &
>>> update docs accordingly.
>>> * Give performance datapoint in commit message.
>>>
>>>  Documentation/intro/install/dpdk.rst     |  2 +
>>>  Documentation/topics/dpdk/vhost-user.rst | 72
>> ++++++++++++++++++++++++++++++++
>>>  NEWS                                     |  1 +
>>>  lib/netdev-dpdk.c                        |  9 +++-
>>>  vswitchd/vswitch.xml                     | 11 +++++
>>>  5 files changed, 94 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/intro/install/dpdk.rst
>> b/Documentation/intro/install/dpdk.rst
>>> index 3fecb5c..087eb88 100644
>>> --- a/Documentation/intro/install/dpdk.rst
>>> +++ b/Documentation/intro/install/dpdk.rst
>>> @@ -518,6 +518,8 @@ The above command sets the number of rx queues
>> for DPDK physical interface.
>>>  The rx queues are assigned to pmd threads on the same NUMA node in a
>>>  round-robin fashion.
>>>
>>> +.. _dpdk-queues-sizes:
>>> +
>>>  DPDK Physical Port Queue Sizes
>>>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
>> b/Documentation/topics/dpdk/vhost-user.rst
>>> index 8447e2d..1a6c6d0 100644
>>> --- a/Documentation/topics/dpdk/vhost-user.rst
>>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>>> @@ -458,3 +458,75 @@ Sample XML
>>>      </domain>
>>>
>>>  .. _QEMU documentation: http://git.qemu-
>> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
>> user.txt;h=7890d7169;hb=HEAD
>>> +
>>> +vhost-user Dequeue Zero Copy (experimental)
>>> +-------------------------------------------
>>> +
>>> +Normally when dequeuing a packet from a vHost User device, a memcpy
>> operation
>>> +must be used to copy that packet from guest address space to host
>> address
>>> +space. This memcpy can be removed by enabling dequeue zero-copy like
>> so::
>>> +
>>> +    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
>>> +        dpdkvhostuserclient0 type=dpdkvhostuserclient \
>>> +        options:vhost-server-path=/tmp/dpdkvhostclient0 \
>>> +        options:dq-zero-copy=true
>>> +
>>> +With this feature enabled, a reference (pointer) to the packet is passed to
>>> +the host, instead of a copy of the packet. Removing this memcpy can give
>> a
>>> +performance improvement for some use cases, for example switching
>> large packets
>>> +between different VMs. However additional packet loss may be
>> observed.
>>> +
>>> +Note that the feature is disabled by default and must be explicitly enabled
>>> +by setting the 'dq-zero-copy' option to 'true' while specifying the
>>> +'vhost-server-path' option as above. If you wish to split out the command
>> into
>>> +multiple commands as below, ensure 'dq-zero-copy' is set before
>>> +'vhost-server-path'::
>>> +
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
>> copy=true
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
>>> +        options:vhost-server-path=/tmp/dpdkvhostclient0
>>> +
>>> +The feature is only available to dpdkvhostuserclient port types.
>>> +
>>> +A limitation exists whereby if packets from a vHost port with dq-zero-
>> copy=true
>>> +are destined for a 'dpdk' type port, the number of tx descriptors
>> (n_txq_desc)
>>> +for that port must be reduced to a smaller number, 128 being the
>> recommended
>>> +value. This can be achieved by issuing the following command::
>>> +
>>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>> +
>>> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
>>> +should not exceed 128. For example, in case of a bond over two physical
>> ports
>>> +in balance-tcp mode, one must divide 128 by the number of links in the
>> bond.
>>> +
>>> +Refer to :ref:`<dpdk-queue-sizes>` for more information.
>>> +
>>> +The reason for this limitation is due to how the zero copy functionality is
>>> +implemented. The vHost device's 'tx used vring', a virtio structure used for
>>> +tracking used ie. sent descriptors, will only be updated when the NIC
>> frees
>>> +the corresponding mbuf. If we don't free the mbufs frequently enough,
>> that
>>> +vring will be starved and packets will no longer be processed. One way to
>>> +ensure we don't encounter this scenario, is to configure n_txq_desc to a
>> small
>>> +enough number such that the 'mbuf free threshold' for the NIC will be hit
>> more
>>> +often and thus free mbufs more frequently. The value of 128 is
>> suggested, but
>>> +values of 64 and 256 have been tested and verified to work too, with
>> differing
>>> +performance characteristics. A value of 512 can be used too, if the virtio
>>> +queue size in the guest is increased to 1024 (available to configure in
>> QEMU
>>> +versions v2.10 and greater). This value can be set like so::
>>> +
>>> +    $ qemu-system-x86_64 ... -chardev
>> socket,id=char1,path=<sockpath>,server
>>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
>>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
>>> +      tx_queue_size=1024
>>> +
>>> +Because of this limitation, this feature is condsidered 'experimental'.
>>> +
>>> +The feature currently does not fully work with QEMU >= v2.7 due to a bug
>> in
>>> +DPDK which will be addressed in an upcoming release. The patch to fix this
>>> +issue can be found on
>>> +`Patchwork
>>> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
>>> +
>>> +Further information can be found in the
>>> +`DPDK documentation
>>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
>>
>> There are many keywords like 'vhost-server-path', 'dq-zero-copy', 'true' and
>> so on in above text. IMHO, they should be marked like ``something`` to be
>> highlighted.
>>
>>> diff --git a/NEWS b/NEWS
>>> index 752e98f..1eaa32c 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -24,6 +24,7 @@ Post-v2.8.0
>>>     - DPDK:
>>>       * Add support for DPDK v17.11
>>>       * Add support for vHost IOMMU
>>> +     * Add support for vHost dequeue zero copy (experimental)
>>>
>>>  v2.8.0 - 31 Aug 2017
>>>  --------------------
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 8f22264..d100b31 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -1384,6 +1384,10 @@ netdev_dpdk_vhost_client_set_config(struct
>> netdev *netdev,
>>>          path = smap_get(args, "vhost-server-path");
>>>          if (path && strcmp(path, dev->vhost_id)) {
>>>              strcpy(dev->vhost_id, path);
>>> +            /* check zero copy configuration */
>>> +            if (smap_get_bool(args, "dq-zero-copy", false)) {
>>> +                dev->vhost_driver_flags |=
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>
>> In case of rte_vhost_driver_register() failure while reconfiguration we'll not
>> be able to disable "dq-zero-copy" for that port, becose we're only anble to
>> enable it.
>> For example, it could happen in case of typo in vhost_id.
> 
> Hi Ilya,
> 
> Can you clarify what you want here. Previously we had the ability to enable/disable zc but I removed it due to your comments.
> Do you want to be able to enable/disable zc before successful driver register but not after? What is the benefit?

Hello Ciara,
I'm thinking about following case:

Trying to register device with vhost-server-path=A and dq-zero-copy=true and
it fails for whatever reason. After that I can change vhost-server-path to
something else, but have no way to disable the zero-copy mode.

I'm not sure if there any profit from changing zero-copy mode after the
driver register failure, but it's not hard to fix and, probably, could
save someone's nerves.

I think, simple 'else' condition should work here:

    } else {
        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
    }

What do you think?

> 
> Thanks,
> Ciara
> 
>>
>>> +            }
>>>              netdev_request_reconfigure(netdev);
>>>          }
>>>      }
>>> @@ -3278,7 +3282,7 @@ netdev_dpdk_vhost_client_reconfigure(struct
>> netdev *netdev)
>>>  {
>>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>      int err;
>>> -    uint64_t vhost_flags = 0;
>>> +    uint64_t vhost_flags = dev->vhost_driver_flags;
>>
>> This should be under the mutex.
>>
>>>
>>>      ovs_mutex_lock(&dev->mutex);
>>>
>>> @@ -3307,6 +3311,9 @@ netdev_dpdk_vhost_client_reconfigure(struct
>> netdev *netdev)
>>>              VLOG_INFO("vHost User device '%s' created in 'client' mode, "
>>>                        "using client socket '%s'",
>>>                        dev->up.name, dev->vhost_id);
>>> +            if (dev->vhost_driver_flags &
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
>>> +                VLOG_INFO("Zero copy enabled for vHost port %s", dev-
>>> up.name);
>>> +            }
>>>          }
>>>
>>>          err = rte_vhost_driver_callback_register(dev->vhost_id,
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 018d644..16b3c3e 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>> type=patch options:peer=p1 \
>>>          </p>
>>>        </column>
>>>
>>> +      <column name="options" key="dq-zero-copy"
>>> +              type='{"type": "boolean"}'>
>>> +        <p>
>>> +          The value specifies whether or not to enable dequeue zero copy on
>>> +          the given interface.
>>> +          Must be set before vhost-server-path is specified.
>>> +          Only supported by dpdkvhostuserclient interfaces.
>>> +          The feature is considered experimental.
>>> +        </p>
>>> +      </column>
>>> +
>>>        <column name="options" key="n_rxq_desc"
>>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>>>          <p>
>>>
Ciara Loftus Jan. 19, 2018, 4:10 p.m. | #5
> 
> On 19.01.2018 18:31, Loftus, Ciara wrote:
> >>
> >> On 05.01.2018 19:13, Ciara Loftus wrote:
> >>> Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
> >>> option to 'true' when configuring the Interface:
> >>>
> >>> ovs-vsctl set Interface dpdkvhostuserclient0
> >>> options:vhost-server-path=/tmp/dpdkvhostuserclient0
> >>> options:dq-zero-copy=true
> >>>
> >>> When packets from a vHost device with zero copy enabled are destined
> for
> >>> a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> >>> must be set to a smaller value. 128 is recommended. This can be
> achieved
> >>> like so:
> >>>
> >>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >>>
> >>> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> >>> to should not exceed 128. Due to this requirement, the feature is
> >>> considered 'experimental'.
> >>>
> >>> Testing of the patch showed a 15% improvement when switching 512B
> >>> packets between vHost devices on different VMs on the same host
> when
> >>> zero copy was enabled on the transmitting device.
> >>>
> >>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> >>> ---
> >>> v8:
> >>> * Disallow configurability after vHost device has been registered &
> >>> update docs accordingly.
> >>> * Give performance datapoint in commit message.
> >>>
> >>>  Documentation/intro/install/dpdk.rst     |  2 +
> >>>  Documentation/topics/dpdk/vhost-user.rst | 72
> >> ++++++++++++++++++++++++++++++++
> >>>  NEWS                                     |  1 +
> >>>  lib/netdev-dpdk.c                        |  9 +++-
> >>>  vswitchd/vswitch.xml                     | 11 +++++
> >>>  5 files changed, 94 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/Documentation/intro/install/dpdk.rst
> >> b/Documentation/intro/install/dpdk.rst
> >>> index 3fecb5c..087eb88 100644
> >>> --- a/Documentation/intro/install/dpdk.rst
> >>> +++ b/Documentation/intro/install/dpdk.rst
> >>> @@ -518,6 +518,8 @@ The above command sets the number of rx
> queues
> >> for DPDK physical interface.
> >>>  The rx queues are assigned to pmd threads on the same NUMA node in
> a
> >>>  round-robin fashion.
> >>>
> >>> +.. _dpdk-queues-sizes:
> >>> +
> >>>  DPDK Physical Port Queue Sizes
> >>>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>
> >>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> >> b/Documentation/topics/dpdk/vhost-user.rst
> >>> index 8447e2d..1a6c6d0 100644
> >>> --- a/Documentation/topics/dpdk/vhost-user.rst
> >>> +++ b/Documentation/topics/dpdk/vhost-user.rst
> >>> @@ -458,3 +458,75 @@ Sample XML
> >>>      </domain>
> >>>
> >>>  .. _QEMU documentation: http://git.qemu-
> >> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> >> user.txt;h=7890d7169;hb=HEAD
> >>> +
> >>> +vhost-user Dequeue Zero Copy (experimental)
> >>> +-------------------------------------------
> >>> +
> >>> +Normally when dequeuing a packet from a vHost User device, a
> memcpy
> >> operation
> >>> +must be used to copy that packet from guest address space to host
> >> address
> >>> +space. This memcpy can be removed by enabling dequeue zero-copy
> like
> >> so::
> >>> +
> >>> +    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
> >>> +        dpdkvhostuserclient0 type=dpdkvhostuserclient \
> >>> +        options:vhost-server-path=/tmp/dpdkvhostclient0 \
> >>> +        options:dq-zero-copy=true
> >>> +
> >>> +With this feature enabled, a reference (pointer) to the packet is passed
> to
> >>> +the host, instead of a copy of the packet. Removing this memcpy can
> give
> >> a
> >>> +performance improvement for some use cases, for example switching
> >> large packets
> >>> +between different VMs. However additional packet loss may be
> >> observed.
> >>> +
> >>> +Note that the feature is disabled by default and must be explicitly
> enabled
> >>> +by setting the 'dq-zero-copy' option to 'true' while specifying the
> >>> +'vhost-server-path' option as above. If you wish to split out the
> command
> >> into
> >>> +multiple commands as below, ensure 'dq-zero-copy' is set before
> >>> +'vhost-server-path'::
> >>> +
> >>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> >> copy=true
> >>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
> >>> +        options:vhost-server-path=/tmp/dpdkvhostclient0
> >>> +
> >>> +The feature is only available to dpdkvhostuserclient port types.
> >>> +
> >>> +A limitation exists whereby if packets from a vHost port with dq-zero-
> >> copy=true
> >>> +are destined for a 'dpdk' type port, the number of tx descriptors
> >> (n_txq_desc)
> >>> +for that port must be reduced to a smaller number, 128 being the
> >> recommended
> >>> +value. This can be achieved by issuing the following command::
> >>> +
> >>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >>> +
> >>> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> to
> >>> +should not exceed 128. For example, in case of a bond over two
> physical
> >> ports
> >>> +in balance-tcp mode, one must divide 128 by the number of links in the
> >> bond.
> >>> +
> >>> +Refer to :ref:`<dpdk-queue-sizes>` for more information.
> >>> +
> >>> +The reason for this limitation is due to how the zero copy functionality
> is
> >>> +implemented. The vHost device's 'tx used vring', a virtio structure used
> for
> >>> +tracking used ie. sent descriptors, will only be updated when the NIC
> >> frees
> >>> +the corresponding mbuf. If we don't free the mbufs frequently
> enough,
> >> that
> >>> +vring will be starved and packets will no longer be processed. One way
> to
> >>> +ensure we don't encounter this scenario, is to configure n_txq_desc to
> a
> >> small
> >>> +enough number such that the 'mbuf free threshold' for the NIC will be
> hit
> >> more
> >>> +often and thus free mbufs more frequently. The value of 128 is
> >> suggested, but
> >>> +values of 64 and 256 have been tested and verified to work too, with
> >> differing
> >>> +performance characteristics. A value of 512 can be used too, if the virtio
> >>> +queue size in the guest is increased to 1024 (available to configure in
> >> QEMU
> >>> +versions v2.10 and greater). This value can be set like so::
> >>> +
> >>> +    $ qemu-system-x86_64 ... -chardev
> >> socket,id=char1,path=<sockpath>,server
> >>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> >>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> >>> +      tx_queue_size=1024
> >>> +
> >>> +Because of this limitation, this feature is condsidered 'experimental'.
> >>> +
> >>> +The feature currently does not fully work with QEMU >= v2.7 due to a
> bug
> >> in
> >>> +DPDK which will be addressed in an upcoming release. The patch to fix
> this
> >>> +issue can be found on
> >>> +`Patchwork
> >>> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> >>> +
> >>> +Further information can be found in the
> >>> +`DPDK documentation
> >>>
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> >>
> >> There are many keywords like 'vhost-server-path', 'dq-zero-copy', 'true'
> and
> >> so on in above text. IMHO, they should be marked like ``something`` to be
> >> highlighted.
> >>
> >>> diff --git a/NEWS b/NEWS
> >>> index 752e98f..1eaa32c 100644
> >>> --- a/NEWS
> >>> +++ b/NEWS
> >>> @@ -24,6 +24,7 @@ Post-v2.8.0
> >>>     - DPDK:
> >>>       * Add support for DPDK v17.11
> >>>       * Add support for vHost IOMMU
> >>> +     * Add support for vHost dequeue zero copy (experimental)
> >>>
> >>>  v2.8.0 - 31 Aug 2017
> >>>  --------------------
> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >>> index 8f22264..d100b31 100644
> >>> --- a/lib/netdev-dpdk.c
> >>> +++ b/lib/netdev-dpdk.c
> >>> @@ -1384,6 +1384,10 @@ netdev_dpdk_vhost_client_set_config(struct
> >> netdev *netdev,
> >>>          path = smap_get(args, "vhost-server-path");
> >>>          if (path && strcmp(path, dev->vhost_id)) {
> >>>              strcpy(dev->vhost_id, path);
> >>> +            /* check zero copy configuration */
> >>> +            if (smap_get_bool(args, "dq-zero-copy", false)) {
> >>> +                dev->vhost_driver_flags |=
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >>
> >> In case of rte_vhost_driver_register() failure while reconfiguration we'll
> not
> >> be able to disable "dq-zero-copy" for that port, becose we're only anble
> to
> >> enable it.
> >> For example, it could happen in case of typo in vhost_id.
> >
> > Hi Ilya,
> >
> > Can you clarify what you want here. Previously we had the ability to
> enable/disable zc but I removed it due to your comments.
> > Do you want to be able to enable/disable zc before successful driver
> register but not after? What is the benefit?
> 
> Hello Ciara,
> I'm thinking about following case:
> 
> Trying to register device with vhost-server-path=A and dq-zero-copy=true
> and
> it fails for whatever reason. After that I can change vhost-server-path to
> something else, but have no way to disable the zero-copy mode.
> 
> I'm not sure if there any profit from changing zero-copy mode after the
> driver register failure, but it's not hard to fix and, probably, could
> save someone's nerves.
> 
> I think, simple 'else' condition should work here:
> 
>     } else {
>         dev->vhost_driver_flags &=
> ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>     }
> 
> What do you think?

Thanks for clarifying Ilya, that is reasonable.
Thanks for your feedback. Hope to get a v9 out soon..

Thanks,
Ciara

> 
> >
> > Thanks,
> > Ciara
> >
> >>
> >>> +            }
> >>>              netdev_request_reconfigure(netdev);
> >>>          }
> >>>      }
> >>> @@ -3278,7 +3282,7 @@ netdev_dpdk_vhost_client_reconfigure(struct
> >> netdev *netdev)
> >>>  {
> >>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >>>      int err;
> >>> -    uint64_t vhost_flags = 0;
> >>> +    uint64_t vhost_flags = dev->vhost_driver_flags;
> >>
> >> This should be under the mutex.
> >>
> >>>
> >>>      ovs_mutex_lock(&dev->mutex);
> >>>
> >>> @@ -3307,6 +3311,9 @@ netdev_dpdk_vhost_client_reconfigure(struct
> >> netdev *netdev)
> >>>              VLOG_INFO("vHost User device '%s' created in 'client' mode, "
> >>>                        "using client socket '%s'",
> >>>                        dev->up.name, dev->vhost_id);
> >>> +            if (dev->vhost_driver_flags &
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
> >>> +                VLOG_INFO("Zero copy enabled for vHost port %s", dev-
> >>> up.name);
> >>> +            }
> >>>          }
> >>>
> >>>          err = rte_vhost_driver_callback_register(dev->vhost_id,
> >>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> >>> index 018d644..16b3c3e 100644
> >>> --- a/vswitchd/vswitch.xml
> >>> +++ b/vswitchd/vswitch.xml
> >>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> >> type=patch options:peer=p1 \
> >>>          </p>
> >>>        </column>
> >>>
> >>> +      <column name="options" key="dq-zero-copy"
> >>> +              type='{"type": "boolean"}'>
> >>> +        <p>
> >>> +          The value specifies whether or not to enable dequeue zero copy
> on
> >>> +          the given interface.
> >>> +          Must be set before vhost-server-path is specified.
> >>> +          Only supported by dpdkvhostuserclient interfaces.
> >>> +          The feature is considered experimental.
> >>> +        </p>
> >>> +      </column>
> >>> +
> >>>        <column name="options" key="n_rxq_desc"
> >>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
> >>>          <p>
> >>>

Patch

diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
index 3fecb5c..087eb88 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -518,6 +518,8 @@  The above command sets the number of rx queues for DPDK physical interface.
 The rx queues are assigned to pmd threads on the same NUMA node in a
 round-robin fashion.
 
+.. _dpdk-queues-sizes:
+
 DPDK Physical Port Queue Sizes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
index 8447e2d..1a6c6d0 100644
--- a/Documentation/topics/dpdk/vhost-user.rst
+++ b/Documentation/topics/dpdk/vhost-user.rst
@@ -458,3 +458,75 @@  Sample XML
     </domain>
 
 .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
+
+vhost-user Dequeue Zero Copy (experimental)
+-------------------------------------------
+
+Normally when dequeuing a packet from a vHost User device, a memcpy operation
+must be used to copy that packet from guest address space to host address
+space. This memcpy can be removed by enabling dequeue zero-copy like so::
+
+    $ ovs-vsctl add-port br0 dpdkvhostuserclient0 -- set Interface \
+        dpdkvhostuserclient0 type=dpdkvhostuserclient \
+        options:vhost-server-path=/tmp/dpdkvhostclient0 \
+        options:dq-zero-copy=true
+
+With this feature enabled, a reference (pointer) to the packet is passed to
+the host, instead of a copy of the packet. Removing this memcpy can give a
+performance improvement for some use cases, for example switching large packets
+between different VMs. However additional packet loss may be observed.
+
+Note that the feature is disabled by default and must be explicitly enabled
+by setting the 'dq-zero-copy' option to 'true' while specifying the
+'vhost-server-path' option as above. If you wish to split out the command into
+multiple commands as below, ensure 'dq-zero-copy' is set before
+'vhost-server-path'::
+
+    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
+    $ ovs-vsctl set Interface dpdkvhostuserclient0 \
+        options:vhost-server-path=/tmp/dpdkvhostclient0
+
+The feature is only available to dpdkvhostuserclient port types.
+
+A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
+are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
+for that port must be reduced to a smaller number, 128 being the recommended
+value. This can be achieved by issuing the following command::
+
+    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
+
+Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
+should not exceed 128. For example, in case of a bond over two physical ports
+in balance-tcp mode, one must divide 128 by the number of links in the bond.
+
+Refer to :ref:`<dpdk-queue-sizes>` for more information.
+
+The reason for this limitation is due to how the zero copy functionality is
+implemented. The vHost device's 'tx used vring', a virtio structure used for
+tracking used ie. sent descriptors, will only be updated when the NIC frees
+the corresponding mbuf. If we don't free the mbufs frequently enough, that
+vring will be starved and packets will no longer be processed. One way to
+ensure we don't encounter this scenario, is to configure n_txq_desc to a small
+enough number such that the 'mbuf free threshold' for the NIC will be hit more
+often and thus free mbufs more frequently. The value of 128 is suggested, but
+values of 64 and 256 have been tested and verified to work too, with differing
+performance characteristics. A value of 512 can be used too, if the virtio
+queue size in the guest is increased to 1024 (available to configure in QEMU
+versions v2.10 and greater). This value can be set like so::
+
+    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
+      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
+      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
+      tx_queue_size=1024
+
+Because of this limitation, this feature is condsidered 'experimental'.
+
+The feature currently does not fully work with QEMU >= v2.7 due to a bug in
+DPDK which will be addressed in an upcoming release. The patch to fix this
+issue can be found on
+`Patchwork
+<http://dpdk.org/dev/patchwork/patch/32198/>`__
+
+Further information can be found in the
+`DPDK documentation
+<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
diff --git a/NEWS b/NEWS
index 752e98f..1eaa32c 100644
--- a/NEWS
+++ b/NEWS
@@ -24,6 +24,7 @@  Post-v2.8.0
    - DPDK:
      * Add support for DPDK v17.11
      * Add support for vHost IOMMU
+     * Add support for vHost dequeue zero copy (experimental)
 
 v2.8.0 - 31 Aug 2017
 --------------------
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 8f22264..d100b31 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1384,6 +1384,10 @@  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
         path = smap_get(args, "vhost-server-path");
         if (path && strcmp(path, dev->vhost_id)) {
             strcpy(dev->vhost_id, path);
+            /* check zero copy configuration */
+            if (smap_get_bool(args, "dq-zero-copy", false)) {
+                dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+            }
             netdev_request_reconfigure(netdev);
         }
     }
@@ -3278,7 +3282,7 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     int err;
-    uint64_t vhost_flags = 0;
+    uint64_t vhost_flags = dev->vhost_driver_flags;
 
     ovs_mutex_lock(&dev->mutex);
 
@@ -3307,6 +3311,9 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
             VLOG_INFO("vHost User device '%s' created in 'client' mode, "
                       "using client socket '%s'",
                       dev->up.name, dev->vhost_id);
+            if (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY) {
+                VLOG_INFO("Zero copy enabled for vHost port %s", dev->up.name);
+            }
         }
 
         err = rte_vhost_driver_callback_register(dev->vhost_id,
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 018d644..16b3c3e 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -2669,6 +2669,17 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="options" key="dq-zero-copy"
+              type='{"type": "boolean"}'>
+        <p>
+          The value specifies whether or not to enable dequeue zero copy on
+          the given interface.
+          Must be set before vhost-server-path is specified.
+          Only supported by dpdkvhostuserclient interfaces.
+          The feature is considered experimental.
+        </p>
+      </column>
+
       <column name="options" key="n_rxq_desc"
               type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
         <p>