diff mbox series

[ovs-dev,v5,2/2] netdev-dpdk: Enable optional dequeue zero copy for vHost User

Message ID 1512746819-26061-3-git-send-email-ciara.loftus@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series vHost Dequeue Zero Copy | expand

Commit Message

Ciara Loftus Dec. 8, 2017, 3:26 p.m. UTC
Enabled per port like so:
ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true

The feature is disabled by default and can only be enabled/disabled when
a vHost port is down.

When packets from a vHost device with zero copy enabled are destined for
a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
set to a smaller value. 128 is recommended. This can be achieved like
so:

ovs-vsctl set Interface dpdkport options:n_txq_desc=128

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
v5:
* Rebase
* Update docs with warning of potential packet loss
* Update docs with info on NIC & Virtio descriptor values and working
combinations

 Documentation/howto/dpdk.rst             | 33 ++++++++++++
 Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++
 NEWS                                     |  2 +
 lib/netdev-dpdk.c                        | 89 +++++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml                     | 11 ++++
 5 files changed, 195 insertions(+), 1 deletion(-)

Comments

Ilya Maximets Dec. 14, 2017, 3:30 p.m. UTC | #1
Hello Ciara,
Thanks for the patches.

I did not review the code. But I have few general concerns about number
of tx descriptors in HW NICs inline.

Best regards, Ilya Maximets.

On 08.12.2017 18:26, Ciara Loftus wrote:
> Enabled per port like so:
> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> 
> The feature is disabled by default and can only be enabled/disabled when
> a vHost port is down.
> 
> When packets from a vHost device with zero copy enabled are destined for
> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
> set to a smaller value. 128 is recommended. This can be achieved like
> so:
> 
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
> v5:
> * Rebase
> * Update docs with warning of potential packet loss
> * Update docs with info on NIC & Virtio descriptor values and working
> combinations
> 
>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>  Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++
>  NEWS                                     |  2 +
>  lib/netdev-dpdk.c                        | 89 +++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml                     | 11 ++++
>  5 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index d123819..3e1b8f8 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>     Check traffic on multiple queues::
>  
>         $ cat /proc/interrupts | grep virtio
> +
> +PHY-VM-PHY (vHost Dequeue Zero Copy)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +vHost dequeue zero copy functionality can  be validated using the
> +PHY-VM-PHY configuration. To begin, follow the steps described in
> +:ref:`dpdk-phy-phy` to create and initialize the database, start
> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
> +and flows to bridge ``br0``. Once complete, follow the below steps:
> +
> +1. Enable dequeue zero copy on the vHost devices.
> +
> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
> +
> +The following log should be observed for each device:
> +
> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
> +
> +2. Reduce the number of txq descriptors on the phy ports.
> +
> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
> +
> +3. Proceed with the test by launching the VM and configuring guest
> +forwarding, be it via the vHost loopback method or kernel forwarding
> +method, and sending traffic. The following log should be oberved for
> +each device as it becomes active during VM boot:
> +
> +       VHOST_CONFIG: dequeue zero copy is enabled
> +
> +It is essential that step 1 is performed before booting the VM, otherwise
> +the feature will not be enabled.
> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
> index 8b1c671..8587370 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -441,3 +441,64 @@ Sample XML
>      </domain>
>  
>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy
> +-------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
> +must be used to copy that packet from guest address space to host address
> +space. This memcpy can be removed by enabling dequeue zero-copy like so:
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is passed to
> +the host, instead of a copy of the packet. Removing this memcpy can give a
> +performance improvement for some use cases, for example switching large packets
> +between different VMs. However additional packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly enabled
> +by using the command above.
> +
> +The feature cannot be enabled when the device is active (ie. VM booted). If
> +you wish to enable the feature after the VM has booted, you must shutdown
> +the VM and bring it back up.
> +
> +The same logic applies for disabling the feature - it must be disabled when
> +the device is inactive, for example before VM boot. To disable the feature:
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
> +
> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient port
> +types.
> +
> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
> +for that port must be reduced to a smaller number, 128 being the recommended
> +value. This can be achieved by issuing the following command:
> +
> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +More information on the n_txq_desc option can be found in the "DPDK Physical
> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
> +
> +The reason for this limitation is due to how the zero copy functionality is
> +implemented. The vHost device's 'tx used vring', a virtio structure used for
> +tracking used ie. sent descriptors, will only be updated when the NIC frees
> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
> +vring will be starved and packets will no longer be processed. One way to
> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
> +often and thus free mbufs more frequently. The value of 128 is suggested, but
> +values of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the virtio
> +queue size in the guest is increased to 1024 (available to configure in QEMU
> +versions v2.10 and greater). This value can be set like so:
> +
> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> +      tx_queue_size=1024

What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
We may actually stuck with non-working virtual interface in all above cases.

IMHO, this feature should be marked as experimental until we don't have proper solution
for that stuck mbufs in HW rings.

Thoughts?

> +
> +Further information can be found in the
> +`DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> diff --git a/NEWS b/NEWS
> index d45904e..50630f8 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -18,6 +18,8 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU
> +     * Optional dequeue zero copy feature for vHost ports can be enabled per
> +       port via the boolean 'dq-zero-copy' option.
>  
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 879019e..03cc6cb 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>          /* True if vHost device is 'up' and has been reconfigured at least once */
>          bool vhost_reconfigured;
>          /* 3 pad bytes here. */
> +
> +        /* True if dq-zero-copy feature has successfully been enabled */
> +        bool dq_zc_enabled;
>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      ovsrcu_index_init(&dev->vid, -1);
>      dev->vhost_reconfigured = false;
> +    dev->dq_zc_enabled = false;
>      dev->attached = false;
>  
>      ovsrcu_init(&dev->qos_conf, NULL);
> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
>      return 0;
>  }
>  
> +static void
> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
> +                             const struct smap *args)
> +{
> +    bool needs_reconfigure = false;
> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> +
> +    if (zc_requested &&
> +            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    } else if (!zc_requested &&
> +            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    }
> +
> +    /* Only try to change ZC mode when device is down */
> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
> +        netdev_request_reconfigure(&dev->up);
> +    }
> +}
> +
>  static int
>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>                                      const struct smap *args,
> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>              netdev_request_reconfigure(netdev);
>          }
>      }
> +
> +    dpdk_vhost_set_config_helper(dev, args);
> +
> +    ovs_mutex_unlock(&dev->mutex);
> +
> +    return 0;
> +}
> +
> +static int
> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
> +                             const struct smap *args,
> +                             char **errp OVS_UNUSED)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    dpdk_vhost_set_config_helper(dev, args);
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>      }
>  }
>  
> +static void
> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,
> +                            bool enable)
> +{
> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
> +
> +    if (err) {
> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero copy "
> +                "mode", dev->vhost_id);
> +    } else {
> +        err = dpdk_setup_vhost_device(dev, client_mode);
> +        if (err) {
> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
> +                    dev->vhost_id);
> +        } else if (enable) {
> +            dev->dq_zc_enabled = true;
> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev->vhost_id);
> +        } else {
> +            dev->dq_zc_enabled = false;
> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev->vhost_id);
> +        }
> +    }
> +}
> +
> +static void
> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
> +{
> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
> +
> +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> +                && !dev->dq_zc_enabled) {
> +        /* ZC disabled but requested to be enabled, enable it. */
> +        vhost_change_zero_copy_mode(dev, mode, true);
> +    } else if (!(dev->vhost_driver_flags &
> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {
> +        /* ZC enabled but requested to be disabled, disable it. */
> +        vhost_change_zero_copy_mode(dev, mode, false);
> +    }
> +}
> +
>  /*
>   * Remove a virtio-net device from the specific vhost port.  Use dev->remove
>   * flag to stop any more packets from being sent or received to/from a VM and
> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>           */
>          ovsrcu_quiesce_start();
>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
> +        netdev_request_reconfigure(&dev->up);
>      } else {
>          VLOG_INFO("vHost Device '%s' not found", ifname);
>      }
> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>              /* Carrier status may need updating. */
>              netdev_change_seq_changed(&dev->up);
>          }
> +    } else {
> +        vhost_check_zero_copy_status(dev);
>      }
>  
>      return 0;
> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class =
>          NULL,
>          netdev_dpdk_vhost_construct,
>          netdev_dpdk_vhost_destruct,
> -        NULL,
> +        netdev_dpdk_vhost_set_config,
>          NULL,
>          netdev_dpdk_vhost_send,
>          netdev_dpdk_vhost_get_carrier,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4c317d0..e8409b4 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="options" key="dq-zero-copy"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          The value specifies whether or not to enable dequeue zero copy on
> +          the given interface.
> +          The port must be in an inactive state in order to enable or disable
> +          this feature.
> +          Only supported by dpdkvhostuserclient and dpdkvhostuser interfaces.
> +        </p>
> +      </column>
> +
>        <column name="options" key="n_rxq_desc"
>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>          <p>
>
Kevin Traynor Dec. 14, 2017, 4:26 p.m. UTC | #2
On 12/14/2017 03:30 PM, Ilya Maximets wrote:
> Hello Ciara,
> Thanks for the patches.
> 
> I did not review the code. But I have few general concerns about number
> of tx descriptors in HW NICs inline.
> 
> Best regards, Ilya Maximets.
> 
> On 08.12.2017 18:26, Ciara Loftus wrote:
>> Enabled per port like so:
>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>>
>> The feature is disabled by default and can only be enabled/disabled when
>> a vHost port is down.
>>
>> When packets from a vHost device with zero copy enabled are destined for
>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
>> set to a smaller value. 128 is recommended. This can be achieved like
>> so:
>>
>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> ---
>> v5:
>> * Rebase
>> * Update docs with warning of potential packet loss
>> * Update docs with info on NIC & Virtio descriptor values and working
>> combinations
>>
>>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>>  Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++
>>  NEWS                                     |  2 +
>>  lib/netdev-dpdk.c                        | 89 +++++++++++++++++++++++++++++++-
>>  vswitchd/vswitch.xml                     | 11 ++++
>>  5 files changed, 195 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
>> index d123819..3e1b8f8 100644
>> --- a/Documentation/howto/dpdk.rst
>> +++ b/Documentation/howto/dpdk.rst
>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>>     Check traffic on multiple queues::
>>  
>>         $ cat /proc/interrupts | grep virtio
>> +
>> +PHY-VM-PHY (vHost Dequeue Zero Copy)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +vHost dequeue zero copy functionality can  be validated using the
>> +PHY-VM-PHY configuration. To begin, follow the steps described in
>> +:ref:`dpdk-phy-phy` to create and initialize the database, start
>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
>> +and flows to bridge ``br0``. Once complete, follow the below steps:
>> +
>> +1. Enable dequeue zero copy on the vHost devices.
>> +
>> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
>> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
>> +
>> +The following log should be observed for each device:
>> +
>> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
>> +
>> +2. Reduce the number of txq descriptors on the phy ports.
>> +
>> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
>> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
>> +
>> +3. Proceed with the test by launching the VM and configuring guest
>> +forwarding, be it via the vHost loopback method or kernel forwarding
>> +method, and sending traffic. The following log should be oberved for
>> +each device as it becomes active during VM boot:
>> +
>> +       VHOST_CONFIG: dequeue zero copy is enabled
>> +
>> +It is essential that step 1 is performed before booting the VM, otherwise
>> +the feature will not be enabled.
>> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
>> index 8b1c671..8587370 100644
>> --- a/Documentation/topics/dpdk/vhost-user.rst
>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>> @@ -441,3 +441,64 @@ Sample XML
>>      </domain>
>>  
>>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
>> +
>> +vhost-user Dequeue Zero Copy
>> +-------------------------------------
>> +
>> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
>> +must be used to copy that packet from guest address space to host address
>> +space. This memcpy can be removed by enabling dequeue zero-copy like so:
>> +
>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>> +
>> +With this feature enabled, a reference (pointer) to the packet is passed to
>> +the host, instead of a copy of the packet. Removing this memcpy can give a
>> +performance improvement for some use cases, for example switching large packets
>> +between different VMs. However additional packet loss may be observed.
>> +
>> +Note that the feature is disabled by default and must be explicitly enabled
>> +by using the command above.
>> +
>> +The feature cannot be enabled when the device is active (ie. VM booted). If
>> +you wish to enable the feature after the VM has booted, you must shutdown
>> +the VM and bring it back up.
>> +
>> +The same logic applies for disabling the feature - it must be disabled when
>> +the device is inactive, for example before VM boot. To disable the feature:
>> +
>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
>> +
>> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient port
>> +types.
>> +
>> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
>> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
>> +for that port must be reduced to a smaller number, 128 being the recommended
>> +value. This can be achieved by issuing the following command:
>> +
>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>> +
>> +More information on the n_txq_desc option can be found in the "DPDK Physical
>> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
>> +
>> +The reason for this limitation is due to how the zero copy functionality is
>> +implemented. The vHost device's 'tx used vring', a virtio structure used for
>> +tracking used ie. sent descriptors, will only be updated when the NIC frees
>> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
>> +vring will be starved and packets will no longer be processed. One way to
>> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
>> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
>> +often and thus free mbufs more frequently. The value of 128 is suggested, but
>> +values of 64 and 256 have been tested and verified to work too, with differing
>> +performance characteristics. A value of 512 can be used too, if the virtio
>> +queue size in the guest is increased to 1024 (available to configure in QEMU
>> +versions v2.10 and greater). This value can be set like so:
>> +
>> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
>> +      tx_queue_size=1024
> 
> What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
> What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
> We may actually stuck with non-working virtual interface in all above cases.
> 
> IMHO, this feature should be marked as experimental until we don't have proper solution
> for that stuck mbufs in HW rings.
> 
> Thoughts?
> 

How about using rte_eth_tx_done_cleanup() ? I guess a user would not
want to combine this feature with sw tx batching though.

On a related note, I would like to ask Ciara/Jan the use case when this
feature would likely be enabled. The RFC2544 test is very opaque but it
seems there is at least some increased packet loss, so would it be just
when the user knows it is vm-vm and large packets? (In which case you
could argue, packets would not get stuck in HW NIC anyway)

Kevin.

>> +
>> +Further information can be found in the
>> +`DPDK documentation
>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
>> diff --git a/NEWS b/NEWS
>> index d45904e..50630f8 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -18,6 +18,8 @@ Post-v2.8.0
>>     - DPDK:
>>       * Add support for DPDK v17.11
>>       * Add support for vHost IOMMU
>> +     * Optional dequeue zero copy feature for vHost ports can be enabled per
>> +       port via the boolean 'dq-zero-copy' option.
>>  
>>  v2.8.0 - 31 Aug 2017
>>  --------------------
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 879019e..03cc6cb 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>>          /* True if vHost device is 'up' and has been reconfigured at least once */
>>          bool vhost_reconfigured;
>>          /* 3 pad bytes here. */
>> +
>> +        /* True if dq-zero-copy feature has successfully been enabled */
>> +        bool dq_zc_enabled;
>>      );
>>  
>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>      ovsrcu_index_init(&dev->vid, -1);
>>      dev->vhost_reconfigured = false;
>> +    dev->dq_zc_enabled = false;
>>      dev->attached = false;
>>  
>>      ovsrcu_init(&dev->qos_conf, NULL);
>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
>>      return 0;
>>  }
>>  
>> +static void
>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
>> +                             const struct smap *args)
>> +{
>> +    bool needs_reconfigure = false;
>> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
>> +
>> +    if (zc_requested &&
>> +            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>> +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>> +        needs_reconfigure = true;
>> +    } else if (!zc_requested &&
>> +            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>> +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>> +        needs_reconfigure = true;
>> +    }
>> +
>> +    /* Only try to change ZC mode when device is down */
>> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
>> +        netdev_request_reconfigure(&dev->up);
>> +    }
>> +}
>> +
>>  static int
>>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>                                      const struct smap *args,
>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>              netdev_request_reconfigure(netdev);
>>          }
>>      }
>> +
>> +    dpdk_vhost_set_config_helper(dev, args);
>> +
>> +    ovs_mutex_unlock(&dev->mutex);
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
>> +                             const struct smap *args,
>> +                             char **errp OVS_UNUSED)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +    ovs_mutex_lock(&dev->mutex);
>> +    dpdk_vhost_set_config_helper(dev, args);
>>      ovs_mutex_unlock(&dev->mutex);
>>  
>>      return 0;
>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>>      }
>>  }
>>  
>> +static void
>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,
>> +                            bool enable)
>> +{
>> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
>> +
>> +    if (err) {
>> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero copy "
>> +                "mode", dev->vhost_id);
>> +    } else {
>> +        err = dpdk_setup_vhost_device(dev, client_mode);
>> +        if (err) {
>> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
>> +                    dev->vhost_id);
>> +        } else if (enable) {
>> +            dev->dq_zc_enabled = true;
>> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev->vhost_id);
>> +        } else {
>> +            dev->dq_zc_enabled = false;
>> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev->vhost_id);
>> +        }
>> +    }
>> +}
>> +
>> +static void
>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
>> +{
>> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
>> +
>> +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
>> +                && !dev->dq_zc_enabled) {
>> +        /* ZC disabled but requested to be enabled, enable it. */
>> +        vhost_change_zero_copy_mode(dev, mode, true);
>> +    } else if (!(dev->vhost_driver_flags &
>> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {
>> +        /* ZC enabled but requested to be disabled, disable it. */
>> +        vhost_change_zero_copy_mode(dev, mode, false);
>> +    }
>> +}
>> +
>>  /*
>>   * Remove a virtio-net device from the specific vhost port.  Use dev->remove
>>   * flag to stop any more packets from being sent or received to/from a VM and
>> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>>           */
>>          ovsrcu_quiesce_start();
>>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
>> +        netdev_request_reconfigure(&dev->up);
>>      } else {
>>          VLOG_INFO("vHost Device '%s' not found", ifname);
>>      }
>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>              /* Carrier status may need updating. */
>>              netdev_change_seq_changed(&dev->up);
>>          }
>> +    } else {
>> +        vhost_check_zero_copy_status(dev);
>>      }
>>  
>>      return 0;
>> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class =
>>          NULL,
>>          netdev_dpdk_vhost_construct,
>>          netdev_dpdk_vhost_destruct,
>> -        NULL,
>> +        netdev_dpdk_vhost_set_config,
>>          NULL,
>>          netdev_dpdk_vhost_send,
>>          netdev_dpdk_vhost_get_carrier,
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 4c317d0..e8409b4 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>>          </p>
>>        </column>
>>  
>> +      <column name="options" key="dq-zero-copy"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          The value specifies whether or not to enable dequeue zero copy on
>> +          the given interface.
>> +          The port must be in an inactive state in order to enable or disable
>> +          this feature.
>> +          Only supported by dpdkvhostuserclient and dpdkvhostuser interfaces.
>> +        </p>
>> +      </column>
>> +
>>        <column name="options" key="n_rxq_desc"
>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>>          <p>
>>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ilya Maximets Dec. 15, 2017, 9:39 a.m. UTC | #3
On 14.12.2017 19:26, Kevin Traynor wrote:
> On 12/14/2017 03:30 PM, Ilya Maximets wrote:
>> Hello Ciara,
>> Thanks for the patches.
>>
>> I did not review the code. But I have few general concerns about number
>> of tx descriptors in HW NICs inline.
>>
>> Best regards, Ilya Maximets.
>>
>> On 08.12.2017 18:26, Ciara Loftus wrote:
>>> Enabled per port like so:
>>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>>>
>>> The feature is disabled by default and can only be enabled/disabled when
>>> a vHost port is down.
>>>
>>> When packets from a vHost device with zero copy enabled are destined for
>>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
>>> set to a smaller value. 128 is recommended. This can be achieved like
>>> so:
>>>
>>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>>
>>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>>> ---
>>> v5:
>>> * Rebase
>>> * Update docs with warning of potential packet loss
>>> * Update docs with info on NIC & Virtio descriptor values and working
>>> combinations
>>>
>>>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>>>  Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++
>>>  NEWS                                     |  2 +
>>>  lib/netdev-dpdk.c                        | 89 +++++++++++++++++++++++++++++++-
>>>  vswitchd/vswitch.xml                     | 11 ++++
>>>  5 files changed, 195 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
>>> index d123819..3e1b8f8 100644
>>> --- a/Documentation/howto/dpdk.rst
>>> +++ b/Documentation/howto/dpdk.rst
>>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>>>     Check traffic on multiple queues::
>>>  
>>>         $ cat /proc/interrupts | grep virtio
>>> +
>>> +PHY-VM-PHY (vHost Dequeue Zero Copy)
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +vHost dequeue zero copy functionality can  be validated using the
>>> +PHY-VM-PHY configuration. To begin, follow the steps described in
>>> +:ref:`dpdk-phy-phy` to create and initialize the database, start
>>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
>>> +and flows to bridge ``br0``. Once complete, follow the below steps:
>>> +
>>> +1. Enable dequeue zero copy on the vHost devices.
>>> +
>>> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
>>> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
>>> +
>>> +The following log should be observed for each device:
>>> +
>>> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
>>> +
>>> +2. Reduce the number of txq descriptors on the phy ports.
>>> +
>>> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
>>> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
>>> +
>>> +3. Proceed with the test by launching the VM and configuring guest
>>> +forwarding, be it via the vHost loopback method or kernel forwarding
>>> +method, and sending traffic. The following log should be oberved for
>>> +each device as it becomes active during VM boot:
>>> +
>>> +       VHOST_CONFIG: dequeue zero copy is enabled
>>> +
>>> +It is essential that step 1 is performed before booting the VM, otherwise
>>> +the feature will not be enabled.
>>> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
>>> index 8b1c671..8587370 100644
>>> --- a/Documentation/topics/dpdk/vhost-user.rst
>>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>>> @@ -441,3 +441,64 @@ Sample XML
>>>      </domain>
>>>  
>>>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
>>> +
>>> +vhost-user Dequeue Zero Copy
>>> +-------------------------------------
>>> +
>>> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
>>> +must be used to copy that packet from guest address space to host address
>>> +space. This memcpy can be removed by enabling dequeue zero-copy like so:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>>> +
>>> +With this feature enabled, a reference (pointer) to the packet is passed to
>>> +the host, instead of a copy of the packet. Removing this memcpy can give a
>>> +performance improvement for some use cases, for example switching large packets
>>> +between different VMs. However additional packet loss may be observed.
>>> +
>>> +Note that the feature is disabled by default and must be explicitly enabled
>>> +by using the command above.
>>> +
>>> +The feature cannot be enabled when the device is active (ie. VM booted). If
>>> +you wish to enable the feature after the VM has booted, you must shutdown
>>> +the VM and bring it back up.
>>> +
>>> +The same logic applies for disabling the feature - it must be disabled when
>>> +the device is inactive, for example before VM boot. To disable the feature:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
>>> +
>>> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient port
>>> +types.
>>> +
>>> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
>>> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
>>> +for that port must be reduced to a smaller number, 128 being the recommended
>>> +value. This can be achieved by issuing the following command:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>> +
>>> +More information on the n_txq_desc option can be found in the "DPDK Physical
>>> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
>>> +
>>> +The reason for this limitation is due to how the zero copy functionality is
>>> +implemented. The vHost device's 'tx used vring', a virtio structure used for
>>> +tracking used ie. sent descriptors, will only be updated when the NIC frees
>>> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
>>> +vring will be starved and packets will no longer be processed. One way to
>>> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
>>> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
>>> +often and thus free mbufs more frequently. The value of 128 is suggested, but
>>> +values of 64 and 256 have been tested and verified to work too, with differing
>>> +performance characteristics. A value of 512 can be used too, if the virtio
>>> +queue size in the guest is increased to 1024 (available to configure in QEMU
>>> +versions v2.10 and greater). This value can be set like so:
>>> +
>>> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
>>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
>>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
>>> +      tx_queue_size=1024
>>
>> What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
>> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
>> What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
>> We may actually stuck with non-working virtual interface in all above cases.
>>
>> IMHO, this feature should be marked as experimental until we don't have proper solution
>> for that stuck mbufs in HW rings.
>>
>> Thoughts?
>>
> 
> How about using rte_eth_tx_done_cleanup() ?

This API currently implemented only by igb driver.
This will not be much useful.

> I guess a user would not
> want to combine this feature with sw tx batching though.
> 
> On a related note, I would like to ask Ciara/Jan the use case when this
> feature would likely be enabled. The RFC2544 test is very opaque but it
> seems there is at least some increased packet loss, so would it be just
> when the user knows it is vm-vm and large packets? (In which case you
> could argue, packets would not get stuck in HW NIC anyway)
> 
> Kevin.
> 
>>> +
>>> +Further information can be found in the
>>> +`DPDK documentation
>>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
>>> diff --git a/NEWS b/NEWS
>>> index d45904e..50630f8 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -18,6 +18,8 @@ Post-v2.8.0
>>>     - DPDK:
>>>       * Add support for DPDK v17.11
>>>       * Add support for vHost IOMMU
>>> +     * Optional dequeue zero copy feature for vHost ports can be enabled per
>>> +       port via the boolean 'dq-zero-copy' option.
>>>  
>>>  v2.8.0 - 31 Aug 2017
>>>  --------------------
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 879019e..03cc6cb 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>>>          /* True if vHost device is 'up' and has been reconfigured at least once */
>>>          bool vhost_reconfigured;
>>>          /* 3 pad bytes here. */
>>> +
>>> +        /* True if dq-zero-copy feature has successfully been enabled */
>>> +        bool dq_zc_enabled;
>>>      );
>>>  
>>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>      ovsrcu_index_init(&dev->vid, -1);
>>>      dev->vhost_reconfigured = false;
>>> +    dev->dq_zc_enabled = false;
>>>      dev->attached = false;
>>>  
>>>      ovsrcu_init(&dev->qos_conf, NULL);
>>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
>>>      return 0;
>>>  }
>>>  
>>> +static void
>>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
>>> +                             const struct smap *args)
>>> +{
>>> +    bool needs_reconfigure = false;
>>> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
>>> +
>>> +    if (zc_requested &&
>>> +            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>> +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>> +        needs_reconfigure = true;
>>> +    } else if (!zc_requested &&
>>> +            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>> +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>> +        needs_reconfigure = true;
>>> +    }
>>> +
>>> +    /* Only try to change ZC mode when device is down */
>>> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
>>> +        netdev_request_reconfigure(&dev->up);
>>> +    }
>>> +}
>>> +
>>>  static int
>>>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>>                                      const struct smap *args,
>>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>>              netdev_request_reconfigure(netdev);
>>>          }
>>>      }
>>> +
>>> +    dpdk_vhost_set_config_helper(dev, args);
>>> +
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int
>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
>>> +                             const struct smap *args,
>>> +                             char **errp OVS_UNUSED)
>>> +{
>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +    dpdk_vhost_set_config_helper(dev, args);
>>>      ovs_mutex_unlock(&dev->mutex);
>>>  
>>>      return 0;
>>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>>>      }
>>>  }
>>>  
>>> +static void
>>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,
>>> +                            bool enable)
>>> +{
>>> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
>>> +
>>> +    if (err) {
>>> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero copy "
>>> +                "mode", dev->vhost_id);
>>> +    } else {
>>> +        err = dpdk_setup_vhost_device(dev, client_mode);
>>> +        if (err) {
>>> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
>>> +                    dev->vhost_id);
>>> +        } else if (enable) {
>>> +            dev->dq_zc_enabled = true;
>>> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev->vhost_id);
>>> +        } else {
>>> +            dev->dq_zc_enabled = false;
>>> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev->vhost_id);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
>>> +{
>>> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
>>> +
>>> +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
>>> +                && !dev->dq_zc_enabled) {
>>> +        /* ZC disabled but requested to be enabled, enable it. */
>>> +        vhost_change_zero_copy_mode(dev, mode, true);
>>> +    } else if (!(dev->vhost_driver_flags &
>>> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {
>>> +        /* ZC enabled but requested to be disabled, disable it. */
>>> +        vhost_change_zero_copy_mode(dev, mode, false);
>>> +    }
>>> +}
>>> +
>>>  /*
>>>   * Remove a virtio-net device from the specific vhost port.  Use dev->remove
>>>   * flag to stop any more packets from being sent or received to/from a VM and
>>> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>>>           */
>>>          ovsrcu_quiesce_start();
>>>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
>>> +        netdev_request_reconfigure(&dev->up);
>>>      } else {
>>>          VLOG_INFO("vHost Device '%s' not found", ifname);
>>>      }
>>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>>              /* Carrier status may need updating. */
>>>              netdev_change_seq_changed(&dev->up);
>>>          }
>>> +    } else {
>>> +        vhost_check_zero_copy_status(dev);
>>>      }
>>>  
>>>      return 0;
>>> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class =
>>>          NULL,
>>>          netdev_dpdk_vhost_construct,
>>>          netdev_dpdk_vhost_destruct,
>>> -        NULL,
>>> +        netdev_dpdk_vhost_set_config,
>>>          NULL,
>>>          netdev_dpdk_vhost_send,
>>>          netdev_dpdk_vhost_get_carrier,
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 4c317d0..e8409b4 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>>>          </p>
>>>        </column>
>>>  
>>> +      <column name="options" key="dq-zero-copy"
>>> +              type='{"type": "boolean"}'>
>>> +        <p>
>>> +          The value specifies whether or not to enable dequeue zero copy on
>>> +          the given interface.
>>> +          The port must be in an inactive state in order to enable or disable
>>> +          this feature.
>>> +          Only supported by dpdkvhostuserclient and dpdkvhostuser interfaces.
>>> +        </p>
>>> +      </column>
>>> +
>>>        <column name="options" key="n_rxq_desc"
>>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>>>          <p>
>>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
> 
> 
> 
>
Stokes, Ian Dec. 15, 2017, 11:27 a.m. UTC | #4
> Hello Ciara,

> Thanks for the patches.

> 

> I did not review the code. But I have few general concerns about number of

> tx descriptors in HW NICs inline.

> 

> Best regards, Ilya Maximets.

> 

> On 08.12.2017 18:26, Ciara Loftus wrote:

> > Enabled per port like so:

> > ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true

> >

> > The feature is disabled by default and can only be enabled/disabled

> > when a vHost port is down.

> >

> > When packets from a vHost device with zero copy enabled are destined

> > for a 'dpdk' port, the number of tx descriptors on that 'dpdk' port

> > must be set to a smaller value. 128 is recommended. This can be

> > achieved like

> > so:

> >

> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128

> >

> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>

> > ---

> > v5:

> > * Rebase

> > * Update docs with warning of potential packet loss

> > * Update docs with info on NIC & Virtio descriptor values and working

> > combinations

> >

> >  Documentation/howto/dpdk.rst             | 33 ++++++++++++

> >  Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++

> >  NEWS                                     |  2 +

> >  lib/netdev-dpdk.c                        | 89

> +++++++++++++++++++++++++++++++-

> >  vswitchd/vswitch.xml                     | 11 ++++

> >  5 files changed, 195 insertions(+), 1 deletion(-)

> >

> > diff --git a/Documentation/howto/dpdk.rst

> > b/Documentation/howto/dpdk.rst index d123819..3e1b8f8 100644

> > --- a/Documentation/howto/dpdk.rst

> > +++ b/Documentation/howto/dpdk.rst

> > @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow

> the below steps:

> >     Check traffic on multiple queues::

> >

> >         $ cat /proc/interrupts | grep virtio

> > +

> > +PHY-VM-PHY (vHost Dequeue Zero Copy)

> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> > +

> > +vHost dequeue zero copy functionality can  be validated using the

> > +PHY-VM-PHY configuration. To begin, follow the steps described in

> > +:ref:`dpdk-phy-phy` to create and initialize the database, start

> > +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices

> > +and flows to bridge ``br0``. Once complete, follow the below steps:

> > +

> > +1. Enable dequeue zero copy on the vHost devices.

> > +

> > +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-

> copy=true

> > +       $ ovs-vsctl set Interface dpdkvhostuser1

> > + options:dq-zero-copy=true

> > +

> > +The following log should be observed for each device:

> > +

> > +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>

> > +

> > +2. Reduce the number of txq descriptors on the phy ports.

> > +

> > +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128

> > +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128

> > +

> > +3. Proceed with the test by launching the VM and configuring guest

> > +forwarding, be it via the vHost loopback method or kernel forwarding

> > +method, and sending traffic. The following log should be oberved for

> > +each device as it becomes active during VM boot:

> > +

> > +       VHOST_CONFIG: dequeue zero copy is enabled

> > +

> > +It is essential that step 1 is performed before booting the VM,

> > +otherwise the feature will not be enabled.

> > diff --git a/Documentation/topics/dpdk/vhost-user.rst

> > b/Documentation/topics/dpdk/vhost-user.rst

> > index 8b1c671..8587370 100644

> > --- a/Documentation/topics/dpdk/vhost-user.rst

> > +++ b/Documentation/topics/dpdk/vhost-user.rst

> > @@ -441,3 +441,64 @@ Sample XML

> >      </domain>

> >

> >  .. _QEMU documentation:

> > http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user

> > .txt;h=7890d7169;hb=HEAD

> > +

> > +vhost-user Dequeue Zero Copy

> > +-------------------------------------

> > +

> > +Normally when dequeuing a packet from a vHost User device, a memcpy

> > +operation must be used to copy that packet from guest address space

> > +to host address space. This memcpy can be removed by enabling dequeue

> zero-copy like so:

> > +

> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0

> > + options:dq-zero-copy=true

> > +

> > +With this feature enabled, a reference (pointer) to the packet is

> > +passed to the host, instead of a copy of the packet. Removing this

> > +memcpy can give a performance improvement for some use cases, for

> > +example switching large packets between different VMs. However

> additional packet loss may be observed.

> > +

> > +Note that the feature is disabled by default and must be explicitly

> > +enabled by using the command above.

> > +

> > +The feature cannot be enabled when the device is active (ie. VM

> > +booted). If you wish to enable the feature after the VM has booted,

> > +you must shutdown the VM and bring it back up.

> > +

> > +The same logic applies for disabling the feature - it must be

> > +disabled when the device is inactive, for example before VM boot. To

> disable the feature:

> > +

> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0

> > + options:dq-zero-copy=false

> > +

> > +The feature is available to both dpdkvhostuser and

> > +dpdkvhostuserclient port types.

> > +

> > +A limitation exists whereby if packets from a vHost port with

> > +dq-zero-copy=true are destined for a 'dpdk' type port, the number of

> > +tx descriptors (n_txq_desc) for that port must be reduced to a

> > +smaller number, 128 being the recommended value. This can be achieved

> by issuing the following command:

> > +

> > +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128

> > +

> > +More information on the n_txq_desc option can be found in the "DPDK

> > +Physical Port Queue Sizes" section of the  `intro/install/dpdk.rst`

> guide.

> > +

> > +The reason for this limitation is due to how the zero copy

> > +functionality is implemented. The vHost device's 'tx used vring', a

> > +virtio structure used for tracking used ie. sent descriptors, will

> > +only be updated when the NIC frees the corresponding mbuf. If we

> > +don't free the mbufs frequently enough, that vring will be starved

> > +and packets will no longer be processed. One way to ensure we don't

> > +encounter this scenario, is to configure n_txq_desc to a small enough

> > +number such that the 'mbuf free threshold' for the NIC will be hit

> > +more often and thus free mbufs more frequently. The value of 128 is

> > +suggested, but values of 64 and 256 have been tested and verified to

> > +work too, with differing performance characteristics. A value of 512

> > +can be used too, if the virtio queue size in the guest is increased to

> 1024 (available to configure in QEMU versions v2.10 and greater). This

> value can be set like so:

> > +

> > +    $ qemu-system-x86_64 ... -chardev

> socket,id=char1,path=<sockpath>,server

> > +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce

> > +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,

> > +      tx_queue_size=1024

> 

> What if we have 2, 4, 10 HW NICs and a few different destinations for

> traffic from VM?

> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a

> few different packet flows?

> What if we have HW NICs with small number of TX queues (like VFs) and XPS

> is working?

> We may actually stuck with non-working virtual interface in all above

> cases.

> 

> IMHO, this feature should be marked as experimental until we don't have

> proper solution for that stuck mbufs in HW rings.

> 

> Thoughts?


My 2 cents, I'd like to see this get in for 2.9.

I understand there are concerns with corner cases and config combinations but it has been available for review and testing for a while and for the most part no blocking issues were found.

But as there could be a fix required for the mbuff and rings and other corner cases as highlighted above as testing continues I wouldn’t be opposed to putting an experimental tag against the feature for the 2.9 release.

Ian

> 

> > +

> > +Further information can be found in the `DPDK documentation

> > +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__

> > diff --git a/NEWS b/NEWS

> > index d45904e..50630f8 100644

> > --- a/NEWS

> > +++ b/NEWS

> > @@ -18,6 +18,8 @@ Post-v2.8.0

> >     - DPDK:

> >       * Add support for DPDK v17.11

> >       * Add support for vHost IOMMU

> > +     * Optional dequeue zero copy feature for vHost ports can be

> enabled per

> > +       port via the boolean 'dq-zero-copy' option.

> >

> >  v2.8.0 - 31 Aug 2017

> >  --------------------

> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index

> > 879019e..03cc6cb 100644

> > --- a/lib/netdev-dpdk.c

> > +++ b/lib/netdev-dpdk.c

> > @@ -361,6 +361,9 @@ struct netdev_dpdk {

> >          /* True if vHost device is 'up' and has been reconfigured at

> least once */

> >          bool vhost_reconfigured;

> >          /* 3 pad bytes here. */

> > +

> > +        /* True if dq-zero-copy feature has successfully been enabled

> */

> > +        bool dq_zc_enabled;

> >      );

> >

> >      PADDED_MEMBERS(CACHE_LINE_SIZE,

> > @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t

> port_no,

> >      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

> >      ovsrcu_index_init(&dev->vid, -1);

> >      dev->vhost_reconfigured = false;

> > +    dev->dq_zc_enabled = false;

> >      dev->attached = false;

> >

> >      ovsrcu_init(&dev->qos_conf, NULL); @@ -1383,6 +1387,29 @@

> > netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap

> *args,

> >      return 0;

> >  }

> >

> > +static void

> > +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,

> > +                             const struct smap *args) {

> > +    bool needs_reconfigure = false;

> > +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);

> > +

> > +    if (zc_requested &&

> > +            !(dev->vhost_driver_flags &

> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {

> > +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;

> > +        needs_reconfigure = true;

> > +    } else if (!zc_requested &&

> > +            (dev->vhost_driver_flags &

> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {

> > +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;

> > +        needs_reconfigure = true;

> > +    }

> > +

> > +    /* Only try to change ZC mode when device is down */

> > +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {

> > +        netdev_request_reconfigure(&dev->up);

> > +    }

> > +}

> > +

> >  static int

> >  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,

> >                                      const struct smap *args, @@

> > -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev

> *netdev,

> >              netdev_request_reconfigure(netdev);

> >          }

> >      }

> > +

> > +    dpdk_vhost_set_config_helper(dev, args);

> > +

> > +    ovs_mutex_unlock(&dev->mutex);

> > +

> > +    return 0;

> > +}

> > +

> > +static int

> > +netdev_dpdk_vhost_set_config(struct netdev *netdev,

> > +                             const struct smap *args,

> > +                             char **errp OVS_UNUSED) {

> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

> > +

> > +    ovs_mutex_lock(&dev->mutex);

> > +    dpdk_vhost_set_config_helper(dev, args);

> >      ovs_mutex_unlock(&dev->mutex);

> >

> >      return 0;

> > @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk

> *dev)

> >      }

> >  }

> >

> > +static void

> > +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,

> > +                            bool enable) {

> > +    int err = rte_vhost_driver_unregister(dev->vhost_id);

> > +

> > +    if (err) {

> > +        VLOG_ERR("Error unregistering vHost socket %s; can't change

> zero copy "

> > +                "mode", dev->vhost_id);

> > +    } else {

> > +        err = dpdk_setup_vhost_device(dev, client_mode);

> > +        if (err) {

> > +            VLOG_ERR("Error changing zero copy mode for vHost socket

> %s",

> > +                    dev->vhost_id);

> > +        } else if (enable) {

> > +            dev->dq_zc_enabled = true;

> > +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev-

> >vhost_id);

> > +        } else {

> > +            dev->dq_zc_enabled = false;

> > +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev-

> >vhost_id);

> > +        }

> > +    }

> > +}

> > +

> > +static void

> > +vhost_check_zero_copy_status(struct netdev_dpdk *dev) {

> > +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;

> > +

> > +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)

> > +                && !dev->dq_zc_enabled) {

> > +        /* ZC disabled but requested to be enabled, enable it. */

> > +        vhost_change_zero_copy_mode(dev, mode, true);

> > +    } else if (!(dev->vhost_driver_flags &

> > +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {

> > +        /* ZC enabled but requested to be disabled, disable it. */

> > +        vhost_change_zero_copy_mode(dev, mode, false);

> > +    }

> > +}

> > +

> >  /*

> >   * Remove a virtio-net device from the specific vhost port.  Use dev-

> >remove

> >   * flag to stop any more packets from being sent or received to/from

> > a VM and @@ -2768,6 +2852,7 @@ destroy_device(int vid)

> >           */

> >          ovsrcu_quiesce_start();

> >          VLOG_INFO("vHost Device '%s' has been removed", ifname);

> > +        netdev_request_reconfigure(&dev->up);

> >      } else {

> >          VLOG_INFO("vHost Device '%s' not found", ifname);

> >      }

> > @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk

> *dev)

> >              /* Carrier status may need updating. */

> >              netdev_change_seq_changed(&dev->up);

> >          }

> > +    } else {

> > +        vhost_check_zero_copy_status(dev);

> >      }

> >

> >      return 0;

> > @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class

> =

> >          NULL,

> >          netdev_dpdk_vhost_construct,

> >          netdev_dpdk_vhost_destruct,

> > -        NULL,

> > +        netdev_dpdk_vhost_set_config,

> >          NULL,

> >          netdev_dpdk_vhost_send,

> >          netdev_dpdk_vhost_get_carrier, diff --git

> > a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 4c317d0..e8409b4

> > 100644

> > --- a/vswitchd/vswitch.xml

> > +++ b/vswitchd/vswitch.xml

> > @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0

> type=patch options:peer=p1 \

> >          </p>

> >        </column>

> >

> > +      <column name="options" key="dq-zero-copy"

> > +              type='{"type": "boolean"}'>

> > +        <p>

> > +          The value specifies whether or not to enable dequeue zero

> copy on

> > +          the given interface.

> > +          The port must be in an inactive state in order to enable or

> disable

> > +          this feature.

> > +          Only supported by dpdkvhostuserclient and dpdkvhostuser

> interfaces.

> > +        </p>

> > +      </column>

> > +

> >        <column name="options" key="n_rxq_desc"

> >                type='{"type": "integer", "minInteger": 1, "maxInteger":

> 4096}'>

> >          <p>

> >
Jan Scheurich Dec. 15, 2017, 12:32 p.m. UTC | #5
Hi,

> >> What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
> >> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
> >> What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
> >> We may actually stuck with non-working virtual interface in all above cases.
> >>
> >> IMHO, this feature should be marked as experimental until we don't have proper solution
> >> for that stuck mbufs in HW rings.
> >>
> >> Thoughts?

I definitely think this feature, if merged, should be declared experimental. As far as I can see there is no safe way to prevent vhostuser ports running out of descriptors and the negative impact on packet drop likelihood is so significant that it may not justify the small increase in saturation load.

> >>
> >
> > How about using rte_eth_tx_done_cleanup() ?
> 
> This API currently implemented only by igb driver.
> This will not be much useful.
> 
> > I guess a user would not
> > want to combine this feature with sw tx batching though.

Yes, tx batching with tx-flush-timer > 0 would not go well at all with zero-copy.
We need to document that conflict in the final version.

> >
> > On a related note, I would like to ask Ciara/Jan the use case when this
> > feature would likely be enabled. The RFC2544 test is very opaque but it
> > seems there is at least some increased packet loss, so would it be just
> > when the user knows it is vm-vm and large packets? (In which case you
> > could argue, packets would not get stuck in HW NIC anyway)

Given the current limitations I don't think zero-copy vhostuser is a viable option for Cloud  deployments, where the bulk of the traffic is of type PVP.  

The VM-VM traffic case is unproblematic, but the issue I have is that at configuration time of the vhostuser port, the operator would have to make assumptions on whether the traffic remains local on the host or goes out to the physical network. But that depends on scheduling decisions by Nova, that the operator would typically not know in advance.

All in all, I have no objections against merging zero copy as experimental feature, but from our perspective there is more value in the Tx batching patch series.

BR, Jan
Ilya Maximets Dec. 15, 2017, 2:02 p.m. UTC | #6
Not a full review.

General thoughts:

If following conditions are true:

1. We don't need to add new feature to deprecated vhostuser port.

2. We actually don't need to have ability to change ZC config if vhost-server-path
   already configured.

   Let me explain this condition:
   To change 'dq-zero-copy' if VM already started we need to stop VM, change the
   'dq-zero-copy' and start the VM back. It's not much simpler than stop VM, re-add
   port with different value for 'dq-zero-copy' and start VM back. I'm assuming here
   that VM restart is much more complex and unlikely operation than port removing
   from OVS and adding back. This will require documenting, but we already have a
   bunch of docs about how to modify this config option in current version.

we may drop patch #1 from the series and use just something like following
code instead of code changes in this patch (not tested, just for a reference):

---------------
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2325f0e..c4cbf7c 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -379,6 +379,9 @@ struct netdev_dpdk {
         /* True if vHost device is 'up' and has been reconfigured at least once */
         bool vhost_reconfigured;
         /* 3 pad bytes here. */
+
+        /* True if dq-zero-copy feature requested for vhost device. */
+        bool vhost_dq_zc;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -889,6 +892,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
     dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
     ovsrcu_index_init(&dev->vid, -1);
     dev->vhost_reconfigured = false;
+    dev->vhost_dq_zc = false;
     dev->attached = false;
 
     ovsrcu_init(&dev->qos_conf, NULL);
@@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
         path = smap_get(args, "vhost-server-path");
         if (path && strcmp(path, dev->vhost_id)) {
             strcpy(dev->vhost_id, path);
+            dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
             netdev_request_reconfigure(netdev);
         }
     }
@@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
         if (dpdk_vhost_iommu_enabled()) {
             vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
         }
+
+        /* Enable Zero Copy mode if configured. */
+        if (dev->vhost_dq_zc) {
+            vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+        }
         err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
         if (err) {
             VLOG_ERR("vhost-user device setup failure for device %s\n",

---------------

Thoughts?


Best regards, Ilya Maximets.

P.S. Few comments about patch itself are inline.

On 08.12.2017 18:26, Ciara Loftus wrote:
> Enabled per port like so:
> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> 
> The feature is disabled by default and can only be enabled/disabled when
> a vHost port is down.
> 
> When packets from a vHost device with zero copy enabled are destined for
> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
> set to a smaller value. 128 is recommended. This can be achieved like
> so:
> 
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
> v5:
> * Rebase
> * Update docs with warning of potential packet loss
> * Update docs with info on NIC & Virtio descriptor values and working
> combinations
> 
>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>  Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++
>  NEWS                                     |  2 +
>  lib/netdev-dpdk.c                        | 89 +++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml                     | 11 ++++
>  5 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index d123819..3e1b8f8 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>     Check traffic on multiple queues::
>  
>         $ cat /proc/interrupts | grep virtio
> +
> +PHY-VM-PHY (vHost Dequeue Zero Copy)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +vHost dequeue zero copy functionality can  be validated using the
> +PHY-VM-PHY configuration. To begin, follow the steps described in
> +:ref:`dpdk-phy-phy` to create and initialize the database, start
> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
> +and flows to bridge ``br0``. Once complete, follow the below steps:
> +
> +1. Enable dequeue zero copy on the vHost devices.
> +
> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
> +
> +The following log should be observed for each device:
> +
> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
> +
> +2. Reduce the number of txq descriptors on the phy ports.
> +
> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
> +
> +3. Proceed with the test by launching the VM and configuring guest
> +forwarding, be it via the vHost loopback method or kernel forwarding
> +method, and sending traffic. The following log should be oberved for
> +each device as it becomes active during VM boot:
> +
> +       VHOST_CONFIG: dequeue zero copy is enabled
> +
> +It is essential that step 1 is performed before booting the VM, otherwise
> +the feature will not be enabled.
> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
> index 8b1c671..8587370 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -441,3 +441,64 @@ Sample XML
>      </domain>
>  
>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy
> +-------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
> +must be used to copy that packet from guest address space to host address
> +space. This memcpy can be removed by enabling dequeue zero-copy like so:
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is passed to
> +the host, instead of a copy of the packet. Removing this memcpy can give a
> +performance improvement for some use cases, for example switching large packets
> +between different VMs. However additional packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly enabled
> +by using the command above.
> +
> +The feature cannot be enabled when the device is active (ie. VM booted). If
> +you wish to enable the feature after the VM has booted, you must shutdown
> +the VM and bring it back up.
> +
> +The same logic applies for disabling the feature - it must be disabled when
> +the device is inactive, for example before VM boot. To disable the feature:
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
> +
> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient port
> +types.
> +
> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
> +for that port must be reduced to a smaller number, 128 being the recommended
> +value. This can be achieved by issuing the following command:
> +
> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +More information on the n_txq_desc option can be found in the "DPDK Physical
> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.

I think some :ref: link should be here.

> +
> +The reason for this limitation is due to how the zero copy functionality is
> +implemented. The vHost device's 'tx used vring', a virtio structure used for
> +tracking used ie. sent descriptors, will only be updated when the NIC frees
> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
> +vring will be starved and packets will no longer be processed. One way to
> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
> +often and thus free mbufs more frequently. The value of 128 is suggested, but
> +values of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the virtio
> +queue size in the guest is increased to 1024 (available to configure in QEMU
> +versions v2.10 and greater). This value can be set like so:
> +
> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> +      tx_queue_size=1024
> +
> +Further information can be found in the
> +`DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> diff --git a/NEWS b/NEWS
> index d45904e..50630f8 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -18,6 +18,8 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU
> +     * Optional dequeue zero copy feature for vHost ports can be enabled per
> +       port via the boolean 'dq-zero-copy' option.
>  
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 879019e..03cc6cb 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>          /* True if vHost device is 'up' and has been reconfigured at least once */
>          bool vhost_reconfigured;
>          /* 3 pad bytes here. */
> +
> +        /* True if dq-zero-copy feature has successfully been enabled */
> +        bool dq_zc_enabled;

"pad bytes" should be fixed above or removed entirely, or we may just revert padding
here too because it's incorrect just like in dp_netdev_pmd_thread. The only difference
is that struct netdev_dpdk aligned to cacheline.

>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      ovsrcu_index_init(&dev->vid, -1);
>      dev->vhost_reconfigured = false;
> +    dev->dq_zc_enabled = false;
>      dev->attached = false;
>  
>      ovsrcu_init(&dev->qos_conf, NULL);
> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
>      return 0;
>  }
>  
> +static void
> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
> +                             const struct smap *args)
> +{
> +    bool needs_reconfigure = false;
> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> +
> +    if (zc_requested &&
> +            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    } else if (!zc_requested &&
> +            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    }
> +
> +    /* Only try to change ZC mode when device is down */
> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
> +        netdev_request_reconfigure(&dev->up);
> +    }
> +}
> +
>  static int
>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>                                      const struct smap *args,
> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>              netdev_request_reconfigure(netdev);
>          }
>      }
> +
> +    dpdk_vhost_set_config_helper(dev, args);
> +
> +    ovs_mutex_unlock(&dev->mutex);
> +
> +    return 0;
> +}
> +
> +static int
> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
> +                             const struct smap *args,
> +                             char **errp OVS_UNUSED)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    dpdk_vhost_set_config_helper(dev, args);
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>      }
>  }
>  
> +static void
> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,
> +                            bool enable)
> +{
> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
> +
> +    if (err) {
> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero copy "
> +                "mode", dev->vhost_id);
> +    } else {
> +        err = dpdk_setup_vhost_device(dev, client_mode);
> +        if (err) {
> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
> +                    dev->vhost_id);
> +        } else if (enable) {
> +            dev->dq_zc_enabled = true;
> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev->vhost_id);
> +        } else {
> +            dev->dq_zc_enabled = false;
> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev->vhost_id);
> +        }
> +    }
> +}
> +
> +static void
> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
> +{
> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
> +
> +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> +                && !dev->dq_zc_enabled) {
> +        /* ZC disabled but requested to be enabled, enable it. */
> +        vhost_change_zero_copy_mode(dev, mode, true);
> +    } else if (!(dev->vhost_driver_flags &
> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {
> +        /* ZC enabled but requested to be disabled, disable it. */
> +        vhost_change_zero_copy_mode(dev, mode, false);
> +    }
> +}
> +
>  /*
>   * Remove a virtio-net device from the specific vhost port.  Use dev->remove
>   * flag to stop any more packets from being sent or received to/from a VM and
> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>           */
>          ovsrcu_quiesce_start();
>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
> +        netdev_request_reconfigure(&dev->up);
>      } else {
>          VLOG_INFO("vHost Device '%s' not found", ifname);
>      }
> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>              /* Carrier status may need updating. */
>              netdev_change_seq_changed(&dev->up);
>          }
> +    } else {
> +        vhost_check_zero_copy_status(dev);
>      }
>  
>      return 0;
> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class =
>          NULL,
>          netdev_dpdk_vhost_construct,
>          netdev_dpdk_vhost_destruct,
> -        NULL,
> +        netdev_dpdk_vhost_set_config,
>          NULL,
>          netdev_dpdk_vhost_send,
>          netdev_dpdk_vhost_get_carrier,
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4c317d0..e8409b4 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="options" key="dq-zero-copy"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          The value specifies whether or not to enable dequeue zero copy on
> +          the given interface.
> +          The port must be in an inactive state in order to enable or disable
> +          this feature.
> +          Only supported by dpdkvhostuserclient and dpdkvhostuser interfaces.
> +        </p>
> +      </column>
> +
>        <column name="options" key="n_rxq_desc"
>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>          <p>
>
Kevin Traynor Dec. 15, 2017, 6:38 p.m. UTC | #7
On 12/15/2017 12:32 PM, Jan Scheurich wrote:
> Hi,
> 
>>>> What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
>>>> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
>>>> What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
>>>> We may actually stuck with non-working virtual interface in all above cases.
>>>>
>>>> IMHO, this feature should be marked as experimental until we don't have proper solution
>>>> for that stuck mbufs in HW rings.
>>>>
>>>> Thoughts?
> 
> I definitely think this feature, if merged, should be declared experimental. As far as I can see there is no safe way to prevent vhostuser ports running out of descriptors and the negative impact on packet drop likelihood is so significant that it may not justify the small increase in saturation load.
> 
>>>>
>>>
>>> How about using rte_eth_tx_done_cleanup() ?
>>
>> This API currently implemented only by igb driver.
>> This will not be much useful.
>>
>>> I guess a user would not
>>> want to combine this feature with sw tx batching though.
> 
> Yes, tx batching with tx-flush-timer > 0 would not go well at all with zero-copy.
> We need to document that conflict in the final version.
> 
>>>
>>> On a related note, I would like to ask Ciara/Jan the use case when this
>>> feature would likely be enabled. The RFC2544 test is very opaque but it
>>> seems there is at least some increased packet loss, so would it be just
>>> when the user knows it is vm-vm and large packets? (In which case you
>>> could argue, packets would not get stuck in HW NIC anyway)
> 
> Given the current limitations I don't think zero-copy vhostuser is a viable option for Cloud  deployments, where the bulk of the traffic is of type PVP.  
> 
> The VM-VM traffic case is unproblematic, but the issue I have is that at configuration time of the vhostuser port, the operator would have to make assumptions on whether the traffic remains local on the host or goes out to the physical network. But that depends on scheduling decisions by Nova, that the operator would typically not know in advance.
> 
> All in all, I have no objections against merging zero copy as experimental feature, but from our perspective there is more value in the Tx batching patch series.
> 

+1

Thanks Jan.

> BR, Jan
>
Kevin Traynor Dec. 15, 2017, 6:44 p.m. UTC | #8
On 12/15/2017 09:39 AM, Ilya Maximets wrote:
>>> What if we have 2, 4, 10 HW NICs and a few different destinations for traffic from VM?
>>> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a few different packet flows?
>>> What if we have HW NICs with small number of TX queues (like VFs) and XPS is working?
>>> We may actually stuck with non-working virtual interface in all above cases.
>>>
>>> IMHO, this feature should be marked as experimental until we don't have proper solution
>>> for that stuck mbufs in HW rings.
>>>
>>> Thoughts?
>>>
>> How about using rte_eth_tx_done_cleanup() ?
> This API currently implemented only by igb driver.
> This will not be much useful.
> 

Ah ok, pity - it seems like it could be used as part of a solution for
this. Maybe it will be more widely implemented in the future.
Stokes, Ian Dec. 15, 2017, 6:52 p.m. UTC | #9
> On 12/15/2017 12:32 PM, Jan Scheurich wrote:
> > Hi,
> >
> >>>> What if we have 2, 4, 10 HW NICs and a few different destinations for
> traffic from VM?
> >>>> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and
> just a few different packet flows?
> >>>> What if we have HW NICs with small number of TX queues (like VFs) and
> XPS is working?
> >>>> We may actually stuck with non-working virtual interface in all above
> cases.
> >>>>
> >>>> IMHO, this feature should be marked as experimental until we don't
> >>>> have proper solution for that stuck mbufs in HW rings.
> >>>>
> >>>> Thoughts?
> >
> > I definitely think this feature, if merged, should be declared
> experimental. As far as I can see there is no safe way to prevent
> vhostuser ports running out of descriptors and the negative impact on
> packet drop likelihood is so significant that it may not justify the small
> increase in saturation load.
> >
> >>>>
> >>>
> >>> How about using rte_eth_tx_done_cleanup() ?
> >>
> >> This API currently implemented only by igb driver.
> >> This will not be much useful.
> >>
> >>> I guess a user would not
> >>> want to combine this feature with sw tx batching though.
> >
> > Yes, tx batching with tx-flush-timer > 0 would not go well at all with
> zero-copy.
> > We need to document that conflict in the final version.
> >
> >>>
> >>> On a related note, I would like to ask Ciara/Jan the use case when
> >>> this feature would likely be enabled. The RFC2544 test is very
> >>> opaque but it seems there is at least some increased packet loss, so
> >>> would it be just when the user knows it is vm-vm and large packets?
> >>> (In which case you could argue, packets would not get stuck in HW
> >>> NIC anyway)
> >
> > Given the current limitations I don't think zero-copy vhostuser is a
> viable option for Cloud  deployments, where the bulk of the traffic is of
> type PVP.
> >
> > The VM-VM traffic case is unproblematic, but the issue I have is that at
> configuration time of the vhostuser port, the operator would have to make
> assumptions on whether the traffic remains local on the host or goes out
> to the physical network. But that depends on scheduling decisions by Nova,
> that the operator would typically not know in advance.
> >
> > All in all, I have no objections against merging zero copy as
> experimental feature, but from our perspective there is more value in the
> Tx batching patch series.
> >
> 
> +1

Fully agree, I'm almost finished reviewing the tx_batching, it's the priority feature, I expect it to be in the next pull request.

After that I would like to see this reviewed with the required changes and merged as experimental.

Ian
> 
> Thanks Jan.
> 
> > BR, Jan
> >
Ciara Loftus Dec. 18, 2017, 12:28 p.m. UTC | #10
> 
> Not a full review.

Thanks for your feedback.

> 
> General thoughts:
> 
> If following conditions are true:
> 
> 1. We don't need to add new feature to deprecated vhostuser port.

Agree.

> 
> 2. We actually don't need to have ability to change ZC config if vhost-server-
> path
>    already configured.
> 
>    Let me explain this condition:
>    To change 'dq-zero-copy' if VM already started we need to stop VM,
> change the
>    'dq-zero-copy' and start the VM back. It's not much simpler than stop VM,
> re-add
>    port with different value for 'dq-zero-copy' and start VM back. I'm assuming
> here
>    that VM restart is much more complex and unlikely operation than port
> removing
>    from OVS and adding back. This will require documenting, but we already
> have a
>    bunch of docs about how to modify this config option in current version.

Don't fully agree. I can't say whether your assumption is correct.
Port-deletion & re-addition has repercussions eg. loss of statistics from an interface. Retaining those stats might be important for some.
Why not leave it up to the user to choose their preferred method of enabling feature ie. reboot the VM or delete & add the port.

> 
> we may drop patch #1 from the series and use just something like following
> code instead of code changes in this patch (not tested, just for a reference):
> 
> ---------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2325f0e..c4cbf7c 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -379,6 +379,9 @@ struct netdev_dpdk {
>          /* True if vHost device is 'up' and has been reconfigured at least once */
>          bool vhost_reconfigured;
>          /* 3 pad bytes here. */
> +
> +        /* True if dq-zero-copy feature requested for vhost device. */
> +        bool vhost_dq_zc;
>      );
> 
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
> dpdk_port_t port_no,
>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      ovsrcu_index_init(&dev->vid, -1);
>      dev->vhost_reconfigured = false;
> +    dev->vhost_dq_zc = false;
>      dev->attached = false;
> 
>      ovsrcu_init(&dev->qos_conf, NULL);
> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
>          path = smap_get(args, "vhost-server-path");
>          if (path && strcmp(path, dev->vhost_id)) {
>              strcpy(dev->vhost_id, path);
> +            dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
>              netdev_request_reconfigure(netdev);
>          }
>      }
> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
> netdev *netdev)
>          if (dpdk_vhost_iommu_enabled()) {
>              vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
>          }
> +
> +        /* Enable Zero Copy mode if configured. */
> +        if (dev->vhost_dq_zc) {
> +            vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        }
>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>          if (err) {
>              VLOG_ERR("vhost-user device setup failure for device %s\n",
> 
> ---------------
> 
> Thoughts?

The above is a lot shorter of course which is nice, but I don't think we should sacrifice the ability to enable the feature post-boot for this.

Thanks,
Ciara

> 
> 
> Best regards, Ilya Maximets.
> 
> P.S. Few comments about patch itself are inline.
> 
> On 08.12.2017 18:26, Ciara Loftus wrote:
> > Enabled per port like so:
> > ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> >
> > The feature is disabled by default and can only be enabled/disabled when
> > a vHost port is down.
> >
> > When packets from a vHost device with zero copy enabled are destined for
> > a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
> > set to a smaller value. 128 is recommended. This can be achieved like
> > so:
> >
> > ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > ---
> > v5:
> > * Rebase
> > * Update docs with warning of potential packet loss
> > * Update docs with info on NIC & Virtio descriptor values and working
> > combinations
> >
> >  Documentation/howto/dpdk.rst             | 33 ++++++++++++
> >  Documentation/topics/dpdk/vhost-user.rst | 61
> ++++++++++++++++++++++
> >  NEWS                                     |  2 +
> >  lib/netdev-dpdk.c                        | 89
> +++++++++++++++++++++++++++++++-
> >  vswitchd/vswitch.xml                     | 11 ++++
> >  5 files changed, 195 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> > index d123819..3e1b8f8 100644
> > --- a/Documentation/howto/dpdk.rst
> > +++ b/Documentation/howto/dpdk.rst
> > @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow
> the below steps:
> >     Check traffic on multiple queues::
> >
> >         $ cat /proc/interrupts | grep virtio
> > +
> > +PHY-VM-PHY (vHost Dequeue Zero Copy)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +vHost dequeue zero copy functionality can  be validated using the
> > +PHY-VM-PHY configuration. To begin, follow the steps described in
> > +:ref:`dpdk-phy-phy` to create and initialize the database, start
> > +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
> > +and flows to bridge ``br0``. Once complete, follow the below steps:
> > +
> > +1. Enable dequeue zero copy on the vHost devices.
> > +
> > +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
> > +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
> > +
> > +The following log should be observed for each device:
> > +
> > +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
> > +
> > +2. Reduce the number of txq descriptors on the phy ports.
> > +
> > +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
> > +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
> > +
> > +3. Proceed with the test by launching the VM and configuring guest
> > +forwarding, be it via the vHost loopback method or kernel forwarding
> > +method, and sending traffic. The following log should be oberved for
> > +each device as it becomes active during VM boot:
> > +
> > +       VHOST_CONFIG: dequeue zero copy is enabled
> > +
> > +It is essential that step 1 is performed before booting the VM, otherwise
> > +the feature will not be enabled.
> > diff --git a/Documentation/topics/dpdk/vhost-user.rst
> b/Documentation/topics/dpdk/vhost-user.rst
> > index 8b1c671..8587370 100644
> > --- a/Documentation/topics/dpdk/vhost-user.rst
> > +++ b/Documentation/topics/dpdk/vhost-user.rst
> > @@ -441,3 +441,64 @@ Sample XML
> >      </domain>
> >
> >  .. _QEMU documentation: http://git.qemu-
> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;h=7890d7169;hb=HEAD
> > +
> > +vhost-user Dequeue Zero Copy
> > +-------------------------------------
> > +
> > +Normally when dequeuing a packet from a vHost User device, a memcpy
> operation
> > +must be used to copy that packet from guest address space to host
> address
> > +space. This memcpy can be removed by enabling dequeue zero-copy like
> so:
> > +
> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=true
> > +
> > +With this feature enabled, a reference (pointer) to the packet is passed to
> > +the host, instead of a copy of the packet. Removing this memcpy can give
> a
> > +performance improvement for some use cases, for example switching
> large packets
> > +between different VMs. However additional packet loss may be
> observed.
> > +
> > +Note that the feature is disabled by default and must be explicitly enabled
> > +by using the command above.
> > +
> > +The feature cannot be enabled when the device is active (ie. VM booted).
> If
> > +you wish to enable the feature after the VM has booted, you must
> shutdown
> > +the VM and bring it back up.
> > +
> > +The same logic applies for disabling the feature - it must be disabled when
> > +the device is inactive, for example before VM boot. To disable the
> feature:
> > +
> > +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> copy=false
> > +
> > +The feature is available to both dpdkvhostuser and dpdkvhostuserclient
> port
> > +types.
> > +
> > +A limitation exists whereby if packets from a vHost port with dq-zero-
> copy=true
> > +are destined for a 'dpdk' type port, the number of tx descriptors
> (n_txq_desc)
> > +for that port must be reduced to a smaller number, 128 being the
> recommended
> > +value. This can be achieved by issuing the following command:
> > +
> > +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> > +
> > +More information on the n_txq_desc option can be found in the "DPDK
> Physical
> > +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
> 
> I think some :ref: link should be here.
> 
> > +
> > +The reason for this limitation is due to how the zero copy functionality is
> > +implemented. The vHost device's 'tx used vring', a virtio structure used for
> > +tracking used ie. sent descriptors, will only be updated when the NIC
> frees
> > +the corresponding mbuf. If we don't free the mbufs frequently enough,
> that
> > +vring will be starved and packets will no longer be processed. One way to
> > +ensure we don't encounter this scenario, is to configure n_txq_desc to a
> small
> > +enough number such that the 'mbuf free threshold' for the NIC will be hit
> more
> > +often and thus free mbufs more frequently. The value of 128 is
> suggested, but
> > +values of 64 and 256 have been tested and verified to work too, with
> differing
> > +performance characteristics. A value of 512 can be used too, if the virtio
> > +queue size in the guest is increased to 1024 (available to configure in
> QEMU
> > +versions v2.10 and greater). This value can be set like so:
> > +
> > +    $ qemu-system-x86_64 ... -chardev
> socket,id=char1,path=<sockpath>,server
> > +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> > +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> > +      tx_queue_size=1024
> > +
> > +Further information can be found in the
> > +`DPDK documentation
> > +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> > diff --git a/NEWS b/NEWS
> > index d45904e..50630f8 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -18,6 +18,8 @@ Post-v2.8.0
> >     - DPDK:
> >       * Add support for DPDK v17.11
> >       * Add support for vHost IOMMU
> > +     * Optional dequeue zero copy feature for vHost ports can be enabled
> per
> > +       port via the boolean 'dq-zero-copy' option.
> >
> >  v2.8.0 - 31 Aug 2017
> >  --------------------
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 879019e..03cc6cb 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -361,6 +361,9 @@ struct netdev_dpdk {
> >          /* True if vHost device is 'up' and has been reconfigured at least once
> */
> >          bool vhost_reconfigured;
> >          /* 3 pad bytes here. */
> > +
> > +        /* True if dq-zero-copy feature has successfully been enabled */
> > +        bool dq_zc_enabled;
> 
> "pad bytes" should be fixed above or removed entirely, or we may just
> revert padding
> here too because it's incorrect just like in dp_netdev_pmd_thread. The only
> difference
> is that struct netdev_dpdk aligned to cacheline.
> 
> >      );
> >
> >      PADDED_MEMBERS(CACHE_LINE_SIZE,
> > @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev,
> dpdk_port_t port_no,
> >      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >      ovsrcu_index_init(&dev->vid, -1);
> >      dev->vhost_reconfigured = false;
> > +    dev->dq_zc_enabled = false;
> >      dev->attached = false;
> >
> >      ovsrcu_init(&dev->qos_conf, NULL);
> > @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev
> *netdev, const struct smap *args,
> >      return 0;
> >  }
> >
> > +static void
> > +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
> > +                             const struct smap *args)
> > +{
> > +    bool needs_reconfigure = false;
> > +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> > +
> > +    if (zc_requested &&
> > +            !(dev->vhost_driver_flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> > +        dev->vhost_driver_flags |=
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> > +        needs_reconfigure = true;
> > +    } else if (!zc_requested &&
> > +            (dev->vhost_driver_flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> > +        dev->vhost_driver_flags &=
> ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> > +        needs_reconfigure = true;
> > +    }
> > +
> > +    /* Only try to change ZC mode when device is down */
> > +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
> > +        netdev_request_reconfigure(&dev->up);
> > +    }
> > +}
> > +
> >  static int
> >  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
> >                                      const struct smap *args,
> > @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct
> netdev *netdev,
> >              netdev_request_reconfigure(netdev);
> >          }
> >      }
> > +
> > +    dpdk_vhost_set_config_helper(dev, args);
> > +
> > +    ovs_mutex_unlock(&dev->mutex);
> > +
> > +    return 0;
> > +}
> > +
> > +static int
> > +netdev_dpdk_vhost_set_config(struct netdev *netdev,
> > +                             const struct smap *args,
> > +                             char **errp OVS_UNUSED)
> > +{
> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +
> > +    ovs_mutex_lock(&dev->mutex);
> > +    dpdk_vhost_set_config_helper(dev, args);
> >      ovs_mutex_unlock(&dev->mutex);
> >
> >      return 0;
> > @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct
> netdev_dpdk *dev)
> >      }
> >  }
> >
> > +static void
> > +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool
> client_mode,
> > +                            bool enable)
> > +{
> > +    int err = rte_vhost_driver_unregister(dev->vhost_id);
> > +
> > +    if (err) {
> > +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero
> copy "
> > +                "mode", dev->vhost_id);
> > +    } else {
> > +        err = dpdk_setup_vhost_device(dev, client_mode);
> > +        if (err) {
> > +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
> > +                    dev->vhost_id);
> > +        } else if (enable) {
> > +            dev->dq_zc_enabled = true;
> > +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev-
> >vhost_id);
> > +        } else {
> > +            dev->dq_zc_enabled = false;
> > +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev-
> >vhost_id);
> > +        }
> > +    }
> > +}
> > +
> > +static void
> > +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
> > +{
> > +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
> > +
> > +    if ((dev->vhost_driver_flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> > +                && !dev->dq_zc_enabled) {
> > +        /* ZC disabled but requested to be enabled, enable it. */
> > +        vhost_change_zero_copy_mode(dev, mode, true);
> > +    } else if (!(dev->vhost_driver_flags &
> > +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev-
> >dq_zc_enabled) {
> > +        /* ZC enabled but requested to be disabled, disable it. */
> > +        vhost_change_zero_copy_mode(dev, mode, false);
> > +    }
> > +}
> > +
> >  /*
> >   * Remove a virtio-net device from the specific vhost port.  Use dev-
> >remove
> >   * flag to stop any more packets from being sent or received to/from a VM
> and
> > @@ -2768,6 +2852,7 @@ destroy_device(int vid)
> >           */
> >          ovsrcu_quiesce_start();
> >          VLOG_INFO("vHost Device '%s' has been removed", ifname);
> > +        netdev_request_reconfigure(&dev->up);
> >      } else {
> >          VLOG_INFO("vHost Device '%s' not found", ifname);
> >      }
> > @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct
> netdev_dpdk *dev)
> >              /* Carrier status may need updating. */
> >              netdev_change_seq_changed(&dev->up);
> >          }
> > +    } else {
> > +        vhost_check_zero_copy_status(dev);
> >      }
> >
> >      return 0;
> > @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class
> =
> >          NULL,
> >          netdev_dpdk_vhost_construct,
> >          netdev_dpdk_vhost_destruct,
> > -        NULL,
> > +        netdev_dpdk_vhost_set_config,
> >          NULL,
> >          netdev_dpdk_vhost_send,
> >          netdev_dpdk_vhost_get_carrier,
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 4c317d0..e8409b4 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
> >          </p>
> >        </column>
> >
> > +      <column name="options" key="dq-zero-copy"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          The value specifies whether or not to enable dequeue zero copy on
> > +          the given interface.
> > +          The port must be in an inactive state in order to enable or disable
> > +          this feature.
> > +          Only supported by dpdkvhostuserclient and dpdkvhostuser
> interfaces.
> > +        </p>
> > +      </column>
> > +
> >        <column name="options" key="n_rxq_desc"
> >                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
> >          <p>
> >
Ilya Maximets Dec. 18, 2017, 12:37 p.m. UTC | #11
On 18.12.2017 15:28, Loftus, Ciara wrote:
>>
>> Not a full review.
> 
> Thanks for your feedback.
> 
>>
>> General thoughts:
>>
>> If following conditions are true:
>>
>> 1. We don't need to add new feature to deprecated vhostuser port.
> 
> Agree.
> 
>>
>> 2. We actually don't need to have ability to change ZC config if vhost-server-
>> path
>>    already configured.
>>
>>    Let me explain this condition:
>>    To change 'dq-zero-copy' if VM already started we need to stop VM,
>> change the
>>    'dq-zero-copy' and start the VM back. It's not much simpler than stop VM,
>> re-add
>>    port with different value for 'dq-zero-copy' and start VM back. I'm assuming
>> here
>>    that VM restart is much more complex and unlikely operation than port
>> removing
>>    from OVS and adding back. This will require documenting, but we already
>> have a
>>    bunch of docs about how to modify this config option in current version.
> 
> Don't fully agree. I can't say whether your assumption is correct.
> Port-deletion & re-addition has repercussions eg. loss of statistics from an interface. Retaining those stats might be important for some.
> Why not leave it up to the user to choose their preferred method of enabling feature ie. reboot the VM or delete & add the port.
> 
>>
>> we may drop patch #1 from the series and use just something like following
>> code instead of code changes in this patch (not tested, just for a reference):
>>
>> ---------------
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 2325f0e..c4cbf7c 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -379,6 +379,9 @@ struct netdev_dpdk {
>>          /* True if vHost device is 'up' and has been reconfigured at least once */
>>          bool vhost_reconfigured;
>>          /* 3 pad bytes here. */
>> +
>> +        /* True if dq-zero-copy feature requested for vhost device. */
>> +        bool vhost_dq_zc;
>>      );
>>
>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
>> dpdk_port_t port_no,
>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>      ovsrcu_index_init(&dev->vid, -1);
>>      dev->vhost_reconfigured = false;
>> +    dev->vhost_dq_zc = false;
>>      dev->attached = false;
>>
>>      ovsrcu_init(&dev->qos_conf, NULL);
>> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
>> netdev *netdev,
>>          path = smap_get(args, "vhost-server-path");
>>          if (path && strcmp(path, dev->vhost_id)) {
>>              strcpy(dev->vhost_id, path);
>> +            dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
>>              netdev_request_reconfigure(netdev);
>>          }
>>      }
>> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
>> netdev *netdev)
>>          if (dpdk_vhost_iommu_enabled()) {
>>              vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
>>          }
>> +
>> +        /* Enable Zero Copy mode if configured. */
>> +        if (dev->vhost_dq_zc) {
>> +            vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>> +        }
>>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>>          if (err) {
>>              VLOG_ERR("vhost-user device setup failure for device %s\n",
>>
>> ---------------
>>
>> Thoughts?
> 
> The above is a lot shorter of course which is nice, but I don't think we should sacrifice the ability to enable the feature post-boot for this.

But you wrote in documentation below:

+The feature cannot be enabled when the device is active (ie. VM booted). If
+you wish to enable the feature after the VM has booted, you must shutdown
+the VM and bring it back up.

What you mean saying "sacrifice the ability to enable the feature post-boot"?
Am I missed something?

> 
> Thanks,
> Ciara
> 
>>
>>
>> Best regards, Ilya Maximets.
>>
>> P.S. Few comments about patch itself are inline.
>>
>> On 08.12.2017 18:26, Ciara Loftus wrote:
>>> Enabled per port like so:
>>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>>>
>>> The feature is disabled by default and can only be enabled/disabled when
>>> a vHost port is down.
>>>
>>> When packets from a vHost device with zero copy enabled are destined for
>>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
>>> set to a smaller value. 128 is recommended. This can be achieved like
>>> so:
>>>
>>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>>
>>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>>> ---
>>> v5:
>>> * Rebase
>>> * Update docs with warning of potential packet loss
>>> * Update docs with info on NIC & Virtio descriptor values and working
>>> combinations
>>>
>>>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>>>  Documentation/topics/dpdk/vhost-user.rst | 61
>> ++++++++++++++++++++++
>>>  NEWS                                     |  2 +
>>>  lib/netdev-dpdk.c                        | 89
>> +++++++++++++++++++++++++++++++-
>>>  vswitchd/vswitch.xml                     | 11 ++++
>>>  5 files changed, 195 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/howto/dpdk.rst
>> b/Documentation/howto/dpdk.rst
>>> index d123819..3e1b8f8 100644
>>> --- a/Documentation/howto/dpdk.rst
>>> +++ b/Documentation/howto/dpdk.rst
>>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow
>> the below steps:
>>>     Check traffic on multiple queues::
>>>
>>>         $ cat /proc/interrupts | grep virtio
>>> +
>>> +PHY-VM-PHY (vHost Dequeue Zero Copy)
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +vHost dequeue zero copy functionality can  be validated using the
>>> +PHY-VM-PHY configuration. To begin, follow the steps described in
>>> +:ref:`dpdk-phy-phy` to create and initialize the database, start
>>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
>>> +and flows to bridge ``br0``. Once complete, follow the below steps:
>>> +
>>> +1. Enable dequeue zero copy on the vHost devices.
>>> +
>>> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
>>> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
>>> +
>>> +The following log should be observed for each device:
>>> +
>>> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
>>> +
>>> +2. Reduce the number of txq descriptors on the phy ports.
>>> +
>>> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
>>> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
>>> +
>>> +3. Proceed with the test by launching the VM and configuring guest
>>> +forwarding, be it via the vHost loopback method or kernel forwarding
>>> +method, and sending traffic. The following log should be oberved for
>>> +each device as it becomes active during VM boot:
>>> +
>>> +       VHOST_CONFIG: dequeue zero copy is enabled
>>> +
>>> +It is essential that step 1 is performed before booting the VM, otherwise
>>> +the feature will not be enabled.
>>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
>> b/Documentation/topics/dpdk/vhost-user.rst
>>> index 8b1c671..8587370 100644
>>> --- a/Documentation/topics/dpdk/vhost-user.rst
>>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>>> @@ -441,3 +441,64 @@ Sample XML
>>>      </domain>
>>>
>>>  .. _QEMU documentation: http://git.qemu-
>> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
>> user.txt;h=7890d7169;hb=HEAD
>>> +
>>> +vhost-user Dequeue Zero Copy
>>> +-------------------------------------
>>> +
>>> +Normally when dequeuing a packet from a vHost User device, a memcpy
>> operation
>>> +must be used to copy that packet from guest address space to host
>> address
>>> +space. This memcpy can be removed by enabling dequeue zero-copy like
>> so:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
>> copy=true
>>> +
>>> +With this feature enabled, a reference (pointer) to the packet is passed to
>>> +the host, instead of a copy of the packet. Removing this memcpy can give
>> a
>>> +performance improvement for some use cases, for example switching
>> large packets
>>> +between different VMs. However additional packet loss may be
>> observed.
>>> +
>>> +Note that the feature is disabled by default and must be explicitly enabled
>>> +by using the command above.
>>> +
>>> +The feature cannot be enabled when the device is active (ie. VM booted).
>> If
>>> +you wish to enable the feature after the VM has booted, you must
>> shutdown
>>> +the VM and bring it back up.
>>> +
>>> +The same logic applies for disabling the feature - it must be disabled when
>>> +the device is inactive, for example before VM boot. To disable the
>> feature:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
>> copy=false
>>> +
>>> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient
>> port
>>> +types.
>>> +
>>> +A limitation exists whereby if packets from a vHost port with dq-zero-
>> copy=true
>>> +are destined for a 'dpdk' type port, the number of tx descriptors
>> (n_txq_desc)
>>> +for that port must be reduced to a smaller number, 128 being the
>> recommended
>>> +value. This can be achieved by issuing the following command:
>>> +
>>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>> +
>>> +More information on the n_txq_desc option can be found in the "DPDK
>> Physical
>>> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
>>
>> I think some :ref: link should be here.
>>
>>> +
>>> +The reason for this limitation is due to how the zero copy functionality is
>>> +implemented. The vHost device's 'tx used vring', a virtio structure used for
>>> +tracking used ie. sent descriptors, will only be updated when the NIC
>> frees
>>> +the corresponding mbuf. If we don't free the mbufs frequently enough,
>> that
>>> +vring will be starved and packets will no longer be processed. One way to
>>> +ensure we don't encounter this scenario, is to configure n_txq_desc to a
>> small
>>> +enough number such that the 'mbuf free threshold' for the NIC will be hit
>> more
>>> +often and thus free mbufs more frequently. The value of 128 is
>> suggested, but
>>> +values of 64 and 256 have been tested and verified to work too, with
>> differing
>>> +performance characteristics. A value of 512 can be used too, if the virtio
>>> +queue size in the guest is increased to 1024 (available to configure in
>> QEMU
>>> +versions v2.10 and greater). This value can be set like so:
>>> +
>>> +    $ qemu-system-x86_64 ... -chardev
>> socket,id=char1,path=<sockpath>,server
>>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
>>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
>>> +      tx_queue_size=1024
>>> +
>>> +Further information can be found in the
>>> +`DPDK documentation
>>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
>>> diff --git a/NEWS b/NEWS
>>> index d45904e..50630f8 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -18,6 +18,8 @@ Post-v2.8.0
>>>     - DPDK:
>>>       * Add support for DPDK v17.11
>>>       * Add support for vHost IOMMU
>>> +     * Optional dequeue zero copy feature for vHost ports can be enabled
>> per
>>> +       port via the boolean 'dq-zero-copy' option.
>>>
>>>  v2.8.0 - 31 Aug 2017
>>>  --------------------
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 879019e..03cc6cb 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>>>          /* True if vHost device is 'up' and has been reconfigured at least once
>> */
>>>          bool vhost_reconfigured;
>>>          /* 3 pad bytes here. */
>>> +
>>> +        /* True if dq-zero-copy feature has successfully been enabled */
>>> +        bool dq_zc_enabled;
>>
>> "pad bytes" should be fixed above or removed entirely, or we may just
>> revert padding
>> here too because it's incorrect just like in dp_netdev_pmd_thread. The only
>> difference
>> is that struct netdev_dpdk aligned to cacheline.
>>
>>>      );
>>>
>>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev,
>> dpdk_port_t port_no,
>>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>      ovsrcu_index_init(&dev->vid, -1);
>>>      dev->vhost_reconfigured = false;
>>> +    dev->dq_zc_enabled = false;
>>>      dev->attached = false;
>>>
>>>      ovsrcu_init(&dev->qos_conf, NULL);
>>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev
>> *netdev, const struct smap *args,
>>>      return 0;
>>>  }
>>>
>>> +static void
>>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
>>> +                             const struct smap *args)
>>> +{
>>> +    bool needs_reconfigure = false;
>>> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
>>> +
>>> +    if (zc_requested &&
>>> +            !(dev->vhost_driver_flags &
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>> +        dev->vhost_driver_flags |=
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>> +        needs_reconfigure = true;
>>> +    } else if (!zc_requested &&
>>> +            (dev->vhost_driver_flags &
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>> +        dev->vhost_driver_flags &=
>> ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>> +        needs_reconfigure = true;
>>> +    }
>>> +
>>> +    /* Only try to change ZC mode when device is down */
>>> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
>>> +        netdev_request_reconfigure(&dev->up);
>>> +    }
>>> +}
>>> +
>>>  static int
>>>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>>                                      const struct smap *args,
>>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct
>> netdev *netdev,
>>>              netdev_request_reconfigure(netdev);
>>>          }
>>>      }
>>> +
>>> +    dpdk_vhost_set_config_helper(dev, args);
>>> +
>>> +    ovs_mutex_unlock(&dev->mutex);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int
>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
>>> +                             const struct smap *args,
>>> +                             char **errp OVS_UNUSED)
>>> +{
>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>> +
>>> +    ovs_mutex_lock(&dev->mutex);
>>> +    dpdk_vhost_set_config_helper(dev, args);
>>>      ovs_mutex_unlock(&dev->mutex);
>>>
>>>      return 0;
>>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct
>> netdev_dpdk *dev)
>>>      }
>>>  }
>>>
>>> +static void
>>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool
>> client_mode,
>>> +                            bool enable)
>>> +{
>>> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
>>> +
>>> +    if (err) {
>>> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero
>> copy "
>>> +                "mode", dev->vhost_id);
>>> +    } else {
>>> +        err = dpdk_setup_vhost_device(dev, client_mode);
>>> +        if (err) {
>>> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
>>> +                    dev->vhost_id);
>>> +        } else if (enable) {
>>> +            dev->dq_zc_enabled = true;
>>> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev-
>>> vhost_id);
>>> +        } else {
>>> +            dev->dq_zc_enabled = false;
>>> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev-
>>> vhost_id);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
>>> +{
>>> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
>>> +
>>> +    if ((dev->vhost_driver_flags &
>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
>>> +                && !dev->dq_zc_enabled) {
>>> +        /* ZC disabled but requested to be enabled, enable it. */
>>> +        vhost_change_zero_copy_mode(dev, mode, true);
>>> +    } else if (!(dev->vhost_driver_flags &
>>> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev-
>>> dq_zc_enabled) {
>>> +        /* ZC enabled but requested to be disabled, disable it. */
>>> +        vhost_change_zero_copy_mode(dev, mode, false);
>>> +    }
>>> +}
>>> +
>>>  /*
>>>   * Remove a virtio-net device from the specific vhost port.  Use dev-
>>> remove
>>>   * flag to stop any more packets from being sent or received to/from a VM
>> and
>>> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>>>           */
>>>          ovsrcu_quiesce_start();
>>>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
>>> +        netdev_request_reconfigure(&dev->up);
>>>      } else {
>>>          VLOG_INFO("vHost Device '%s' not found", ifname);
>>>      }
>>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct
>> netdev_dpdk *dev)
>>>              /* Carrier status may need updating. */
>>>              netdev_change_seq_changed(&dev->up);
>>>          }
>>> +    } else {
>>> +        vhost_check_zero_copy_status(dev);
>>>      }
>>>
>>>      return 0;
>>> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class
>> =
>>>          NULL,
>>>          netdev_dpdk_vhost_construct,
>>>          netdev_dpdk_vhost_destruct,
>>> -        NULL,
>>> +        netdev_dpdk_vhost_set_config,
>>>          NULL,
>>>          netdev_dpdk_vhost_send,
>>>          netdev_dpdk_vhost_get_carrier,
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 4c317d0..e8409b4 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>> type=patch options:peer=p1 \
>>>          </p>
>>>        </column>
>>>
>>> +      <column name="options" key="dq-zero-copy"
>>> +              type='{"type": "boolean"}'>
>>> +        <p>
>>> +          The value specifies whether or not to enable dequeue zero copy on
>>> +          the given interface.
>>> +          The port must be in an inactive state in order to enable or disable
>>> +          this feature.
>>> +          Only supported by dpdkvhostuserclient and dpdkvhostuser
>> interfaces.
>>> +        </p>
>>> +      </column>
>>> +
>>>        <column name="options" key="n_rxq_desc"
>>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>>>          <p>
>>>
Ciara Loftus Dec. 18, 2017, 1:20 p.m. UTC | #12
> 
> On 18.12.2017 15:28, Loftus, Ciara wrote:
> >>
> >> Not a full review.
> >
> > Thanks for your feedback.
> >
> >>
> >> General thoughts:
> >>
> >> If following conditions are true:
> >>
> >> 1. We don't need to add new feature to deprecated vhostuser port.
> >
> > Agree.
> >
> >>
> >> 2. We actually don't need to have ability to change ZC config if vhost-
> server-
> >> path
> >>    already configured.
> >>
> >>    Let me explain this condition:
> >>    To change 'dq-zero-copy' if VM already started we need to stop VM,
> >> change the
> >>    'dq-zero-copy' and start the VM back. It's not much simpler than stop
> VM,
> >> re-add
> >>    port with different value for 'dq-zero-copy' and start VM back. I'm
> assuming
> >> here
> >>    that VM restart is much more complex and unlikely operation than port
> >> removing
> >>    from OVS and adding back. This will require documenting, but we
> already
> >> have a
> >>    bunch of docs about how to modify this config option in current version.
> >
> > Don't fully agree. I can't say whether your assumption is correct.
> > Port-deletion & re-addition has repercussions eg. loss of statistics from an
> interface. Retaining those stats might be important for some.
> > Why not leave it up to the user to choose their preferred method of
> enabling feature ie. reboot the VM or delete & add the port.
> >
> >>
> >> we may drop patch #1 from the series and use just something like
> following
> >> code instead of code changes in this patch (not tested, just for a
> reference):
> >>
> >> ---------------
> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >> index 2325f0e..c4cbf7c 100644
> >> --- a/lib/netdev-dpdk.c
> >> +++ b/lib/netdev-dpdk.c
> >> @@ -379,6 +379,9 @@ struct netdev_dpdk {
> >>          /* True if vHost device is 'up' and has been reconfigured at least once
> */
> >>          bool vhost_reconfigured;
> >>          /* 3 pad bytes here. */
> >> +
> >> +        /* True if dq-zero-copy feature requested for vhost device. */
> >> +        bool vhost_dq_zc;
> >>      );
> >>
> >>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> >> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
> >> dpdk_port_t port_no,
> >>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >>      ovsrcu_index_init(&dev->vid, -1);
> >>      dev->vhost_reconfigured = false;
> >> +    dev->vhost_dq_zc = false;
> >>      dev->attached = false;
> >>
> >>      ovsrcu_init(&dev->qos_conf, NULL);
> >> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
> >> netdev *netdev,
> >>          path = smap_get(args, "vhost-server-path");
> >>          if (path && strcmp(path, dev->vhost_id)) {
> >>              strcpy(dev->vhost_id, path);
> >> +            dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
> >>              netdev_request_reconfigure(netdev);
> >>          }
> >>      }
> >> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
> >> netdev *netdev)
> >>          if (dpdk_vhost_iommu_enabled()) {
> >>              vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
> >>          }
> >> +
> >> +        /* Enable Zero Copy mode if configured. */
> >> +        if (dev->vhost_dq_zc) {
> >> +            vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >> +        }
> >>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> >>          if (err) {
> >>              VLOG_ERR("vhost-user device setup failure for device %s\n",
> >>
> >> ---------------
> >>
> >> Thoughts?
> >
> > The above is a lot shorter of course which is nice, but I don't think we
> should sacrifice the ability to enable the feature post-boot for this.
> 
> But you wrote in documentation below:
> 
> +The feature cannot be enabled when the device is active (ie. VM booted). If
> +you wish to enable the feature after the VM has booted, you must
> shutdown
> +the VM and bring it back up.
> 
> What you mean saying "sacrifice the ability to enable the feature post-boot"?
> Am I missed something?

My understanding is that your suggestion means you sacrifice the ability to enable the feature post *first* VM boot. Am I correct?
My suggestion (and how the patch is implemented now) allows you to enable the feature post *first* VM boot, by means of a re-boot.

Thanks,
Ciara

> 
> >
> > Thanks,
> > Ciara
> >
> >>
> >>
> >> Best regards, Ilya Maximets.
> >>
> >> P.S. Few comments about patch itself are inline.
> >>
> >> On 08.12.2017 18:26, Ciara Loftus wrote:
> >>> Enabled per port like so:
> >>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> >>>
> >>> The feature is disabled by default and can only be enabled/disabled
> when
> >>> a vHost port is down.
> >>>
> >>> When packets from a vHost device with zero copy enabled are destined
> for
> >>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
> >>> set to a smaller value. 128 is recommended. This can be achieved like
> >>> so:
> >>>
> >>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >>>
> >>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> >>> ---
> >>> v5:
> >>> * Rebase
> >>> * Update docs with warning of potential packet loss
> >>> * Update docs with info on NIC & Virtio descriptor values and working
> >>> combinations
> >>>
> >>>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
> >>>  Documentation/topics/dpdk/vhost-user.rst | 61
> >> ++++++++++++++++++++++
> >>>  NEWS                                     |  2 +
> >>>  lib/netdev-dpdk.c                        | 89
> >> +++++++++++++++++++++++++++++++-
> >>>  vswitchd/vswitch.xml                     | 11 ++++
> >>>  5 files changed, 195 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/Documentation/howto/dpdk.rst
> >> b/Documentation/howto/dpdk.rst
> >>> index d123819..3e1b8f8 100644
> >>> --- a/Documentation/howto/dpdk.rst
> >>> +++ b/Documentation/howto/dpdk.rst
> >>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow
> >> the below steps:
> >>>     Check traffic on multiple queues::
> >>>
> >>>         $ cat /proc/interrupts | grep virtio
> >>> +
> >>> +PHY-VM-PHY (vHost Dequeue Zero Copy)
> >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> +
> >>> +vHost dequeue zero copy functionality can  be validated using the
> >>> +PHY-VM-PHY configuration. To begin, follow the steps described in
> >>> +:ref:`dpdk-phy-phy` to create and initialize the database, start
> >>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
> >>> +and flows to bridge ``br0``. Once complete, follow the below steps:
> >>> +
> >>> +1. Enable dequeue zero copy on the vHost devices.
> >>> +
> >>> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-
> copy=true
> >>> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-
> copy=true
> >>> +
> >>> +The following log should be observed for each device:
> >>> +
> >>> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
> >>> +
> >>> +2. Reduce the number of txq descriptors on the phy ports.
> >>> +
> >>> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
> >>> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
> >>> +
> >>> +3. Proceed with the test by launching the VM and configuring guest
> >>> +forwarding, be it via the vHost loopback method or kernel forwarding
> >>> +method, and sending traffic. The following log should be oberved for
> >>> +each device as it becomes active during VM boot:
> >>> +
> >>> +       VHOST_CONFIG: dequeue zero copy is enabled
> >>> +
> >>> +It is essential that step 1 is performed before booting the VM,
> otherwise
> >>> +the feature will not be enabled.
> >>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
> >> b/Documentation/topics/dpdk/vhost-user.rst
> >>> index 8b1c671..8587370 100644
> >>> --- a/Documentation/topics/dpdk/vhost-user.rst
> >>> +++ b/Documentation/topics/dpdk/vhost-user.rst
> >>> @@ -441,3 +441,64 @@ Sample XML
> >>>      </domain>
> >>>
> >>>  .. _QEMU documentation: http://git.qemu-
> >> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> >> user.txt;h=7890d7169;hb=HEAD
> >>> +
> >>> +vhost-user Dequeue Zero Copy
> >>> +-------------------------------------
> >>> +
> >>> +Normally when dequeuing a packet from a vHost User device, a
> memcpy
> >> operation
> >>> +must be used to copy that packet from guest address space to host
> >> address
> >>> +space. This memcpy can be removed by enabling dequeue zero-copy
> like
> >> so:
> >>> +
> >>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> >> copy=true
> >>> +
> >>> +With this feature enabled, a reference (pointer) to the packet is passed
> to
> >>> +the host, instead of a copy of the packet. Removing this memcpy can
> give
> >> a
> >>> +performance improvement for some use cases, for example switching
> >> large packets
> >>> +between different VMs. However additional packet loss may be
> >> observed.
> >>> +
> >>> +Note that the feature is disabled by default and must be explicitly
> enabled
> >>> +by using the command above.
> >>> +
> >>> +The feature cannot be enabled when the device is active (ie. VM
> booted).
> >> If
> >>> +you wish to enable the feature after the VM has booted, you must
> >> shutdown
> >>> +the VM and bring it back up.
> >>> +
> >>> +The same logic applies for disabling the feature - it must be disabled
> when
> >>> +the device is inactive, for example before VM boot. To disable the
> >> feature:
> >>> +
> >>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
> >> copy=false
> >>> +
> >>> +The feature is available to both dpdkvhostuser and
> dpdkvhostuserclient
> >> port
> >>> +types.
> >>> +
> >>> +A limitation exists whereby if packets from a vHost port with dq-zero-
> >> copy=true
> >>> +are destined for a 'dpdk' type port, the number of tx descriptors
> >> (n_txq_desc)
> >>> +for that port must be reduced to a smaller number, 128 being the
> >> recommended
> >>> +value. This can be achieved by issuing the following command:
> >>> +
> >>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> >>> +
> >>> +More information on the n_txq_desc option can be found in the "DPDK
> >> Physical
> >>> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
> >>
> >> I think some :ref: link should be here.
> >>
> >>> +
> >>> +The reason for this limitation is due to how the zero copy functionality
> is
> >>> +implemented. The vHost device's 'tx used vring', a virtio structure used
> for
> >>> +tracking used ie. sent descriptors, will only be updated when the NIC
> >> frees
> >>> +the corresponding mbuf. If we don't free the mbufs frequently
> enough,
> >> that
> >>> +vring will be starved and packets will no longer be processed. One way
> to
> >>> +ensure we don't encounter this scenario, is to configure n_txq_desc to
> a
> >> small
> >>> +enough number such that the 'mbuf free threshold' for the NIC will be
> hit
> >> more
> >>> +often and thus free mbufs more frequently. The value of 128 is
> >> suggested, but
> >>> +values of 64 and 256 have been tested and verified to work too, with
> >> differing
> >>> +performance characteristics. A value of 512 can be used too, if the virtio
> >>> +queue size in the guest is increased to 1024 (available to configure in
> >> QEMU
> >>> +versions v2.10 and greater). This value can be set like so:
> >>> +
> >>> +    $ qemu-system-x86_64 ... -chardev
> >> socket,id=char1,path=<sockpath>,server
> >>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> >>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> >>> +      tx_queue_size=1024
> >>> +
> >>> +Further information can be found in the
> >>> +`DPDK documentation
> >>>
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> >>> diff --git a/NEWS b/NEWS
> >>> index d45904e..50630f8 100644
> >>> --- a/NEWS
> >>> +++ b/NEWS
> >>> @@ -18,6 +18,8 @@ Post-v2.8.0
> >>>     - DPDK:
> >>>       * Add support for DPDK v17.11
> >>>       * Add support for vHost IOMMU
> >>> +     * Optional dequeue zero copy feature for vHost ports can be
> enabled
> >> per
> >>> +       port via the boolean 'dq-zero-copy' option.
> >>>
> >>>  v2.8.0 - 31 Aug 2017
> >>>  --------------------
> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >>> index 879019e..03cc6cb 100644
> >>> --- a/lib/netdev-dpdk.c
> >>> +++ b/lib/netdev-dpdk.c
> >>> @@ -361,6 +361,9 @@ struct netdev_dpdk {
> >>>          /* True if vHost device is 'up' and has been reconfigured at least
> once
> >> */
> >>>          bool vhost_reconfigured;
> >>>          /* 3 pad bytes here. */
> >>> +
> >>> +        /* True if dq-zero-copy feature has successfully been enabled */
> >>> +        bool dq_zc_enabled;
> >>
> >> "pad bytes" should be fixed above or removed entirely, or we may just
> >> revert padding
> >> here too because it's incorrect just like in dp_netdev_pmd_thread. The
> only
> >> difference
> >> is that struct netdev_dpdk aligned to cacheline.
> >>
> >>>      );
> >>>
> >>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev,
> >> dpdk_port_t port_no,
> >>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >>>      ovsrcu_index_init(&dev->vid, -1);
> >>>      dev->vhost_reconfigured = false;
> >>> +    dev->dq_zc_enabled = false;
> >>>      dev->attached = false;
> >>>
> >>>      ovsrcu_init(&dev->qos_conf, NULL);
> >>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev
> >> *netdev, const struct smap *args,
> >>>      return 0;
> >>>  }
> >>>
> >>> +static void
> >>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
> >>> +                             const struct smap *args)
> >>> +{
> >>> +    bool needs_reconfigure = false;
> >>> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> >>> +
> >>> +    if (zc_requested &&
> >>> +            !(dev->vhost_driver_flags &
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> >>> +        dev->vhost_driver_flags |=
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >>> +        needs_reconfigure = true;
> >>> +    } else if (!zc_requested &&
> >>> +            (dev->vhost_driver_flags &
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> >>> +        dev->vhost_driver_flags &=
> >> ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >>> +        needs_reconfigure = true;
> >>> +    }
> >>> +
> >>> +    /* Only try to change ZC mode when device is down */
> >>> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
> >>> +        netdev_request_reconfigure(&dev->up);
> >>> +    }
> >>> +}
> >>> +
> >>>  static int
> >>>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
> >>>                                      const struct smap *args,
> >>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct
> >> netdev *netdev,
> >>>              netdev_request_reconfigure(netdev);
> >>>          }
> >>>      }
> >>> +
> >>> +    dpdk_vhost_set_config_helper(dev, args);
> >>> +
> >>> +    ovs_mutex_unlock(&dev->mutex);
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int
> >>> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
> >>> +                             const struct smap *args,
> >>> +                             char **errp OVS_UNUSED)
> >>> +{
> >>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >>> +
> >>> +    ovs_mutex_lock(&dev->mutex);
> >>> +    dpdk_vhost_set_config_helper(dev, args);
> >>>      ovs_mutex_unlock(&dev->mutex);
> >>>
> >>>      return 0;
> >>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct
> >> netdev_dpdk *dev)
> >>>      }
> >>>  }
> >>>
> >>> +static void
> >>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool
> >> client_mode,
> >>> +                            bool enable)
> >>> +{
> >>> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
> >>> +
> >>> +    if (err) {
> >>> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero
> >> copy "
> >>> +                "mode", dev->vhost_id);
> >>> +    } else {
> >>> +        err = dpdk_setup_vhost_device(dev, client_mode);
> >>> +        if (err) {
> >>> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
> >>> +                    dev->vhost_id);
> >>> +        } else if (enable) {
> >>> +            dev->dq_zc_enabled = true;
> >>> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev-
> >>> vhost_id);
> >>> +        } else {
> >>> +            dev->dq_zc_enabled = false;
> >>> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev-
> >>> vhost_id);
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>> +static void
> >>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
> >>> +{
> >>> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
> >>> +
> >>> +    if ((dev->vhost_driver_flags &
> >> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> >>> +                && !dev->dq_zc_enabled) {
> >>> +        /* ZC disabled but requested to be enabled, enable it. */
> >>> +        vhost_change_zero_copy_mode(dev, mode, true);
> >>> +    } else if (!(dev->vhost_driver_flags &
> >>> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev-
> >>> dq_zc_enabled) {
> >>> +        /* ZC enabled but requested to be disabled, disable it. */
> >>> +        vhost_change_zero_copy_mode(dev, mode, false);
> >>> +    }
> >>> +}
> >>> +
> >>>  /*
> >>>   * Remove a virtio-net device from the specific vhost port.  Use dev-
> >>> remove
> >>>   * flag to stop any more packets from being sent or received to/from a
> VM
> >> and
> >>> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
> >>>           */
> >>>          ovsrcu_quiesce_start();
> >>>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
> >>> +        netdev_request_reconfigure(&dev->up);
> >>>      } else {
> >>>          VLOG_INFO("vHost Device '%s' not found", ifname);
> >>>      }
> >>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct
> >> netdev_dpdk *dev)
> >>>              /* Carrier status may need updating. */
> >>>              netdev_change_seq_changed(&dev->up);
> >>>          }
> >>> +    } else {
> >>> +        vhost_check_zero_copy_status(dev);
> >>>      }
> >>>
> >>>      return 0;
> >>> @@ -3421,7 +3508,7 @@ static const struct netdev_class
> dpdk_vhost_class
> >> =
> >>>          NULL,
> >>>          netdev_dpdk_vhost_construct,
> >>>          netdev_dpdk_vhost_destruct,
> >>> -        NULL,
> >>> +        netdev_dpdk_vhost_set_config,
> >>>          NULL,
> >>>          netdev_dpdk_vhost_send,
> >>>          netdev_dpdk_vhost_get_carrier,
> >>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> >>> index 4c317d0..e8409b4 100644
> >>> --- a/vswitchd/vswitch.xml
> >>> +++ b/vswitchd/vswitch.xml
> >>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> >> type=patch options:peer=p1 \
> >>>          </p>
> >>>        </column>
> >>>
> >>> +      <column name="options" key="dq-zero-copy"
> >>> +              type='{"type": "boolean"}'>
> >>> +        <p>
> >>> +          The value specifies whether or not to enable dequeue zero copy
> on
> >>> +          the given interface.
> >>> +          The port must be in an inactive state in order to enable or disable
> >>> +          this feature.
> >>> +          Only supported by dpdkvhostuserclient and dpdkvhostuser
> >> interfaces.
> >>> +        </p>
> >>> +      </column>
> >>> +
> >>>        <column name="options" key="n_rxq_desc"
> >>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
> >>>          <p>
> >>>
Ilya Maximets Dec. 18, 2017, 2:56 p.m. UTC | #13
On 18.12.2017 16:20, Loftus, Ciara wrote:
>>
>> On 18.12.2017 15:28, Loftus, Ciara wrote:
>>>>
>>>> Not a full review.
>>>
>>> Thanks for your feedback.
>>>
>>>>
>>>> General thoughts:
>>>>
>>>> If following conditions are true:
>>>>
>>>> 1. We don't need to add new feature to deprecated vhostuser port.
>>>
>>> Agree.
>>>
>>>>
>>>> 2. We actually don't need to have ability to change ZC config if vhost-
>> server-
>>>> path
>>>>    already configured.
>>>>
>>>>    Let me explain this condition:
>>>>    To change 'dq-zero-copy' if VM already started we need to stop VM,
>>>> change the
>>>>    'dq-zero-copy' and start the VM back. It's not much simpler than stop
>> VM,
>>>> re-add
>>>>    port with different value for 'dq-zero-copy' and start VM back. I'm
>> assuming
>>>> here
>>>>    that VM restart is much more complex and unlikely operation than port
>>>> removing
>>>>    from OVS and adding back. This will require documenting, but we
>> already
>>>> have a
>>>>    bunch of docs about how to modify this config option in current version.
>>>
>>> Don't fully agree. I can't say whether your assumption is correct.
>>> Port-deletion & re-addition has repercussions eg. loss of statistics from an
>> interface. Retaining those stats might be important for some.
>>> Why not leave it up to the user to choose their preferred method of
>> enabling feature ie. reboot the VM or delete & add the port.
>>>
>>>>
>>>> we may drop patch #1 from the series and use just something like
>> following
>>>> code instead of code changes in this patch (not tested, just for a
>> reference):
>>>>
>>>> ---------------
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 2325f0e..c4cbf7c 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -379,6 +379,9 @@ struct netdev_dpdk {
>>>>          /* True if vHost device is 'up' and has been reconfigured at least once
>> */
>>>>          bool vhost_reconfigured;
>>>>          /* 3 pad bytes here. */
>>>> +
>>>> +        /* True if dq-zero-copy feature requested for vhost device. */
>>>> +        bool vhost_dq_zc;
>>>>      );
>>>>
>>>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev,
>>>> dpdk_port_t port_no,
>>>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>      ovsrcu_index_init(&dev->vid, -1);
>>>>      dev->vhost_reconfigured = false;
>>>> +    dev->vhost_dq_zc = false;
>>>>      dev->attached = false;
>>>>
>>>>      ovsrcu_init(&dev->qos_conf, NULL);
>>>> @@ -1384,6 +1388,7 @@ netdev_dpdk_vhost_client_set_config(struct
>>>> netdev *netdev,
>>>>          path = smap_get(args, "vhost-server-path");
>>>>          if (path && strcmp(path, dev->vhost_id)) {
>>>>              strcpy(dev->vhost_id, path);
>>>> +            dev->vhost_dq_zc = smap_get_bool(args, "dq-zero-copy", false);
>>>>              netdev_request_reconfigure(netdev);
>>>>          }
>>>>      }
>>>> @@ -3288,6 +3293,11 @@ netdev_dpdk_vhost_client_reconfigure(struct
>>>> netdev *netdev)
>>>>          if (dpdk_vhost_iommu_enabled()) {
>>>>              vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
>>>>          }
>>>> +
>>>> +        /* Enable Zero Copy mode if configured. */
>>>> +        if (dev->vhost_dq_zc) {
>>>> +            vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>>> +        }
>>>>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>>>>          if (err) {
>>>>              VLOG_ERR("vhost-user device setup failure for device %s\n",
>>>>
>>>> ---------------
>>>>
>>>> Thoughts?
>>>
>>> The above is a lot shorter of course which is nice, but I don't think we
>> should sacrifice the ability to enable the feature post-boot for this.
>>
>> But you wrote in documentation below:
>>
>> +The feature cannot be enabled when the device is active (ie. VM booted). If
>> +you wish to enable the feature after the VM has booted, you must
>> shutdown
>> +the VM and bring it back up.
>>
>> What you mean saying "sacrifice the ability to enable the feature post-boot"?
>> Am I missed something?
> 
> My understanding is that your suggestion means you sacrifice the ability to enable the feature post *first* VM boot. Am I correct?
> My suggestion (and how the patch is implemented now) allows you to enable the feature post *first* VM boot, by means of a re-boot.

Your approach:
* ovs-vsctl \
         add-port br0 vhost0 \
      -- set interface vhost0 type=dpdkvhostuserclient options:vhost-server-path=./vhost0.sock
* ./boot_vm.sh
* ovs-vsctl set interface vhost0 options:dq-zero-copy=true
* ./stop_vm.sh (stop virtio-net driver)
* ./boot_vm.sh (start virtio-net driver)

My approach:
* ovs-vsctl \
         add-port br0 vhost0 \
      -- set interface vhost0 type=dpdkvhostuserclient options:vhost-server-path=./vhost0.sock
* ./boot_vm.sh
* ovs-vsctl del-port vhost0
* ovs-vsctl \
         add-port br0 vhost0 \
      -- set interface vhost0 type=dpdkvhostuserclient options:vhost-server-path=./vhost0.sock \
                                                       options:dq-zero-copy=true
* ./stop_vm.sh (stop virtio-net driver)
* ./boot_vm.sh (start virtio-net driver)

The only difference is, as you already mentioned, statistics for that interface will be dropped.
But I'm not sure If someone really needs to change that parameter while VM is running.
It sounds strange that someone wants to enable/disable this feature few times in a VM
lifetime. It looks more reasonable that this feature will be enabled or disabled for
the whole life of a single VM. Especially because it costs VM/driver restart.

The most common scenario, IMHO, is passing all the required parameters at once in 'add-port'
command. In this case there will be no difference between our approaches from the user's point
of view, but a big difference from the developer's.

------

Beside above discussion.
I guess that with vhostuserclient ports we don't really need to reboot the VM. We only
need to reload virtio driver in guest. If guest uses DPDK, this means that we only need
to restart DPDK application inside guest to enable/disable zero copy. Am I right?
If so, I think, documentation could be modified with driver reloading instead of VM rebooting.


------

One more general comment to documentation:
It's not a full rST now, it's half - rST / half - usual text. Please, add '::' for command blocks,
':ref:' for referencing of another sections (Physical Port Queue Sizes).

Best regards, Ilya Maximets.


>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>> P.S. Few comments about patch itself are inline.
>>>>
>>>> On 08.12.2017 18:26, Ciara Loftus wrote:
>>>>> Enabled per port like so:
>>>>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>>>>>
>>>>> The feature is disabled by default and can only be enabled/disabled
>> when
>>>>> a vHost port is down.
>>>>>
>>>>> When packets from a vHost device with zero copy enabled are destined
>> for
>>>>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be
>>>>> set to a smaller value. 128 is recommended. This can be achieved like
>>>>> so:
>>>>>
>>>>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>>>>
>>>>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>>>>> ---
>>>>> v5:
>>>>> * Rebase
>>>>> * Update docs with warning of potential packet loss
>>>>> * Update docs with info on NIC & Virtio descriptor values and working
>>>>> combinations
>>>>>
>>>>>  Documentation/howto/dpdk.rst             | 33 ++++++++++++
>>>>>  Documentation/topics/dpdk/vhost-user.rst | 61
>>>> ++++++++++++++++++++++
>>>>>  NEWS                                     |  2 +
>>>>>  lib/netdev-dpdk.c                        | 89
>>>> +++++++++++++++++++++++++++++++-
>>>>>  vswitchd/vswitch.xml                     | 11 ++++
>>>>>  5 files changed, 195 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/Documentation/howto/dpdk.rst
>>>> b/Documentation/howto/dpdk.rst
>>>>> index d123819..3e1b8f8 100644
>>>>> --- a/Documentation/howto/dpdk.rst
>>>>> +++ b/Documentation/howto/dpdk.rst
>>>>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow
>>>> the below steps:
>>>>>     Check traffic on multiple queues::
>>>>>
>>>>>         $ cat /proc/interrupts | grep virtio
>>>>> +
>>>>> +PHY-VM-PHY (vHost Dequeue Zero Copy)
>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> +
>>>>> +vHost dequeue zero copy functionality can  be validated using the
>>>>> +PHY-VM-PHY configuration. To begin, follow the steps described in
>>>>> +:ref:`dpdk-phy-phy` to create and initialize the database, start
>>>>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
>>>>> +and flows to bridge ``br0``. Once complete, follow the below steps:
>>>>> +
>>>>> +1. Enable dequeue zero copy on the vHost devices.
>>>>> +
>>>>> +       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-
>> copy=true
>>>>> +       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-
>> copy=true
>>>>> +
>>>>> +The following log should be observed for each device:
>>>>> +
>>>>> +       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
>>>>> +
>>>>> +2. Reduce the number of txq descriptors on the phy ports.
>>>>> +
>>>>> +       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
>>>>> +       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
>>>>> +
>>>>> +3. Proceed with the test by launching the VM and configuring guest
>>>>> +forwarding, be it via the vHost loopback method or kernel forwarding
>>>>> +method, and sending traffic. The following log should be oberved for
>>>>> +each device as it becomes active during VM boot:
>>>>> +
>>>>> +       VHOST_CONFIG: dequeue zero copy is enabled
>>>>> +
>>>>> +It is essential that step 1 is performed before booting the VM,
>> otherwise
>>>>> +the feature will not be enabled.
>>>>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
>>>> b/Documentation/topics/dpdk/vhost-user.rst
>>>>> index 8b1c671..8587370 100644
>>>>> --- a/Documentation/topics/dpdk/vhost-user.rst
>>>>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>>>>> @@ -441,3 +441,64 @@ Sample XML
>>>>>      </domain>
>>>>>
>>>>>  .. _QEMU documentation: http://git.qemu-
>>>> project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
>>>> user.txt;h=7890d7169;hb=HEAD
>>>>> +
>>>>> +vhost-user Dequeue Zero Copy
>>>>> +-------------------------------------
>>>>> +
>>>>> +Normally when dequeuing a packet from a vHost User device, a
>> memcpy
>>>> operation
>>>>> +must be used to copy that packet from guest address space to host
>>>> address
>>>>> +space. This memcpy can be removed by enabling dequeue zero-copy
>> like
>>>> so:
>>>>> +
>>>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
>>>> copy=true
>>>>> +
>>>>> +With this feature enabled, a reference (pointer) to the packet is passed
>> to
>>>>> +the host, instead of a copy of the packet. Removing this memcpy can
>> give
>>>> a
>>>>> +performance improvement for some use cases, for example switching
>>>> large packets
>>>>> +between different VMs. However additional packet loss may be
>>>> observed.
>>>>> +
>>>>> +Note that the feature is disabled by default and must be explicitly
>> enabled
>>>>> +by using the command above.
>>>>> +
>>>>> +The feature cannot be enabled when the device is active (ie. VM
>> booted).
>>>> If
>>>>> +you wish to enable the feature after the VM has booted, you must
>>>> shutdown
>>>>> +the VM and bring it back up.
>>>>> +
>>>>> +The same logic applies for disabling the feature - it must be disabled
>> when
>>>>> +the device is inactive, for example before VM boot. To disable the
>>>> feature:
>>>>> +
>>>>> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-
>>>> copy=false
>>>>> +
>>>>> +The feature is available to both dpdkvhostuser and
>> dpdkvhostuserclient
>>>> port
>>>>> +types.
>>>>> +
>>>>> +A limitation exists whereby if packets from a vHost port with dq-zero-
>>>> copy=true
>>>>> +are destined for a 'dpdk' type port, the number of tx descriptors
>>>> (n_txq_desc)
>>>>> +for that port must be reduced to a smaller number, 128 being the
>>>> recommended
>>>>> +value. This can be achieved by issuing the following command:
>>>>> +
>>>>> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>>>>> +
>>>>> +More information on the n_txq_desc option can be found in the "DPDK
>>>> Physical
>>>>> +Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
>>>>
>>>> I think some :ref: link should be here.
>>>>
>>>>> +
>>>>> +The reason for this limitation is due to how the zero copy functionality
>> is
>>>>> +implemented. The vHost device's 'tx used vring', a virtio structure used
>> for
>>>>> +tracking used ie. sent descriptors, will only be updated when the NIC
>>>> frees
>>>>> +the corresponding mbuf. If we don't free the mbufs frequently
>> enough,
>>>> that
>>>>> +vring will be starved and packets will no longer be processed. One way
>> to
>>>>> +ensure we don't encounter this scenario, is to configure n_txq_desc to
>> a
>>>> small
>>>>> +enough number such that the 'mbuf free threshold' for the NIC will be
>> hit
>>>> more
>>>>> +often and thus free mbufs more frequently. The value of 128 is
>>>> suggested, but
>>>>> +values of 64 and 256 have been tested and verified to work too, with
>>>> differing
>>>>> +performance characteristics. A value of 512 can be used too, if the virtio
>>>>> +queue size in the guest is increased to 1024 (available to configure in
>>>> QEMU
>>>>> +versions v2.10 and greater). This value can be set like so:
>>>>> +
>>>>> +    $ qemu-system-x86_64 ... -chardev
>>>> socket,id=char1,path=<sockpath>,server
>>>>> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
>>>>> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
>>>>> +      tx_queue_size=1024
>>>>> +
>>>>> +Further information can be found in the
>>>>> +`DPDK documentation
>>>>>
>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
>>>>> diff --git a/NEWS b/NEWS
>>>>> index d45904e..50630f8 100644
>>>>> --- a/NEWS
>>>>> +++ b/NEWS
>>>>> @@ -18,6 +18,8 @@ Post-v2.8.0
>>>>>     - DPDK:
>>>>>       * Add support for DPDK v17.11
>>>>>       * Add support for vHost IOMMU
>>>>> +     * Optional dequeue zero copy feature for vHost ports can be
>> enabled
>>>> per
>>>>> +       port via the boolean 'dq-zero-copy' option.
>>>>>
>>>>>  v2.8.0 - 31 Aug 2017
>>>>>  --------------------
>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>> index 879019e..03cc6cb 100644
>>>>> --- a/lib/netdev-dpdk.c
>>>>> +++ b/lib/netdev-dpdk.c
>>>>> @@ -361,6 +361,9 @@ struct netdev_dpdk {
>>>>>          /* True if vHost device is 'up' and has been reconfigured at least
>> once
>>>> */
>>>>>          bool vhost_reconfigured;
>>>>>          /* 3 pad bytes here. */
>>>>> +
>>>>> +        /* True if dq-zero-copy feature has successfully been enabled */
>>>>> +        bool dq_zc_enabled;
>>>>
>>>> "pad bytes" should be fixed above or removed entirely, or we may just
>>>> revert padding
>>>> here too because it's incorrect just like in dp_netdev_pmd_thread. The
>> only
>>>> difference
>>>> is that struct netdev_dpdk aligned to cacheline.
>>>>
>>>>>      );
>>>>>
>>>>>      PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev,
>>>> dpdk_port_t port_no,
>>>>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>      ovsrcu_index_init(&dev->vid, -1);
>>>>>      dev->vhost_reconfigured = false;
>>>>> +    dev->dq_zc_enabled = false;
>>>>>      dev->attached = false;
>>>>>
>>>>>      ovsrcu_init(&dev->qos_conf, NULL);
>>>>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev
>>>> *netdev, const struct smap *args,
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> +static void
>>>>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
>>>>> +                             const struct smap *args)
>>>>> +{
>>>>> +    bool needs_reconfigure = false;
>>>>> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
>>>>> +
>>>>> +    if (zc_requested &&
>>>>> +            !(dev->vhost_driver_flags &
>>>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>>>> +        dev->vhost_driver_flags |=
>>>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>>>> +        needs_reconfigure = true;
>>>>> +    } else if (!zc_requested &&
>>>>> +            (dev->vhost_driver_flags &
>>>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
>>>>> +        dev->vhost_driver_flags &=
>>>> ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>>>> +        needs_reconfigure = true;
>>>>> +    }
>>>>> +
>>>>> +    /* Only try to change ZC mode when device is down */
>>>>> +    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
>>>>> +        netdev_request_reconfigure(&dev->up);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>  static int
>>>>>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>>>>>                                      const struct smap *args,
>>>>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct
>>>> netdev *netdev,
>>>>>              netdev_request_reconfigure(netdev);
>>>>>          }
>>>>>      }
>>>>> +
>>>>> +    dpdk_vhost_set_config_helper(dev, args);
>>>>> +
>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev,
>>>>> +                             const struct smap *args,
>>>>> +                             char **errp OVS_UNUSED)
>>>>> +{
>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>> +
>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>> +    dpdk_vhost_set_config_helper(dev, args);
>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>
>>>>>      return 0;
>>>>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct
>>>> netdev_dpdk *dev)
>>>>>      }
>>>>>  }
>>>>>
>>>>> +static void
>>>>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool
>>>> client_mode,
>>>>> +                            bool enable)
>>>>> +{
>>>>> +    int err = rte_vhost_driver_unregister(dev->vhost_id);
>>>>> +
>>>>> +    if (err) {
>>>>> +        VLOG_ERR("Error unregistering vHost socket %s; can't change zero
>>>> copy "
>>>>> +                "mode", dev->vhost_id);
>>>>> +    } else {
>>>>> +        err = dpdk_setup_vhost_device(dev, client_mode);
>>>>> +        if (err) {
>>>>> +            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
>>>>> +                    dev->vhost_id);
>>>>> +        } else if (enable) {
>>>>> +            dev->dq_zc_enabled = true;
>>>>> +            VLOG_INFO("Zero copy enabled for vHost socket %s", dev-
>>>>> vhost_id);
>>>>> +        } else {
>>>>> +            dev->dq_zc_enabled = false;
>>>>> +            VLOG_INFO("Zero copy disabled for vHost socket %s", dev-
>>>>> vhost_id);
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev)
>>>>> +{
>>>>> +    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
>>>>> +
>>>>> +    if ((dev->vhost_driver_flags &
>>>> RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
>>>>> +                && !dev->dq_zc_enabled) {
>>>>> +        /* ZC disabled but requested to be enabled, enable it. */
>>>>> +        vhost_change_zero_copy_mode(dev, mode, true);
>>>>> +    } else if (!(dev->vhost_driver_flags &
>>>>> +            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev-
>>>>> dq_zc_enabled) {
>>>>> +        /* ZC enabled but requested to be disabled, disable it. */
>>>>> +        vhost_change_zero_copy_mode(dev, mode, false);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * Remove a virtio-net device from the specific vhost port.  Use dev-
>>>>> remove
>>>>>   * flag to stop any more packets from being sent or received to/from a
>> VM
>>>> and
>>>>> @@ -2768,6 +2852,7 @@ destroy_device(int vid)
>>>>>           */
>>>>>          ovsrcu_quiesce_start();
>>>>>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
>>>>> +        netdev_request_reconfigure(&dev->up);
>>>>>      } else {
>>>>>          VLOG_INFO("vHost Device '%s' not found", ifname);
>>>>>      }
>>>>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct
>>>> netdev_dpdk *dev)
>>>>>              /* Carrier status may need updating. */
>>>>>              netdev_change_seq_changed(&dev->up);
>>>>>          }
>>>>> +    } else {
>>>>> +        vhost_check_zero_copy_status(dev);
>>>>>      }
>>>>>
>>>>>      return 0;
>>>>> @@ -3421,7 +3508,7 @@ static const struct netdev_class
>> dpdk_vhost_class
>>>> =
>>>>>          NULL,
>>>>>          netdev_dpdk_vhost_construct,
>>>>>          netdev_dpdk_vhost_destruct,
>>>>> -        NULL,
>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>          NULL,
>>>>>          netdev_dpdk_vhost_send,
>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>>> index 4c317d0..e8409b4 100644
>>>>> --- a/vswitchd/vswitch.xml
>>>>> +++ b/vswitchd/vswitch.xml
>>>>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>>>> type=patch options:peer=p1 \
>>>>>          </p>
>>>>>        </column>
>>>>>
>>>>> +      <column name="options" key="dq-zero-copy"
>>>>> +              type='{"type": "boolean"}'>
>>>>> +        <p>
>>>>> +          The value specifies whether or not to enable dequeue zero copy
>> on
>>>>> +          the given interface.
>>>>> +          The port must be in an inactive state in order to enable or disable
>>>>> +          this feature.
>>>>> +          Only supported by dpdkvhostuserclient and dpdkvhostuser
>>>> interfaces.
>>>>> +        </p>
>>>>> +      </column>
>>>>> +
>>>>>        <column name="options" key="n_rxq_desc"
>>>>>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>>>>>          <p>
>>>>>
diff mbox series

Patch

diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index d123819..3e1b8f8 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -709,3 +709,36 @@  devices to bridge ``br0``. Once complete, follow the below steps:
    Check traffic on multiple queues::
 
        $ cat /proc/interrupts | grep virtio
+
+PHY-VM-PHY (vHost Dequeue Zero Copy)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+vHost dequeue zero copy functionality can  be validated using the
+PHY-VM-PHY configuration. To begin, follow the steps described in
+:ref:`dpdk-phy-phy` to create and initialize the database, start
+ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices
+and flows to bridge ``br0``. Once complete, follow the below steps:
+
+1. Enable dequeue zero copy on the vHost devices.
+
+       $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true
+       $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true
+
+The following log should be observed for each device:
+
+       netdev_dpdk|INFO|Zero copy enabled for vHost socket <name>
+
+2. Reduce the number of txq descriptors on the phy ports.
+
+       $ ovs-vsctl set Interface phy0 options:n_txq_desc=128
+       $ ovs-vsctl set Interface phy1 options:n_txq_desc=128
+
+3. Proceed with the test by launching the VM and configuring guest
+forwarding, be it via the vHost loopback method or kernel forwarding
+method, and sending traffic. The following log should be oberved for
+each device as it becomes active during VM boot:
+
+       VHOST_CONFIG: dequeue zero copy is enabled
+
+It is essential that step 1 is performed before booting the VM, otherwise
+the feature will not be enabled.
diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
index 8b1c671..8587370 100644
--- a/Documentation/topics/dpdk/vhost-user.rst
+++ b/Documentation/topics/dpdk/vhost-user.rst
@@ -441,3 +441,64 @@  Sample XML
     </domain>
 
 .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
+
+vhost-user Dequeue Zero Copy
+-------------------------------------
+
+Normally when dequeuing a packet from a vHost User device, a memcpy operation
+must be used to copy that packet from guest address space to host address
+space. This memcpy can be removed by enabling dequeue zero-copy like so:
+
+    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
+
+With this feature enabled, a reference (pointer) to the packet is passed to
+the host, instead of a copy of the packet. Removing this memcpy can give a
+performance improvement for some use cases, for example switching large packets
+between different VMs. However additional packet loss may be observed.
+
+Note that the feature is disabled by default and must be explicitly enabled
+by using the command above.
+
+The feature cannot be enabled when the device is active (ie. VM booted). If
+you wish to enable the feature after the VM has booted, you must shutdown
+the VM and bring it back up.
+
+The same logic applies for disabling the feature - it must be disabled when
+the device is inactive, for example before VM boot. To disable the feature:
+
+    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
+
+The feature is available to both dpdkvhostuser and dpdkvhostuserclient port
+types.
+
+A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
+are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
+for that port must be reduced to a smaller number, 128 being the recommended
+value. This can be achieved by issuing the following command:
+
+    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
+
+More information on the n_txq_desc option can be found in the "DPDK Physical
+Port Queue Sizes" section of the  `intro/install/dpdk.rst` guide.
+
+The reason for this limitation is due to how the zero copy functionality is
+implemented. The vHost device's 'tx used vring', a virtio structure used for
+tracking used ie. sent descriptors, will only be updated when the NIC frees
+the corresponding mbuf. If we don't free the mbufs frequently enough, that
+vring will be starved and packets will no longer be processed. One way to
+ensure we don't encounter this scenario, is to configure n_txq_desc to a small
+enough number such that the 'mbuf free threshold' for the NIC will be hit more
+often and thus free mbufs more frequently. The value of 128 is suggested, but
+values of 64 and 256 have been tested and verified to work too, with differing
+performance characteristics. A value of 512 can be used too, if the virtio
+queue size in the guest is increased to 1024 (available to configure in QEMU
+versions v2.10 and greater). This value can be set like so:
+
+    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
+      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
+      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
+      tx_queue_size=1024
+
+Further information can be found in the
+`DPDK documentation
+<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
diff --git a/NEWS b/NEWS
index d45904e..50630f8 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,8 @@  Post-v2.8.0
    - DPDK:
      * Add support for DPDK v17.11
      * Add support for vHost IOMMU
+     * Optional dequeue zero copy feature for vHost ports can be enabled per
+       port via the boolean 'dq-zero-copy' option.
 
 v2.8.0 - 31 Aug 2017
 --------------------
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 879019e..03cc6cb 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -361,6 +361,9 @@  struct netdev_dpdk {
         /* True if vHost device is 'up' and has been reconfigured at least once */
         bool vhost_reconfigured;
         /* 3 pad bytes here. */
+
+        /* True if dq-zero-copy feature has successfully been enabled */
+        bool dq_zc_enabled;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -871,6 +874,7 @@  common_construct(struct netdev *netdev, dpdk_port_t port_no,
     dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
     ovsrcu_index_init(&dev->vid, -1);
     dev->vhost_reconfigured = false;
+    dev->dq_zc_enabled = false;
     dev->attached = false;
 
     ovsrcu_init(&dev->qos_conf, NULL);
@@ -1383,6 +1387,29 @@  netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
     return 0;
 }
 
+static void
+dpdk_vhost_set_config_helper(struct netdev_dpdk *dev,
+                             const struct smap *args)
+{
+    bool needs_reconfigure = false;
+    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
+
+    if (zc_requested &&
+            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
+        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+        needs_reconfigure = true;
+    } else if (!zc_requested &&
+            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
+        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+        needs_reconfigure = true;
+    }
+
+    /* Only try to change ZC mode when device is down */
+    if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) {
+        netdev_request_reconfigure(&dev->up);
+    }
+}
+
 static int
 netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
                                     const struct smap *args,
@@ -1399,6 +1426,23 @@  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
             netdev_request_reconfigure(netdev);
         }
     }
+
+    dpdk_vhost_set_config_helper(dev, args);
+
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
+netdev_dpdk_vhost_set_config(struct netdev *netdev,
+                             const struct smap *args,
+                             char **errp OVS_UNUSED)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    dpdk_vhost_set_config_helper(dev, args);
     ovs_mutex_unlock(&dev->mutex);
 
     return 0;
@@ -2723,6 +2767,46 @@  netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
     }
 }
 
+static void
+vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode,
+                            bool enable)
+{
+    int err = rte_vhost_driver_unregister(dev->vhost_id);
+
+    if (err) {
+        VLOG_ERR("Error unregistering vHost socket %s; can't change zero copy "
+                "mode", dev->vhost_id);
+    } else {
+        err = dpdk_setup_vhost_device(dev, client_mode);
+        if (err) {
+            VLOG_ERR("Error changing zero copy mode for vHost socket %s",
+                    dev->vhost_id);
+        } else if (enable) {
+            dev->dq_zc_enabled = true;
+            VLOG_INFO("Zero copy enabled for vHost socket %s", dev->vhost_id);
+        } else {
+            dev->dq_zc_enabled = false;
+            VLOG_INFO("Zero copy disabled for vHost socket %s", dev->vhost_id);
+        }
+    }
+}
+
+static void
+vhost_check_zero_copy_status(struct netdev_dpdk *dev)
+{
+    bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT;
+
+    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
+                && !dev->dq_zc_enabled) {
+        /* ZC disabled but requested to be enabled, enable it. */
+        vhost_change_zero_copy_mode(dev, mode, true);
+    } else if (!(dev->vhost_driver_flags &
+            RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) {
+        /* ZC enabled but requested to be disabled, disable it. */
+        vhost_change_zero_copy_mode(dev, mode, false);
+    }
+}
+
 /*
  * Remove a virtio-net device from the specific vhost port.  Use dev->remove
  * flag to stop any more packets from being sent or received to/from a VM and
@@ -2768,6 +2852,7 @@  destroy_device(int vid)
          */
         ovsrcu_quiesce_start();
         VLOG_INFO("vHost Device '%s' has been removed", ifname);
+        netdev_request_reconfigure(&dev->up);
     } else {
         VLOG_INFO("vHost Device '%s' not found", ifname);
     }
@@ -3259,6 +3344,8 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
             /* Carrier status may need updating. */
             netdev_change_seq_changed(&dev->up);
         }
+    } else {
+        vhost_check_zero_copy_status(dev);
     }
 
     return 0;
@@ -3421,7 +3508,7 @@  static const struct netdev_class dpdk_vhost_class =
         NULL,
         netdev_dpdk_vhost_construct,
         netdev_dpdk_vhost_destruct,
-        NULL,
+        netdev_dpdk_vhost_set_config,
         NULL,
         netdev_dpdk_vhost_send,
         netdev_dpdk_vhost_get_carrier,
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 4c317d0..e8409b4 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -2669,6 +2669,17 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="options" key="dq-zero-copy"
+              type='{"type": "boolean"}'>
+        <p>
+          The value specifies whether or not to enable dequeue zero copy on
+          the given interface.
+          The port must be in an inactive state in order to enable or disable
+          this feature.
+          Only supported by dpdkvhostuserclient and dpdkvhostuser interfaces.
+        </p>
+      </column>
+
       <column name="options" key="n_rxq_desc"
               type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
         <p>