[ovs-dev,RFC,v3,8/8] netdev-dpdk: support multi-segment jumbo frames

Message ID 1511288957-68599-9-git-send-email-mark.b.kavanagh@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series
  • netdev-dpdk: support multi-segment mbufs
Related show

Commit Message

Kavanagh, Mark B Nov. 21, 2017, 6:29 p.m.
Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

    ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
    ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
    ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
---
 NEWS                 |  1 +
 lib/dpdk.c           |  7 +++++++
 lib/netdev-dpdk.c    | 43 ++++++++++++++++++++++++++++++++++++++++---
 lib/netdev-dpdk.h    |  1 +
 vswitchd/vswitch.xml | 20 ++++++++++++++++++++
 5 files changed, 69 insertions(+), 3 deletions(-)

Comments

O Mahony, Billy Nov. 23, 2017, 11:23 a.m. | #1
Hi Mark,

Just one comment below.

/Billy

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of Mark Kavanagh
> Sent: Tuesday, November 21, 2017 6:29 PM
> To: dev@openvswitch.org; qiudayu@chinac.com
> Subject: [ovs-dev] [RFC PATCH v3 8/8] netdev-dpdk: support multi-segment
> jumbo frames
> 
> Currently, jumbo frame support for OvS-DPDK is implemented by increasing the
> size of mbufs within a mempool, such that each mbuf within the pool is large
> enough to contain an entire jumbo frame of a user-defined size. Typically, for
> each user-defined MTU, 'requested_mtu', a new mempool is created, containing
> mbufs of size ~requested_mtu.
> 
> With the multi-segment approach, a port uses a single mempool, (containing
> standard/default-sized mbufs of ~2k bytes), irrespective of the user-requested
> MTU value. To accommodate jumbo frames, mbufs are chained together, where
> each mbuf in the chain stores a portion of the jumbo frame. Each mbuf in the
> chain is termed a segment, hence the name.
> 
> == Enabling multi-segment mbufs ==
> Multi-segment and single-segment mbufs are mutually exclusive, and the user
> must decide on which approach to adopt on init. The introduction of a new
> OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This is a global boolean
> value, which determines how jumbo frames are represented across all DPDK
> ports. In the absence of a user-supplied value, 'dpdk-multi-seg-mbufs' defaults
> to false, i.e. multi-segment mbufs must be explicitly enabled / single-segment
> mbufs remain the default.
> 
[[BO'M]] Would it be more useful if they multi-segment was enabled by default?  Does enabling multi-segment mbufs result in much of a performance decrease when not-using jumbo frames? Either because jumbo frames are not coming in on the ingress port or because the mtu is set not to accept jumbo frames.

Obviously not a blocker to this patch-set. Maybe something to be looked at in the future. 

> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> ---
>  NEWS                 |  1 +
>  lib/dpdk.c           |  7 +++++++
>  lib/netdev-dpdk.c    | 43 ++++++++++++++++++++++++++++++++++++++++---
>  lib/netdev-dpdk.h    |  1 +
>  vswitchd/vswitch.xml | 20 ++++++++++++++++++++
>  5 files changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index c15dc24..657b598 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -15,6 +15,7 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU feature
> +     * Add support for multi-segment mbufs
> 
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 8da6c32..4c28bd0 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -450,6 +450,13 @@ dpdk_init__(const struct smap *ovs_other_config)
> 
>      /* Finally, register the dpdk classes */
>      netdev_dpdk_register();
> +
> +    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
> +            "dpdk-multi-seg-mbufs", false);
> +    if (multi_seg_mbufs_enable) {
> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
> +        netdev_dpdk_multi_segment_mbufs_enable();
> +    }
>  }
> 
>  void
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 36275bd..293edad
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -65,6 +65,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> 
>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +static bool dpdk_multi_segment_mbufs = false;
> 
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> 
> @@ -500,6 +501,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, uint16_t
> frame_len)
>                + dev->requested_n_txq * dev->requested_txq_size
>                + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
>                + MIN_NB_MBUF;
> +    /* XXX (RFC) - should n_mbufs be increased if multi-seg mbufs are
> + used? */
> 
>      ovs_mutex_lock(&dpdk_mp_mutex);
>      do {
> @@ -568,7 +570,13 @@ dpdk_mp_free(struct rte_mempool *mp)
> 
>  /* Tries to allocate a new mempool - or re-use an existing one where
>   * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> + * requested_mtu and requested Rx/Tx queues. Some properties of the
> + mempool's
> + * elements are dependent on the value of 'dpdk_multi_segment_mbufs':
> + * - if 'true', then the mempool contains standard-sized mbufs that are chained
> + *   together to accommodate packets of size 'requested_mtu'.
> + * - if 'false', then the members of the allocated mempool are
> + *   non-standard-sized mbufs. Each mbuf in the mempool is large enough to
> fully
> + *   accomdate packets of size 'requested_mtu'.
>   * On success - or when re-using an existing mempool - the new configuration
>   * will be applied.
>   * On error, device will be left unchanged. */ @@ -576,10 +584,18 @@ static
> int  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
> -    uint16_t buf_size = dpdk_buf_size(dev->requested_mtu);
> +    uint16_t buf_size = 0;
>      struct rte_mempool *mp;
>      int ret = 0;
> 
> +    /* Contiguous mbufs in use - permit oversized mbufs */
> +    if (!dpdk_multi_segment_mbufs) {
> +        buf_size = dpdk_buf_size(dev->requested_mtu);
> +    } else {
> +        /* multi-segment mbufs - use standard mbuf size */
> +        buf_size = dpdk_buf_size(ETHER_MTU);
> +    }
> +
>      mp = dpdk_mp_create(dev, buf_size);
>      if (!mp) {
>          VLOG_ERR("Failed to create memory pool for netdev "
> @@ -657,6 +673,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
> int n_rxq, int n_txq)
>      int diag = 0;
>      int i;
>      struct rte_eth_conf conf = port_conf;
> +    struct rte_eth_txconf txconf;
> 
>      /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly
>       * enabled. */
> @@ -690,9 +707,23 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
> int n_rxq, int n_txq)
>              break;
>          }
> 
> +        /* DPDK PMDs typically attempt to use simple or vectorized
> +         * transmit functions, neither of which are compatible with
> +         * multi-segment mbufs. Ensure that these are disabled in the
> +         * when multi-segment mbufs are enabled.
> +         */
> +        if (dpdk_multi_segment_mbufs) {
> +            struct rte_eth_dev_info dev_info;
> +            rte_eth_dev_info_get(dev->port_id, &dev_info);
> +            txconf = dev_info.default_txconf;
> +            txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
> +        }
> +
>          for (i = 0; i < n_txq; i++) {
>              diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
> -                                          dev->socket_id, NULL);
> +                                          dev->socket_id,
> +                                          dpdk_multi_segment_mbufs ? &txconf
> +                                                                   :
> + NULL);
>              if (diag) {
>                  VLOG_INFO("Interface %s txq(%d) setup error: %s",
>                            dev->up.name, i, rte_strerror(-diag)); @@ -3380,6 +3411,12 @@
> unlock:
>      return err;
>  }
> 
> +void
> +netdev_dpdk_multi_segment_mbufs_enable(void)
> +{
> +    dpdk_multi_segment_mbufs = true;
> +}
> +
>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,    \
>                            SET_CONFIG, SET_TX_MULTIQ, SEND,    \
>                            GET_CARRIER, GET_STATS,             \
> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h index b7d02a7..a3339fe
> 100644
> --- a/lib/netdev-dpdk.h
> +++ b/lib/netdev-dpdk.h
> @@ -25,6 +25,7 @@ struct dp_packet;
> 
>  #ifdef DPDK_NETDEV
> 
> +void netdev_dpdk_multi_segment_mbufs_enable(void);
>  void netdev_dpdk_register(void);
>  void free_dpdk_buf(struct dp_packet *);
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> a633226..2b71c4a 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -331,6 +331,26 @@
>          </p>
>        </column>
> 
> +      <column name="other_config" key="dpdk-multi-seg-mbufs"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Specifies if DPDK uses multi-segment mbufs for handling jumbo frames.
> +        </p>
> +        <p>
> +            If true, DPDK allocates a single mempool per port, irrespective
> +            of the ports' requested MTU sizes. The elements of this mempool are
> +            'standard'-sized mbufs (typically 2k MB), which may be chained
> +            together to accommodate jumbo frames. In this approach, each mbuf
> +            typically stores a fragment of the overall jumbo frame.
> +        </p>
> +        <p>
> +            If not specified, defaults to <code>false</code>, in which case, the size
> +            of each mbuf within a DPDK port's mempool will be grown to
> accommodate
> +            jumbo frames within a single mbuf.
> +        </p>
> +      </column>
> +
> +
>        <column name="other_config" key="vhost-sock-dir"
>                type='{"type": "string"}'>
>          <p>
> --
> 1.9.3
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Kavanagh, Mark B Nov. 24, 2017, 1:02 p.m. | #2
>From: O Mahony, Billy
>Sent: Thursday, November 23, 2017 11:23 AM
>To: Kavanagh, Mark B <mark.b.kavanagh@intel.com>; dev@openvswitch.org;
>qiudayu@chinac.com
>Subject: RE: [ovs-dev] [RFC PATCH v3 8/8] netdev-dpdk: support multi-segment
>jumbo frames
>
>Hi Mark,
>
>Just one comment below.
>
>/Billy
>
>> -----Original Message-----
>> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
>> bounces@openvswitch.org] On Behalf Of Mark Kavanagh
>> Sent: Tuesday, November 21, 2017 6:29 PM
>> To: dev@openvswitch.org; qiudayu@chinac.com
>> Subject: [ovs-dev] [RFC PATCH v3 8/8] netdev-dpdk: support multi-segment
>> jumbo frames
>>
>> Currently, jumbo frame support for OvS-DPDK is implemented by increasing the
>> size of mbufs within a mempool, such that each mbuf within the pool is large
>> enough to contain an entire jumbo frame of a user-defined size. Typically,
>for
>> each user-defined MTU, 'requested_mtu', a new mempool is created, containing
>> mbufs of size ~requested_mtu.
>>
>> With the multi-segment approach, a port uses a single mempool, (containing
>> standard/default-sized mbufs of ~2k bytes), irrespective of the user-
>requested
>> MTU value. To accommodate jumbo frames, mbufs are chained together, where
>> each mbuf in the chain stores a portion of the jumbo frame. Each mbuf in the
>> chain is termed a segment, hence the name.
>>
>> == Enabling multi-segment mbufs ==
>> Multi-segment and single-segment mbufs are mutually exclusive, and the user
>> must decide on which approach to adopt on init. The introduction of a new
>> OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This is a global
>boolean
>> value, which determines how jumbo frames are represented across all DPDK
>> ports. In the absence of a user-supplied value, 'dpdk-multi-seg-mbufs'
>defaults
>> to false, i.e. multi-segment mbufs must be explicitly enabled / single-
>segment
>> mbufs remain the default.
>>
>[[BO'M]] Would it be more useful if they multi-segment was enabled by default?
>Does enabling multi-segment mbufs result in much of a performance decrease
>when not-using jumbo frames? Either because jumbo frames are not coming in on
>the ingress port or because the mtu is set not to accept jumbo frames.

Hey Billy,

I think that single-segment should remain the default.

Enabling multi-segment implicitly means that non-vectorized DPDK driver Rx and TX functions must be used, which are, by nature, not as performant as their vectorized counterparts.

I don't have comparative figures to hand, but I'll note same in the cover letter of any subsequent versions of this patchset.

Thanks,
Mark

>
>Obviously not a blocker to this patch-set. Maybe something to be looked at in
>the future.
>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> ---
>>  NEWS                 |  1 +
>>  lib/dpdk.c           |  7 +++++++
>>  lib/netdev-dpdk.c    | 43 ++++++++++++++++++++++++++++++++++++++++---
>>  lib/netdev-dpdk.h    |  1 +
>>  vswitchd/vswitch.xml | 20 ++++++++++++++++++++
>>  5 files changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/NEWS b/NEWS
>> index c15dc24..657b598 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -15,6 +15,7 @@ Post-v2.8.0
>>     - DPDK:
>>       * Add support for DPDK v17.11
>>       * Add support for vHost IOMMU feature
>> +     * Add support for multi-segment mbufs
>>
>>  v2.8.0 - 31 Aug 2017
>>  --------------------
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 8da6c32..4c28bd0 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -450,6 +450,13 @@ dpdk_init__(const struct smap *ovs_other_config)
>>
>>      /* Finally, register the dpdk classes */
>>      netdev_dpdk_register();
>> +
>> +    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
>> +            "dpdk-multi-seg-mbufs", false);
>> +    if (multi_seg_mbufs_enable) {
>> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
>> +        netdev_dpdk_multi_segment_mbufs_enable();
>> +    }
>>  }
>>
>>  void
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 36275bd..293edad
>> 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -65,6 +65,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>
>>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>> +static bool dpdk_multi_segment_mbufs = false;
>>
>>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
>>
>> @@ -500,6 +501,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, uint16_t
>> frame_len)
>>                + dev->requested_n_txq * dev->requested_txq_size
>>                + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>> NETDEV_MAX_BURST
>>                + MIN_NB_MBUF;
>> +    /* XXX (RFC) - should n_mbufs be increased if multi-seg mbufs are
>> + used? */
>>
>>      ovs_mutex_lock(&dpdk_mp_mutex);
>>      do {
>> @@ -568,7 +570,13 @@ dpdk_mp_free(struct rte_mempool *mp)
>>
>>  /* Tries to allocate a new mempool - or re-use an existing one where
>>   * appropriate - on requested_socket_id with a size determined by
>> - * requested_mtu and requested Rx/Tx queues.
>> + * requested_mtu and requested Rx/Tx queues. Some properties of the
>> + mempool's
>> + * elements are dependent on the value of 'dpdk_multi_segment_mbufs':
>> + * - if 'true', then the mempool contains standard-sized mbufs that are
>chained
>> + *   together to accommodate packets of size 'requested_mtu'.
>> + * - if 'false', then the members of the allocated mempool are
>> + *   non-standard-sized mbufs. Each mbuf in the mempool is large enough to
>> fully
>> + *   accomdate packets of size 'requested_mtu'.
>>   * On success - or when re-using an existing mempool - the new
>configuration
>>   * will be applied.
>>   * On error, device will be left unchanged. */ @@ -576,10 +584,18 @@ static
>> int  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>      OVS_REQUIRES(dev->mutex)
>>  {
>> -    uint16_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> +    uint16_t buf_size = 0;
>>      struct rte_mempool *mp;
>>      int ret = 0;
>>
>> +    /* Contiguous mbufs in use - permit oversized mbufs */
>> +    if (!dpdk_multi_segment_mbufs) {
>> +        buf_size = dpdk_buf_size(dev->requested_mtu);
>> +    } else {
>> +        /* multi-segment mbufs - use standard mbuf size */
>> +        buf_size = dpdk_buf_size(ETHER_MTU);
>> +    }
>> +
>>      mp = dpdk_mp_create(dev, buf_size);
>>      if (!mp) {
>>          VLOG_ERR("Failed to create memory pool for netdev "
>> @@ -657,6 +673,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>> int n_rxq, int n_txq)
>>      int diag = 0;
>>      int i;
>>      struct rte_eth_conf conf = port_conf;
>> +    struct rte_eth_txconf txconf;
>>
>>      /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly
>>       * enabled. */
>> @@ -690,9 +707,23 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>> int n_rxq, int n_txq)
>>              break;
>>          }
>>
>> +        /* DPDK PMDs typically attempt to use simple or vectorized
>> +         * transmit functions, neither of which are compatible with
>> +         * multi-segment mbufs. Ensure that these are disabled in the
>> +         * when multi-segment mbufs are enabled.
>> +         */
>> +        if (dpdk_multi_segment_mbufs) {
>> +            struct rte_eth_dev_info dev_info;
>> +            rte_eth_dev_info_get(dev->port_id, &dev_info);
>> +            txconf = dev_info.default_txconf;
>> +            txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
>> +        }
>> +
>>          for (i = 0; i < n_txq; i++) {
>>              diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
>> -                                          dev->socket_id, NULL);
>> +                                          dev->socket_id,
>> +                                          dpdk_multi_segment_mbufs ?
>&txconf
>> +                                                                   :
>> + NULL);
>>              if (diag) {
>>                  VLOG_INFO("Interface %s txq(%d) setup error: %s",
>>                            dev->up.name, i, rte_strerror(-diag)); @@ -3380,6
>+3411,12 @@
>> unlock:
>>      return err;
>>  }
>>
>> +void
>> +netdev_dpdk_multi_segment_mbufs_enable(void)
>> +{
>> +    dpdk_multi_segment_mbufs = true;
>> +}
>> +
>>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,    \
>>                            SET_CONFIG, SET_TX_MULTIQ, SEND,    \
>>                            GET_CARRIER, GET_STATS,             \
>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h index b7d02a7..a3339fe
>> 100644
>> --- a/lib/netdev-dpdk.h
>> +++ b/lib/netdev-dpdk.h
>> @@ -25,6 +25,7 @@ struct dp_packet;
>>
>>  #ifdef DPDK_NETDEV
>>
>> +void netdev_dpdk_multi_segment_mbufs_enable(void);
>>  void netdev_dpdk_register(void);
>>  void free_dpdk_buf(struct dp_packet *);
>>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
>> a633226..2b71c4a 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -331,6 +331,26 @@
>>          </p>
>>        </column>
>>
>> +      <column name="other_config" key="dpdk-multi-seg-mbufs"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          Specifies if DPDK uses multi-segment mbufs for handling jumbo
>frames.
>> +        </p>
>> +        <p>
>> +            If true, DPDK allocates a single mempool per port, irrespective
>> +            of the ports' requested MTU sizes. The elements of this mempool
>are
>> +            'standard'-sized mbufs (typically 2k MB), which may be chained
>> +            together to accommodate jumbo frames. In this approach, each
>mbuf
>> +            typically stores a fragment of the overall jumbo frame.
>> +        </p>
>> +        <p>
>> +            If not specified, defaults to <code>false</code>, in which
>case, the size
>> +            of each mbuf within a DPDK port's mempool will be grown to
>> accommodate
>> +            jumbo frames within a single mbuf.
>> +        </p>
>> +      </column>
>> +
>> +
>>        <column name="other_config" key="vhost-sock-dir"
>>                type='{"type": "string"}'>
>>          <p>
>> --
>> 1.9.3
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Patch

diff --git a/NEWS b/NEWS
index c15dc24..657b598 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,7 @@  Post-v2.8.0
    - DPDK:
      * Add support for DPDK v17.11
      * Add support for vHost IOMMU feature
+     * Add support for multi-segment mbufs
 
 v2.8.0 - 31 Aug 2017
 --------------------
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 8da6c32..4c28bd0 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -450,6 +450,13 @@  dpdk_init__(const struct smap *ovs_other_config)
 
     /* Finally, register the dpdk classes */
     netdev_dpdk_register();
+
+    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
+            "dpdk-multi-seg-mbufs", false);
+    if (multi_seg_mbufs_enable) {
+        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
+        netdev_dpdk_multi_segment_mbufs_enable();
+    }
 }
 
 void
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 36275bd..293edad 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -65,6 +65,7 @@  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 
 VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+static bool dpdk_multi_segment_mbufs = false;
 
 #define DPDK_PORT_WATCHDOG_INTERVAL 5
 
@@ -500,6 +501,7 @@  dpdk_mp_create(struct netdev_dpdk *dev, uint16_t frame_len)
               + dev->requested_n_txq * dev->requested_txq_size
               + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
               + MIN_NB_MBUF;
+    /* XXX (RFC) - should n_mbufs be increased if multi-seg mbufs are used? */
 
     ovs_mutex_lock(&dpdk_mp_mutex);
     do {
@@ -568,7 +570,13 @@  dpdk_mp_free(struct rte_mempool *mp)
 
 /* Tries to allocate a new mempool - or re-use an existing one where
  * appropriate - on requested_socket_id with a size determined by
- * requested_mtu and requested Rx/Tx queues.
+ * requested_mtu and requested Rx/Tx queues. Some properties of the mempool's
+ * elements are dependent on the value of 'dpdk_multi_segment_mbufs':
+ * - if 'true', then the mempool contains standard-sized mbufs that are chained
+ *   together to accommodate packets of size 'requested_mtu'.
+ * - if 'false', then the members of the allocated mempool are
+ *   non-standard-sized mbufs. Each mbuf in the mempool is large enough to fully
+ *   accomdate packets of size 'requested_mtu'.
  * On success - or when re-using an existing mempool - the new configuration
  * will be applied.
  * On error, device will be left unchanged. */
@@ -576,10 +584,18 @@  static int
 netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
     OVS_REQUIRES(dev->mutex)
 {
-    uint16_t buf_size = dpdk_buf_size(dev->requested_mtu);
+    uint16_t buf_size = 0;
     struct rte_mempool *mp;
     int ret = 0;
 
+    /* Contiguous mbufs in use - permit oversized mbufs */
+    if (!dpdk_multi_segment_mbufs) {
+        buf_size = dpdk_buf_size(dev->requested_mtu);
+    } else {
+        /* multi-segment mbufs - use standard mbuf size */
+        buf_size = dpdk_buf_size(ETHER_MTU);
+    }
+
     mp = dpdk_mp_create(dev, buf_size);
     if (!mp) {
         VLOG_ERR("Failed to create memory pool for netdev "
@@ -657,6 +673,7 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
     int diag = 0;
     int i;
     struct rte_eth_conf conf = port_conf;
+    struct rte_eth_txconf txconf;
 
     /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly
      * enabled. */
@@ -690,9 +707,23 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             break;
         }
 
+        /* DPDK PMDs typically attempt to use simple or vectorized
+         * transmit functions, neither of which are compatible with
+         * multi-segment mbufs. Ensure that these are disabled in the
+         * when multi-segment mbufs are enabled.
+         */
+        if (dpdk_multi_segment_mbufs) {
+            struct rte_eth_dev_info dev_info;
+            rte_eth_dev_info_get(dev->port_id, &dev_info);
+            txconf = dev_info.default_txconf;
+            txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
+        }
+
         for (i = 0; i < n_txq; i++) {
             diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
-                                          dev->socket_id, NULL);
+                                          dev->socket_id,
+                                          dpdk_multi_segment_mbufs ? &txconf
+                                                                   : NULL);
             if (diag) {
                 VLOG_INFO("Interface %s txq(%d) setup error: %s",
                           dev->up.name, i, rte_strerror(-diag));
@@ -3380,6 +3411,12 @@  unlock:
     return err;
 }
 
+void
+netdev_dpdk_multi_segment_mbufs_enable(void)
+{
+    dpdk_multi_segment_mbufs = true;
+}
+
 #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,    \
                           SET_CONFIG, SET_TX_MULTIQ, SEND,    \
                           GET_CARRIER, GET_STATS,             \
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
index b7d02a7..a3339fe 100644
--- a/lib/netdev-dpdk.h
+++ b/lib/netdev-dpdk.h
@@ -25,6 +25,7 @@  struct dp_packet;
 
 #ifdef DPDK_NETDEV
 
+void netdev_dpdk_multi_segment_mbufs_enable(void);
 void netdev_dpdk_register(void);
 void free_dpdk_buf(struct dp_packet *);
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index a633226..2b71c4a 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -331,6 +331,26 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="dpdk-multi-seg-mbufs"
+              type='{"type": "boolean"}'>
+        <p>
+          Specifies if DPDK uses multi-segment mbufs for handling jumbo frames.
+        </p>
+        <p>
+            If true, DPDK allocates a single mempool per port, irrespective
+            of the ports' requested MTU sizes. The elements of this mempool are
+            'standard'-sized mbufs (typically 2k MB), which may be chained
+            together to accommodate jumbo frames. In this approach, each mbuf
+            typically stores a fragment of the overall jumbo frame.
+        </p>
+        <p>
+            If not specified, defaults to <code>false</code>, in which case, the size
+            of each mbuf within a DPDK port's mempool will be grown to accommodate
+            jumbo frames within a single mbuf.
+        </p>
+      </column>
+
+
       <column name="other_config" key="vhost-sock-dir"
               type='{"type": "string"}'>
         <p>