diff mbox

[ovs-dev,V3] netdev-dpdk: Add Jumbo Frame Support.

Message ID 1454321934-146694-1-git-send-email-mark.b.kavanagh@intel.com
State Superseded
Headers show

Commit Message

Mark Kavanagh Feb. 1, 2016, 10:18 a.m. UTC
Add support for Jumbo Frames to DPDK-enabled port types,
using single-segment-mbufs.

Using this approach, the amount of memory allocated for each mbuf
to store frame data is increased to a value greater than 1518B
(typical Ethernet maximum frame length). The increased space
available in the mbuf means that an entire Jumbo Frame can be carried
in a single mbuf, as opposed to partitioning it across multiple mbuf
segments.

The amount of space allocated to each mbuf to hold frame data is
defined dynamically by the user when adding a DPDK port to a bridge.
If an MTU value is not supplied, or the user-supplied value is invalid,
the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).

Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
---
 INSTALL.DPDK.md   |  59 +++++++++-
 NEWS              |   2 +
 lib/netdev-dpdk.c | 347 +++++++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 328 insertions(+), 80 deletions(-)

Comments

Zoltan Kiss Feb. 1, 2016, 3:57 p.m. UTC | #1
Hi,

I have a related question: is it a possible alternative with the current 
codebase to use KNI for configuring the MTU?
I never actually tried KNI, does it have a performance penalty to just 
have it?
The reason I'm asking these questions because OVS normally doesn't 
configure the MTU, but it is outside of its scope. If there is a 
reasonable alternative to do that for DPDK (like with KNI), we don't 
need this change.
Sorry for the noise if this idea have been discussed previously.

Regards,

Zoltan

On 01/02/16 10:18, Mark Kavanagh wrote:
> Add support for Jumbo Frames to DPDK-enabled port types,
> using single-segment-mbufs.
>
> Using this approach, the amount of memory allocated for each mbuf
> to store frame data is increased to a value greater than 1518B
> (typical Ethernet maximum frame length). The increased space
> available in the mbuf means that an entire Jumbo Frame can be carried
> in a single mbuf, as opposed to partitioning it across multiple mbuf
> segments.
>
> The amount of space allocated to each mbuf to hold frame data is
> defined dynamically by the user when adding a DPDK port to a bridge.
> If an MTU value is not supplied, or the user-supplied value is invalid,
> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> ---
>   INSTALL.DPDK.md   |  59 +++++++++-
>   NEWS              |   2 +
>   lib/netdev-dpdk.c | 347 +++++++++++++++++++++++++++++++++++++++++-------------
>   3 files changed, 328 insertions(+), 80 deletions(-)
>
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 96b686c..64ccd15 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -859,10 +859,61 @@ by adding the following string:
>   to <interface> sections of all network devices used by DPDK. Parameter 'N'
>   determines how many queues can be used by the guest.
>
> +
> +Jumbo Frames
> +------------
> +
> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
> +
> +To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the ovs-vsctl
> +'add-port' command-line, along with the required MTU for the port.
> +e.g.
> +
> +     ```
> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
> +     ```
> +
> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
> +increased, such that a full Jumbo Frame may be accommodated inside a single
> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
> +
> +Jumbo frame support has been validated against 13312B frames, using the
> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as
> +this feature is pending deprecation.
> +
> +
> +vHost Ports and Jumbo Frames
> +----------------------------
> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional
> +configuration is needed to take advantage of this feature:
> +
> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
> +      the QEMU command line snippet below:
> +
> +      ```
> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
> +      ```
> +
> +  2. Where virtio devices are bound to the Linux kernel driver in a guest
> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
> +     MTU of those logical network interfaces must also be increased. This
> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
> +     to the length of the IP packet only, and not that of the entire frame.
> +
> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
> +     header and CRC lengths (i.e. 18B) from the max supported frame size.
> +     So, to set the MTU for a 13312B Jumbo Frame:
> +
> +      ```
> +      ifconfig eth1 mtu 13294
> +      ```
> +
> +
>   Restrictions:
>   -------------
>
> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>     - Currently DPDK port does not make use any offload functionality.
>     - DPDK-vHost support works with 1G huge pages.
>
> @@ -903,6 +954,12 @@ Restrictions:
>       the next release of DPDK (which includes the above patch) is available and
>       integrated into OVS.
>
> +  Jumbo Frames:
> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
> +     this issue is currently being investigated.
> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
> +
> +
>   Bug Reporting:
>   --------------
>
> diff --git a/NEWS b/NEWS
> index 5c18867..cd563e0 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>        abstractions, such as virtual L2 and L3 overlays and security groups.
>      - RHEL packaging:
>        * DPDK ports may now be created via network scripts (see README.RHEL).
> +   - netdev-dpdk:
> +     * Add Jumbo Frame Support
>
>
>   v2.4.0 - 20 Aug 2015
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index de7e488..76a5dcc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>   #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>   #define OVS_VPORT_DPDK "ovs_dpdk"
>
> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
> +
>   /*
>    * need to reserve tons of extra space in the mbufs so we can align the
>    * DMA addresses to 4KB.
>    * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>    * performance for standard Ethernet MTU.
>    */
> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
> -                              + sizeof(struct dp_packet) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE_DRIVER     (2048                       \
> -                              + sizeof (struct rte_mbuf) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
> +                                    + sizeof(struct dp_packet)   \
> +                                    + RTE_PKTMBUF_HEADROOM)
> +
> +/* This value should be specified as a multiple of the DPDK NIC driver's
> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
> + */
> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>
>   /* Max and min number of packets in the mempool.  OVS tries to allocate a
>    * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>   #define MIN_NB_MBUF          (4096 * 4)
>   #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>
> +#define DPDK_VLAN_TAG_LEN    4
> +
>   /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>   BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>
> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>           .header_split   = 0, /* Header Split disabled */
>           .hw_ip_checksum = 0, /* IP checksum offload disabled */
>           .hw_vlan_filter = 0, /* VLAN filtering disabled */
> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>           .hw_strip_crc   = 0,
>       },
>       .rx_adv_conf = {
> @@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>       return class->construct == netdev_dpdk_construct;
>   }
>
> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
> + * If 'frame_len' is not a multiple of this value, insufficient buffers are
> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
> + * driver needs to ensure that there is also sufficient space in the Rx buffer
> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
> + * small, then the driver enables scatter RX behaviour, which reduces
> + * performance. To prevent this, use a buffer size that is closest to
> + * 'frame_len', but which satisfies the aforementioned criteria.
> + */
> +static uint32_t
> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
> +{
> +    struct rte_eth_dev_info info;
> +    uint32_t buf_size;
> +    /* XXX: This is a workaround for DPDK v2.2, and needs to be refactored with a
> +     * future DPDK release. */
> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
> +
> +    if(netdev->type == DPDK_DEV_ETH) {
> +        rte_eth_dev_info_get(netdev->port_id, &info);
> +        buf_size = (info.min_rx_bufsize == 0) ?
> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
> +                   info.min_rx_bufsize;
> +    } else {
> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
> +    }
> +
> +    if(len % buf_size) {
> +        len = buf_size * ((len/buf_size) + 1);
> +    }
> +
> +    return len;
> +}
> +
>   /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>    * for all other segments data, bss and text. */
>
> @@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>   }
>
>   static void
> -__rte_pktmbuf_init(struct rte_mempool *mp,
> -                   void *opaque_arg OVS_UNUSED,
> -                   void *_m,
> -                   unsigned i OVS_UNUSED)
> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>   {
> -    struct rte_mbuf *m = _m;
> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
> +    struct rte_pktmbuf_pool_private default_mbp_priv;
> +    uint16_t roomsz;
>
>       RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>
> +    /* if no structure is provided, assume no mbuf private area */
> +
> +    user_mbp_priv = opaque_arg;
> +    if (user_mbp_priv == NULL) {
> +        default_mbp_priv.mbuf_priv_size = 0;
> +        if (mp->elt_size > sizeof(struct dp_packet)) {
> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
> +        } else {
> +            roomsz = 0;
> +        }
> +        default_mbp_priv.mbuf_data_room_size = roomsz;
> +        user_mbp_priv = &default_mbp_priv;
> +    }
> +
> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
> +        user_mbp_priv->mbuf_data_room_size +
> +        user_mbp_priv->mbuf_priv_size);
> +
> +    mbp_priv = rte_mempool_get_priv(mp);
> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
> +}
> +
> +/* Initialise some fields in the mbuf structure that are not modified by the
> + * user once created (origin pool, buffer start address, etc.*/
> +static void
> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
> +                       void *opaque_arg OVS_UNUSED,
> +                       void *_m,
> +                       unsigned i OVS_UNUSED)
> +{
> +    struct rte_mbuf *m = _m;
> +    uint32_t buf_size, buf_len, priv_size;
> +
> +    priv_size = rte_pktmbuf_priv_size(mp);
> +    buf_size = sizeof(struct dp_packet) + priv_size;
> +    buf_len = rte_pktmbuf_data_room_size(mp);
> +
> +    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
> +    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
> +    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
> +
>       memset(m, 0, mp->elt_size);
>
> -    /* start of buffer is just after mbuf structure */
> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
> -                    sizeof(struct dp_packet);
> +    /* start of buffer is after dp_packet structure and priv data */
> +    m->priv_size = priv_size;
> +    m->buf_addr = (char *)m + buf_size;
> +    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>       m->buf_len = (uint16_t)buf_len;
>
>       /* keep some headroom between start of buffer and data */
> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
> +    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>
>       /* init some constant fields */
>       m->pool = mp;
> @@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>   {
>       struct rte_mbuf *m = _m;
>
> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>
>       dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>   }
> @@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>       struct dpdk_mp *dmp = NULL;
>       char mp_name[RTE_MEMPOOL_NAMESIZE];
>       unsigned mp_size;
> +    struct rte_pktmbuf_pool_private mbp_priv;
>
>       LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>           if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
> @@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>       dmp->socket_id = socket_id;
>       dmp->mtu = mtu;
>       dmp->refcount = 1;
> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
> +    mbp_priv.mbuf_priv_size = 0;
>
>       mp_size = MAX_NB_MBUF;
>       do {
> @@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>               return NULL;
>           }
>
> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
>                                        MP_CACHE_SZ,
>                                        sizeof(struct rte_pktmbuf_pool_private),
> -                                     rte_pktmbuf_pool_init, NULL,
> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
>                                        ovs_rte_pktmbuf_init, NULL,
>                                        socket_id, 0);
>       } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
> @@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>   {
>       int diag = 0;
>       int i;
> +    struct rte_eth_conf conf = port_conf;
>
>       /* A device may report more queues than it makes available (this has
>        * been observed for Intel xl710, which reserves some of them for
> @@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>               VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>           }
>
> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        }
> +
> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>           if (diag) {
>               break;
>           }
> @@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>       struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>       int sid;
>       int err = 0;
> +    uint32_t buf_size;
>
>       ovs_mutex_init(&netdev->mutex);
>       ovs_mutex_lock(&netdev->mutex);
> @@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>       netdev->port_id = port_no;
>       netdev->type = type;
>       netdev->flags = 0;
> +
> +    /* Initialize port's MTU and frame len to the default Ethernet values.
> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
> +     * netdev_dpdk_set_config.
> +     */
> +    netdev->max_packet_len = ETHER_MAX_LEN;
>       netdev->mtu = ETHER_MTU;
> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>
> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>       if (!netdev->dpdk_mp) {
>           err = ENOMEM;
>           goto unlock;
> @@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],
>       return 0;
>   }
>
> +static void
> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
> +{
> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
> +    char *end_ptr = NULL;
> +    int local_mtu;
> +
> +    if(mtu_str) {
> +        local_mtu = strtoul(mtu_str, &end_ptr, 0);
> +    }
> +    if(!mtu_str || local_mtu < ETHER_MTU ||
> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
> +                   *end_ptr != '\0') {
> +        local_mtu = ETHER_MTU;
> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
> +                   local_mtu);
> +    }
> +
> +    *mtu = local_mtu;
> +}
> +
>   static int
>   vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>   {
> @@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
>       smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>       smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>       smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>       ovs_mutex_unlock(&dev->mutex);
>
>       return 0;
>   }
>
> +/* Set the mtu of DPDK_DEV_ETH ports */
> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int old_mtu, err;
> +    uint32_t buf_size;
> +    int dpdk_mtu;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
> +
> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    rte_eth_dev_stop(dev->port_id);
> +
> +    old_mtu = dev->mtu;
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    err = dpdk_eth_dev_init(dev);
> +    if (err) {
> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
> +        dpdk_mp_put(mp);
> +        dev->mtu = old_mtu;
> +        dev->dpdk_mp = old_mp;
> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        dpdk_eth_dev_init(dev);
> +        goto out;
> +    } else {
> +        dpdk_mp_put(old_mp);
> +        netdev_change_seq_changed(netdev);
> +    }
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
> +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_set_mtu(netdev_, mtu);
> +}
> +
>   static int
>   netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>   {
> @@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
>
>       return 0;
>   }
> -
> -static int
> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> -{
> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    int old_mtu, err;
> -    struct dpdk_mp *old_mp;
> -    struct dpdk_mp *mp;
> -
> -    ovs_mutex_lock(&dpdk_mutex);
> -    ovs_mutex_lock(&dev->mutex);
> -    if (dev->mtu == mtu) {
> -        err = 0;
> -        goto out;
> -    }
> -
> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> -    if (!mp) {
> -        err = ENOMEM;
> -        goto out;
> -    }
> -
> -    rte_eth_dev_stop(dev->port_id);
> -
> -    old_mtu = dev->mtu;
> -    old_mp = dev->dpdk_mp;
> -    dev->dpdk_mp = mp;
> -    dev->mtu = mtu;
> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -
> -    err = dpdk_eth_dev_init(dev);
> -    if (err) {
> -        dpdk_mp_put(mp);
> -        dev->mtu = old_mtu;
> -        dev->dpdk_mp = old_mp;
> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -        dpdk_eth_dev_init(dev);
> -        goto out;
> -    }
> -
> -    dpdk_mp_put(old_mp);
> -    netdev_change_seq_changed(netdev);
> -out:
> -    ovs_mutex_unlock(&dev->mutex);
> -    ovs_mutex_unlock(&dpdk_mutex);
> -    return err;
> -}
> -
>   static int
>   netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>
> @@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *args)
>       smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
>       smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>       smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>       smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>       smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>       smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
> @@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>       return 0;
>   }
>
> +/* Set the mtu of DPDK_DEV_VHOST ports */
> +static int
> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int err = 0;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    mp = dpdk_mp_get(dev->socket_id, mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    dpdk_mp_put(old_mp);
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
> +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
> +}
> +
>   static void
>   dpdk_common_init(void)
>   {
> @@ -2040,8 +2220,9 @@ unlock_dpdk:
>       return err;
>   }
>
> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
> +        GET_STATUS, RXQ_RECV)                                          \
>   {                                                             \
>       NAME,                                                     \
>       INIT,                       /* init */                    \
> @@ -2053,7 +2234,7 @@ unlock_dpdk:
>       DESTRUCT,                                                 \
>       netdev_dpdk_dealloc,                                      \
>       netdev_dpdk_get_config,                                   \
> -    NULL,                       /* netdev_dpdk_set_config */  \
> +    SET_CONFIG,                                               \
>       NULL,                       /* get_tunnel_config */       \
>       NULL,                       /* build header */            \
>       NULL,                       /* push header */             \
> @@ -2067,7 +2248,7 @@ unlock_dpdk:
>       netdev_dpdk_set_etheraddr,                                \
>       netdev_dpdk_get_etheraddr,                                \
>       netdev_dpdk_get_mtu,                                      \
> -    netdev_dpdk_set_mtu,                                      \
> +    SET_MTU,                                                  \
>       netdev_dpdk_get_ifindex,                                  \
>       GET_CARRIER,                                              \
>       netdev_dpdk_get_carrier_resets,                           \
> @@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>           NULL,
>           netdev_dpdk_construct,
>           netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>           netdev_dpdk_set_multiq,
>           netdev_dpdk_eth_send,
> +        netdev_dpdk_set_mtu,
>           netdev_dpdk_get_carrier,
>           netdev_dpdk_get_stats,
>           netdev_dpdk_get_features,
> @@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class =
>           NULL,
>           netdev_dpdk_ring_construct,
>           netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>           netdev_dpdk_set_multiq,
>           netdev_dpdk_ring_send,
> +        netdev_dpdk_set_mtu,
>           netdev_dpdk_get_carrier,
>           netdev_dpdk_get_stats,
>           netdev_dpdk_get_features,
> @@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
>           dpdk_vhost_cuse_class_init,
>           netdev_dpdk_vhost_cuse_construct,
>           netdev_dpdk_vhost_destruct,
> +        NULL,
>           netdev_dpdk_vhost_set_multiq,
>           netdev_dpdk_vhost_send,
> +        NULL,
>           netdev_dpdk_vhost_get_carrier,
>           netdev_dpdk_vhost_get_stats,
>           NULL,
> @@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
>           dpdk_vhost_user_class_init,
>           netdev_dpdk_vhost_user_construct,
>           netdev_dpdk_vhost_destruct,
> +        netdev_dpdk_vhost_set_config,
>           netdev_dpdk_vhost_set_multiq,
>           netdev_dpdk_vhost_send,
> +        netdev_dpdk_vhost_set_mtu,
>           netdev_dpdk_vhost_get_carrier,
>           netdev_dpdk_vhost_get_stats,
>           NULL,
>
Mark Kavanagh Feb. 1, 2016, 4:36 p.m. UTC | #2
Hi Zoltan. Thanks for your mail - responses inline below.

Cheers,
Mark

>

>Hi,

>

>I have a related question: is it a possible alternative with the current

>codebase to use KNI for configuring the MTU?


This would (as far as I'm aware) require the implementation of a new KNI netdev class, along with a handler/set of handlers in order to set the MTU (for that netdev class only), based on requests from ifconfig/ip link/etc.
It wouldn't make a difference as far as IVSHMEM/Phy ports are concerned, so I don't think this approach is workable. If I've missed something though, please do let me know.

>I never actually tried KNI, does it have a performance penalty to just

>have it?


I haven't used it in some time, but the fact that it passes packets to/accepts packets from the Linux Network Stack means that it's subject to all of the inherent performance bottlenecks that you'd expect.

>The reason I'm asking these questions because OVS normally doesn't

>configure the MTU, but it is outside of its scope. If there is a

>reasonable alternative to do that for DPDK (like with KNI), we don't

>need this change.

>Sorry for the noise if this idea have been discussed previously.

>

>Regards,

>

>Zoltan

>

>On 01/02/16 10:18, Mark Kavanagh wrote:

>> Add support for Jumbo Frames to DPDK-enabled port types,

>> using single-segment-mbufs.

>>

>> Using this approach, the amount of memory allocated for each mbuf

>> to store frame data is increased to a value greater than 1518B

>> (typical Ethernet maximum frame length). The increased space

>> available in the mbuf means that an entire Jumbo Frame can be carried

>> in a single mbuf, as opposed to partitioning it across multiple mbuf

>> segments.

>>

>> The amount of space allocated to each mbuf to hold frame data is

>> defined dynamically by the user when adding a DPDK port to a bridge.

>> If an MTU value is not supplied, or the user-supplied value is invalid,

>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).

>>

>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>

>> ---

>>   INSTALL.DPDK.md   |  59 +++++++++-

>>   NEWS              |   2 +

>>   lib/netdev-dpdk.c | 347 +++++++++++++++++++++++++++++++++++++++++-------------

>>   3 files changed, 328 insertions(+), 80 deletions(-)

>>

>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md

>> index 96b686c..64ccd15 100644

>> --- a/INSTALL.DPDK.md

>> +++ b/INSTALL.DPDK.md

>> @@ -859,10 +859,61 @@ by adding the following string:

>>   to <interface> sections of all network devices used by DPDK. Parameter 'N'

>>   determines how many queues can be used by the guest.

>>

>> +

>> +Jumbo Frames

>> +------------

>> +

>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.

>> +

>> +To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the ovs-vsctl

>> +'add-port' command-line, along with the required MTU for the port.

>> +e.g.

>> +

>> +     ```

>> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000

>> +     ```

>> +

>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are

>> +increased, such that a full Jumbo Frame may be accommodated inside a single

>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.

>> +

>> +Jumbo frame support has been validated against 13312B frames, using the

>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may

>> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as

>> +this feature is pending deprecation.

>> +

>> +

>> +vHost Ports and Jumbo Frames

>> +----------------------------

>> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional

>> +configuration is needed to take advantage of this feature:

>> +

>> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in

>> +      the QEMU command line snippet below:

>> +

>> +      ```

>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'

>> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'

>> +      ```

>> +

>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest

>> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the

>> +     MTU of those logical network interfaces must also be increased. This

>> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers

>> +     to the length of the IP packet only, and not that of the entire frame.

>> +

>> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2

>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.

>> +     So, to set the MTU for a 13312B Jumbo Frame:

>> +

>> +      ```

>> +      ifconfig eth1 mtu 13294

>> +      ```

>> +

>> +

>>   Restrictions:

>>   -------------

>>

>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.

>>     - Currently DPDK port does not make use any offload functionality.

>>     - DPDK-vHost support works with 1G huge pages.

>>

>> @@ -903,6 +954,12 @@ Restrictions:

>>       the next release of DPDK (which includes the above patch) is available and

>>       integrated into OVS.

>>

>> +  Jumbo Frames:

>> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of

>> +     this issue is currently being investigated.

>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.

>> +

>> +

>>   Bug Reporting:

>>   --------------

>>

>> diff --git a/NEWS b/NEWS

>> index 5c18867..cd563e0 100644

>> --- a/NEWS

>> +++ b/NEWS

>> @@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx

>>        abstractions, such as virtual L2 and L3 overlays and security groups.

>>      - RHEL packaging:

>>        * DPDK ports may now be created via network scripts (see README.RHEL).

>> +   - netdev-dpdk:

>> +     * Add Jumbo Frame Support

>>

>>

>>   v2.4.0 - 20 Aug 2015

>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c

>> index de7e488..76a5dcc 100644

>> --- a/lib/netdev-dpdk.c

>> +++ b/lib/netdev-dpdk.c

>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>   #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE

>>   #define OVS_VPORT_DPDK "ovs_dpdk"

>>

>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1

>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024

>> +

>>   /*

>>    * need to reserve tons of extra space in the mbufs so we can align the

>>    * DMA addresses to 4KB.

>>    * The minimum mbuf size is limited to avoid scatter behaviour and drop in

>>    * performance for standard Ethernet MTU.

>>    */

>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \

>> -                              + sizeof(struct dp_packet) \

>> -                              + RTE_PKTMBUF_HEADROOM)

>> -#define MBUF_SIZE_DRIVER     (2048                       \

>> -                              + sizeof (struct rte_mbuf) \

>> -                              + RTE_PKTMBUF_HEADROOM)

>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)

>> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)

>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \

>> +                                    + sizeof(struct dp_packet)   \

>> +                                    + RTE_PKTMBUF_HEADROOM)

>> +

>> +/* This value should be specified as a multiple of the DPDK NIC driver's

>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').

>> + */

>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312

>>

>>   /* Max and min number of packets in the mempool.  OVS tries to allocate a

>>    * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have

>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>   #define MIN_NB_MBUF          (4096 * 4)

>>   #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE

>>

>> +#define DPDK_VLAN_TAG_LEN    4

>> +

>>   /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */

>>   BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);

>>

>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {

>>           .header_split   = 0, /* Header Split disabled */

>>           .hw_ip_checksum = 0, /* IP checksum offload disabled */

>>           .hw_vlan_filter = 0, /* VLAN filtering disabled */

>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */

>>           .hw_strip_crc   = 0,

>>       },

>>       .rx_adv_conf = {

>> @@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)

>>       return class->construct == netdev_dpdk_construct;

>>   }

>>

>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity

>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).

>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are

>> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio

>> + * driver needs to ensure that there is also sufficient space in the Rx buffer

>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too

>> + * small, then the driver enables scatter RX behaviour, which reduces

>> + * performance. To prevent this, use a buffer size that is closest to

>> + * 'frame_len', but which satisfies the aforementioned criteria.

>> + */

>> +static uint32_t

>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)

>> +{

>> +    struct rte_eth_dev_info info;

>> +    uint32_t buf_size;

>> +    /* XXX: This is a workaround for DPDK v2.2, and needs to be refactored with a

>> +     * future DPDK release. */

>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);

>> +

>> +    if(netdev->type == DPDK_DEV_ETH) {

>> +        rte_eth_dev_info_get(netdev->port_id, &info);

>> +        buf_size = (info.min_rx_bufsize == 0) ?

>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :

>> +                   info.min_rx_bufsize;

>> +    } else {

>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;

>> +    }

>> +

>> +    if(len % buf_size) {

>> +        len = buf_size * ((len/buf_size) + 1);

>> +    }

>> +

>> +    return len;

>> +}

>> +

>>   /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used

>>    * for all other segments data, bss and text. */

>>

>> @@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)

>>   }

>>

>>   static void

>> -__rte_pktmbuf_init(struct rte_mempool *mp,

>> -                   void *opaque_arg OVS_UNUSED,

>> -                   void *_m,

>> -                   unsigned i OVS_UNUSED)

>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)

>>   {

>> -    struct rte_mbuf *m = _m;

>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);

>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;

>> +    struct rte_pktmbuf_pool_private default_mbp_priv;

>> +    uint16_t roomsz;

>>

>>       RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));

>>

>> +    /* if no structure is provided, assume no mbuf private area */

>> +

>> +    user_mbp_priv = opaque_arg;

>> +    if (user_mbp_priv == NULL) {

>> +        default_mbp_priv.mbuf_priv_size = 0;

>> +        if (mp->elt_size > sizeof(struct dp_packet)) {

>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);

>> +        } else {

>> +            roomsz = 0;

>> +        }

>> +        default_mbp_priv.mbuf_data_room_size = roomsz;

>> +        user_mbp_priv = &default_mbp_priv;

>> +    }

>> +

>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +

>> +        user_mbp_priv->mbuf_data_room_size +

>> +        user_mbp_priv->mbuf_priv_size);

>> +

>> +    mbp_priv = rte_mempool_get_priv(mp);

>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));

>> +}

>> +

>> +/* Initialise some fields in the mbuf structure that are not modified by the

>> + * user once created (origin pool, buffer start address, etc.*/

>> +static void

>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>> +                       void *opaque_arg OVS_UNUSED,

>> +                       void *_m,

>> +                       unsigned i OVS_UNUSED)

>> +{

>> +    struct rte_mbuf *m = _m;

>> +    uint32_t buf_size, buf_len, priv_size;

>> +

>> +    priv_size = rte_pktmbuf_priv_size(mp);

>> +    buf_size = sizeof(struct dp_packet) + priv_size;

>> +    buf_len = rte_pktmbuf_data_room_size(mp);

>> +

>> +    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);

>> +    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);

>> +    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);

>> +

>>       memset(m, 0, mp->elt_size);

>>

>> -    /* start of buffer is just after mbuf structure */

>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);

>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +

>> -                    sizeof(struct dp_packet);

>> +    /* start of buffer is after dp_packet structure and priv data */

>> +    m->priv_size = priv_size;

>> +    m->buf_addr = (char *)m + buf_size;

>> +    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;

>>       m->buf_len = (uint16_t)buf_len;

>>

>>       /* keep some headroom between start of buffer and data */

>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);

>> +    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);

>>

>>       /* init some constant fields */

>>       m->pool = mp;

>> @@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>>   {

>>       struct rte_mbuf *m = _m;

>>

>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);

>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);

>>

>>       dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);

>>   }

>> @@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>       struct dpdk_mp *dmp = NULL;

>>       char mp_name[RTE_MEMPOOL_NAMESIZE];

>>       unsigned mp_size;

>> +    struct rte_pktmbuf_pool_private mbp_priv;

>>

>>       LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {

>>           if (dmp->socket_id == socket_id && dmp->mtu == mtu) {

>> @@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>       dmp->socket_id = socket_id;

>>       dmp->mtu = mtu;

>>       dmp->refcount = 1;

>> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;

>> +    mbp_priv.mbuf_priv_size = 0;

>>

>>       mp_size = MAX_NB_MBUF;

>>       do {

>> @@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>               return NULL;

>>           }

>>

>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),

>> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),

>>                                        MP_CACHE_SZ,

>>                                        sizeof(struct rte_pktmbuf_pool_private),

>> -                                     rte_pktmbuf_pool_init, NULL,

>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,

>>                                        ovs_rte_pktmbuf_init, NULL,

>>                                        socket_id, 0);

>>       } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);

>> @@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)

>>   {

>>       int diag = 0;

>>       int i;

>> +    struct rte_eth_conf conf = port_conf;

>>

>>       /* A device may report more queues than it makes available (this has

>>        * been observed for Intel xl710, which reserves some of them for

>> @@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>n_txq)

>>               VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);

>>           }

>>

>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);

>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {

>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;

>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);

>> +        }

>> +

>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);

>>           if (diag) {

>>               break;

>>           }

>> @@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>       struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);

>>       int sid;

>>       int err = 0;

>> +    uint32_t buf_size;

>>

>>       ovs_mutex_init(&netdev->mutex);

>>       ovs_mutex_lock(&netdev->mutex);

>> @@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>       netdev->port_id = port_no;

>>       netdev->type = type;

>>       netdev->flags = 0;

>> +

>> +    /* Initialize port's MTU and frame len to the default Ethernet values.

>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in

>> +     * netdev_dpdk_set_config.

>> +     */

>> +    netdev->max_packet_len = ETHER_MAX_LEN;

>>       netdev->mtu = ETHER_MTU;

>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);

>>

>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);

>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);

>> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));

>>       if (!netdev->dpdk_mp) {

>>           err = ENOMEM;

>>           goto unlock;

>> @@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],

>>       return 0;

>>   }

>>

>> +static void

>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)

>> +{

>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");

>> +    char *end_ptr = NULL;

>> +    int local_mtu;

>> +

>> +    if(mtu_str) {

>> +        local_mtu = strtoul(mtu_str, &end_ptr, 0);

>> +    }

>> +    if(!mtu_str || local_mtu < ETHER_MTU ||

>> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||

>> +                   *end_ptr != '\0') {

>> +        local_mtu = ETHER_MTU;

>> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",

>> +                   local_mtu);

>> +    }

>> +

>> +    *mtu = local_mtu;

>> +}

>> +

>>   static int

>>   vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)

>>   {

>> @@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap

>*args)

>>       smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);

>>       smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);

>>       smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);

>> +    smap_add_format(args, "mtu", "%d", dev->mtu);

>>       ovs_mutex_unlock(&dev->mutex);

>>

>>       return 0;

>>   }

>>

>> +/* Set the mtu of DPDK_DEV_ETH ports */

>> +static int

>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>> +{

>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>> +    int old_mtu, err;

>> +    uint32_t buf_size;

>> +    int dpdk_mtu;

>> +    struct dpdk_mp *old_mp;

>> +    struct dpdk_mp *mp;

>> +

>> +    ovs_mutex_lock(&dpdk_mutex);

>> +    ovs_mutex_lock(&dev->mutex);

>> +    if (dev->mtu == mtu) {

>> +        err = 0;

>> +        goto out;

>> +    }

>> +

>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));

>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);

>> +

>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);

>> +    if (!mp) {

>> +        err = ENOMEM;

>> +        goto out;

>> +    }

>> +

>> +    rte_eth_dev_stop(dev->port_id);

>> +

>> +    old_mtu = dev->mtu;

>> +    old_mp = dev->dpdk_mp;

>> +    dev->dpdk_mp = mp;

>> +    dev->mtu = mtu;

>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>> +

>> +    err = dpdk_eth_dev_init(dev);

>> +    if (err) {

>> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "

>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);

>> +        dpdk_mp_put(mp);

>> +        dev->mtu = old_mtu;

>> +        dev->dpdk_mp = old_mp;

>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>> +        dpdk_eth_dev_init(dev);

>> +        goto out;

>> +    } else {

>> +        dpdk_mp_put(old_mp);

>> +        netdev_change_seq_changed(netdev);

>> +    }

>> +out:

>> +    ovs_mutex_unlock(&dev->mutex);

>> +    ovs_mutex_unlock(&dpdk_mutex);

>> +    return err;

>> +}

>> +

>> +static int

>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)

>> +{

>> +    int mtu;

>> +

>> +    dpdk_dev_parse_mtu(args, &mtu);

>> +

>> +    return netdev_dpdk_set_mtu(netdev_, mtu);

>> +}

>> +

>>   static int

>>   netdev_dpdk_get_numa_id(const struct netdev *netdev_)

>>   {

>> @@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)

>>

>>       return 0;

>>   }

>> -

>> -static int

>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>> -{

>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>> -    int old_mtu, err;

>> -    struct dpdk_mp *old_mp;

>> -    struct dpdk_mp *mp;

>> -

>> -    ovs_mutex_lock(&dpdk_mutex);

>> -    ovs_mutex_lock(&dev->mutex);

>> -    if (dev->mtu == mtu) {

>> -        err = 0;

>> -        goto out;

>> -    }

>> -

>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);

>> -    if (!mp) {

>> -        err = ENOMEM;

>> -        goto out;

>> -    }

>> -

>> -    rte_eth_dev_stop(dev->port_id);

>> -

>> -    old_mtu = dev->mtu;

>> -    old_mp = dev->dpdk_mp;

>> -    dev->dpdk_mp = mp;

>> -    dev->mtu = mtu;

>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>> -

>> -    err = dpdk_eth_dev_init(dev);

>> -    if (err) {

>> -        dpdk_mp_put(mp);

>> -        dev->mtu = old_mtu;

>> -        dev->dpdk_mp = old_mp;

>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>> -        dpdk_eth_dev_init(dev);

>> -        goto out;

>> -    }

>> -

>> -    dpdk_mp_put(old_mp);

>> -    netdev_change_seq_changed(netdev);

>> -out:

>> -    ovs_mutex_unlock(&dev->mutex);

>> -    ovs_mutex_unlock(&dpdk_mutex);

>> -    return err;

>> -}

>> -

>>   static int

>>   netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);

>>

>> @@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap

>*args)

>>       smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));

>>       smap_add_format(args, "driver_name", "%s", dev_info.driver_name);

>>       smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);

>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);

>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);

>>       smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);

>>       smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);

>>       smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);

>> @@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)

>>       return 0;

>>   }

>>

>> +/* Set the mtu of DPDK_DEV_VHOST ports */

>> +static int

>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)

>> +{

>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>> +    int err = 0;

>> +    struct dpdk_mp *old_mp;

>> +    struct dpdk_mp *mp;

>> +

>> +    ovs_mutex_lock(&dpdk_mutex);

>> +    ovs_mutex_lock(&dev->mutex);

>> +    if (dev->mtu == mtu) {

>> +        err = 0;

>> +        goto out;

>> +    }

>> +

>> +    mp = dpdk_mp_get(dev->socket_id, mtu);

>> +    if (!mp) {

>> +        err = ENOMEM;

>> +        goto out;

>> +    }

>> +

>> +    old_mp = dev->dpdk_mp;

>> +    dev->dpdk_mp = mp;

>> +    dev->mtu = mtu;

>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>> +

>> +    dpdk_mp_put(old_mp);

>> +    netdev_change_seq_changed(netdev);

>> +out:

>> +    ovs_mutex_unlock(&dev->mutex);

>> +    ovs_mutex_unlock(&dpdk_mutex);

>> +    return err;

>> +}

>> +

>> +static int

>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)

>> +{

>> +    int mtu;

>> +

>> +    dpdk_dev_parse_mtu(args, &mtu);

>> +

>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);

>> +}

>> +

>>   static void

>>   dpdk_common_init(void)

>>   {

>> @@ -2040,8 +2220,9 @@ unlock_dpdk:

>>       return err;

>>   }

>>

>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \

>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \

>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \

>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \

>> +        GET_STATUS, RXQ_RECV)                                          \

>>   {                                                             \

>>       NAME,                                                     \

>>       INIT,                       /* init */                    \

>> @@ -2053,7 +2234,7 @@ unlock_dpdk:

>>       DESTRUCT,                                                 \

>>       netdev_dpdk_dealloc,                                      \

>>       netdev_dpdk_get_config,                                   \

>> -    NULL,                       /* netdev_dpdk_set_config */  \

>> +    SET_CONFIG,                                               \

>>       NULL,                       /* get_tunnel_config */       \

>>       NULL,                       /* build header */            \

>>       NULL,                       /* push header */             \

>> @@ -2067,7 +2248,7 @@ unlock_dpdk:

>>       netdev_dpdk_set_etheraddr,                                \

>>       netdev_dpdk_get_etheraddr,                                \

>>       netdev_dpdk_get_mtu,                                      \

>> -    netdev_dpdk_set_mtu,                                      \

>> +    SET_MTU,                                                  \

>>       netdev_dpdk_get_ifindex,                                  \

>>       GET_CARRIER,                                              \

>>       netdev_dpdk_get_carrier_resets,                           \

>> @@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =

>>           NULL,

>>           netdev_dpdk_construct,

>>           netdev_dpdk_destruct,

>> +        netdev_dpdk_set_config,

>>           netdev_dpdk_set_multiq,

>>           netdev_dpdk_eth_send,

>> +        netdev_dpdk_set_mtu,

>>           netdev_dpdk_get_carrier,

>>           netdev_dpdk_get_stats,

>>           netdev_dpdk_get_features,

>> @@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class =

>>           NULL,

>>           netdev_dpdk_ring_construct,

>>           netdev_dpdk_destruct,

>> +        netdev_dpdk_set_config,

>>           netdev_dpdk_set_multiq,

>>           netdev_dpdk_ring_send,

>> +        netdev_dpdk_set_mtu,

>>           netdev_dpdk_get_carrier,

>>           netdev_dpdk_get_stats,

>>           netdev_dpdk_get_features,

>> @@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =

>>           dpdk_vhost_cuse_class_init,

>>           netdev_dpdk_vhost_cuse_construct,

>>           netdev_dpdk_vhost_destruct,

>> +        NULL,

>>           netdev_dpdk_vhost_set_multiq,

>>           netdev_dpdk_vhost_send,

>> +        NULL,

>>           netdev_dpdk_vhost_get_carrier,

>>           netdev_dpdk_vhost_get_stats,

>>           NULL,

>> @@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =

>>           dpdk_vhost_user_class_init,

>>           netdev_dpdk_vhost_user_construct,

>>           netdev_dpdk_vhost_destruct,

>> +        netdev_dpdk_vhost_set_config,

>>           netdev_dpdk_vhost_set_multiq,

>>           netdev_dpdk_vhost_send,

>> +        netdev_dpdk_vhost_set_mtu,

>>           netdev_dpdk_vhost_get_carrier,

>>           netdev_dpdk_vhost_get_stats,

>>           NULL,

>>
Flavio Leitner Feb. 4, 2016, 8:37 p.m. UTC | #3
Hi,

Some comments below.


On Mon,  1 Feb 2016 10:18:54 +0000
Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:

> Add support for Jumbo Frames to DPDK-enabled port types,
> using single-segment-mbufs.
> 
> Using this approach, the amount of memory allocated for each mbuf
> to store frame data is increased to a value greater than 1518B
> (typical Ethernet maximum frame length). The increased space
> available in the mbuf means that an entire Jumbo Frame can be carried
> in a single mbuf, as opposed to partitioning it across multiple mbuf
> segments.
> 
> The amount of space allocated to each mbuf to hold frame data is
> defined dynamically by the user when adding a DPDK port to a bridge.
> If an MTU value is not supplied, or the user-supplied value is
> invalid, the MTU for the port defaults to standard Ethernet MTU (i.e.
> 1500B).
> 
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> ---
>  INSTALL.DPDK.md   |  59 +++++++++-
>  NEWS              |   2 +
>  lib/netdev-dpdk.c | 347
> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
> changed, 328 insertions(+), 80 deletions(-)
> 
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 96b686c..64ccd15 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -859,10 +859,61 @@ by adding the following string:
>  to <interface> sections of all network devices used by DPDK.
> Parameter 'N' determines how many queues can be used by the guest.
>  
> +
> +Jumbo Frames
> +------------
> +
> +Support for Jumbo Frames may be enabled at run-time for DPDK-type
> ports. +
> +To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
> ovs-vsctl +'add-port' command-line, along with the required MTU for
> the port. +e.g.
> +
> +     ```
> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> options:dpdk-mtu=9000
> +     ```

Any particular reason to call it dpdk-mtu instead of mtu?



> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
> segments are +increased, such that a full Jumbo Frame may be
> accommodated inside a single +mbuf segment. Once set, the MTU for a
> DPDK port is immutable. +

I think we need some protection here because if someone tries to
change that, it will segfault PMD threads.


> +Jumbo frame support has been validated against 13312B frames, using
> the +DPDK `igb_uio` driver, but larger frames and other DPDK NIC
> drivers may +theoretically be supported. Supported port types
> excludes vHost-Cuse ports, as +this feature is pending deprecation.
> +
> +
> +vHost Ports and Jumbo Frames
> +----------------------------
> +Jumbo frame support is available for DPDK vHost-User ports only.
> Some additional +configuration is needed to take advantage of this
> feature: +
> +  1. `mergeable buffers` must be enabled for vHost ports, as
> demonstrated in
> +      the QEMU command line snippet below:
> +
> +      ```
> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
> +      '-device
> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
> +      ```
> +
> +  2. Where virtio devices are bound to the Linux kernel driver in a
> guest
> +     environment (i.e. interfaces are not bound to an in-guest DPDK
> driver), the
> +     MTU of those logical network interfaces must also be increased.
> This
> +     avoids segmentation of Jumbo Frames in the guest. Note that
> 'MTU' refers
> +     to the length of the IP packet only, and not that of the entire
> frame. +
> +     e.g. To calculate the exact MTU of a standard IPv4 frame,
> subtract the L2
> +     header and CRC lengths (i.e. 18B) from the max supported frame
> size.
> +     So, to set the MTU for a 13312B Jumbo Frame:
> +
> +      ```
> +      ifconfig eth1 mtu 13294
> +      ```
> +
> +
>  Restrictions:
>  -------------
>  
> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this
> issue.
>    - Currently DPDK port does not make use any offload functionality.
>    - DPDK-vHost support works with 1G huge pages.
>  
> @@ -903,6 +954,12 @@ Restrictions:
>      the next release of DPDK (which includes the above patch) is
> available and integrated into OVS.
>  
> +  Jumbo Frames:
> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
> source of
> +     this issue is currently being investigated.
> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
> ports. +
> +
>  Bug Reporting:
>  --------------
>  
> diff --git a/NEWS b/NEWS
> index 5c18867..cd563e0 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>       abstractions, such as virtual L2 and L3 overlays and security
> groups.
>     - RHEL packaging:
>       * DPDK ports may now be created via network scripts (see
> README.RHEL).
> +   - netdev-dpdk:
> +     * Add Jumbo Frame Support  
>  
>  v2.4.0 - 20 Aug 2015
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index de7e488..76a5dcc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20); #define OVS_CACHE_LINE_SIZE
> CACHE_LINE_SIZE #define OVS_VPORT_DPDK "ovs_dpdk"
>  
> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
> +
>  /*
>   * need to reserve tons of extra space in the mbufs so we can align
> the
>   * DMA addresses to 4KB.
>   * The minimum mbuf size is limited to avoid scatter behaviour and
> drop in
>   * performance for standard Ethernet MTU.
>   */
> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
> -                              + sizeof(struct dp_packet) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE_DRIVER     (2048                       \
> -                              + sizeof (struct rte_mbuf) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu),
> MBUF_SIZE_DRIVER) +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) +
> ETHER_HDR_LEN + ETHER_CRC_LEN) +#define FRAME_LEN_TO_MTU(frame_len)
> ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN) +#define
> MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
> +                                    + sizeof(struct dp_packet)   \
> +                                    + RTE_PKTMBUF_HEADROOM)
> +
> +/* This value should be specified as a multiple of the DPDK NIC
> driver's
> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
> + */
> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>  
>  /* Max and min number of packets in the mempool.  OVS tries to
> allocate a
>   * mempool with MAX_NB_MBUF: if this fails (because the system
> doesn't have @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20); #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +#define DPDK_VLAN_TAG_LEN    4

Why not use VLAN_HEADER_LEN ?
We already include packets.h anyway.



>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>  BUILD_ASSERT_DECL(MAX_NB_MBUF %
> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0); 
> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>          .header_split   = 0, /* Header Split disabled */
>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>          .hw_strip_crc   = 0,
>      },
>      .rx_adv_conf = {
> @@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>      return class->construct == netdev_dpdk_construct;
>  }
>  
> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
> igb_uio).
> + * If 'frame_len' is not a multiple of this value, insufficient
> buffers are
> + * allocated to accomodate the packet in its entirety. Furthermore,
> the igb_uio
> + * driver needs to ensure that there is also sufficient space in the
> Rx buffer
> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer
> is too
> + * small, then the driver enables scatter RX behaviour, which reduces
> + * performance. To prevent this, use a buffer size that is closest to
> + * 'frame_len', but which satisfies the aforementioned criteria.
> + */
> +static uint32_t
> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
> +{
> +    struct rte_eth_dev_info info;
> +    uint32_t buf_size;
> +    /* XXX: This is a workaround for DPDK v2.2, and needs to be
> refactored with a
> +     * future DPDK release. */
> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
> +
> +    if(netdev->type == DPDK_DEV_ETH) {
> +        rte_eth_dev_info_get(netdev->port_id, &info);
> +        buf_size = (info.min_rx_bufsize == 0) ?
> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
> +                   info.min_rx_bufsize;
> +    } else {
> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
> +    }
> +
> +    if(len % buf_size) {
> +        len = buf_size * ((len/buf_size) + 1);
> +    }
> +
> +    return len;
> +}
> +
>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
> used
>   * for all other segments data, bss and text. */
>  
> @@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>  }
>  
>  static void
> -__rte_pktmbuf_init(struct rte_mempool *mp,
> -                   void *opaque_arg OVS_UNUSED,
> -                   void *_m,
> -                   unsigned i OVS_UNUSED)
> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>  {
> -    struct rte_mbuf *m = _m;
> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
> +    struct rte_pktmbuf_pool_private default_mbp_priv;
> +    uint16_t roomsz;
>  
>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>  
> +    /* if no structure is provided, assume no mbuf private area */
> +
> +    user_mbp_priv = opaque_arg;
> +    if (user_mbp_priv == NULL) {
> +        default_mbp_priv.mbuf_priv_size = 0;
> +        if (mp->elt_size > sizeof(struct dp_packet)) {
> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
> +        } else {
> +            roomsz = 0;
> +        }
> +        default_mbp_priv.mbuf_data_room_size = roomsz;
> +        user_mbp_priv = &default_mbp_priv;
> +    }
> +
> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
> +        user_mbp_priv->mbuf_data_room_size +
> +        user_mbp_priv->mbuf_priv_size);
> +
> +    mbp_priv = rte_mempool_get_priv(mp);
> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
> +}
> +
> +/* Initialise some fields in the mbuf structure that are not
> modified by the
> + * user once created (origin pool, buffer start address, etc.*/
> +static void
> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
> +                       void *opaque_arg OVS_UNUSED,
> +                       void *_m,
> +                       unsigned i OVS_UNUSED)
> +{
> +    struct rte_mbuf *m = _m;
> +    uint32_t buf_size, buf_len, priv_size;
> +
> +    priv_size = rte_pktmbuf_priv_size(mp);
> +    buf_size = sizeof(struct dp_packet) + priv_size;
> +    buf_len = rte_pktmbuf_data_room_size(mp);
> +
> +    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
> priv_size);
> +    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
> +    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
> +
>      memset(m, 0, mp->elt_size);
>  
> -    /* start of buffer is just after mbuf structure */
> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
> -                    sizeof(struct dp_packet);
> +    /* start of buffer is after dp_packet structure and priv data */
> +    m->priv_size = priv_size;
> +    m->buf_addr = (char *)m + buf_size;
> +    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>      m->buf_len = (uint16_t)buf_len;
>  
>      /* keep some headroom between start of buffer and data */
> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
> +    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
> (uint16_t)m->buf_len); 
>      /* init some constant fields */
>      m->pool = mp;
> @@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>  {
>      struct rte_mbuf *m = _m;
>  
> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>  
>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>  }
> @@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) struct dpdk_mp *dmp = NULL;
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      unsigned mp_size;
> +    struct rte_pktmbuf_pool_private mbp_priv;
>  
>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
> @@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) dmp->socket_id = socket_id;
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
> RTE_PKTMBUF_HEADROOM;
> +    mbp_priv.mbuf_priv_size = 0;
>  
>      mp_size = MAX_NB_MBUF;
>      do {
> @@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) return NULL;
>          }
>  
> -        dmp->mp = rte_mempool_create(mp_name, mp_size,
> MBUF_SIZE(mtu),
> +        dmp->mp = rte_mempool_create(mp_name, mp_size,
> MBUF_SEGMENT_SIZE(mtu), MP_CACHE_SZ,
>                                       sizeof(struct
> rte_pktmbuf_pool_private),
> -                                     rte_pktmbuf_pool_init, NULL,
> +                                     ovs_rte_pktmbuf_pool_init,
> &mbp_priv, ovs_rte_pktmbuf_init, NULL,
>                                       socket_id, 0);
>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
> MIN_NB_MBUF); @@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct
> netdev_dpdk *dev, int n_rxq, int n_txq) {
>      int diag = 0;
>      int i;
> +    struct rte_eth_conf conf = port_conf;
>  
>      /* A device may report more queues than it makes available (this
> has
>       * been observed for Intel xl710, which reserves some of them for
> @@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int n_txq) VLOG_INFO("Retrying setup with (rxq:%d
> txq:%d)", n_rxq, n_txq); }
>  
> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &port_conf);
> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        }
>

Can we add an "else" here?  Before port_conf was initializing
jumbo_frame=0 for instance, but now we are changing it so I wonder if I
set up a port with jumbo and another without, the last one would get
that bit set. 



> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &conf); if (diag) {
>              break;
>          }
> @@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
> int port_no, struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>      int sid;
>      int err = 0;
> +    uint32_t buf_size;
>  
>      ovs_mutex_init(&netdev->mutex);
>      ovs_mutex_lock(&netdev->mutex);
> @@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_,
> unsigned int port_no, netdev->port_id = port_no;
>      netdev->type = type;
>      netdev->flags = 0;
> +
> +    /* Initialize port's MTU and frame len to the default Ethernet
> values.
> +     * Larger, user-specified (jumbo) frame buffers are accommodated
> in
> +     * netdev_dpdk_set_config.
> +     */
> +    netdev->max_packet_len = ETHER_MAX_LEN;
>      netdev->mtu = ETHER_MTU;
> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>  
> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> FRAME_LEN_TO_MTU(buf_size)); if (!netdev->dpdk_mp) {
>          err = ENOMEM;
>          goto unlock;
> @@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
> char prefix[], return 0;
>  }
>  
> +static void
> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
> +{
> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
> +    char *end_ptr = NULL;
> +    int local_mtu;
> +
> +    if(mtu_str) {
> +        local_mtu = strtoul(mtu_str, &end_ptr, 0);
> +    }
> +    if(!mtu_str || local_mtu < ETHER_MTU ||
> +                   local_mtu >
> FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
> +                   *end_ptr != '\0') {
> +        local_mtu = ETHER_MTU;
> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter -
> defaulting to %d.\n",
> +                   local_mtu);
> +    }
> +
> +    *mtu = local_mtu;
> +}
> +
>  static int
>  vhost_construct_helper(struct netdev *netdev_)
> OVS_REQUIRES(dpdk_mutex) {
> @@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
> *netdev_, struct smap *args) smap_add_format(args,
> "configured_rx_queues", "%d", netdev_->n_rxq); smap_add_format(args,
> "requested_tx_queues", "%d", netdev_->n_txq); smap_add_format(args,
> "configured_tx_queues", "%d", dev->real_n_txq);
> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
>  }
>  
> +/* Set the mtu of DPDK_DEV_ETH ports */
> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int old_mtu, err;
> +    uint32_t buf_size;
> +    int dpdk_mtu;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
> +
> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    rte_eth_dev_stop(dev->port_id);
> +
> +    old_mtu = dev->mtu;
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    err = dpdk_eth_dev_init(dev);
> +    if (err) {
> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to
> last known "
> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
> +        dpdk_mp_put(mp);
> +        dev->mtu = old_mtu;
> +        dev->dpdk_mp = old_mp;
> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        dpdk_eth_dev_init(dev);
> +        goto out;
> +    } else {
> +        dpdk_mp_put(old_mp);
> +        netdev_change_seq_changed(netdev);
> +    }
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap
> *args) +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_set_mtu(netdev_, mtu);
> +}
> +
>  static int
>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>  {
> @@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev
> *netdev, int *mtup) 
>      return 0;
>  }
> -
> -static int
> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> -{
> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    int old_mtu, err;
> -    struct dpdk_mp *old_mp;
> -    struct dpdk_mp *mp;
> -
> -    ovs_mutex_lock(&dpdk_mutex);
> -    ovs_mutex_lock(&dev->mutex);
> -    if (dev->mtu == mtu) {
> -        err = 0;
> -        goto out;
> -    }
> -
> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> -    if (!mp) {
> -        err = ENOMEM;
> -        goto out;
> -    }
> -
> -    rte_eth_dev_stop(dev->port_id);
> -
> -    old_mtu = dev->mtu;
> -    old_mp = dev->dpdk_mp;
> -    dev->dpdk_mp = mp;
> -    dev->mtu = mtu;
> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -
> -    err = dpdk_eth_dev_init(dev);
> -    if (err) {
> -        dpdk_mp_put(mp);
> -        dev->mtu = old_mtu;
> -        dev->dpdk_mp = old_mp;
> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -        dpdk_eth_dev_init(dev);
> -        goto out;
> -    }
> -
> -    dpdk_mp_put(old_mp);
> -    netdev_change_seq_changed(netdev);
> -out:
> -    ovs_mutex_unlock(&dev->mutex);
> -    ovs_mutex_unlock(&dpdk_mutex);
> -    return err;
> -}
> -
>  static int
>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>  
> @@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
> *netdev_, struct smap *args) smap_add_format(args, "numa_id", "%d",
> rte_eth_dev_socket_id(dev->port_id)); smap_add_format(args,
> "driver_name", "%s", dev_info.driver_name); smap_add_format(args,
> "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
> -    smap_add_format(args, "max_rx_pktlen", "%u",
> dev_info.max_rx_pktlen);
> +    smap_add_format(args, "max_rx_pktlen", "%u",
> dev->max_packet_len); smap_add_format(args, "max_rx_queues", "%u",
> dev_info.max_rx_queues); smap_add_format(args, "max_tx_queues", "%u",
> dev_info.max_tx_queues); smap_add_format(args, "max_mac_addrs", "%u",
> dev_info.max_mac_addrs); @@ -1904,6 +2039,51 @@
> dpdk_vhost_user_class_init(void) return 0;
>  }
>  
> +/* Set the mtu of DPDK_DEV_VHOST ports */
> +static int
> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int err = 0;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    mp = dpdk_mp_get(dev->socket_id, mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    dpdk_mp_put(old_mp);
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct
> smap *args) +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
> +}
> +
>  static void
>  dpdk_common_init(void)
>  {
> @@ -2040,8 +2220,9 @@ unlock_dpdk:
>      return err;
>  }
>  
> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
> SEND, \
> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS,
> RXQ_RECV)          \ +#define NETDEV_DPDK_CLASS(NAME, INIT,
> CONSTRUCT, DESTRUCT, SET_CONFIG, \
> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS,
> GET_FEATURES,   \
> +        GET_STATUS,
> RXQ_RECV)                                          \
> {                                                             \
> NAME,                                                     \
> INIT,                       /* init */                    \ @@
> -2053,7 +2234,7 @@ unlock_dpdk:
> DESTRUCT,                                                 \
> netdev_dpdk_dealloc,                                      \
> netdev_dpdk_get_config,                                   \
> -    NULL,                       /* netdev_dpdk_set_config */  \
> +    SET_CONFIG,                                               \
>      NULL,                       /* get_tunnel_config */       \
>      NULL,                       /* build header */            \
>      NULL,                       /* push header */             \
> @@ -2067,7 +2248,7 @@ unlock_dpdk:
>      netdev_dpdk_set_etheraddr,                                \
>      netdev_dpdk_get_etheraddr,                                \
>      netdev_dpdk_get_mtu,                                      \
> -    netdev_dpdk_set_mtu,                                      \
> +    SET_MTU,                                                  \
>      netdev_dpdk_get_ifindex,                                  \
>      GET_CARRIER,                                              \
>      netdev_dpdk_get_carrier_resets,                           \
> @@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>          NULL,
>          netdev_dpdk_construct,
>          netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>          netdev_dpdk_set_multiq,
>          netdev_dpdk_eth_send,
> +        netdev_dpdk_set_mtu,
>          netdev_dpdk_get_carrier,
>          netdev_dpdk_get_stats,
>          netdev_dpdk_get_features,
> @@ -2227,8 +2410,10 @@ static const struct netdev_class
> dpdk_ring_class = NULL,
>          netdev_dpdk_ring_construct,
>          netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>          netdev_dpdk_set_multiq,
>          netdev_dpdk_ring_send,
> +        netdev_dpdk_set_mtu,
>          netdev_dpdk_get_carrier,
>          netdev_dpdk_get_stats,
>          netdev_dpdk_get_features,
> @@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
> dpdk_vhost_cuse_class = dpdk_vhost_cuse_class_init,
>          netdev_dpdk_vhost_cuse_construct,
>          netdev_dpdk_vhost_destruct,
> +        NULL,
>          netdev_dpdk_vhost_set_multiq,
>          netdev_dpdk_vhost_send,
> +        NULL,
>          netdev_dpdk_vhost_get_carrier,
>          netdev_dpdk_vhost_get_stats,
>          NULL,
> @@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
> dpdk_vhost_user_class = dpdk_vhost_user_class_init,
>          netdev_dpdk_vhost_user_construct,
>          netdev_dpdk_vhost_destruct,
> +        netdev_dpdk_vhost_set_config,
>          netdev_dpdk_vhost_set_multiq,
>          netdev_dpdk_vhost_send,
> +        netdev_dpdk_vhost_set_mtu,
>          netdev_dpdk_vhost_get_carrier,
>          netdev_dpdk_vhost_get_stats,
>          NULL,

Other parts look good to me, thanks Mark!

I tried this patch with small and jumbo (8000 bytes) frames from
outside to DPDK port then vhost-user and then to guest using a traffic
generator.  I could see the packets inside of the VM just fine.
I haven't tried with real traffic yet though.

Thanks,
Flavio Leitner Feb. 4, 2016, 8:51 p.m. UTC | #4
Some minor comments just in case you decide to spin another version:

On Mon,  1 Feb 2016 10:18:54 +0000
Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:


[...]
> +static uint32_t
> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
> +{
> +    struct rte_eth_dev_info info;
> +    uint32_t buf_size;
> +    /* XXX: This is a workaround for DPDK v2.2, and needs to be
> refactored with a
> +     * future DPDK release. */
> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
> +
> +    if(netdev->type == DPDK_DEV_ETH) {
         ^ space here


> +        rte_eth_dev_info_get(netdev->port_id, &info);
> +        buf_size = (info.min_rx_bufsize == 0) ?
> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
> +                   info.min_rx_bufsize;
> +    } else {
> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
> +    }
> +
> +    if(len % buf_size) {

also here

> +        len = buf_size * ((len/buf_size) + 1);
> +    }
> +
> +    return len;
> +}
> +
>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
> used
>   * for all other segments data, bss and text. */
>  
> @@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>  }
>  
>  static void
> -__rte_pktmbuf_init(struct rte_mempool *mp,
> -                   void *opaque_arg OVS_UNUSED,
> -                   void *_m,
> -                   unsigned i OVS_UNUSED)
> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>  {
> -    struct rte_mbuf *m = _m;
> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
> +    struct rte_pktmbuf_pool_private default_mbp_priv;
> +    uint16_t roomsz;
>  
>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>  
> +    /* if no structure is provided, assume no mbuf private area */
> +
> +    user_mbp_priv = opaque_arg;
> +    if (user_mbp_priv == NULL) {
> +        default_mbp_priv.mbuf_priv_size = 0;
> +        if (mp->elt_size > sizeof(struct dp_packet)) {
> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
> +        } else {
> +            roomsz = 0;
> +        }
> +        default_mbp_priv.mbuf_data_room_size = roomsz;
> +        user_mbp_priv = &default_mbp_priv;
> +    }
> +
> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
> +        user_mbp_priv->mbuf_data_room_size +
> +        user_mbp_priv->mbuf_priv_size);
> +
> +    mbp_priv = rte_mempool_get_priv(mp);
> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
> +}
> +
> +/* Initialise some fields in the mbuf structure that are not
> modified by the
> + * user once created (origin pool, buffer start address, etc.*/
> +static void
> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
> +                       void *opaque_arg OVS_UNUSED,
> +                       void *_m,
> +                       unsigned i OVS_UNUSED)
> +{
> +    struct rte_mbuf *m = _m;
> +    uint32_t buf_size, buf_len, priv_size;
> +
> +    priv_size = rte_pktmbuf_priv_size(mp);
> +    buf_size = sizeof(struct dp_packet) + priv_size;
> +    buf_len = rte_pktmbuf_data_room_size(mp);
> +
> +    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
> priv_size);
> +    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
> +    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
> +
>      memset(m, 0, mp->elt_size);
>  
> -    /* start of buffer is just after mbuf structure */
> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
> -                    sizeof(struct dp_packet);
> +    /* start of buffer is after dp_packet structure and priv data */
> +    m->priv_size = priv_size;
> +    m->buf_addr = (char *)m + buf_size;
> +    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>      m->buf_len = (uint16_t)buf_len;
>  
>      /* keep some headroom between start of buffer and data */
> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
> +    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
> (uint16_t)m->buf_len); 
>      /* init some constant fields */
>      m->pool = mp;
> @@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>  {
>      struct rte_mbuf *m = _m;
>  
> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>  
>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>  }
> @@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) struct dpdk_mp *dmp = NULL;
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      unsigned mp_size;
> +    struct rte_pktmbuf_pool_private mbp_priv;
>  
>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
> @@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) dmp->socket_id = socket_id;
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
> RTE_PKTMBUF_HEADROOM;
> +    mbp_priv.mbuf_priv_size = 0;
>  
>      mp_size = MAX_NB_MBUF;
>      do {
> @@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex) return NULL;
>          }
>  
> -        dmp->mp = rte_mempool_create(mp_name, mp_size,
> MBUF_SIZE(mtu),
> +        dmp->mp = rte_mempool_create(mp_name, mp_size,
> MBUF_SEGMENT_SIZE(mtu), MP_CACHE_SZ,
>                                       sizeof(struct
> rte_pktmbuf_pool_private),
> -                                     rte_pktmbuf_pool_init, NULL,
> +                                     ovs_rte_pktmbuf_pool_init,
> &mbp_priv, ovs_rte_pktmbuf_init, NULL,
>                                       socket_id, 0);
>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
> MIN_NB_MBUF); @@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct
> netdev_dpdk *dev, int n_rxq, int n_txq) {
>      int diag = 0;
>      int i;
> +    struct rte_eth_conf conf = port_conf;
>  
>      /* A device may report more queues than it makes available (this
> has
>       * been observed for Intel xl710, which reserves some of them for
> @@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int n_txq) VLOG_INFO("Retrying setup with (rxq:%d
> txq:%d)", n_rxq, n_txq); }
>  
> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &port_conf);
> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {

space needed

> +            conf.rxmode.jumbo_frame =
> NETDEV_DPDK_JUMBO_FRAME_ENABLED;
> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        }
> +
> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &conf); if (diag) {
>              break;
>          }
> @@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
> int port_no, struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>      int sid;
>      int err = 0;
> +    uint32_t buf_size;
>  
>      ovs_mutex_init(&netdev->mutex);
>      ovs_mutex_lock(&netdev->mutex);
> @@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_,
> unsigned int port_no, netdev->port_id = port_no;
>      netdev->type = type;
>      netdev->flags = 0;
> +
> +    /* Initialize port's MTU and frame len to the default Ethernet
> values.
> +     * Larger, user-specified (jumbo) frame buffers are accommodated
> in
> +     * netdev_dpdk_set_config.
> +     */
> +    netdev->max_packet_len = ETHER_MAX_LEN;
>      netdev->mtu = ETHER_MTU;
> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>  
> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> FRAME_LEN_TO_MTU(buf_size)); if (!netdev->dpdk_mp) {
>          err = ENOMEM;
>          goto unlock;
> @@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
> char prefix[], return 0;
>  }
>  
> +static void
> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
> +{
> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
> +    char *end_ptr = NULL;
> +    int local_mtu;
> +
> +    if(mtu_str) {
> +        local_mtu = strtoul(mtu_str, &end_ptr, 0);
> +    }
> +    if(!mtu_str || local_mtu < ETHER_MTU ||
> +                   local_mtu >
> FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
> +                   *end_ptr != '\0') {

This needs to align with !mtu_str

Thanks,
Daniele Di Proietto Feb. 4, 2016, 10:21 p.m. UTC | #5
Hi Mark,

This patch besides adding Jumbo Frame support also cleans up
the mbuf initialization (by changing the macros, adding
dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
for this.  I think it makes sense to split the patch in two:
one that does the clenup, and one that allows configuring the
MTU.

I agree with Flavio's comments as well, more inline

Thanks

On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh@intel.com> wrote:

>Add support for Jumbo Frames to DPDK-enabled port types,
>using single-segment-mbufs.
>
>Using this approach, the amount of memory allocated for each mbuf
>to store frame data is increased to a value greater than 1518B
>(typical Ethernet maximum frame length). The increased space
>available in the mbuf means that an entire Jumbo Frame can be carried
>in a single mbuf, as opposed to partitioning it across multiple mbuf
>segments.
>
>The amount of space allocated to each mbuf to hold frame data is
>defined dynamically by the user when adding a DPDK port to a bridge.
>If an MTU value is not supplied, or the user-supplied value is invalid,
>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>
>Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>---
> INSTALL.DPDK.md   |  59 +++++++++-
> NEWS              |   2 +
> lib/netdev-dpdk.c | 347
>+++++++++++++++++++++++++++++++++++++++++-------------
> 3 files changed, 328 insertions(+), 80 deletions(-)
>
>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>index 96b686c..64ccd15 100644
>--- a/INSTALL.DPDK.md
>+++ b/INSTALL.DPDK.md
>@@ -859,10 +859,61 @@ by adding the following string:
> to <interface> sections of all network devices used by DPDK. Parameter
>'N'
> determines how many queues can be used by the guest.
> 
>+
>+Jumbo Frames
>+------------
>+
>+Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>+
>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>ovs-vsctl
>+'add-port' command-line, along with the required MTU for the port.
>+e.g.
>+
>+     ```
>+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>options:dpdk-mtu=9000
>+     ```
>+
>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
>are
>+increased, such that a full Jumbo Frame may be accommodated inside a
>single
>+mbuf segment. Once set, the MTU for a DPDK port is immutable.

Why is it immutable?

>+
>+Jumbo frame support has been validated against 13312B frames, using the
>+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>+theoretically be supported. Supported port types excludes vHost-Cuse
>ports, as
>+this feature is pending deprecation.
>+
>+
>+vHost Ports and Jumbo Frames
>+----------------------------
>+Jumbo frame support is available for DPDK vHost-User ports only. Some
>additional
>+configuration is needed to take advantage of this feature:
>+
>+  1. `mergeable buffers` must be enabled for vHost ports, as
>demonstrated in
>+      the QEMU command line snippet below:
>+
>+      ```
>+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>+      '-device 
>virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>+      ```
>+
>+  2. Where virtio devices are bound to the Linux kernel driver in a guest
>+     environment (i.e. interfaces are not bound to an in-guest DPDK
>driver), the
>+     MTU of those logical network interfaces must also be increased. This
>+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU'
>refers
>+     to the length of the IP packet only, and not that of the entire
>frame.
>+
>+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract
>the L2
>+     header and CRC lengths (i.e. 18B) from the max supported frame size.
>+     So, to set the MTU for a 13312B Jumbo Frame:
>+
>+      ```
>+      ifconfig eth1 mtu 13294
>+      ```
>+
>+
> Restrictions:
> -------------
> 
>-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>   - Currently DPDK port does not make use any offload functionality.
>   - DPDK-vHost support works with 1G huge pages.
> 
>@@ -903,6 +954,12 @@ Restrictions:
>     the next release of DPDK (which includes the above patch) is
>available and
>     integrated into OVS.
> 
>+  Jumbo Frames:
>+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>source of
>+     this issue is currently being investigated.
>+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>ports.
>+
>+
> Bug Reporting:
> --------------
> 
>diff --git a/NEWS b/NEWS
>index 5c18867..cd563e0 100644
>--- a/NEWS
>+++ b/NEWS
>@@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>      abstractions, such as virtual L2 and L3 overlays and security
>groups.
>    - RHEL packaging:
>      * DPDK ports may now be created via network scripts (see
>README.RHEL).
>+   - netdev-dpdk:
>+     * Add Jumbo Frame Support

I don't think we should add this to 2.5

> 
> 
> v2.4.0 - 20 Aug 2015
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>index de7e488..76a5dcc 100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>VLOG_RATE_LIMIT_INIT(5, 20);
> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
> #define OVS_VPORT_DPDK "ovs_dpdk"
> 
>+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1

I don't see particular value in the above #define.

>+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>+
> /*
>  * need to reserve tons of extra space in the mbufs so we can align the
>  * DMA addresses to 4KB.
>  * The minimum mbuf size is limited to avoid scatter behaviour and drop
>in
>  * performance for standard Ethernet MTU.
>  */
>-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>-                              + sizeof(struct dp_packet) \
>-                              + RTE_PKTMBUF_HEADROOM)
>-#define MBUF_SIZE_DRIVER     (2048                       \
>-                              + sizeof (struct rte_mbuf) \
>-                              + RTE_PKTMBUF_HEADROOM)
>-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN +
>ETHER_CRC_LEN)
>+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN -
>ETHER_CRC_LEN)
>+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>+                                    + sizeof(struct dp_packet)   \
>+                                    + RTE_PKTMBUF_HEADROOM)
>+
>+/* This value should be specified as a multiple of the DPDK NIC driver's
>+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>+ */
>+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
> 
> /* Max and min number of packets in the mempool.  OVS tries to allocate a
>  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't
>have
>@@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>VLOG_RATE_LIMIT_INIT(5, 20);
> #define MIN_NB_MBUF          (4096 * 4)
> #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
> 
>+#define DPDK_VLAN_TAG_LEN    4
>+
> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)
>== 0);
> 
>@@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>         .header_split   = 0, /* Header Split disabled */
>         .hw_ip_checksum = 0, /* IP checksum offload disabled */
>         .hw_vlan_filter = 0, /* VLAN filtering disabled */
>-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>         .hw_strip_crc   = 0,
>     },
>     .rx_adv_conf = {
>@@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>     return class->construct == netdev_dpdk_construct;
> }
> 
>+/* DPDK NIC drivers allocate RX buffers at a particular granularity
>+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>igb_uio).
>+ * If 'frame_len' is not a multiple of this value, insufficient buffers
>are
>+ * allocated to accomodate the packet in its entirety. Furthermore, the
>igb_uio
>+ * driver needs to ensure that there is also sufficient space in the Rx
>buffer
>+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is
>too
>+ * small, then the driver enables scatter RX behaviour, which reduces
>+ * performance. To prevent this, use a buffer size that is closest to
>+ * 'frame_len', but which satisfies the aforementioned criteria.
>+ */
>+static uint32_t
>+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>+{
>+    struct rte_eth_dev_info info;
>+    uint32_t buf_size;
>+    /* XXX: This is a workaround for DPDK v2.2, and needs to be
>refactored with a
>+     * future DPDK release. */

Could you elaborate on that?

>+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>+
>+    if(netdev->type == DPDK_DEV_ETH) {
>+        rte_eth_dev_info_get(netdev->port_id, &info);
>+        buf_size = (info.min_rx_bufsize == 0) ?
>+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>+                   info.min_rx_bufsize;
>+    } else {
>+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>+    }
>+
>+    if(len % buf_size) {
>+        len = buf_size * ((len/buf_size) + 1);
>+    }

I think this looks better with the ROUND_UP macro.

>+
>+    return len;
>+}
>+
> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>  * for all other segments data, bss and text. */
> 
>@@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
> }
> 
> static void
>-__rte_pktmbuf_init(struct rte_mempool *mp,
>-                   void *opaque_arg OVS_UNUSED,
>-                   void *_m,
>-                   unsigned i OVS_UNUSED)
>+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
> {
>-    struct rte_mbuf *m = _m;
>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>+    struct rte_pktmbuf_pool_private default_mbp_priv;
>+    uint16_t roomsz;
> 
>     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
> 
>+    /* if no structure is provided, assume no mbuf private area */
>+
>+    user_mbp_priv = opaque_arg;
>+    if (user_mbp_priv == NULL) {
>+        default_mbp_priv.mbuf_priv_size = 0;
>+        if (mp->elt_size > sizeof(struct dp_packet)) {
>+            roomsz = mp->elt_size - sizeof(struct dp_packet);
>+        } else {
>+            roomsz = 0;
>+        }
>+        default_mbp_priv.mbuf_data_room_size = roomsz;
>+        user_mbp_priv = &default_mbp_priv;
>+    }
>+
>+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>+        user_mbp_priv->mbuf_data_room_size +
>+        user_mbp_priv->mbuf_priv_size);
>+
>+    mbp_priv = rte_mempool_get_priv(mp);
>+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>+}
>+
>+/* Initialise some fields in the mbuf structure that are not modified by
>the
>+ * user once created (origin pool, buffer start address, etc.*/
>+static void
>+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>+                       void *opaque_arg OVS_UNUSED,
>+                       void *_m,
>+                       unsigned i OVS_UNUSED)
>+{
>+    struct rte_mbuf *m = _m;
>+    uint32_t buf_size, buf_len, priv_size;
>+
>+    priv_size = rte_pktmbuf_priv_size(mp);
>+    buf_size = sizeof(struct dp_packet) + priv_size;
>+    buf_len = rte_pktmbuf_data_room_size(mp);
>+
>+    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>priv_size);
>+    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>+    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>+
>     memset(m, 0, mp->elt_size);
> 
>-    /* start of buffer is just after mbuf structure */
>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>-                    sizeof(struct dp_packet);
>+    /* start of buffer is after dp_packet structure and priv data */
>+    m->priv_size = priv_size;
>+    m->buf_addr = (char *)m + buf_size;
>+    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>     m->buf_len = (uint16_t)buf_len;
> 
>     /* keep some headroom between start of buffer and data */
>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>+    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
> 
>     /* init some constant fields */
>     m->pool = mp;
>@@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
> {
>     struct rte_mbuf *m = _m;
> 
>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
> 
>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
> }
>@@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>OVS_REQUIRES(dpdk_mutex)
>     struct dpdk_mp *dmp = NULL;
>     char mp_name[RTE_MEMPOOL_NAMESIZE];
>     unsigned mp_size;
>+    struct rte_pktmbuf_pool_private mbp_priv;
> 
>     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>@@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>OVS_REQUIRES(dpdk_mutex)
>     dmp->socket_id = socket_id;
>     dmp->mtu = mtu;
>     dmp->refcount = 1;
>+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>RTE_PKTMBUF_HEADROOM;
>+    mbp_priv.mbuf_priv_size = 0;
> 
>     mp_size = MAX_NB_MBUF;
>     do {
>@@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>OVS_REQUIRES(dpdk_mutex)
>             return NULL;
>         }
> 
>-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>+        dmp->mp = rte_mempool_create(mp_name, mp_size,
>MBUF_SEGMENT_SIZE(mtu),
>                                      MP_CACHE_SZ,
>                                      sizeof(struct
>rte_pktmbuf_pool_private),
>-                                     rte_pktmbuf_pool_init, NULL,
>+                                     ovs_rte_pktmbuf_pool_init,
>&mbp_priv,
>                                      ovs_rte_pktmbuf_init, NULL,
>                                      socket_id, 0);
>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>MIN_NB_MBUF);

Ok, this is quite intricated. I believe the reason that OVS used its own
__rte_pktmbuf_init() was that it wasn't possible to have custom metadata
in mbufs before DPDK 2.1.

If I'm not mistaken, with DPDK commit b507905ff407 there's a way to do that
without copying any code from DPDK with the following incremental (from
master)

---8<---

@@ -278,42 +272,12 @@ free_dpdk_buf(struct dp_packet *p)
 }
 
 static void
-__rte_pktmbuf_init(struct rte_mempool *mp,
-                   void *opaque_arg OVS_UNUSED,
-                   void *_m,
-                   unsigned i OVS_UNUSED)
-{
-    struct rte_mbuf *m = _m;
-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
-
-    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
-
-    memset(m, 0, mp->elt_size);
-
-    /* start of buffer is just after mbuf structure */
-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-                    sizeof(struct dp_packet);
-    m->buf_len = (uint16_t)buf_len;
-
-    /* keep some headroom between start of buffer and data */
-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
-
-    /* init some constant fields */
-    m->pool = mp;
-    m->nb_segs = 1;
-    m->port = 0xff;
-}
-
-static void
-ovs_rte_pktmbuf_init(struct rte_mempool *mp,
-                     void *opaque_arg OVS_UNUSED,
-                     void *_m,
-                     unsigned i OVS_UNUSED)
+ovs_rte_pktmbuf_init(struct rte_mempool *mp, void *opaque_arg,
+                     void *_m, unsigned i)
 {
     struct rte_mbuf *m = _m;
 
-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
+    rte_pktmbuf_init(mp, opaque_arg, _m, i);
 
     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
 }
@@ -339,15 +303,21 @@ dpdk_mp_get(int socket_id, int mtu)
OVS_REQUIRES(dpdk_mutex)
 
     mp_size = MAX_NB_MBUF;
     do {
+        struct rte_pktmbuf_pool_private mbuf_sizes;
+
         if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
                      dmp->mtu, dmp->socket_id, mp_size) < 0) {
             return NULL;
         }
 
+        mbuf_sizes.mbuf_priv_size = sizeof (struct dp_packet)
+                                    - sizeof (struct rte_mbuf);
+        mbuf_sizes.mbuf_data_room_size = MBUF_SIZE(mtu)
+                                         - sizeof(struct dp_packet);
         dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
                                      MP_CACHE_SZ,
                                      sizeof(struct
rte_pktmbuf_pool_private),
-                                     rte_pktmbuf_pool_init, NULL,
+                                     rte_pktmbuf_pool_init, &mbuf_sizes,
                                      ovs_rte_pktmbuf_init, NULL,
                                      socket_id, 0);
     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
MIN_NB_MBUF);

---8<---

I think this will make the patch cleaner. Do you think this will work with
the increased MTU as well?

>@@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int
>n_rxq, int n_txq)
> {
>     int diag = 0;
>     int i;
>+    struct rte_eth_conf conf = port_conf;
> 
>     /* A device may report more queues than it makes available (this has
>      * been observed for Intel xl710, which reserves some of them for
>@@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>int n_rxq, int n_txq)
>             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq,
>n_txq);
>         }
> 
>-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>&port_conf);
>+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>+        }
>+
>+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>         if (diag) {
>             break;
>         }
>@@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int
>port_no,
>     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>     int sid;
>     int err = 0;
>+    uint32_t buf_size;
> 
>     ovs_mutex_init(&netdev->mutex);
>     ovs_mutex_lock(&netdev->mutex);
>@@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>int port_no,
>     netdev->port_id = port_no;
>     netdev->type = type;
>     netdev->flags = 0;
>+
>+    /* Initialize port's MTU and frame len to the default Ethernet
>values.
>+     * Larger, user-specified (jumbo) frame buffers are accommodated in
>+     * netdev_dpdk_set_config.
>+     */
>+    netdev->max_packet_len = ETHER_MAX_LEN;
>     netdev->mtu = ETHER_MTU;
>-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
> 
>-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>FRAME_LEN_TO_MTU(buf_size));
>     if (!netdev->dpdk_mp) {
>         err = ENOMEM;
>         goto unlock;
>@@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>char prefix[],
>     return 0;
> }
> 
>+static void
>+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>+{
>+    const char *mtu_str = smap_get(args, "dpdk-mtu");
>+    char *end_ptr = NULL;
>+    int local_mtu;
>+
>+    if(mtu_str) {
>+        local_mtu = strtoul(mtu_str, &end_ptr, 0);
>+    }
>+    if(!mtu_str || local_mtu < ETHER_MTU ||
>+                   local_mtu >
>FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>+                   *end_ptr != '\0') {
>+        local_mtu = ETHER_MTU;
>+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to
>%d.\n",
>+                   local_mtu);
>+    }
>+
>+    *mtu = local_mtu;
>+}
>+
> static int
> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
> {
>@@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>*netdev_, struct smap *args)
>     smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>     smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>+    smap_add_format(args, "mtu", "%d", dev->mtu);
>     ovs_mutex_unlock(&dev->mutex);
> 
>     return 0;
> }
> 
>+/* Set the mtu of DPDK_DEV_ETH ports */
>+static int
>+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>+{
>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>+    int old_mtu, err;
>+    uint32_t buf_size;
>+    int dpdk_mtu;
>+    struct dpdk_mp *old_mp;
>+    struct dpdk_mp *mp;
>+
>+    ovs_mutex_lock(&dpdk_mutex);
>+    ovs_mutex_lock(&dev->mutex);
>+    if (dev->mtu == mtu) {
>+        err = 0;
>+        goto out;
>+    }
>+
>+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>+
>+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>+    if (!mp) {
>+        err = ENOMEM;
>+        goto out;
>+    }
>+
>+    rte_eth_dev_stop(dev->port_id);
>+
>+    old_mtu = dev->mtu;
>+    old_mp = dev->dpdk_mp;
>+    dev->dpdk_mp = mp;
>+    dev->mtu = mtu;
>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>+
>+    err = dpdk_eth_dev_init(dev);
>+    if (err) {
>+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>known "
>+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>+        dpdk_mp_put(mp);
>+        dev->mtu = old_mtu;
>+        dev->dpdk_mp = old_mp;
>+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>+        dpdk_eth_dev_init(dev);
>+        goto out;
>+    } else {
>+        dpdk_mp_put(old_mp);
>+        netdev_change_seq_changed(netdev);
>+    }
>+out:
>+    ovs_mutex_unlock(&dev->mutex);
>+    ovs_mutex_unlock(&dpdk_mutex);
>+    return err;
>+}
>+
>+static int
>+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>+{
>+    int mtu;
>+
>+    dpdk_dev_parse_mtu(args, &mtu);
>+
>+    return netdev_dpdk_set_mtu(netdev_, mtu);
>+}
>+
> static int
> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
> {
>@@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev,
>int *mtup)
> 
>     return 0;
> }
>-
>-static int
>-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>-{
>-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>-    int old_mtu, err;
>-    struct dpdk_mp *old_mp;
>-    struct dpdk_mp *mp;
>-
>-    ovs_mutex_lock(&dpdk_mutex);
>-    ovs_mutex_lock(&dev->mutex);
>-    if (dev->mtu == mtu) {
>-        err = 0;
>-        goto out;
>-    }
>-
>-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>-    if (!mp) {
>-        err = ENOMEM;
>-        goto out;
>-    }
>-
>-    rte_eth_dev_stop(dev->port_id);
>-
>-    old_mtu = dev->mtu;
>-    old_mp = dev->dpdk_mp;
>-    dev->dpdk_mp = mp;
>-    dev->mtu = mtu;
>-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>-
>-    err = dpdk_eth_dev_init(dev);
>-    if (err) {
>-        dpdk_mp_put(mp);
>-        dev->mtu = old_mtu;
>-        dev->dpdk_mp = old_mp;
>-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>-        dpdk_eth_dev_init(dev);
>-        goto out;
>-    }
>-
>-    dpdk_mp_put(old_mp);
>-    netdev_change_seq_changed(netdev);
>-out:
>-    ovs_mutex_unlock(&dev->mutex);
>-    ovs_mutex_unlock(&dpdk_mutex);
>-    return err;
>-}
>-
> static int
> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
> 
>@@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>*netdev_, struct smap *args)
>     smap_add_format(args, "numa_id", "%d",
>rte_eth_dev_socket_id(dev->port_id));
>     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>     smap_add_format(args, "min_rx_bufsize", "%u",
>dev_info.min_rx_bufsize);
>-    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>     smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>     smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>     smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>@@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>     return 0;
> }
> 
>+/* Set the mtu of DPDK_DEV_VHOST ports */
>+static int
>+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>+{
>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>+    int err = 0;
>+    struct dpdk_mp *old_mp;
>+    struct dpdk_mp *mp;
>+
>+    ovs_mutex_lock(&dpdk_mutex);
>+    ovs_mutex_lock(&dev->mutex);
>+    if (dev->mtu == mtu) {
>+        err = 0;
>+        goto out;
>+    }
>+
>+    mp = dpdk_mp_get(dev->socket_id, mtu);
>+    if (!mp) {
>+        err = ENOMEM;
>+        goto out;
>+    }
>+
>+    old_mp = dev->dpdk_mp;
>+    dev->dpdk_mp = mp;
>+    dev->mtu = mtu;
>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>+
>+    dpdk_mp_put(old_mp);
>+    netdev_change_seq_changed(netdev);
>+out:
>+    ovs_mutex_unlock(&dev->mutex);
>+    ovs_mutex_unlock(&dpdk_mutex);
>+    return err;
>+}
>+
>+static int
>+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap
>*args)
>+{
>+    int mtu;
>+
>+    dpdk_dev_parse_mtu(args, &mtu);
>+
>+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>+}
>+
> static void
> dpdk_common_init(void)
> {
>@@ -2040,8 +2220,9 @@ unlock_dpdk:
>     return err;
> }
> 
>-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND,
>\
>-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
>\
>+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>+        GET_STATUS, RXQ_RECV)                                          \
> {                                                             \
>     NAME,                                                     \
>     INIT,                       /* init */                    \
>@@ -2053,7 +2234,7 @@ unlock_dpdk:
>     DESTRUCT,                                                 \
>     netdev_dpdk_dealloc,                                      \
>     netdev_dpdk_get_config,                                   \
>-    NULL,                       /* netdev_dpdk_set_config */  \
>+    SET_CONFIG,                                               \
>     NULL,                       /* get_tunnel_config */       \
>     NULL,                       /* build header */            \
>     NULL,                       /* push header */             \
>@@ -2067,7 +2248,7 @@ unlock_dpdk:
>     netdev_dpdk_set_etheraddr,                                \
>     netdev_dpdk_get_etheraddr,                                \
>     netdev_dpdk_get_mtu,                                      \
>-    netdev_dpdk_set_mtu,                                      \
>+    SET_MTU,                                                  \
>     netdev_dpdk_get_ifindex,                                  \
>     GET_CARRIER,                                              \
>     netdev_dpdk_get_carrier_resets,                           \
>@@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>         NULL,
>         netdev_dpdk_construct,
>         netdev_dpdk_destruct,
>+        netdev_dpdk_set_config,
>         netdev_dpdk_set_multiq,
>         netdev_dpdk_eth_send,
>+        netdev_dpdk_set_mtu,
>         netdev_dpdk_get_carrier,
>         netdev_dpdk_get_stats,
>         netdev_dpdk_get_features,
>@@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class =
>         NULL,
>         netdev_dpdk_ring_construct,
>         netdev_dpdk_destruct,
>+        netdev_dpdk_set_config,
>         netdev_dpdk_set_multiq,
>         netdev_dpdk_ring_send,
>+        netdev_dpdk_set_mtu,
>         netdev_dpdk_get_carrier,
>         netdev_dpdk_get_stats,
>         netdev_dpdk_get_features,
>@@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>dpdk_vhost_cuse_class =
>         dpdk_vhost_cuse_class_init,
>         netdev_dpdk_vhost_cuse_construct,
>         netdev_dpdk_vhost_destruct,
>+        NULL,
>         netdev_dpdk_vhost_set_multiq,
>         netdev_dpdk_vhost_send,
>+        NULL,
>         netdev_dpdk_vhost_get_carrier,
>         netdev_dpdk_vhost_get_stats,
>         NULL,
>@@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>dpdk_vhost_user_class =
>         dpdk_vhost_user_class_init,
>         netdev_dpdk_vhost_user_construct,
>         netdev_dpdk_vhost_destruct,
>+        netdev_dpdk_vhost_set_config,
>         netdev_dpdk_vhost_set_multiq,
>         netdev_dpdk_vhost_send,
>+        netdev_dpdk_vhost_set_mtu,
>         netdev_dpdk_vhost_get_carrier,
>         netdev_dpdk_vhost_get_stats,
>         NULL,
>-- 
>1.9.3
>
>_______________________________________________
>dev mailing list
>dev@openvswitch.org
>http://openvswitch.org/mailman/listinfo/dev
Mark Kavanagh Feb. 8, 2016, 10:54 a.m. UTC | #6
>
>Hi,
>
>Some comments below.
>

Hi Flavio,

Thanks for your review!

I've responded to your comments inline - I've also addressed the cosmetic changes that you suggested in V3 of the patch.

Thanks again,
Mark

>
>On Mon,  1 Feb 2016 10:18:54 +0000
>Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:
>
>> Add support for Jumbo Frames to DPDK-enabled port types,
>> using single-segment-mbufs.
>>
>> Using this approach, the amount of memory allocated for each mbuf
>> to store frame data is increased to a value greater than 1518B
>> (typical Ethernet maximum frame length). The increased space
>> available in the mbuf means that an entire Jumbo Frame can be carried
>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>> segments.
>>
>> The amount of space allocated to each mbuf to hold frame data is
>> defined dynamically by the user when adding a DPDK port to a bridge.
>> If an MTU value is not supplied, or the user-supplied value is
>> invalid, the MTU for the port defaults to standard Ethernet MTU (i.e.
>> 1500B).
>>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> ---
>>  INSTALL.DPDK.md   |  59 +++++++++-
>>  NEWS              |   2 +
>>  lib/netdev-dpdk.c | 347
>> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
>> changed, 328 insertions(+), 80 deletions(-)
>>
>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> index 96b686c..64ccd15 100644
>> --- a/INSTALL.DPDK.md
>> +++ b/INSTALL.DPDK.md
>> @@ -859,10 +859,61 @@ by adding the following string:
>>  to <interface> sections of all network devices used by DPDK.
>> Parameter 'N' determines how many queues can be used by the guest.
>>
>> +
>> +Jumbo Frames
>> +------------
>> +
>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type
>> ports. +
>> +To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>> ovs-vsctl +'add-port' command-line, along with the required MTU for
>> the port. +e.g.
>> +
>> +     ```
>> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>> options:dpdk-mtu=9000
>> +     ```
>
>Any particular reason to call it dpdk-mtu instead of mtu?

Mainly because there is already an 'MTU' column in the OVSDB Interfaces table, which is read-only; I wanted to make it clear that this parameter doesn't directly update that field.

>
>
>
>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>> segments are +increased, such that a full Jumbo Frame may be
>> accommodated inside a single +mbuf segment. Once set, the MTU for a
>> DPDK port is immutable. +
>
>I think we need some protection here because if someone tries to
>change that, it will segfault PMD threads.
>

Could you elaborate on this a bit more; do you mean that PMD threads would segfault if a user attempts to change the MTU of a running DPDK port? 
If so, how would that be possible (AFAIA, there's no way to programmatically modify the MTU of OVS ports, outside of standard Linux utilities such as ifconfig, which DPDK ports typically aren't subject to; there's also AFAIA no way to programmatically modify MTU of DPDK ports outside of modifying the source code).

>
>> +Jumbo frame support has been validated against 13312B frames, using
>> the +DPDK `igb_uio` driver, but larger frames and other DPDK NIC
>> drivers may +theoretically be supported. Supported port types
>> excludes vHost-Cuse ports, as +this feature is pending deprecation.
>> +
>> +
>> +vHost Ports and Jumbo Frames
>> +----------------------------
>> +Jumbo frame support is available for DPDK vHost-User ports only.
>> Some additional +configuration is needed to take advantage of this
>> feature: +
>> +  1. `mergeable buffers` must be enabled for vHost ports, as
>> demonstrated in
>> +      the QEMU command line snippet below:
>> +
>> +      ```
>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>> +      '-device
>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>> +      ```
>> +
>> +  2. Where virtio devices are bound to the Linux kernel driver in a
>> guest
>> +     environment (i.e. interfaces are not bound to an in-guest DPDK
>> driver), the
>> +     MTU of those logical network interfaces must also be increased.
>> This
>> +     avoids segmentation of Jumbo Frames in the guest. Note that
>> 'MTU' refers
>> +     to the length of the IP packet only, and not that of the entire
>> frame. +
>> +     e.g. To calculate the exact MTU of a standard IPv4 frame,
>> subtract the L2
>> +     header and CRC lengths (i.e. 18B) from the max supported frame
>> size.
>> +     So, to set the MTU for a 13312B Jumbo Frame:
>> +
>> +      ```
>> +      ifconfig eth1 mtu 13294
>> +      ```
>> +
>> +
>>  Restrictions:
>>  -------------
>>
>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this
>> issue.
>>    - Currently DPDK port does not make use any offload functionality.
>>    - DPDK-vHost support works with 1G huge pages.
>>
>> @@ -903,6 +954,12 @@ Restrictions:
>>      the next release of DPDK (which includes the above patch) is
>> available and integrated into OVS.
>>
>> +  Jumbo Frames:
>> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>> source of
>> +     this issue is currently being investigated.
>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>> ports. +
>> +
>>  Bug Reporting:
>>  --------------
>>
>> diff --git a/NEWS b/NEWS
>> index 5c18867..cd563e0 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>>       abstractions, such as virtual L2 and L3 overlays and security
>> groups.
>>     - RHEL packaging:
>>       * DPDK ports may now be created via network scripts (see
>> README.RHEL).
>> +   - netdev-dpdk:
>> +     * Add Jumbo Frame Support
>>
>>  v2.4.0 - 20 Aug 2015
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index de7e488..76a5dcc 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>> VLOG_RATE_LIMIT_INIT(5, 20); #define OVS_CACHE_LINE_SIZE
>> CACHE_LINE_SIZE #define OVS_VPORT_DPDK "ovs_dpdk"
>>
>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>> +
>>  /*
>>   * need to reserve tons of extra space in the mbufs so we can align
>> the
>>   * DMA addresses to 4KB.
>>   * The minimum mbuf size is limited to avoid scatter behaviour and
>> drop in
>>   * performance for standard Ethernet MTU.
>>   */
>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>> -                              + sizeof(struct dp_packet) \
>> -                              + RTE_PKTMBUF_HEADROOM)
>> -#define MBUF_SIZE_DRIVER     (2048                       \
>> -                              + sizeof (struct rte_mbuf) \
>> -                              + RTE_PKTMBUF_HEADROOM)
>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu),
>> MBUF_SIZE_DRIVER) +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) +
>> ETHER_HDR_LEN + ETHER_CRC_LEN) +#define FRAME_LEN_TO_MTU(frame_len)
>> ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN) +#define
>> MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>> +                                    + sizeof(struct dp_packet)   \
>> +                                    + RTE_PKTMBUF_HEADROOM)
>> +
>> +/* This value should be specified as a multiple of the DPDK NIC
>> driver's
>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>> + */
>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>
>>  /* Max and min number of packets in the mempool.  OVS tries to
>> allocate a
>>   * mempool with MAX_NB_MBUF: if this fails (because the system
>> doesn't have @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>> VLOG_RATE_LIMIT_INIT(5, 20); #define MIN_NB_MBUF          (4096 * 4)
>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>
>> +#define DPDK_VLAN_TAG_LEN    4
>
>Why not use VLAN_HEADER_LEN ?
>We already include packets.h anyway.

Done - thanks!

>
>
>
>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>  BUILD_ASSERT_DECL(MAX_NB_MBUF %
>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>          .header_split   = 0, /* Header Split disabled */
>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>          .hw_strip_crc   = 0,
>>      },
>>      .rx_adv_conf = {
>> @@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>>      return class->construct == netdev_dpdk_construct;
>>  }
>>
>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>> igb_uio).
>> + * If 'frame_len' is not a multiple of this value, insufficient
>> buffers are
>> + * allocated to accomodate the packet in its entirety. Furthermore,
>> the igb_uio
>> + * driver needs to ensure that there is also sufficient space in the
>> Rx buffer
>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer
>> is too
>> + * small, then the driver enables scatter RX behaviour, which reduces
>> + * performance. To prevent this, use a buffer size that is closest to
>> + * 'frame_len', but which satisfies the aforementioned criteria.
>> + */
>> +static uint32_t
>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>> +{
>> +    struct rte_eth_dev_info info;
>> +    uint32_t buf_size;
>> +    /* XXX: This is a workaround for DPDK v2.2, and needs to be
>> refactored with a
>> +     * future DPDK release. */
>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>> +
>> +    if(netdev->type == DPDK_DEV_ETH) {
>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>> +        buf_size = (info.min_rx_bufsize == 0) ?
>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>> +                   info.min_rx_bufsize;
>> +    } else {
>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>> +    }
>> +
>> +    if(len % buf_size) {
>> +        len = buf_size * ((len/buf_size) + 1);
>> +    }
>> +
>> +    return len;
>> +}
>> +
>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
>> used
>>   * for all other segments data, bss and text. */
>>
>> @@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>>  }
>>
>>  static void
>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>> -                   void *opaque_arg OVS_UNUSED,
>> -                   void *_m,
>> -                   unsigned i OVS_UNUSED)
>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>  {
>> -    struct rte_mbuf *m = _m;
>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>> +    uint16_t roomsz;
>>
>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>
>> +    /* if no structure is provided, assume no mbuf private area */
>> +
>> +    user_mbp_priv = opaque_arg;
>> +    if (user_mbp_priv == NULL) {
>> +        default_mbp_priv.mbuf_priv_size = 0;
>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>> +        } else {
>> +            roomsz = 0;
>> +        }
>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>> +        user_mbp_priv = &default_mbp_priv;
>> +    }
>> +
>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>> +        user_mbp_priv->mbuf_data_room_size +
>> +        user_mbp_priv->mbuf_priv_size);
>> +
>> +    mbp_priv = rte_mempool_get_priv(mp);
>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>> +}
>> +
>> +/* Initialise some fields in the mbuf structure that are not
>> modified by the
>> + * user once created (origin pool, buffer start address, etc.*/
>> +static void
>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>> +                       void *opaque_arg OVS_UNUSED,
>> +                       void *_m,
>> +                       unsigned i OVS_UNUSED)
>> +{
>> +    struct rte_mbuf *m = _m;
>> +    uint32_t buf_size, buf_len, priv_size;
>> +
>> +    priv_size = rte_pktmbuf_priv_size(mp);
>> +    buf_size = sizeof(struct dp_packet) + priv_size;
>> +    buf_len = rte_pktmbuf_data_room_size(mp);
>> +
>> +    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>> priv_size);
>> +    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>> +    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>> +
>>      memset(m, 0, mp->elt_size);
>>
>> -    /* start of buffer is just after mbuf structure */
>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>> -                    sizeof(struct dp_packet);
>> +    /* start of buffer is after dp_packet structure and priv data */
>> +    m->priv_size = priv_size;
>> +    m->buf_addr = (char *)m + buf_size;
>> +    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>      m->buf_len = (uint16_t)buf_len;
>>
>>      /* keep some headroom between start of buffer and data */
>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>> +    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
>> (uint16_t)m->buf_len);
>>      /* init some constant fields */
>>      m->pool = mp;
>> @@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>  {
>>      struct rte_mbuf *m = _m;
>>
>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>
>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>  }
>> @@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex) struct dpdk_mp *dmp = NULL;
>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>      unsigned mp_size;
>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>
>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>> @@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex) dmp->socket_id = socket_id;
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>> RTE_PKTMBUF_HEADROOM;
>> +    mbp_priv.mbuf_priv_size = 0;
>>
>>      mp_size = MAX_NB_MBUF;
>>      do {
>> @@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex) return NULL;
>>          }
>>
>> -        dmp->mp = rte_mempool_create(mp_name, mp_size,
>> MBUF_SIZE(mtu),
>> +        dmp->mp = rte_mempool_create(mp_name, mp_size,
>> MBUF_SEGMENT_SIZE(mtu), MP_CACHE_SZ,
>>                                       sizeof(struct
>> rte_pktmbuf_pool_private),
>> -                                     rte_pktmbuf_pool_init, NULL,
>> +                                     ovs_rte_pktmbuf_pool_init,
>> &mbp_priv, ovs_rte_pktmbuf_init, NULL,
>>                                       socket_id, 0);
>>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>> MIN_NB_MBUF); @@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct
>> netdev_dpdk *dev, int n_rxq, int n_txq) {
>>      int diag = 0;
>>      int i;
>> +    struct rte_eth_conf conf = port_conf;
>>
>>      /* A device may report more queues than it makes available (this
>> has
>>       * been observed for Intel xl710, which reserves some of them for
>> @@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>> *dev, int n_rxq, int n_txq) VLOG_INFO("Retrying setup with (rxq:%d
>> txq:%d)", n_rxq, n_txq); }
>>
>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>> &port_conf);
>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +        }
>>
>
>Can we add an "else" here?  Before port_conf was initializing
>jumbo_frame=0 for instance, but now we are changing it so I wonder if I
>set up a port with jumbo and another without, the last one would get
>that bit set.

I've removed this code in the latest version of the patch, replacing it with a call to rte_eth_dev_set_mtu, so I no longer need to modify the port conf.

>
>
>
>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>> &conf); if (diag) {
>>              break;
>>          }
>> @@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>> int port_no, struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>      int sid;
>>      int err = 0;
>> +    uint32_t buf_size;
>>
>>      ovs_mutex_init(&netdev->mutex);
>>      ovs_mutex_lock(&netdev->mutex);
>> @@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>> unsigned int port_no, netdev->port_id = port_no;
>>      netdev->type = type;
>>      netdev->flags = 0;
>> +
>> +    /* Initialize port's MTU and frame len to the default Ethernet
>> values.
>> +     * Larger, user-specified (jumbo) frame buffers are accommodated
>> in
>> +     * netdev_dpdk_set_config.
>> +     */
>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>      netdev->mtu = ETHER_MTU;
>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>
>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>> FRAME_LEN_TO_MTU(buf_size)); if (!netdev->dpdk_mp) {
>>          err = ENOMEM;
>>          goto unlock;
>> @@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>> char prefix[], return 0;
>>  }
>>
>> +static void
>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>> +{
>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>> +    char *end_ptr = NULL;
>> +    int local_mtu;
>> +
>> +    if(mtu_str) {
>> +        local_mtu = strtoul(mtu_str, &end_ptr, 0);
>> +    }
>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>> +                   local_mtu >
>> FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>> +                   *end_ptr != '\0') {
>> +        local_mtu = ETHER_MTU;
>> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter -
>> defaulting to %d.\n",
>> +                   local_mtu);
>> +    }
>> +
>> +    *mtu = local_mtu;
>> +}
>> +
>>  static int
>>  vhost_construct_helper(struct netdev *netdev_)
>> OVS_REQUIRES(dpdk_mutex) {
>> @@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>> *netdev_, struct smap *args) smap_add_format(args,
>> "configured_rx_queues", "%d", netdev_->n_rxq); smap_add_format(args,
>> "requested_tx_queues", "%d", netdev_->n_txq); smap_add_format(args,
>> "configured_tx_queues", "%d", dev->real_n_txq);
>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>      ovs_mutex_unlock(&dev->mutex);
>>
>>      return 0;
>>  }
>>
>> +/* Set the mtu of DPDK_DEV_ETH ports */
>> +static int
>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    int old_mtu, err;
>> +    uint32_t buf_size;
>> +    int dpdk_mtu;
>> +    struct dpdk_mp *old_mp;
>> +    struct dpdk_mp *mp;
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->mtu == mtu) {
>> +        err = 0;
>> +        goto out;
>> +    }
>> +
>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>> +
>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>> +    if (!mp) {
>> +        err = ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    rte_eth_dev_stop(dev->port_id);
>> +
>> +    old_mtu = dev->mtu;
>> +    old_mp = dev->dpdk_mp;
>> +    dev->dpdk_mp = mp;
>> +    dev->mtu = mtu;
>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +
>> +    err = dpdk_eth_dev_init(dev);
>> +    if (err) {
>> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to
>> last known "
>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>> +        dpdk_mp_put(mp);
>> +        dev->mtu = old_mtu;
>> +        dev->dpdk_mp = old_mp;
>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +        dpdk_eth_dev_init(dev);
>> +        goto out;
>> +    } else {
>> +        dpdk_mp_put(old_mp);
>> +        netdev_change_seq_changed(netdev);
>> +    }
>> +out:
>> +    ovs_mutex_unlock(&dev->mutex);
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +    return err;
>> +}
>> +
>> +static int
>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap
>> *args) +{
>> +    int mtu;
>> +
>> +    dpdk_dev_parse_mtu(args, &mtu);
>> +
>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>> +}
>> +
>>  static int
>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>  {
>> @@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev
>> *netdev, int *mtup)
>>      return 0;
>>  }
>> -
>> -static int
>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> -{
>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> -    int old_mtu, err;
>> -    struct dpdk_mp *old_mp;
>> -    struct dpdk_mp *mp;
>> -
>> -    ovs_mutex_lock(&dpdk_mutex);
>> -    ovs_mutex_lock(&dev->mutex);
>> -    if (dev->mtu == mtu) {
>> -        err = 0;
>> -        goto out;
>> -    }
>> -
>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> -    if (!mp) {
>> -        err = ENOMEM;
>> -        goto out;
>> -    }
>> -
>> -    rte_eth_dev_stop(dev->port_id);
>> -
>> -    old_mtu = dev->mtu;
>> -    old_mp = dev->dpdk_mp;
>> -    dev->dpdk_mp = mp;
>> -    dev->mtu = mtu;
>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> -
>> -    err = dpdk_eth_dev_init(dev);
>> -    if (err) {
>> -        dpdk_mp_put(mp);
>> -        dev->mtu = old_mtu;
>> -        dev->dpdk_mp = old_mp;
>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> -        dpdk_eth_dev_init(dev);
>> -        goto out;
>> -    }
>> -
>> -    dpdk_mp_put(old_mp);
>> -    netdev_change_seq_changed(netdev);
>> -out:
>> -    ovs_mutex_unlock(&dev->mutex);
>> -    ovs_mutex_unlock(&dpdk_mutex);
>> -    return err;
>> -}
>> -
>>  static int
>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>
>> @@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>> *netdev_, struct smap *args) smap_add_format(args, "numa_id", "%d",
>> rte_eth_dev_socket_id(dev->port_id)); smap_add_format(args,
>> "driver_name", "%s", dev_info.driver_name); smap_add_format(args,
>> "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
>> -    smap_add_format(args, "max_rx_pktlen", "%u",
>> dev_info.max_rx_pktlen);
>> +    smap_add_format(args, "max_rx_pktlen", "%u",
>> dev->max_packet_len); smap_add_format(args, "max_rx_queues", "%u",
>> dev_info.max_rx_queues); smap_add_format(args, "max_tx_queues", "%u",
>> dev_info.max_tx_queues); smap_add_format(args, "max_mac_addrs", "%u",
>> dev_info.max_mac_addrs); @@ -1904,6 +2039,51 @@
>> dpdk_vhost_user_class_init(void) return 0;
>>  }
>>
>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>> +static int
>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    int err = 0;
>> +    struct dpdk_mp *old_mp;
>> +    struct dpdk_mp *mp;
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->mtu == mtu) {
>> +        err = 0;
>> +        goto out;
>> +    }
>> +
>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>> +    if (!mp) {
>> +        err = ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    old_mp = dev->dpdk_mp;
>> +    dev->dpdk_mp = mp;
>> +    dev->mtu = mtu;
>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +
>> +    dpdk_mp_put(old_mp);
>> +    netdev_change_seq_changed(netdev);
>> +out:
>> +    ovs_mutex_unlock(&dev->mutex);
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +    return err;
>> +}
>> +
>> +static int
>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct
>> smap *args) +{
>> +    int mtu;
>> +
>> +    dpdk_dev_parse_mtu(args, &mtu);
>> +
>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>> +}
>> +
>>  static void
>>  dpdk_common_init(void)
>>  {
>> @@ -2040,8 +2220,9 @@ unlock_dpdk:
>>      return err;
>>  }
>>
>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
>> SEND, \
>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS,
>> RXQ_RECV)          \ +#define NETDEV_DPDK_CLASS(NAME, INIT,
>> CONSTRUCT, DESTRUCT, SET_CONFIG, \
>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS,
>> GET_FEATURES,   \
>> +        GET_STATUS,
>> RXQ_RECV)                                          \
>> {                                                             \
>> NAME,                                                     \
>> INIT,                       /* init */                    \ @@
>> -2053,7 +2234,7 @@ unlock_dpdk:
>> DESTRUCT,                                                 \
>> netdev_dpdk_dealloc,                                      \
>> netdev_dpdk_get_config,                                   \
>> -    NULL,                       /* netdev_dpdk_set_config */  \
>> +    SET_CONFIG,                                               \
>>      NULL,                       /* get_tunnel_config */       \
>>      NULL,                       /* build header */            \
>>      NULL,                       /* push header */             \
>> @@ -2067,7 +2248,7 @@ unlock_dpdk:
>>      netdev_dpdk_set_etheraddr,                                \
>>      netdev_dpdk_get_etheraddr,                                \
>>      netdev_dpdk_get_mtu,                                      \
>> -    netdev_dpdk_set_mtu,                                      \
>> +    SET_MTU,                                                  \
>>      netdev_dpdk_get_ifindex,                                  \
>>      GET_CARRIER,                                              \
>>      netdev_dpdk_get_carrier_resets,                           \
>> @@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>>          NULL,
>>          netdev_dpdk_construct,
>>          netdev_dpdk_destruct,
>> +        netdev_dpdk_set_config,
>>          netdev_dpdk_set_multiq,
>>          netdev_dpdk_eth_send,
>> +        netdev_dpdk_set_mtu,
>>          netdev_dpdk_get_carrier,
>>          netdev_dpdk_get_stats,
>>          netdev_dpdk_get_features,
>> @@ -2227,8 +2410,10 @@ static const struct netdev_class
>> dpdk_ring_class = NULL,
>>          netdev_dpdk_ring_construct,
>>          netdev_dpdk_destruct,
>> +        netdev_dpdk_set_config,
>>          netdev_dpdk_set_multiq,
>>          netdev_dpdk_ring_send,
>> +        netdev_dpdk_set_mtu,
>>          netdev_dpdk_get_carrier,
>>          netdev_dpdk_get_stats,
>>          netdev_dpdk_get_features,
>> @@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>> dpdk_vhost_cuse_class = dpdk_vhost_cuse_class_init,
>>          netdev_dpdk_vhost_cuse_construct,
>>          netdev_dpdk_vhost_destruct,
>> +        NULL,
>>          netdev_dpdk_vhost_set_multiq,
>>          netdev_dpdk_vhost_send,
>> +        NULL,
>>          netdev_dpdk_vhost_get_carrier,
>>          netdev_dpdk_vhost_get_stats,
>>          NULL,
>> @@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>> dpdk_vhost_user_class = dpdk_vhost_user_class_init,
>>          netdev_dpdk_vhost_user_construct,
>>          netdev_dpdk_vhost_destruct,
>> +        netdev_dpdk_vhost_set_config,
>>          netdev_dpdk_vhost_set_multiq,
>>          netdev_dpdk_vhost_send,
>> +        netdev_dpdk_vhost_set_mtu,
>>          netdev_dpdk_vhost_get_carrier,
>>          netdev_dpdk_vhost_get_stats,
>>          NULL,
>
>Other parts look good to me, thanks Mark!
>
>I tried this patch with small and jumbo (8000 bytes) frames from
>outside to DPDK port then vhost-user and then to guest using a traffic
>generator.  I could see the packets inside of the VM just fine.
>I haven't tried with real traffic yet though.

Great!

>
>Thanks,
>--
>fbl
Mark Kavanagh Feb. 8, 2016, 11:23 a.m. UTC | #7
>
>Hi Mark,
>

Hi Daniele,

Thanks for the review! Responses inline below.

Cheers,
Mark

>This patch besides adding Jumbo Frame support also cleans up
>the mbuf initialization (by changing the macros, adding
>dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
>for this.  I think it makes sense to split the patch in two:
>one that does the clenup, and one that allows configuring the
>MTU.

Okay, sounds good - I'll spin another version, splitting into a two patch set.

>
>I agree with Flavio's comments as well, more inline
>
>Thanks
>
>On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh@intel.com> wrote:
>
>>Add support for Jumbo Frames to DPDK-enabled port types,
>>using single-segment-mbufs.
>>
>>Using this approach, the amount of memory allocated for each mbuf
>>to store frame data is increased to a value greater than 1518B
>>(typical Ethernet maximum frame length). The increased space
>>available in the mbuf means that an entire Jumbo Frame can be carried
>>in a single mbuf, as opposed to partitioning it across multiple mbuf
>>segments.
>>
>>The amount of space allocated to each mbuf to hold frame data is
>>defined dynamically by the user when adding a DPDK port to a bridge.
>>If an MTU value is not supplied, or the user-supplied value is invalid,
>>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>
>>Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>---
>> INSTALL.DPDK.md   |  59 +++++++++-
>> NEWS              |   2 +
>> lib/netdev-dpdk.c | 347
>>+++++++++++++++++++++++++++++++++++++++++-------------
>> 3 files changed, 328 insertions(+), 80 deletions(-)
>>
>>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>index 96b686c..64ccd15 100644
>>--- a/INSTALL.DPDK.md
>>+++ b/INSTALL.DPDK.md
>>@@ -859,10 +859,61 @@ by adding the following string:
>> to <interface> sections of all network devices used by DPDK. Parameter
>>'N'
>> determines how many queues can be used by the guest.
>>
>>+
>>+Jumbo Frames
>>+------------
>>+
>>+Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>+
>>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>>ovs-vsctl
>>+'add-port' command-line, along with the required MTU for the port.
>>+e.g.
>>+
>>+     ```
>>+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>options:dpdk-mtu=9000
>>+     ```
>>+
>>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
>>are
>>+increased, such that a full Jumbo Frame may be accommodated inside a
>>single
>>+mbuf segment. Once set, the MTU for a DPDK port is immutable.
>
>Why is it immutable?

I guess my rationale here is that an MTU change can't be triggered via OVS command-line, nor can it be triggered programmatically via DPDK (apart from an explicit call to rte_eth_dev_set_mtu).
So, while technically it's possibly, from a user's point of view, there's no way to configure it, outside of modifying the code directly.
If I've missed something here, please let me know.

>
>>+
>>+Jumbo frame support has been validated against 13312B frames, using the
>>+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>+theoretically be supported. Supported port types excludes vHost-Cuse
>>ports, as
>>+this feature is pending deprecation.
>>+
>>+
>>+vHost Ports and Jumbo Frames
>>+----------------------------
>>+Jumbo frame support is available for DPDK vHost-User ports only. Some
>>additional
>>+configuration is needed to take advantage of this feature:
>>+
>>+  1. `mergeable buffers` must be enabled for vHost ports, as
>>demonstrated in
>>+      the QEMU command line snippet below:
>>+
>>+      ```
>>+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>+      '-device
>>virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>+      ```
>>+
>>+  2. Where virtio devices are bound to the Linux kernel driver in a guest
>>+     environment (i.e. interfaces are not bound to an in-guest DPDK
>>driver), the
>>+     MTU of those logical network interfaces must also be increased. This
>>+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU'
>>refers
>>+     to the length of the IP packet only, and not that of the entire
>>frame.
>>+
>>+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract
>>the L2
>>+     header and CRC lengths (i.e. 18B) from the max supported frame size.
>>+     So, to set the MTU for a 13312B Jumbo Frame:
>>+
>>+      ```
>>+      ifconfig eth1 mtu 13294
>>+      ```
>>+
>>+
>> Restrictions:
>> -------------
>>
>>-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>   - Currently DPDK port does not make use any offload functionality.
>>   - DPDK-vHost support works with 1G huge pages.
>>
>>@@ -903,6 +954,12 @@ Restrictions:
>>     the next release of DPDK (which includes the above patch) is
>>available and
>>     integrated into OVS.
>>
>>+  Jumbo Frames:
>>+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>>source of
>>+     this issue is currently being investigated.
>>+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>>ports.
>>+
>>+
>> Bug Reporting:
>> --------------
>>
>>diff --git a/NEWS b/NEWS
>>index 5c18867..cd563e0 100644
>>--- a/NEWS
>>+++ b/NEWS
>>@@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>>      abstractions, such as virtual L2 and L3 overlays and security
>>groups.
>>    - RHEL packaging:
>>      * DPDK ports may now be created via network scripts (see
>>README.RHEL).
>>+   - netdev-dpdk:
>>+     * Add Jumbo Frame Support
>
>I don't think we should add this to 2.5

Out of curiosity, is this mainly due to potential instability?
In any event, I'll move it into the 'Post-v2.5.0' section.

>
>>
>>
>> v2.4.0 - 20 Aug 2015
>>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>index de7e488..76a5dcc 100644
>>--- a/lib/netdev-dpdk.c
>>+++ b/lib/netdev-dpdk.c
>>@@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>VLOG_RATE_LIMIT_INIT(5, 20);
>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>> #define OVS_VPORT_DPDK "ovs_dpdk"
>>
>>+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>
>I don't see particular value in the above #define.

I've removed this macro, and the section of code that used it in the latest version of the patch.

>
>>+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>+
>> /*
>>  * need to reserve tons of extra space in the mbufs so we can align the
>>  * DMA addresses to 4KB.
>>  * The minimum mbuf size is limited to avoid scatter behaviour and drop
>>in
>>  * performance for standard Ethernet MTU.
>>  */
>>-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>-                              + sizeof(struct dp_packet) \
>>-                              + RTE_PKTMBUF_HEADROOM)
>>-#define MBUF_SIZE_DRIVER     (2048                       \
>>-                              + sizeof (struct rte_mbuf) \
>>-                              + RTE_PKTMBUF_HEADROOM)
>>-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN +
>>ETHER_CRC_LEN)
>>+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN -
>>ETHER_CRC_LEN)
>>+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>+                                    + sizeof(struct dp_packet)   \
>>+                                    + RTE_PKTMBUF_HEADROOM)
>>+
>>+/* This value should be specified as a multiple of the DPDK NIC driver's
>>+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>+ */
>>+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>
>> /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't
>>have
>>@@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>VLOG_RATE_LIMIT_INIT(5, 20);
>> #define MIN_NB_MBUF          (4096 * 4)
>> #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>
>>+#define DPDK_VLAN_TAG_LEN    4
>>+
>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>> BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)
>>== 0);
>>
>>@@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>         .header_split   = 0, /* Header Split disabled */
>>         .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>         .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>         .hw_strip_crc   = 0,
>>     },
>>     .rx_adv_conf = {
>>@@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>>     return class->construct == netdev_dpdk_construct;
>> }
>>
>>+/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>>igb_uio).
>>+ * If 'frame_len' is not a multiple of this value, insufficient buffers
>>are
>>+ * allocated to accomodate the packet in its entirety. Furthermore, the
>>igb_uio
>>+ * driver needs to ensure that there is also sufficient space in the Rx
>>buffer
>>+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is
>>too
>>+ * small, then the driver enables scatter RX behaviour, which reduces
>>+ * performance. To prevent this, use a buffer size that is closest to
>>+ * 'frame_len', but which satisfies the aforementioned criteria.
>>+ */
>>+static uint32_t
>>+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>+{
>>+    struct rte_eth_dev_info info;
>>+    uint32_t buf_size;
>>+    /* XXX: This is a workaround for DPDK v2.2, and needs to be
>>refactored with a
>>+     * future DPDK release. */
>
>Could you elaborate on that?

Due to changes pending in the latest version of the code, this is no longer relevant and will be removed.

>
>>+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>+
>>+    if(netdev->type == DPDK_DEV_ETH) {
>>+        rte_eth_dev_info_get(netdev->port_id, &info);
>>+        buf_size = (info.min_rx_bufsize == 0) ?
>>+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>+                   info.min_rx_bufsize;
>>+    } else {
>>+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>+    }
>>+
>>+    if(len % buf_size) {
>>+        len = buf_size * ((len/buf_size) + 1);
>>+    }
>
>I think this looks better with the ROUND_UP macro.

True - I'll update the code accordingly.

>
>>+
>>+    return len;
>>+}
>>+
>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>  * for all other segments data, bss and text. */
>>
>>@@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>> }
>>
>> static void
>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>-                   void *opaque_arg OVS_UNUSED,
>>-                   void *_m,
>>-                   unsigned i OVS_UNUSED)
>>+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>> {
>>-    struct rte_mbuf *m = _m;
>>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>+    struct rte_pktmbuf_pool_private default_mbp_priv;
>>+    uint16_t roomsz;
>>
>>     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>
>>+    /* if no structure is provided, assume no mbuf private area */
>>+
>>+    user_mbp_priv = opaque_arg;
>>+    if (user_mbp_priv == NULL) {
>>+        default_mbp_priv.mbuf_priv_size = 0;
>>+        if (mp->elt_size > sizeof(struct dp_packet)) {
>>+            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>+        } else {
>>+            roomsz = 0;
>>+        }
>>+        default_mbp_priv.mbuf_data_room_size = roomsz;
>>+        user_mbp_priv = &default_mbp_priv;
>>+    }
>>+
>>+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>+        user_mbp_priv->mbuf_data_room_size +
>>+        user_mbp_priv->mbuf_priv_size);
>>+
>>+    mbp_priv = rte_mempool_get_priv(mp);
>>+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>+}
>>+
>>+/* Initialise some fields in the mbuf structure that are not modified by
>>the
>>+ * user once created (origin pool, buffer start address, etc.*/
>>+static void
>>+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>+                       void *opaque_arg OVS_UNUSED,
>>+                       void *_m,
>>+                       unsigned i OVS_UNUSED)
>>+{
>>+    struct rte_mbuf *m = _m;
>>+    uint32_t buf_size, buf_len, priv_size;
>>+
>>+    priv_size = rte_pktmbuf_priv_size(mp);
>>+    buf_size = sizeof(struct dp_packet) + priv_size;
>>+    buf_len = rte_pktmbuf_data_room_size(mp);
>>+
>>+    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>priv_size);
>>+    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>+    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>+
>>     memset(m, 0, mp->elt_size);
>>
>>-    /* start of buffer is just after mbuf structure */
>>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>-                    sizeof(struct dp_packet);
>>+    /* start of buffer is after dp_packet structure and priv data */
>>+    m->priv_size = priv_size;
>>+    m->buf_addr = (char *)m + buf_size;
>>+    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>     m->buf_len = (uint16_t)buf_len;
>>
>>     /* keep some headroom between start of buffer and data */
>>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>+    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>
>>     /* init some constant fields */
>>     m->pool = mp;
>>@@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>> {
>>     struct rte_mbuf *m = _m;
>>
>>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>
>>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>> }
>>@@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>OVS_REQUIRES(dpdk_mutex)
>>     struct dpdk_mp *dmp = NULL;
>>     char mp_name[RTE_MEMPOOL_NAMESIZE];
>>     unsigned mp_size;
>>+    struct rte_pktmbuf_pool_private mbp_priv;
>>
>>     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>@@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>OVS_REQUIRES(dpdk_mutex)
>>     dmp->socket_id = socket_id;
>>     dmp->mtu = mtu;
>>     dmp->refcount = 1;
>>+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>RTE_PKTMBUF_HEADROOM;
>>+    mbp_priv.mbuf_priv_size = 0;
>>
>>     mp_size = MAX_NB_MBUF;
>>     do {
>>@@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>OVS_REQUIRES(dpdk_mutex)
>>             return NULL;
>>         }
>>
>>-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>+        dmp->mp = rte_mempool_create(mp_name, mp_size,
>>MBUF_SEGMENT_SIZE(mtu),
>>                                      MP_CACHE_SZ,
>>                                      sizeof(struct
>>rte_pktmbuf_pool_private),
>>-                                     rte_pktmbuf_pool_init, NULL,
>>+                                     ovs_rte_pktmbuf_pool_init,
>>&mbp_priv,
>>                                      ovs_rte_pktmbuf_init, NULL,
>>                                      socket_id, 0);
>>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>MIN_NB_MBUF);
>
>Ok, this is quite intricated. I believe the reason that OVS used its own
>__rte_pktmbuf_init() was that it wasn't possible to have custom metadata
>in mbufs before DPDK 2.1.
>
>If I'm not mistaken, with DPDK commit b507905ff407 there's a way to do that
>without copying any code from DPDK with the following incremental (from
>master)
>
>---8<---
>
>@@ -278,42 +272,12 @@ free_dpdk_buf(struct dp_packet *p)
> }
>
> static void
>-__rte_pktmbuf_init(struct rte_mempool *mp,
>-                   void *opaque_arg OVS_UNUSED,
>-                   void *_m,
>-                   unsigned i OVS_UNUSED)
>-{
>-    struct rte_mbuf *m = _m;
>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>-
>-    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>-
>-    memset(m, 0, mp->elt_size);
>-
>-    /* start of buffer is just after mbuf structure */
>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>-                    sizeof(struct dp_packet);
>-    m->buf_len = (uint16_t)buf_len;
>-
>-    /* keep some headroom between start of buffer and data */
>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>-
>-    /* init some constant fields */
>-    m->pool = mp;
>-    m->nb_segs = 1;
>-    m->port = 0xff;
>-}
>-
>-static void
>-ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>-                     void *opaque_arg OVS_UNUSED,
>-                     void *_m,
>-                     unsigned i OVS_UNUSED)
>+ovs_rte_pktmbuf_init(struct rte_mempool *mp, void *opaque_arg,
>+                     void *_m, unsigned i)
> {
>     struct rte_mbuf *m = _m;
>
>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>+    rte_pktmbuf_init(mp, opaque_arg, _m, i);
>
>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
> }
>@@ -339,15 +303,21 @@ dpdk_mp_get(int socket_id, int mtu)
>OVS_REQUIRES(dpdk_mutex)
>
>     mp_size = MAX_NB_MBUF;
>     do {
>+        struct rte_pktmbuf_pool_private mbuf_sizes;
>+
>         if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>                      dmp->mtu, dmp->socket_id, mp_size) < 0) {
>             return NULL;
>         }
>
>+        mbuf_sizes.mbuf_priv_size = sizeof (struct dp_packet)
>+                                    - sizeof (struct rte_mbuf);
>+        mbuf_sizes.mbuf_data_room_size = MBUF_SIZE(mtu)
>+                                         - sizeof(struct dp_packet);
>         dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>                                      MP_CACHE_SZ,
>                                      sizeof(struct
>rte_pktmbuf_pool_private),
>-                                     rte_pktmbuf_pool_init, NULL,
>+                                     rte_pktmbuf_pool_init, &mbuf_sizes,
>                                      ovs_rte_pktmbuf_init, NULL,
>                                      socket_id, 0);
>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>MIN_NB_MBUF);
>
>---8<---
>
>I think this will make the patch cleaner. Do you think this will work with
>the increased MTU as well?

Thanks for this Daniele. When I implemented the code, I attempted to change as little of the legacy code as possible, but seeing as how I'm now doing a patchset, I can roll this change in.
Btw, I just tested with P2P 9k frames, and it works fine :)

>
>>@@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int
>>n_rxq, int n_txq)
>> {
>>     int diag = 0;
>>     int i;
>>+    struct rte_eth_conf conf = port_conf;
>>
>>     /* A device may report more queues than it makes available (this has
>>      * been observed for Intel xl710, which reserves some of them for
>>@@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>int n_rxq, int n_txq)
>>             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq,
>>n_txq);
>>         }
>>
>>-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>&port_conf);
>>+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>+        }
>>+
>>+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>         if (diag) {
>>             break;
>>         }
>>@@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int
>>port_no,
>>     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>     int sid;
>>     int err = 0;
>>+    uint32_t buf_size;
>>
>>     ovs_mutex_init(&netdev->mutex);
>>     ovs_mutex_lock(&netdev->mutex);
>>@@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>int port_no,
>>     netdev->port_id = port_no;
>>     netdev->type = type;
>>     netdev->flags = 0;
>>+
>>+    /* Initialize port's MTU and frame len to the default Ethernet
>>values.
>>+     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>+     * netdev_dpdk_set_config.
>>+     */
>>+    netdev->max_packet_len = ETHER_MAX_LEN;
>>     netdev->mtu = ETHER_MTU;
>>-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>
>>-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>FRAME_LEN_TO_MTU(buf_size));
>>     if (!netdev->dpdk_mp) {
>>         err = ENOMEM;
>>         goto unlock;
>>@@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>>char prefix[],
>>     return 0;
>> }
>>
>>+static void
>>+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>+{
>>+    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>+    char *end_ptr = NULL;
>>+    int local_mtu;
>>+
>>+    if(mtu_str) {
>>+        local_mtu = strtoul(mtu_str, &end_ptr, 0);
>>+    }
>>+    if(!mtu_str || local_mtu < ETHER_MTU ||
>>+                   local_mtu >
>>FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>>+                   *end_ptr != '\0') {
>>+        local_mtu = ETHER_MTU;
>>+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to
>>%d.\n",
>>+                   local_mtu);
>>+    }
>>+
>>+    *mtu = local_mtu;
>>+}
>>+
>> static int
>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>> {
>>@@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>>*netdev_, struct smap *args)
>>     smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>     smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>>+    smap_add_format(args, "mtu", "%d", dev->mtu);
>>     ovs_mutex_unlock(&dev->mutex);
>>
>>     return 0;
>> }
>>
>>+/* Set the mtu of DPDK_DEV_ETH ports */
>>+static int
>>+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>+{
>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>+    int old_mtu, err;
>>+    uint32_t buf_size;
>>+    int dpdk_mtu;
>>+    struct dpdk_mp *old_mp;
>>+    struct dpdk_mp *mp;
>>+
>>+    ovs_mutex_lock(&dpdk_mutex);
>>+    ovs_mutex_lock(&dev->mutex);
>>+    if (dev->mtu == mtu) {
>>+        err = 0;
>>+        goto out;
>>+    }
>>+
>>+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>+
>>+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>+    if (!mp) {
>>+        err = ENOMEM;
>>+        goto out;
>>+    }
>>+
>>+    rte_eth_dev_stop(dev->port_id);
>>+
>>+    old_mtu = dev->mtu;
>>+    old_mp = dev->dpdk_mp;
>>+    dev->dpdk_mp = mp;
>>+    dev->mtu = mtu;
>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>+
>>+    err = dpdk_eth_dev_init(dev);
>>+    if (err) {
>>+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>known "
>>+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>+        dpdk_mp_put(mp);
>>+        dev->mtu = old_mtu;
>>+        dev->dpdk_mp = old_mp;
>>+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>+        dpdk_eth_dev_init(dev);
>>+        goto out;
>>+    } else {
>>+        dpdk_mp_put(old_mp);
>>+        netdev_change_seq_changed(netdev);
>>+    }
>>+out:
>>+    ovs_mutex_unlock(&dev->mutex);
>>+    ovs_mutex_unlock(&dpdk_mutex);
>>+    return err;
>>+}
>>+
>>+static int
>>+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>+{
>>+    int mtu;
>>+
>>+    dpdk_dev_parse_mtu(args, &mtu);
>>+
>>+    return netdev_dpdk_set_mtu(netdev_, mtu);
>>+}
>>+
>> static int
>> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>> {
>>@@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev,
>>int *mtup)
>>
>>     return 0;
>> }
>>-
>>-static int
>>-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>-{
>>-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>-    int old_mtu, err;
>>-    struct dpdk_mp *old_mp;
>>-    struct dpdk_mp *mp;
>>-
>>-    ovs_mutex_lock(&dpdk_mutex);
>>-    ovs_mutex_lock(&dev->mutex);
>>-    if (dev->mtu == mtu) {
>>-        err = 0;
>>-        goto out;
>>-    }
>>-
>>-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>-    if (!mp) {
>>-        err = ENOMEM;
>>-        goto out;
>>-    }
>>-
>>-    rte_eth_dev_stop(dev->port_id);
>>-
>>-    old_mtu = dev->mtu;
>>-    old_mp = dev->dpdk_mp;
>>-    dev->dpdk_mp = mp;
>>-    dev->mtu = mtu;
>>-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>-
>>-    err = dpdk_eth_dev_init(dev);
>>-    if (err) {
>>-        dpdk_mp_put(mp);
>>-        dev->mtu = old_mtu;
>>-        dev->dpdk_mp = old_mp;
>>-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>-        dpdk_eth_dev_init(dev);
>>-        goto out;
>>-    }
>>-
>>-    dpdk_mp_put(old_mp);
>>-    netdev_change_seq_changed(netdev);
>>-out:
>>-    ovs_mutex_unlock(&dev->mutex);
>>-    ovs_mutex_unlock(&dpdk_mutex);
>>-    return err;
>>-}
>>-
>> static int
>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>
>>@@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>>*netdev_, struct smap *args)
>>     smap_add_format(args, "numa_id", "%d",
>>rte_eth_dev_socket_id(dev->port_id));
>>     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>     smap_add_format(args, "min_rx_bufsize", "%u",
>>dev_info.min_rx_bufsize);
>>-    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>>+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>     smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>>     smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>>     smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>>@@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>>     return 0;
>> }
>>
>>+/* Set the mtu of DPDK_DEV_VHOST ports */
>>+static int
>>+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>+{
>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>+    int err = 0;
>>+    struct dpdk_mp *old_mp;
>>+    struct dpdk_mp *mp;
>>+
>>+    ovs_mutex_lock(&dpdk_mutex);
>>+    ovs_mutex_lock(&dev->mutex);
>>+    if (dev->mtu == mtu) {
>>+        err = 0;
>>+        goto out;
>>+    }
>>+
>>+    mp = dpdk_mp_get(dev->socket_id, mtu);
>>+    if (!mp) {
>>+        err = ENOMEM;
>>+        goto out;
>>+    }
>>+
>>+    old_mp = dev->dpdk_mp;
>>+    dev->dpdk_mp = mp;
>>+    dev->mtu = mtu;
>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>+
>>+    dpdk_mp_put(old_mp);
>>+    netdev_change_seq_changed(netdev);
>>+out:
>>+    ovs_mutex_unlock(&dev->mutex);
>>+    ovs_mutex_unlock(&dpdk_mutex);
>>+    return err;
>>+}
>>+
>>+static int
>>+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap
>>*args)
>>+{
>>+    int mtu;
>>+
>>+    dpdk_dev_parse_mtu(args, &mtu);
>>+
>>+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>+}
>>+
>> static void
>> dpdk_common_init(void)
>> {
>>@@ -2040,8 +2220,9 @@ unlock_dpdk:
>>     return err;
>> }
>>
>>-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND,
>>\
>>-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
>>\
>>+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>+        GET_STATUS, RXQ_RECV)                                          \
>> {                                                             \
>>     NAME,                                                     \
>>     INIT,                       /* init */                    \
>>@@ -2053,7 +2234,7 @@ unlock_dpdk:
>>     DESTRUCT,                                                 \
>>     netdev_dpdk_dealloc,                                      \
>>     netdev_dpdk_get_config,                                   \
>>-    NULL,                       /* netdev_dpdk_set_config */  \
>>+    SET_CONFIG,                                               \
>>     NULL,                       /* get_tunnel_config */       \
>>     NULL,                       /* build header */            \
>>     NULL,                       /* push header */             \
>>@@ -2067,7 +2248,7 @@ unlock_dpdk:
>>     netdev_dpdk_set_etheraddr,                                \
>>     netdev_dpdk_get_etheraddr,                                \
>>     netdev_dpdk_get_mtu,                                      \
>>-    netdev_dpdk_set_mtu,                                      \
>>+    SET_MTU,                                                  \
>>     netdev_dpdk_get_ifindex,                                  \
>>     GET_CARRIER,                                              \
>>     netdev_dpdk_get_carrier_resets,                           \
>>@@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>>         NULL,
>>         netdev_dpdk_construct,
>>         netdev_dpdk_destruct,
>>+        netdev_dpdk_set_config,
>>         netdev_dpdk_set_multiq,
>>         netdev_dpdk_eth_send,
>>+        netdev_dpdk_set_mtu,
>>         netdev_dpdk_get_carrier,
>>         netdev_dpdk_get_stats,
>>         netdev_dpdk_get_features,
>>@@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class =
>>         NULL,
>>         netdev_dpdk_ring_construct,
>>         netdev_dpdk_destruct,
>>+        netdev_dpdk_set_config,
>>         netdev_dpdk_set_multiq,
>>         netdev_dpdk_ring_send,
>>+        netdev_dpdk_set_mtu,
>>         netdev_dpdk_get_carrier,
>>         netdev_dpdk_get_stats,
>>         netdev_dpdk_get_features,
>>@@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>>dpdk_vhost_cuse_class =
>>         dpdk_vhost_cuse_class_init,
>>         netdev_dpdk_vhost_cuse_construct,
>>         netdev_dpdk_vhost_destruct,
>>+        NULL,
>>         netdev_dpdk_vhost_set_multiq,
>>         netdev_dpdk_vhost_send,
>>+        NULL,
>>         netdev_dpdk_vhost_get_carrier,
>>         netdev_dpdk_vhost_get_stats,
>>         NULL,
>>@@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>>dpdk_vhost_user_class =
>>         dpdk_vhost_user_class_init,
>>         netdev_dpdk_vhost_user_construct,
>>         netdev_dpdk_vhost_destruct,
>>+        netdev_dpdk_vhost_set_config,
>>         netdev_dpdk_vhost_set_multiq,
>>         netdev_dpdk_vhost_send,
>>+        netdev_dpdk_vhost_set_mtu,
>>         netdev_dpdk_vhost_get_carrier,
>>         netdev_dpdk_vhost_get_stats,
>>         NULL,
>>--
>>1.9.3
>>
>>_______________________________________________
>>dev mailing list
>>dev@openvswitch.org
>>http://openvswitch.org/mailman/listinfo/dev
Daniele Di Proietto Feb. 11, 2016, 2:55 a.m. UTC | #8
On 08/02/2016 03:23, "Kavanagh, Mark B" <mark.b.kavanagh@intel.com> wrote:

>>
>>Hi Mark,
>>
>
>Hi Daniele,
>
>Thanks for the review! Responses inline below.
>
>Cheers,
>Mark
>
>>This patch besides adding Jumbo Frame support also cleans up
>>the mbuf initialization (by changing the macros, adding
>>dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
>>for this.  I think it makes sense to split the patch in two:
>>one that does the clenup, and one that allows configuring the
>>MTU.
>
>Okay, sounds good - I'll spin another version, splitting into a two patch
>set.

Thanks!

>
>>
>>I agree with Flavio's comments as well, more inline
>>
>>Thanks
>>
>>On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh@intel.com> wrote:
>>
>>>Add support for Jumbo Frames to DPDK-enabled port types,
>>>using single-segment-mbufs.
>>>
>>>Using this approach, the amount of memory allocated for each mbuf
>>>to store frame data is increased to a value greater than 1518B
>>>(typical Ethernet maximum frame length). The increased space
>>>available in the mbuf means that an entire Jumbo Frame can be carried
>>>in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>segments.
>>>
>>>The amount of space allocated to each mbuf to hold frame data is
>>>defined dynamically by the user when adding a DPDK port to a bridge.
>>>If an MTU value is not supplied, or the user-supplied value is invalid,
>>>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>
>>>Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>---
>>> INSTALL.DPDK.md   |  59 +++++++++-
>>> NEWS              |   2 +
>>> lib/netdev-dpdk.c | 347
>>>+++++++++++++++++++++++++++++++++++++++++-------------
>>> 3 files changed, 328 insertions(+), 80 deletions(-)
>>>
>>>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>index 96b686c..64ccd15 100644
>>>--- a/INSTALL.DPDK.md
>>>+++ b/INSTALL.DPDK.md
>>>@@ -859,10 +859,61 @@ by adding the following string:
>>> to <interface> sections of all network devices used by DPDK. Parameter
>>>'N'
>>> determines how many queues can be used by the guest.
>>>
>>>+
>>>+Jumbo Frames
>>>+------------
>>>+
>>>+Support for Jumbo Frames may be enabled at run-time for DPDK-type
>>>ports.
>>>+
>>>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>>>ovs-vsctl
>>>+'add-port' command-line, along with the required MTU for the port.
>>>+e.g.
>>>+
>>>+     ```
>>>+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>options:dpdk-mtu=9000
>>>+     ```
>>>+
>>>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
>>>are
>>>+increased, such that a full Jumbo Frame may be accommodated inside a
>>>single
>>>+mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>
>>Why is it immutable?
>
>I guess my rationale here is that an MTU change can't be triggered via
>OVS command-line, nor can it be triggered programmatically via DPDK
>(apart from an explicit call to rte_eth_dev_set_mtu).
>So, while technically it's possibly, from a user's point of view, there's
>no way to configure it, outside of modifying the code directly.
>If I've missed something here, please let me know.

With

ovs-vsctl set Interface options:dpdk-mtu=...

you could change it at runtime.  Have you considered that in this patch?
I guess it's ok to have it configurable only when the port is added for
now,
but we need to make sure that nothing bad happens if a user tries to change
it when it's running.

Also, after talking with other people, I think that a better name for the
parameter might be 'mtu_request', and we might want to use it also for
kernel
ports (but that doesn't need to be done now).


>
>>
>>>+
>>>+Jumbo frame support has been validated against 13312B frames, using the
>>>+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>+theoretically be supported. Supported port types excludes vHost-Cuse
>>>ports, as
>>>+this feature is pending deprecation.
>>>+
>>>+
>>>+vHost Ports and Jumbo Frames
>>>+----------------------------
>>>+Jumbo frame support is available for DPDK vHost-User ports only. Some
>>>additional
>>>+configuration is needed to take advantage of this feature:
>>>+
>>>+  1. `mergeable buffers` must be enabled for vHost ports, as
>>>demonstrated in
>>>+      the QEMU command line snippet below:
>>>+
>>>+      ```
>>>+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>+      '-device
>>>virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>+      ```
>>>+
>>>+  2. Where virtio devices are bound to the Linux kernel driver in a
>>>guest
>>>+     environment (i.e. interfaces are not bound to an in-guest DPDK
>>>driver), the
>>>+     MTU of those logical network interfaces must also be increased.
>>>This
>>>+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU'
>>>refers
>>>+     to the length of the IP packet only, and not that of the entire
>>>frame.
>>>+
>>>+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract
>>>the L2
>>>+     header and CRC lengths (i.e. 18B) from the max supported frame
>>>size.
>>>+     So, to set the MTU for a 13312B Jumbo Frame:
>>>+
>>>+      ```
>>>+      ifconfig eth1 mtu 13294
>>>+      ```
>>>+
>>>+
>>> Restrictions:
>>> -------------
>>>
>>>-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this
>>>issue.
>>>   - Currently DPDK port does not make use any offload functionality.
>>>   - DPDK-vHost support works with 1G huge pages.
>>>
>>>@@ -903,6 +954,12 @@ Restrictions:
>>>     the next release of DPDK (which includes the above patch) is
>>>available and
>>>     integrated into OVS.
>>>
>>>+  Jumbo Frames:
>>>+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>>>source of
>>>+     this issue is currently being investigated.
>>>+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>>>ports.
>>>+
>>>+
>>> Bug Reporting:
>>> --------------
>>>
>>>diff --git a/NEWS b/NEWS
>>>index 5c18867..cd563e0 100644
>>>--- a/NEWS
>>>+++ b/NEWS
>>>@@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>>>      abstractions, such as virtual L2 and L3 overlays and security
>>>groups.
>>>    - RHEL packaging:
>>>      * DPDK ports may now be created via network scripts (see
>>>README.RHEL).
>>>+   - netdev-dpdk:
>>>+     * Add Jumbo Frame Support
>>
>>I don't think we should add this to 2.5
>
>Out of curiosity, is this mainly due to potential instability?
>In any event, I'll move it into the 'Post-v2.5.0' section.
>
>>
>>>
>>>
>>> v2.4.0 - 20 Aug 2015
>>>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>index de7e488..76a5dcc 100644
>>>--- a/lib/netdev-dpdk.c
>>>+++ b/lib/netdev-dpdk.c
>>>@@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>> #define OVS_VPORT_DPDK "ovs_dpdk"
>>>
>>>+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>
>>I don't see particular value in the above #define.
>
>I've removed this macro, and the section of code that used it in the
>latest version of the patch.
>
>>
>>>+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>+
>>> /*
>>>  * need to reserve tons of extra space in the mbufs so we can align the
>>>  * DMA addresses to 4KB.
>>>  * The minimum mbuf size is limited to avoid scatter behaviour and drop
>>>in
>>>  * performance for standard Ethernet MTU.
>>>  */
>>>-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>-                              + sizeof(struct dp_packet) \
>>>-                              + RTE_PKTMBUF_HEADROOM)
>>>-#define MBUF_SIZE_DRIVER     (2048                       \
>>>-                              + sizeof (struct rte_mbuf) \
>>>-                              + RTE_PKTMBUF_HEADROOM)
>>>-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN +
>>>ETHER_CRC_LEN)
>>>+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN -
>>>ETHER_CRC_LEN)
>>>+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>+                                    + sizeof(struct dp_packet)   \
>>>+                                    + RTE_PKTMBUF_HEADROOM)
>>>+
>>>+/* This value should be specified as a multiple of the DPDK NIC
>>>driver's
>>>+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>+ */
>>>+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>
>>> /* Max and min number of packets in the mempool.  OVS tries to
>>>allocate a
>>>  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't
>>>have
>>>@@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>> #define MIN_NB_MBUF          (4096 * 4)
>>> #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>
>>>+#define DPDK_VLAN_TAG_LEN    4
>>>+
>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)
>>>== 0);
>>>
>>>@@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>         .header_split   = 0, /* Header Split disabled */
>>>         .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>         .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>         .hw_strip_crc   = 0,
>>>     },
>>>     .rx_adv_conf = {
>>>@@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>>>     return class->construct == netdev_dpdk_construct;
>>> }
>>>
>>>+/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>>>igb_uio).
>>>+ * If 'frame_len' is not a multiple of this value, insufficient buffers
>>>are
>>>+ * allocated to accomodate the packet in its entirety. Furthermore, the
>>>igb_uio
>>>+ * driver needs to ensure that there is also sufficient space in the Rx
>>>buffer
>>>+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is
>>>too
>>>+ * small, then the driver enables scatter RX behaviour, which reduces
>>>+ * performance. To prevent this, use a buffer size that is closest to
>>>+ * 'frame_len', but which satisfies the aforementioned criteria.
>>>+ */
>>>+static uint32_t
>>>+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>+{
>>>+    struct rte_eth_dev_info info;
>>>+    uint32_t buf_size;
>>>+    /* XXX: This is a workaround for DPDK v2.2, and needs to be
>>>refactored with a
>>>+     * future DPDK release. */
>>
>>Could you elaborate on that?
>
>Due to changes pending in the latest version of the code, this is no
>longer relevant and will be removed.
>
>>
>>>+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>+
>>>+    if(netdev->type == DPDK_DEV_ETH) {
>>>+        rte_eth_dev_info_get(netdev->port_id, &info);
>>>+        buf_size = (info.min_rx_bufsize == 0) ?
>>>+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>+                   info.min_rx_bufsize;
>>>+    } else {
>>>+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>+    }
>>>+
>>>+    if(len % buf_size) {
>>>+        len = buf_size * ((len/buf_size) + 1);
>>>+    }
>>
>>I think this looks better with the ROUND_UP macro.
>
>True - I'll update the code accordingly.
>
>>
>>>+
>>>+    return len;
>>>+}
>>>+
>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
>>>used
>>>  * for all other segments data, bss and text. */
>>>
>>>@@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>>> }
>>>
>>> static void
>>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>>-                   void *opaque_arg OVS_UNUSED,
>>>-                   void *_m,
>>>-                   unsigned i OVS_UNUSED)
>>>+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>> {
>>>-    struct rte_mbuf *m = _m;
>>>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>+    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>+    uint16_t roomsz;
>>>
>>>     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>
>>>+    /* if no structure is provided, assume no mbuf private area */
>>>+
>>>+    user_mbp_priv = opaque_arg;
>>>+    if (user_mbp_priv == NULL) {
>>>+        default_mbp_priv.mbuf_priv_size = 0;
>>>+        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>+            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>+        } else {
>>>+            roomsz = 0;
>>>+        }
>>>+        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>+        user_mbp_priv = &default_mbp_priv;
>>>+    }
>>>+
>>>+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>+        user_mbp_priv->mbuf_data_room_size +
>>>+        user_mbp_priv->mbuf_priv_size);
>>>+
>>>+    mbp_priv = rte_mempool_get_priv(mp);
>>>+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>+}
>>>+
>>>+/* Initialise some fields in the mbuf structure that are not modified
>>>by
>>>the
>>>+ * user once created (origin pool, buffer start address, etc.*/
>>>+static void
>>>+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>+                       void *opaque_arg OVS_UNUSED,
>>>+                       void *_m,
>>>+                       unsigned i OVS_UNUSED)
>>>+{
>>>+    struct rte_mbuf *m = _m;
>>>+    uint32_t buf_size, buf_len, priv_size;
>>>+
>>>+    priv_size = rte_pktmbuf_priv_size(mp);
>>>+    buf_size = sizeof(struct dp_packet) + priv_size;
>>>+    buf_len = rte_pktmbuf_data_room_size(mp);
>>>+
>>>+    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>>priv_size);
>>>+    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>+    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>+
>>>     memset(m, 0, mp->elt_size);
>>>
>>>-    /* start of buffer is just after mbuf structure */
>>>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>-                    sizeof(struct dp_packet);
>>>+    /* start of buffer is after dp_packet structure and priv data */
>>>+    m->priv_size = priv_size;
>>>+    m->buf_addr = (char *)m + buf_size;
>>>+    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>     m->buf_len = (uint16_t)buf_len;
>>>
>>>     /* keep some headroom between start of buffer and data */
>>>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>+    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>
>>>     /* init some constant fields */
>>>     m->pool = mp;
>>>@@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>> {
>>>     struct rte_mbuf *m = _m;
>>>
>>>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>
>>>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>> }
>>>@@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>OVS_REQUIRES(dpdk_mutex)
>>>     struct dpdk_mp *dmp = NULL;
>>>     char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>     unsigned mp_size;
>>>+    struct rte_pktmbuf_pool_private mbp_priv;
>>>
>>>     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>@@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>OVS_REQUIRES(dpdk_mutex)
>>>     dmp->socket_id = socket_id;
>>>     dmp->mtu = mtu;
>>>     dmp->refcount = 1;
>>>+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>RTE_PKTMBUF_HEADROOM;
>>>+    mbp_priv.mbuf_priv_size = 0;
>>>
>>>     mp_size = MAX_NB_MBUF;
>>>     do {
>>>@@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>OVS_REQUIRES(dpdk_mutex)
>>>             return NULL;
>>>         }
>>>
>>>-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>+        dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>MBUF_SEGMENT_SIZE(mtu),
>>>                                      MP_CACHE_SZ,
>>>                                      sizeof(struct
>>>rte_pktmbuf_pool_private),
>>>-                                     rte_pktmbuf_pool_init, NULL,
>>>+                                     ovs_rte_pktmbuf_pool_init,
>>>&mbp_priv,
>>>                                      ovs_rte_pktmbuf_init, NULL,
>>>                                      socket_id, 0);
>>>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>MIN_NB_MBUF);
>>
>>Ok, this is quite intricated. I believe the reason that OVS used its own
>>__rte_pktmbuf_init() was that it wasn't possible to have custom metadata
>>in mbufs before DPDK 2.1.
>>
>>If I'm not mistaken, with DPDK commit b507905ff407 there's a way to do
>>that
>>without copying any code from DPDK with the following incremental (from
>>master)
>>
>>---8<---
>>
>>@@ -278,42 +272,12 @@ free_dpdk_buf(struct dp_packet *p)
>> }
>>
>> static void
>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>-                   void *opaque_arg OVS_UNUSED,
>>-                   void *_m,
>>-                   unsigned i OVS_UNUSED)
>>-{
>>-    struct rte_mbuf *m = _m;
>>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>-
>>-    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>-
>>-    memset(m, 0, mp->elt_size);
>>-
>>-    /* start of buffer is just after mbuf structure */
>>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>-                    sizeof(struct dp_packet);
>>-    m->buf_len = (uint16_t)buf_len;
>>-
>>-    /* keep some headroom between start of buffer and data */
>>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>-
>>-    /* init some constant fields */
>>-    m->pool = mp;
>>-    m->nb_segs = 1;
>>-    m->port = 0xff;
>>-}
>>-
>>-static void
>>-ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>-                     void *opaque_arg OVS_UNUSED,
>>-                     void *_m,
>>-                     unsigned i OVS_UNUSED)
>>+ovs_rte_pktmbuf_init(struct rte_mempool *mp, void *opaque_arg,
>>+                     void *_m, unsigned i)
>> {
>>     struct rte_mbuf *m = _m;
>>
>>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>+    rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>
>>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>> }
>>@@ -339,15 +303,21 @@ dpdk_mp_get(int socket_id, int mtu)
>>OVS_REQUIRES(dpdk_mutex)
>>
>>     mp_size = MAX_NB_MBUF;
>>     do {
>>+        struct rte_pktmbuf_pool_private mbuf_sizes;
>>+
>>         if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>>                      dmp->mtu, dmp->socket_id, mp_size) < 0) {
>>             return NULL;
>>         }
>>
>>+        mbuf_sizes.mbuf_priv_size = sizeof (struct dp_packet)
>>+                                    - sizeof (struct rte_mbuf);
>>+        mbuf_sizes.mbuf_data_room_size = MBUF_SIZE(mtu)
>>+                                         - sizeof(struct dp_packet);
>>         dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>                                      MP_CACHE_SZ,
>>                                      sizeof(struct
>>rte_pktmbuf_pool_private),
>>-                                     rte_pktmbuf_pool_init, NULL,
>>+                                     rte_pktmbuf_pool_init, &mbuf_sizes,
>>                                      ovs_rte_pktmbuf_init, NULL,
>>                                      socket_id, 0);
>>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>MIN_NB_MBUF);
>>
>>---8<---
>>
>>I think this will make the patch cleaner. Do you think this will work
>>with
>>the increased MTU as well?
>
>Thanks for this Daniele. When I implemented the code, I attempted to
>change as little of the legacy code as possible, but seeing as how I'm
>now doing a patchset, I can roll this change in.
>Btw, I just tested with P2P 9k frames, and it works fine :)
>
>>
>>>@@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>int
>>>n_rxq, int n_txq)
>>> {
>>>     int diag = 0;
>>>     int i;
>>>+    struct rte_eth_conf conf = port_conf;
>>>
>>>     /* A device may report more queues than it makes available (this
>>>has
>>>      * been observed for Intel xl710, which reserves some of them for
>>>@@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>int n_rxq, int n_txq)
>>>             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq,
>>>n_txq);
>>>         }
>>>
>>>-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>&port_conf);
>>>+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>+        }
>>>+
>>>+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>&conf);
>>>         if (diag) {
>>>             break;
>>>         }
>>>@@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>int
>>>port_no,
>>>     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>     int sid;
>>>     int err = 0;
>>>+    uint32_t buf_size;
>>>
>>>     ovs_mutex_init(&netdev->mutex);
>>>     ovs_mutex_lock(&netdev->mutex);
>>>@@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>int port_no,
>>>     netdev->port_id = port_no;
>>>     netdev->type = type;
>>>     netdev->flags = 0;
>>>+
>>>+    /* Initialize port's MTU and frame len to the default Ethernet
>>>values.
>>>+     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>+     * netdev_dpdk_set_config.
>>>+     */
>>>+    netdev->max_packet_len = ETHER_MAX_LEN;
>>>     netdev->mtu = ETHER_MTU;
>>>-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>
>>>-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>FRAME_LEN_TO_MTU(buf_size));
>>>     if (!netdev->dpdk_mp) {
>>>         err = ENOMEM;
>>>         goto unlock;
>>>@@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>>>char prefix[],
>>>     return 0;
>>> }
>>>
>>>+static void
>>>+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>+{
>>>+    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>+    char *end_ptr = NULL;
>>>+    int local_mtu;
>>>+
>>>+    if(mtu_str) {
>>>+        local_mtu = strtoul(mtu_str, &end_ptr, 0);
>>>+    }
>>>+    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>+                   local_mtu >
>>>FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>>>+                   *end_ptr != '\0') {
>>>+        local_mtu = ETHER_MTU;
>>>+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>to
>>>%d.\n",
>>>+                   local_mtu);
>>>+    }
>>>+
>>>+    *mtu = local_mtu;
>>>+}
>>>+
>>> static int
>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>> {
>>>@@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>>>*netdev_, struct smap *args)
>>>     smap_add_format(args, "configured_rx_queues", "%d",
>>>netdev_->n_rxq);
>>>     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>     smap_add_format(args, "configured_tx_queues", "%d",
>>>dev->real_n_txq);
>>>+    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>     ovs_mutex_unlock(&dev->mutex);
>>>
>>>     return 0;
>>> }
>>>
>>>+/* Set the mtu of DPDK_DEV_ETH ports */
>>>+static int
>>>+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>+{
>>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>+    int old_mtu, err;
>>>+    uint32_t buf_size;
>>>+    int dpdk_mtu;
>>>+    struct dpdk_mp *old_mp;
>>>+    struct dpdk_mp *mp;
>>>+
>>>+    ovs_mutex_lock(&dpdk_mutex);
>>>+    ovs_mutex_lock(&dev->mutex);
>>>+    if (dev->mtu == mtu) {
>>>+        err = 0;
>>>+        goto out;
>>>+    }
>>>+
>>>+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>+
>>>+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>+    if (!mp) {
>>>+        err = ENOMEM;
>>>+        goto out;
>>>+    }
>>>+
>>>+    rte_eth_dev_stop(dev->port_id);
>>>+
>>>+    old_mtu = dev->mtu;
>>>+    old_mp = dev->dpdk_mp;
>>>+    dev->dpdk_mp = mp;
>>>+    dev->mtu = mtu;
>>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>+
>>>+    err = dpdk_eth_dev_init(dev);
>>>+    if (err) {
>>>+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>known "
>>>+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>+        dpdk_mp_put(mp);
>>>+        dev->mtu = old_mtu;
>>>+        dev->dpdk_mp = old_mp;
>>>+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>+        dpdk_eth_dev_init(dev);
>>>+        goto out;
>>>+    } else {
>>>+        dpdk_mp_put(old_mp);
>>>+        netdev_change_seq_changed(netdev);
>>>+    }
>>>+out:
>>>+    ovs_mutex_unlock(&dev->mutex);
>>>+    ovs_mutex_unlock(&dpdk_mutex);
>>>+    return err;
>>>+}
>>>+
>>>+static int
>>>+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>+{
>>>+    int mtu;
>>>+
>>>+    dpdk_dev_parse_mtu(args, &mtu);
>>>+
>>>+    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>+}
>>>+
>>> static int
>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>> {
>>>@@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev,
>>>int *mtup)
>>>
>>>     return 0;
>>> }
>>>-
>>>-static int
>>>-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>-{
>>>-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>-    int old_mtu, err;
>>>-    struct dpdk_mp *old_mp;
>>>-    struct dpdk_mp *mp;
>>>-
>>>-    ovs_mutex_lock(&dpdk_mutex);
>>>-    ovs_mutex_lock(&dev->mutex);
>>>-    if (dev->mtu == mtu) {
>>>-        err = 0;
>>>-        goto out;
>>>-    }
>>>-
>>>-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>-    if (!mp) {
>>>-        err = ENOMEM;
>>>-        goto out;
>>>-    }
>>>-
>>>-    rte_eth_dev_stop(dev->port_id);
>>>-
>>>-    old_mtu = dev->mtu;
>>>-    old_mp = dev->dpdk_mp;
>>>-    dev->dpdk_mp = mp;
>>>-    dev->mtu = mtu;
>>>-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>-
>>>-    err = dpdk_eth_dev_init(dev);
>>>-    if (err) {
>>>-        dpdk_mp_put(mp);
>>>-        dev->mtu = old_mtu;
>>>-        dev->dpdk_mp = old_mp;
>>>-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>-        dpdk_eth_dev_init(dev);
>>>-        goto out;
>>>-    }
>>>-
>>>-    dpdk_mp_put(old_mp);
>>>-    netdev_change_seq_changed(netdev);
>>>-out:
>>>-    ovs_mutex_unlock(&dev->mutex);
>>>-    ovs_mutex_unlock(&dpdk_mutex);
>>>-    return err;
>>>-}
>>>-
>>> static int
>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>
>>>@@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>>>*netdev_, struct smap *args)
>>>     smap_add_format(args, "numa_id", "%d",
>>>rte_eth_dev_socket_id(dev->port_id));
>>>     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>     smap_add_format(args, "min_rx_bufsize", "%u",
>>>dev_info.min_rx_bufsize);
>>>-    smap_add_format(args, "max_rx_pktlen", "%u",
>>>dev_info.max_rx_pktlen);
>>>+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>     smap_add_format(args, "max_rx_queues", "%u",
>>>dev_info.max_rx_queues);
>>>     smap_add_format(args, "max_tx_queues", "%u",
>>>dev_info.max_tx_queues);
>>>     smap_add_format(args, "max_mac_addrs", "%u",
>>>dev_info.max_mac_addrs);
>>>@@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>>>     return 0;
>>> }
>>>
>>>+/* Set the mtu of DPDK_DEV_VHOST ports */
>>>+static int
>>>+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>+{
>>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>+    int err = 0;
>>>+    struct dpdk_mp *old_mp;
>>>+    struct dpdk_mp *mp;
>>>+
>>>+    ovs_mutex_lock(&dpdk_mutex);
>>>+    ovs_mutex_lock(&dev->mutex);
>>>+    if (dev->mtu == mtu) {
>>>+        err = 0;
>>>+        goto out;
>>>+    }
>>>+
>>>+    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>+    if (!mp) {
>>>+        err = ENOMEM;
>>>+        goto out;
>>>+    }
>>>+
>>>+    old_mp = dev->dpdk_mp;
>>>+    dev->dpdk_mp = mp;
>>>+    dev->mtu = mtu;
>>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>+
>>>+    dpdk_mp_put(old_mp);
>>>+    netdev_change_seq_changed(netdev);
>>>+out:
>>>+    ovs_mutex_unlock(&dev->mutex);
>>>+    ovs_mutex_unlock(&dpdk_mutex);
>>>+    return err;
>>>+}
>>>+
>>>+static int
>>>+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap
>>>*args)
>>>+{
>>>+    int mtu;
>>>+
>>>+    dpdk_dev_parse_mtu(args, &mtu);
>>>+
>>>+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>+}
>>>+
>>> static void
>>> dpdk_common_init(void)
>>> {
>>>@@ -2040,8 +2220,9 @@ unlock_dpdk:
>>>     return err;
>>> }
>>>
>>>-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
>>>SEND,
>>>\
>>>-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
>>>\
>>>+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG,
>>>\
>>>+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,
>>>\
>>>+        GET_STATUS, RXQ_RECV)
>>>\
>>> {                                                             \
>>>     NAME,                                                     \
>>>     INIT,                       /* init */                    \
>>>@@ -2053,7 +2234,7 @@ unlock_dpdk:
>>>     DESTRUCT,                                                 \
>>>     netdev_dpdk_dealloc,                                      \
>>>     netdev_dpdk_get_config,                                   \
>>>-    NULL,                       /* netdev_dpdk_set_config */  \
>>>+    SET_CONFIG,                                               \
>>>     NULL,                       /* get_tunnel_config */       \
>>>     NULL,                       /* build header */            \
>>>     NULL,                       /* push header */             \
>>>@@ -2067,7 +2248,7 @@ unlock_dpdk:
>>>     netdev_dpdk_set_etheraddr,                                \
>>>     netdev_dpdk_get_etheraddr,                                \
>>>     netdev_dpdk_get_mtu,                                      \
>>>-    netdev_dpdk_set_mtu,                                      \
>>>+    SET_MTU,                                                  \
>>>     netdev_dpdk_get_ifindex,                                  \
>>>     GET_CARRIER,                                              \
>>>     netdev_dpdk_get_carrier_resets,                           \
>>>@@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>>>         NULL,
>>>         netdev_dpdk_construct,
>>>         netdev_dpdk_destruct,
>>>+        netdev_dpdk_set_config,
>>>         netdev_dpdk_set_multiq,
>>>         netdev_dpdk_eth_send,
>>>+        netdev_dpdk_set_mtu,
>>>         netdev_dpdk_get_carrier,
>>>         netdev_dpdk_get_stats,
>>>         netdev_dpdk_get_features,
>>>@@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class
>>>=
>>>         NULL,
>>>         netdev_dpdk_ring_construct,
>>>         netdev_dpdk_destruct,
>>>+        netdev_dpdk_set_config,
>>>         netdev_dpdk_set_multiq,
>>>         netdev_dpdk_ring_send,
>>>+        netdev_dpdk_set_mtu,
>>>         netdev_dpdk_get_carrier,
>>>         netdev_dpdk_get_stats,
>>>         netdev_dpdk_get_features,
>>>@@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>>>dpdk_vhost_cuse_class =
>>>         dpdk_vhost_cuse_class_init,
>>>         netdev_dpdk_vhost_cuse_construct,
>>>         netdev_dpdk_vhost_destruct,
>>>+        NULL,
>>>         netdev_dpdk_vhost_set_multiq,
>>>         netdev_dpdk_vhost_send,
>>>+        NULL,
>>>         netdev_dpdk_vhost_get_carrier,
>>>         netdev_dpdk_vhost_get_stats,
>>>         NULL,
>>>@@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>>>dpdk_vhost_user_class =
>>>         dpdk_vhost_user_class_init,
>>>         netdev_dpdk_vhost_user_construct,
>>>         netdev_dpdk_vhost_destruct,
>>>+        netdev_dpdk_vhost_set_config,
>>>         netdev_dpdk_vhost_set_multiq,
>>>         netdev_dpdk_vhost_send,
>>>+        netdev_dpdk_vhost_set_mtu,
>>>         netdev_dpdk_vhost_get_carrier,
>>>         netdev_dpdk_vhost_get_stats,
>>>         NULL,
>>>--
>>>1.9.3
>>>
>>>_______________________________________________
>>>dev mailing list
>>>dev@openvswitch.org
>>>http://openvswitch.org/mailman/listinfo/dev
>
Mark Kavanagh Feb. 11, 2016, 10:46 a.m. UTC | #9
>
>
>On 08/02/2016 03:23, "Kavanagh, Mark B" <mark.b.kavanagh@intel.com> wrote:
>
>>>
>>>Hi Mark,
>>>
>>
>>Hi Daniele,
>>
>>Thanks for the review! Responses inline below.
>>
>>Cheers,
>>Mark
>>
>>>This patch besides adding Jumbo Frame support also cleans up
>>>the mbuf initialization (by changing the macros, adding
>>>dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
>>>for this.  I think it makes sense to split the patch in two:
>>>one that does the clenup, and one that allows configuring the
>>>MTU.
>>
>>Okay, sounds good - I'll spin another version, splitting into a two patch
>>set.
>
>Thanks!
>
>>
>>>
>>>I agree with Flavio's comments as well, more inline
>>>
>>>Thanks
>>>
>>>On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh@intel.com> wrote:
>>>
>>>>Add support for Jumbo Frames to DPDK-enabled port types,
>>>>using single-segment-mbufs.
>>>>
>>>>Using this approach, the amount of memory allocated for each mbuf
>>>>to store frame data is increased to a value greater than 1518B
>>>>(typical Ethernet maximum frame length). The increased space
>>>>available in the mbuf means that an entire Jumbo Frame can be carried
>>>>in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>segments.
>>>>
>>>>The amount of space allocated to each mbuf to hold frame data is
>>>>defined dynamically by the user when adding a DPDK port to a bridge.
>>>>If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>
>>>>Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>---
>>>> INSTALL.DPDK.md   |  59 +++++++++-
>>>> NEWS              |   2 +
>>>> lib/netdev-dpdk.c | 347
>>>>+++++++++++++++++++++++++++++++++++++++++-------------
>>>> 3 files changed, 328 insertions(+), 80 deletions(-)
>>>>
>>>>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>index 96b686c..64ccd15 100644
>>>>--- a/INSTALL.DPDK.md
>>>>+++ b/INSTALL.DPDK.md
>>>>@@ -859,10 +859,61 @@ by adding the following string:
>>>> to <interface> sections of all network devices used by DPDK. Parameter
>>>>'N'
>>>> determines how many queues can be used by the guest.
>>>>
>>>>+
>>>>+Jumbo Frames
>>>>+------------
>>>>+
>>>>+Support for Jumbo Frames may be enabled at run-time for DPDK-type
>>>>ports.
>>>>+
>>>>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>>>>ovs-vsctl
>>>>+'add-port' command-line, along with the required MTU for the port.
>>>>+e.g.
>>>>+
>>>>+     ```
>>>>+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>>options:dpdk-mtu=9000
>>>>+     ```
>>>>+
>>>>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
>>>>are
>>>>+increased, such that a full Jumbo Frame may be accommodated inside a
>>>>single
>>>>+mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>
>>>Why is it immutable?
>>
>>I guess my rationale here is that an MTU change can't be triggered via
>>OVS command-line, nor can it be triggered programmatically via DPDK
>>(apart from an explicit call to rte_eth_dev_set_mtu).
>>So, while technically it's possibly, from a user's point of view, there's
>>no way to configure it, outside of modifying the code directly.
>>If I've missed something here, please let me know.
>
>With
>
>ovs-vsctl set Interface options:dpdk-mtu=...
>
>you could change it at runtime.  Have you considered that in this patch?
>I guess it's ok to have it configurable only when the port is added for
>now,

I'd actually only considered 'ovs-vsctl set Interface dpdk0 mtu XYZ'.
Since the MTU field in the Interface table is a status field (i.e. essentially read-only), updating the MTU in this way wouldn't have been possible.

Unfortunately, the same can't be said for the 'options' field - I'll need to look at this before I push a new version of the patch.

Thanks for the catch!

>but we need to make sure that nothing bad happens if a user tries to change
>it when it's running.
>
>Also, after talking with other people, I think that a better name for the
>parameter might be 'mtu_request', 

No problem, I'll rename it.

and we might want to use it also for
>kernel
>ports (but that doesn't need to be done now).
>
>
>>
>>>
>>>>+
>>>>+Jumbo frame support has been validated against 13312B frames, using the
>>>>+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>+theoretically be supported. Supported port types excludes vHost-Cuse
>>>>ports, as
>>>>+this feature is pending deprecation.
>>>>+
>>>>+
>>>>+vHost Ports and Jumbo Frames
>>>>+----------------------------
>>>>+Jumbo frame support is available for DPDK vHost-User ports only. Some
>>>>additional
>>>>+configuration is needed to take advantage of this feature:
>>>>+
>>>>+  1. `mergeable buffers` must be enabled for vHost ports, as
>>>>demonstrated in
>>>>+      the QEMU command line snippet below:
>>>>+
>>>>+      ```
>>>>+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>+      '-device
>>>>virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>+      ```
>>>>+
>>>>+  2. Where virtio devices are bound to the Linux kernel driver in a
>>>>guest
>>>>+     environment (i.e. interfaces are not bound to an in-guest DPDK
>>>>driver), the
>>>>+     MTU of those logical network interfaces must also be increased.
>>>>This
>>>>+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU'
>>>>refers
>>>>+     to the length of the IP packet only, and not that of the entire
>>>>frame.
>>>>+
>>>>+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract
>>>>the L2
>>>>+     header and CRC lengths (i.e. 18B) from the max supported frame
>>>>size.
>>>>+     So, to set the MTU for a 13312B Jumbo Frame:
>>>>+
>>>>+      ```
>>>>+      ifconfig eth1 mtu 13294
>>>>+      ```
>>>>+
>>>>+
>>>> Restrictions:
>>>> -------------
>>>>
>>>>-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this
>>>>issue.
>>>>   - Currently DPDK port does not make use any offload functionality.
>>>>   - DPDK-vHost support works with 1G huge pages.
>>>>
>>>>@@ -903,6 +954,12 @@ Restrictions:
>>>>     the next release of DPDK (which includes the above patch) is
>>>>available and
>>>>     integrated into OVS.
>>>>
>>>>+  Jumbo Frames:
>>>>+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>>>>source of
>>>>+     this issue is currently being investigated.
>>>>+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>>>>ports.
>>>>+
>>>>+
>>>> Bug Reporting:
>>>> --------------
>>>>
>>>>diff --git a/NEWS b/NEWS
>>>>index 5c18867..cd563e0 100644
>>>>--- a/NEWS
>>>>+++ b/NEWS
>>>>@@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>>>>      abstractions, such as virtual L2 and L3 overlays and security
>>>>groups.
>>>>    - RHEL packaging:
>>>>      * DPDK ports may now be created via network scripts (see
>>>>README.RHEL).
>>>>+   - netdev-dpdk:
>>>>+     * Add Jumbo Frame Support
>>>
>>>I don't think we should add this to 2.5
>>
>>Out of curiosity, is this mainly due to potential instability?
>>In any event, I'll move it into the 'Post-v2.5.0' section.
>>
>>>
>>>>
>>>>
>>>> v2.4.0 - 20 Aug 2015
>>>>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>index de7e488..76a5dcc 100644
>>>>--- a/lib/netdev-dpdk.c
>>>>+++ b/lib/netdev-dpdk.c
>>>>@@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>> #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>
>>>>+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>
>>>I don't see particular value in the above #define.
>>
>>I've removed this macro, and the section of code that used it in the
>>latest version of the patch.
>>
>>>
>>>>+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>+
>>>> /*
>>>>  * need to reserve tons of extra space in the mbufs so we can align the
>>>>  * DMA addresses to 4KB.
>>>>  * The minimum mbuf size is limited to avoid scatter behaviour and drop
>>>>in
>>>>  * performance for standard Ethernet MTU.
>>>>  */
>>>>-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>-                              + sizeof(struct dp_packet) \
>>>>-                              + RTE_PKTMBUF_HEADROOM)
>>>>-#define MBUF_SIZE_DRIVER     (2048                       \
>>>>-                              + sizeof (struct rte_mbuf) \
>>>>-                              + RTE_PKTMBUF_HEADROOM)
>>>>-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN +
>>>>ETHER_CRC_LEN)
>>>>+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN -
>>>>ETHER_CRC_LEN)
>>>>+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>+                                    + sizeof(struct dp_packet)   \
>>>>+                                    + RTE_PKTMBUF_HEADROOM)
>>>>+
>>>>+/* This value should be specified as a multiple of the DPDK NIC
>>>>driver's
>>>>+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>+ */
>>>>+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>
>>>> /* Max and min number of packets in the mempool.  OVS tries to
>>>>allocate a
>>>>  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't
>>>>have
>>>>@@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define MIN_NB_MBUF          (4096 * 4)
>>>> #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>
>>>>+#define DPDK_VLAN_TAG_LEN    4
>>>>+
>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>>ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)
>>>>== 0);
>>>>
>>>>@@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>         .header_split   = 0, /* Header Split disabled */
>>>>         .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>         .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>         .hw_strip_crc   = 0,
>>>>     },
>>>>     .rx_adv_conf = {
>>>>@@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>>>>     return class->construct == netdev_dpdk_construct;
>>>> }
>>>>
>>>>+/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>>>>igb_uio).
>>>>+ * If 'frame_len' is not a multiple of this value, insufficient buffers
>>>>are
>>>>+ * allocated to accomodate the packet in its entirety. Furthermore, the
>>>>igb_uio
>>>>+ * driver needs to ensure that there is also sufficient space in the Rx
>>>>buffer
>>>>+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is
>>>>too
>>>>+ * small, then the driver enables scatter RX behaviour, which reduces
>>>>+ * performance. To prevent this, use a buffer size that is closest to
>>>>+ * 'frame_len', but which satisfies the aforementioned criteria.
>>>>+ */
>>>>+static uint32_t
>>>>+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>+{
>>>>+    struct rte_eth_dev_info info;
>>>>+    uint32_t buf_size;
>>>>+    /* XXX: This is a workaround for DPDK v2.2, and needs to be
>>>>refactored with a
>>>>+     * future DPDK release. */
>>>
>>>Could you elaborate on that?
>>
>>Due to changes pending in the latest version of the code, this is no
>>longer relevant and will be removed.
>>
>>>
>>>>+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>+
>>>>+    if(netdev->type == DPDK_DEV_ETH) {
>>>>+        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>+        buf_size = (info.min_rx_bufsize == 0) ?
>>>>+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>+                   info.min_rx_bufsize;
>>>>+    } else {
>>>>+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>+    }
>>>>+
>>>>+    if(len % buf_size) {
>>>>+        len = buf_size * ((len/buf_size) + 1);
>>>>+    }
>>>
>>>I think this looks better with the ROUND_UP macro.
>>
>>True - I'll update the code accordingly.
>>
>>>
>>>>+
>>>>+    return len;
>>>>+}
>>>>+
>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
>>>>used
>>>>  * for all other segments data, bss and text. */
>>>>
>>>>@@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>>>> }
>>>>
>>>> static void
>>>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>-                   void *opaque_arg OVS_UNUSED,
>>>>-                   void *_m,
>>>>-                   unsigned i OVS_UNUSED)
>>>>+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>> {
>>>>-    struct rte_mbuf *m = _m;
>>>>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>+    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>+    uint16_t roomsz;
>>>>
>>>>     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>
>>>>+    /* if no structure is provided, assume no mbuf private area */
>>>>+
>>>>+    user_mbp_priv = opaque_arg;
>>>>+    if (user_mbp_priv == NULL) {
>>>>+        default_mbp_priv.mbuf_priv_size = 0;
>>>>+        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>+            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>+        } else {
>>>>+            roomsz = 0;
>>>>+        }
>>>>+        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>+        user_mbp_priv = &default_mbp_priv;
>>>>+    }
>>>>+
>>>>+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>+        user_mbp_priv->mbuf_data_room_size +
>>>>+        user_mbp_priv->mbuf_priv_size);
>>>>+
>>>>+    mbp_priv = rte_mempool_get_priv(mp);
>>>>+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>+}
>>>>+
>>>>+/* Initialise some fields in the mbuf structure that are not modified
>>>>by
>>>>the
>>>>+ * user once created (origin pool, buffer start address, etc.*/
>>>>+static void
>>>>+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>+                       void *opaque_arg OVS_UNUSED,
>>>>+                       void *_m,
>>>>+                       unsigned i OVS_UNUSED)
>>>>+{
>>>>+    struct rte_mbuf *m = _m;
>>>>+    uint32_t buf_size, buf_len, priv_size;
>>>>+
>>>>+    priv_size = rte_pktmbuf_priv_size(mp);
>>>>+    buf_size = sizeof(struct dp_packet) + priv_size;
>>>>+    buf_len = rte_pktmbuf_data_room_size(mp);
>>>>+
>>>>+    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>>>priv_size);
>>>>+    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>+    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>+
>>>>     memset(m, 0, mp->elt_size);
>>>>
>>>>-    /* start of buffer is just after mbuf structure */
>>>>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>-                    sizeof(struct dp_packet);
>>>>+    /* start of buffer is after dp_packet structure and priv data */
>>>>+    m->priv_size = priv_size;
>>>>+    m->buf_addr = (char *)m + buf_size;
>>>>+    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>     m->buf_len = (uint16_t)buf_len;
>>>>
>>>>     /* keep some headroom between start of buffer and data */
>>>>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>+    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>>
>>>>     /* init some constant fields */
>>>>     m->pool = mp;
>>>>@@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>> {
>>>>     struct rte_mbuf *m = _m;
>>>>
>>>>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>
>>>>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>> }
>>>>@@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>>     struct dpdk_mp *dmp = NULL;
>>>>     char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>     unsigned mp_size;
>>>>+    struct rte_pktmbuf_pool_private mbp_priv;
>>>>
>>>>     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>@@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>>     dmp->socket_id = socket_id;
>>>>     dmp->mtu = mtu;
>>>>     dmp->refcount = 1;
>>>>+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>>RTE_PKTMBUF_HEADROOM;
>>>>+    mbp_priv.mbuf_priv_size = 0;
>>>>
>>>>     mp_size = MAX_NB_MBUF;
>>>>     do {
>>>>@@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>>             return NULL;
>>>>         }
>>>>
>>>>-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>+        dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>>MBUF_SEGMENT_SIZE(mtu),
>>>>                                      MP_CACHE_SZ,
>>>>                                      sizeof(struct
>>>>rte_pktmbuf_pool_private),
>>>>-                                     rte_pktmbuf_pool_init, NULL,
>>>>+                                     ovs_rte_pktmbuf_pool_init,
>>>>&mbp_priv,
>>>>                                      ovs_rte_pktmbuf_init, NULL,
>>>>                                      socket_id, 0);
>>>>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>>MIN_NB_MBUF);
>>>
>>>Ok, this is quite intricated. I believe the reason that OVS used its own
>>>__rte_pktmbuf_init() was that it wasn't possible to have custom metadata
>>>in mbufs before DPDK 2.1.
>>>
>>>If I'm not mistaken, with DPDK commit b507905ff407 there's a way to do
>>>that
>>>without copying any code from DPDK with the following incremental (from
>>>master)
>>>
>>>---8<---
>>>
>>>@@ -278,42 +272,12 @@ free_dpdk_buf(struct dp_packet *p)
>>> }
>>>
>>> static void
>>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>>-                   void *opaque_arg OVS_UNUSED,
>>>-                   void *_m,
>>>-                   unsigned i OVS_UNUSED)
>>>-{
>>>-    struct rte_mbuf *m = _m;
>>>-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>-
>>>-    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>-
>>>-    memset(m, 0, mp->elt_size);
>>>-
>>>-    /* start of buffer is just after mbuf structure */
>>>-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>-                    sizeof(struct dp_packet);
>>>-    m->buf_len = (uint16_t)buf_len;
>>>-
>>>-    /* keep some headroom between start of buffer and data */
>>>-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>-
>>>-    /* init some constant fields */
>>>-    m->pool = mp;
>>>-    m->nb_segs = 1;
>>>-    m->port = 0xff;
>>>-}
>>>-
>>>-static void
>>>-ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>-                     void *opaque_arg OVS_UNUSED,
>>>-                     void *_m,
>>>-                     unsigned i OVS_UNUSED)
>>>+ovs_rte_pktmbuf_init(struct rte_mempool *mp, void *opaque_arg,
>>>+                     void *_m, unsigned i)
>>> {
>>>     struct rte_mbuf *m = _m;
>>>
>>>-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>+    rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>
>>>     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>> }
>>>@@ -339,15 +303,21 @@ dpdk_mp_get(int socket_id, int mtu)
>>>OVS_REQUIRES(dpdk_mutex)
>>>
>>>     mp_size = MAX_NB_MBUF;
>>>     do {
>>>+        struct rte_pktmbuf_pool_private mbuf_sizes;
>>>+
>>>         if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>>>                      dmp->mtu, dmp->socket_id, mp_size) < 0) {
>>>             return NULL;
>>>         }
>>>
>>>+        mbuf_sizes.mbuf_priv_size = sizeof (struct dp_packet)
>>>+                                    - sizeof (struct rte_mbuf);
>>>+        mbuf_sizes.mbuf_data_room_size = MBUF_SIZE(mtu)
>>>+                                         - sizeof(struct dp_packet);
>>>         dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>                                      MP_CACHE_SZ,
>>>                                      sizeof(struct
>>>rte_pktmbuf_pool_private),
>>>-                                     rte_pktmbuf_pool_init, NULL,
>>>+                                     rte_pktmbuf_pool_init, &mbuf_sizes,
>>>                                      ovs_rte_pktmbuf_init, NULL,
>>>                                      socket_id, 0);
>>>     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>MIN_NB_MBUF);
>>>
>>>---8<---
>>>
>>>I think this will make the patch cleaner. Do you think this will work
>>>with
>>>the increased MTU as well?
>>
>>Thanks for this Daniele. When I implemented the code, I attempted to
>>change as little of the legacy code as possible, but seeing as how I'm
>>now doing a patchset, I can roll this change in.
>>Btw, I just tested with P2P 9k frames, and it works fine :)
>>
>>>
>>>>@@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>>int
>>>>n_rxq, int n_txq)
>>>> {
>>>>     int diag = 0;
>>>>     int i;
>>>>+    struct rte_eth_conf conf = port_conf;
>>>>
>>>>     /* A device may report more queues than it makes available (this
>>>>has
>>>>      * been observed for Intel xl710, which reserves some of them for
>>>>@@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>>int n_rxq, int n_txq)
>>>>             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq,
>>>>n_txq);
>>>>         }
>>>>
>>>>-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>&port_conf);
>>>>+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+        }
>>>>+
>>>>+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>&conf);
>>>>         if (diag) {
>>>>             break;
>>>>         }
>>>>@@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>>int
>>>>port_no,
>>>>     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>     int sid;
>>>>     int err = 0;
>>>>+    uint32_t buf_size;
>>>>
>>>>     ovs_mutex_init(&netdev->mutex);
>>>>     ovs_mutex_lock(&netdev->mutex);
>>>>@@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>>int port_no,
>>>>     netdev->port_id = port_no;
>>>>     netdev->type = type;
>>>>     netdev->flags = 0;
>>>>+
>>>>+    /* Initialize port's MTU and frame len to the default Ethernet
>>>>values.
>>>>+     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>+     * netdev_dpdk_set_config.
>>>>+     */
>>>>+    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>     netdev->mtu = ETHER_MTU;
>>>>-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>
>>>>-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>>FRAME_LEN_TO_MTU(buf_size));
>>>>     if (!netdev->dpdk_mp) {
>>>>         err = ENOMEM;
>>>>         goto unlock;
>>>>@@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>>>>char prefix[],
>>>>     return 0;
>>>> }
>>>>
>>>>+static void
>>>>+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>+{
>>>>+    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>+    char *end_ptr = NULL;
>>>>+    int local_mtu;
>>>>+
>>>>+    if(mtu_str) {
>>>>+        local_mtu = strtoul(mtu_str, &end_ptr, 0);
>>>>+    }
>>>>+    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>+                   local_mtu >
>>>>FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>>>>+                   *end_ptr != '\0') {
>>>>+        local_mtu = ETHER_MTU;
>>>>+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>>to
>>>>%d.\n",
>>>>+                   local_mtu);
>>>>+    }
>>>>+
>>>>+    *mtu = local_mtu;
>>>>+}
>>>>+
>>>> static int
>>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>> {
>>>>@@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>>>>*netdev_, struct smap *args)
>>>>     smap_add_format(args, "configured_rx_queues", "%d",
>>>>netdev_->n_rxq);
>>>>     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>     smap_add_format(args, "configured_tx_queues", "%d",
>>>>dev->real_n_txq);
>>>>+    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>     ovs_mutex_unlock(&dev->mutex);
>>>>
>>>>     return 0;
>>>> }
>>>>
>>>>+/* Set the mtu of DPDK_DEV_ETH ports */
>>>>+static int
>>>>+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>+{
>>>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>+    int old_mtu, err;
>>>>+    uint32_t buf_size;
>>>>+    int dpdk_mtu;
>>>>+    struct dpdk_mp *old_mp;
>>>>+    struct dpdk_mp *mp;
>>>>+
>>>>+    ovs_mutex_lock(&dpdk_mutex);
>>>>+    ovs_mutex_lock(&dev->mutex);
>>>>+    if (dev->mtu == mtu) {
>>>>+        err = 0;
>>>>+        goto out;
>>>>+    }
>>>>+
>>>>+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>+
>>>>+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>+    if (!mp) {
>>>>+        err = ENOMEM;
>>>>+        goto out;
>>>>+    }
>>>>+
>>>>+    rte_eth_dev_stop(dev->port_id);
>>>>+
>>>>+    old_mtu = dev->mtu;
>>>>+    old_mp = dev->dpdk_mp;
>>>>+    dev->dpdk_mp = mp;
>>>>+    dev->mtu = mtu;
>>>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+
>>>>+    err = dpdk_eth_dev_init(dev);
>>>>+    if (err) {
>>>>+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>>known "
>>>>+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>+        dpdk_mp_put(mp);
>>>>+        dev->mtu = old_mtu;
>>>>+        dev->dpdk_mp = old_mp;
>>>>+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+        dpdk_eth_dev_init(dev);
>>>>+        goto out;
>>>>+    } else {
>>>>+        dpdk_mp_put(old_mp);
>>>>+        netdev_change_seq_changed(netdev);
>>>>+    }
>>>>+out:
>>>>+    ovs_mutex_unlock(&dev->mutex);
>>>>+    ovs_mutex_unlock(&dpdk_mutex);
>>>>+    return err;
>>>>+}
>>>>+
>>>>+static int
>>>>+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>+{
>>>>+    int mtu;
>>>>+
>>>>+    dpdk_dev_parse_mtu(args, &mtu);
>>>>+
>>>>+    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>+}
>>>>+
>>>> static int
>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>> {
>>>>@@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev,
>>>>int *mtup)
>>>>
>>>>     return 0;
>>>> }
>>>>-
>>>>-static int
>>>>-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>-{
>>>>-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>-    int old_mtu, err;
>>>>-    struct dpdk_mp *old_mp;
>>>>-    struct dpdk_mp *mp;
>>>>-
>>>>-    ovs_mutex_lock(&dpdk_mutex);
>>>>-    ovs_mutex_lock(&dev->mutex);
>>>>-    if (dev->mtu == mtu) {
>>>>-        err = 0;
>>>>-        goto out;
>>>>-    }
>>>>-
>>>>-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>-    if (!mp) {
>>>>-        err = ENOMEM;
>>>>-        goto out;
>>>>-    }
>>>>-
>>>>-    rte_eth_dev_stop(dev->port_id);
>>>>-
>>>>-    old_mtu = dev->mtu;
>>>>-    old_mp = dev->dpdk_mp;
>>>>-    dev->dpdk_mp = mp;
>>>>-    dev->mtu = mtu;
>>>>-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>-
>>>>-    err = dpdk_eth_dev_init(dev);
>>>>-    if (err) {
>>>>-        dpdk_mp_put(mp);
>>>>-        dev->mtu = old_mtu;
>>>>-        dev->dpdk_mp = old_mp;
>>>>-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>-        dpdk_eth_dev_init(dev);
>>>>-        goto out;
>>>>-    }
>>>>-
>>>>-    dpdk_mp_put(old_mp);
>>>>-    netdev_change_seq_changed(netdev);
>>>>-out:
>>>>-    ovs_mutex_unlock(&dev->mutex);
>>>>-    ovs_mutex_unlock(&dpdk_mutex);
>>>>-    return err;
>>>>-}
>>>>-
>>>> static int
>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>
>>>>@@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>>>>*netdev_, struct smap *args)
>>>>     smap_add_format(args, "numa_id", "%d",
>>>>rte_eth_dev_socket_id(dev->port_id));
>>>>     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>     smap_add_format(args, "min_rx_bufsize", "%u",
>>>>dev_info.min_rx_bufsize);
>>>>-    smap_add_format(args, "max_rx_pktlen", "%u",
>>>>dev_info.max_rx_pktlen);
>>>>+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>     smap_add_format(args, "max_rx_queues", "%u",
>>>>dev_info.max_rx_queues);
>>>>     smap_add_format(args, "max_tx_queues", "%u",
>>>>dev_info.max_tx_queues);
>>>>     smap_add_format(args, "max_mac_addrs", "%u",
>>>>dev_info.max_mac_addrs);
>>>>@@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>>>>     return 0;
>>>> }
>>>>
>>>>+/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>+static int
>>>>+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>+{
>>>>+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>+    int err = 0;
>>>>+    struct dpdk_mp *old_mp;
>>>>+    struct dpdk_mp *mp;
>>>>+
>>>>+    ovs_mutex_lock(&dpdk_mutex);
>>>>+    ovs_mutex_lock(&dev->mutex);
>>>>+    if (dev->mtu == mtu) {
>>>>+        err = 0;
>>>>+        goto out;
>>>>+    }
>>>>+
>>>>+    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>+    if (!mp) {
>>>>+        err = ENOMEM;
>>>>+        goto out;
>>>>+    }
>>>>+
>>>>+    old_mp = dev->dpdk_mp;
>>>>+    dev->dpdk_mp = mp;
>>>>+    dev->mtu = mtu;
>>>>+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+
>>>>+    dpdk_mp_put(old_mp);
>>>>+    netdev_change_seq_changed(netdev);
>>>>+out:
>>>>+    ovs_mutex_unlock(&dev->mutex);
>>>>+    ovs_mutex_unlock(&dpdk_mutex);
>>>>+    return err;
>>>>+}
>>>>+
>>>>+static int
>>>>+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap
>>>>*args)
>>>>+{
>>>>+    int mtu;
>>>>+
>>>>+    dpdk_dev_parse_mtu(args, &mtu);
>>>>+
>>>>+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>+}
>>>>+
>>>> static void
>>>> dpdk_common_init(void)
>>>> {
>>>>@@ -2040,8 +2220,9 @@ unlock_dpdk:
>>>>     return err;
>>>> }
>>>>
>>>>-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
>>>>SEND,
>>>>\
>>>>-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
>>>>\
>>>>+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG,
>>>>\
>>>>+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,
>>>>\
>>>>+        GET_STATUS, RXQ_RECV)
>>>>\
>>>> {                                                             \
>>>>     NAME,                                                     \
>>>>     INIT,                       /* init */                    \
>>>>@@ -2053,7 +2234,7 @@ unlock_dpdk:
>>>>     DESTRUCT,                                                 \
>>>>     netdev_dpdk_dealloc,                                      \
>>>>     netdev_dpdk_get_config,                                   \
>>>>-    NULL,                       /* netdev_dpdk_set_config */  \
>>>>+    SET_CONFIG,                                               \
>>>>     NULL,                       /* get_tunnel_config */       \
>>>>     NULL,                       /* build header */            \
>>>>     NULL,                       /* push header */             \
>>>>@@ -2067,7 +2248,7 @@ unlock_dpdk:
>>>>     netdev_dpdk_set_etheraddr,                                \
>>>>     netdev_dpdk_get_etheraddr,                                \
>>>>     netdev_dpdk_get_mtu,                                      \
>>>>-    netdev_dpdk_set_mtu,                                      \
>>>>+    SET_MTU,                                                  \
>>>>     netdev_dpdk_get_ifindex,                                  \
>>>>     GET_CARRIER,                                              \
>>>>     netdev_dpdk_get_carrier_resets,                           \
>>>>@@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>>>>         NULL,
>>>>         netdev_dpdk_construct,
>>>>         netdev_dpdk_destruct,
>>>>+        netdev_dpdk_set_config,
>>>>         netdev_dpdk_set_multiq,
>>>>         netdev_dpdk_eth_send,
>>>>+        netdev_dpdk_set_mtu,
>>>>         netdev_dpdk_get_carrier,
>>>>         netdev_dpdk_get_stats,
>>>>         netdev_dpdk_get_features,
>>>>@@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class
>>>>=
>>>>         NULL,
>>>>         netdev_dpdk_ring_construct,
>>>>         netdev_dpdk_destruct,
>>>>+        netdev_dpdk_set_config,
>>>>         netdev_dpdk_set_multiq,
>>>>         netdev_dpdk_ring_send,
>>>>+        netdev_dpdk_set_mtu,
>>>>         netdev_dpdk_get_carrier,
>>>>         netdev_dpdk_get_stats,
>>>>         netdev_dpdk_get_features,
>>>>@@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>>>>dpdk_vhost_cuse_class =
>>>>         dpdk_vhost_cuse_class_init,
>>>>         netdev_dpdk_vhost_cuse_construct,
>>>>         netdev_dpdk_vhost_destruct,
>>>>+        NULL,
>>>>         netdev_dpdk_vhost_set_multiq,
>>>>         netdev_dpdk_vhost_send,
>>>>+        NULL,
>>>>         netdev_dpdk_vhost_get_carrier,
>>>>         netdev_dpdk_vhost_get_stats,
>>>>         NULL,
>>>>@@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>>>>dpdk_vhost_user_class =
>>>>         dpdk_vhost_user_class_init,
>>>>         netdev_dpdk_vhost_user_construct,
>>>>         netdev_dpdk_vhost_destruct,
>>>>+        netdev_dpdk_vhost_set_config,
>>>>         netdev_dpdk_vhost_set_multiq,
>>>>         netdev_dpdk_vhost_send,
>>>>+        netdev_dpdk_vhost_set_mtu,
>>>>         netdev_dpdk_vhost_get_carrier,
>>>>         netdev_dpdk_vhost_get_stats,
>>>>         NULL,
>>>>--
>>>>1.9.3
>>>>
>>>>_______________________________________________
>>>>dev mailing list
>>>>dev@openvswitch.org
>>>>http://openvswitch.org/mailman/listinfo/dev
>>
Flavio Leitner Feb. 12, 2016, 1:20 p.m. UTC | #10
On Thu, 11 Feb 2016 02:55:43 +0000
Daniele Di Proietto <diproiettod@vmware.com> wrote:

> On 08/02/2016 03:23, "Kavanagh, Mark B" <mark.b.kavanagh@intel.com> wrote:
> 
> >>
> >>Hi Mark,
> >>  
> >
> >Hi Daniele,
> >
> >Thanks for the review! Responses inline below.
> >
> >Cheers,
> >Mark
> >  
> >>This patch besides adding Jumbo Frame support also cleans up
> >>the mbuf initialization (by changing the macros, adding
> >>dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
> >>for this.  I think it makes sense to split the patch in two:
> >>one that does the clenup, and one that allows configuring the
> >>MTU.  
> >
> >Okay, sounds good - I'll spin another version, splitting into a two patch
> >set.  
> 
> Thanks!
> 
> >  
> >>
> >>I agree with Flavio's comments as well, more inline
> >>
> >>Thanks
> >>
> >>On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh@intel.com> wrote:
> >>  
> >>>Add support for Jumbo Frames to DPDK-enabled port types,
> >>>using single-segment-mbufs.
> >>>
> >>>Using this approach, the amount of memory allocated for each mbuf
> >>>to store frame data is increased to a value greater than 1518B
> >>>(typical Ethernet maximum frame length). The increased space
> >>>available in the mbuf means that an entire Jumbo Frame can be carried
> >>>in a single mbuf, as opposed to partitioning it across multiple mbuf
> >>>segments.
> >>>
> >>>The amount of space allocated to each mbuf to hold frame data is
> >>>defined dynamically by the user when adding a DPDK port to a bridge.
> >>>If an MTU value is not supplied, or the user-supplied value is invalid,
> >>>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
> >>>
> >>>Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> >>>---
> >>> INSTALL.DPDK.md   |  59 +++++++++-
> >>> NEWS              |   2 +
> >>> lib/netdev-dpdk.c | 347
> >>>+++++++++++++++++++++++++++++++++++++++++-------------
> >>> 3 files changed, 328 insertions(+), 80 deletions(-)
> >>>
> >>>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> >>>index 96b686c..64ccd15 100644
> >>>--- a/INSTALL.DPDK.md
> >>>+++ b/INSTALL.DPDK.md
> >>>@@ -859,10 +859,61 @@ by adding the following string:
> >>> to <interface> sections of all network devices used by DPDK. Parameter
> >>>'N'
> >>> determines how many queues can be used by the guest.
> >>>
> >>>+
> >>>+Jumbo Frames
> >>>+------------
> >>>+
> >>>+Support for Jumbo Frames may be enabled at run-time for DPDK-type
> >>>ports.
> >>>+
> >>>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
> >>>ovs-vsctl
> >>>+'add-port' command-line, along with the required MTU for the port.
> >>>+e.g.
> >>>+
> >>>+     ```
> >>>+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> >>>options:dpdk-mtu=9000
> >>>+     ```
> >>>+
> >>>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
> >>>are
> >>>+increased, such that a full Jumbo Frame may be accommodated inside a
> >>>single
> >>>+mbuf segment. Once set, the MTU for a DPDK port is immutable.  
> >>
> >>Why is it immutable?  
> >
> >I guess my rationale here is that an MTU change can't be triggered via
> >OVS command-line, nor can it be triggered programmatically via DPDK
> >(apart from an explicit call to rte_eth_dev_set_mtu).
> >So, while technically it's possibly, from a user's point of view, there's
> >no way to configure it, outside of modifying the code directly.
> >If I've missed something here, please let me know.  
> 
> With
> 
> ovs-vsctl set Interface options:dpdk-mtu=...
> 
> you could change it at runtime.  Have you considered that in this patch?
> I guess it's ok to have it configurable only when the port is added for
> now,
> but we need to make sure that nothing bad happens if a user tries to change
> it when it's running.
> 
> Also, after talking with other people, I think that a better name for the
> parameter might be 'mtu_request', and we might want to use it also for
> kernel
> ports (but that doesn't need to be done now).
>

I am not sure about that idea because if you think in other scenarios, we
use the same way to set and to verify the MTU. So, having mtu-request and
mtu seems to be counter intuitive for me, though it is better than dpdk-mtu.
Flavio Leitner Feb. 12, 2016, 2:09 p.m. UTC | #11
On Mon, 8 Feb 2016 10:54:08 +0000
"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> wrote:

> >
> >Hi,
> >
> >Some comments below.
> >  
> 
> Hi Flavio,
> 
> Thanks for your review!
> 
> I've responded to your comments inline - I've also addressed the
> cosmetic changes that you suggested in V3 of the patch.
> 
> Thanks again,
> Mark
> 
> >
> >On Mon,  1 Feb 2016 10:18:54 +0000
> >Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:
> >  
> >> Add support for Jumbo Frames to DPDK-enabled port types,
> >> using single-segment-mbufs.
> >>
> >> Using this approach, the amount of memory allocated for each mbuf
> >> to store frame data is increased to a value greater than 1518B
> >> (typical Ethernet maximum frame length). The increased space
> >> available in the mbuf means that an entire Jumbo Frame can be
> >> carried in a single mbuf, as opposed to partitioning it across
> >> multiple mbuf segments.
> >>
> >> The amount of space allocated to each mbuf to hold frame data is
> >> defined dynamically by the user when adding a DPDK port to a
> >> bridge. If an MTU value is not supplied, or the user-supplied
> >> value is invalid, the MTU for the port defaults to standard
> >> Ethernet MTU (i.e. 1500B).
> >>
> >> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> >> ---
> >>  INSTALL.DPDK.md   |  59 +++++++++-
> >>  NEWS              |   2 +
> >>  lib/netdev-dpdk.c | 347
> >> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
> >> changed, 328 insertions(+), 80 deletions(-)
[...]

> >> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
> >> segments are +increased, such that a full Jumbo Frame may be
> >> accommodated inside a single +mbuf segment. Once set, the MTU for a
> >> DPDK port is immutable. +  
> >
> >I think we need some protection here because if someone tries to
> >change that, it will segfault PMD threads.
> >  
> 
> Could you elaborate on this a bit more; do you mean that PMD threads
> would segfault if a user attempts to change the MTU of a running DPDK
> port? If so, how would that be possible (AFAIA, there's no way to
> programmatically modify the MTU of OVS ports, outside of standard
> Linux utilities such as ifconfig, which DPDK ports typically aren't
> subject to; there's also AFAIA no way to programmatically modify MTU
> of DPDK ports outside of modifying the source code).

I have a ovs bridge with two dpdk ports created with these commands:

# ovs-vsctl add-port ovsbr0 dpdk0 \
        -- set interface dpdk0 type=dpdk options:dpdk-mtu=9000 ofport_request=10
# ovs-vsctl add-port ovsbr0 dpdk1 \
        -- set interface dpdk1 type=dpdk options:dpdk-mtu=9000 ofport_request=11

Now I am pushing traffic (jumbo frames or small frames) and then
I change dpdk-mtu to 1500.

# ovs-vsctl set interface dpdk1 options:dpdk-mtu=1500

Then I see this:
2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes: 
pid 78009 died, killed (Segmentation fault), core dumped, restarting
...

This is the backtrace:
#0  0x00007f1909cf4423 in _recv_raw_pkts_vec () from /lib64/libdpdk.so
#1  0x00000000004fd595 in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f18c37fd970, queue_id=2, 
    port_id=1 '\001') at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2510
#2  netdev_dpdk_rxq_recv (rxq_=<optimized out>, packets=0x7f18c37fd970, c=0x7f18c37fd96c)
    at ../lib/netdev-dpdk.c:1216
#3  0x000000000047c3c1 in netdev_rxq_recv (rx=<optimized out>, buffers=buffers@entry=0x7f18c37fd970, 
    cnt=cnt@entry=0x7f18c37fd96c) at ../lib/netdev.c:654
#4  0x000000000045e566 in dp_netdev_process_rxq_port (pmd=pmd@entry=0x1e1cd90, rxq=<optimized out>, 
    port=<optimized out>, port=<optimized out>) at ../lib/dpif-netdev.c:2555
#5  0x000000000045e9f9 in pmd_thread_main (f_=0x1e1cd90) at ../lib/dpif-netdev.c:2691
#6  0x00000000004bab64 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:340
#7  0x00007f1909583555 in start_thread (arg=0x7f18c37fe700) at pthread_create.c:333
#8  0x00007f1908daeb9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Dump of assembler code for function _recv_raw_pkts_vec:
   0x00007fa8d51ccf30 <+0>:     push   %r15
   0x00007fa8d51ccf32 <+2>:     mov    %edx,%r15d
   0x00007fa8d51ccf35 <+5>:     push   %r14
   0x00007fa8d51ccf37 <+7>:     push   %r13
   0x00007fa8d51ccf39 <+9>:     push   %r12
   0x00007fa8d51ccf3b <+11>:    push   %rbp
   0x00007fa8d51ccf3c <+12>:    push   %rbx
   0x00007fa8d51ccf3d <+13>:    sub    $0x38,%rsp
=> 0x00007fa8d51ccf41 <+17>:    movzbl 0x69(%rdi),%eax

rdi            0x0      0

Looks like dev->data->rx_queues[queue_id] is NULL.
Mark Kavanagh Feb. 12, 2016, 2:36 p.m. UTC | #12
>> >
>> >Hi,
>> >
>> >Some comments below.
>> >
>>
>> Hi Flavio,
>>
>> Thanks for your review!
>>
>> I've responded to your comments inline - I've also addressed the
>> cosmetic changes that you suggested in V3 of the patch.
>>
>> Thanks again,
>> Mark
>>
>> >
>> >On Mon,  1 Feb 2016 10:18:54 +0000
>> >Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:
>> >
>> >> Add support for Jumbo Frames to DPDK-enabled port types,
>> >> using single-segment-mbufs.
>> >>
>> >> Using this approach, the amount of memory allocated for each mbuf
>> >> to store frame data is increased to a value greater than 1518B
>> >> (typical Ethernet maximum frame length). The increased space
>> >> available in the mbuf means that an entire Jumbo Frame can be
>> >> carried in a single mbuf, as opposed to partitioning it across
>> >> multiple mbuf segments.
>> >>
>> >> The amount of space allocated to each mbuf to hold frame data is
>> >> defined dynamically by the user when adding a DPDK port to a
>> >> bridge. If an MTU value is not supplied, or the user-supplied
>> >> value is invalid, the MTU for the port defaults to standard
>> >> Ethernet MTU (i.e. 1500B).
>> >>
>> >> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> >> ---
>> >>  INSTALL.DPDK.md   |  59 +++++++++-
>> >>  NEWS              |   2 +
>> >>  lib/netdev-dpdk.c | 347
>> >> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
>> >> changed, 328 insertions(+), 80 deletions(-)
>[...]
>
>> >> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>> >> segments are +increased, such that a full Jumbo Frame may be
>> >> accommodated inside a single +mbuf segment. Once set, the MTU for a
>> >> DPDK port is immutable. +
>> >
>> >I think we need some protection here because if someone tries to
>> >change that, it will segfault PMD threads.
>> >
>>
>> Could you elaborate on this a bit more; do you mean that PMD threads
>> would segfault if a user attempts to change the MTU of a running DPDK
>> port? If so, how would that be possible (AFAIA, there's no way to
>> programmatically modify the MTU of OVS ports, outside of standard
>> Linux utilities such as ifconfig, which DPDK ports typically aren't
>> subject to; there's also AFAIA no way to programmatically modify MTU
>> of DPDK ports outside of modifying the source code).
>
>I have a ovs bridge with two dpdk ports created with these commands:
>
># ovs-vsctl add-port ovsbr0 dpdk0 \
>        -- set interface dpdk0 type=dpdk options:dpdk-mtu=9000 ofport_request=10
># ovs-vsctl add-port ovsbr0 dpdk1 \
>        -- set interface dpdk1 type=dpdk options:dpdk-mtu=9000 ofport_request=11
>
>Now I am pushing traffic (jumbo frames or small frames) and then
>I change dpdk-mtu to 1500.
>
># ovs-vsctl set interface dpdk1 options:dpdk-mtu=1500
>
>Then I see this:
>2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes:
>pid 78009 died, killed (Segmentation fault), core dumped, restarting
>...
>
>This is the backtrace:
>#0  0x00007f1909cf4423 in _recv_raw_pkts_vec () from /lib64/libdpdk.so
>#1  0x00000000004fd595 in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f18c37fd970, queue_id=2,
>    port_id=1 '\001') at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2510
>#2  netdev_dpdk_rxq_recv (rxq_=<optimized out>, packets=0x7f18c37fd970, c=0x7f18c37fd96c)
>    at ../lib/netdev-dpdk.c:1216
>#3  0x000000000047c3c1 in netdev_rxq_recv (rx=<optimized out>,
>buffers=buffers@entry=0x7f18c37fd970,
>    cnt=cnt@entry=0x7f18c37fd96c) at ../lib/netdev.c:654
>#4  0x000000000045e566 in dp_netdev_process_rxq_port (pmd=pmd@entry=0x1e1cd90, rxq=<optimized
>out>,
>    port=<optimized out>, port=<optimized out>) at ../lib/dpif-netdev.c:2555
>#5  0x000000000045e9f9 in pmd_thread_main (f_=0x1e1cd90) at ../lib/dpif-netdev.c:2691
>#6  0x00000000004bab64 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:340
>#7  0x00007f1909583555 in start_thread (arg=0x7f18c37fe700) at pthread_create.c:333
>#8  0x00007f1908daeb9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>
>
>Dump of assembler code for function _recv_raw_pkts_vec:
>   0x00007fa8d51ccf30 <+0>:     push   %r15
>   0x00007fa8d51ccf32 <+2>:     mov    %edx,%r15d
>   0x00007fa8d51ccf35 <+5>:     push   %r14
>   0x00007fa8d51ccf37 <+7>:     push   %r13
>   0x00007fa8d51ccf39 <+9>:     push   %r12
>   0x00007fa8d51ccf3b <+11>:    push   %rbp
>   0x00007fa8d51ccf3c <+12>:    push   %rbx
>   0x00007fa8d51ccf3d <+13>:    sub    $0x38,%rsp
>=> 0x00007fa8d51ccf41 <+17>:    movzbl 0x69(%rdi),%eax
>
>rdi            0x0      0
>
>Looks like dev->data->rx_queues[queue_id] is NULL.
>
>--
>fbl

Hi Flavio,

I saw this myself yesterday, when testing out that same scenario which was pointed out to me by Daniele.

I'm currently investigating how to fix the issue, and will hopefully have a solution in the next patchset.

Cheers,
Mark
Flavio Leitner Feb. 12, 2016, 4:24 p.m. UTC | #13
On Fri, 12 Feb 2016 14:36:14 +0000
"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> wrote:

> >> >
> >> >Hi,
> >> >
> >> >Some comments below.
> >> >
> >>
> >> Hi Flavio,
> >>
> >> Thanks for your review!
> >>
> >> I've responded to your comments inline - I've also addressed the
> >> cosmetic changes that you suggested in V3 of the patch.
> >>
> >> Thanks again,
> >> Mark
> >>
> >> >
> >> >On Mon,  1 Feb 2016 10:18:54 +0000
> >> >Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:
> >> >
> >> >> Add support for Jumbo Frames to DPDK-enabled port types,
> >> >> using single-segment-mbufs.
> >> >>
> >> >> Using this approach, the amount of memory allocated for each mbuf
> >> >> to store frame data is increased to a value greater than 1518B
> >> >> (typical Ethernet maximum frame length). The increased space
> >> >> available in the mbuf means that an entire Jumbo Frame can be
> >> >> carried in a single mbuf, as opposed to partitioning it across
> >> >> multiple mbuf segments.
> >> >>
> >> >> The amount of space allocated to each mbuf to hold frame data is
> >> >> defined dynamically by the user when adding a DPDK port to a
> >> >> bridge. If an MTU value is not supplied, or the user-supplied
> >> >> value is invalid, the MTU for the port defaults to standard
> >> >> Ethernet MTU (i.e. 1500B).
> >> >>
> >> >> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> >> >> ---
> >> >>  INSTALL.DPDK.md   |  59 +++++++++-
> >> >>  NEWS              |   2 +
> >> >>  lib/netdev-dpdk.c | 347
> >> >> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
> >> >> changed, 328 insertions(+), 80 deletions(-)
> >[...]
> >
> >> >> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
> >> >> segments are +increased, such that a full Jumbo Frame may be
> >> >> accommodated inside a single +mbuf segment. Once set, the MTU for a
> >> >> DPDK port is immutable. +
> >> >
> >> >I think we need some protection here because if someone tries to
> >> >change that, it will segfault PMD threads.
> >> >
> >>
> >> Could you elaborate on this a bit more; do you mean that PMD threads
> >> would segfault if a user attempts to change the MTU of a running DPDK
> >> port? If so, how would that be possible (AFAIA, there's no way to
> >> programmatically modify the MTU of OVS ports, outside of standard
> >> Linux utilities such as ifconfig, which DPDK ports typically aren't
> >> subject to; there's also AFAIA no way to programmatically modify MTU
> >> of DPDK ports outside of modifying the source code).
> >
> >I have a ovs bridge with two dpdk ports created with these commands:
> >
> ># ovs-vsctl add-port ovsbr0 dpdk0 \
> >        -- set interface dpdk0 type=dpdk options:dpdk-mtu=9000 ofport_request=10
> ># ovs-vsctl add-port ovsbr0 dpdk1 \
> >        -- set interface dpdk1 type=dpdk options:dpdk-mtu=9000 ofport_request=11
> >
> >Now I am pushing traffic (jumbo frames or small frames) and then
> >I change dpdk-mtu to 1500.
> >
> ># ovs-vsctl set interface dpdk1 options:dpdk-mtu=1500
> >
> >Then I see this:
> >2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes:
> >pid 78009 died, killed (Segmentation fault), core dumped, restarting
> >...
> >
> >This is the backtrace:
> >#0  0x00007f1909cf4423 in _recv_raw_pkts_vec () from /lib64/libdpdk.so
> >#1  0x00000000004fd595 in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f18c37fd970, queue_id=2,
> >    port_id=1 '\001') at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2510
> >#2  netdev_dpdk_rxq_recv (rxq_=<optimized out>, packets=0x7f18c37fd970, c=0x7f18c37fd96c)
> >    at ../lib/netdev-dpdk.c:1216
> >#3  0x000000000047c3c1 in netdev_rxq_recv (rx=<optimized out>,
> >buffers=buffers@entry=0x7f18c37fd970,
> >    cnt=cnt@entry=0x7f18c37fd96c) at ../lib/netdev.c:654
> >#4  0x000000000045e566 in dp_netdev_process_rxq_port (pmd=pmd@entry=0x1e1cd90, rxq=<optimized
> >out>,
> >    port=<optimized out>, port=<optimized out>) at ../lib/dpif-netdev.c:2555
> >#5  0x000000000045e9f9 in pmd_thread_main (f_=0x1e1cd90) at ../lib/dpif-netdev.c:2691
> >#6  0x00000000004bab64 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:340
> >#7  0x00007f1909583555 in start_thread (arg=0x7f18c37fe700) at pthread_create.c:333
> >#8  0x00007f1908daeb9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> >
> >
> >Dump of assembler code for function _recv_raw_pkts_vec:
> >   0x00007fa8d51ccf30 <+0>:     push   %r15
> >   0x00007fa8d51ccf32 <+2>:     mov    %edx,%r15d
> >   0x00007fa8d51ccf35 <+5>:     push   %r14
> >   0x00007fa8d51ccf37 <+7>:     push   %r13
> >   0x00007fa8d51ccf39 <+9>:     push   %r12
> >   0x00007fa8d51ccf3b <+11>:    push   %rbp
> >   0x00007fa8d51ccf3c <+12>:    push   %rbx
> >   0x00007fa8d51ccf3d <+13>:    sub    $0x38,%rsp
> >=> 0x00007fa8d51ccf41 <+17>:    movzbl 0x69(%rdi),%eax
> >
> >rdi            0x0      0
> >
> >Looks like dev->data->rx_queues[queue_id] is NULL.
> >
> >--
> >fbl
> 
> Hi Flavio,
> 
> I saw this myself yesterday, when testing out that same scenario which was pointed out to me by Daniele.
> 
> I'm currently investigating how to fix the issue, and will hopefully have a solution in the next patchset.

OK, no worries.  You asked to me to elaborate on the segfault issue and
it took some time to get my testbed ready to test it again.
Mark Kavanagh Feb. 12, 2016, 4:27 p.m. UTC | #14
>
>> >> >
>> >> >Hi,
>> >> >
>> >> >Some comments below.
>> >> >
>> >>
>> >> Hi Flavio,
>> >>
>> >> Thanks for your review!
>> >>
>> >> I've responded to your comments inline - I've also addressed the
>> >> cosmetic changes that you suggested in V3 of the patch.
>> >>
>> >> Thanks again,
>> >> Mark
>> >>
>> >> >
>> >> >On Mon,  1 Feb 2016 10:18:54 +0000
>> >> >Mark Kavanagh <mark.b.kavanagh@intel.com> wrote:
>> >> >
>> >> >> Add support for Jumbo Frames to DPDK-enabled port types,
>> >> >> using single-segment-mbufs.
>> >> >>
>> >> >> Using this approach, the amount of memory allocated for each mbuf
>> >> >> to store frame data is increased to a value greater than 1518B
>> >> >> (typical Ethernet maximum frame length). The increased space
>> >> >> available in the mbuf means that an entire Jumbo Frame can be
>> >> >> carried in a single mbuf, as opposed to partitioning it across
>> >> >> multiple mbuf segments.
>> >> >>
>> >> >> The amount of space allocated to each mbuf to hold frame data is
>> >> >> defined dynamically by the user when adding a DPDK port to a
>> >> >> bridge. If an MTU value is not supplied, or the user-supplied
>> >> >> value is invalid, the MTU for the port defaults to standard
>> >> >> Ethernet MTU (i.e. 1500B).
>> >> >>
>> >> >> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> >> >> ---
>> >> >>  INSTALL.DPDK.md   |  59 +++++++++-
>> >> >>  NEWS              |   2 +
>> >> >>  lib/netdev-dpdk.c | 347
>> >> >> +++++++++++++++++++++++++++++++++++++++++------------- 3 files
>> >> >> changed, 328 insertions(+), 80 deletions(-)
>> >[...]
>> >
>> >> >> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>> >> >> segments are +increased, such that a full Jumbo Frame may be
>> >> >> accommodated inside a single +mbuf segment. Once set, the MTU for a
>> >> >> DPDK port is immutable. +
>> >> >
>> >> >I think we need some protection here because if someone tries to
>> >> >change that, it will segfault PMD threads.
>> >> >
>> >>
>> >> Could you elaborate on this a bit more; do you mean that PMD threads
>> >> would segfault if a user attempts to change the MTU of a running DPDK
>> >> port? If so, how would that be possible (AFAIA, there's no way to
>> >> programmatically modify the MTU of OVS ports, outside of standard
>> >> Linux utilities such as ifconfig, which DPDK ports typically aren't
>> >> subject to; there's also AFAIA no way to programmatically modify MTU
>> >> of DPDK ports outside of modifying the source code).
>> >
>> >I have a ovs bridge with two dpdk ports created with these commands:
>> >
>> ># ovs-vsctl add-port ovsbr0 dpdk0 \
>> >        -- set interface dpdk0 type=dpdk options:dpdk-mtu=9000 ofport_request=10
>> ># ovs-vsctl add-port ovsbr0 dpdk1 \
>> >        -- set interface dpdk1 type=dpdk options:dpdk-mtu=9000 ofport_request=11
>> >
>> >Now I am pushing traffic (jumbo frames or small frames) and then
>> >I change dpdk-mtu to 1500.
>> >
>> ># ovs-vsctl set interface dpdk1 options:dpdk-mtu=1500
>> >
>> >Then I see this:
>> >2016-02-12T13:19:20.142Z|00003|daemon_unix(monitor)|ERR|1 crashes:
>> >pid 78009 died, killed (Segmentation fault), core dumped, restarting
>> >...
>> >
>> >This is the backtrace:
>> >#0  0x00007f1909cf4423 in _recv_raw_pkts_vec () from /lib64/libdpdk.so
>> >#1  0x00000000004fd595 in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f18c37fd970,
>queue_id=2,
>> >    port_id=1 '\001') at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2510
>> >#2  netdev_dpdk_rxq_recv (rxq_=<optimized out>, packets=0x7f18c37fd970, c=0x7f18c37fd96c)
>> >    at ../lib/netdev-dpdk.c:1216
>> >#3  0x000000000047c3c1 in netdev_rxq_recv (rx=<optimized out>,
>> >buffers=buffers@entry=0x7f18c37fd970,
>> >    cnt=cnt@entry=0x7f18c37fd96c) at ../lib/netdev.c:654
>> >#4  0x000000000045e566 in dp_netdev_process_rxq_port (pmd=pmd@entry=0x1e1cd90,
>rxq=<optimized
>> >out>,
>> >    port=<optimized out>, port=<optimized out>) at ../lib/dpif-netdev.c:2555
>> >#5  0x000000000045e9f9 in pmd_thread_main (f_=0x1e1cd90) at ../lib/dpif-netdev.c:2691
>> >#6  0x00000000004bab64 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-
>thread.c:340
>> >#7  0x00007f1909583555 in start_thread (arg=0x7f18c37fe700) at pthread_create.c:333
>> >#8  0x00007f1908daeb9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>> >
>> >
>> >Dump of assembler code for function _recv_raw_pkts_vec:
>> >   0x00007fa8d51ccf30 <+0>:     push   %r15
>> >   0x00007fa8d51ccf32 <+2>:     mov    %edx,%r15d
>> >   0x00007fa8d51ccf35 <+5>:     push   %r14
>> >   0x00007fa8d51ccf37 <+7>:     push   %r13
>> >   0x00007fa8d51ccf39 <+9>:     push   %r12
>> >   0x00007fa8d51ccf3b <+11>:    push   %rbp
>> >   0x00007fa8d51ccf3c <+12>:    push   %rbx
>> >   0x00007fa8d51ccf3d <+13>:    sub    $0x38,%rsp
>> >=> 0x00007fa8d51ccf41 <+17>:    movzbl 0x69(%rdi),%eax
>> >
>> >rdi            0x0      0
>> >
>> >Looks like dev->data->rx_queues[queue_id] is NULL.
>> >
>> >--
>> >fbl
>>
>> Hi Flavio,
>>
>> I saw this myself yesterday, when testing out that same scenario which was pointed out to
>me by Daniele.
>>
>> I'm currently investigating how to fix the issue, and will hopefully have a solution in the
>next patchset.
>
>OK, no worries.  You asked to me to elaborate on the segfault issue and
>it took some time to get my testbed ready to test it again.

Thanks Flavio - I appreciate the effort!

>
>--
>fbl
diff mbox

Patch

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 96b686c..64ccd15 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -859,10 +859,61 @@  by adding the following string:
 to <interface> sections of all network devices used by DPDK. Parameter 'N'
 determines how many queues can be used by the guest.
 
+
+Jumbo Frames
+------------
+
+Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
+
+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the ovs-vsctl
+'add-port' command-line, along with the required MTU for the port.
+e.g.
+
+     ```
+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
+     ```
+
+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
+increased, such that a full Jumbo Frame may be accommodated inside a single
+mbuf segment. Once set, the MTU for a DPDK port is immutable.
+
+Jumbo frame support has been validated against 13312B frames, using the
+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
+theoretically be supported. Supported port types excludes vHost-Cuse ports, as
+this feature is pending deprecation.
+
+
+vHost Ports and Jumbo Frames
+----------------------------
+Jumbo frame support is available for DPDK vHost-User ports only. Some additional
+configuration is needed to take advantage of this feature:
+
+  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
+      the QEMU command line snippet below:
+
+      ```
+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
+      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
+      ```
+
+  2. Where virtio devices are bound to the Linux kernel driver in a guest
+     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
+     MTU of those logical network interfaces must also be increased. This
+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
+     to the length of the IP packet only, and not that of the entire frame.
+
+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
+     header and CRC lengths (i.e. 18B) from the max supported frame size.
+     So, to set the MTU for a 13312B Jumbo Frame:
+
+      ```
+      ifconfig eth1 mtu 13294
+      ```
+
+
 Restrictions:
 -------------
 
-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
   - Currently DPDK port does not make use any offload functionality.
   - DPDK-vHost support works with 1G huge pages.
 
@@ -903,6 +954,12 @@  Restrictions:
     the next release of DPDK (which includes the above patch) is available and
     integrated into OVS.
 
+  Jumbo Frames:
+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
+     this issue is currently being investigated.
+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
+
+
 Bug Reporting:
 --------------
 
diff --git a/NEWS b/NEWS
index 5c18867..cd563e0 100644
--- a/NEWS
+++ b/NEWS
@@ -46,6 +46,8 @@  v2.5.0 - xx xxx xxxx
      abstractions, such as virtual L2 and L3 overlays and security groups.
    - RHEL packaging:
      * DPDK ports may now be created via network scripts (see README.RHEL).
+   - netdev-dpdk:
+     * Add Jumbo Frame Support
 
 
 v2.4.0 - 20 Aug 2015
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index de7e488..76a5dcc 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -62,20 +62,25 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"
 
+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
+
 /*
  * need to reserve tons of extra space in the mbufs so we can align the
  * DMA addresses to 4KB.
  * The minimum mbuf size is limited to avoid scatter behaviour and drop in
  * performance for standard Ethernet MTU.
  */
-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
-                              + sizeof(struct dp_packet) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE_DRIVER     (2048                       \
-                              + sizeof (struct rte_mbuf) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
+                                    + sizeof(struct dp_packet)   \
+                                    + RTE_PKTMBUF_HEADROOM)
+
+/* This value should be specified as a multiple of the DPDK NIC driver's
+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
+ */
+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
 
 /* Max and min number of packets in the mempool.  OVS tries to allocate a
  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
@@ -86,6 +91,8 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define MIN_NB_MBUF          (4096 * 4)
 #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
 
+#define DPDK_VLAN_TAG_LEN    4
+
 /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
 BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
 
@@ -114,7 +121,6 @@  static const struct rte_eth_conf port_conf = {
         .header_split   = 0, /* Header Split disabled */
         .hw_ip_checksum = 0, /* IP checksum offload disabled */
         .hw_vlan_filter = 0, /* VLAN filtering disabled */
-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
         .hw_strip_crc   = 0,
     },
     .rx_adv_conf = {
@@ -254,6 +260,41 @@  is_dpdk_class(const struct netdev_class *class)
     return class->construct == netdev_dpdk_construct;
 }
 
+/* DPDK NIC drivers allocate RX buffers at a particular granularity
+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
+ * If 'frame_len' is not a multiple of this value, insufficient buffers are
+ * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
+ * driver needs to ensure that there is also sufficient space in the Rx buffer
+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
+ * small, then the driver enables scatter RX behaviour, which reduces
+ * performance. To prevent this, use a buffer size that is closest to
+ * 'frame_len', but which satisfies the aforementioned criteria.
+ */
+static uint32_t
+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
+{
+    struct rte_eth_dev_info info;
+    uint32_t buf_size;
+    /* XXX: This is a workaround for DPDK v2.2, and needs to be refactored with a
+     * future DPDK release. */
+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
+
+    if(netdev->type == DPDK_DEV_ETH) {
+        rte_eth_dev_info_get(netdev->port_id, &info);
+        buf_size = (info.min_rx_bufsize == 0) ?
+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
+                   info.min_rx_bufsize;
+    } else {
+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
+    }
+
+    if(len % buf_size) {
+        len = buf_size * ((len/buf_size) + 1);
+    }
+
+    return len;
+}
+
 /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
  * for all other segments data, bss and text. */
 
@@ -280,26 +321,65 @@  free_dpdk_buf(struct dp_packet *p)
 }
 
 static void
-__rte_pktmbuf_init(struct rte_mempool *mp,
-                   void *opaque_arg OVS_UNUSED,
-                   void *_m,
-                   unsigned i OVS_UNUSED)
+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
 {
-    struct rte_mbuf *m = _m;
-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
+    struct rte_pktmbuf_pool_private default_mbp_priv;
+    uint16_t roomsz;
 
     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
 
+    /* if no structure is provided, assume no mbuf private area */
+
+    user_mbp_priv = opaque_arg;
+    if (user_mbp_priv == NULL) {
+        default_mbp_priv.mbuf_priv_size = 0;
+        if (mp->elt_size > sizeof(struct dp_packet)) {
+            roomsz = mp->elt_size - sizeof(struct dp_packet);
+        } else {
+            roomsz = 0;
+        }
+        default_mbp_priv.mbuf_data_room_size = roomsz;
+        user_mbp_priv = &default_mbp_priv;
+    }
+
+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
+        user_mbp_priv->mbuf_data_room_size +
+        user_mbp_priv->mbuf_priv_size);
+
+    mbp_priv = rte_mempool_get_priv(mp);
+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
+}
+
+/* Initialise some fields in the mbuf structure that are not modified by the
+ * user once created (origin pool, buffer start address, etc.*/
+static void
+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
+                       void *opaque_arg OVS_UNUSED,
+                       void *_m,
+                       unsigned i OVS_UNUSED)
+{
+    struct rte_mbuf *m = _m;
+    uint32_t buf_size, buf_len, priv_size;
+
+    priv_size = rte_pktmbuf_priv_size(mp);
+    buf_size = sizeof(struct dp_packet) + priv_size;
+    buf_len = rte_pktmbuf_data_room_size(mp);
+
+    RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
+    RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
+    RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
+
     memset(m, 0, mp->elt_size);
 
-    /* start of buffer is just after mbuf structure */
-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-                    sizeof(struct dp_packet);
+    /* start of buffer is after dp_packet structure and priv data */
+    m->priv_size = priv_size;
+    m->buf_addr = (char *)m + buf_size;
+    m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
     m->buf_len = (uint16_t)buf_len;
 
     /* keep some headroom between start of buffer and data */
-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
 
     /* init some constant fields */
     m->pool = mp;
@@ -315,7 +395,7 @@  ovs_rte_pktmbuf_init(struct rte_mempool *mp,
 {
     struct rte_mbuf *m = _m;
 
-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
 
     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
 }
@@ -326,6 +406,7 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     struct dpdk_mp *dmp = NULL;
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     unsigned mp_size;
+    struct rte_pktmbuf_pool_private mbp_priv;
 
     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
@@ -338,6 +419,8 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     dmp->socket_id = socket_id;
     dmp->mtu = mtu;
     dmp->refcount = 1;
+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
+    mbp_priv.mbuf_priv_size = 0;
 
     mp_size = MAX_NB_MBUF;
     do {
@@ -346,10 +429,10 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
             return NULL;
         }
 
-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
+        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
                                      MP_CACHE_SZ,
                                      sizeof(struct rte_pktmbuf_pool_private),
-                                     rte_pktmbuf_pool_init, NULL,
+                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
                                      ovs_rte_pktmbuf_init, NULL,
                                      socket_id, 0);
     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
@@ -433,6 +516,7 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 {
     int diag = 0;
     int i;
+    struct rte_eth_conf conf = port_conf;
 
     /* A device may report more queues than it makes available (this has
      * been observed for Intel xl710, which reserves some of them for
@@ -444,7 +528,12 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
         }
 
-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
+        }
+
+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
         if (diag) {
             break;
         }
@@ -586,6 +675,7 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int sid;
     int err = 0;
+    uint32_t buf_size;
 
     ovs_mutex_init(&netdev->mutex);
     ovs_mutex_lock(&netdev->mutex);
@@ -605,10 +695,16 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     netdev->port_id = port_no;
     netdev->type = type;
     netdev->flags = 0;
+
+    /* Initialize port's MTU and frame len to the default Ethernet values.
+     * Larger, user-specified (jumbo) frame buffers are accommodated in
+     * netdev_dpdk_set_config.
+     */
+    netdev->max_packet_len = ETHER_MAX_LEN;
     netdev->mtu = ETHER_MTU;
-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
 
-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
     if (!netdev->dpdk_mp) {
         err = ENOMEM;
         goto unlock;
@@ -651,6 +747,27 @@  dpdk_dev_parse_name(const char dev_name[], const char prefix[],
     return 0;
 }
 
+static void
+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
+{
+    const char *mtu_str = smap_get(args, "dpdk-mtu");
+    char *end_ptr = NULL;
+    int local_mtu;
+
+    if(mtu_str) {
+        local_mtu = strtoul(mtu_str, &end_ptr, 0);
+    }
+    if(!mtu_str || local_mtu < ETHER_MTU ||
+                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
+                   *end_ptr != '\0') {
+        local_mtu = ETHER_MTU;
+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
+                   local_mtu);
+    }
+
+    *mtu = local_mtu;
+}
+
 static int
 vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
 {
@@ -777,11 +894,77 @@  netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
     smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
     smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
+    smap_add_format(args, "mtu", "%d", dev->mtu);
     ovs_mutex_unlock(&dev->mutex);
 
     return 0;
 }
 
+/* Set the mtu of DPDK_DEV_ETH ports */
+static int
+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int old_mtu, err;
+    uint32_t buf_size;
+    int dpdk_mtu;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
+
+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    old_mtu = dev->mtu;
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+
+    err = dpdk_eth_dev_init(dev);
+    if (err) {
+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
+        dpdk_mp_put(mp);
+        dev->mtu = old_mtu;
+        dev->dpdk_mp = old_mp;
+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+        dpdk_eth_dev_init(dev);
+        goto out;
+    } else {
+        dpdk_mp_put(old_mp);
+        netdev_change_seq_changed(netdev);
+    }
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
+{
+    int mtu;
+
+    dpdk_dev_parse_mtu(args, &mtu);
+
+    return netdev_dpdk_set_mtu(netdev_, mtu);
+}
+
 static int
 netdev_dpdk_get_numa_id(const struct netdev *netdev_)
 {
@@ -1358,54 +1541,6 @@  netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
 
     return 0;
 }
-
-static int
-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    int old_mtu, err;
-    struct dpdk_mp *old_mp;
-    struct dpdk_mp *mp;
-
-    ovs_mutex_lock(&dpdk_mutex);
-    ovs_mutex_lock(&dev->mutex);
-    if (dev->mtu == mtu) {
-        err = 0;
-        goto out;
-    }
-
-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
-    if (!mp) {
-        err = ENOMEM;
-        goto out;
-    }
-
-    rte_eth_dev_stop(dev->port_id);
-
-    old_mtu = dev->mtu;
-    old_mp = dev->dpdk_mp;
-    dev->dpdk_mp = mp;
-    dev->mtu = mtu;
-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
-
-    err = dpdk_eth_dev_init(dev);
-    if (err) {
-        dpdk_mp_put(mp);
-        dev->mtu = old_mtu;
-        dev->dpdk_mp = old_mp;
-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
-        dpdk_eth_dev_init(dev);
-        goto out;
-    }
-
-    dpdk_mp_put(old_mp);
-    netdev_change_seq_changed(netdev);
-out:
-    ovs_mutex_unlock(&dev->mutex);
-    ovs_mutex_unlock(&dpdk_mutex);
-    return err;
-}
-
 static int
 netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
 
@@ -1682,7 +1817,7 @@  netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *args)
     smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
     smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
-    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
     smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
     smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
     smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
@@ -1904,6 +2039,51 @@  dpdk_vhost_user_class_init(void)
     return 0;
 }
 
+/* Set the mtu of DPDK_DEV_VHOST ports */
+static int
+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int err = 0;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    mp = dpdk_mp_get(dev->socket_id, mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+
+    dpdk_mp_put(old_mp);
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
+{
+    int mtu;
+
+    dpdk_dev_parse_mtu(args, &mtu);
+
+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
+}
+
 static void
 dpdk_common_init(void)
 {
@@ -2040,8 +2220,9 @@  unlock_dpdk:
     return err;
 }
 
-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
+        GET_STATUS, RXQ_RECV)                                          \
 {                                                             \
     NAME,                                                     \
     INIT,                       /* init */                    \
@@ -2053,7 +2234,7 @@  unlock_dpdk:
     DESTRUCT,                                                 \
     netdev_dpdk_dealloc,                                      \
     netdev_dpdk_get_config,                                   \
-    NULL,                       /* netdev_dpdk_set_config */  \
+    SET_CONFIG,                                               \
     NULL,                       /* get_tunnel_config */       \
     NULL,                       /* build header */            \
     NULL,                       /* push header */             \
@@ -2067,7 +2248,7 @@  unlock_dpdk:
     netdev_dpdk_set_etheraddr,                                \
     netdev_dpdk_get_etheraddr,                                \
     netdev_dpdk_get_mtu,                                      \
-    netdev_dpdk_set_mtu,                                      \
+    SET_MTU,                                                  \
     netdev_dpdk_get_ifindex,                                  \
     GET_CARRIER,                                              \
     netdev_dpdk_get_carrier_resets,                           \
@@ -2213,8 +2394,10 @@  static const struct netdev_class dpdk_class =
         NULL,
         netdev_dpdk_construct,
         netdev_dpdk_destruct,
+        netdev_dpdk_set_config,
         netdev_dpdk_set_multiq,
         netdev_dpdk_eth_send,
+        netdev_dpdk_set_mtu,
         netdev_dpdk_get_carrier,
         netdev_dpdk_get_stats,
         netdev_dpdk_get_features,
@@ -2227,8 +2410,10 @@  static const struct netdev_class dpdk_ring_class =
         NULL,
         netdev_dpdk_ring_construct,
         netdev_dpdk_destruct,
+        netdev_dpdk_set_config,
         netdev_dpdk_set_multiq,
         netdev_dpdk_ring_send,
+        netdev_dpdk_set_mtu,
         netdev_dpdk_get_carrier,
         netdev_dpdk_get_stats,
         netdev_dpdk_get_features,
@@ -2241,8 +2426,10 @@  static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
         dpdk_vhost_cuse_class_init,
         netdev_dpdk_vhost_cuse_construct,
         netdev_dpdk_vhost_destruct,
+        NULL,
         netdev_dpdk_vhost_set_multiq,
         netdev_dpdk_vhost_send,
+        NULL,
         netdev_dpdk_vhost_get_carrier,
         netdev_dpdk_vhost_get_stats,
         NULL,
@@ -2255,8 +2442,10 @@  static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
         dpdk_vhost_user_class_init,
         netdev_dpdk_vhost_user_construct,
         netdev_dpdk_vhost_destruct,
+        netdev_dpdk_vhost_set_config,
         netdev_dpdk_vhost_set_multiq,
         netdev_dpdk_vhost_send,
+        netdev_dpdk_vhost_set_mtu,
         netdev_dpdk_vhost_get_carrier,
         netdev_dpdk_vhost_get_stats,
         NULL,