diff mbox

[ovs-dev,7/7] netdev-dpdk: add support for Jumbo Frames

Message ID 1469841772-119013-7-git-send-email-diproiettod@vmware.com
State Superseded
Headers show

Commit Message

Daniele Di Proietto July 30, 2016, 1:22 a.m. UTC
From: Mark Kavanagh <mark.b.kavanagh@intel.com>

Add support for Jumbo Frames to DPDK-enabled port types,
using single-segment-mbufs.

Using this approach, the amount of memory allocated to each mbuf
to store frame data is increased to a value greater than 1518B
(typical Ethernet maximum frame length). The increased space
available in the mbuf means that an entire Jumbo Frame of a specific
size can be carried in a single mbuf, as opposed to partitioning
it across multiple mbuf segments.

The amount of space allocated to each mbuf to hold frame data is
defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
parameter.

Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
[diproiettod@vmware.com rebased]
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
---
 INSTALL.DPDK-ADVANCED.md |  59 +++++++++++++++++-
 INSTALL.DPDK.md          |   1 -
 NEWS                     |   1 +
 lib/netdev-dpdk.c        | 151 +++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 185 insertions(+), 27 deletions(-)

Comments

Mark Kavanagh Aug. 3, 2016, 12:14 p.m. UTC | #1
>

>Hi Daniele. Thanks for posting this.


Hi Ilya,

I actually implemented this patch as part of Daniele's MTU patchset, based on my earlier patch - Daniele mainly rebased it to head of master :)

Thanks for your feedback - I've responded inline.

Cheers,
Mark

>I have almost same patch in my local branch.

>

>I didn't test this with physical DPDK NICs yet, but I have few

>high level comments:

>

>1. Do you thought about renaming of 'mtu_request' inside netdev-dpdk

>   to 'requested_mtu'? I think, this would be more clear and

>   consistent with other configurable parameters (n_rxq, n_txq, ...).


'mtu_request' was the name suggested by Daniele, following a discussion with colleagues.
I don't have strong feelings either way, so I'll leave Daniele to comment.

>

>2. I'd prefer not to fail reconfiguration if there is no enough memory

>   for new mempool. I think, it'll be common situation when we are

>   requesting more memory than we have. Failure leads to destruction

>   of the port and inability to reconnect to vhost-user port after

>   re-creation if vhost is in server mode. We can just keep old

>   mempool and inform user via VLOG_ERR.

>

Agreed - I'll modify V2 accordingly.


>3. Minor issues inline.


Comments on these inline also.

>

>What do you think?

>

>Best regards, Ilya Maximets.

>

>On 30.07.2016 04:22, Daniele Di Proietto wrote:

>> From: Mark Kavanagh <mark.b.kavanagh@intel.com>

>>

>> Add support for Jumbo Frames to DPDK-enabled port types,

>> using single-segment-mbufs.

>>

>> Using this approach, the amount of memory allocated to each mbuf

>> to store frame data is increased to a value greater than 1518B

>> (typical Ethernet maximum frame length). The increased space

>> available in the mbuf means that an entire Jumbo Frame of a specific

>> size can be carried in a single mbuf, as opposed to partitioning

>> it across multiple mbuf segments.

>>

>> The amount of space allocated to each mbuf to hold frame data is

>> defined dynamically by the user with ovs-vsctl, via the 'mtu_request'

>> parameter.

>>

>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>

>> [diproiettod@vmware.com rebased]

>> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>

>> ---

>>  INSTALL.DPDK-ADVANCED.md |  59 +++++++++++++++++-

>>  INSTALL.DPDK.md          |   1 -

>>  NEWS                     |   1 +

>>  lib/netdev-dpdk.c        | 151 +++++++++++++++++++++++++++++++++++++++--------

>>  4 files changed, 185 insertions(+), 27 deletions(-)

>>

>> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md

>> index 191e69e..5cd64bf 100755

>> --- a/INSTALL.DPDK-ADVANCED.md

>> +++ b/INSTALL.DPDK-ADVANCED.md

>> @@ -1,5 +1,5 @@

>>  OVS DPDK ADVANCED INSTALL GUIDE

>> -=================================

>> +===============================

>>

>>  ## Contents

>>

>> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE

>>  7. [QOS](#qos)

>>  8. [Rate Limiting](#rl)

>>  9. [Flow Control](#fc)

>> -10. [Vsperf](#vsperf)

>> +10. [Jumbo Frames](#jumbo)

>> +11. [Vsperf](#vsperf)

>>

>>  ## <a name="overview"></a> 1. Overview

>>

>> @@ -862,7 +863,59 @@ respective parameter. To disable the flow control at tx side,

>>

>>  `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`

>>

>> -## <a name="vsperf"></a> 10. Vsperf

>> +## <a name="jumbo"></a> 10. Jumbo Frames

>> +

>> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To

>> +enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request`

>> +attribute to a sufficiently large value.

>> +

>> +e.g. Add a DPDK Phy port with MTU of 9000:

>> +

>> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0

>mtu_request=9000`

>> +

>> +e.g. Change the MTU of an existing port to 6200:

>> +

>> +`ovs-vsctl set Interface dpdk0 mtu_request=6200`

>> +

>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are

>> +increased, such that a full Jumbo Frame of a specific size may be accommodated

>> +within a single mbuf segment.

>> +

>> +Jumbo frame support has been validated against 9728B frames (largest frame size

>> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames

>> +(particularly in use cases involving East-West traffic only), and other DPDK NIC

>> +drivers may be supported.

>> +

>> +### 9.1 vHost Ports and Jumbo Frames

>> +

>> +Some additional configuration is needed to take advantage of jumbo frames with

>> +vhost ports:

>> +

>> +    1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in

>> +        the QEMU command line snippet below:

>> +

>> +        ```

>> +        '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'

>> +        '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'

>> +        ```

>> +

>> +    2. Where virtio devices are bound to the Linux kernel driver in a guest

>> +       environment (i.e. interfaces are not bound to an in-guest DPDK driver),

>> +       the MTU of those logical network interfaces must also be increased to a

>> +       sufficiently large value. This avoids segmentation of Jumbo Frames

>> +       received in the guest. Note that 'MTU' refers to the length of the IP

>> +       packet only, and not that of the entire frame.

>> +

>> +       To calculate the exact MTU of a standard IPv4 frame, subtract the L2

>> +       header and CRC lengths (i.e. 18B) from the max supported frame size.

>> +       So, to set the MTU for a 9018B Jumbo Frame:

>> +

>> +       ```

>> +       ifconfig eth1 mtu 9000

>> +       ```

>> +>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames

>

>Looks like rebasing artefact.


Yup - I'll remove in  V2.

>

>> +

>> +## <a name="vsperf"></a> 11. Vsperf

>>

>>  Vsperf project goal is to develop vSwitch test framework that can be used to

>>  validate the suitability of different vSwitch implementations in a Telco deployment

>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md

>> index 7609aa7..25c79de 100644

>> --- a/INSTALL.DPDK.md

>> +++ b/INSTALL.DPDK.md

>> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough].

>>

>>  ## <a name="ovslimits"></a> 6. Limitations

>>

>> -  - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future OVS release.

>>    - Currently DPDK ports does not use HW offload functionality.

>>    - Network Interface Firmware requirements:

>>      Each release of DPDK is validated against a specific firmware version for

>> diff --git a/NEWS b/NEWS

>> index 0ff5616..c004e5f 100644

>> --- a/NEWS

>> +++ b/NEWS

>> @@ -68,6 +68,7 @@ Post-v2.5.0

>>         is enabled in DPDK.

>>       * Basic connection tracking for the userspace datapath (no ALG,

>>         fragmentation or NAT support yet)

>> +     * Jumbo frame support

>>     - Increase number of registers to 16.

>>     - ovs-benchmark: This utility has been removed due to lack of use and

>>       bitrot.

>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c

>> index 0b6e410..68639ae 100644

>> --- a/lib/netdev-dpdk.c

>> +++ b/lib/netdev-dpdk.c

>> @@ -82,6 +82,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>                                      + sizeof(struct dp_packet)    \

>>                                      + RTE_PKTMBUF_HEADROOM)

>>  #define NETDEV_DPDK_MBUF_ALIGN      1024

>> +#define NETDEV_DPDK_MAX_PKT_LEN     9728

>>

>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a

>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have

>> @@ -336,6 +337,7 @@ struct netdev_dpdk {

>>      struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);

>>

>>      struct dpdk_mp *dpdk_mp;

>> +    int mtu_request;

>>      int mtu;

>>      int socket_id;

>>      int buf_size;

>> @@ -474,10 +476,19 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>      dmp->mtu = mtu;

>>      dmp->refcount = 1;

>>      mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);

>> -    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -

>> -                              sizeof (struct rte_mbuf);

>> +    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)

>> +                              - sizeof (struct rte_mbuf);

>> +    /* XXX: this is a really rough method of provisioning memory.

>> +     * It's impossible to determine what the exact memory requirements are when

>> +     * the number of ports and rxqs that utilize a particular mempool can change

>> +     * dynamically at runtime. For the moment, use this rough heurisitic.

>> +     */

>> +    if (mtu >= ETHER_MTU) {

>> +        mp_size = MAX_NB_MBUF;

>> +    } else {

>> +        mp_size = MIN_NB_MBUF;

>> +    }

>>

>> -    mp_size = MAX_NB_MBUF;

>>      do {

>>          if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",

>>                       dmp->mtu, dmp->socket_id, mp_size) < 0) {

>> @@ -522,6 +533,32 @@ dpdk_mp_put(struct dpdk_mp *dmp)

>>  #endif

>>  }

>>

>> +static int

>> +dpdk_mp_configure(struct netdev_dpdk *dev)

>> +    OVS_REQUIRES(dpdk_mutex)

>> +    OVS_REQUIRES(dev->mutex)

>> +{

>> +    uint32_t buf_size = dpdk_buf_size(dev->mtu);

>> +    struct dpdk_mp *mp_old, *mp;

>> +

>> +    mp_old = dev->dpdk_mp;

>

>Do we need this variable? We can just put old mempool before

>assigning the new one.


Agreed - will remove in V2.

>

>> +

>> +    mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));

>> +    if (!mp) {

>> +        VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n",

>> +                dev->up.name);

>

>+1 space character.


Sure - will fix.

>

>> +        return ENOMEM;

>> +    }

>> +

>> +    dev->dpdk_mp = mp;

>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>> +

>> +    dpdk_mp_put(mp_old);

>> +

>> +    return 0;

>> +}

>> +

>> +

>>  static void

>>  check_link_status(struct netdev_dpdk *dev)

>>  {

>> @@ -573,7 +610,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>n_txq)

>>  {

>>      int diag = 0;

>>      int i;

>> +    struct rte_eth_conf conf = port_conf;

>>

>> +    if (dev->mtu > ETHER_MTU) {

>> +        conf.rxmode.jumbo_frame = 1;

>> +        conf.rxmode.max_rx_pkt_len = dev->max_packet_len;

>> +    } else {

>> +        conf.rxmode.jumbo_frame = 0;

>> +        conf.rxmode.max_rx_pkt_len = 0;

>

>I know, it was implemented this way in the original patch, but I'm

>not sure that all DPDK drivers will handle zero value of

>'max_rx_pkt_len' in a right way.


Out of interest, can you point to any examples of DPDK drivers handling max_rx_pkt_len = 0 with unintended behaviour?
This field should only be of relevance if rxmode.jumbo_frame = 1, and 0 is its default value; if it is an issue though, I can just default instead to ETHER_MAX_LEN.

>

>> +    }

>>      /* A device may report more queues than it makes available (this has

>>       * been observed for Intel xl710, which reserves some of them for

>>       * SRIOV):  rte_eth_*_queue_setup will fail if a queue is not

>> @@ -584,8 +629,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>n_txq)

>>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);

>>          }

>>

>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);

>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);

>>          if (diag) {

>> +            VLOG_WARN("Interface %s eth_dev setup error %s\n",

>> +                      dev->up.name, rte_strerror(-diag));

>>              break;

>>          }

>>

>> @@ -738,7 +785,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,

>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>      int sid;

>>      int err = 0;

>> -    uint32_t buf_size;

>>

>>      ovs_mutex_init(&dev->mutex);

>>      ovs_mutex_lock(&dev->mutex);

>> @@ -759,13 +805,11 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,

>>      dev->port_id = port_no;

>>      dev->type = type;

>>      dev->flags = 0;

>> -    dev->mtu = ETHER_MTU;

>> +    dev->mtu_request = dev->mtu = ETHER_MTU;

>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>

>> -    buf_size = dpdk_buf_size(dev->mtu);

>> -    dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));

>> -    if (!dev->dpdk_mp) {

>> -        err = ENOMEM;

>> +    err = dpdk_mp_configure(dev);

>> +    if (err) {

>>          goto unlock;

>>      }

>>

>> @@ -986,6 +1030,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)

>>      smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq);

>>      smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq);

>>      smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq);

>> +    smap_add_format(args, "mtu", "%d", dev->mtu);

>>      ovs_mutex_unlock(&dev->mutex);

>>

>>      return 0;

>> @@ -1362,6 +1407,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,

>>      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;

>>      unsigned int total_pkts = cnt;

>>      unsigned int qos_pkts = cnt;

>> +    unsigned int mtu_dropped = 0;

>>      int retries = 0;

>>

>>      qid = dev->tx_q[qid % netdev->n_txq].map;

>> @@ -1383,25 +1429,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,

>>      do {

>>          int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;

>>          unsigned int tx_pkts;

>> +        unsigned int try_tx_pkts = cnt;

>>

>> +        for (unsigned int i = 0; i < cnt; i++) {

>> +            if (cur_pkts[i]->pkt_len > dev->max_packet_len) {

>> +                try_tx_pkts = i;

>> +                break;

>> +            }

>> +        }

>> +        if (!try_tx_pkts) {

>> +            cur_pkts++;

>> +            mtu_dropped++;

>> +            cnt--;

>> +            continue;

>> +        }

>>          tx_pkts = rte_vhost_enqueue_burst(virtio_dev, vhost_qid,

>> -                                          cur_pkts, cnt);

>> +                                          cur_pkts, try_tx_pkts);

>>          if (OVS_LIKELY(tx_pkts)) {

>>              /* Packets have been sent.*/

>>              cnt -= tx_pkts;

>>              /* Prepare for possible retry.*/

>>              cur_pkts = &cur_pkts[tx_pkts];

>> +            if (tx_pkts != try_tx_pkts) {

>> +                retries++;

>> +            }

>>          } else {

>>              /* No packets sent - do not retry.*/

>>              break;

>>          }

>> -    } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));

>> +    } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM));

>>

>>      rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);

>>

>>      rte_spinlock_lock(&dev->stats_lock);

>> -    cnt += qos_pkts;

>> -    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);

>> +    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,

>> +                                         cnt + mtu_dropped + qos_pkts);

>>      rte_spinlock_unlock(&dev->stats_lock);

>>

>>  out:

>> @@ -1635,6 +1697,26 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)

>>  }

>>

>>  static int

>> +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)

>> +{

>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>> +

>> +    if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) {

>> +        VLOG_WARN("Unsupported MTU (%d)\n", mtu);

>> +        return EINVAL;

>> +    }

>> +

>> +    ovs_mutex_lock(&dev->mutex);

>> +    if (dev->mtu_request != mtu) {

>> +        dev->mtu_request = mtu;

>> +        netdev_request_reconfigure(netdev);

>> +    }

>> +    ovs_mutex_unlock(&dev->mutex);

>> +

>> +    return 0;

>> +}

>> +

>> +static int

>>  netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);

>>

>>  static int

>> @@ -2787,7 +2869,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)

>>      ovs_mutex_lock(&dev->mutex);

>>

>>      if (netdev->n_txq == dev->requested_n_txq

>> -        && netdev->n_rxq == dev->requested_n_rxq) {

>> +        && netdev->n_rxq == dev->requested_n_rxq

>> +        && dev->mtu == dev->mtu_request) {

>>          /* Reconfiguration is unnecessary */

>>

>>          goto out;

>> @@ -2795,6 +2878,14 @@ netdev_dpdk_reconfigure(struct netdev *netdev)

>>

>>      rte_eth_dev_stop(dev->port_id);

>>

>> +    if (dev->mtu != dev->mtu_request) {

>> +        dev->mtu = dev->mtu_request;

>> +        err = dpdk_mp_configure(dev);

>> +        if (err) {

>> +            goto out;

>> +        }

>> +    }

>> +

>>      netdev->n_txq = dev->requested_n_txq;

>>      netdev->n_rxq = dev->requested_n_rxq;

>>

>> @@ -2802,6 +2893,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)

>>      err = dpdk_eth_dev_init(dev);

>>      netdev_dpdk_alloc_txq(dev, netdev->n_txq);

>>

>> +    netdev_change_seq_changed(netdev);

>> +

>>  out:

>>

>>      ovs_mutex_unlock(&dev->mutex);

>> @@ -2830,20 +2923,23 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)

>>

>>      netdev_dpdk_remap_txqs(dev);

>>

>> -    if (dev->requested_socket_id != dev->socket_id) {

>> +    if (dev->requested_socket_id != dev->socket_id

>> +        || dev->mtu_request != dev->mtu) {

>>          dev->socket_id = dev->requested_socket_id;

>> -        /* Change mempool to new NUMA Node */

>> -        dpdk_mp_put(dev->dpdk_mp);

>> -        dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu);

>> -        if (!dev->dpdk_mp) {

>> -            err = ENOMEM;

>> +        dev->mtu = dev->mtu_request;

>> +        /* Change mempool to new NUMA Node and to new MTU. */

>> +        err = dpdk_mp_configure(dev);

>> +        if (err) {

>> +            goto out;

>>          }

>> +        netdev_change_seq_changed(netdev);

>>      }

>>

>>      if (virtio_dev) {

>>          virtio_dev->flags |= VIRTIO_DEV_RUNNING;

>>      }

>>

>> +out:

>>      ovs_mutex_unlock(&dev->mutex);

>>      ovs_mutex_unlock(&dpdk_mutex);

>>

>> @@ -2854,6 +2950,7 @@ static int

>>  netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)

>>  {

>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>> +    int err = 0;

>>

>>      ovs_mutex_lock(&dpdk_mutex);

>>      ovs_mutex_lock(&dev->mutex);

>> @@ -2861,10 +2958,18 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)

>>      netdev->n_txq = dev->requested_n_txq;

>>      netdev->n_rxq = 1;

>>

>> +    if (dev->mtu_request != dev->mtu) {

>> +        /* Change mempool to new MTU. */

>> +        err = dpdk_mp_configure(dev);

>> +        if (!err) {

>> +            netdev_change_seq_changed(netdev);

>> +        }

>> +    }

>> +

>>      ovs_mutex_unlock(&dev->mutex);

>>      ovs_mutex_unlock(&dpdk_mutex);

>>

>> -    return 0;

>> +    return err;

>>  }

>>

>>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,    \

>> @@ -2898,7 +3003,7 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)

>>      netdev_dpdk_set_etheraddr,                                \

>>      netdev_dpdk_get_etheraddr,                                \

>>      netdev_dpdk_get_mtu,                                      \

>> -    NULL,                       /* set_mtu */                 \

>> +    netdev_dpdk_set_mtu,                                      \

>>      netdev_dpdk_get_ifindex,                                  \

>>      GET_CARRIER,                                              \

>>      netdev_dpdk_get_carrier_resets,                           \

>>
Ilya Maximets Aug. 3, 2016, 12:34 p.m. UTC | #2
Hi, Mark.

On 03.08.2016 15:14, Kavanagh, Mark B wrote:
>>
>> Hi Daniele. Thanks for posting this.
> 
> Hi Ilya,
> 
> I actually implemented this patch as part of Daniele's MTU patchset, based on my earlier patch - Daniele mainly rebased it to head of master :)
> 
> Thanks for your feedback - I've responded inline.
> 
> Cheers,
> Mark
> 
>> I have almost same patch in my local branch.
>>
>> I didn't test this with physical DPDK NICs yet, but I have few
>> high level comments:
>>
>> 1. Do you thought about renaming of 'mtu_request' inside netdev-dpdk
>>   to 'requested_mtu'? I think, this would be more clear and
>>   consistent with other configurable parameters (n_rxq, n_txq, ...).
> 
> 'mtu_request' was the name suggested by Daniele, following a discussion with colleagues.
> I don't have strong feelings either way, so I'll leave Daniele to comment.

I meant only renaming of 'netdev_dpdk->mtu_request' to 'netdev_dpdk->requested_mtu'.
Database column should be 'mtu_request' as it is now.

>>
>> 2. I'd prefer not to fail reconfiguration if there is no enough memory
>>   for new mempool. I think, it'll be common situation when we are
>>   requesting more memory than we have. Failure leads to destruction
>>   of the port and inability to reconnect to vhost-user port after
>>   re-creation if vhost is in server mode. We can just keep old
>>   mempool and inform user via VLOG_ERR.
>>
> Agreed - I'll modify V2 accordingly.
> 
> 
>> 3. Minor issues inline.
> 
> Comments on these inline also.
> 
>>
>> What do you think?
>>
>> Best regards, Ilya Maximets.
>>
>> On 30.07.2016 04:22, Daniele Di Proietto wrote:
>>> From: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>
>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>> using single-segment-mbufs.
>>>
>>> Using this approach, the amount of memory allocated to each mbuf
>>> to store frame data is increased to a value greater than 1518B
>>> (typical Ethernet maximum frame length). The increased space
>>> available in the mbuf means that an entire Jumbo Frame of a specific
>>> size can be carried in a single mbuf, as opposed to partitioning
>>> it across multiple mbuf segments.
>>>
>>> The amount of space allocated to each mbuf to hold frame data is
>>> defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
>>> parameter.
>>>
>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>> [diproiettod@vmware.com rebased]
>>> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
>>> ---
>>>  INSTALL.DPDK-ADVANCED.md |  59 +++++++++++++++++-
>>>  INSTALL.DPDK.md          |   1 -
>>>  NEWS                     |   1 +
>>>  lib/netdev-dpdk.c        | 151 +++++++++++++++++++++++++++++++++++++++--------
>>>  4 files changed, 185 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md
>>> index 191e69e..5cd64bf 100755
>>> --- a/INSTALL.DPDK-ADVANCED.md
>>> +++ b/INSTALL.DPDK-ADVANCED.md
>>> @@ -1,5 +1,5 @@
>>>  OVS DPDK ADVANCED INSTALL GUIDE
>>> -=================================
>>> +===============================
>>>
>>>  ## Contents
>>>
>>> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE
>>>  7. [QOS](#qos)
>>>  8. [Rate Limiting](#rl)
>>>  9. [Flow Control](#fc)
>>> -10. [Vsperf](#vsperf)
>>> +10. [Jumbo Frames](#jumbo)
>>> +11. [Vsperf](#vsperf)
>>>
>>>  ## <a name="overview"></a> 1. Overview
>>>
>>> @@ -862,7 +863,59 @@ respective parameter. To disable the flow control at tx side,
>>>
>>>  `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`
>>>
>>> -## <a name="vsperf"></a> 10. Vsperf
>>> +## <a name="jumbo"></a> 10. Jumbo Frames
>>> +
>>> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
>>> +enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request`
>>> +attribute to a sufficiently large value.
>>> +
>>> +e.g. Add a DPDK Phy port with MTU of 9000:
>>> +
>>> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0
>> mtu_request=9000`
>>> +
>>> +e.g. Change the MTU of an existing port to 6200:
>>> +
>>> +`ovs-vsctl set Interface dpdk0 mtu_request=6200`
>>> +
>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>>> +increased, such that a full Jumbo Frame of a specific size may be accommodated
>>> +within a single mbuf segment.
>>> +
>>> +Jumbo frame support has been validated against 9728B frames (largest frame size
>>> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames
>>> +(particularly in use cases involving East-West traffic only), and other DPDK NIC
>>> +drivers may be supported.
>>> +
>>> +### 9.1 vHost Ports and Jumbo Frames
>>> +
>>> +Some additional configuration is needed to take advantage of jumbo frames with
>>> +vhost ports:
>>> +
>>> +    1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
>>> +        the QEMU command line snippet below:
>>> +
>>> +        ```
>>> +        '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>> +        '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>> +        ```
>>> +
>>> +    2. Where virtio devices are bound to the Linux kernel driver in a guest
>>> +       environment (i.e. interfaces are not bound to an in-guest DPDK driver),
>>> +       the MTU of those logical network interfaces must also be increased to a
>>> +       sufficiently large value. This avoids segmentation of Jumbo Frames
>>> +       received in the guest. Note that 'MTU' refers to the length of the IP
>>> +       packet only, and not that of the entire frame.
>>> +
>>> +       To calculate the exact MTU of a standard IPv4 frame, subtract the L2
>>> +       header and CRC lengths (i.e. 18B) from the max supported frame size.
>>> +       So, to set the MTU for a 9018B Jumbo Frame:
>>> +
>>> +       ```
>>> +       ifconfig eth1 mtu 9000
>>> +       ```
>>> +>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames
>>
>> Looks like rebasing artefact.
> 
> Yup - I'll remove in  V2.
> 
>>
>>> +
>>> +## <a name="vsperf"></a> 11. Vsperf
>>>
>>>  Vsperf project goal is to develop vSwitch test framework that can be used to
>>>  validate the suitability of different vSwitch implementations in a Telco deployment
>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>> index 7609aa7..25c79de 100644
>>> --- a/INSTALL.DPDK.md
>>> +++ b/INSTALL.DPDK.md
>>> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough].
>>>
>>>  ## <a name="ovslimits"></a> 6. Limitations
>>>
>>> -  - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future OVS release.
>>>    - Currently DPDK ports does not use HW offload functionality.
>>>    - Network Interface Firmware requirements:
>>>      Each release of DPDK is validated against a specific firmware version for
>>> diff --git a/NEWS b/NEWS
>>> index 0ff5616..c004e5f 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -68,6 +68,7 @@ Post-v2.5.0
>>>         is enabled in DPDK.
>>>       * Basic connection tracking for the userspace datapath (no ALG,
>>>         fragmentation or NAT support yet)
>>> +     * Jumbo frame support
>>>     - Increase number of registers to 16.
>>>     - ovs-benchmark: This utility has been removed due to lack of use and
>>>       bitrot.
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 0b6e410..68639ae 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -82,6 +82,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>>                                      + sizeof(struct dp_packet)    \
>>>                                      + RTE_PKTMBUF_HEADROOM)
>>>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>>> +#define NETDEV_DPDK_MAX_PKT_LEN     9728
>>>
>>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>>> @@ -336,6 +337,7 @@ struct netdev_dpdk {
>>>      struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>>>
>>>      struct dpdk_mp *dpdk_mp;
>>> +    int mtu_request;
>>>      int mtu;
>>>      int socket_id;
>>>      int buf_size;
>>> @@ -474,10 +476,19 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>>      dmp->mtu = mtu;
>>>      dmp->refcount = 1;
>>>      mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);
>>> -    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -
>>> -                              sizeof (struct rte_mbuf);
>>> +    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)
>>> +                              - sizeof (struct rte_mbuf);
>>> +    /* XXX: this is a really rough method of provisioning memory.
>>> +     * It's impossible to determine what the exact memory requirements are when
>>> +     * the number of ports and rxqs that utilize a particular mempool can change
>>> +     * dynamically at runtime. For the moment, use this rough heurisitic.
>>> +     */
>>> +    if (mtu >= ETHER_MTU) {
>>> +        mp_size = MAX_NB_MBUF;
>>> +    } else {
>>> +        mp_size = MIN_NB_MBUF;
>>> +    }
>>>
>>> -    mp_size = MAX_NB_MBUF;
>>>      do {
>>>          if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>>>                       dmp->mtu, dmp->socket_id, mp_size) < 0) {
>>> @@ -522,6 +533,32 @@ dpdk_mp_put(struct dpdk_mp *dmp)
>>>  #endif
>>>  }
>>>
>>> +static int
>>> +dpdk_mp_configure(struct netdev_dpdk *dev)
>>> +    OVS_REQUIRES(dpdk_mutex)
>>> +    OVS_REQUIRES(dev->mutex)
>>> +{
>>> +    uint32_t buf_size = dpdk_buf_size(dev->mtu);
>>> +    struct dpdk_mp *mp_old, *mp;
>>> +
>>> +    mp_old = dev->dpdk_mp;
>>
>> Do we need this variable? We can just put old mempool before
>> assigning the new one.
> 
> Agreed - will remove in V2.
> 
>>
>>> +
>>> +    mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>>> +    if (!mp) {
>>> +        VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n",
>>> +                dev->up.name);
>>
>> +1 space character.
> 
> Sure - will fix.
> 
>>
>>> +        return ENOMEM;
>>> +    }
>>> +
>>> +    dev->dpdk_mp = mp;
>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>> +
>>> +    dpdk_mp_put(mp_old);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +
>>>  static void
>>>  check_link_status(struct netdev_dpdk *dev)
>>>  {
>>> @@ -573,7 +610,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>> n_txq)
>>>  {
>>>      int diag = 0;
>>>      int i;
>>> +    struct rte_eth_conf conf = port_conf;
>>>
>>> +    if (dev->mtu > ETHER_MTU) {
>>> +        conf.rxmode.jumbo_frame = 1;
>>> +        conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
>>> +    } else {
>>> +        conf.rxmode.jumbo_frame = 0;
>>> +        conf.rxmode.max_rx_pkt_len = 0;
>>
>> I know, it was implemented this way in the original patch, but I'm
>> not sure that all DPDK drivers will handle zero value of
>> 'max_rx_pkt_len' in a right way.
> 
> Out of interest, can you point to any examples of DPDK drivers handling max_rx_pkt_len = 0 with unintended behaviour?
> This field should only be of relevance if rxmode.jumbo_frame = 1, and 0 is its default value; if it is an issue though,

I don't know exactly. One place in DPDK i40e driver confuses me a little.
It's function 'i40evf_rxq_init()' in drivers/net/i40e/i40e_ethdev_vf.c .
It checks the value of 'rxq->max_pkt_len' if 'dev_conf.rxmode.jumbo_frame == 0'.

> I can just default instead to ETHER_MAX_LEN.

Will VLANs be handled properly in this case?
diff mbox

Patch

diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md
index 191e69e..5cd64bf 100755
--- a/INSTALL.DPDK-ADVANCED.md
+++ b/INSTALL.DPDK-ADVANCED.md
@@ -1,5 +1,5 @@ 
 OVS DPDK ADVANCED INSTALL GUIDE
-=================================
+===============================
 
 ## Contents
 
@@ -12,7 +12,8 @@  OVS DPDK ADVANCED INSTALL GUIDE
 7. [QOS](#qos)
 8. [Rate Limiting](#rl)
 9. [Flow Control](#fc)
-10. [Vsperf](#vsperf)
+10. [Jumbo Frames](#jumbo)
+11. [Vsperf](#vsperf)
 
 ## <a name="overview"></a> 1. Overview
 
@@ -862,7 +863,59 @@  respective parameter. To disable the flow control at tx side,
 
 `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`
 
-## <a name="vsperf"></a> 10. Vsperf
+## <a name="jumbo"></a> 10. Jumbo Frames
+
+By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
+enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request`
+attribute to a sufficiently large value.
+
+e.g. Add a DPDK Phy port with MTU of 9000:
+
+`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000`
+
+e.g. Change the MTU of an existing port to 6200:
+
+`ovs-vsctl set Interface dpdk0 mtu_request=6200`
+
+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
+increased, such that a full Jumbo Frame of a specific size may be accommodated
+within a single mbuf segment.
+
+Jumbo frame support has been validated against 9728B frames (largest frame size
+supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames
+(particularly in use cases involving East-West traffic only), and other DPDK NIC
+drivers may be supported.
+
+### 9.1 vHost Ports and Jumbo Frames
+
+Some additional configuration is needed to take advantage of jumbo frames with
+vhost ports:
+
+    1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
+        the QEMU command line snippet below:
+
+        ```
+        '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
+        '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
+        ```
+
+    2. Where virtio devices are bound to the Linux kernel driver in a guest
+       environment (i.e. interfaces are not bound to an in-guest DPDK driver),
+       the MTU of those logical network interfaces must also be increased to a
+       sufficiently large value. This avoids segmentation of Jumbo Frames
+       received in the guest. Note that 'MTU' refers to the length of the IP
+       packet only, and not that of the entire frame.
+
+       To calculate the exact MTU of a standard IPv4 frame, subtract the L2
+       header and CRC lengths (i.e. 18B) from the max supported frame size.
+       So, to set the MTU for a 9018B Jumbo Frame:
+
+       ```
+       ifconfig eth1 mtu 9000
+       ```
+>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames
+
+## <a name="vsperf"></a> 11. Vsperf
 
 Vsperf project goal is to develop vSwitch test framework that can be used to
 validate the suitability of different vSwitch implementations in a Telco deployment
diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 7609aa7..25c79de 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -590,7 +590,6 @@  can be found in [Vhost Walkthrough].
 
 ## <a name="ovslimits"></a> 6. Limitations
 
-  - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future OVS release.
   - Currently DPDK ports does not use HW offload functionality.
   - Network Interface Firmware requirements:
     Each release of DPDK is validated against a specific firmware version for
diff --git a/NEWS b/NEWS
index 0ff5616..c004e5f 100644
--- a/NEWS
+++ b/NEWS
@@ -68,6 +68,7 @@  Post-v2.5.0
        is enabled in DPDK.
      * Basic connection tracking for the userspace datapath (no ALG,
        fragmentation or NAT support yet)
+     * Jumbo frame support
    - Increase number of registers to 16.
    - ovs-benchmark: This utility has been removed due to lack of use and
      bitrot.
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 0b6e410..68639ae 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -82,6 +82,7 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
                                     + sizeof(struct dp_packet)    \
                                     + RTE_PKTMBUF_HEADROOM)
 #define NETDEV_DPDK_MBUF_ALIGN      1024
+#define NETDEV_DPDK_MAX_PKT_LEN     9728
 
 /* Max and min number of packets in the mempool.  OVS tries to allocate a
  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
@@ -336,6 +337,7 @@  struct netdev_dpdk {
     struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
 
     struct dpdk_mp *dpdk_mp;
+    int mtu_request;
     int mtu;
     int socket_id;
     int buf_size;
@@ -474,10 +476,19 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     dmp->mtu = mtu;
     dmp->refcount = 1;
     mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);
-    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -
-                              sizeof (struct rte_mbuf);
+    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)
+                              - sizeof (struct rte_mbuf);
+    /* XXX: this is a really rough method of provisioning memory.
+     * It's impossible to determine what the exact memory requirements are when
+     * the number of ports and rxqs that utilize a particular mempool can change
+     * dynamically at runtime. For the moment, use this rough heurisitic.
+     */
+    if (mtu >= ETHER_MTU) {
+        mp_size = MAX_NB_MBUF;
+    } else {
+        mp_size = MIN_NB_MBUF;
+    }
 
-    mp_size = MAX_NB_MBUF;
     do {
         if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
                      dmp->mtu, dmp->socket_id, mp_size) < 0) {
@@ -522,6 +533,32 @@  dpdk_mp_put(struct dpdk_mp *dmp)
 #endif
 }
 
+static int
+dpdk_mp_configure(struct netdev_dpdk *dev)
+    OVS_REQUIRES(dpdk_mutex)
+    OVS_REQUIRES(dev->mutex)
+{
+    uint32_t buf_size = dpdk_buf_size(dev->mtu);
+    struct dpdk_mp *mp_old, *mp;
+
+    mp_old = dev->dpdk_mp;
+
+    mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
+    if (!mp) {
+        VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n",
+                dev->up.name);
+        return ENOMEM;
+    }
+
+    dev->dpdk_mp = mp;
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+
+    dpdk_mp_put(mp_old);
+
+    return 0;
+}
+
+
 static void
 check_link_status(struct netdev_dpdk *dev)
 {
@@ -573,7 +610,15 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 {
     int diag = 0;
     int i;
+    struct rte_eth_conf conf = port_conf;
 
+    if (dev->mtu > ETHER_MTU) {
+        conf.rxmode.jumbo_frame = 1;
+        conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
+    } else {
+        conf.rxmode.jumbo_frame = 0;
+        conf.rxmode.max_rx_pkt_len = 0;
+    }
     /* A device may report more queues than it makes available (this has
      * been observed for Intel xl710, which reserves some of them for
      * SRIOV):  rte_eth_*_queue_setup will fail if a queue is not
@@ -584,8 +629,10 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
         }
 
-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
         if (diag) {
+            VLOG_WARN("Interface %s eth_dev setup error %s\n",
+                      dev->up.name, rte_strerror(-diag));
             break;
         }
 
@@ -738,7 +785,6 @@  netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     int sid;
     int err = 0;
-    uint32_t buf_size;
 
     ovs_mutex_init(&dev->mutex);
     ovs_mutex_lock(&dev->mutex);
@@ -759,13 +805,11 @@  netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
     dev->port_id = port_no;
     dev->type = type;
     dev->flags = 0;
-    dev->mtu = ETHER_MTU;
+    dev->mtu_request = dev->mtu = ETHER_MTU;
     dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
 
-    buf_size = dpdk_buf_size(dev->mtu);
-    dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
-    if (!dev->dpdk_mp) {
-        err = ENOMEM;
+    err = dpdk_mp_configure(dev);
+    if (err) {
         goto unlock;
     }
 
@@ -986,6 +1030,7 @@  netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
     smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq);
     smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq);
     smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq);
+    smap_add_format(args, "mtu", "%d", dev->mtu);
     ovs_mutex_unlock(&dev->mutex);
 
     return 0;
@@ -1362,6 +1407,7 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
     struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
     unsigned int total_pkts = cnt;
     unsigned int qos_pkts = cnt;
+    unsigned int mtu_dropped = 0;
     int retries = 0;
 
     qid = dev->tx_q[qid % netdev->n_txq].map;
@@ -1383,25 +1429,41 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
     do {
         int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
         unsigned int tx_pkts;
+        unsigned int try_tx_pkts = cnt;
 
+        for (unsigned int i = 0; i < cnt; i++) {
+            if (cur_pkts[i]->pkt_len > dev->max_packet_len) {
+                try_tx_pkts = i;
+                break;
+            }
+        }
+        if (!try_tx_pkts) {
+            cur_pkts++;
+            mtu_dropped++;
+            cnt--;
+            continue;
+        }
         tx_pkts = rte_vhost_enqueue_burst(virtio_dev, vhost_qid,
-                                          cur_pkts, cnt);
+                                          cur_pkts, try_tx_pkts);
         if (OVS_LIKELY(tx_pkts)) {
             /* Packets have been sent.*/
             cnt -= tx_pkts;
             /* Prepare for possible retry.*/
             cur_pkts = &cur_pkts[tx_pkts];
+            if (tx_pkts != try_tx_pkts) {
+                retries++;
+            }
         } else {
             /* No packets sent - do not retry.*/
             break;
         }
-    } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
+    } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM));
 
     rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
 
     rte_spinlock_lock(&dev->stats_lock);
-    cnt += qos_pkts;
-    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);
+    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
+                                         cnt + mtu_dropped + qos_pkts);
     rte_spinlock_unlock(&dev->stats_lock);
 
 out:
@@ -1635,6 +1697,26 @@  netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
 }
 
 static int
+netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+    if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) {
+        VLOG_WARN("Unsupported MTU (%d)\n", mtu);
+        return EINVAL;
+    }
+
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu_request != mtu) {
+        dev->mtu_request = mtu;
+        netdev_request_reconfigure(netdev);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return 0;
+}
+
+static int
 netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
 
 static int
@@ -2787,7 +2869,8 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
     ovs_mutex_lock(&dev->mutex);
 
     if (netdev->n_txq == dev->requested_n_txq
-        && netdev->n_rxq == dev->requested_n_rxq) {
+        && netdev->n_rxq == dev->requested_n_rxq
+        && dev->mtu == dev->mtu_request) {
         /* Reconfiguration is unnecessary */
 
         goto out;
@@ -2795,6 +2878,14 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
 
     rte_eth_dev_stop(dev->port_id);
 
+    if (dev->mtu != dev->mtu_request) {
+        dev->mtu = dev->mtu_request;
+        err = dpdk_mp_configure(dev);
+        if (err) {
+            goto out;
+        }
+    }
+
     netdev->n_txq = dev->requested_n_txq;
     netdev->n_rxq = dev->requested_n_rxq;
 
@@ -2802,6 +2893,8 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
     err = dpdk_eth_dev_init(dev);
     netdev_dpdk_alloc_txq(dev, netdev->n_txq);
 
+    netdev_change_seq_changed(netdev);
+
 out:
 
     ovs_mutex_unlock(&dev->mutex);
@@ -2830,20 +2923,23 @@  netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
 
     netdev_dpdk_remap_txqs(dev);
 
-    if (dev->requested_socket_id != dev->socket_id) {
+    if (dev->requested_socket_id != dev->socket_id
+        || dev->mtu_request != dev->mtu) {
         dev->socket_id = dev->requested_socket_id;
-        /* Change mempool to new NUMA Node */
-        dpdk_mp_put(dev->dpdk_mp);
-        dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu);
-        if (!dev->dpdk_mp) {
-            err = ENOMEM;
+        dev->mtu = dev->mtu_request;
+        /* Change mempool to new NUMA Node and to new MTU. */
+        err = dpdk_mp_configure(dev);
+        if (err) {
+            goto out;
         }
+        netdev_change_seq_changed(netdev);
     }
 
     if (virtio_dev) {
         virtio_dev->flags |= VIRTIO_DEV_RUNNING;
     }
 
+out:
     ovs_mutex_unlock(&dev->mutex);
     ovs_mutex_unlock(&dpdk_mutex);
 
@@ -2854,6 +2950,7 @@  static int
 netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int err = 0;
 
     ovs_mutex_lock(&dpdk_mutex);
     ovs_mutex_lock(&dev->mutex);
@@ -2861,10 +2958,18 @@  netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
     netdev->n_txq = dev->requested_n_txq;
     netdev->n_rxq = 1;
 
+    if (dev->mtu_request != dev->mtu) {
+        /* Change mempool to new MTU. */
+        err = dpdk_mp_configure(dev);
+        if (!err) {
+            netdev_change_seq_changed(netdev);
+        }
+    }
+
     ovs_mutex_unlock(&dev->mutex);
     ovs_mutex_unlock(&dpdk_mutex);
 
-    return 0;
+    return err;
 }
 
 #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,    \
@@ -2898,7 +3003,7 @@  netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
     netdev_dpdk_set_etheraddr,                                \
     netdev_dpdk_get_etheraddr,                                \
     netdev_dpdk_get_mtu,                                      \
-    NULL,                       /* set_mtu */                 \
+    netdev_dpdk_set_mtu,                                      \
     netdev_dpdk_get_ifindex,                                  \
     GET_CARRIER,                                              \
     netdev_dpdk_get_carrier_resets,                           \