[ovs-dev,v11,01/14] netdev-dpdk: fix mbuf sizing

Message ID 1539188552-129083-2-git-send-email-tiago.lam@intel.com
State New
Delegated to: Ian Stokes
Headers show
Series
  • Support multi-segment mbufs
Related show

Commit Message

Lam, Tiago Oct. 10, 2018, 4:22 p.m.
From: Mark Kavanagh <mark.b.kavanagh@intel.com>

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

    mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
---
 lib/netdev-dpdk.c | 56 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 18 deletions(-)

Comments

Stokes, Ian Oct. 16, 2018, 1:46 p.m. | #1
> From: Mark Kavanagh <mark.b.kavanagh@intel.com>
> 
> There are numerous factors that must be considered when calculating the
> size of an mbuf:
> - the data portion of the mbuf must be sized in accordance With Rx
>   buffer alignment (typically 1024B). So, for example, in order to
>   successfully receive and capture a 1500B packet, mbufs with a
>   data portion of size 2048B must be used.
> - in OvS, the elements that comprise an mbuf are:
>   * the dp packet, which includes a struct rte mbuf (704B)
>   * RTE_PKTMBUF_HEADROOM (128B)
>   * packet data (aligned to 1k, as previously described)
>   * RTE_PKTMBUF_TAILROOM (typically 0)
> 
> Some PMDs require that the total mbuf size (i.e. the total sum of all of
> the above-listed components' lengths) is cache-aligned. To satisfy this
> requirement, it may be necessary to round up the total mbuf size with
> respect to cacheline size. In doing so, it's possible that the dp_packet's
> data portion is inadvertently increased in size, such that it no longer
> adheres to Rx buffer alignment. Consequently, the following property of
> the mbuf no longer holds true:
> 
>     mbuf.data_len == mbuf.buf_len - mbuf.data_off
> 
> This creates a problem in the case of multi-segment mbufs, where that
> assumption is assumed to be true for all but the final segment in an mbuf
> chain. Resolve this issue by adjusting the size of the mbuf's private data
> portion, as opposed to the packet data portion when aligning mbuf size to
> cachelines.
> 
> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> Acked-by: Eelco Chaudron <echaudro@redhat.com>

Thanks for this Tiago, I think this would make sense to apply independently of the multi seg series. Thoughts? It could also be backported to 2.10 and 2.9 at least.

One other issue is that I'd like to see the documentation regarding the memory model examples updated as part of this patch.

For example prior to this patch the memory requested for an MTU of 1500 would be 3008 Bytes per mbuf (Header size + element size + trailer size). After this patch however it would be 2880. You could be updating the examples in a later path in the series but I would prefer to keep the docs in sync with the code changes in the same patch.

Thanks
Ian

> ---
>  lib/netdev-dpdk.c | 56 +++++++++++++++++++++++++++++++++++++-------------
> -----
>  1 file changed, 38 insertions(+), 18 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index f91aa27..11659eb
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -88,10 +88,6 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
>  #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
>  #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)                    \
>                                       - ETHER_HDR_LEN - ETHER_CRC_LEN)
> -#define MBUF_SIZE(mtu)              ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
> -                                             + sizeof(struct dp_packet) \
> -                                             + RTE_PKTMBUF_HEADROOM),   \
> -                                             RTE_CACHE_LINE_SIZE)
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
> 
> @@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool
> per_port_mp)
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
> -    uint32_t n_mbufs;
> +    uint32_t n_mbufs = 0;
> +    uint32_t mbuf_size = 0;
> +    uint32_t aligned_mbuf_size = 0;
> +    uint32_t mbuf_priv_data_len = 0;
> +    uint32_t pkt_size = 0;
>      uint32_t hash = hash_string(netdev_name, 0);
>      struct dpdk_mp *dmp = NULL;
>      int ret;
> @@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool
> per_port_mp)
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
> 
> +    /* Get the size of each mbuf, based on the MTU */
> +    mbuf_size = dpdk_buf_size(dev->requested_mtu);
> +
>      n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
> 
>      do {
> @@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool
> per_port_mp)
>           * so this is not an issue for tasks such as debugging.
>           */
>          ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> -                           "ovs%08x%02d%05d%07u",
> -                           hash, socket_id, mtu, n_mbufs);
> +                       "ovs%08x%02d%05d%07u",
> +                        hash, socket_id, mtu, n_mbufs);
>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>              VLOG_DBG("snprintf returned %d. "
>                       "Failed to generate a mempool name for \"%s\". "
> @@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu,
> bool per_port_mp)
>              break;
>          }
> 
> -        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
> -                  "on socket %d for %d Rx and %d Tx queues.",
> -                  netdev_name, n_mbufs, socket_id,
> -                  dev->requested_n_rxq, dev->requested_n_txq);
> -
> -        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> -                                          MP_CACHE_SZ,
> -                                          sizeof (struct dp_packet)
> -                                          - sizeof (struct rte_mbuf),
> -                                          MBUF_SIZE(mtu)
> -                                          - sizeof(struct dp_packet),
> +        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
> +                  "on socket %d for %d Rx and %d Tx queues, "
> +                  "cache line size of %u",
> +                  netdev_name, n_mbufs, mbuf_size, socket_id,
> +                  dev->requested_n_rxq, dev->requested_n_txq,
> +                  RTE_CACHE_LINE_SIZE);
> +
> +        mbuf_priv_data_len = sizeof(struct dp_packet) -
> +                                 sizeof(struct rte_mbuf);
> +        /* The size of the entire dp_packet. */
> +        pkt_size = sizeof (struct dp_packet) +
> +                                mbuf_size + RTE_PKTMBUF_HEADROOM;
> +        /* mbuf size, rounded up to cacheline size. */
> +        aligned_mbuf_size = ROUND_UP(pkt_size, RTE_CACHE_LINE_SIZE);
> +        /* If there is a size discrepancy, add padding to
> mbuf_priv_data_len.
> +         * This maintains mbuf size cache alignment, while also honoring
> RX
> +         * buffer alignment in the data portion of the mbuf. If this
> adjustment
> +         * is not made, there is a possiblity later on that for an
> element of
> +         * the mempool, buf, buf->data_len < (buf->buf_len - buf-
> >data_off).
> +         * This is problematic in the case of multi-segment mbufs,
> particularly
> +         * when an mbuf segment needs to be resized (when [push|popp]ing
> a VLAN
> +         * header, for example.
> +         */
> +        mbuf_priv_data_len += (aligned_mbuf_size - pkt_size);
> +
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> +                                          mbuf_priv_data_len,
> +                                          mbuf_size,
>                                            socket_id);
> 
>          if (dmp->mp) {
> --
> 2.7.4
Lam, Tiago Oct. 16, 2018, 4:50 p.m. | #2
On 16/10/2018 14:46, Stokes, Ian wrote:
>> From: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>
>> There are numerous factors that must be considered when calculating the
>> size of an mbuf:
>> - the data portion of the mbuf must be sized in accordance With Rx
>>   buffer alignment (typically 1024B). So, for example, in order to
>>   successfully receive and capture a 1500B packet, mbufs with a
>>   data portion of size 2048B must be used.
>> - in OvS, the elements that comprise an mbuf are:
>>   * the dp packet, which includes a struct rte mbuf (704B)
>>   * RTE_PKTMBUF_HEADROOM (128B)
>>   * packet data (aligned to 1k, as previously described)
>>   * RTE_PKTMBUF_TAILROOM (typically 0)
>>
>> Some PMDs require that the total mbuf size (i.e. the total sum of all of
>> the above-listed components' lengths) is cache-aligned. To satisfy this
>> requirement, it may be necessary to round up the total mbuf size with
>> respect to cacheline size. In doing so, it's possible that the dp_packet's
>> data portion is inadvertently increased in size, such that it no longer
>> adheres to Rx buffer alignment. Consequently, the following property of
>> the mbuf no longer holds true:
>>
>>     mbuf.data_len == mbuf.buf_len - mbuf.data_off
>>
>> This creates a problem in the case of multi-segment mbufs, where that
>> assumption is assumed to be true for all but the final segment in an mbuf
>> chain. Resolve this issue by adjusting the size of the mbuf's private data
>> portion, as opposed to the packet data portion when aligning mbuf size to
>> cachelines.
>>
>> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
>> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
>> Acked-by: Eelco Chaudron <echaudro@redhat.com>
> 
> Thanks for this Tiago, I think this would make sense to apply independently of the multi seg series. Thoughts? It could also be backported to 2.10 and 2.9 at least.
> 
> One other issue is that I'd like to see the documentation regarding the memory model examples updated as part of this patch.
> 
> For example prior to this patch the memory requested for an MTU of 1500 would be 3008 Bytes per mbuf (Header size + element size + trailer size). After this patch however it would be 2880. You could be updating the examples in a later path in the series but I would prefer to keep the docs in sync with the code changes in the same patch.

Thanks for looking into this, Ian.

This makes sense to me as there's nothing in these patches specific to
multi-segment mbufs. Just some cleanups and fixes.

I'll add the documentation fixes and your comments to 02/14 as part of
the next iteration.

Tiago.
Ilya Maximets Oct. 17, 2018, 4:30 p.m. | #3
On 10.10.2018 19:22, Tiago Lam wrote:
> From: Mark Kavanagh <mark.b.kavanagh@intel.com>
> 
> There are numerous factors that must be considered when calculating
> the size of an mbuf:
> - the data portion of the mbuf must be sized in accordance With Rx
>   buffer alignment (typically 1024B). So, for example, in order to
>   successfully receive and capture a 1500B packet, mbufs with a
>   data portion of size 2048B must be used.
> - in OvS, the elements that comprise an mbuf are:
>   * the dp packet, which includes a struct rte mbuf (704B)
>   * RTE_PKTMBUF_HEADROOM (128B)
>   * packet data (aligned to 1k, as previously described)
>   * RTE_PKTMBUF_TAILROOM (typically 0)
> 
> Some PMDs require that the total mbuf size (i.e. the total sum of all
> of the above-listed components' lengths) is cache-aligned. To satisfy
> this requirement, it may be necessary to round up the total mbuf size
> with respect to cacheline size. In doing so, it's possible that the
> dp_packet's data portion is inadvertently increased in size, such that
> it no longer adheres to Rx buffer alignment. Consequently, the
> following property of the mbuf no longer holds true:
> 
>     mbuf.data_len == mbuf.buf_len - mbuf.data_off
> 
> This creates a problem in the case of multi-segment mbufs, where that
> assumption is assumed to be true for all but the final segment in an
> mbuf chain. Resolve this issue by adjusting the size of the mbuf's
> private data portion, as opposed to the packet data portion when
> aligning mbuf size to cachelines.
> 
> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> Acked-by: Eelco Chaudron <echaudro@redhat.com>
> ---
>  lib/netdev-dpdk.c | 56 +++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 38 insertions(+), 18 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index f91aa27..11659eb 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
>  #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)                    \
>                                       - ETHER_HDR_LEN - ETHER_CRC_LEN)
> -#define MBUF_SIZE(mtu)              ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
> -                                             + sizeof(struct dp_packet) \
> -                                             + RTE_PKTMBUF_HEADROOM),   \
> -                                             RTE_CACHE_LINE_SIZE)
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
> @@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
> -    uint32_t n_mbufs;
> +    uint32_t n_mbufs = 0;
> +    uint32_t mbuf_size = 0;
> +    uint32_t aligned_mbuf_size = 0;
> +    uint32_t mbuf_priv_data_len = 0;
> +    uint32_t pkt_size = 0;
>      uint32_t hash = hash_string(netdev_name, 0);
>      struct dpdk_mp *dmp = NULL;
>      int ret;
> @@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
>  
> +    /* Get the size of each mbuf, based on the MTU */
> +    mbuf_size = dpdk_buf_size(dev->requested_mtu);

1. You're using 'requested_mtu' here directly, which is different to
   'mtu' passed to this function. This looks strange and misleading
   for reader. Could you unify this?

2  Almost same as previous concern from Flavio.
   'mbuf_size' accounts RTE_PKTMBUF_HEADROOM here and rounded up.

> +
>      n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>  
>      do {
> @@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>           * so this is not an issue for tasks such as debugging.
>           */
>          ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> -                           "ovs%08x%02d%05d%07u",
> -                           hash, socket_id, mtu, n_mbufs);
> +                       "ovs%08x%02d%05d%07u",
> +                        hash, socket_id, mtu, n_mbufs);
>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>              VLOG_DBG("snprintf returned %d. "
>                       "Failed to generate a mempool name for \"%s\". "
> @@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>              break;
>          }
>  
> -        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
> -                  "on socket %d for %d Rx and %d Tx queues.",
> -                  netdev_name, n_mbufs, socket_id,
> -                  dev->requested_n_rxq, dev->requested_n_txq);
> -
> -        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> -                                          MP_CACHE_SZ,
> -                                          sizeof (struct dp_packet)
> -                                          - sizeof (struct rte_mbuf),
> -                                          MBUF_SIZE(mtu)
> -                                          - sizeof(struct dp_packet),
> +        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
> +                  "on socket %d for %d Rx and %d Tx queues, "
> +                  "cache line size of %u",
> +                  netdev_name, n_mbufs, mbuf_size, socket_id,
> +                  dev->requested_n_rxq, dev->requested_n_txq,
> +                  RTE_CACHE_LINE_SIZE);
> +
> +        mbuf_priv_data_len = sizeof(struct dp_packet) -
> +                                 sizeof(struct rte_mbuf);
> +        /* The size of the entire dp_packet. */
> +        pkt_size = sizeof (struct dp_packet) +
> +                                mbuf_size + RTE_PKTMBUF_HEADROOM;

2* Here RTE_PKTMBUF_HEADROOM accounted again. That is unclear.

> +        /* mbuf size, rounded up to cacheline size. */
> +        aligned_mbuf_size = ROUND_UP(pkt_size, RTE_CACHE_LINE_SIZE);
> +        /* If there is a size discrepancy, add padding to mbuf_priv_data_len.
> +         * This maintains mbuf size cache alignment, while also honoring RX
> +         * buffer alignment in the data portion of the mbuf. If this adjustment
> +         * is not made, there is a possiblity later on that for an element of
> +         * the mempool, buf, buf->data_len < (buf->buf_len - buf->data_off).
> +         * This is problematic in the case of multi-segment mbufs, particularly
> +         * when an mbuf segment needs to be resized (when [push|popp]ing a VLAN
> +         * header, for example.
> +         */
> +        mbuf_priv_data_len += (aligned_mbuf_size - pkt_size);
> +
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> +                                          mbuf_priv_data_len,
> +                                          mbuf_size,
>                                            socket_id);
>  
>          if (dmp->mp) {
>
Lam, Tiago Oct. 18, 2018, 7:41 a.m. | #4
Hi Ilya,

Thanks for the comments, my replies are in-line.

On 17/10/2018 17:30, Ilya Maximets wrote:
> On 10.10.2018 19:22, Tiago Lam wrote:
>> From: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>
>> There are numerous factors that must be considered when calculating
>> the size of an mbuf:
>> - the data portion of the mbuf must be sized in accordance With Rx
>>   buffer alignment (typically 1024B). So, for example, in order to
>>   successfully receive and capture a 1500B packet, mbufs with a
>>   data portion of size 2048B must be used.
>> - in OvS, the elements that comprise an mbuf are:
>>   * the dp packet, which includes a struct rte mbuf (704B)
>>   * RTE_PKTMBUF_HEADROOM (128B)
>>   * packet data (aligned to 1k, as previously described)
>>   * RTE_PKTMBUF_TAILROOM (typically 0)
>>
>> Some PMDs require that the total mbuf size (i.e. the total sum of all
>> of the above-listed components' lengths) is cache-aligned. To satisfy
>> this requirement, it may be necessary to round up the total mbuf size
>> with respect to cacheline size. In doing so, it's possible that the
>> dp_packet's data portion is inadvertently increased in size, such that
>> it no longer adheres to Rx buffer alignment. Consequently, the
>> following property of the mbuf no longer holds true:
>>
>>     mbuf.data_len == mbuf.buf_len - mbuf.data_off
>>
>> This creates a problem in the case of multi-segment mbufs, where that
>> assumption is assumed to be true for all but the final segment in an
>> mbuf chain. Resolve this issue by adjusting the size of the mbuf's
>> private data portion, as opposed to the packet data portion when
>> aligning mbuf size to cachelines.
>>
>> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
>> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
>> Acked-by: Eelco Chaudron <echaudro@redhat.com>
>> ---
>>  lib/netdev-dpdk.c | 56 +++++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 38 insertions(+), 18 deletions(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index f91aa27..11659eb 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>  #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
>>  #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)                    \
>>                                       - ETHER_HDR_LEN - ETHER_CRC_LEN)
>> -#define MBUF_SIZE(mtu)              ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
>> -                                             + sizeof(struct dp_packet) \
>> -                                             + RTE_PKTMBUF_HEADROOM),   \
>> -                                             RTE_CACHE_LINE_SIZE)
>>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>  
>> @@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>      const char *netdev_name = netdev_get_name(&dev->up);
>>      int socket_id = dev->requested_socket_id;
>> -    uint32_t n_mbufs;
>> +    uint32_t n_mbufs = 0;
>> +    uint32_t mbuf_size = 0;
>> +    uint32_t aligned_mbuf_size = 0;
>> +    uint32_t mbuf_priv_data_len = 0;
>> +    uint32_t pkt_size = 0;
>>      uint32_t hash = hash_string(netdev_name, 0);
>>      struct dpdk_mp *dmp = NULL;
>>      int ret;
>> @@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>>  
>> +    /* Get the size of each mbuf, based on the MTU */
>> +    mbuf_size = dpdk_buf_size(dev->requested_mtu);
> 
> 1. You're using 'requested_mtu' here directly, which is different to
>    'mtu' passed to this function. This looks strange and misleading
>    for reader. Could you unify this?

I see your point. The `mtu` passed to dpdk_mp_create() is a result of
using the FRAME_LEN_TO_MTU macro on the buf_size already (in
netdev_dpdk_mempool_configure()). So, I can use the reverse,
MTU_TO_FRAME_LEN, to get bcak the mbuf_size. How does that sound?

> 
> 2  Almost same as previous concern from Flavio.
>    'mbuf_size' accounts RTE_PKTMBUF_HEADROOM here and rounded up.

Thanks, I've replied below.

> 
>> +
>>      n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>  
>>      do {
>> @@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>           * so this is not an issue for tasks such as debugging.
>>           */
>>          ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>> -                           "ovs%08x%02d%05d%07u",
>> -                           hash, socket_id, mtu, n_mbufs);
>> +                       "ovs%08x%02d%05d%07u",
>> +                        hash, socket_id, mtu, n_mbufs);
>>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>              VLOG_DBG("snprintf returned %d. "
>>                       "Failed to generate a mempool name for \"%s\". "
>> @@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>              break;
>>          }
>>  
>> -        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
>> -                  "on socket %d for %d Rx and %d Tx queues.",
>> -                  netdev_name, n_mbufs, socket_id,
>> -                  dev->requested_n_rxq, dev->requested_n_txq);
>> -
>> -        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>> -                                          MP_CACHE_SZ,
>> -                                          sizeof (struct dp_packet)
>> -                                          - sizeof (struct rte_mbuf),
>> -                                          MBUF_SIZE(mtu)
>> -                                          - sizeof(struct dp_packet),
>> +        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
>> +                  "on socket %d for %d Rx and %d Tx queues, "
>> +                  "cache line size of %u",
>> +                  netdev_name, n_mbufs, mbuf_size, socket_id,
>> +                  dev->requested_n_rxq, dev->requested_n_txq,
>> +                  RTE_CACHE_LINE_SIZE);
>> +
>> +        mbuf_priv_data_len = sizeof(struct dp_packet) -
>> +                                 sizeof(struct rte_mbuf);
>> +        /* The size of the entire dp_packet. */
>> +        pkt_size = sizeof (struct dp_packet) +
>> +                                mbuf_size + RTE_PKTMBUF_HEADROOM;
> 
> 2* Here RTE_PKTMBUF_HEADROOM accounted again. That is unclear.

This had been removed already locally, only noticed after sending the
patchset.

Tiago.

> 
>> +        /* mbuf size, rounded up to cacheline size. */
>> +        aligned_mbuf_size = ROUND_UP(pkt_size, RTE_CACHE_LINE_SIZE);
>> +        /* If there is a size discrepancy, add padding to mbuf_priv_data_len.
>> +         * This maintains mbuf size cache alignment, while also honoring RX
>> +         * buffer alignment in the data portion of the mbuf. If this adjustment
>> +         * is not made, there is a possiblity later on that for an element of
>> +         * the mempool, buf, buf->data_len < (buf->buf_len - buf->data_off).
>> +         * This is problematic in the case of multi-segment mbufs, particularly
>> +         * when an mbuf segment needs to be resized (when [push|popp]ing a VLAN
>> +         * header, for example.
>> +         */
>> +        mbuf_priv_data_len += (aligned_mbuf_size - pkt_size);
>> +
>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>> +                                          mbuf_priv_data_len,
>> +                                          mbuf_size,
>>                                            socket_id);
>>  
>>          if (dmp->mp) {
>>

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f91aa27..11659eb 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -88,10 +88,6 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
 #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)                    \
                                      - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)              ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
-                                             + sizeof(struct dp_packet) \
-                                             + RTE_PKTMBUF_HEADROOM),   \
-                                             RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN      1024
 #define NETDEV_DPDK_MAX_PKT_LEN     9728
 
@@ -637,7 +633,11 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     const char *netdev_name = netdev_get_name(&dev->up);
     int socket_id = dev->requested_socket_id;
-    uint32_t n_mbufs;
+    uint32_t n_mbufs = 0;
+    uint32_t mbuf_size = 0;
+    uint32_t aligned_mbuf_size = 0;
+    uint32_t mbuf_priv_data_len = 0;
+    uint32_t pkt_size = 0;
     uint32_t hash = hash_string(netdev_name, 0);
     struct dpdk_mp *dmp = NULL;
     int ret;
@@ -650,6 +650,9 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
     dmp->mtu = mtu;
     dmp->refcount = 1;
 
+    /* Get the size of each mbuf, based on the MTU */
+    mbuf_size = dpdk_buf_size(dev->requested_mtu);
+
     n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
     do {
@@ -661,8 +664,8 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
          * so this is not an issue for tasks such as debugging.
          */
         ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
-                           "ovs%08x%02d%05d%07u",
-                           hash, socket_id, mtu, n_mbufs);
+                       "ovs%08x%02d%05d%07u",
+                        hash, socket_id, mtu, n_mbufs);
         if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
             VLOG_DBG("snprintf returned %d. "
                      "Failed to generate a mempool name for \"%s\". "
@@ -671,17 +674,34 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
             break;
         }
 
-        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
-                  "on socket %d for %d Rx and %d Tx queues.",
-                  netdev_name, n_mbufs, socket_id,
-                  dev->requested_n_rxq, dev->requested_n_txq);
-
-        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
-                                          MP_CACHE_SZ,
-                                          sizeof (struct dp_packet)
-                                          - sizeof (struct rte_mbuf),
-                                          MBUF_SIZE(mtu)
-                                          - sizeof(struct dp_packet),
+        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
+                  "on socket %d for %d Rx and %d Tx queues, "
+                  "cache line size of %u",
+                  netdev_name, n_mbufs, mbuf_size, socket_id,
+                  dev->requested_n_rxq, dev->requested_n_txq,
+                  RTE_CACHE_LINE_SIZE);
+
+        mbuf_priv_data_len = sizeof(struct dp_packet) -
+                                 sizeof(struct rte_mbuf);
+        /* The size of the entire dp_packet. */
+        pkt_size = sizeof (struct dp_packet) +
+                                mbuf_size + RTE_PKTMBUF_HEADROOM;
+        /* mbuf size, rounded up to cacheline size. */
+        aligned_mbuf_size = ROUND_UP(pkt_size, RTE_CACHE_LINE_SIZE);
+        /* If there is a size discrepancy, add padding to mbuf_priv_data_len.
+         * This maintains mbuf size cache alignment, while also honoring RX
+         * buffer alignment in the data portion of the mbuf. If this adjustment
+         * is not made, there is a possiblity later on that for an element of
+         * the mempool, buf, buf->data_len < (buf->buf_len - buf->data_off).
+         * This is problematic in the case of multi-segment mbufs, particularly
+         * when an mbuf segment needs to be resized (when [push|popp]ing a VLAN
+         * header, for example.
+         */
+        mbuf_priv_data_len += (aligned_mbuf_size - pkt_size);
+
+        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
+                                          mbuf_priv_data_len,
+                                          mbuf_size,
                                           socket_id);
 
         if (dmp->mp) {