diff mbox series

[ovs-dev,v3] netdev-dpdk: Add mbuf HEADROOM after alignment.

Message ID 1543337663-160716-1-git-send-email-tiago.lam@intel.com
State Accepted
Delegated to: Ian Stokes
Headers show
Series [ovs-dev,v3] netdev-dpdk: Add mbuf HEADROOM after alignment. | expand

Commit Message

Lam, Tiago Nov. 27, 2018, 4:54 p.m. UTC
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes <ian.stokes@intel.com>
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
---
v3:
   - Take trailer_size into account when calculating mbuf size - Ian.

v2:
   - Rebase to master 85706c3 ("ovn: Avoid tunneling for VLAN packets
     redirected to a gateway chassis").

   - Fix mbuf size calculations under Documentation/topics/dpdk/memory.rst
     to take into account the header_size added to each mepool element (64
     bytes) - Ian.
---

 Documentation/topics/dpdk/memory.rst | 28 ++++++++++++++--------------
 lib/netdev-dpdk.c                    |  6 ++++--
 2 files changed, 18 insertions(+), 16 deletions(-)

Comments

Stokes, Ian Nov. 28, 2018, 3:56 p.m. UTC | #1
> Commit dfaf00e started using the result of dpdk_buf_size() to calculate
> the available size on each mbuf, as opposed to using the previous
> MBUF_SIZE macro. However, this was calculating the mbuf size by adding up
> the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
> NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
> RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.
> 
> Before alignment:
> ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048
> 
> After aligment:
> ROUNDUP(MTU(1500), 1024) + 128 = 2176
> 
> This might seem insignificant, however, it might have performance
> implications in DPDK, where each mbuf is expected to have 2k +
> RTE_PKTMBUF_HEADROOM of available space. This is because not only some
> NICs have course grained alignments of 1k, they will also take
> RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
> when setting up their Rx requirements. Thus, only the "After alignment"
> case above would guarantee a 2k of available room, as the "Before
> alignment" would report only 1920B.
> 
> Some extra information can be found at:
> https://mails.dpdk.org/archives/dev/2018-November/119219.html
> 
> Note: This has been found by Ian Stokes while going through some af_packet
> checks.
> 
> Reported-by: Ian Stokes <ian.stokes@intel.com>
> Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
> Signed-off-by: Tiago Lam <tiago.lam@intel.com>

Thanks Tiago, I've applied this to master. If I'm not mistaken I believe it should be back ported also, previous releases were using the incorrect calculation for dpdk_buf_size Thoughts?

Ian
Lam, Tiago Nov. 29, 2018, 1:04 p.m. UTC | #2
On 28/11/2018 15:56, Stokes, Ian wrote:
>> Commit dfaf00e started using the result of dpdk_buf_size() to calculate
>> the available size on each mbuf, as opposed to using the previous
>> MBUF_SIZE macro. However, this was calculating the mbuf size by adding up
>> the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
>> NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
>> RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.
>>
>> Before alignment:
>> ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048
>>
>> After aligment:
>> ROUNDUP(MTU(1500), 1024) + 128 = 2176
>>
>> This might seem insignificant, however, it might have performance
>> implications in DPDK, where each mbuf is expected to have 2k +
>> RTE_PKTMBUF_HEADROOM of available space. This is because not only some
>> NICs have course grained alignments of 1k, they will also take
>> RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
>> when setting up their Rx requirements. Thus, only the "After alignment"
>> case above would guarantee a 2k of available room, as the "Before
>> alignment" would report only 1920B.
>>
>> Some extra information can be found at:
>> https://mails.dpdk.org/archives/dev/2018-November/119219.html
>>
>> Note: This has been found by Ian Stokes while going through some af_packet
>> checks.
>>
>> Reported-by: Ian Stokes <ian.stokes@intel.com>
>> Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
>> Signed-off-by: Tiago Lam <tiago.lam@intel.com>
> 
> Thanks Tiago, I've applied this to master. If I'm not mistaken I believe it should be back ported also, previous releases were using the incorrect calculation for dpdk_buf_size Thoughts?

Thanks for that, Ian.

On back porting, that's a good point. I'm not too sure on this one here,
though.

From what I can gather, on previous versions until 2.6.x (so, 2.6.x,
2.7.x, 2.8.x, 2.9.x and 2.10.x) we won't have the same problem. This is
because we are over allocating the size of each mbuf and the size being
passed to `rte_pktmbuf_pool_create()` is adding RTE_PKTMBUF_HEADROOM for
the second time, since it had already been added in dpdk_buf_size() and
will be added again by the MBUF_SIZE macro. So, here we can potentially
save some space by back porting, but I don't think we ever pass less
size than we should into DPDK.

So, I think the TL;DR here is that this incorrect behavior was
introduced in commit dfaf00e ("netdev-dpdk: fix mbuf sizing") only,
where the MBUF_SIZE header was removed. As a result, we started relying
on dpdk_buf_size() result only, which adds RTE_PKTMBUF_HEADROOM first
and aligns to MBUF_ALIGN afterwards, meaning we can end up passing only
2k when calling DPDK's rte_pktmbuf_pool_create(), which may not be enough.

Back porting could make this up a bit, though. Let me know if you think
it would still make sense.

Regards,
Tiago.

(As an aside, from  2.7.x below, OvS-DPDK is not using
`rte_pktmbuf_pool_create` and is using `rte_mempool_create()` directly
(likely because the former wasn't available in DPDK yet). And there I
think passing RTE_PKTMBUF_HEADROOM twice is correct behavior since the
mempool API doesn't add `sizeof(struct rte_mbuf)` as
`rte_pktmbuf_pool_create()` does. So, this may very well be a leftover.)
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
index c9b739f..9ebfd11 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@  Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 262144 * 3008 = 788 MB
 
 Example 2
 +++++++++
@@ -116,8 +116,8 @@  Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 262144 * 3008 = 788 MB
 
 .. note::
 
@@ -130,8 +130,8 @@  Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8000 Bytes
- Memory required = 262144 * 8000 = 2097 MB
+ Mbuf size = 7104 Bytes
+ Memory required = 262144 * 7104 = 1862 MB
 
 Example 4
 +++++++++
@@ -139,8 +139,8 @@  Example 4
 
  MTU = 9000 Bytes
  Number of mbufs = 262144
- Mbuf size = 10048 Bytes
- Memory required = 262144 * 10048 = 2634 MB
+ Mbuf size = 10176 Bytes
+ Memory required = 262144 * 10176 = 2667 MB
 
 Per Port Memory Calculations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -194,8 +194,8 @@  Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 2752 Bytes
- Memory required = 22560 * 2752 = 62 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 22560 * 3008 = 67 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++++++++++++++++++++++++++++++++++
@@ -203,8 +203,8 @@  Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8000 Bytes
- Memory required = 24608 * 8000 = 196 MB
+ Mbuf size = 7104 Bytes
+ Memory required = 24608 * 7104 = 175 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++++++++++++++++++++++++++++++++++
@@ -212,5 +212,5 @@  Example 3: (2 rxq, 2 PMD, 9000 MTU)
 
  MTU = 9000
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
- Mbuf size = 10048 Bytes
- Memory required = 26656 * 10048 = 267 MB
+ Mbuf size = 10176 Bytes
+ Memory required = 26656 * 10176 = 271 MB
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e8618a6..a871743 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -521,8 +521,8 @@  is_dpdk_class(const struct netdev_class *class)
 static uint32_t
 dpdk_buf_size(int mtu)
 {
-    return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
-                     NETDEV_DPDK_MBUF_ALIGN);
+    return ROUND_UP(MTU_TO_MAX_FRAME_LEN(mtu), NETDEV_DPDK_MBUF_ALIGN)
+            + RTE_PKTMBUF_HEADROOM;
 }
 
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
@@ -681,6 +681,8 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
                   dev->requested_n_rxq, dev->requested_n_txq,
                   RTE_CACHE_LINE_SIZE);
 
+        /* The size of the mbuf's private area (i.e. area that holds OvS'
+         * dp_packet data)*/
         mbuf_priv_data_len = sizeof(struct dp_packet) -
                                  sizeof(struct rte_mbuf);
         /* The size of the entire dp_packet. */