mbox series

[ovs-dev,v8,00/13] Support multi-segment mbufs

Message ID 1528734090-220990-1-git-send-email-tiago.lam@intel.com
Headers show
Series Support multi-segment mbufs | expand

Message

Lam, Tiago June 11, 2018, 4:21 p.m. UTC
Overview
========
This patchset introduces support for multi-segment mbufs to OvS-DPDK.
Multi-segment mbufs are typically used when the size of an mbuf is
insufficient to contain the entirety of a packet's data. Instead, the
data is split across numerous mbufs, each carrying a portion, or
'segment', of the packet data. mbufs are chained via their 'next'
attribute (an mbuf pointer).

Use Cases
=========
i.  Handling oversized (guest-originated) frames, which are marked
    for hardware accelration/offload (TSO, for example).

    Packets which originate from a non-DPDK source may be marked for
    offload; as such, they may be larger than the permitted ingress
    interface's MTU, and may be stored in an oversized dp-packet. In
    order to transmit such packets over a DPDK port, their contents
    must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
    its current implementation, that function only copies data into
    a single mbuf; if the space available in the mbuf is exhausted,
    but not all packet data has been copied, then it is lost.
    Similarly, when cloning a DPDK mbuf, it must be considered
    whether that mbuf contains multiple segments. Both issues are
    resolved within this patchset.

ii. Handling jumbo frames.

    While OvS already supports jumbo frames, it does so by increasing
    mbuf size, such that the entirety of a jumbo frame may be handled
    in a single mbuf. This is certainly the preferred, and most
    performant approach (and remains the default). However, it places
    high demands on system memory; multi-segment mbufs may be
    prefereable for systems which are memory-constrained.

Enabling multi-segment mbufs
============================
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.

This is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

    ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
    ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
    ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Performance notes
=================
In order to test for regressions in performance, tests were run on top
of master 88125d6 and v8 of this patchset, both with the multi-segment
mbufs option enabled and disabled.

VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.

Test | Size | Master | Multi-seg disabled | Multi-seg enabled
-------------------------------------------------------------
p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36

Packet size is in bytes, while all packet rates are reported in mpps
(aggregated).

No noticeable regression has been observed (certainly everything is
within the ± 5% margin of existing performance), aside from the 64B
packet size case when multi-segment mbuf is enabled. This is
expected, however, because of how Tx vectoriszed functions are
incompatible with multi-segment mbufs on some PMDs. The PMD under
use during these tests was the i40e (on a Intel X710 NIC), which
indeed doesn't support vectorized Tx functions with multi-segment
mbufs.

---
v8 (non-RFC): 
    - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
      temporary directory (also for RHEL)");
    - Address Ciara's and Ilya's comments:
        - Drop the dp_packet_mbuf_tail() function and use only the
          already existing dp_packet_tail();
        - Fix bug in dpdk_do_tx_copy() where the error return from
          dpdk_prep_tx_buf() was being wrongly checked;
        - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
          rte_pktmbuf_alloc() and rte_pktmbuf_free();
        - Fix some other code style and duplication issues pointed out.
    - Refactored dp_packet_shift(), dp_packet_resize__() and
      dp_packet_put*() functions to work within the bounds of existing
      mbufs only;
    - Fix dp_packet_clear() which wasn't correctly clearing / freeing
      other mbufs in the chain for chains with more than a single mbuf;
    - dp_packet layer functions (such as dp_packet_l3()) now check if
      the header is within the first mbuf, when using mbufs;
    - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
      was refactored to use free_dpdk_buf();
    - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
      leading to memory corruption; 
    - Modified the added tests to account for some of the above changes;
    - Run performance tests, compiling results and adding them to the
      cover letter;
    - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
      together with a "Performance notes" sub-section reflecting the
      findings mentioned above in the cover letter.
 
v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
     - Address Ilya comments;
        - Fix non-DPDK build;
        - Serialise the access of non pmds to allocation and free of mbufs by
          using a newly introduced mutex.
     - Add a new set of tests that integrates with the recently added DPDK
       testsuite. These focus on allocating dp_packets, with a single or
       multiple mbufs, from an instantiated mempool and performing several
       operations on those, verifying if the data at the end matches what's
       expected;
     - Fix bugs found by the new tests:
        - dp_packet_shift() wasn't taking into account shift lefts;
        - dp_packet_resize__() was misusing and miscalculating the tailrooms
          and headrooms, ending up calculating the wrong number of segments
          that needed allocation;
        - An mbuf's end was being miscalculated both in dp_packet_tail,
          dp_packet_mbuf_tail() and dp_packet_end();
        - dp_packet_set_size() was not updating the number of chained segments
          'nb_segs';
     - Add support for multi-seg mbufs in dp_packet_clear().

v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
       reuse.");
     - Further improve dp_packet_put_uninit() and dp_packet_shift() to
       support multi-seg mbufs;
     - Add support for multi-seg mbufs in dp_packet_l4_size() and
       improve other helper funcs, such as dp_packet_tail() and dp_
       packet_tailroom().
     - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
       put_zeros(), as well as dp_packet_resize__() - allocating new
       mbufs and linking them together;
     Restructured patchset:
     - Squash patch 5 into patch 6, since they were both related to
       copying data while handling multi-seg mbufs;
     - Split patch 4 into two separate patches - one that introduces the
       changes in helper functions to deal with multi-seg mbufs and
       two others for the shift() and put_uninit() functionality;
     - Move patch 4 to before patch 3, so that ihelper functions come
       before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
      from getting redundantly incremented");
    - Sugesh's comments have been addressed:
      - Changed dp_packet_set_data() and dp_packet_set_size() logic to
        make them independent of each other;
      - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
        size() are independent;
      - dp_packet_clone_with_headroom() now has split functions for
        handling DPDK sourced packets and non-DPDK packets;
    - Modified various functions in dp-packet.h to account for multi-seg
      mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
      and dp_packet_at();
    - Added support for shifting packet data in multi-seg mbufs, using
      dp_packet_shift();
    - Fixed some minor inconsistencies.

    Note that some of the changes in v5 have been contributed by Mark
    Kavanagh as well.

v4: - restructure patchset
    - account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (4):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (8):
  dp-packet: Fix allocated size on DPDK init.
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mbufs in put_uninit().
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Handle multi-seg mbufs in resize__().
  dpdk-tests: Add test coverage for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  41 +++
 NEWS                                       |   1 +
 lib/dp-packet.c                            | 243 +++++++++++++-
 lib/dp-packet.h                            | 319 ++++++++++++++----
 lib/dpdk.c                                 |   8 +
 lib/netdev-dpdk.c                          | 213 ++++++++++--
 lib/netdev-dpdk.h                          |   2 +
 tests/automake.mk                          |  10 +-
 tests/dpdk-packet-mbufs.at                 |   7 +
 tests/system-dpdk-testsuite.at             |   1 +
 tests/test-dpdk-mbufs.c                    | 518 +++++++++++++++++++++++++++++
 vswitchd/vswitch.xml                       |  20 ++
 12 files changed, 1273 insertions(+), 110 deletions(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

Comments

Eelco Chaudron June 18, 2018, 11:18 a.m. UTC | #1
On 11 Jun 2018, at 18:21, Tiago Lam wrote:

> Overview
> ========
> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
> Multi-segment mbufs are typically used when the size of an mbuf is
> insufficient to contain the entirety of a packet's data. Instead, the
> data is split across numerous mbufs, each carrying a portion, or
> 'segment', of the packet data. mbufs are chained via their 'next'
> attribute (an mbuf pointer).
>
> Use Cases
> =========
> i.  Handling oversized (guest-originated) frames, which are marked
>     for hardware accelration/offload (TSO, for example).
>
>     Packets which originate from a non-DPDK source may be marked for
>     offload; as such, they may be larger than the permitted ingress
>     interface's MTU, and may be stored in an oversized dp-packet. In
>     order to transmit such packets over a DPDK port, their contents
>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>     its current implementation, that function only copies data into
>     a single mbuf; if the space available in the mbuf is exhausted,
>     but not all packet data has been copied, then it is lost.
>     Similarly, when cloning a DPDK mbuf, it must be considered
>     whether that mbuf contains multiple segments. Both issues are
>     resolved within this patchset.
>
> ii. Handling jumbo frames.
>
>     While OvS already supports jumbo frames, it does so by increasing
>     mbuf size, such that the entirety of a jumbo frame may be handled
>     in a single mbuf. This is certainly the preferred, and most
>     performant approach (and remains the default). However, it places
>     high demands on system memory; multi-segment mbufs may be
>     prefereable for systems which are memory-constrained.
>
> Enabling multi-segment mbufs
> ============================
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>
> This is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
>
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
>
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . 
> other_config:dpdk-multi-seg-mbufs=true
>
> Performance notes
> =================
> In order to test for regressions in performance, tests were run on top
> of master 88125d6 and v8 of this patchset, both with the multi-segment
> mbufs option enabled and disabled.
>
> VSperf was used to run the phy2phy_cont and pvp_cont tests with 
> varying
> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>
> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
> -------------------------------------------------------------
> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>
> Packet size is in bytes, while all packet rates are reported in mpps
> (aggregated).
>
> No noticeable regression has been observed (certainly everything is
> within the ± 5% margin of existing performance), aside from the 64B
> packet size case when multi-segment mbuf is enabled. This is
> expected, however, because of how Tx vectoriszed functions are
> incompatible with multi-segment mbufs on some PMDs. The PMD under
> use during these tests was the i40e (on a Intel X710 NIC), which
> indeed doesn't support vectorized Tx functions with multi-segment
> mbufs.

I did some testing using my PVP framework, 
https://github.com/chaudron/ovs_perf, and I see the same as above. I 
also used an XL710 for these tests.

I reviewed the complete patch-set and will reply on the individual 
patches.

//Eelco


>
> ---
> v8 (non-RFC):
>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>       temporary directory (also for RHEL)");
>     - Address Ciara's and Ilya's comments:
>         - Drop the dp_packet_mbuf_tail() function and use only the
>           already existing dp_packet_tail();
>         - Fix bug in dpdk_do_tx_copy() where the error return from
>           dpdk_prep_tx_buf() was being wrongly checked;
>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>         - Fix some other code style and duplication issues pointed 
> out.
>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>       dp_packet_put*() functions to work within the bounds of existing
>       mbufs only;
>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>       other mbufs in the chain for chains with more than a single 
> mbuf;
>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>       the header is within the first mbuf, when using mbufs;
>     - Move patch 08/13 to before patch 04/13, since 
> dp_packet_set_size()
>       was refactored to use free_dpdk_buf();
>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which 
> was
>       leading to memory corruption;
>     - Modified the added tests to account for some of the above 
> changes;
>     - Run performance tests, compiling results and adding them to the
>       cover letter;
>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>       together with a "Performance notes" sub-section reflecting the
>       findings mentioned above in the cover letter.
>
> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan 
> version.");
>      - Address Ilya comments;
>         - Fix non-DPDK build;
>         - Serialise the access of non pmds to allocation and free of 
> mbufs by
>           using a newly introduced mutex.
>      - Add a new set of tests that integrates with the recently added 
> DPDK
>        testsuite. These focus on allocating dp_packets, with a single 
> or
>        multiple mbufs, from an instantiated mempool and performing 
> several
>        operations on those, verifying if the data at the end matches 
> what's
>        expected;
>      - Fix bugs found by the new tests:
>         - dp_packet_shift() wasn't taking into account shift lefts;
>         - dp_packet_resize__() was misusing and miscalculating the 
> tailrooms
>           and headrooms, ending up calculating the wrong number of 
> segments
>           that needed allocation;
>         - An mbuf's end was being miscalculated both in 
> dp_packet_tail,
>           dp_packet_mbuf_tail() and dp_packet_end();
>         - dp_packet_set_size() was not updating the number of chained 
> segments
>           'nb_segs';
>      - Add support for multi-seg mbufs in dp_packet_clear().
>
> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>        reuse.");
>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>        support multi-seg mbufs;
>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>        improve other helper funcs, such as dp_packet_tail() and dp_
>        packet_tailroom().
>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>        put_zeros(), as well as dp_packet_resize__() - allocating new
>        mbufs and linking them together;
>      Restructured patchset:
>      - Squash patch 5 into patch 6, since they were both related to
>        copying data while handling multi-seg mbufs;
>      - Split patch 4 into two separate patches - one that introduces 
> the
>        changes in helper functions to deal with multi-seg mbufs and
>        two others for the shift() and put_uninit() functionality;
>      - Move patch 4 to before patch 3, so that ihelper functions come
>        before functionality improvement that rely on those helpers.
>
> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent 
> ct-counters
>       from getting redundantly incremented");
>     - Sugesh's comments have been addressed:
>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>         make them independent of each other;
>       - Dropped patch 3 now that dp_packet_set_data() and 
> dp_packet_set_
>         size() are independent;
>       - dp_packet_clone_with_headroom() now has split functions for
>         handling DPDK sourced packets and non-DPDK packets;
>     - Modified various functions in dp-packet.h to account for 
> multi-seg
>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), 
> dp_packet_tail()
>       and dp_packet_at();
>     - Added support for shifting packet data in multi-seg mbufs, using
>       dp_packet_shift();
>     - Fixed some minor inconsistencies.
>
>     Note that some of the changes in v5 have been contributed by Mark
>     Kavanagh as well.
>
> v4: - restructure patchset
>     - account for 128B ARM cacheline when sizing mbufs
>
> Mark Kavanagh (4):
>   netdev-dpdk: fix mbuf sizing
>   dp-packet: Init specific mbuf fields.
>   netdev-dpdk: copy large packet to multi-seg. mbufs
>   netdev-dpdk: support multi-segment jumbo frames
>
> Michael Qiu (1):
>   dp-packet: copy data from multi-seg. DPDK mbuf
>
> Tiago Lam (8):
>   dp-packet: Fix allocated size on DPDK init.
>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>   dp-packet: Fix data_len handling multi-seg mbufs.
>   dp-packet: Handle multi-seg mbufs in helper funcs.
>   dp-packet: Handle multi-seg mbufs in put_uninit().
>   dp-packet: Handle multi-seg mubfs in shift() func.
>   dp-packet: Handle multi-seg mbufs in resize__().
>   dpdk-tests: Add test coverage for multi-seg mbufs.
>
>  Documentation/topics/dpdk/jumbo-frames.rst |  41 +++
>  NEWS                                       |   1 +
>  lib/dp-packet.c                            | 243 +++++++++++++-
>  lib/dp-packet.h                            | 319 ++++++++++++++----
>  lib/dpdk.c                                 |   8 +
>  lib/netdev-dpdk.c                          | 213 ++++++++++--
>  lib/netdev-dpdk.h                          |   2 +
>  tests/automake.mk                          |  10 +-
>  tests/dpdk-packet-mbufs.at                 |   7 +
>  tests/system-dpdk-testsuite.at             |   1 +
>  tests/test-dpdk-mbufs.c                    | 518 
> +++++++++++++++++++++++++++++
>  vswitchd/vswitch.xml                       |  20 ++
>  12 files changed, 1273 insertions(+), 110 deletions(-)
>  create mode 100644 tests/dpdk-packet-mbufs.at
>  create mode 100644 tests/test-dpdk-mbufs.c
>
> -- 
> 2.7.4
Lam, Tiago June 22, 2018, 7:02 p.m. UTC | #2
Hi Eelco,

On 18/06/2018 12:18, Eelco Chaudron wrote:
> 
> 
> On 11 Jun 2018, at 18:21, Tiago Lam wrote:
> 

[snip]

>> Performance notes
>> =================
>> In order to test for regressions in performance, tests were run on top
>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>> mbufs option enabled and disabled.
>>
>> VSperf was used to run the phy2phy_cont and pvp_cont tests with 
>> varying
>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>
>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>> -------------------------------------------------------------
>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>
>> Packet size is in bytes, while all packet rates are reported in mpps
>> (aggregated).
>>
>> No noticeable regression has been observed (certainly everything is
>> within the ± 5% margin of existing performance), aside from the 64B
>> packet size case when multi-segment mbuf is enabled. This is
>> expected, however, because of how Tx vectoriszed functions are
>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>> use during these tests was the i40e (on a Intel X710 NIC), which
>> indeed doesn't support vectorized Tx functions with multi-segment
>> mbufs.
> 
> I did some testing using my PVP framework, 
> https://github.com/chaudron/ovs_perf, and I see the same as above. I 
> also used an XL710 for these tests.

Thanks for the review! And for confirming the results, that gives some
assurance.

> 
> I reviewed the complete patch-set and will reply on the individual 
> patches.
> 
> //Eelco
> 
>
I thought it would be worth giving a more generic explanation here since
some of the comments are around the same point.

This series takes the approach of not allocating new mbufs in
`DPBUF_DPDK` dp_packets. This is to avoid the case where a
dp_packet_put() call, for example, allocates two mbufs to create enough
space, and returns a pointer to the data which isn't contiguous in
memory (because it is spread across two mbufs). Most of the code in OvS
that uses the dp_packet API relies on the assumption that memory
returned is contiguous in memory.

To enforce this, dp_packet_resize__() doesn't allocate any new mbufs,
and only tries to make enough room from the space it already has
available (both in the tailroom and headroom). dp_packet_shift() also
doesn't allow the possibility for the head to move past the first mbuf
(when shifting right), or the tail to move past the last mbuf (if
shifting left).

This is also why some assumptions can be made in the implementation,
such that the tailroom of a dp_packet will be the tailroom available in
the last mbuf, since it shouldn't be possible to have a whole free mbuf
after the tail.

This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV` is
defined, an mbuf's allocated memory comes from the system. In many cases
though, this is the type of packet used internally by OvS when doing
manipulations, since any clone of the received `DPBUF_DPDK` dp_packet
will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()).

I'll reply to the other emails individually and refer to this
explanation when useful.

Tiago.
Eelco Chaudron June 26, 2018, 1:23 p.m. UTC | #3
On 22 Jun 2018, at 21:02, Lam, Tiago wrote:

> Hi Eelco,
>
> On 18/06/2018 12:18, Eelco Chaudron wrote:
>>
>>
>> On 11 Jun 2018, at 18:21, Tiago Lam wrote:
>>
>
> [snip]
>
>>> Performance notes
>>> =================
>>> In order to test for regressions in performance, tests were run on 
>>> top
>>> of master 88125d6 and v8 of this patchset, both with the 
>>> multi-segment
>>> mbufs option enabled and disabled.
>>>
>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with
>>> varying
>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>
>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>> -------------------------------------------------------------
>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>
>>> Packet size is in bytes, while all packet rates are reported in mpps
>>> (aggregated).
>>>
>>> No noticeable regression has been observed (certainly everything is
>>> within the ± 5% margin of existing performance), aside from the 64B
>>> packet size case when multi-segment mbuf is enabled. This is
>>> expected, however, because of how Tx vectoriszed functions are
>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>> indeed doesn't support vectorized Tx functions with multi-segment
>>> mbufs.
>>
>> I did some testing using my PVP framework,
>> https://github.com/chaudron/ovs_perf, and I see the same as above. I
>> also used an XL710 for these tests.
>
> Thanks for the review! And for confirming the results, that gives some
> assurance.
>
>>
>> I reviewed the complete patch-set and will reply on the individual
>> patches.
>>
>> //Eelco
>>
>>
> I thought it would be worth giving a more generic explanation here 
> since
> some of the comments are around the same point.
>
> This series takes the approach of not allocating new mbufs in
> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
> dp_packet_put() call, for example, allocates two mbufs to create 
> enough
> space, and returns a pointer to the data which isn't contiguous in
> memory (because it is spread across two mbufs). Most of the code in 
> OvS
> that uses the dp_packet API relies on the assumption that memory
> returned is contiguous in memory.
>
> To enforce this, dp_packet_resize__() doesn't allocate any new mbufs,
> and only tries to make enough room from the space it already has
> available (both in the tailroom and headroom). dp_packet_shift() also
> doesn't allow the possibility for the head to move past the first mbuf
> (when shifting right), or the tail to move past the last mbuf (if
> shifting left).
>
> This is also why some assumptions can be made in the implementation,
> such that the tailroom of a dp_packet will be the tailroom available 
> in
> the last mbuf, since it shouldn't be possible to have a whole free 
> mbuf
> after the tail.

Thanks for explaining the details, maybe its good to explain some of 
these details in dp_packet_resize__().

However, for the general function, I think you should not assume that 
tailroom is only available in the last mbuf. I think you should stick as 
much as possible to the existing mbuf API’s to avoid problems in the 
future if people decide to do add “full” mbuf support. Most of these 
shortcuts do not give any real performance advantages.

> This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV` 
> is
> defined, an mbuf's allocated memory comes from the system. In many 
> cases
> though, this is the type of packet used internally by OvS when doing
> manipulations, since any clone of the received `DPBUF_DPDK` dp_packet
> will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()).
>
> I'll reply to the other emails individually and refer to this
> explanation when useful.

Thanks, I’ve replied to all the comments in the individual mails.

I’ll allocate some time to review your new revision once it’s 
available.

Cheers,

Eelco
Lam, Tiago June 27, 2018, 10:09 a.m. UTC | #4
On 26/06/2018 14:23, Eelco Chaudron wrote:
> 
> 
> On 22 Jun 2018, at 21:02, Lam, Tiago wrote:
> 
>> Hi Eelco,
>>
>> On 18/06/2018 12:18, Eelco Chaudron wrote:
>>>
>>>
>>> On 11 Jun 2018, at 18:21, Tiago Lam wrote:
>>>
>>
>> [snip]
>>
>>>> Performance notes
>>>> =================
>>>> In order to test for regressions in performance, tests were run on 
>>>> top
>>>> of master 88125d6 and v8 of this patchset, both with the 
>>>> multi-segment
>>>> mbufs option enabled and disabled.
>>>>
>>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with
>>>> varying
>>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>>
>>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>>> -------------------------------------------------------------
>>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>>
>>>> Packet size is in bytes, while all packet rates are reported in mpps
>>>> (aggregated).
>>>>
>>>> No noticeable regression has been observed (certainly everything is
>>>> within the ± 5% margin of existing performance), aside from the 64B
>>>> packet size case when multi-segment mbuf is enabled. This is
>>>> expected, however, because of how Tx vectoriszed functions are
>>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>>> indeed doesn't support vectorized Tx functions with multi-segment
>>>> mbufs.
>>>
>>> I did some testing using my PVP framework,
>>> https://github.com/chaudron/ovs_perf, and I see the same as above. I
>>> also used an XL710 for these tests.
>>
>> Thanks for the review! And for confirming the results, that gives some
>> assurance.
>>
>>>
>>> I reviewed the complete patch-set and will reply on the individual
>>> patches.
>>>
>>> //Eelco
>>>
>>>
>> I thought it would be worth giving a more generic explanation here 
>> since
>> some of the comments are around the same point.
>>
>> This series takes the approach of not allocating new mbufs in
>> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
>> dp_packet_put() call, for example, allocates two mbufs to create 
>> enough
>> space, and returns a pointer to the data which isn't contiguous in
>> memory (because it is spread across two mbufs). Most of the code in 
>> OvS
>> that uses the dp_packet API relies on the assumption that memory
>> returned is contiguous in memory.
>>
>> To enforce this, dp_packet_resize__() doesn't allocate any new mbufs,
>> and only tries to make enough room from the space it already has
>> available (both in the tailroom and headroom). dp_packet_shift() also
>> doesn't allow the possibility for the head to move past the first mbuf
>> (when shifting right), or the tail to move past the last mbuf (if
>> shifting left).
>>
>> This is also why some assumptions can be made in the implementation,
>> such that the tailroom of a dp_packet will be the tailroom available 
>> in
>> the last mbuf, since it shouldn't be possible to have a whole free 
>> mbuf
>> after the tail.
> 
> Thanks for explaining the details, maybe its good to explain some of 
> these details in dp_packet_resize__().
> 
> However, for the general function, I think you should not assume that 
> tailroom is only available in the last mbuf. I think you should stick as 
> much as possible to the existing mbuf API’s to avoid problems in the 
> future if people decide to do add “full” mbuf support. Most of these 
> shortcuts do not give any real performance advantages.
>>> This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV`
>> is
>> defined, an mbuf's allocated memory comes from the system. In many 
>> cases
>> though, this is the type of packet used internally by OvS when doing
>> manipulations, since any clone of the received `DPBUF_DPDK` dp_packet
>> will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()).
>>
>> I'll reply to the other emails individually and refer to this
>> explanation when useful.
> 
> Thanks, I’ve replied to all the comments in the individual mails.
> 
> I’ll allocate some time to review your new revision once it’s 
> available.
> 

Thanks a bunch for the reviews, Eelco. This is much appreciated. I
should be able to send the next iteration soon.

To avoid the extra noise I've replied only to the comments which were
more relevant in the series, and I agree and will work on the rest.

Tiago.

> Cheers,
> 
> Eelco
>