mbox series

[ovs-dev,v7,00/13] Support multi-segment mbufs

Message ID 1532528359-96999-1-git-send-email-tiago.lam@intel.com
Headers show
Series Support multi-segment mbufs | expand

Message

Lam, Tiago July 25, 2018, 2:19 p.m. UTC
Overview
========
This patchset introduces support for multi-segment mbufs to OvS-DPDK.
Multi-segment mbufs are typically used when the size of an mbuf is
insufficient to contain the entirety of a packet's data. Instead, the
data is split across numerous mbufs, each carrying a portion, or
'segment', of the packet data. Mbufs are chained via their 'next'
attribute (an mbuf pointer).

The main motivation behind the support for multi-segment mbufs is to
later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
planned to be introduced after this series.

Use Cases
=========
i.  Handling oversized (guest-originated) frames, which are marked
    for hardware accelration/offload (TSO, for example).

    Packets which originate from a non-DPDK source may be marked for
    offload; as such, they may be larger than the permitted ingress
    interface's MTU, and may be stored in an oversized dp-packet. In
    order to transmit such packets over a DPDK port, their contents
    must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
    its current implementation, that function only copies data into
    a single mbuf; if the space available in the mbuf is exhausted,
    but not all packet data has been copied, then it is lost.
    Similarly, when cloning a DPDK mbuf, it must be considered
    whether that mbuf contains multiple segments. Both issues are
    resolved within this patchset.

ii. Handling jumbo frames.

    While OvS already supports jumbo frames, it does so by increasing
    mbuf size, such that the entirety of a jumbo frame may be handled
    in a single mbuf. This is certainly the preferred, and most
    performant approach (and remains the default).

Enabling multi-segment mbufs
============================
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.

This is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

    ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
    ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
    ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Performance notes (based on v8)
=================
In order to test for regressions in performance, tests were run on top
of master 88125d6 and v8 of this patchset, both with the multi-segment
mbufs option enabled and disabled.

VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.

Test | Size | Master | Multi-seg disabled | Multi-seg enabled
-------------------------------------------------------------
p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36

Packet size is in bytes, while all packet rates are reported in mpps
(aggregated).

No noticeable regression has been observed (certainly everything is
within the ± 5% margin of existing performance), aside from the 64B
packet size case when multi-segment mbuf is enabled. This is
expected, however, because of how Tx vectoriszed functions are
incompatible with multi-segment mbufs on some PMDs. The PMD under
use during these tests was the i40e (on a Intel X710 NIC), which
indeed doesn't support vectorized Tx functions with multi-segment
mbufs.

---
v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
    - Add Ben's proposed fix for automake's warning;
    - Add a note to cover letter to explain this is preperatory work for TSO /
      GRO.

v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
      error paths.");
    - Address Darrell's comments:
      - The changes in dp_packet_resize__() were trying to alleviate the call
        to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
        tailroom space when no more headroom space is available, and vice-versa.
        However, this was breaking the API for the dp_packet_resize__()
        function (only called in dp_packet_prealloc_tailroom() and
        dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
        for DPDK packets.
        Instead, and because this is isolate funtionality, revert to the
        previous state where dp_packet_resize__() is not supported for DPDK
        packets. Hence, then patch 08/14 has been dropped.
    - Additionally, fix the tests that were relying on the removed
      functionality.

v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
      after free.");
    - Address Eelco's comments:
      - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
        leftover from rebase. Only call should be in dpdk_mp_get();
      - Remove NEWS line added by mistake during rebase (about adding
        experimental vhost zero copy support).
    - Address Ian's comments:
      - Drop patch 01 from previous series entirely;
      - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
        dpdk_mp_create() to get the correct "mbuf_size" to be used;
      - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
        mbufs is enabled, in which case it calculates the new "mbuf_size" to be
        used;
      - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
        conditionally.
    - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
      DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
      the shared mempool model (runs out of memory);
    - Add new examples for when multi-segment mbufs are enabled in
      topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
      (now patch 11/14).

v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
      command."):
      - A new patch (now 01/15) has been introduced to differentiate between
        MTU and mbuf size when creating the mempool. This is because as part of
        the new support for both per port and shared mempools, mempools were
        being reused based on the mbuf size, and for multi-segment mbufs the
        mbuf size can end up being the same for all mbufs;
      - A couple of other patches (now 02/15 and 12/15) have been modified as
        part of the rebase, but only to adapt to the code changes to "Support
        both shared and per port mempools.", no functionality should have been
        changed.

v3:
    - Address Eelco's comment:
        - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
          wasn't correctly asserting that the passed 'v' was smaller than the
          first mbuf's buf_len.

v2:
    - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
      error.");
    - Address Eelco's comments:
        - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
          casting;
        - Remove unneeded variable in dp_packet_set_size(), pointing to the
          first mbuf in the chain;
        - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
          always true for DPBUF_DPDK packets;
        - Assert in __packet_set_data() to enforce that data_off never goes
          beyond the first mbuf in the chain.

v1:
    - v8 should have been sent as v1 really, as that's usually the approach
      followed in OvS. That clearly didn't happen, so restarting the series
      now. This also helps making it clear it is no longer an RFC series;
    - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
      output.");
    - Address Eelco's comments:
        - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
          `pkt_len` and `data_len` can't be set to values bigger than the
          available space. Also fix assigment to `data_len` which was
          incorrectly being set to just`pkt_len`;
        - Improve `nonpmd_mp_mutex` comment with a better explanation as to
          why the mutex is needed;
        - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
          `DPBUF_DPDK` packets;
        - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
        - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
          copying `DPBUF_DPDK` packets' data. Also, change it to return
          appropriate and meaningful errors, instead of just "0" or "1";
        - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
          and reuse dp_packet_mbuf_write() instead of manual copy;
        - Add note vswitchd/vswitch.xml to make it clear the enabling of
          dpdk-multi-seg-mbufs requires a restart;
    - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
      mbufs approach;
    - Increase test coverage by adding "end-to-end" tests that verify that
      "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
      successfully sent out;
    - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
      back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
      minimise changes;
    - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
      after further testing;
    - Structure changes:
      - Drop patch 07/13 which is now unneeded;
      - Two more patches added for extra test coverage. This is what accounts
        for the increase in size (+1 patch) in the series.

v8 (non-RFC): 
    - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
      temporary directory (also for RHEL)");
    - Address Ciara's and Ilya's comments:
        - Drop the dp_packet_mbuf_tail() function and use only the
          already existing dp_packet_tail();
        - Fix bug in dpdk_do_tx_copy() where the error return from
          dpdk_prep_tx_buf() was being wrongly checked;
        - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
          rte_pktmbuf_alloc() and rte_pktmbuf_free();
        - Fix some other code style and duplication issues pointed out.
    - Refactored dp_packet_shift(), dp_packet_resize__() and
      dp_packet_put*() functions to work within the bounds of existing
      mbufs only;
    - Fix dp_packet_clear() which wasn't correctly clearing / freeing
      other mbufs in the chain for chains with more than a single mbuf;
    - dp_packet layer functions (such as dp_packet_l3()) now check if
      the header is within the first mbuf, when using mbufs;
    - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
      was refactored to use free_dpdk_buf();
    - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
      leading to memory corruption; 
    - Modified the added tests to account for some of the above changes;
    - Run performance tests, compiling results and adding them to the
      cover letter;
    - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
      together with a "Performance notes" sub-section reflecting the
      findings mentioned above in the cover letter.

v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
     - Address Ilya comments;
        - Fix non-DPDK build;
        - Serialise the access of non pmds to allocation and free of mbufs by
          using a newly introduced mutex.
     - Add a new set of tests that integrates with the recently added DPDK
       testsuite. These focus on allocating dp_packets, with a single or
       multiple mbufs, from an instantiated mempool and performing several
       operations on those, verifying if the data at the end matches what's
       expected;
     - Fix bugs found by the new tests:
        - dp_packet_shift() wasn't taking into account shift lefts;
        - dp_packet_resize__() was misusing and miscalculating the tailrooms
          and headrooms, ending up calculating the wrong number of segments
          that needed allocation;
        - An mbuf's end was being miscalculated both in dp_packet_tail,
          dp_packet_mbuf_tail() and dp_packet_end();
        - dp_packet_set_size() was not updating the number of chained segments
          'nb_segs';
     - Add support for multi-seg mbufs in dp_packet_clear().

v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
       reuse.");
     - Further improve dp_packet_put_uninit() and dp_packet_shift() to
       support multi-seg mbufs;
     - Add support for multi-seg mbufs in dp_packet_l4_size() and
       improve other helper funcs, such as dp_packet_tail() and dp_
       packet_tailroom().
     - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
       put_zeros(), as well as dp_packet_resize__() - allocating new
       mbufs and linking them together;
     Restructured patchset:
     - Squash patch 5 into patch 6, since they were both related to
       copying data while handling multi-seg mbufs;
     - Split patch 4 into two separate patches - one that introduces the
       changes in helper functions to deal with multi-seg mbufs and
       two others for the shift() and put_uninit() functionality;
     - Move patch 4 to before patch 3, so that ihelper functions come
       before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
      from getting redundantly incremented");
    - Sugesh's comments have been addressed:
      - Changed dp_packet_set_data() and dp_packet_set_size() logic to
        make them independent of each other;
      - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
        size() are independent;
      - dp_packet_clone_with_headroom() now has split functions for
        handling DPDK sourced packets and non-DPDK packets;
    - Modified various functions in dp-packet.h to account for multi-seg
      mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
      and dp_packet_at();
    - Added support for shifting packet data in multi-seg mbufs, using
      dp_packet_shift();
    - Fixed some minor inconsistencies.

    Note that some of the changes in v5 have been contributed by Mark
    Kavanagh as well.

v4: - restructure patchset
    - account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (4):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (8):
  dp-packet: Fix allocated size on DPDK init.
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dpdk-tests: Add uni-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
 Documentation/topics/dpdk/memory.rst       |  36 ++
 NEWS                                       |   1 +
 lib/dp-packet.c                            | 173 +++++++++-
 lib/dp-packet.h                            | 214 ++++++++++--
 lib/dpdk.c                                 |   8 +
 lib/netdev-dpdk.c                          | 244 +++++++++++---
 lib/netdev-dpdk.h                          |   2 +
 tests/automake.mk                          |  10 +-
 tests/dpdk-packet-mbufs.at                 |   7 +
 tests/system-dpdk-macros.at                |   6 +-
 tests/system-dpdk-testsuite.at             |   1 +
 tests/system-dpdk.at                       |  65 ++++
 tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
 vswitchd/vswitch.xml                       |  22 ++
 15 files changed, 1277 insertions(+), 77 deletions(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

Comments

Ilya Maximets Aug. 9, 2018, 8:27 a.m. UTC | #1
Hmm. I found that this series modifies only dpdk related components
and doesn't pay any attention to others.
dp_packet API was modified to respect the segmented packets, but
there are many places, where code uses packet data directly using
only dp_packet_data() pointer and dp_packet_size().
For example, at least following functions will read the wrong memory
and probably generate segfault:

  * All the netdev-{linux, dummy, bsd} send functions, because
    they are using code like this:

    write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));

  * Tunnel extracting functions like 'udp_extract_tnl_md', because
    they're trying to calculate packet checksums using raw calls to
    'csum_continue'.

  * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.

  * Possibly, much more functions.

If It's true, when I'm afraid that this patch set is far from being
ready for merge.

On 25.07.2018 17:19, Tiago Lam wrote:
> Overview
> ========
> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
> Multi-segment mbufs are typically used when the size of an mbuf is
> insufficient to contain the entirety of a packet's data. Instead, the
> data is split across numerous mbufs, each carrying a portion, or
> 'segment', of the packet data. Mbufs are chained via their 'next'
> attribute (an mbuf pointer).
> 
> The main motivation behind the support for multi-segment mbufs is to
> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
> planned to be introduced after this series.
> 
> Use Cases
> =========
> i.  Handling oversized (guest-originated) frames, which are marked
>     for hardware accelration/offload (TSO, for example).
> 
>     Packets which originate from a non-DPDK source may be marked for
>     offload; as such, they may be larger than the permitted ingress
>     interface's MTU, and may be stored in an oversized dp-packet. In
>     order to transmit such packets over a DPDK port, their contents
>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>     its current implementation, that function only copies data into
>     a single mbuf; if the space available in the mbuf is exhausted,
>     but not all packet data has been copied, then it is lost.
>     Similarly, when cloning a DPDK mbuf, it must be considered
>     whether that mbuf contains multiple segments. Both issues are
>     resolved within this patchset.
> 
> ii. Handling jumbo frames.
> 
>     While OvS already supports jumbo frames, it does so by increasing
>     mbuf size, such that the entirety of a jumbo frame may be handled
>     in a single mbuf. This is certainly the preferred, and most
>     performant approach (and remains the default).
> 
> Enabling multi-segment mbufs
> ============================
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
> 
> This is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
> 
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> Performance notes (based on v8)
> =================
> In order to test for regressions in performance, tests were run on top
> of master 88125d6 and v8 of this patchset, both with the multi-segment
> mbufs option enabled and disabled.
> 
> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
> 
> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
> -------------------------------------------------------------
> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
> 
> Packet size is in bytes, while all packet rates are reported in mpps
> (aggregated).
> 
> No noticeable regression has been observed (certainly everything is
> within the ± 5% margin of existing performance), aside from the 64B
> packet size case when multi-segment mbuf is enabled. This is
> expected, however, because of how Tx vectoriszed functions are
> incompatible with multi-segment mbufs on some PMDs. The PMD under
> use during these tests was the i40e (on a Intel X710 NIC), which
> indeed doesn't support vectorized Tx functions with multi-segment
> mbufs.
> 
> ---
> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>     - Add Ben's proposed fix for automake's warning;
>     - Add a note to cover letter to explain this is preperatory work for TSO /
>       GRO.
> 
> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>       error paths.");
>     - Address Darrell's comments:
>       - The changes in dp_packet_resize__() were trying to alleviate the call
>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>         tailroom space when no more headroom space is available, and vice-versa.
>         However, this was breaking the API for the dp_packet_resize__()
>         function (only called in dp_packet_prealloc_tailroom() and
>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>         for DPDK packets.
>         Instead, and because this is isolate funtionality, revert to the
>         previous state where dp_packet_resize__() is not supported for DPDK
>         packets. Hence, then patch 08/14 has been dropped.
>     - Additionally, fix the tests that were relying on the removed
>       functionality.
> 
> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>       after free.");
>     - Address Eelco's comments:
>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>         leftover from rebase. Only call should be in dpdk_mp_get();
>       - Remove NEWS line added by mistake during rebase (about adding
>         experimental vhost zero copy support).
>     - Address Ian's comments:
>       - Drop patch 01 from previous series entirely;
>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>         used;
>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>         conditionally.
>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>       the shared mempool model (runs out of memory);
>     - Add new examples for when multi-segment mbufs are enabled in
>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>       (now patch 11/14).
> 
> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>       command."):
>       - A new patch (now 01/15) has been introduced to differentiate between
>         MTU and mbuf size when creating the mempool. This is because as part of
>         the new support for both per port and shared mempools, mempools were
>         being reused based on the mbuf size, and for multi-segment mbufs the
>         mbuf size can end up being the same for all mbufs;
>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>         part of the rebase, but only to adapt to the code changes to "Support
>         both shared and per port mempools.", no functionality should have been
>         changed.
> 
> v3:
>     - Address Eelco's comment:
>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>           wasn't correctly asserting that the passed 'v' was smaller than the
>           first mbuf's buf_len.
> 
> v2:
>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>       error.");
>     - Address Eelco's comments:
>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>           casting;
>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>           first mbuf in the chain;
>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>           always true for DPBUF_DPDK packets;
>         - Assert in __packet_set_data() to enforce that data_off never goes
>           beyond the first mbuf in the chain.
> 
> v1:
>     - v8 should have been sent as v1 really, as that's usually the approach
>       followed in OvS. That clearly didn't happen, so restarting the series
>       now. This also helps making it clear it is no longer an RFC series;
>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>       output.");
>     - Address Eelco's comments:
>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>           `pkt_len` and `data_len` can't be set to values bigger than the
>           available space. Also fix assigment to `data_len` which was
>           incorrectly being set to just`pkt_len`;
>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>           why the mutex is needed;
>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>           `DPBUF_DPDK` packets;
>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>           appropriate and meaningful errors, instead of just "0" or "1";
>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>           and reuse dp_packet_mbuf_write() instead of manual copy;
>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>           dpdk-multi-seg-mbufs requires a restart;
>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>       mbufs approach;
>     - Increase test coverage by adding "end-to-end" tests that verify that
>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>       successfully sent out;
>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>       minimise changes;
>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>       after further testing;
>     - Structure changes:
>       - Drop patch 07/13 which is now unneeded;
>       - Two more patches added for extra test coverage. This is what accounts
>         for the increase in size (+1 patch) in the series.
> 
> v8 (non-RFC): 
>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>       temporary directory (also for RHEL)");
>     - Address Ciara's and Ilya's comments:
>         - Drop the dp_packet_mbuf_tail() function and use only the
>           already existing dp_packet_tail();
>         - Fix bug in dpdk_do_tx_copy() where the error return from
>           dpdk_prep_tx_buf() was being wrongly checked;
>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>         - Fix some other code style and duplication issues pointed out.
>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>       dp_packet_put*() functions to work within the bounds of existing
>       mbufs only;
>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>       other mbufs in the chain for chains with more than a single mbuf;
>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>       the header is within the first mbuf, when using mbufs;
>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>       was refactored to use free_dpdk_buf();
>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>       leading to memory corruption; 
>     - Modified the added tests to account for some of the above changes;
>     - Run performance tests, compiling results and adding them to the
>       cover letter;
>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>       together with a "Performance notes" sub-section reflecting the
>       findings mentioned above in the cover letter.
> 
> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>      - Address Ilya comments;
>         - Fix non-DPDK build;
>         - Serialise the access of non pmds to allocation and free of mbufs by
>           using a newly introduced mutex.
>      - Add a new set of tests that integrates with the recently added DPDK
>        testsuite. These focus on allocating dp_packets, with a single or
>        multiple mbufs, from an instantiated mempool and performing several
>        operations on those, verifying if the data at the end matches what's
>        expected;
>      - Fix bugs found by the new tests:
>         - dp_packet_shift() wasn't taking into account shift lefts;
>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>           and headrooms, ending up calculating the wrong number of segments
>           that needed allocation;
>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>           dp_packet_mbuf_tail() and dp_packet_end();
>         - dp_packet_set_size() was not updating the number of chained segments
>           'nb_segs';
>      - Add support for multi-seg mbufs in dp_packet_clear().
> 
> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>        reuse.");
>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>        support multi-seg mbufs;
>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>        improve other helper funcs, such as dp_packet_tail() and dp_
>        packet_tailroom().
>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>        put_zeros(), as well as dp_packet_resize__() - allocating new
>        mbufs and linking them together;
>      Restructured patchset:
>      - Squash patch 5 into patch 6, since they were both related to
>        copying data while handling multi-seg mbufs;
>      - Split patch 4 into two separate patches - one that introduces the
>        changes in helper functions to deal with multi-seg mbufs and
>        two others for the shift() and put_uninit() functionality;
>      - Move patch 4 to before patch 3, so that ihelper functions come
>        before functionality improvement that rely on those helpers.
> 
> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>       from getting redundantly incremented");
>     - Sugesh's comments have been addressed:
>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>         make them independent of each other;
>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>         size() are independent;
>       - dp_packet_clone_with_headroom() now has split functions for
>         handling DPDK sourced packets and non-DPDK packets;
>     - Modified various functions in dp-packet.h to account for multi-seg
>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>       and dp_packet_at();
>     - Added support for shifting packet data in multi-seg mbufs, using
>       dp_packet_shift();
>     - Fixed some minor inconsistencies.
> 
>     Note that some of the changes in v5 have been contributed by Mark
>     Kavanagh as well.
> 
> v4: - restructure patchset
>     - account for 128B ARM cacheline when sizing mbufs
> 
> Mark Kavanagh (4):
>   netdev-dpdk: fix mbuf sizing
>   dp-packet: Init specific mbuf fields.
>   netdev-dpdk: copy large packet to multi-seg. mbufs
>   netdev-dpdk: support multi-segment jumbo frames
> 
> Michael Qiu (1):
>   dp-packet: copy data from multi-seg. DPDK mbuf
> 
> Tiago Lam (8):
>   dp-packet: Fix allocated size on DPDK init.
>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>   dp-packet: Fix data_len handling multi-seg mbufs.
>   dp-packet: Handle multi-seg mbufs in helper funcs.
>   dp-packet: Handle multi-seg mubfs in shift() func.
>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>   dpdk-tests: Accept other configs in OVS_DPDK_START
>   dpdk-tests: End-to-end tests for multi-seg mbufs.
> 
>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>  Documentation/topics/dpdk/memory.rst       |  36 ++
>  NEWS                                       |   1 +
>  lib/dp-packet.c                            | 173 +++++++++-
>  lib/dp-packet.h                            | 214 ++++++++++--
>  lib/dpdk.c                                 |   8 +
>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>  lib/netdev-dpdk.h                          |   2 +
>  tests/automake.mk                          |  10 +-
>  tests/dpdk-packet-mbufs.at                 |   7 +
>  tests/system-dpdk-macros.at                |   6 +-
>  tests/system-dpdk-testsuite.at             |   1 +
>  tests/system-dpdk.at                       |  65 ++++
>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>  vswitchd/vswitch.xml                       |  22 ++
>  15 files changed, 1277 insertions(+), 77 deletions(-)
>  create mode 100644 tests/dpdk-packet-mbufs.at
>  create mode 100644 tests/test-dpdk-mbufs.c
>
Lam, Tiago Aug. 9, 2018, 8:38 a.m. UTC | #2
Hi Ilya,

On 09/08/2018 09:27, Ilya Maximets wrote:
> Hmm. I found that this series modifies only dpdk related components
> and doesn't pay any attention to others.
> dp_packet API was modified to respect the segmented packets, but
> there are many places, where code uses packet data directly using
> only dp_packet_data() pointer and dp_packet_size().
> For example, at least following functions will read the wrong memory
> and probably generate segfault:
> 

No, that's not correct. Actually, this has been explained back in June
[1], where Eelco raised the same concerns. There, I replied with:

"This series takes the approach of not allocating new mbufs in
`DPBUF_DPDK` dp_packets. This is to avoid the case where a
dp_packet_put() call, for example, allocates two mbufs to create enough
space, and returns a pointer to the data which isn't contiguous in
memory (because it is spread across two mbufs). Most of the code in OvS
that uses the dp_packet API relies on the assumption that memory
returned is contiguous in memory."

But please do refer to the link for a full explanation. In fact, if that
was the case most of the tests in system-traffic would fail, and that
would show a flaw in the design.

Now, obviously this doesn't mean that in the odd case something may have
slip through the cracks, but all the tests that have been run so far
have given enough confidence.

Tiago.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348502.html

>   * All the netdev-{linux, dummy, bsd} send functions, because
>     they are using code like this:
> 
>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
> 
>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>     they're trying to calculate packet checksums using raw calls to
>     'csum_continue'.
> 
>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
> 
>   * Possibly, much more functions.
> 
> If It's true, when I'm afraid that this patch set is far from being
> ready for merge.
> 
> On 25.07.2018 17:19, Tiago Lam wrote:
>> Overview
>> ========
>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>> Multi-segment mbufs are typically used when the size of an mbuf is
>> insufficient to contain the entirety of a packet's data. Instead, the
>> data is split across numerous mbufs, each carrying a portion, or
>> 'segment', of the packet data. Mbufs are chained via their 'next'
>> attribute (an mbuf pointer).
>>
>> The main motivation behind the support for multi-segment mbufs is to
>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>> planned to be introduced after this series.
>>
>> Use Cases
>> =========
>> i.  Handling oversized (guest-originated) frames, which are marked
>>     for hardware accelration/offload (TSO, for example).
>>
>>     Packets which originate from a non-DPDK source may be marked for
>>     offload; as such, they may be larger than the permitted ingress
>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>     order to transmit such packets over a DPDK port, their contents
>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>     its current implementation, that function only copies data into
>>     a single mbuf; if the space available in the mbuf is exhausted,
>>     but not all packet data has been copied, then it is lost.
>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>     whether that mbuf contains multiple segments. Both issues are
>>     resolved within this patchset.
>>
>> ii. Handling jumbo frames.
>>
>>     While OvS already supports jumbo frames, it does so by increasing
>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>     in a single mbuf. This is certainly the preferred, and most
>>     performant approach (and remains the default).
>>
>> Enabling multi-segment mbufs
>> ============================
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>
>> This is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Performance notes (based on v8)
>> =================
>> In order to test for regressions in performance, tests were run on top
>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>> mbufs option enabled and disabled.
>>
>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>
>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>> -------------------------------------------------------------
>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>
>> Packet size is in bytes, while all packet rates are reported in mpps
>> (aggregated).
>>
>> No noticeable regression has been observed (certainly everything is
>> within the ± 5% margin of existing performance), aside from the 64B
>> packet size case when multi-segment mbuf is enabled. This is
>> expected, however, because of how Tx vectoriszed functions are
>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>> use during these tests was the i40e (on a Intel X710 NIC), which
>> indeed doesn't support vectorized Tx functions with multi-segment
>> mbufs.
>>
>> ---
>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>     - Add Ben's proposed fix for automake's warning;
>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>       GRO.
>>
>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>       error paths.");
>>     - Address Darrell's comments:
>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>         tailroom space when no more headroom space is available, and vice-versa.
>>         However, this was breaking the API for the dp_packet_resize__()
>>         function (only called in dp_packet_prealloc_tailroom() and
>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>         for DPDK packets.
>>         Instead, and because this is isolate funtionality, revert to the
>>         previous state where dp_packet_resize__() is not supported for DPDK
>>         packets. Hence, then patch 08/14 has been dropped.
>>     - Additionally, fix the tests that were relying on the removed
>>       functionality.
>>
>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>       after free.");
>>     - Address Eelco's comments:
>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>       - Remove NEWS line added by mistake during rebase (about adding
>>         experimental vhost zero copy support).
>>     - Address Ian's comments:
>>       - Drop patch 01 from previous series entirely;
>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>         used;
>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>         conditionally.
>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>       the shared mempool model (runs out of memory);
>>     - Add new examples for when multi-segment mbufs are enabled in
>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>       (now patch 11/14).
>>
>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>       command."):
>>       - A new patch (now 01/15) has been introduced to differentiate between
>>         MTU and mbuf size when creating the mempool. This is because as part of
>>         the new support for both per port and shared mempools, mempools were
>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>         mbuf size can end up being the same for all mbufs;
>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>         part of the rebase, but only to adapt to the code changes to "Support
>>         both shared and per port mempools.", no functionality should have been
>>         changed.
>>
>> v3:
>>     - Address Eelco's comment:
>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>           first mbuf's buf_len.
>>
>> v2:
>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>       error.");
>>     - Address Eelco's comments:
>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>           casting;
>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>           first mbuf in the chain;
>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>           always true for DPBUF_DPDK packets;
>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>           beyond the first mbuf in the chain.
>>
>> v1:
>>     - v8 should have been sent as v1 really, as that's usually the approach
>>       followed in OvS. That clearly didn't happen, so restarting the series
>>       now. This also helps making it clear it is no longer an RFC series;
>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>       output.");
>>     - Address Eelco's comments:
>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>           available space. Also fix assigment to `data_len` which was
>>           incorrectly being set to just`pkt_len`;
>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>           why the mutex is needed;
>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>           `DPBUF_DPDK` packets;
>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>           appropriate and meaningful errors, instead of just "0" or "1";
>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>           dpdk-multi-seg-mbufs requires a restart;
>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>       mbufs approach;
>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>       successfully sent out;
>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>       minimise changes;
>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>       after further testing;
>>     - Structure changes:
>>       - Drop patch 07/13 which is now unneeded;
>>       - Two more patches added for extra test coverage. This is what accounts
>>         for the increase in size (+1 patch) in the series.
>>
>> v8 (non-RFC): 
>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>       temporary directory (also for RHEL)");
>>     - Address Ciara's and Ilya's comments:
>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>           already existing dp_packet_tail();
>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>           dpdk_prep_tx_buf() was being wrongly checked;
>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>         - Fix some other code style and duplication issues pointed out.
>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>       dp_packet_put*() functions to work within the bounds of existing
>>       mbufs only;
>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>       other mbufs in the chain for chains with more than a single mbuf;
>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>       the header is within the first mbuf, when using mbufs;
>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>       was refactored to use free_dpdk_buf();
>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>       leading to memory corruption; 
>>     - Modified the added tests to account for some of the above changes;
>>     - Run performance tests, compiling results and adding them to the
>>       cover letter;
>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>       together with a "Performance notes" sub-section reflecting the
>>       findings mentioned above in the cover letter.
>>
>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>      - Address Ilya comments;
>>         - Fix non-DPDK build;
>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>           using a newly introduced mutex.
>>      - Add a new set of tests that integrates with the recently added DPDK
>>        testsuite. These focus on allocating dp_packets, with a single or
>>        multiple mbufs, from an instantiated mempool and performing several
>>        operations on those, verifying if the data at the end matches what's
>>        expected;
>>      - Fix bugs found by the new tests:
>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>           and headrooms, ending up calculating the wrong number of segments
>>           that needed allocation;
>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>           dp_packet_mbuf_tail() and dp_packet_end();
>>         - dp_packet_set_size() was not updating the number of chained segments
>>           'nb_segs';
>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>
>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>        reuse.");
>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>        support multi-seg mbufs;
>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>        packet_tailroom().
>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>        mbufs and linking them together;
>>      Restructured patchset:
>>      - Squash patch 5 into patch 6, since they were both related to
>>        copying data while handling multi-seg mbufs;
>>      - Split patch 4 into two separate patches - one that introduces the
>>        changes in helper functions to deal with multi-seg mbufs and
>>        two others for the shift() and put_uninit() functionality;
>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>        before functionality improvement that rely on those helpers.
>>
>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>       from getting redundantly incremented");
>>     - Sugesh's comments have been addressed:
>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>         make them independent of each other;
>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>         size() are independent;
>>       - dp_packet_clone_with_headroom() now has split functions for
>>         handling DPDK sourced packets and non-DPDK packets;
>>     - Modified various functions in dp-packet.h to account for multi-seg
>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>       and dp_packet_at();
>>     - Added support for shifting packet data in multi-seg mbufs, using
>>       dp_packet_shift();
>>     - Fixed some minor inconsistencies.
>>
>>     Note that some of the changes in v5 have been contributed by Mark
>>     Kavanagh as well.
>>
>> v4: - restructure patchset
>>     - account for 128B ARM cacheline when sizing mbufs
>>
>> Mark Kavanagh (4):
>>   netdev-dpdk: fix mbuf sizing
>>   dp-packet: Init specific mbuf fields.
>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>   netdev-dpdk: support multi-segment jumbo frames
>>
>> Michael Qiu (1):
>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>
>> Tiago Lam (8):
>>   dp-packet: Fix allocated size on DPDK init.
>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>
>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>  NEWS                                       |   1 +
>>  lib/dp-packet.c                            | 173 +++++++++-
>>  lib/dp-packet.h                            | 214 ++++++++++--
>>  lib/dpdk.c                                 |   8 +
>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>  lib/netdev-dpdk.h                          |   2 +
>>  tests/automake.mk                          |  10 +-
>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>  tests/system-dpdk-macros.at                |   6 +-
>>  tests/system-dpdk-testsuite.at             |   1 +
>>  tests/system-dpdk.at                       |  65 ++++
>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>  vswitchd/vswitch.xml                       |  22 ++
>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>  create mode 100644 tests/test-dpdk-mbufs.c
>>
Ilya Maximets Aug. 9, 2018, 11:44 a.m. UTC | #3
On 09.08.2018 11:38, Lam, Tiago wrote:
> Hi Ilya,
> 
> On 09/08/2018 09:27, Ilya Maximets wrote:
>> Hmm. I found that this series modifies only dpdk related components
>> and doesn't pay any attention to others.
>> dp_packet API was modified to respect the segmented packets, but
>> there are many places, where code uses packet data directly using
>> only dp_packet_data() pointer and dp_packet_size().
>> For example, at least following functions will read the wrong memory
>> and probably generate segfault:
>>
> 
> No, that's not correct. Actually, this has been explained back in June
> [1], where Eelco raised the same concerns. There, I replied with:
> 
> "This series takes the approach of not allocating new mbufs in
> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
> dp_packet_put() call, for example, allocates two mbufs to create enough
> space, and returns a pointer to the data which isn't contiguous in
> memory (because it is spread across two mbufs). Most of the code in OvS
> that uses the dp_packet API relies on the assumption that memory
> returned is contiguous in memory."
> 
> But please do refer to the link for a full explanation. In fact, if that
> was the case most of the tests in system-traffic would fail, and that
> would show a flaw in the design.
> 
> Now, obviously this doesn't mean that in the odd case something may have
> slip through the cracks, but all the tests that have been run so far
> have given enough confidence.
> 
> Tiago.
> 
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348502.html

I don't understand what you're talking about.
Just look at the following case:

1. Applying following patch, just to be sure:
---
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 0c42268..bc995bc 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1433,6 +1433,11 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
         ssize_t retval;
         int error;
 
+        if (packet->mbuf.nb_segs > 1) {
+            VLOG_ERR_RL(&rl,
+                        "Sending multiseg mbuf as a plain data. nb_segs: %d",
+                        packet->mbuf.nb_segs);
+        }
         do {
             retval = write(netdev->tap_fd, dp_packet_data(packet), size);
             error = retval < 0 ? errno : 0;
---

2. Configure OVS:
./bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

3. Create one bridge with datapath_type "netdev", add one vhostuserclient
   port "vhost1" like this:

# ovs-vsctl show
    Bridge "ovsdpdk_br0"
        Port "vhost1"
            Interface "vhost1"
                type: dpdkvhostuserclient
                options: {vhost-server-path="/vhost1.sock"}
        Port "ovsdpdk_br0"
            Interface "ovsdpdk_br0"
                type: internal

4. Start VM.

5. Set up on host:

# ovs-vsctl set interface vhost1 mtu_request=9000
# ovs-vsctl set interface ovsdpdk_br0 mtu_request=9000
# ifconfig ovsdpdk_br0 192.168.114.20 netmask 255.255.255.0

6. Set up inside VM:

# ifconfig eth0 192.168.114.10 netmask 255.255.255.0 mtu 9000

7. Try to scp large file from VM to Host:

# scp file root@192.168.114.20:./
file                             13% 2112KB 434.8KB/s - *stalled*

At the same time in tcpdump on the host side:

14:17:51.162168 IP (tos 0x8, ttl 64, id 30567, offset 0, flags [DF], proto TCP (6), length 9000)
    192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc831 (incorrect -> 0x1be9), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216598 ecr 1685821578], length 8948
14:17:51.582171 IP (tos 0x8, ttl 64, id 30568, offset 0, flags [DF], proto TCP (6), length 9000)
    192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc807 (incorrect -> 0x980f), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216640 ecr 1685821578], length 8948
14:17:52.422175 IP (tos 0x8, ttl 64, id 30569, offset 0, flags [DF], proto TCP (6), length 9000)
    192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc7b3 (incorrect -> 0x1b6b), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216724 ecr 1685821578], length 8948
14:17:54.115128 IP (tos 0x8, ttl 64, id 30570, offset 0, flags [DF], proto TCP (6), length 9000)
    192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc70b (incorrect -> 0x9713), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216892 ecr 1685821578], length 8948

And in OVS log:

2018-08-09T11:17:50Z|00002|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
2018-08-09T11:17:50Z|00003|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
2018-08-09T11:17:50Z|00004|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
2018-08-09T11:17:50Z|00005|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
2018-08-09T11:17:50Z|00006|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
2018-08-09T11:17:51Z|00007|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
2018-08-09T11:17:51Z|00008|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5

SCP never finished because of corrupted data.
Same test works fine if other_config:dpdk-multi-seg-mbufs=false.

Best regards, Ilya Maximets.

> 
>>   * All the netdev-{linux, dummy, bsd} send functions, because
>>     they are using code like this:
>>
>>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>
>>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>>     they're trying to calculate packet checksums using raw calls to
>>     'csum_continue'.
>>
>>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>
>>   * Possibly, much more functions.
>>
>> If It's true, when I'm afraid that this patch set is far from being
>> ready for merge.
>>
>> On 25.07.2018 17:19, Tiago Lam wrote:
>>> Overview
>>> ========
>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>> insufficient to contain the entirety of a packet's data. Instead, the
>>> data is split across numerous mbufs, each carrying a portion, or
>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>> attribute (an mbuf pointer).
>>>
>>> The main motivation behind the support for multi-segment mbufs is to
>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>> planned to be introduced after this series.
>>>
>>> Use Cases
>>> =========
>>> i.  Handling oversized (guest-originated) frames, which are marked
>>>     for hardware accelration/offload (TSO, for example).
>>>
>>>     Packets which originate from a non-DPDK source may be marked for
>>>     offload; as such, they may be larger than the permitted ingress
>>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>>     order to transmit such packets over a DPDK port, their contents
>>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>>     its current implementation, that function only copies data into
>>>     a single mbuf; if the space available in the mbuf is exhausted,
>>>     but not all packet data has been copied, then it is lost.
>>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>>     whether that mbuf contains multiple segments. Both issues are
>>>     resolved within this patchset.
>>>
>>> ii. Handling jumbo frames.
>>>
>>>     While OvS already supports jumbo frames, it does so by increasing
>>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>>     in a single mbuf. This is certainly the preferred, and most
>>>     performant approach (and remains the default).
>>>
>>> Enabling multi-segment mbufs
>>> ============================
>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>> user must decide on which approach to adopt on init. The introduction
>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>
>>> This is a global boolean value, which determines how jumbo frames are
>>> represented across all DPDK ports. In the absence of a user-supplied
>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>> default.
>>>
>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>> fields:
>>>
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>
>>> Performance notes (based on v8)
>>> =================
>>> In order to test for regressions in performance, tests were run on top
>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>> mbufs option enabled and disabled.
>>>
>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>
>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>> -------------------------------------------------------------
>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>
>>> Packet size is in bytes, while all packet rates are reported in mpps
>>> (aggregated).
>>>
>>> No noticeable regression has been observed (certainly everything is
>>> within the ± 5% margin of existing performance), aside from the 64B
>>> packet size case when multi-segment mbuf is enabled. This is
>>> expected, however, because of how Tx vectoriszed functions are
>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>> indeed doesn't support vectorized Tx functions with multi-segment
>>> mbufs.
>>>
>>> ---
>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>>     - Add Ben's proposed fix for automake's warning;
>>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>>       GRO.
>>>
>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>>       error paths.");
>>>     - Address Darrell's comments:
>>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>>         tailroom space when no more headroom space is available, and vice-versa.
>>>         However, this was breaking the API for the dp_packet_resize__()
>>>         function (only called in dp_packet_prealloc_tailroom() and
>>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>>         for DPDK packets.
>>>         Instead, and because this is isolate funtionality, revert to the
>>>         previous state where dp_packet_resize__() is not supported for DPDK
>>>         packets. Hence, then patch 08/14 has been dropped.
>>>     - Additionally, fix the tests that were relying on the removed
>>>       functionality.
>>>
>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>>       after free.");
>>>     - Address Eelco's comments:
>>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>>       - Remove NEWS line added by mistake during rebase (about adding
>>>         experimental vhost zero copy support).
>>>     - Address Ian's comments:
>>>       - Drop patch 01 from previous series entirely;
>>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>>         used;
>>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>>         conditionally.
>>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>>       the shared mempool model (runs out of memory);
>>>     - Add new examples for when multi-segment mbufs are enabled in
>>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>>       (now patch 11/14).
>>>
>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>>       command."):
>>>       - A new patch (now 01/15) has been introduced to differentiate between
>>>         MTU and mbuf size when creating the mempool. This is because as part of
>>>         the new support for both per port and shared mempools, mempools were
>>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>>         mbuf size can end up being the same for all mbufs;
>>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>>         part of the rebase, but only to adapt to the code changes to "Support
>>>         both shared and per port mempools.", no functionality should have been
>>>         changed.
>>>
>>> v3:
>>>     - Address Eelco's comment:
>>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>>           first mbuf's buf_len.
>>>
>>> v2:
>>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>>       error.");
>>>     - Address Eelco's comments:
>>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>>           casting;
>>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>>           first mbuf in the chain;
>>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>>           always true for DPBUF_DPDK packets;
>>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>>           beyond the first mbuf in the chain.
>>>
>>> v1:
>>>     - v8 should have been sent as v1 really, as that's usually the approach
>>>       followed in OvS. That clearly didn't happen, so restarting the series
>>>       now. This also helps making it clear it is no longer an RFC series;
>>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>>       output.");
>>>     - Address Eelco's comments:
>>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>>           available space. Also fix assigment to `data_len` which was
>>>           incorrectly being set to just`pkt_len`;
>>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>>           why the mutex is needed;
>>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>>           `DPBUF_DPDK` packets;
>>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>>           appropriate and meaningful errors, instead of just "0" or "1";
>>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>>           dpdk-multi-seg-mbufs requires a restart;
>>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>>       mbufs approach;
>>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>>       successfully sent out;
>>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>>       minimise changes;
>>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>>       after further testing;
>>>     - Structure changes:
>>>       - Drop patch 07/13 which is now unneeded;
>>>       - Two more patches added for extra test coverage. This is what accounts
>>>         for the increase in size (+1 patch) in the series.
>>>
>>> v8 (non-RFC): 
>>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>>       temporary directory (also for RHEL)");
>>>     - Address Ciara's and Ilya's comments:
>>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>>           already existing dp_packet_tail();
>>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>>           dpdk_prep_tx_buf() was being wrongly checked;
>>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>>         - Fix some other code style and duplication issues pointed out.
>>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>>       dp_packet_put*() functions to work within the bounds of existing
>>>       mbufs only;
>>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>>       other mbufs in the chain for chains with more than a single mbuf;
>>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>>       the header is within the first mbuf, when using mbufs;
>>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>>       was refactored to use free_dpdk_buf();
>>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>>       leading to memory corruption; 
>>>     - Modified the added tests to account for some of the above changes;
>>>     - Run performance tests, compiling results and adding them to the
>>>       cover letter;
>>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>>       together with a "Performance notes" sub-section reflecting the
>>>       findings mentioned above in the cover letter.
>>>
>>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>>      - Address Ilya comments;
>>>         - Fix non-DPDK build;
>>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>>           using a newly introduced mutex.
>>>      - Add a new set of tests that integrates with the recently added DPDK
>>>        testsuite. These focus on allocating dp_packets, with a single or
>>>        multiple mbufs, from an instantiated mempool and performing several
>>>        operations on those, verifying if the data at the end matches what's
>>>        expected;
>>>      - Fix bugs found by the new tests:
>>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>>           and headrooms, ending up calculating the wrong number of segments
>>>           that needed allocation;
>>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>>           dp_packet_mbuf_tail() and dp_packet_end();
>>>         - dp_packet_set_size() was not updating the number of chained segments
>>>           'nb_segs';
>>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>>
>>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>>        reuse.");
>>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>>        support multi-seg mbufs;
>>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>>        packet_tailroom().
>>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>>        mbufs and linking them together;
>>>      Restructured patchset:
>>>      - Squash patch 5 into patch 6, since they were both related to
>>>        copying data while handling multi-seg mbufs;
>>>      - Split patch 4 into two separate patches - one that introduces the
>>>        changes in helper functions to deal with multi-seg mbufs and
>>>        two others for the shift() and put_uninit() functionality;
>>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>>        before functionality improvement that rely on those helpers.
>>>
>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>>       from getting redundantly incremented");
>>>     - Sugesh's comments have been addressed:
>>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>>         make them independent of each other;
>>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>>         size() are independent;
>>>       - dp_packet_clone_with_headroom() now has split functions for
>>>         handling DPDK sourced packets and non-DPDK packets;
>>>     - Modified various functions in dp-packet.h to account for multi-seg
>>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>>       and dp_packet_at();
>>>     - Added support for shifting packet data in multi-seg mbufs, using
>>>       dp_packet_shift();
>>>     - Fixed some minor inconsistencies.
>>>
>>>     Note that some of the changes in v5 have been contributed by Mark
>>>     Kavanagh as well.
>>>
>>> v4: - restructure patchset
>>>     - account for 128B ARM cacheline when sizing mbufs
>>>
>>> Mark Kavanagh (4):
>>>   netdev-dpdk: fix mbuf sizing
>>>   dp-packet: Init specific mbuf fields.
>>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>>   netdev-dpdk: support multi-segment jumbo frames
>>>
>>> Michael Qiu (1):
>>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>>
>>> Tiago Lam (8):
>>>   dp-packet: Fix allocated size on DPDK init.
>>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>
>>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>>  NEWS                                       |   1 +
>>>  lib/dp-packet.c                            | 173 +++++++++-
>>>  lib/dp-packet.h                            | 214 ++++++++++--
>>>  lib/dpdk.c                                 |   8 +
>>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>>  lib/netdev-dpdk.h                          |   2 +
>>>  tests/automake.mk                          |  10 +-
>>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>>  tests/system-dpdk-macros.at                |   6 +-
>>>  tests/system-dpdk-testsuite.at             |   1 +
>>>  tests/system-dpdk.at                       |  65 ++++
>>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>>  vswitchd/vswitch.xml                       |  22 ++
>>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>>  create mode 100644 tests/test-dpdk-mbufs.c
>>>
> 
>
Stokes, Ian Aug. 9, 2018, 2:53 p.m. UTC | #4
On 8/9/2018 12:44 PM, Ilya Maximets wrote:
> On 09.08.2018 11:38, Lam, Tiago wrote:
>> Hi Ilya,
>>
>> On 09/08/2018 09:27, Ilya Maximets wrote:
>>> Hmm. I found that this series modifies only dpdk related components
>>> and doesn't pay any attention to others.
>>> dp_packet API was modified to respect the segmented packets, but
>>> there are many places, where code uses packet data directly using
>>> only dp_packet_data() pointer and dp_packet_size().
>>> For example, at least following functions will read the wrong memory
>>> and probably generate segfault:
>>>
>>
>> No, that's not correct. Actually, this has been explained back in June
>> [1], where Eelco raised the same concerns. There, I replied with:
>>
>> "This series takes the approach of not allocating new mbufs in
>> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
>> dp_packet_put() call, for example, allocates two mbufs to create enough
>> space, and returns a pointer to the data which isn't contiguous in
>> memory (because it is spread across two mbufs). Most of the code in OvS
>> that uses the dp_packet API relies on the assumption that memory
>> returned is contiguous in memory."
>>
>> But please do refer to the link for a full explanation. In fact, if that
>> was the case most of the tests in system-traffic would fail, and that
>> would show a flaw in the design.
>>
>> Now, obviously this doesn't mean that in the odd case something may have
>> slip through the cracks, but all the tests that have been run so far
>> have given enough confidence.
>>
>> Tiago.
>>
>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348502.html
> 
> I don't understand what you're talking about.
> Just look at the following case:
> 
> 1. Applying following patch, just to be sure:
> ---
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 0c42268..bc995bc 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -1433,6 +1433,11 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
>           ssize_t retval;
>           int error;
>   
> +        if (packet->mbuf.nb_segs > 1) {
> +            VLOG_ERR_RL(&rl,
> +                        "Sending multiseg mbuf as a plain data. nb_segs: %d",
> +                        packet->mbuf.nb_segs);
> +        }
>           do {
>               retval = write(netdev->tap_fd, dp_packet_data(packet), size);
>               error = retval < 0 ? errno : 0;
> ---
> 
> 2. Configure OVS:
> ./bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> 3. Create one bridge with datapath_type "netdev", add one vhostuserclient
>     port "vhost1" like this:
> 
> # ovs-vsctl show
>      Bridge "ovsdpdk_br0"
>          Port "vhost1"
>              Interface "vhost1"
>                  type: dpdkvhostuserclient
>                  options: {vhost-server-path="/vhost1.sock"}
>          Port "ovsdpdk_br0"
>              Interface "ovsdpdk_br0"
>                  type: internal
> 
> 4. Start VM.
> 
> 5. Set up on host:
> 
> # ovs-vsctl set interface vhost1 mtu_request=9000
> # ovs-vsctl set interface ovsdpdk_br0 mtu_request=9000
> # ifconfig ovsdpdk_br0 192.168.114.20 netmask 255.255.255.0
> 
> 6. Set up inside VM:
> 
> # ifconfig eth0 192.168.114.10 netmask 255.255.255.0 mtu 9000
> 
> 7. Try to scp large file from VM to Host:
> 
> # scp file root@192.168.114.20:./
> file                             13% 2112KB 434.8KB/s - *stalled*
> 
> At the same time in tcpdump on the host side:
> 
> 14:17:51.162168 IP (tos 0x8, ttl 64, id 30567, offset 0, flags [DF], proto TCP (6), length 9000)
>      192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc831 (incorrect -> 0x1be9), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216598 ecr 1685821578], length 8948
> 14:17:51.582171 IP (tos 0x8, ttl 64, id 30568, offset 0, flags [DF], proto TCP (6), length 9000)
>      192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc807 (incorrect -> 0x980f), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216640 ecr 1685821578], length 8948
> 14:17:52.422175 IP (tos 0x8, ttl 64, id 30569, offset 0, flags [DF], proto TCP (6), length 9000)
>      192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc7b3 (incorrect -> 0x1b6b), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216724 ecr 1685821578], length 8948
> 14:17:54.115128 IP (tos 0x8, ttl 64, id 30570, offset 0, flags [DF], proto TCP (6), length 9000)
>      192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc70b (incorrect -> 0x9713), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216892 ecr 1685821578], length 8948
> 
> And in OVS log:
> 
> 2018-08-09T11:17:50Z|00002|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
> 2018-08-09T11:17:50Z|00003|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:50Z|00004|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
> 2018-08-09T11:17:50Z|00005|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:50Z|00006|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:51Z|00007|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:51Z|00008|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 
> SCP never finished because of corrupted data.
> Same test works fine if other_config:dpdk-multi-seg-mbufs=false.
> 
> Best regards, Ilya Maximets.
> 


I'm holding off merging this just yet, the questions around issues 
flagged here should be resolved first, there are also a number of sparse 
build issues with the series to be resolved so another revision at least 
is required to address these problems.

Ian
Lam, Tiago Aug. 9, 2018, 4:53 p.m. UTC | #5
On 09/08/2018 12:44, Ilya Maximets wrote:
> On 09.08.2018 11:38, Lam, Tiago wrote:
>> Hi Ilya,
>>
>> On 09/08/2018 09:27, Ilya Maximets wrote:
>>> Hmm. I found that this series modifies only dpdk related components
>>> and doesn't pay any attention to others.
>>> dp_packet API was modified to respect the segmented packets, but
>>> there are many places, where code uses packet data directly using
>>> only dp_packet_data() pointer and dp_packet_size().
>>> For example, at least following functions will read the wrong memory
>>> and probably generate segfault:
>>>
>>
>> No, that's not correct. Actually, this has been explained back in June
>> [1], where Eelco raised the same concerns. There, I replied with:
>>
>> "This series takes the approach of not allocating new mbufs in
>> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
>> dp_packet_put() call, for example, allocates two mbufs to create enough
>> space, and returns a pointer to the data which isn't contiguous in
>> memory (because it is spread across two mbufs). Most of the code in OvS
>> that uses the dp_packet API relies on the assumption that memory
>> returned is contiguous in memory."
>>
>> But please do refer to the link for a full explanation. In fact, if that
>> was the case most of the tests in system-traffic would fail, and that
>> would show a flaw in the design.
>>
>> Now, obviously this doesn't mean that in the odd case something may have
>> slip through the cracks, but all the tests that have been run so far
>> have given enough confidence.
>>
>> Tiago.
>>
>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348502.html
> 
> I don't understand what you're talking about.
> Just look at the following case:
> 

The broad sentence above kind of threw me off to think this was about
the allocation of mbufs where data becomes uncontiguous, but the example
below shows well your point. I've similar tests, but they are missing
the "big file" part. Thanks for the detailed example.

I'll work to improve on this.

Tiago.

> 1. Applying following patch, just to be sure:
> ---
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 0c42268..bc995bc 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -1433,6 +1433,11 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
>          ssize_t retval;
>          int error;
>  
> +        if (packet->mbuf.nb_segs > 1) {
> +            VLOG_ERR_RL(&rl,
> +                        "Sending multiseg mbuf as a plain data. nb_segs: %d",
> +                        packet->mbuf.nb_segs);
> +        }
>          do {
>              retval = write(netdev->tap_fd, dp_packet_data(packet), size);
>              error = retval < 0 ? errno : 0;
> ---
> 
> 2. Configure OVS:
> ./bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> 3. Create one bridge with datapath_type "netdev", add one vhostuserclient
>    port "vhost1" like this:
> 
> # ovs-vsctl show
>     Bridge "ovsdpdk_br0"
>         Port "vhost1"
>             Interface "vhost1"
>                 type: dpdkvhostuserclient
>                 options: {vhost-server-path="/vhost1.sock"}
>         Port "ovsdpdk_br0"
>             Interface "ovsdpdk_br0"
>                 type: internal
> 
> 4. Start VM.
> 
> 5. Set up on host:
> 
> # ovs-vsctl set interface vhost1 mtu_request=9000
> # ovs-vsctl set interface ovsdpdk_br0 mtu_request=9000
> # ifconfig ovsdpdk_br0 192.168.114.20 netmask 255.255.255.0
> 
> 6. Set up inside VM:
> 
> # ifconfig eth0 192.168.114.10 netmask 255.255.255.0 mtu 9000
> 
> 7. Try to scp large file from VM to Host:
> 
> # scp file root@192.168.114.20:./
> file                             13% 2112KB 434.8KB/s - *stalled*
> 
> At the same time in tcpdump on the host side:
> 
> 14:17:51.162168 IP (tos 0x8, ttl 64, id 30567, offset 0, flags [DF], proto TCP (6), length 9000)
>     192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc831 (incorrect -> 0x1be9), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216598 ecr 1685821578], length 8948
> 14:17:51.582171 IP (tos 0x8, ttl 64, id 30568, offset 0, flags [DF], proto TCP (6), length 9000)
>     192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc807 (incorrect -> 0x980f), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216640 ecr 1685821578], length 8948
> 14:17:52.422175 IP (tos 0x8, ttl 64, id 30569, offset 0, flags [DF], proto TCP (6), length 9000)
>     192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc7b3 (incorrect -> 0x1b6b), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216724 ecr 1685821578], length 8948
> 14:17:54.115128 IP (tos 0x8, ttl 64, id 30570, offset 0, flags [DF], proto TCP (6), length 9000)
>     192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc70b (incorrect -> 0x9713), seq 480:9428, ack 761, win 256, options [nop,nop,TS val 216892 ecr 1685821578], length 8948
> 
> And in OVS log:
> 
> 2018-08-09T11:17:50Z|00002|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
> 2018-08-09T11:17:50Z|00003|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:50Z|00004|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 4
> 2018-08-09T11:17:50Z|00005|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:50Z|00006|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:51Z|00007|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 2018-08-09T11:17:51Z|00008|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a plain data. nb_segs: 5
> 
> SCP never finished because of corrupted data.
> Same test works fine if other_config:dpdk-multi-seg-mbufs=false.
> 
> Best regards, Ilya Maximets.
> 
>>
>>>   * All the netdev-{linux, dummy, bsd} send functions, because
>>>     they are using code like this:
>>>
>>>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>>
>>>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>>>     they're trying to calculate packet checksums using raw calls to
>>>     'csum_continue'.
>>>
>>>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>>
>>>   * Possibly, much more functions.
>>>
>>> If It's true, when I'm afraid that this patch set is far from being
>>> ready for merge.
>>>
>>> On 25.07.2018 17:19, Tiago Lam wrote:
>>>> Overview
>>>> ========
>>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>>> insufficient to contain the entirety of a packet's data. Instead, the
>>>> data is split across numerous mbufs, each carrying a portion, or
>>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>>> attribute (an mbuf pointer).
>>>>
>>>> The main motivation behind the support for multi-segment mbufs is to
>>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>>> planned to be introduced after this series.
>>>>
>>>> Use Cases
>>>> =========
>>>> i.  Handling oversized (guest-originated) frames, which are marked
>>>>     for hardware accelration/offload (TSO, for example).
>>>>
>>>>     Packets which originate from a non-DPDK source may be marked for
>>>>     offload; as such, they may be larger than the permitted ingress
>>>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>>>     order to transmit such packets over a DPDK port, their contents
>>>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>>>     its current implementation, that function only copies data into
>>>>     a single mbuf; if the space available in the mbuf is exhausted,
>>>>     but not all packet data has been copied, then it is lost.
>>>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>>>     whether that mbuf contains multiple segments. Both issues are
>>>>     resolved within this patchset.
>>>>
>>>> ii. Handling jumbo frames.
>>>>
>>>>     While OvS already supports jumbo frames, it does so by increasing
>>>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>>>     in a single mbuf. This is certainly the preferred, and most
>>>>     performant approach (and remains the default).
>>>>
>>>> Enabling multi-segment mbufs
>>>> ============================
>>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>>> user must decide on which approach to adopt on init. The introduction
>>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>>
>>>> This is a global boolean value, which determines how jumbo frames are
>>>> represented across all DPDK ports. In the absence of a user-supplied
>>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>>> default.
>>>>
>>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>>> fields:
>>>>
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>>
>>>> Performance notes (based on v8)
>>>> =================
>>>> In order to test for regressions in performance, tests were run on top
>>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>>> mbufs option enabled and disabled.
>>>>
>>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>>
>>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>>> -------------------------------------------------------------
>>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>>
>>>> Packet size is in bytes, while all packet rates are reported in mpps
>>>> (aggregated).
>>>>
>>>> No noticeable regression has been observed (certainly everything is
>>>> within the ± 5% margin of existing performance), aside from the 64B
>>>> packet size case when multi-segment mbuf is enabled. This is
>>>> expected, however, because of how Tx vectoriszed functions are
>>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>>> indeed doesn't support vectorized Tx functions with multi-segment
>>>> mbufs.
>>>>
>>>> ---
>>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>>>     - Add Ben's proposed fix for automake's warning;
>>>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>>>       GRO.
>>>>
>>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>>>       error paths.");
>>>>     - Address Darrell's comments:
>>>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>>>         tailroom space when no more headroom space is available, and vice-versa.
>>>>         However, this was breaking the API for the dp_packet_resize__()
>>>>         function (only called in dp_packet_prealloc_tailroom() and
>>>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>>>         for DPDK packets.
>>>>         Instead, and because this is isolate funtionality, revert to the
>>>>         previous state where dp_packet_resize__() is not supported for DPDK
>>>>         packets. Hence, then patch 08/14 has been dropped.
>>>>     - Additionally, fix the tests that were relying on the removed
>>>>       functionality.
>>>>
>>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>>>       after free.");
>>>>     - Address Eelco's comments:
>>>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>>>       - Remove NEWS line added by mistake during rebase (about adding
>>>>         experimental vhost zero copy support).
>>>>     - Address Ian's comments:
>>>>       - Drop patch 01 from previous series entirely;
>>>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>>>         used;
>>>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>>>         conditionally.
>>>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>>>       the shared mempool model (runs out of memory);
>>>>     - Add new examples for when multi-segment mbufs are enabled in
>>>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>>>       (now patch 11/14).
>>>>
>>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>>>       command."):
>>>>       - A new patch (now 01/15) has been introduced to differentiate between
>>>>         MTU and mbuf size when creating the mempool. This is because as part of
>>>>         the new support for both per port and shared mempools, mempools were
>>>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>>>         mbuf size can end up being the same for all mbufs;
>>>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>>>         part of the rebase, but only to adapt to the code changes to "Support
>>>>         both shared and per port mempools.", no functionality should have been
>>>>         changed.
>>>>
>>>> v3:
>>>>     - Address Eelco's comment:
>>>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>>>           first mbuf's buf_len.
>>>>
>>>> v2:
>>>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>>>       error.");
>>>>     - Address Eelco's comments:
>>>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>>>           casting;
>>>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>>>           first mbuf in the chain;
>>>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>>>           always true for DPBUF_DPDK packets;
>>>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>>>           beyond the first mbuf in the chain.
>>>>
>>>> v1:
>>>>     - v8 should have been sent as v1 really, as that's usually the approach
>>>>       followed in OvS. That clearly didn't happen, so restarting the series
>>>>       now. This also helps making it clear it is no longer an RFC series;
>>>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>>>       output.");
>>>>     - Address Eelco's comments:
>>>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>>>           available space. Also fix assigment to `data_len` which was
>>>>           incorrectly being set to just`pkt_len`;
>>>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>>>           why the mutex is needed;
>>>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>>>           `DPBUF_DPDK` packets;
>>>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>>>           appropriate and meaningful errors, instead of just "0" or "1";
>>>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>>>           dpdk-multi-seg-mbufs requires a restart;
>>>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>>>       mbufs approach;
>>>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>>>       successfully sent out;
>>>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>>>       minimise changes;
>>>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>>>       after further testing;
>>>>     - Structure changes:
>>>>       - Drop patch 07/13 which is now unneeded;
>>>>       - Two more patches added for extra test coverage. This is what accounts
>>>>         for the increase in size (+1 patch) in the series.
>>>>
>>>> v8 (non-RFC): 
>>>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>>>       temporary directory (also for RHEL)");
>>>>     - Address Ciara's and Ilya's comments:
>>>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>>>           already existing dp_packet_tail();
>>>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>>>           dpdk_prep_tx_buf() was being wrongly checked;
>>>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>>>         - Fix some other code style and duplication issues pointed out.
>>>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>>>       dp_packet_put*() functions to work within the bounds of existing
>>>>       mbufs only;
>>>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>>>       other mbufs in the chain for chains with more than a single mbuf;
>>>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>>>       the header is within the first mbuf, when using mbufs;
>>>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>>>       was refactored to use free_dpdk_buf();
>>>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>>>       leading to memory corruption; 
>>>>     - Modified the added tests to account for some of the above changes;
>>>>     - Run performance tests, compiling results and adding them to the
>>>>       cover letter;
>>>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>>>       together with a "Performance notes" sub-section reflecting the
>>>>       findings mentioned above in the cover letter.
>>>>
>>>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>>>      - Address Ilya comments;
>>>>         - Fix non-DPDK build;
>>>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>>>           using a newly introduced mutex.
>>>>      - Add a new set of tests that integrates with the recently added DPDK
>>>>        testsuite. These focus on allocating dp_packets, with a single or
>>>>        multiple mbufs, from an instantiated mempool and performing several
>>>>        operations on those, verifying if the data at the end matches what's
>>>>        expected;
>>>>      - Fix bugs found by the new tests:
>>>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>>>           and headrooms, ending up calculating the wrong number of segments
>>>>           that needed allocation;
>>>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>>>           dp_packet_mbuf_tail() and dp_packet_end();
>>>>         - dp_packet_set_size() was not updating the number of chained segments
>>>>           'nb_segs';
>>>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>>>
>>>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>>>        reuse.");
>>>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>>>        support multi-seg mbufs;
>>>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>>>        packet_tailroom().
>>>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>>>        mbufs and linking them together;
>>>>      Restructured patchset:
>>>>      - Squash patch 5 into patch 6, since they were both related to
>>>>        copying data while handling multi-seg mbufs;
>>>>      - Split patch 4 into two separate patches - one that introduces the
>>>>        changes in helper functions to deal with multi-seg mbufs and
>>>>        two others for the shift() and put_uninit() functionality;
>>>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>>>        before functionality improvement that rely on those helpers.
>>>>
>>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>>>       from getting redundantly incremented");
>>>>     - Sugesh's comments have been addressed:
>>>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>>>         make them independent of each other;
>>>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>>>         size() are independent;
>>>>       - dp_packet_clone_with_headroom() now has split functions for
>>>>         handling DPDK sourced packets and non-DPDK packets;
>>>>     - Modified various functions in dp-packet.h to account for multi-seg
>>>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>>>       and dp_packet_at();
>>>>     - Added support for shifting packet data in multi-seg mbufs, using
>>>>       dp_packet_shift();
>>>>     - Fixed some minor inconsistencies.
>>>>
>>>>     Note that some of the changes in v5 have been contributed by Mark
>>>>     Kavanagh as well.
>>>>
>>>> v4: - restructure patchset
>>>>     - account for 128B ARM cacheline when sizing mbufs
>>>>
>>>> Mark Kavanagh (4):
>>>>   netdev-dpdk: fix mbuf sizing
>>>>   dp-packet: Init specific mbuf fields.
>>>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>>>   netdev-dpdk: support multi-segment jumbo frames
>>>>
>>>> Michael Qiu (1):
>>>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>>>
>>>> Tiago Lam (8):
>>>>   dp-packet: Fix allocated size on DPDK init.
>>>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>>
>>>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>>>  NEWS                                       |   1 +
>>>>  lib/dp-packet.c                            | 173 +++++++++-
>>>>  lib/dp-packet.h                            | 214 ++++++++++--
>>>>  lib/dpdk.c                                 |   8 +
>>>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>>>  lib/netdev-dpdk.h                          |   2 +
>>>>  tests/automake.mk                          |  10 +-
>>>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>>>  tests/system-dpdk-macros.at                |   6 +-
>>>>  tests/system-dpdk-testsuite.at             |   1 +
>>>>  tests/system-dpdk.at                       |  65 ++++
>>>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>>>  vswitchd/vswitch.xml                       |  22 ++
>>>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>>>  create mode 100644 tests/test-dpdk-mbufs.c
>>>>
>>
>>
Lam, Tiago Aug. 14, 2018, 9:45 a.m. UTC | #6
On 09/08/2018 09:27, Ilya Maximets wrote:
> Hmm. I found that this series modifies only dpdk related components
> and doesn't pay any attention to others.
> dp_packet API was modified to respect the segmented packets, but
> there are many places, where code uses packet data directly using
> only dp_packet_data() pointer and dp_packet_size().
> For example, at least following functions will read the wrong memory
> and probably generate segfault:
> 
>   * All the netdev-{linux, dummy, bsd} send functions, because
>     they are using code like this:
> 
>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
> 
>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>     they're trying to calculate packet checksums using raw calls to
>     'csum_continue'.
> 
>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
> 
>   * Possibly, much more functions.
> 

Hi Ilya (and All),

I went through the uses that call dp_packet_data(), and some of the
cases were covered already, like the "udp_extract_tnl_md" case you
mention - because the dp_packet API is changed to not let callers pull
headers after the first segment.

The other cases I consider all to be of the same category: they all need
the data contiguous in memory, such as the write() syscall you mention
above. Do you see any other corner cases that I may be missing?

So, to cover this category, where data is expected to be contiguous in
memory, we'd need to copy it to a someplace contigous (most likely
system's memory). The approach I have in mind is to extend the dp_packet
API in order to add a new function that "linearizes" the multi-segmented
data. This new function would be called from places such as the write()
call in netdev-linux you mention above, instead of the call to
dp_packet_data() in place now, and would convert between a DPBUF_DPDK
packet into a DPBUF_MALLOC packet. The packet would then continue to be
used on as a regular DPBUF_MALLOC packet (instead of a DPBUF_DPDK
packet) which would be properly free'ed and returned back to the DPDK
mempool's upon dp_packet_delete() / dp_packet_uninit().

An upside of this approach is that it would also enable us to follow the
ideia proposed on [1] (by yourself), where we could "easily" tranform a
DPDK packet into a system packet if no more space exists when calling
dp_packet_reuse__().

This would enable us to use multi-segment mbufs, and later on TSO,
between DPDK ports, while still enabling the use cases between DPDK
ports and non-DPDK ports. For this latter case, there'd be a one-off
(expensive) operation where system's memory needs to be allocated and
copied to, but documentation would be further improved to explain this,
and I believe warning the user with a WARN log message would also be
helpful here as well.

What do you think?

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-July/350009.html

> If It's true, when I'm afraid that this patch set is far from being
> ready for merge.
> 
> On 25.07.2018 17:19, Tiago Lam wrote:
>> Overview
>> ========
>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>> Multi-segment mbufs are typically used when the size of an mbuf is
>> insufficient to contain the entirety of a packet's data. Instead, the
>> data is split across numerous mbufs, each carrying a portion, or
>> 'segment', of the packet data. Mbufs are chained via their 'next'
>> attribute (an mbuf pointer).
>>
>> The main motivation behind the support for multi-segment mbufs is to
>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>> planned to be introduced after this series.
>>
>> Use Cases
>> =========
>> i.  Handling oversized (guest-originated) frames, which are marked
>>     for hardware accelration/offload (TSO, for example).
>>
>>     Packets which originate from a non-DPDK source may be marked for
>>     offload; as such, they may be larger than the permitted ingress
>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>     order to transmit such packets over a DPDK port, their contents
>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>     its current implementation, that function only copies data into
>>     a single mbuf; if the space available in the mbuf is exhausted,
>>     but not all packet data has been copied, then it is lost.
>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>     whether that mbuf contains multiple segments. Both issues are
>>     resolved within this patchset.
>>
>> ii. Handling jumbo frames.
>>
>>     While OvS already supports jumbo frames, it does so by increasing
>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>     in a single mbuf. This is certainly the preferred, and most
>>     performant approach (and remains the default).
>>
>> Enabling multi-segment mbufs
>> ============================
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>
>> This is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Performance notes (based on v8)
>> =================
>> In order to test for regressions in performance, tests were run on top
>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>> mbufs option enabled and disabled.
>>
>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>
>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>> -------------------------------------------------------------
>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>
>> Packet size is in bytes, while all packet rates are reported in mpps
>> (aggregated).
>>
>> No noticeable regression has been observed (certainly everything is
>> within the ± 5% margin of existing performance), aside from the 64B
>> packet size case when multi-segment mbuf is enabled. This is
>> expected, however, because of how Tx vectoriszed functions are
>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>> use during these tests was the i40e (on a Intel X710 NIC), which
>> indeed doesn't support vectorized Tx functions with multi-segment
>> mbufs.
>>
>> ---
>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>     - Add Ben's proposed fix for automake's warning;
>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>       GRO.
>>
>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>       error paths.");
>>     - Address Darrell's comments:
>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>         tailroom space when no more headroom space is available, and vice-versa.
>>         However, this was breaking the API for the dp_packet_resize__()
>>         function (only called in dp_packet_prealloc_tailroom() and
>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>         for DPDK packets.
>>         Instead, and because this is isolate funtionality, revert to the
>>         previous state where dp_packet_resize__() is not supported for DPDK
>>         packets. Hence, then patch 08/14 has been dropped.
>>     - Additionally, fix the tests that were relying on the removed
>>       functionality.
>>
>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>       after free.");
>>     - Address Eelco's comments:
>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>       - Remove NEWS line added by mistake during rebase (about adding
>>         experimental vhost zero copy support).
>>     - Address Ian's comments:
>>       - Drop patch 01 from previous series entirely;
>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>         used;
>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>         conditionally.
>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>       the shared mempool model (runs out of memory);
>>     - Add new examples for when multi-segment mbufs are enabled in
>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>       (now patch 11/14).
>>
>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>       command."):
>>       - A new patch (now 01/15) has been introduced to differentiate between
>>         MTU and mbuf size when creating the mempool. This is because as part of
>>         the new support for both per port and shared mempools, mempools were
>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>         mbuf size can end up being the same for all mbufs;
>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>         part of the rebase, but only to adapt to the code changes to "Support
>>         both shared and per port mempools.", no functionality should have been
>>         changed.
>>
>> v3:
>>     - Address Eelco's comment:
>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>           first mbuf's buf_len.
>>
>> v2:
>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>       error.");
>>     - Address Eelco's comments:
>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>           casting;
>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>           first mbuf in the chain;
>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>           always true for DPBUF_DPDK packets;
>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>           beyond the first mbuf in the chain.
>>
>> v1:
>>     - v8 should have been sent as v1 really, as that's usually the approach
>>       followed in OvS. That clearly didn't happen, so restarting the series
>>       now. This also helps making it clear it is no longer an RFC series;
>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>       output.");
>>     - Address Eelco's comments:
>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>           available space. Also fix assigment to `data_len` which was
>>           incorrectly being set to just`pkt_len`;
>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>           why the mutex is needed;
>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>           `DPBUF_DPDK` packets;
>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>           appropriate and meaningful errors, instead of just "0" or "1";
>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>           dpdk-multi-seg-mbufs requires a restart;
>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>       mbufs approach;
>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>       successfully sent out;
>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>       minimise changes;
>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>       after further testing;
>>     - Structure changes:
>>       - Drop patch 07/13 which is now unneeded;
>>       - Two more patches added for extra test coverage. This is what accounts
>>         for the increase in size (+1 patch) in the series.
>>
>> v8 (non-RFC): 
>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>       temporary directory (also for RHEL)");
>>     - Address Ciara's and Ilya's comments:
>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>           already existing dp_packet_tail();
>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>           dpdk_prep_tx_buf() was being wrongly checked;
>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>         - Fix some other code style and duplication issues pointed out.
>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>       dp_packet_put*() functions to work within the bounds of existing
>>       mbufs only;
>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>       other mbufs in the chain for chains with more than a single mbuf;
>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>       the header is within the first mbuf, when using mbufs;
>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>       was refactored to use free_dpdk_buf();
>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>       leading to memory corruption; 
>>     - Modified the added tests to account for some of the above changes;
>>     - Run performance tests, compiling results and adding them to the
>>       cover letter;
>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>       together with a "Performance notes" sub-section reflecting the
>>       findings mentioned above in the cover letter.
>>
>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>      - Address Ilya comments;
>>         - Fix non-DPDK build;
>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>           using a newly introduced mutex.
>>      - Add a new set of tests that integrates with the recently added DPDK
>>        testsuite. These focus on allocating dp_packets, with a single or
>>        multiple mbufs, from an instantiated mempool and performing several
>>        operations on those, verifying if the data at the end matches what's
>>        expected;
>>      - Fix bugs found by the new tests:
>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>           and headrooms, ending up calculating the wrong number of segments
>>           that needed allocation;
>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>           dp_packet_mbuf_tail() and dp_packet_end();
>>         - dp_packet_set_size() was not updating the number of chained segments
>>           'nb_segs';
>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>
>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>        reuse.");
>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>        support multi-seg mbufs;
>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>        packet_tailroom().
>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>        mbufs and linking them together;
>>      Restructured patchset:
>>      - Squash patch 5 into patch 6, since they were both related to
>>        copying data while handling multi-seg mbufs;
>>      - Split patch 4 into two separate patches - one that introduces the
>>        changes in helper functions to deal with multi-seg mbufs and
>>        two others for the shift() and put_uninit() functionality;
>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>        before functionality improvement that rely on those helpers.
>>
>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>       from getting redundantly incremented");
>>     - Sugesh's comments have been addressed:
>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>         make them independent of each other;
>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>         size() are independent;
>>       - dp_packet_clone_with_headroom() now has split functions for
>>         handling DPDK sourced packets and non-DPDK packets;
>>     - Modified various functions in dp-packet.h to account for multi-seg
>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>       and dp_packet_at();
>>     - Added support for shifting packet data in multi-seg mbufs, using
>>       dp_packet_shift();
>>     - Fixed some minor inconsistencies.
>>
>>     Note that some of the changes in v5 have been contributed by Mark
>>     Kavanagh as well.
>>
>> v4: - restructure patchset
>>     - account for 128B ARM cacheline when sizing mbufs
>>
>> Mark Kavanagh (4):
>>   netdev-dpdk: fix mbuf sizing
>>   dp-packet: Init specific mbuf fields.
>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>   netdev-dpdk: support multi-segment jumbo frames
>>
>> Michael Qiu (1):
>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>
>> Tiago Lam (8):
>>   dp-packet: Fix allocated size on DPDK init.
>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>
>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>  NEWS                                       |   1 +
>>  lib/dp-packet.c                            | 173 +++++++++-
>>  lib/dp-packet.h                            | 214 ++++++++++--
>>  lib/dpdk.c                                 |   8 +
>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>  lib/netdev-dpdk.h                          |   2 +
>>  tests/automake.mk                          |  10 +-
>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>  tests/system-dpdk-macros.at                |   6 +-
>>  tests/system-dpdk-testsuite.at             |   1 +
>>  tests/system-dpdk.at                       |  65 ++++
>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>  vswitchd/vswitch.xml                       |  22 ++
>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>  create mode 100644 tests/test-dpdk-mbufs.c
>>
Ilya Maximets Aug. 14, 2018, 10:17 a.m. UTC | #7
On 14.08.2018 12:45, Lam, Tiago wrote:
> On 09/08/2018 09:27, Ilya Maximets wrote:
>> Hmm. I found that this series modifies only dpdk related components
>> and doesn't pay any attention to others.
>> dp_packet API was modified to respect the segmented packets, but
>> there are many places, where code uses packet data directly using
>> only dp_packet_data() pointer and dp_packet_size().
>> For example, at least following functions will read the wrong memory
>> and probably generate segfault:
>>
>>   * All the netdev-{linux, dummy, bsd} send functions, because
>>     they are using code like this:
>>
>>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>
>>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>>     they're trying to calculate packet checksums using raw calls to
>>     'csum_continue'.
>>
>>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>
>>   * Possibly, much more functions.
>>
> 
> Hi Ilya (and All),
> 
> I went through the uses that call dp_packet_data(), and some of the
> cases were covered already, like the "udp_extract_tnl_md" case you
> mention - because the dp_packet API is changed to not let callers pull
> headers after the first segment.

Even if headers are in the first segment, L4 checksums includes checksum
for the *whole packet data*, not only headers/pseudo-headers.
Look at the calls to "csum_continue".

> 
> The other cases I consider all to be of the same category: they all need
> the data contiguous in memory, such as the write() syscall you mention
> above. Do you see any other corner cases that I may be missing?

I made just a quick git grep. There are, possibly, other cases. Need to
make deeper research.

> 
> So, to cover this category, where data is expected to be contiguous in
> memory, we'd need to copy it to a someplace contigous (most likely
> system's memory). The approach I have in mind is to extend the dp_packet
> API in order to add a new function that "linearizes" the multi-segmented
> data. This new function would be called from places such as the write()
> call in netdev-linux you mention above, instead of the call to
> dp_packet_data() in place now, and would convert between a DPBUF_DPDK
> packet into a DPBUF_MALLOC packet. The packet would then continue to be
> used on as a regular DPBUF_MALLOC packet (instead of a DPBUF_DPDK
> packet) which would be properly free'ed and returned back to the DPDK
> mempool's upon dp_packet_delete() / dp_packet_uninit().
> 
> An upside of this approach is that it would also enable us to follow the
> ideia proposed on [1] (by yourself), where we could "easily" tranform a
> DPDK packet into a system packet if no more space exists when calling
> dp_packet_reuse__().

That is OK. One thing is it'll be good to not copy the whole packet, 
if it's not segmented to avoid perfromance drop for the usual cases.
This could be hard to cover with generic API.

> 
> This would enable us to use multi-segment mbufs, and later on TSO,
> between DPDK ports, while still enabling the use cases between DPDK
> ports and non-DPDK ports. For this latter case, there'd be a one-off
> (expensive) operation where system's memory needs to be allocated and
> copied to, but documentation would be further improved to explain this,
> and I believe warning the user with a WARN log message would also be
> helpful here as well.

About TSO, there are a lot more questions. At leas you will have to
implement not only re-assembling, but also splitting the packet in
parts in case of low MTU on ports that doesn't support TSO.
Checksum recalculations also should be covered. I'll add some
comments to you TSO series.

> 
> What do you think?
> 
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-July/350009.html
> 
>> If It's true, when I'm afraid that this patch set is far from being
>> ready for merge.
>>
>> On 25.07.2018 17:19, Tiago Lam wrote:
>>> Overview
>>> ========
>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>> insufficient to contain the entirety of a packet's data. Instead, the
>>> data is split across numerous mbufs, each carrying a portion, or
>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>> attribute (an mbuf pointer).
>>>
>>> The main motivation behind the support for multi-segment mbufs is to
>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>> planned to be introduced after this series.
>>>
>>> Use Cases
>>> =========
>>> i.  Handling oversized (guest-originated) frames, which are marked
>>>     for hardware accelration/offload (TSO, for example).
>>>
>>>     Packets which originate from a non-DPDK source may be marked for
>>>     offload; as such, they may be larger than the permitted ingress
>>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>>     order to transmit such packets over a DPDK port, their contents
>>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>>     its current implementation, that function only copies data into
>>>     a single mbuf; if the space available in the mbuf is exhausted,
>>>     but not all packet data has been copied, then it is lost.
>>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>>     whether that mbuf contains multiple segments. Both issues are
>>>     resolved within this patchset.
>>>
>>> ii. Handling jumbo frames.
>>>
>>>     While OvS already supports jumbo frames, it does so by increasing
>>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>>     in a single mbuf. This is certainly the preferred, and most
>>>     performant approach (and remains the default).
>>>
>>> Enabling multi-segment mbufs
>>> ============================
>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>> user must decide on which approach to adopt on init. The introduction
>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>
>>> This is a global boolean value, which determines how jumbo frames are
>>> represented across all DPDK ports. In the absence of a user-supplied
>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>> default.
>>>
>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>> fields:
>>>
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>
>>> Performance notes (based on v8)
>>> =================
>>> In order to test for regressions in performance, tests were run on top
>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>> mbufs option enabled and disabled.
>>>
>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>
>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>> -------------------------------------------------------------
>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>
>>> Packet size is in bytes, while all packet rates are reported in mpps
>>> (aggregated).
>>>
>>> No noticeable regression has been observed (certainly everything is
>>> within the ± 5% margin of existing performance), aside from the 64B
>>> packet size case when multi-segment mbuf is enabled. This is
>>> expected, however, because of how Tx vectoriszed functions are
>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>> indeed doesn't support vectorized Tx functions with multi-segment
>>> mbufs.
>>>
>>> ---
>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>>     - Add Ben's proposed fix for automake's warning;
>>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>>       GRO.
>>>
>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>>       error paths.");
>>>     - Address Darrell's comments:
>>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>>         tailroom space when no more headroom space is available, and vice-versa.
>>>         However, this was breaking the API for the dp_packet_resize__()
>>>         function (only called in dp_packet_prealloc_tailroom() and
>>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>>         for DPDK packets.
>>>         Instead, and because this is isolate funtionality, revert to the
>>>         previous state where dp_packet_resize__() is not supported for DPDK
>>>         packets. Hence, then patch 08/14 has been dropped.
>>>     - Additionally, fix the tests that were relying on the removed
>>>       functionality.
>>>
>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>>       after free.");
>>>     - Address Eelco's comments:
>>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>>       - Remove NEWS line added by mistake during rebase (about adding
>>>         experimental vhost zero copy support).
>>>     - Address Ian's comments:
>>>       - Drop patch 01 from previous series entirely;
>>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>>         used;
>>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>>         conditionally.
>>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>>       the shared mempool model (runs out of memory);
>>>     - Add new examples for when multi-segment mbufs are enabled in
>>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>>       (now patch 11/14).
>>>
>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>>       command."):
>>>       - A new patch (now 01/15) has been introduced to differentiate between
>>>         MTU and mbuf size when creating the mempool. This is because as part of
>>>         the new support for both per port and shared mempools, mempools were
>>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>>         mbuf size can end up being the same for all mbufs;
>>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>>         part of the rebase, but only to adapt to the code changes to "Support
>>>         both shared and per port mempools.", no functionality should have been
>>>         changed.
>>>
>>> v3:
>>>     - Address Eelco's comment:
>>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>>           first mbuf's buf_len.
>>>
>>> v2:
>>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>>       error.");
>>>     - Address Eelco's comments:
>>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>>           casting;
>>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>>           first mbuf in the chain;
>>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>>           always true for DPBUF_DPDK packets;
>>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>>           beyond the first mbuf in the chain.
>>>
>>> v1:
>>>     - v8 should have been sent as v1 really, as that's usually the approach
>>>       followed in OvS. That clearly didn't happen, so restarting the series
>>>       now. This also helps making it clear it is no longer an RFC series;
>>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>>       output.");
>>>     - Address Eelco's comments:
>>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>>           available space. Also fix assigment to `data_len` which was
>>>           incorrectly being set to just`pkt_len`;
>>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>>           why the mutex is needed;
>>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>>           `DPBUF_DPDK` packets;
>>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>>           appropriate and meaningful errors, instead of just "0" or "1";
>>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>>           dpdk-multi-seg-mbufs requires a restart;
>>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>>       mbufs approach;
>>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>>       successfully sent out;
>>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>>       minimise changes;
>>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>>       after further testing;
>>>     - Structure changes:
>>>       - Drop patch 07/13 which is now unneeded;
>>>       - Two more patches added for extra test coverage. This is what accounts
>>>         for the increase in size (+1 patch) in the series.
>>>
>>> v8 (non-RFC): 
>>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>>       temporary directory (also for RHEL)");
>>>     - Address Ciara's and Ilya's comments:
>>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>>           already existing dp_packet_tail();
>>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>>           dpdk_prep_tx_buf() was being wrongly checked;
>>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>>         - Fix some other code style and duplication issues pointed out.
>>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>>       dp_packet_put*() functions to work within the bounds of existing
>>>       mbufs only;
>>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>>       other mbufs in the chain for chains with more than a single mbuf;
>>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>>       the header is within the first mbuf, when using mbufs;
>>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>>       was refactored to use free_dpdk_buf();
>>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>>       leading to memory corruption; 
>>>     - Modified the added tests to account for some of the above changes;
>>>     - Run performance tests, compiling results and adding them to the
>>>       cover letter;
>>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>>       together with a "Performance notes" sub-section reflecting the
>>>       findings mentioned above in the cover letter.
>>>
>>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>>      - Address Ilya comments;
>>>         - Fix non-DPDK build;
>>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>>           using a newly introduced mutex.
>>>      - Add a new set of tests that integrates with the recently added DPDK
>>>        testsuite. These focus on allocating dp_packets, with a single or
>>>        multiple mbufs, from an instantiated mempool and performing several
>>>        operations on those, verifying if the data at the end matches what's
>>>        expected;
>>>      - Fix bugs found by the new tests:
>>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>>           and headrooms, ending up calculating the wrong number of segments
>>>           that needed allocation;
>>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>>           dp_packet_mbuf_tail() and dp_packet_end();
>>>         - dp_packet_set_size() was not updating the number of chained segments
>>>           'nb_segs';
>>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>>
>>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>>        reuse.");
>>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>>        support multi-seg mbufs;
>>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>>        packet_tailroom().
>>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>>        mbufs and linking them together;
>>>      Restructured patchset:
>>>      - Squash patch 5 into patch 6, since they were both related to
>>>        copying data while handling multi-seg mbufs;
>>>      - Split patch 4 into two separate patches - one that introduces the
>>>        changes in helper functions to deal with multi-seg mbufs and
>>>        two others for the shift() and put_uninit() functionality;
>>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>>        before functionality improvement that rely on those helpers.
>>>
>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>>       from getting redundantly incremented");
>>>     - Sugesh's comments have been addressed:
>>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>>         make them independent of each other;
>>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>>         size() are independent;
>>>       - dp_packet_clone_with_headroom() now has split functions for
>>>         handling DPDK sourced packets and non-DPDK packets;
>>>     - Modified various functions in dp-packet.h to account for multi-seg
>>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>>       and dp_packet_at();
>>>     - Added support for shifting packet data in multi-seg mbufs, using
>>>       dp_packet_shift();
>>>     - Fixed some minor inconsistencies.
>>>
>>>     Note that some of the changes in v5 have been contributed by Mark
>>>     Kavanagh as well.
>>>
>>> v4: - restructure patchset
>>>     - account for 128B ARM cacheline when sizing mbufs
>>>
>>> Mark Kavanagh (4):
>>>   netdev-dpdk: fix mbuf sizing
>>>   dp-packet: Init specific mbuf fields.
>>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>>   netdev-dpdk: support multi-segment jumbo frames
>>>
>>> Michael Qiu (1):
>>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>>
>>> Tiago Lam (8):
>>>   dp-packet: Fix allocated size on DPDK init.
>>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>
>>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>>  NEWS                                       |   1 +
>>>  lib/dp-packet.c                            | 173 +++++++++-
>>>  lib/dp-packet.h                            | 214 ++++++++++--
>>>  lib/dpdk.c                                 |   8 +
>>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>>  lib/netdev-dpdk.h                          |   2 +
>>>  tests/automake.mk                          |  10 +-
>>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>>  tests/system-dpdk-macros.at                |   6 +-
>>>  tests/system-dpdk-testsuite.at             |   1 +
>>>  tests/system-dpdk.at                       |  65 ++++
>>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>>  vswitchd/vswitch.xml                       |  22 ++
>>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>>  create mode 100644 tests/test-dpdk-mbufs.c
>>>
> 
>
Lam, Tiago Aug. 14, 2018, 10:46 a.m. UTC | #8
On 14/08/2018 11:17, Ilya Maximets wrote:
> On 14.08.2018 12:45, Lam, Tiago wrote:
>> On 09/08/2018 09:27, Ilya Maximets wrote:
>>> Hmm. I found that this series modifies only dpdk related components
>>> and doesn't pay any attention to others.
>>> dp_packet API was modified to respect the segmented packets, but
>>> there are many places, where code uses packet data directly using
>>> only dp_packet_data() pointer and dp_packet_size().
>>> For example, at least following functions will read the wrong memory
>>> and probably generate segfault:
>>>
>>>   * All the netdev-{linux, dummy, bsd} send functions, because
>>>     they are using code like this:
>>>
>>>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>>
>>>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>>>     they're trying to calculate packet checksums using raw calls to
>>>     'csum_continue'.
>>>
>>>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>>
>>>   * Possibly, much more functions.
>>>
>>
>> Hi Ilya (and All),
>>
>> I went through the uses that call dp_packet_data(), and some of the
>> cases were covered already, like the "udp_extract_tnl_md" case you
>> mention - because the dp_packet API is changed to not let callers pull
>> headers after the first segment.
> 
> Even if headers are in the first segment, L4 checksums includes checksum
> for the *whole packet data*, not only headers/pseudo-headers.
> Look at the calls to "csum_continue".
> 

I was referring to udp_extract_tnl_md()'s use of
netdev_tnl_ip_extract_tnl_md() to fetch the UDP header, the use of
csum_continue() I kind of include it in the below category where the
whole data is needed contiguously. But you're correct that
csum_continue() is used in udp_extract_tnl_md().

>>
>> The other cases I consider all to be of the same category: they all need
>> the data contiguous in memory, such as the write() syscall you mention
>> above. Do you see any other corner cases that I may be missing?
> 
> I made just a quick git grep. There are, possibly, other cases. Need to
> make deeper research.
>

Thanks, Ilya. I've been doing the same thing now to find this usage of
dp_packet_data(). I'll confirm my findings again and send v8 once I've
the below approach in place.

>>
>> So, to cover this category, where data is expected to be contiguous in
>> memory, we'd need to copy it to a someplace contigous (most likely
>> system's memory). The approach I have in mind is to extend the dp_packet
>> API in order to add a new function that "linearizes" the multi-segmented
>> data. This new function would be called from places such as the write()
>> call in netdev-linux you mention above, instead of the call to
>> dp_packet_data() in place now, and would convert between a DPBUF_DPDK
>> packet into a DPBUF_MALLOC packet. The packet would then continue to be
>> used on as a regular DPBUF_MALLOC packet (instead of a DPBUF_DPDK
>> packet) which would be properly free'ed and returned back to the DPDK
>> mempool's upon dp_packet_delete() / dp_packet_uninit().
>>
>> An upside of this approach is that it would also enable us to follow the
>> ideia proposed on [1] (by yourself), where we could "easily" tranform a
>> DPDK packet into a system packet if no more space exists when calling
>> dp_packet_reuse__().
> 
> That is OK. One thing is it'll be good to not copy the whole packet, 
> if it's not segmented to avoid perfromance drop for the usual cases.
> This could be hard to cover with generic API.
> 

Agreed here. This operation would only be done if the packet is
non-contiguous, otherwise it will remain untouched.

>>
>> This would enable us to use multi-segment mbufs, and later on TSO,
>> between DPDK ports, while still enabling the use cases between DPDK
>> ports and non-DPDK ports. For this latter case, there'd be a one-off
>> (expensive) operation where system's memory needs to be allocated and
>> copied to, but documentation would be further improved to explain this,
>> and I believe warning the user with a WARN log message would also be
>> helpful here as well.
> 
> About TSO, there are a lot more questions. At leas you will have to
> implement not only re-assembling, but also splitting the packet in
> parts in case of low MTU on ports that doesn't support TSO.
> Checksum recalculations also should be covered. I'll add some
> comments to you TSO series.
> 

Thanks, Ilya. I'll look forward to it.

Tiago.

>>
>> What do you think?
>>
>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-July/350009.html
>>
>>> If It's true, when I'm afraid that this patch set is far from being
>>> ready for merge.
>>>
>>> On 25.07.2018 17:19, Tiago Lam wrote:
>>>> Overview
>>>> ========
>>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>>> insufficient to contain the entirety of a packet's data. Instead, the
>>>> data is split across numerous mbufs, each carrying a portion, or
>>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>>> attribute (an mbuf pointer).
>>>>
>>>> The main motivation behind the support for multi-segment mbufs is to
>>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>>> planned to be introduced after this series.
>>>>
>>>> Use Cases
>>>> =========
>>>> i.  Handling oversized (guest-originated) frames, which are marked
>>>>     for hardware accelration/offload (TSO, for example).
>>>>
>>>>     Packets which originate from a non-DPDK source may be marked for
>>>>     offload; as such, they may be larger than the permitted ingress
>>>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>>>     order to transmit such packets over a DPDK port, their contents
>>>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>>>     its current implementation, that function only copies data into
>>>>     a single mbuf; if the space available in the mbuf is exhausted,
>>>>     but not all packet data has been copied, then it is lost.
>>>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>>>     whether that mbuf contains multiple segments. Both issues are
>>>>     resolved within this patchset.
>>>>
>>>> ii. Handling jumbo frames.
>>>>
>>>>     While OvS already supports jumbo frames, it does so by increasing
>>>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>>>     in a single mbuf. This is certainly the preferred, and most
>>>>     performant approach (and remains the default).
>>>>
>>>> Enabling multi-segment mbufs
>>>> ============================
>>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>>> user must decide on which approach to adopt on init. The introduction
>>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>>
>>>> This is a global boolean value, which determines how jumbo frames are
>>>> represented across all DPDK ports. In the absence of a user-supplied
>>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>>> default.
>>>>
>>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>>> fields:
>>>>
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>>
>>>> Performance notes (based on v8)
>>>> =================
>>>> In order to test for regressions in performance, tests were run on top
>>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>>> mbufs option enabled and disabled.
>>>>
>>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>>
>>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>>> -------------------------------------------------------------
>>>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>>>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>>>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>>>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>>>
>>>> Packet size is in bytes, while all packet rates are reported in mpps
>>>> (aggregated).
>>>>
>>>> No noticeable regression has been observed (certainly everything is
>>>> within the ± 5% margin of existing performance), aside from the 64B
>>>> packet size case when multi-segment mbuf is enabled. This is
>>>> expected, however, because of how Tx vectoriszed functions are
>>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>>> indeed doesn't support vectorized Tx functions with multi-segment
>>>> mbufs.
>>>>
>>>> ---
>>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>>>     - Add Ben's proposed fix for automake's warning;
>>>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>>>       GRO.
>>>>
>>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>>>       error paths.");
>>>>     - Address Darrell's comments:
>>>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>>>         tailroom space when no more headroom space is available, and vice-versa.
>>>>         However, this was breaking the API for the dp_packet_resize__()
>>>>         function (only called in dp_packet_prealloc_tailroom() and
>>>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>>>         for DPDK packets.
>>>>         Instead, and because this is isolate funtionality, revert to the
>>>>         previous state where dp_packet_resize__() is not supported for DPDK
>>>>         packets. Hence, then patch 08/14 has been dropped.
>>>>     - Additionally, fix the tests that were relying on the removed
>>>>       functionality.
>>>>
>>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>>>       after free.");
>>>>     - Address Eelco's comments:
>>>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>>>       - Remove NEWS line added by mistake during rebase (about adding
>>>>         experimental vhost zero copy support).
>>>>     - Address Ian's comments:
>>>>       - Drop patch 01 from previous series entirely;
>>>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>>>         used;
>>>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>>>         conditionally.
>>>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>>>       the shared mempool model (runs out of memory);
>>>>     - Add new examples for when multi-segment mbufs are enabled in
>>>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>>>       (now patch 11/14).
>>>>
>>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>>>       command."):
>>>>       - A new patch (now 01/15) has been introduced to differentiate between
>>>>         MTU and mbuf size when creating the mempool. This is because as part of
>>>>         the new support for both per port and shared mempools, mempools were
>>>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>>>         mbuf size can end up being the same for all mbufs;
>>>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>>>         part of the rebase, but only to adapt to the code changes to "Support
>>>>         both shared and per port mempools.", no functionality should have been
>>>>         changed.
>>>>
>>>> v3:
>>>>     - Address Eelco's comment:
>>>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>>>           first mbuf's buf_len.
>>>>
>>>> v2:
>>>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>>>       error.");
>>>>     - Address Eelco's comments:
>>>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>>>           casting;
>>>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>>>           first mbuf in the chain;
>>>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>>>           always true for DPBUF_DPDK packets;
>>>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>>>           beyond the first mbuf in the chain.
>>>>
>>>> v1:
>>>>     - v8 should have been sent as v1 really, as that's usually the approach
>>>>       followed in OvS. That clearly didn't happen, so restarting the series
>>>>       now. This also helps making it clear it is no longer an RFC series;
>>>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>>>       output.");
>>>>     - Address Eelco's comments:
>>>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>>>           available space. Also fix assigment to `data_len` which was
>>>>           incorrectly being set to just`pkt_len`;
>>>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>>>           why the mutex is needed;
>>>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>>>           `DPBUF_DPDK` packets;
>>>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>>>           appropriate and meaningful errors, instead of just "0" or "1";
>>>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>>>           dpdk-multi-seg-mbufs requires a restart;
>>>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>>>       mbufs approach;
>>>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>>>       successfully sent out;
>>>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>>>       minimise changes;
>>>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>>>       after further testing;
>>>>     - Structure changes:
>>>>       - Drop patch 07/13 which is now unneeded;
>>>>       - Two more patches added for extra test coverage. This is what accounts
>>>>         for the increase in size (+1 patch) in the series.
>>>>
>>>> v8 (non-RFC): 
>>>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>>>       temporary directory (also for RHEL)");
>>>>     - Address Ciara's and Ilya's comments:
>>>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>>>           already existing dp_packet_tail();
>>>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>>>           dpdk_prep_tx_buf() was being wrongly checked;
>>>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>>>         - Fix some other code style and duplication issues pointed out.
>>>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>>>       dp_packet_put*() functions to work within the bounds of existing
>>>>       mbufs only;
>>>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>>>       other mbufs in the chain for chains with more than a single mbuf;
>>>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>>>       the header is within the first mbuf, when using mbufs;
>>>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>>>       was refactored to use free_dpdk_buf();
>>>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>>>       leading to memory corruption; 
>>>>     - Modified the added tests to account for some of the above changes;
>>>>     - Run performance tests, compiling results and adding them to the
>>>>       cover letter;
>>>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>>>       together with a "Performance notes" sub-section reflecting the
>>>>       findings mentioned above in the cover letter.
>>>>
>>>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>>>      - Address Ilya comments;
>>>>         - Fix non-DPDK build;
>>>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>>>           using a newly introduced mutex.
>>>>      - Add a new set of tests that integrates with the recently added DPDK
>>>>        testsuite. These focus on allocating dp_packets, with a single or
>>>>        multiple mbufs, from an instantiated mempool and performing several
>>>>        operations on those, verifying if the data at the end matches what's
>>>>        expected;
>>>>      - Fix bugs found by the new tests:
>>>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>>>           and headrooms, ending up calculating the wrong number of segments
>>>>           that needed allocation;
>>>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>>>           dp_packet_mbuf_tail() and dp_packet_end();
>>>>         - dp_packet_set_size() was not updating the number of chained segments
>>>>           'nb_segs';
>>>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>>>
>>>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>>>        reuse.");
>>>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>>>        support multi-seg mbufs;
>>>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>>>        packet_tailroom().
>>>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>>>        mbufs and linking them together;
>>>>      Restructured patchset:
>>>>      - Squash patch 5 into patch 6, since they were both related to
>>>>        copying data while handling multi-seg mbufs;
>>>>      - Split patch 4 into two separate patches - one that introduces the
>>>>        changes in helper functions to deal with multi-seg mbufs and
>>>>        two others for the shift() and put_uninit() functionality;
>>>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>>>        before functionality improvement that rely on those helpers.
>>>>
>>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>>>       from getting redundantly incremented");
>>>>     - Sugesh's comments have been addressed:
>>>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>>>         make them independent of each other;
>>>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>>>         size() are independent;
>>>>       - dp_packet_clone_with_headroom() now has split functions for
>>>>         handling DPDK sourced packets and non-DPDK packets;
>>>>     - Modified various functions in dp-packet.h to account for multi-seg
>>>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>>>       and dp_packet_at();
>>>>     - Added support for shifting packet data in multi-seg mbufs, using
>>>>       dp_packet_shift();
>>>>     - Fixed some minor inconsistencies.
>>>>
>>>>     Note that some of the changes in v5 have been contributed by Mark
>>>>     Kavanagh as well.
>>>>
>>>> v4: - restructure patchset
>>>>     - account for 128B ARM cacheline when sizing mbufs
>>>>
>>>> Mark Kavanagh (4):
>>>>   netdev-dpdk: fix mbuf sizing
>>>>   dp-packet: Init specific mbuf fields.
>>>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>>>   netdev-dpdk: support multi-segment jumbo frames
>>>>
>>>> Michael Qiu (1):
>>>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>>>
>>>> Tiago Lam (8):
>>>>   dp-packet: Fix allocated size on DPDK init.
>>>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>>
>>>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>>>  NEWS                                       |   1 +
>>>>  lib/dp-packet.c                            | 173 +++++++++-
>>>>  lib/dp-packet.h                            | 214 ++++++++++--
>>>>  lib/dpdk.c                                 |   8 +
>>>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>>>  lib/netdev-dpdk.h                          |   2 +
>>>>  tests/automake.mk                          |  10 +-
>>>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>>>  tests/system-dpdk-macros.at                |   6 +-
>>>>  tests/system-dpdk-testsuite.at             |   1 +
>>>>  tests/system-dpdk.at                       |  65 ++++
>>>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>>>  vswitchd/vswitch.xml                       |  22 ++
>>>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>>>  create mode 100644 tests/test-dpdk-mbufs.c
>>>>
>>
>>