Message ID | 1531333421-235225-1-git-send-email-tiago.lam@intel.com |
---|---|
Headers | show |
Series | Support multi-segment mbufs | expand |
On 7/11/2018 7:23 PM, Tiago Lam wrote: > Overview > ======== > This patchset introduces support for multi-segment mbufs to OvS-DPDK. > Multi-segment mbufs are typically used when the size of an mbuf is > insufficient to contain the entirety of a packet's data. Instead, the > data is split across numerous mbufs, each carrying a portion, or > 'segment', of the packet data. mbufs are chained via their 'next' > attribute (an mbuf pointer). > Thanks to all for the work on this series. I've pushed this to dpdk_merge and it will be part of the pull request this week. Ian > Use Cases > ========= > i. Handling oversized (guest-originated) frames, which are marked > for hardware accelration/offload (TSO, for example). > > Packets which originate from a non-DPDK source may be marked for > offload; as such, they may be larger than the permitted ingress > interface's MTU, and may be stored in an oversized dp-packet. In > order to transmit such packets over a DPDK port, their contents > must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in > its current implementation, that function only copies data into > a single mbuf; if the space available in the mbuf is exhausted, > but not all packet data has been copied, then it is lost. > Similarly, when cloning a DPDK mbuf, it must be considered > whether that mbuf contains multiple segments. Both issues are > resolved within this patchset. > > ii. Handling jumbo frames. > > While OvS already supports jumbo frames, it does so by increasing > mbuf size, such that the entirety of a jumbo frame may be handled > in a single mbuf. This is certainly the preferred, and most > performant approach (and remains the default). > > Enabling multi-segment mbufs > ============================ > Multi-segment and single-segment mbufs are mutually exclusive, and the > user must decide on which approach to adopt on init. The introduction > of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. > > This is a global boolean value, which determines how jumbo frames are > represented across all DPDK ports. In the absence of a user-supplied > value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment > mbufs must be explicitly enabled / single-segment mbufs remain the > default. > > Setting the field is identical to setting existing DPDK-specific OVSDB > fields: > > ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true > ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 > ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 > ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true > > Performance notes (based on v8) > ================= > In order to test for regressions in performance, tests were run on top > of master 88125d6 and v8 of this patchset, both with the multi-segment > mbufs option enabled and disabled. > > VSperf was used to run the phy2phy_cont and pvp_cont tests with varying > packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. > > Test | Size | Master | Multi-seg disabled | Multi-seg enabled > ------------------------------------------------------------- > p2p | 64 | ~22.7 | ~22.65 | ~18.3 > p2p | 1500 | ~1.6 | ~1.6 | ~1.6 > p2p | 7000 | ~0.36 | ~0.36 | ~0.36 > pvp | 64 | ~6.7 | ~6.7 | ~6.3 > pvp | 1500 | ~1.6 | ~1.6 | ~1.6 > pvp | 7000 | ~0.36 | ~0.36 | ~0.36 > > Packet size is in bytes, while all packet rates are reported in mpps > (aggregated). > > No noticeable regression has been observed (certainly everything is > within the ± 5% margin of existing performance), aside from the 64B > packet size case when multi-segment mbuf is enabled. This is > expected, however, because of how Tx vectoriszed functions are > incompatible with multi-segment mbufs on some PMDs. The PMD under > use during these tests was the i40e (on a Intel X710 NIC), which > indeed doesn't support vectorized Tx functions with multi-segment > mbufs. > > --- > v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use > after free."); > - Address Eelco's comments: > - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a > leftover from rebase. Only call should be in dpdk_mp_get(); > - Remove NEWS line added by mistake during rebase (about adding > experimental vhost zero copy support). > - Address Ian's comments: > - Drop patch 01 from previous series entirely; > - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside > dpdk_mp_create() to get the correct "mbuf_size" to be used; > - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment > mbufs is enabled, in which case it calculates the new "mbuf_size" to be > used; > - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock > conditionally. > - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current > DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using > the shared mempool model (runs out of memory); > - Add new examples for when multi-segment mbufs are enabled in > topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst > (now patch 11/14). > > v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add > command."): > - A new patch (now 01/15) has been introduced to differentiate between > MTU and mbuf size when creating the mempool. This is because as part of > the new support for both per port and shared mempools, mempools were > being reused based on the mbuf size, and for multi-segment mbufs the > mbuf size can end up being the same for all mbufs; > - A couple of other patches (now 02/15 and 12/15) have been modified as > part of the rebase, but only to adapt to the code changes to "Support > both shared and per port mempools.", no functionality should have been > changed. > > v3: > - Address Eelco's comment: > - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which > wasn't correctly asserting that the passed 'v' was smaller than the > first mbuf's buf_len. > > v2: > - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on > error."); > - Address Eelco's comments: > - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of > casting; > - Remove unneeded variable in dp_packet_set_size(), pointing to the > first mbuf in the chain; > - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is > always true for DPBUF_DPDK packets; > - Assert in __packet_set_data() to enforce that data_off never goes > beyond the first mbuf in the chain. > > v1: > - v8 should have been sent as v1 really, as that's usually the approach > followed in OvS. That clearly didn't happen, so restarting the series > now. This also helps making it clear it is no longer an RFC series; > - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags() > output."); > - Address Eelco's comments: > - Change dp_packet_size() so that for `DPBUF_DPDK` packets their > `pkt_len` and `data_len` can't be set to values bigger than the > available space. Also fix assigment to `data_len` which was > incorrectly being set to just`pkt_len`; > - Improve `nonpmd_mp_mutex` comment with a better explanation as to > why the mutex is needed; > - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non > `DPBUF_DPDK` packets; > - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`; > - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for > copying `DPBUF_DPDK` packets' data. Also, change it to return > appropriate and meaningful errors, instead of just "0" or "1"; > - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(), > and reuse dp_packet_mbuf_write() instead of manual copy; > - Add note vswitchd/vswitch.xml to make it clear the enabling of > dpdk-multi-seg-mbufs requires a restart; > - Change dpdk_mp_create() to increase # mbufs used under the multi-segment > mbufs approach; > - Increase test coverage by adding "end-to-end" tests that verify that > "dpdk-multi-seg-mbufs" is disabled by default and that a packet is > successfully sent out; > - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved > back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to > minimise changes; > - Add some extra notes to "Performance notes" in jumbo-frames.rst doc, > after further testing; > - Structure changes: > - Drop patch 07/13 which is now unneeded; > - Two more patches added for extra test coverage. This is what accounts > for the increase in size (+1 patch) in the series. > > v8 (non-RFC): > - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from > temporary directory (also for RHEL)"); > - Address Ciara's and Ilya's comments: > - Drop the dp_packet_mbuf_tail() function and use only the > already existing dp_packet_tail(); > - Fix bug in dpdk_do_tx_copy() where the error return from > dpdk_prep_tx_buf() was being wrongly checked; > - Use dpdk_buf_alloc() and free_dpdk_buf() instead of > rte_pktmbuf_alloc() and rte_pktmbuf_free(); > - Fix some other code style and duplication issues pointed out. > - Refactored dp_packet_shift(), dp_packet_resize__() and > dp_packet_put*() functions to work within the bounds of existing > mbufs only; > - Fix dp_packet_clear() which wasn't correctly clearing / freeing > other mbufs in the chain for chains with more than a single mbuf; > - dp_packet layer functions (such as dp_packet_l3()) now check if > the header is within the first mbuf, when using mbufs; > - Move patch 08/13 to before patch 04/13, since dp_packet_set_size() > was refactored to use free_dpdk_buf(); > - Fix wrong rte_memcpy() when performing dp_packet_clone() which was > leading to memory corruption; > - Modified the added tests to account for some of the above changes; > - Run performance tests, compiling results and adding them to the > cover letter; > - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc, > together with a "Performance notes" sub-section reflecting the > findings mentioned above in the cover letter. > > v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version."); > - Address Ilya comments; > - Fix non-DPDK build; > - Serialise the access of non pmds to allocation and free of mbufs by > using a newly introduced mutex. > - Add a new set of tests that integrates with the recently added DPDK > testsuite. These focus on allocating dp_packets, with a single or > multiple mbufs, from an instantiated mempool and performing several > operations on those, verifying if the data at the end matches what's > expected; > - Fix bugs found by the new tests: > - dp_packet_shift() wasn't taking into account shift lefts; > - dp_packet_resize__() was misusing and miscalculating the tailrooms > and headrooms, ending up calculating the wrong number of segments > that needed allocation; > - An mbuf's end was being miscalculated both in dp_packet_tail, > dp_packet_mbuf_tail() and dp_packet_end(); > - dp_packet_set_size() was not updating the number of chained segments > 'nb_segs'; > - Add support for multi-seg mbufs in dp_packet_clear(). > > v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session > reuse."); > - Further improve dp_packet_put_uninit() and dp_packet_shift() to > support multi-seg mbufs; > - Add support for multi-seg mbufs in dp_packet_l4_size() and > improve other helper funcs, such as dp_packet_tail() and dp_ > packet_tailroom(). > - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_ > put_zeros(), as well as dp_packet_resize__() - allocating new > mbufs and linking them together; > Restructured patchset: > - Squash patch 5 into patch 6, since they were both related to > copying data while handling multi-seg mbufs; > - Split patch 4 into two separate patches - one that introduces the > changes in helper functions to deal with multi-seg mbufs and > two others for the shift() and put_uninit() functionality; > - Move patch 4 to before patch 3, so that ihelper functions come > before functionality improvement that rely on those helpers. > > v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters > from getting redundantly incremented"); > - Sugesh's comments have been addressed: > - Changed dp_packet_set_data() and dp_packet_set_size() logic to > make them independent of each other; > - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_ > size() are independent; > - dp_packet_clone_with_headroom() now has split functions for > handling DPDK sourced packets and non-DPDK packets; > - Modified various functions in dp-packet.h to account for multi-seg > mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail() > and dp_packet_at(); > - Added support for shifting packet data in multi-seg mbufs, using > dp_packet_shift(); > - Fixed some minor inconsistencies. > > Note that some of the changes in v5 have been contributed by Mark > Kavanagh as well. > > v4: - restructure patchset > - account for 128B ARM cacheline when sizing mbufs > > Mark Kavanagh (4): > netdev-dpdk: fix mbuf sizing > dp-packet: Init specific mbuf fields. > netdev-dpdk: copy large packet to multi-seg. mbufs > netdev-dpdk: support multi-segment jumbo frames > > Michael Qiu (1): > dp-packet: copy data from multi-seg. DPDK mbuf > > Tiago Lam (9): > dp-packet: Fix allocated size on DPDK init. > netdev-dpdk: Serialise non-pmds mbufs' alloc/free. > dp-packet: Fix data_len handling multi-seg mbufs. > dp-packet: Handle multi-seg mbufs in helper funcs. > dp-packet: Handle multi-seg mubfs in shift() func. > dp-packet: Handle multi-seg mbufs in resize__(). > dpdk-tests: Add uni-tests for multi-seg mbufs. > dpdk-tests: Accept other configs in OVS_DPDK_START > dpdk-tests: End-to-end tests for multi-seg mbufs. > > Documentation/topics/dpdk/jumbo-frames.rst | 52 +++ > Documentation/topics/dpdk/memory.rst | 36 ++ > NEWS | 1 + > lib/dp-packet.c | 221 +++++++++++- > lib/dp-packet.h | 214 ++++++++++-- > lib/dpdk.c | 8 + > lib/netdev-dpdk.c | 244 +++++++++++--- > lib/netdev-dpdk.h | 2 + > tests/automake.mk | 10 +- > tests/dpdk-packet-mbufs.at | 7 + > tests/system-dpdk-macros.at | 6 +- > tests/system-dpdk-testsuite.at | 1 + > tests/system-dpdk.at | 65 ++++ > tests/test-dpdk-mbufs.c | 518 +++++++++++++++++++++++++++++ > vswitchd/vswitch.xml | 22 ++ > 15 files changed, 1328 insertions(+), 79 deletions(-) > create mode 100644 tests/dpdk-packet-mbufs.at > create mode 100644 tests/test-dpdk-mbufs.c >