Message ID | 1528734090-220990-1-git-send-email-tiago.lam@intel.com |
---|---|
Headers | show |
Series | Support multi-segment mbufs | expand |
On 11 Jun 2018, at 18:21, Tiago Lam wrote: > Overview > ======== > This patchset introduces support for multi-segment mbufs to OvS-DPDK. > Multi-segment mbufs are typically used when the size of an mbuf is > insufficient to contain the entirety of a packet's data. Instead, the > data is split across numerous mbufs, each carrying a portion, or > 'segment', of the packet data. mbufs are chained via their 'next' > attribute (an mbuf pointer). > > Use Cases > ========= > i. Handling oversized (guest-originated) frames, which are marked > for hardware accelration/offload (TSO, for example). > > Packets which originate from a non-DPDK source may be marked for > offload; as such, they may be larger than the permitted ingress > interface's MTU, and may be stored in an oversized dp-packet. In > order to transmit such packets over a DPDK port, their contents > must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in > its current implementation, that function only copies data into > a single mbuf; if the space available in the mbuf is exhausted, > but not all packet data has been copied, then it is lost. > Similarly, when cloning a DPDK mbuf, it must be considered > whether that mbuf contains multiple segments. Both issues are > resolved within this patchset. > > ii. Handling jumbo frames. > > While OvS already supports jumbo frames, it does so by increasing > mbuf size, such that the entirety of a jumbo frame may be handled > in a single mbuf. This is certainly the preferred, and most > performant approach (and remains the default). However, it places > high demands on system memory; multi-segment mbufs may be > prefereable for systems which are memory-constrained. > > Enabling multi-segment mbufs > ============================ > Multi-segment and single-segment mbufs are mutually exclusive, and the > user must decide on which approach to adopt on init. The introduction > of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. > > This is a global boolean value, which determines how jumbo frames are > represented across all DPDK ports. In the absence of a user-supplied > value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment > mbufs must be explicitly enabled / single-segment mbufs remain the > default. > > Setting the field is identical to setting existing DPDK-specific OVSDB > fields: > > ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true > ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 > ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 > ==> ovs-vsctl set Open_vSwitch . > other_config:dpdk-multi-seg-mbufs=true > > Performance notes > ================= > In order to test for regressions in performance, tests were run on top > of master 88125d6 and v8 of this patchset, both with the multi-segment > mbufs option enabled and disabled. > > VSperf was used to run the phy2phy_cont and pvp_cont tests with > varying > packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. > > Test | Size | Master | Multi-seg disabled | Multi-seg enabled > ------------------------------------------------------------- > p2p | 64 | ~22.7 | ~22.65 | ~18.3 > p2p | 1500 | ~1.6 | ~1.6 | ~1.6 > p2p | 7000 | ~0.36 | ~0.36 | ~0.36 > pvp | 64 | ~6.7 | ~6.7 | ~6.3 > pvp | 1500 | ~1.6 | ~1.6 | ~1.6 > pvp | 7000 | ~0.36 | ~0.36 | ~0.36 > > Packet size is in bytes, while all packet rates are reported in mpps > (aggregated). > > No noticeable regression has been observed (certainly everything is > within the ± 5% margin of existing performance), aside from the 64B > packet size case when multi-segment mbuf is enabled. This is > expected, however, because of how Tx vectoriszed functions are > incompatible with multi-segment mbufs on some PMDs. The PMD under > use during these tests was the i40e (on a Intel X710 NIC), which > indeed doesn't support vectorized Tx functions with multi-segment > mbufs. I did some testing using my PVP framework, https://github.com/chaudron/ovs_perf, and I see the same as above. I also used an XL710 for these tests. I reviewed the complete patch-set and will reply on the individual patches. //Eelco > > --- > v8 (non-RFC): > - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from > temporary directory (also for RHEL)"); > - Address Ciara's and Ilya's comments: > - Drop the dp_packet_mbuf_tail() function and use only the > already existing dp_packet_tail(); > - Fix bug in dpdk_do_tx_copy() where the error return from > dpdk_prep_tx_buf() was being wrongly checked; > - Use dpdk_buf_alloc() and free_dpdk_buf() instead of > rte_pktmbuf_alloc() and rte_pktmbuf_free(); > - Fix some other code style and duplication issues pointed > out. > - Refactored dp_packet_shift(), dp_packet_resize__() and > dp_packet_put*() functions to work within the bounds of existing > mbufs only; > - Fix dp_packet_clear() which wasn't correctly clearing / freeing > other mbufs in the chain for chains with more than a single > mbuf; > - dp_packet layer functions (such as dp_packet_l3()) now check if > the header is within the first mbuf, when using mbufs; > - Move patch 08/13 to before patch 04/13, since > dp_packet_set_size() > was refactored to use free_dpdk_buf(); > - Fix wrong rte_memcpy() when performing dp_packet_clone() which > was > leading to memory corruption; > - Modified the added tests to account for some of the above > changes; > - Run performance tests, compiling results and adding them to the > cover letter; > - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc, > together with a "Performance notes" sub-section reflecting the > findings mentioned above in the cover letter. > > v7: - Rebase on master 5e720da ("erspan: fix invalid erspan > version."); > - Address Ilya comments; > - Fix non-DPDK build; > - Serialise the access of non pmds to allocation and free of > mbufs by > using a newly introduced mutex. > - Add a new set of tests that integrates with the recently added > DPDK > testsuite. These focus on allocating dp_packets, with a single > or > multiple mbufs, from an instantiated mempool and performing > several > operations on those, verifying if the data at the end matches > what's > expected; > - Fix bugs found by the new tests: > - dp_packet_shift() wasn't taking into account shift lefts; > - dp_packet_resize__() was misusing and miscalculating the > tailrooms > and headrooms, ending up calculating the wrong number of > segments > that needed allocation; > - An mbuf's end was being miscalculated both in > dp_packet_tail, > dp_packet_mbuf_tail() and dp_packet_end(); > - dp_packet_set_size() was not updating the number of chained > segments > 'nb_segs'; > - Add support for multi-seg mbufs in dp_packet_clear(). > > v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session > reuse."); > - Further improve dp_packet_put_uninit() and dp_packet_shift() to > support multi-seg mbufs; > - Add support for multi-seg mbufs in dp_packet_l4_size() and > improve other helper funcs, such as dp_packet_tail() and dp_ > packet_tailroom(). > - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_ > put_zeros(), as well as dp_packet_resize__() - allocating new > mbufs and linking them together; > Restructured patchset: > - Squash patch 5 into patch 6, since they were both related to > copying data while handling multi-seg mbufs; > - Split patch 4 into two separate patches - one that introduces > the > changes in helper functions to deal with multi-seg mbufs and > two others for the shift() and put_uninit() functionality; > - Move patch 4 to before patch 3, so that ihelper functions come > before functionality improvement that rely on those helpers. > > v5: - Rebased on master e5e22dc ("datapath-windows: Prevent > ct-counters > from getting redundantly incremented"); > - Sugesh's comments have been addressed: > - Changed dp_packet_set_data() and dp_packet_set_size() logic to > make them independent of each other; > - Dropped patch 3 now that dp_packet_set_data() and > dp_packet_set_ > size() are independent; > - dp_packet_clone_with_headroom() now has split functions for > handling DPDK sourced packets and non-DPDK packets; > - Modified various functions in dp-packet.h to account for > multi-seg > mbufs - dp_packet_put_uninit(), dp_packet_tail(), > dp_packet_tail() > and dp_packet_at(); > - Added support for shifting packet data in multi-seg mbufs, using > dp_packet_shift(); > - Fixed some minor inconsistencies. > > Note that some of the changes in v5 have been contributed by Mark > Kavanagh as well. > > v4: - restructure patchset > - account for 128B ARM cacheline when sizing mbufs > > Mark Kavanagh (4): > netdev-dpdk: fix mbuf sizing > dp-packet: Init specific mbuf fields. > netdev-dpdk: copy large packet to multi-seg. mbufs > netdev-dpdk: support multi-segment jumbo frames > > Michael Qiu (1): > dp-packet: copy data from multi-seg. DPDK mbuf > > Tiago Lam (8): > dp-packet: Fix allocated size on DPDK init. > netdev-dpdk: Serialise non-pmds mbufs' alloc/free. > dp-packet: Fix data_len handling multi-seg mbufs. > dp-packet: Handle multi-seg mbufs in helper funcs. > dp-packet: Handle multi-seg mbufs in put_uninit(). > dp-packet: Handle multi-seg mubfs in shift() func. > dp-packet: Handle multi-seg mbufs in resize__(). > dpdk-tests: Add test coverage for multi-seg mbufs. > > Documentation/topics/dpdk/jumbo-frames.rst | 41 +++ > NEWS | 1 + > lib/dp-packet.c | 243 +++++++++++++- > lib/dp-packet.h | 319 ++++++++++++++---- > lib/dpdk.c | 8 + > lib/netdev-dpdk.c | 213 ++++++++++-- > lib/netdev-dpdk.h | 2 + > tests/automake.mk | 10 +- > tests/dpdk-packet-mbufs.at | 7 + > tests/system-dpdk-testsuite.at | 1 + > tests/test-dpdk-mbufs.c | 518 > +++++++++++++++++++++++++++++ > vswitchd/vswitch.xml | 20 ++ > 12 files changed, 1273 insertions(+), 110 deletions(-) > create mode 100644 tests/dpdk-packet-mbufs.at > create mode 100644 tests/test-dpdk-mbufs.c > > -- > 2.7.4
Hi Eelco, On 18/06/2018 12:18, Eelco Chaudron wrote: > > > On 11 Jun 2018, at 18:21, Tiago Lam wrote: > [snip] >> Performance notes >> ================= >> In order to test for regressions in performance, tests were run on top >> of master 88125d6 and v8 of this patchset, both with the multi-segment >> mbufs option enabled and disabled. >> >> VSperf was used to run the phy2phy_cont and pvp_cont tests with >> varying >> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. >> >> Test | Size | Master | Multi-seg disabled | Multi-seg enabled >> ------------------------------------------------------------- >> p2p | 64 | ~22.7 | ~22.65 | ~18.3 >> p2p | 1500 | ~1.6 | ~1.6 | ~1.6 >> p2p | 7000 | ~0.36 | ~0.36 | ~0.36 >> pvp | 64 | ~6.7 | ~6.7 | ~6.3 >> pvp | 1500 | ~1.6 | ~1.6 | ~1.6 >> pvp | 7000 | ~0.36 | ~0.36 | ~0.36 >> >> Packet size is in bytes, while all packet rates are reported in mpps >> (aggregated). >> >> No noticeable regression has been observed (certainly everything is >> within the ± 5% margin of existing performance), aside from the 64B >> packet size case when multi-segment mbuf is enabled. This is >> expected, however, because of how Tx vectoriszed functions are >> incompatible with multi-segment mbufs on some PMDs. The PMD under >> use during these tests was the i40e (on a Intel X710 NIC), which >> indeed doesn't support vectorized Tx functions with multi-segment >> mbufs. > > I did some testing using my PVP framework, > https://github.com/chaudron/ovs_perf, and I see the same as above. I > also used an XL710 for these tests. Thanks for the review! And for confirming the results, that gives some assurance. > > I reviewed the complete patch-set and will reply on the individual > patches. > > //Eelco > > I thought it would be worth giving a more generic explanation here since some of the comments are around the same point. This series takes the approach of not allocating new mbufs in `DPBUF_DPDK` dp_packets. This is to avoid the case where a dp_packet_put() call, for example, allocates two mbufs to create enough space, and returns a pointer to the data which isn't contiguous in memory (because it is spread across two mbufs). Most of the code in OvS that uses the dp_packet API relies on the assumption that memory returned is contiguous in memory. To enforce this, dp_packet_resize__() doesn't allocate any new mbufs, and only tries to make enough room from the space it already has available (both in the tailroom and headroom). dp_packet_shift() also doesn't allow the possibility for the head to move past the first mbuf (when shifting right), or the tail to move past the last mbuf (if shifting left). This is also why some assumptions can be made in the implementation, such that the tailroom of a dp_packet will be the tailroom available in the last mbuf, since it shouldn't be possible to have a whole free mbuf after the tail. This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV` is defined, an mbuf's allocated memory comes from the system. In many cases though, this is the type of packet used internally by OvS when doing manipulations, since any clone of the received `DPBUF_DPDK` dp_packet will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()). I'll reply to the other emails individually and refer to this explanation when useful. Tiago.
On 22 Jun 2018, at 21:02, Lam, Tiago wrote: > Hi Eelco, > > On 18/06/2018 12:18, Eelco Chaudron wrote: >> >> >> On 11 Jun 2018, at 18:21, Tiago Lam wrote: >> > > [snip] > >>> Performance notes >>> ================= >>> In order to test for regressions in performance, tests were run on >>> top >>> of master 88125d6 and v8 of this patchset, both with the >>> multi-segment >>> mbufs option enabled and disabled. >>> >>> VSperf was used to run the phy2phy_cont and pvp_cont tests with >>> varying >>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. >>> >>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled >>> ------------------------------------------------------------- >>> p2p | 64 | ~22.7 | ~22.65 | ~18.3 >>> p2p | 1500 | ~1.6 | ~1.6 | ~1.6 >>> p2p | 7000 | ~0.36 | ~0.36 | ~0.36 >>> pvp | 64 | ~6.7 | ~6.7 | ~6.3 >>> pvp | 1500 | ~1.6 | ~1.6 | ~1.6 >>> pvp | 7000 | ~0.36 | ~0.36 | ~0.36 >>> >>> Packet size is in bytes, while all packet rates are reported in mpps >>> (aggregated). >>> >>> No noticeable regression has been observed (certainly everything is >>> within the ± 5% margin of existing performance), aside from the 64B >>> packet size case when multi-segment mbuf is enabled. This is >>> expected, however, because of how Tx vectoriszed functions are >>> incompatible with multi-segment mbufs on some PMDs. The PMD under >>> use during these tests was the i40e (on a Intel X710 NIC), which >>> indeed doesn't support vectorized Tx functions with multi-segment >>> mbufs. >> >> I did some testing using my PVP framework, >> https://github.com/chaudron/ovs_perf, and I see the same as above. I >> also used an XL710 for these tests. > > Thanks for the review! And for confirming the results, that gives some > assurance. > >> >> I reviewed the complete patch-set and will reply on the individual >> patches. >> >> //Eelco >> >> > I thought it would be worth giving a more generic explanation here > since > some of the comments are around the same point. > > This series takes the approach of not allocating new mbufs in > `DPBUF_DPDK` dp_packets. This is to avoid the case where a > dp_packet_put() call, for example, allocates two mbufs to create > enough > space, and returns a pointer to the data which isn't contiguous in > memory (because it is spread across two mbufs). Most of the code in > OvS > that uses the dp_packet API relies on the assumption that memory > returned is contiguous in memory. > > To enforce this, dp_packet_resize__() doesn't allocate any new mbufs, > and only tries to make enough room from the space it already has > available (both in the tailroom and headroom). dp_packet_shift() also > doesn't allow the possibility for the head to move past the first mbuf > (when shifting right), or the tail to move past the last mbuf (if > shifting left). > > This is also why some assumptions can be made in the implementation, > such that the tailroom of a dp_packet will be the tailroom available > in > the last mbuf, since it shouldn't be possible to have a whole free > mbuf > after the tail. Thanks for explaining the details, maybe its good to explain some of these details in dp_packet_resize__(). However, for the general function, I think you should not assume that tailroom is only available in the last mbuf. I think you should stick as much as possible to the existing mbuf API’s to avoid problems in the future if people decide to do add “full” mbuf support. Most of these shortcuts do not give any real performance advantages. > This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV` > is > defined, an mbuf's allocated memory comes from the system. In many > cases > though, this is the type of packet used internally by OvS when doing > manipulations, since any clone of the received `DPBUF_DPDK` dp_packet > will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()). > > I'll reply to the other emails individually and refer to this > explanation when useful. Thanks, I’ve replied to all the comments in the individual mails. I’ll allocate some time to review your new revision once it’s available. Cheers, Eelco
On 26/06/2018 14:23, Eelco Chaudron wrote: > > > On 22 Jun 2018, at 21:02, Lam, Tiago wrote: > >> Hi Eelco, >> >> On 18/06/2018 12:18, Eelco Chaudron wrote: >>> >>> >>> On 11 Jun 2018, at 18:21, Tiago Lam wrote: >>> >> >> [snip] >> >>>> Performance notes >>>> ================= >>>> In order to test for regressions in performance, tests were run on >>>> top >>>> of master 88125d6 and v8 of this patchset, both with the >>>> multi-segment >>>> mbufs option enabled and disabled. >>>> >>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with >>>> varying >>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. >>>> >>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled >>>> ------------------------------------------------------------- >>>> p2p | 64 | ~22.7 | ~22.65 | ~18.3 >>>> p2p | 1500 | ~1.6 | ~1.6 | ~1.6 >>>> p2p | 7000 | ~0.36 | ~0.36 | ~0.36 >>>> pvp | 64 | ~6.7 | ~6.7 | ~6.3 >>>> pvp | 1500 | ~1.6 | ~1.6 | ~1.6 >>>> pvp | 7000 | ~0.36 | ~0.36 | ~0.36 >>>> >>>> Packet size is in bytes, while all packet rates are reported in mpps >>>> (aggregated). >>>> >>>> No noticeable regression has been observed (certainly everything is >>>> within the ± 5% margin of existing performance), aside from the 64B >>>> packet size case when multi-segment mbuf is enabled. This is >>>> expected, however, because of how Tx vectoriszed functions are >>>> incompatible with multi-segment mbufs on some PMDs. The PMD under >>>> use during these tests was the i40e (on a Intel X710 NIC), which >>>> indeed doesn't support vectorized Tx functions with multi-segment >>>> mbufs. >>> >>> I did some testing using my PVP framework, >>> https://github.com/chaudron/ovs_perf, and I see the same as above. I >>> also used an XL710 for these tests. >> >> Thanks for the review! And for confirming the results, that gives some >> assurance. >> >>> >>> I reviewed the complete patch-set and will reply on the individual >>> patches. >>> >>> //Eelco >>> >>> >> I thought it would be worth giving a more generic explanation here >> since >> some of the comments are around the same point. >> >> This series takes the approach of not allocating new mbufs in >> `DPBUF_DPDK` dp_packets. This is to avoid the case where a >> dp_packet_put() call, for example, allocates two mbufs to create >> enough >> space, and returns a pointer to the data which isn't contiguous in >> memory (because it is spread across two mbufs). Most of the code in >> OvS >> that uses the dp_packet API relies on the assumption that memory >> returned is contiguous in memory. >> >> To enforce this, dp_packet_resize__() doesn't allocate any new mbufs, >> and only tries to make enough room from the space it already has >> available (both in the tailroom and headroom). dp_packet_shift() also >> doesn't allow the possibility for the head to move past the first mbuf >> (when shifting right), or the tail to move past the last mbuf (if >> shifting left). >> >> This is also why some assumptions can be made in the implementation, >> such that the tailroom of a dp_packet will be the tailroom available >> in >> the last mbuf, since it shouldn't be possible to have a whole free >> mbuf >> after the tail. > > Thanks for explaining the details, maybe its good to explain some of > these details in dp_packet_resize__(). > > However, for the general function, I think you should not assume that > tailroom is only available in the last mbuf. I think you should stick as > much as possible to the existing mbuf API’s to avoid problems in the > future if people decide to do add “full” mbuf support. Most of these > shortcuts do not give any real performance advantages. >>> This is opposed to a `DPBUF_MALLOC` packet, which when `DPDK_NETDEV` >> is >> defined, an mbuf's allocated memory comes from the system. In many >> cases >> though, this is the type of packet used internally by OvS when doing >> manipulations, since any clone of the received `DPBUF_DPDK` dp_packet >> will be allocated in a `DPBUF_MALLOC` (done by dpdk_do_tx_copy()). >> >> I'll reply to the other emails individually and refer to this >> explanation when useful. > > Thanks, I’ve replied to all the comments in the individual mails. > > I’ll allocate some time to review your new revision once it’s > available. > Thanks a bunch for the reviews, Eelco. This is much appreciated. I should be able to send the next iteration soon. To avoid the extra noise I've replied only to the comments which were more relevant in the series, and I agree and will work on the rest. Tiago. > Cheers, > > Eelco >