Message ID | 1532442358-13122-1-git-send-email-tiago.lam@intel.com |
---|---|
Headers | show |
Series | Support multi-segment mbufs | expand |
Hi. Just wanted to add some comments for the use-cases and testing methodology. See inline. And I'm actually not sure if there any profit from this patch set? It looks like an internal mbuf handling rework that only degrades the performance and complicates the code. Please, don't consider me as merge blocker. I just want to understand why you think we need this 1200 LOCs? --- About 'resize()' related discussion: Maybe it's worth to allow dp_packet APIs to return different dp_packet. In this case we'll be able to just clone the packet to malloced memory and resize in cases of not enough headroom available. Like: packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); or eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); This will have a little performance penalty in compare with data shifting inside the mbuf, but will be much more elegant, and will allow to eliminate all the OVS_NOT_REACHED cases. Best regards, Ilya Maximets. On 24.07.2018 17:25, Tiago Lam wrote: > Overview > ======== > This patchset introduces support for multi-segment mbufs to OvS-DPDK. > Multi-segment mbufs are typically used when the size of an mbuf is > insufficient to contain the entirety of a packet's data. Instead, the > data is split across numerous mbufs, each carrying a portion, or > 'segment', of the packet data. mbufs are chained via their 'next' > attribute (an mbuf pointer). > > Use Cases > ========= > i. Handling oversized (guest-originated) frames, which are marked > for hardware accelration/offload (TSO, for example). This is not a real use case as vhost doesn't support TSO and other offloading that requires handling segmented mbufs, so, guests are not allowed to send packets larger than MTU. netdev-linux interfaces also not allowed to receive packets larger than MTU. TSO and other offloading support will require much more work, and, in fact, the profit from it is controversial. > > Packets which originate from a non-DPDK source may be marked for > offload; as such, they may be larger than the permitted ingress > interface's MTU, and may be stored in an oversized dp-packet. In > order to transmit such packets over a DPDK port, their contents > must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in > its current implementation, that function only copies data into > a single mbuf; if the space available in the mbuf is exhausted, > but not all packet data has been copied, then it is lost. > Similarly, when cloning a DPDK mbuf, it must be considered > whether that mbuf contains multiple segments. Both issues are > resolved within this patchset.> > ii. Handling jumbo frames. Different internal representation of big packets. Why we need this? > > While OvS already supports jumbo frames, it does so by increasing > mbuf size, such that the entirety of a jumbo frame may be handled > in a single mbuf. This is certainly the preferred, and most > performant approach (and remains the default). > > Enabling multi-segment mbufs > ============================ > Multi-segment and single-segment mbufs are mutually exclusive, and the > user must decide on which approach to adopt on init. The introduction > of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. > > This is a global boolean value, which determines how jumbo frames are > represented across all DPDK ports. In the absence of a user-supplied > value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment > mbufs must be explicitly enabled / single-segment mbufs remain the > default. > > Setting the field is identical to setting existing DPDK-specific OVSDB > fields: > > ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true > ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 > ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 > ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true > > Performance notes (based on v8) > ================= > In order to test for regressions in performance, tests were run on top > of master 88125d6 and v8 of this patchset, both with the multi-segment > mbufs option enabled and disabled. > > VSperf was used to run the phy2phy_cont and pvp_cont tests with varying > packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. > I'm sorry, but these performance tests and below analysis are mostly useless. It's makes no sense to test 1.5K or 7K bytes packets with 10G interface. OVS is capable to handle this traffic without any drops in almost any configuration. My PVP test with bonded phy easily takes almost 2 times higher traffic rates using 2x10G interfaces in balanced bonding with disabled caches. Below table just reports that you're reaching the limit of the 10G physical NICs. Just wanted to point on that. Next time, please, use better test scenarios. I expect that enabling multi-segment mbufs will reduce performance in all scenarios for a few percents. > Test | Size | Master | Multi-seg disabled | Multi-seg enabled > ------------------------------------------------------------- > p2p | 64 | ~22.7 | ~22.65 | ~18.3 > p2p | 1500 | ~1.6 | ~1.6 | ~1.6 > p2p | 7000 | ~0.36 | ~0.36 | ~0.36 > pvp | 64 | ~6.7 | ~6.7 | ~6.3 > pvp | 1500 | ~1.6 | ~1.6 | ~1.6 > pvp | 7000 | ~0.36 | ~0.36 | ~0.36 > > Packet size is in bytes, while all packet rates are reported in mpps > (aggregated). > > No noticeable regression has been observed (certainly everything is > within the ± 5% margin of existing performance), aside from the 64B > packet size case when multi-segment mbuf is enabled. This is > expected, however, because of how Tx vectoriszed functions are > incompatible with multi-segment mbufs on some PMDs. The PMD under > use during these tests was the i40e (on a Intel X710 NIC), which > indeed doesn't support vectorized Tx functions with multi-segment > mbufs. > > --- > v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser > error paths."); > - Address Darrell's comments: > - The changes in dp_packet_resize__() were trying to alleviate the call > to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available > tailroom space when no more headroom space is available, and vice-versa. > However, this was breaking the API for the dp_packet_resize__() > function (only called in dp_packet_prealloc_tailroom() and > dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes > for DPDK packets. > Instead, and because this is isolate funtionality, revert to the > previous state where dp_packet_resize__() is not supported for DPDK > packets. Hence, then patch 08/14 has been dropped. > - Additionally, fix the tests that were relying on the removed > functionality. > > v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use > after free."); > - Address Eelco's comments: > - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a > leftover from rebase. Only call should be in dpdk_mp_get(); > - Remove NEWS line added by mistake during rebase (about adding > experimental vhost zero copy support). > - Address Ian's comments: > - Drop patch 01 from previous series entirely; > - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside > dpdk_mp_create() to get the correct "mbuf_size" to be used; > - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment > mbufs is enabled, in which case it calculates the new "mbuf_size" to be > used; > - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock > conditionally. > - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current > DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using > the shared mempool model (runs out of memory); > - Add new examples for when multi-segment mbufs are enabled in > topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst > (now patch 11/14). > > v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add > command."): > - A new patch (now 01/15) has been introduced to differentiate between > MTU and mbuf size when creating the mempool. This is because as part of > the new support for both per port and shared mempools, mempools were > being reused based on the mbuf size, and for multi-segment mbufs the > mbuf size can end up being the same for all mbufs; > - A couple of other patches (now 02/15 and 12/15) have been modified as > part of the rebase, but only to adapt to the code changes to "Support > both shared and per port mempools.", no functionality should have been > changed. > > v3: > - Address Eelco's comment: > - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which > wasn't correctly asserting that the passed 'v' was smaller than the > first mbuf's buf_len. > > v2: > - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on > error."); > - Address Eelco's comments: > - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of > casting; > - Remove unneeded variable in dp_packet_set_size(), pointing to the > first mbuf in the chain; > - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is > always true for DPBUF_DPDK packets; > - Assert in __packet_set_data() to enforce that data_off never goes > beyond the first mbuf in the chain. > > v1: > - v8 should have been sent as v1 really, as that's usually the approach > followed in OvS. That clearly didn't happen, so restarting the series > now. This also helps making it clear it is no longer an RFC series; > - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags() > output."); > - Address Eelco's comments: > - Change dp_packet_size() so that for `DPBUF_DPDK` packets their > `pkt_len` and `data_len` can't be set to values bigger than the > available space. Also fix assigment to `data_len` which was > incorrectly being set to just`pkt_len`; > - Improve `nonpmd_mp_mutex` comment with a better explanation as to > why the mutex is needed; > - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non > `DPBUF_DPDK` packets; > - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`; > - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for > copying `DPBUF_DPDK` packets' data. Also, change it to return > appropriate and meaningful errors, instead of just "0" or "1"; > - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(), > and reuse dp_packet_mbuf_write() instead of manual copy; > - Add note vswitchd/vswitch.xml to make it clear the enabling of > dpdk-multi-seg-mbufs requires a restart; > - Change dpdk_mp_create() to increase # mbufs used under the multi-segment > mbufs approach; > - Increase test coverage by adding "end-to-end" tests that verify that > "dpdk-multi-seg-mbufs" is disabled by default and that a packet is > successfully sent out; > - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved > back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to > minimise changes; > - Add some extra notes to "Performance notes" in jumbo-frames.rst doc, > after further testing; > - Structure changes: > - Drop patch 07/13 which is now unneeded; > - Two more patches added for extra test coverage. This is what accounts > for the increase in size (+1 patch) in the series. > > v8 (non-RFC): > - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from > temporary directory (also for RHEL)"); > - Address Ciara's and Ilya's comments: > - Drop the dp_packet_mbuf_tail() function and use only the > already existing dp_packet_tail(); > - Fix bug in dpdk_do_tx_copy() where the error return from > dpdk_prep_tx_buf() was being wrongly checked; > - Use dpdk_buf_alloc() and free_dpdk_buf() instead of > rte_pktmbuf_alloc() and rte_pktmbuf_free(); > - Fix some other code style and duplication issues pointed out. > - Refactored dp_packet_shift(), dp_packet_resize__() and > dp_packet_put*() functions to work within the bounds of existing > mbufs only; > - Fix dp_packet_clear() which wasn't correctly clearing / freeing > other mbufs in the chain for chains with more than a single mbuf; > - dp_packet layer functions (such as dp_packet_l3()) now check if > the header is within the first mbuf, when using mbufs; > - Move patch 08/13 to before patch 04/13, since dp_packet_set_size() > was refactored to use free_dpdk_buf(); > - Fix wrong rte_memcpy() when performing dp_packet_clone() which was > leading to memory corruption; > - Modified the added tests to account for some of the above changes; > - Run performance tests, compiling results and adding them to the > cover letter; > - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc, > together with a "Performance notes" sub-section reflecting the > findings mentioned above in the cover letter. > > v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version."); > - Address Ilya comments; > - Fix non-DPDK build; > - Serialise the access of non pmds to allocation and free of mbufs by > using a newly introduced mutex. > - Add a new set of tests that integrates with the recently added DPDK > testsuite. These focus on allocating dp_packets, with a single or > multiple mbufs, from an instantiated mempool and performing several > operations on those, verifying if the data at the end matches what's > expected; > - Fix bugs found by the new tests: > - dp_packet_shift() wasn't taking into account shift lefts; > - dp_packet_resize__() was misusing and miscalculating the tailrooms > and headrooms, ending up calculating the wrong number of segments > that needed allocation; > - An mbuf's end was being miscalculated both in dp_packet_tail, > dp_packet_mbuf_tail() and dp_packet_end(); > - dp_packet_set_size() was not updating the number of chained segments > 'nb_segs'; > - Add support for multi-seg mbufs in dp_packet_clear(). > > v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session > reuse."); > - Further improve dp_packet_put_uninit() and dp_packet_shift() to > support multi-seg mbufs; > - Add support for multi-seg mbufs in dp_packet_l4_size() and > improve other helper funcs, such as dp_packet_tail() and dp_ > packet_tailroom(). > - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_ > put_zeros(), as well as dp_packet_resize__() - allocating new > mbufs and linking them together; > Restructured patchset: > - Squash patch 5 into patch 6, since they were both related to > copying data while handling multi-seg mbufs; > - Split patch 4 into two separate patches - one that introduces the > changes in helper functions to deal with multi-seg mbufs and > two others for the shift() and put_uninit() functionality; > - Move patch 4 to before patch 3, so that ihelper functions come > before functionality improvement that rely on those helpers. > > v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters > from getting redundantly incremented"); > - Sugesh's comments have been addressed: > - Changed dp_packet_set_data() and dp_packet_set_size() logic to > make them independent of each other; > - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_ > size() are independent; > - dp_packet_clone_with_headroom() now has split functions for > handling DPDK sourced packets and non-DPDK packets; > - Modified various functions in dp-packet.h to account for multi-seg > mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail() > and dp_packet_at(); > - Added support for shifting packet data in multi-seg mbufs, using > dp_packet_shift(); > - Fixed some minor inconsistencies. > > Note that some of the changes in v5 have been contributed by Mark > Kavanagh as well. > > v4: - restructure patchset > - account for 128B ARM cacheline when sizing mbufs > > Mark Kavanagh (4): > netdev-dpdk: fix mbuf sizing > dp-packet: Init specific mbuf fields. > netdev-dpdk: copy large packet to multi-seg. mbufs > netdev-dpdk: support multi-segment jumbo frames > > Michael Qiu (1): > dp-packet: copy data from multi-seg. DPDK mbuf > > Tiago Lam (8): > dp-packet: Fix allocated size on DPDK init. > netdev-dpdk: Serialise non-pmds mbufs' alloc/free. > dp-packet: Fix data_len handling multi-seg mbufs. > dp-packet: Handle multi-seg mbufs in helper funcs. > dp-packet: Handle multi-seg mubfs in shift() func. > dpdk-tests: Add uni-tests for multi-seg mbufs. > dpdk-tests: Accept other configs in OVS_DPDK_START > dpdk-tests: End-to-end tests for multi-seg mbufs. > > Documentation/topics/dpdk/jumbo-frames.rst | 52 +++ > Documentation/topics/dpdk/memory.rst | 36 ++ > NEWS | 1 + > lib/dp-packet.c | 173 +++++++++- > lib/dp-packet.h | 214 ++++++++++-- > lib/dpdk.c | 8 + > lib/netdev-dpdk.c | 244 +++++++++++--- > lib/netdev-dpdk.h | 2 + > tests/automake.mk | 10 +- > tests/dpdk-packet-mbufs.at | 7 + > tests/system-dpdk-macros.at | 6 +- > tests/system-dpdk-testsuite.at | 1 + > tests/system-dpdk.at | 65 ++++ > tests/test-dpdk-mbufs.c | 513 +++++++++++++++++++++++++++++ > vswitchd/vswitch.xml | 22 ++ > 15 files changed, 1277 insertions(+), 77 deletions(-) > create mode 100644 tests/dpdk-packet-mbufs.at > create mode 100644 tests/test-dpdk-mbufs.c >
On Tue, Jul 24, 2018 at 07:55:39PM +0300, Ilya Maximets wrote: > Hi. > Just wanted to add some comments for the use-cases and testing methodology. > See inline. > > And I'm actually not sure if there any profit from this patch set? > It looks like an internal mbuf handling rework that only degrades > the performance and complicates the code. > > Please, don't consider me as merge blocker. I just want to understand > why you think we need this 1200 LOCs? > > --- > About 'resize()' related discussion: > Maybe it's worth to allow dp_packet APIs to return different dp_packet. > In this case we'll be able to just clone the packet to malloced memory > and resize in cases of not enough headroom available. > Like: > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); > or > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > This will have a little performance penalty in compare with data shifting > inside the mbuf, but will be much more elegant, and will allow to eliminate > all the OVS_NOT_REACHED cases. > > > Best regards, Ilya Maximets. Hmm, this is the second person from whom I've heard serious misgivings about this patch series. Tiago, Ian, would you like to respond? I'm a little nervous about merging this patch series, especially relatively late before branching, given that some people have technical objections to it. Thanks, Ben.
Hi Ilya, Given the timeline, I'm replying in-line to the critical parts of the email (the ones that question the feature itself). I'll come back and reply to the rest later. On 24/07/2018 17:55, Ilya Maximets wrote: > Hi. > Just wanted to add some comments for the use-cases and testing methodology. > See inline. > > And I'm actually not sure if there any profit from this patch set? > It looks like an internal mbuf handling rework that only degrades > the performance and complicates the code. > > Please, don't consider me as merge blocker. I just want to understand > why you think we need this 1200 LOCs? > Of which 1/2 are related to tests. Granted it is still LOC, but don't directly impact the dp-packet/netdev-dpdk modules. I believe that distinction is important. > --- > About 'resize()' related discussion: > Maybe it's worth to allow dp_packet APIs to return different dp_packet. > In this case we'll be able to just clone the packet to malloced memory > and resize in cases of not enough headroom available. > Like: > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); > or > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > This will have a little performance penalty in compare with data shifting > inside the mbuf, but will be much more elegant, and will allow to eliminate > all the OVS_NOT_REACHED cases. > > > Best regards, Ilya Maximets. > > On 24.07.2018 17:25, Tiago Lam wrote: >> Overview >> ======== >> This patchset introduces support for multi-segment mbufs to OvS-DPDK. >> Multi-segment mbufs are typically used when the size of an mbuf is >> insufficient to contain the entirety of a packet's data. Instead, the >> data is split across numerous mbufs, each carrying a portion, or >> 'segment', of the packet data. mbufs are chained via their 'next' >> attribute (an mbuf pointer). >> >> Use Cases >> ========= >> i. Handling oversized (guest-originated) frames, which are marked >> for hardware accelration/offload (TSO, for example). > > This is not a real use case as vhost doesn't support TSO and other > offloading that requires handling segmented mbufs, so, guests are > not allowed to send packets larger than MTU. netdev-linux interfaces > also not allowed to receive packets larger than MTU. > > TSO and other offloading support will require much more work, and, > in fact, the profit from it is controversial. > There might be some context missing here. As you mentioned, TSO is not possible at the moment with OvS-DPDK. However, we are working on bringing TSO / GRO to OvS-DPDK and this serves as preparatory work to enable those features. These are features some customers are requesting for, and closes a feature gap between OvS and OvS-DPDK. >> >> Packets which originate from a non-DPDK source may be marked for >> offload; as such, they may be larger than the permitted ingress >> interface's MTU, and may be stored in an oversized dp-packet. In >> order to transmit such packets over a DPDK port, their contents >> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in >> its current implementation, that function only copies data into >> a single mbuf; if the space available in the mbuf is exhausted, >> but not all packet data has been copied, then it is lost. >> Similarly, when cloning a DPDK mbuf, it must be considered >> whether that mbuf contains multiple segments. Both issues are >> resolved within this patchset.> >> ii. Handling jumbo frames. > > Different internal representation of big packets. Why we need this? > I believe the above answer should cover this as well. >> >> While OvS already supports jumbo frames, it does so by increasing >> mbuf size, such that the entirety of a jumbo frame may be handled >> in a single mbuf. This is certainly the preferred, and most >> performant approach (and remains the default). >> >> Enabling multi-segment mbufs >> ============================ >> Multi-segment and single-segment mbufs are mutually exclusive, and the >> user must decide on which approach to adopt on init. The introduction >> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. >> >> This is a global boolean value, which determines how jumbo frames are >> represented across all DPDK ports. In the absence of a user-supplied >> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment >> mbufs must be explicitly enabled / single-segment mbufs remain the >> default. >> >> Setting the field is identical to setting existing DPDK-specific OVSDB >> fields: >> >> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true >> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 >> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 >> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true >> >> Performance notes (based on v8) >> ================= >> In order to test for regressions in performance, tests were run on top >> of master 88125d6 and v8 of this patchset, both with the multi-segment >> mbufs option enabled and disabled. >> >> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying >> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. >> > > I'm sorry, but these performance tests and below analysis are mostly useless. > It's makes no sense to test 1.5K or 7K bytes packets with 10G interface. > OVS is capable to handle this traffic without any drops in almost any > configuration. My PVP test with bonded phy easily takes almost 2 times > higher traffic rates using 2x10G interfaces in balanced bonding with disabled > caches. Below table just reports that you're reaching the limit of the > 10G physical NICs. > > Just wanted to point on that. Next time, please, use better test scenarios. > > I expect that enabling multi-segment mbufs will reduce performance in all > scenarios for a few percents. > >> Test | Size | Master | Multi-seg disabled | Multi-seg enabled >> ------------------------------------------------------------- >> p2p | 64 | ~22.7 | ~22.65 | ~18.3 >> p2p | 1500 | ~1.6 | ~1.6 | ~1.6 >> p2p | 7000 | ~0.36 | ~0.36 | ~0.36 >> pvp | 64 | ~6.7 | ~6.7 | ~6.3 >> pvp | 1500 | ~1.6 | ~1.6 | ~1.6 >> pvp | 7000 | ~0.36 | ~0.36 | ~0.36 >> >> Packet size is in bytes, while all packet rates are reported in mpps >> (aggregated). >> >> No noticeable regression has been observed (certainly everything is >> within the ± 5% margin of existing performance), aside from the 64B >> packet size case when multi-segment mbuf is enabled. This is >> expected, however, because of how Tx vectoriszed functions are >> incompatible with multi-segment mbufs on some PMDs. The PMD under >> use during these tests was the i40e (on a Intel X710 NIC), which >> indeed doesn't support vectorized Tx functions with multi-segment >> mbufs. >> >> --- >> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser >> error paths."); >> - Address Darrell's comments: >> - The changes in dp_packet_resize__() were trying to alleviate the call >> to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available >> tailroom space when no more headroom space is available, and vice-versa. >> However, this was breaking the API for the dp_packet_resize__() >> function (only called in dp_packet_prealloc_tailroom() and >> dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes >> for DPDK packets. >> Instead, and because this is isolate funtionality, revert to the >> previous state where dp_packet_resize__() is not supported for DPDK >> packets. Hence, then patch 08/14 has been dropped. >> - Additionally, fix the tests that were relying on the removed >> functionality. >> >> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use >> after free."); >> - Address Eelco's comments: >> - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a >> leftover from rebase. Only call should be in dpdk_mp_get(); >> - Remove NEWS line added by mistake during rebase (about adding >> experimental vhost zero copy support). >> - Address Ian's comments: >> - Drop patch 01 from previous series entirely; >> - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside >> dpdk_mp_create() to get the correct "mbuf_size" to be used; >> - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment >> mbufs is enabled, in which case it calculates the new "mbuf_size" to be >> used; >> - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock >> conditionally. >> - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current >> DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using >> the shared mempool model (runs out of memory); >> - Add new examples for when multi-segment mbufs are enabled in >> topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst >> (now patch 11/14). >> >> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add >> command."): >> - A new patch (now 01/15) has been introduced to differentiate between >> MTU and mbuf size when creating the mempool. This is because as part of >> the new support for both per port and shared mempools, mempools were >> being reused based on the mbuf size, and for multi-segment mbufs the >> mbuf size can end up being the same for all mbufs; >> - A couple of other patches (now 02/15 and 12/15) have been modified as >> part of the rebase, but only to adapt to the code changes to "Support >> both shared and per port mempools.", no functionality should have been >> changed. >> >> v3: >> - Address Eelco's comment: >> - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which >> wasn't correctly asserting that the passed 'v' was smaller than the >> first mbuf's buf_len. >> >> v2: >> - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on >> error."); >> - Address Eelco's comments: >> - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of >> casting; >> - Remove unneeded variable in dp_packet_set_size(), pointing to the >> first mbuf in the chain; >> - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is >> always true for DPBUF_DPDK packets; >> - Assert in __packet_set_data() to enforce that data_off never goes >> beyond the first mbuf in the chain. >> >> v1: >> - v8 should have been sent as v1 really, as that's usually the approach >> followed in OvS. That clearly didn't happen, so restarting the series >> now. This also helps making it clear it is no longer an RFC series; >> - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags() >> output."); >> - Address Eelco's comments: >> - Change dp_packet_size() so that for `DPBUF_DPDK` packets their >> `pkt_len` and `data_len` can't be set to values bigger than the >> available space. Also fix assigment to `data_len` which was >> incorrectly being set to just`pkt_len`; >> - Improve `nonpmd_mp_mutex` comment with a better explanation as to >> why the mutex is needed; >> - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non >> `DPBUF_DPDK` packets; >> - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`; >> - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for >> copying `DPBUF_DPDK` packets' data. Also, change it to return >> appropriate and meaningful errors, instead of just "0" or "1"; >> - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(), >> and reuse dp_packet_mbuf_write() instead of manual copy; >> - Add note vswitchd/vswitch.xml to make it clear the enabling of >> dpdk-multi-seg-mbufs requires a restart; >> - Change dpdk_mp_create() to increase # mbufs used under the multi-segment >> mbufs approach; >> - Increase test coverage by adding "end-to-end" tests that verify that >> "dpdk-multi-seg-mbufs" is disabled by default and that a packet is >> successfully sent out; >> - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved >> back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to >> minimise changes; >> - Add some extra notes to "Performance notes" in jumbo-frames.rst doc, >> after further testing; >> - Structure changes: >> - Drop patch 07/13 which is now unneeded; >> - Two more patches added for extra test coverage. This is what accounts >> for the increase in size (+1 patch) in the series. >> >> v8 (non-RFC): >> - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from >> temporary directory (also for RHEL)"); >> - Address Ciara's and Ilya's comments: >> - Drop the dp_packet_mbuf_tail() function and use only the >> already existing dp_packet_tail(); >> - Fix bug in dpdk_do_tx_copy() where the error return from >> dpdk_prep_tx_buf() was being wrongly checked; >> - Use dpdk_buf_alloc() and free_dpdk_buf() instead of >> rte_pktmbuf_alloc() and rte_pktmbuf_free(); >> - Fix some other code style and duplication issues pointed out. >> - Refactored dp_packet_shift(), dp_packet_resize__() and >> dp_packet_put*() functions to work within the bounds of existing >> mbufs only; >> - Fix dp_packet_clear() which wasn't correctly clearing / freeing >> other mbufs in the chain for chains with more than a single mbuf; >> - dp_packet layer functions (such as dp_packet_l3()) now check if >> the header is within the first mbuf, when using mbufs; >> - Move patch 08/13 to before patch 04/13, since dp_packet_set_size() >> was refactored to use free_dpdk_buf(); >> - Fix wrong rte_memcpy() when performing dp_packet_clone() which was >> leading to memory corruption; >> - Modified the added tests to account for some of the above changes; >> - Run performance tests, compiling results and adding them to the >> cover letter; >> - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc, >> together with a "Performance notes" sub-section reflecting the >> findings mentioned above in the cover letter. >> >> v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version."); >> - Address Ilya comments; >> - Fix non-DPDK build; >> - Serialise the access of non pmds to allocation and free of mbufs by >> using a newly introduced mutex. >> - Add a new set of tests that integrates with the recently added DPDK >> testsuite. These focus on allocating dp_packets, with a single or >> multiple mbufs, from an instantiated mempool and performing several >> operations on those, verifying if the data at the end matches what's >> expected; >> - Fix bugs found by the new tests: >> - dp_packet_shift() wasn't taking into account shift lefts; >> - dp_packet_resize__() was misusing and miscalculating the tailrooms >> and headrooms, ending up calculating the wrong number of segments >> that needed allocation; >> - An mbuf's end was being miscalculated both in dp_packet_tail, >> dp_packet_mbuf_tail() and dp_packet_end(); >> - dp_packet_set_size() was not updating the number of chained segments >> 'nb_segs'; >> - Add support for multi-seg mbufs in dp_packet_clear(). >> >> v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session >> reuse."); >> - Further improve dp_packet_put_uninit() and dp_packet_shift() to >> support multi-seg mbufs; >> - Add support for multi-seg mbufs in dp_packet_l4_size() and >> improve other helper funcs, such as dp_packet_tail() and dp_ >> packet_tailroom(). >> - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_ >> put_zeros(), as well as dp_packet_resize__() - allocating new >> mbufs and linking them together; >> Restructured patchset: >> - Squash patch 5 into patch 6, since they were both related to >> copying data while handling multi-seg mbufs; >> - Split patch 4 into two separate patches - one that introduces the >> changes in helper functions to deal with multi-seg mbufs and >> two others for the shift() and put_uninit() functionality; >> - Move patch 4 to before patch 3, so that ihelper functions come >> before functionality improvement that rely on those helpers. >> >> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters >> from getting redundantly incremented"); >> - Sugesh's comments have been addressed: >> - Changed dp_packet_set_data() and dp_packet_set_size() logic to >> make them independent of each other; >> - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_ >> size() are independent; >> - dp_packet_clone_with_headroom() now has split functions for >> handling DPDK sourced packets and non-DPDK packets; >> - Modified various functions in dp-packet.h to account for multi-seg >> mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail() >> and dp_packet_at(); >> - Added support for shifting packet data in multi-seg mbufs, using >> dp_packet_shift(); >> - Fixed some minor inconsistencies. >> >> Note that some of the changes in v5 have been contributed by Mark >> Kavanagh as well. >> >> v4: - restructure patchset >> - account for 128B ARM cacheline when sizing mbufs >> >> Mark Kavanagh (4): >> netdev-dpdk: fix mbuf sizing >> dp-packet: Init specific mbuf fields. >> netdev-dpdk: copy large packet to multi-seg. mbufs >> netdev-dpdk: support multi-segment jumbo frames >> >> Michael Qiu (1): >> dp-packet: copy data from multi-seg. DPDK mbuf >> >> Tiago Lam (8): >> dp-packet: Fix allocated size on DPDK init. >> netdev-dpdk: Serialise non-pmds mbufs' alloc/free. >> dp-packet: Fix data_len handling multi-seg mbufs. >> dp-packet: Handle multi-seg mbufs in helper funcs. >> dp-packet: Handle multi-seg mubfs in shift() func. >> dpdk-tests: Add uni-tests for multi-seg mbufs. >> dpdk-tests: Accept other configs in OVS_DPDK_START >> dpdk-tests: End-to-end tests for multi-seg mbufs. >> >> Documentation/topics/dpdk/jumbo-frames.rst | 52 +++ >> Documentation/topics/dpdk/memory.rst | 36 ++ >> NEWS | 1 + >> lib/dp-packet.c | 173 +++++++++- >> lib/dp-packet.h | 214 ++++++++++-- >> lib/dpdk.c | 8 + >> lib/netdev-dpdk.c | 244 +++++++++++--- >> lib/netdev-dpdk.h | 2 + >> tests/automake.mk | 10 +- >> tests/dpdk-packet-mbufs.at | 7 + >> tests/system-dpdk-macros.at | 6 +- >> tests/system-dpdk-testsuite.at | 1 + >> tests/system-dpdk.at | 65 ++++ >> tests/test-dpdk-mbufs.c | 513 +++++++++++++++++++++++++++++ >> vswitchd/vswitch.xml | 22 ++ >> 15 files changed, 1277 insertions(+), 77 deletions(-) >> create mode 100644 tests/dpdk-packet-mbufs.at >> create mode 100644 tests/test-dpdk-mbufs.c >>
> On Tue, Jul 24, 2018 at 07:55:39PM +0300, Ilya Maximets wrote: > > Hi. > > Just wanted to add some comments for the use-cases and testing > methodology. > > See inline. > > > > And I'm actually not sure if there any profit from this patch set? > > It looks like an internal mbuf handling rework that only degrades the > > performance and complicates the code. > > > > Please, don't consider me as merge blocker. I just want to understand > > why you think we need this 1200 LOCs? > > > > --- > > About 'resize()' related discussion: > > Maybe it's worth to allow dp_packet APIs to return different dp_packet. > > In this case we'll be able to just clone the packet to malloced memory > > and resize in cases of not enough headroom available. > > Like: > > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); or > > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > > > This will have a little performance penalty in compare with data > > shifting inside the mbuf, but will be much more elegant, and will > > allow to eliminate all the OVS_NOT_REACHED cases. > > > > > > Best regards, Ilya Maximets. > > Hmm, this is the second person from whom I've heard serious misgivings > about this patch series. Tiago, Ian, would you like to respond? I'm a > little nervous about merging this patch series, especially relatively late > before branching, given that some people have technical objections to it. > I think Tiago is currently responding to the queries Ilya has raised but I'd like to respond to this being included in the 2.10 release specifically. The feature has been in development for a while and up until Tiago had taken over the work 2 months had received little feedback. That being said we have had a number of people testing the feature over the last month+ who are familiar with OVS DPDK and have since signed off on it as it is enabling a common customer requirement for TSO/GRO offloads. This is just a stepping stone to TSO and GRO enablement for OVS DPDK which is called out as a feature gap between kernel Ovs and userspace OVS, the work will be continued from our side with the aim of for OVS 2.11 so any changes required can be refined. From a validation POV it's not breaking anything existing for OVS DPDK that I've have come across so the risk on that side is low. Would it be preferred to hold off merging this until the TSO/GRO aspects have been implemented also? I guess my view here was that I would like to enable users to begin using the multiseg feature with the 2.10 release and we can refine our approach if needed based on the community feedback. There are some interesting discussions such as changes to the dp_packet API above, but it's a pity it comes at a critical time window. I wonder are the API changes separate to this work? Impact would be felt outside of the multiseg also I would think as they are APIs. If there are changes required to the APIs I would envision we would update the mutliseg feature to work with these changes as they are introduced in the future. If you feel this is too much of a risk we can remove it from 2.10. I can respin a new pull request with the remaining patches. Thanks Ian > Thanks, > > Ben.
On Tue, Jul 24, 2018 at 06:20:04PM +0000, Stokes, Ian wrote: > > On Tue, Jul 24, 2018 at 07:55:39PM +0300, Ilya Maximets wrote: > > > Hi. > > > Just wanted to add some comments for the use-cases and testing > > methodology. > > > See inline. > > > > > > And I'm actually not sure if there any profit from this patch set? > > > It looks like an internal mbuf handling rework that only degrades the > > > performance and complicates the code. > > > > > > Please, don't consider me as merge blocker. I just want to understand > > > why you think we need this 1200 LOCs? > > > > > > --- > > > About 'resize()' related discussion: > > > Maybe it's worth to allow dp_packet APIs to return different dp_packet. > > > In this case we'll be able to just clone the packet to malloced memory > > > and resize in cases of not enough headroom available. > > > Like: > > > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); or > > > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > > > > > This will have a little performance penalty in compare with data > > > shifting inside the mbuf, but will be much more elegant, and will > > > allow to eliminate all the OVS_NOT_REACHED cases. > > > > > > > > > Best regards, Ilya Maximets. > > > > Hmm, this is the second person from whom I've heard serious misgivings > > about this patch series. Tiago, Ian, would you like to respond? I'm a > > little nervous about merging this patch series, especially relatively late > > before branching, given that some people have technical objections to it. > > > > I think Tiago is currently responding to the queries Ilya has raised but I'd like to respond to this being included in the 2.10 release specifically. > > The feature has been in development for a while and up until Tiago had taken over the work 2 months had received little feedback. > > That being said we have had a number of people testing the feature over the last month+ who are familiar with OVS DPDK and have since signed off on it as it is enabling a common customer requirement for TSO/GRO offloads. > > This is just a stepping stone to TSO and GRO enablement for OVS DPDK which is called out as a feature gap between kernel Ovs and userspace OVS, the work will be continued from our side with the aim of for OVS 2.11 so any changes required can be refined. > > From a validation POV it's not breaking anything existing for OVS DPDK that I've have come across so the risk on that side is low. > > Would it be preferred to hold off merging this until the TSO/GRO aspects have been implemented also? I guess my view here was that I would like to enable users to begin using the multiseg feature with the 2.10 release and we can refine our approach if needed based on the community feedback. > > There are some interesting discussions such as changes to the dp_packet API above, but it's a pity it comes at a critical time window. I wonder are the API changes separate to this work? Impact would be felt outside of the multiseg also I would think as they are APIs. If there are changes required to the APIs I would envision we would update the mutliseg feature to work with these changes as they are introduced in the future. > > If you feel this is too much of a risk we can remove it from 2.10. I can respin a new pull request with the remaining patches. After some consideration, I think we should leave this out of 2.10. I'm happy to merge it into master post-branch. That way it'll have about 6 months to cook on master and possibly acquire these additional benefits you mention. Would you please respin the PR to leave this out? Thanks, Ben.
> On Tue, Jul 24, 2018 at 06:20:04PM +0000, Stokes, Ian wrote: > > > On Tue, Jul 24, 2018 at 07:55:39PM +0300, Ilya Maximets wrote: > > > > Hi. > > > > Just wanted to add some comments for the use-cases and testing > > > methodology. > > > > See inline. > > > > > > > > And I'm actually not sure if there any profit from this patch set? > > > > It looks like an internal mbuf handling rework that only degrades > > > > the performance and complicates the code. > > > > > > > > Please, don't consider me as merge blocker. I just want to > > > > understand why you think we need this 1200 LOCs? > > > > > > > > --- > > > > About 'resize()' related discussion: > > > > Maybe it's worth to allow dp_packet APIs to return different > dp_packet. > > > > In this case we'll be able to just clone the packet to malloced > > > > memory and resize in cases of not enough headroom available. > > > > Like: > > > > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan- > >vlan_tci); or > > > > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > > > > > > > This will have a little performance penalty in compare with data > > > > shifting inside the mbuf, but will be much more elegant, and will > > > > allow to eliminate all the OVS_NOT_REACHED cases. > > > > > > > > > > > > Best regards, Ilya Maximets. > > > > > > Hmm, this is the second person from whom I've heard serious > > > misgivings about this patch series. Tiago, Ian, would you like to > > > respond? I'm a little nervous about merging this patch series, > > > especially relatively late before branching, given that some people > have technical objections to it. > > > > > > > I think Tiago is currently responding to the queries Ilya has raised but > I'd like to respond to this being included in the 2.10 release > specifically. > > > > The feature has been in development for a while and up until Tiago had > taken over the work 2 months had received little feedback. > > > > That being said we have had a number of people testing the feature over > the last month+ who are familiar with OVS DPDK and have since signed off > on it as it is enabling a common customer requirement for TSO/GRO > offloads. > > > > This is just a stepping stone to TSO and GRO enablement for OVS DPDK > which is called out as a feature gap between kernel Ovs and userspace OVS, > the work will be continued from our side with the aim of for OVS 2.11 so > any changes required can be refined. > > > > From a validation POV it's not breaking anything existing for OVS DPDK > that I've have come across so the risk on that side is low. > > > > Would it be preferred to hold off merging this until the TSO/GRO aspects > have been implemented also? I guess my view here was that I would like to > enable users to begin using the multiseg feature with the 2.10 release and > we can refine our approach if needed based on the community feedback. > > > > There are some interesting discussions such as changes to the dp_packet > API above, but it's a pity it comes at a critical time window. I wonder > are the API changes separate to this work? Impact would be felt outside of > the multiseg also I would think as they are APIs. If there are changes > required to the APIs I would envision we would update the mutliseg feature > to work with these changes as they are introduced in the future. > > > > If you feel this is too much of a risk we can remove it from 2.10. I can > respin a new pull request with the remaining patches. > > After some consideration, I think we should leave this out of 2.10. I'm > happy to merge it into master post-branch. That way it'll have about 6 > months to cook on master and possibly acquire these additional benefits > you mention. > > Would you please respin the PR to leave this out? No problem, I'll respin a new pull request now. Thanks Ian > > Thanks, > > Ben.
On 24/07/2018 17:55, Ilya Maximets wrote: > Hi. > Just wanted to add some comments for the use-cases and testing methodology. > See inline. >> And I'm actually not sure if there any profit from this patch set? > It looks like an internal mbuf handling rework that only degrades > the performance and complicates the code. > > Please, don't consider me as merge blocker. I just want to understand > why you think we need this 1200 LOCs? > > --- > About 'resize()' related discussion: > Maybe it's worth to allow dp_packet APIs to return different dp_packet. > In this case we'll be able to just clone the packet to malloced memory > and resize in cases of not enough headroom available. > Like: > packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci); > or > eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci); > > This will have a little performance penalty in compare with data shifting > inside the mbuf, but will be much more elegant, and will allow to eliminate > all the OVS_NOT_REACHED cases. > Thanks for picking up the discussion. To me, it comes down to, is 128B not enough space for headers? Given the VXLAN example, it would be enough to push two headers. Is this too much of a constraint? What would be a good constraint? If we plan to support as many headers as the system's memory allows then we won't have much choice, and this option of malloc'ing from the system and copying to it might be the only one. My concerns would be on the performance penalty (as you mention), but also on users seeing weird performance benefits when using 2 VXLAN headers vs using 3 (for example). It would be yet another question to ask when debugging a system (how many headers are you pushing?) I see benefits on having a hard limit as well, and not allowing anything else to be pushed above those 128B, but since the 128B are set on DPDK at compile time, it seems this is calling for trouble in the future as if that changes between versions, we might start supporting less headers. In this case both this approach or the mbuf shifting would provide more reliability against future changes in DPDK's RTE_PKTMBUF_HEADROOM. Also, this is a bit aside, but ideally I would like to keep this discussion out of this patch series as I think this is creating confusion - this is an issue aside of the multi-segment work, it exists already for single mbufs DPDK packets. There's an RFC, "Don't resize DPBUF_DPDK packets.", which should be the ideal place.