Message ID | 20210110030505.325722-1-fbl@sysclose.org |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev,v2] netdev-dpdk: Refactor the DPDK transmit path. | expand |
Hi Flavio, > -----Original Message----- > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner > Sent: Sunday, January 10, 2021 8:35 AM > To: dev@openvswitch.org > Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner > <fbl@sysclose.org> > Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit > path. > > This patch split out the common code between vhost and dpdk transmit > paths to shared functions to simplify the code and fix an issue. > > The issue is that the packet coming from non-DPDK device and egressing on a > DPDK device currently skips the hwol preparation. > > This also have the side effect of leaving only the dpdk transmit code under > the txq lock. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > --- > > V2: > - mentioned the tx lock change in the commit message. > - fixed packet leak when copy fails. > - moved pkt_cnt = cnt two lines up. > > I tested the following scenarios with iperf and iperf -R and noticed no > performance regression: > IPERF VM-Bridge 1.02x > IPERF VM-VM 1.04x > IPERF VM-ExtHost 1.00x > IPERF VM-NS 1.01x > IPERF VM-VLAN-Bridge 1.03x > IPERF VM-VLAN-VM 1.03x > IPERF VM-VLAN-ExtHost 1.01x > IPERF VM-VLAN-NS 1.02x > IPERF VM-V6-VM 1.03x > IPERF VM-V6-ExtHost 1.00x > IPERF VM-V6-NS 1.01x > IPERF-R VM-Bridge 1.01x > IPERF-R VM-VM 1.04x > IPERF-R VM-ExtHost 1.10x > IPERF-R VM-NS 1.01x > IPERF-R VM-VLAN-Bridge 1.03x > IPERF-R VM-VLAN-VM 1.02x > IPERF-R VM-VLAN-ExtHost 1.08x > IPERF-R VM-VLAN-NS 1.02x > IPERF-R VM-V6-VM 1.00x > IPERF-R VM-V6-ExtHost 1.11x > IPERF-R VM-V6-NS 1.00x > > Now using trex, 64byte packet, PVP: > Original: 3.6Mpps > avg. packets per output batch: 32.00 > idle cycles: 0 (0.00%) > avg cycles per packet: 304.92 (8331383020/27323150) > avg processing cycles per packet: 304.92 (8331383020/27323150) > > Patched: 3.6Mpps > avg. packets per output batch: 32.00 > idle cycles: 0 (0.00%) > avg cycles per packet: 304.08 (21875784116/71941516) > avg processing cycles per packet: 304.08 (21875784116/71941516) Ran the following tests in VSPERF for QoS and they all pass: ovsdpdk_qos_create_phy_port ovsdpdk_qos_delete_phy_port ovsdpdk_qos_create_vport ovsdpdk_qos_delete_vport ovsdpdk_qos_create_no_cir ovsdpdk_qos_create_no_cbs For performance tests: We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet and PVP test results are in line with your observations. NIC: Fortville 10G X710 Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled : With burst mode : ~1% increase in performance as compared to scatter mode: ~4% decrease in performance. [snipped] Thanks , Sunil Intel > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Hi Sunil, On Mon, Jan 11, 2021 at 03:49:56PM +0000, Pai G, Sunil wrote: > Hi Flavio, > > > > -----Original Message----- > > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner > > Sent: Sunday, January 10, 2021 8:35 AM > > To: dev@openvswitch.org > > Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner > > <fbl@sysclose.org> > > Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit > > path. > > > > This patch split out the common code between vhost and dpdk transmit > > paths to shared functions to simplify the code and fix an issue. > > > > The issue is that the packet coming from non-DPDK device and egressing on a > > DPDK device currently skips the hwol preparation. > > > > This also have the side effect of leaving only the dpdk transmit code under > > the txq lock. > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > > --- > > > > V2: > > - mentioned the tx lock change in the commit message. > > - fixed packet leak when copy fails. > > - moved pkt_cnt = cnt two lines up. > > > > I tested the following scenarios with iperf and iperf -R and noticed no > > performance regression: > > IPERF VM-Bridge 1.02x > > IPERF VM-VM 1.04x > > IPERF VM-ExtHost 1.00x > > IPERF VM-NS 1.01x > > IPERF VM-VLAN-Bridge 1.03x > > IPERF VM-VLAN-VM 1.03x > > IPERF VM-VLAN-ExtHost 1.01x > > IPERF VM-VLAN-NS 1.02x > > IPERF VM-V6-VM 1.03x > > IPERF VM-V6-ExtHost 1.00x > > IPERF VM-V6-NS 1.01x > > IPERF-R VM-Bridge 1.01x > > IPERF-R VM-VM 1.04x > > IPERF-R VM-ExtHost 1.10x > > IPERF-R VM-NS 1.01x > > IPERF-R VM-VLAN-Bridge 1.03x > > IPERF-R VM-VLAN-VM 1.02x > > IPERF-R VM-VLAN-ExtHost 1.08x > > IPERF-R VM-VLAN-NS 1.02x > > IPERF-R VM-V6-VM 1.00x > > IPERF-R VM-V6-ExtHost 1.11x > > IPERF-R VM-V6-NS 1.00x > > > > Now using trex, 64byte packet, PVP: > > Original: 3.6Mpps > > avg. packets per output batch: 32.00 > > idle cycles: 0 (0.00%) > > avg cycles per packet: 304.92 (8331383020/27323150) > > avg processing cycles per packet: 304.92 (8331383020/27323150) > > > > Patched: 3.6Mpps > > avg. packets per output batch: 32.00 > > idle cycles: 0 (0.00%) > > avg cycles per packet: 304.08 (21875784116/71941516) > > avg processing cycles per packet: 304.08 (21875784116/71941516) > > > Ran the following tests in VSPERF for QoS and they all pass: Great, thanks for testing the patch. > ovsdpdk_qos_create_phy_port > ovsdpdk_qos_delete_phy_port > ovsdpdk_qos_create_vport > ovsdpdk_qos_delete_vport > ovsdpdk_qos_create_no_cir > ovsdpdk_qos_create_no_cbs > > For performance tests: > We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet I tried P2P here (v2) with trex, single flow, single queue, single PMD and I got this: original: 12.88Mpps/6.6Gbps idle cycles: 0 (0.00%) processing cycles: 14466174204 (100.00%) avg cycles per packet: 170.73 (14466174204/84730796) avg processing cycles per packet: 170.73 (14466174204/84730796) patched: 13.00Mpps/6.65Mpps idle cycles: 0 (0.00%) processing cycles: 3783269716 (100.00%) avg cycles per packet: 169.49 (3783269716/22321968) avg processing cycles per packet: 169.49 (3783269716/22321968) So, I see a small performance gain. Can you provide a more detailed test description? I wonder if we are testing the same thing. > and PVP test results are in line with your observations. > NIC: Fortville 10G X710 Great. > Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled : > With burst mode : ~1% increase in performance That is aligned with above. > as compared to > scatter mode: ~4% decrease in performance. Hm, that is unexpected. Is this reproducible? How are you switching between burst and scatter mode? Thanks, fbl
On Sun, Jan 10, 2021 at 4:05 AM Flavio Leitner <fbl@sysclose.org> wrote: > > This patch split out the common code between vhost and > dpdk transmit paths to shared functions to simplify the > code and fix an issue. > > The issue is that the packet coming from non-DPDK device > and egressing on a DPDK device currently skips the hwol > preparation. > > This also have the side effect of leaving only the dpdk nit: has* > transmit code under the txq lock. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> Reviewed-by: David Marchand <david.marchand@redhat.com> Thanks.
Hi Flavio, > So, I see a small performance gain. Can you provide a more detailed > test description? I wonder if we are testing the same thing. VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic. > > > and PVP test results are in line with your observations. > > NIC: Fortville 10G X710 > > Great. > > > Performance with different traffic profiles when deployed with 32VM's > +1M flows + vxlan enabled : > > With burst mode : ~1% increase in performance > > That is aligned with above. > > > as compared to > > scatter mode: ~4% decrease in performance. > > Hm, that is unexpected. Is this reproducible? If you mean are the results reproducible then yes, I done a re-run of v1 and v2 again I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0 OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page. V1: DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0 OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path. 7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page. V1: Burst= 1% V1: Scatter -4% As sunil reported already shows the same results again As for V2 it didn't show any increase or decrease for both burst and scatter, And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway. > How are you switching between burst and scatter mode? As for how were switching between burst and scatter, I just use two different traffic profiles & gather results. So I run the test using burst profile restart the test then use scatter. Traffic @ Phy NIC Rx: Ether()/IP()/UDP()/VXLAN()/Ether()/IP() Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on Scatter: for scatter we do incrementally for 32 And we do this for 1M flows I hope this answers your question > > Thanks, > fbl Thanks, Marko K
Hi Marko, On Wed, Jan 13, 2021 at 10:51:03AM +0000, Kovacevic, Marko wrote: > Hi Flavio, > > > So, I see a small performance gain. Can you provide a more detailed > > test description? I wonder if we are testing the same thing. > > VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic. Ok. > > > and PVP test results are in line with your observations. > > > NIC: Fortville 10G X710 > > > > Great. > > > > > Performance with different traffic profiles when deployed with 32VM's > > +1M flows + vxlan enabled : > > > With burst mode : ~1% increase in performance > > > > That is aligned with above. > > > > > as compared to > > > scatter mode: ~4% decrease in performance. > > > > Hm, that is unexpected. Is this reproducible? > > If you mean are the results reproducible then yes, > I done a re-run of v1 and v2 again Thanks for checking. > I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results > DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0 > OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page. > > V1: > DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0 > OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path. > 7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page. > > V1: Burst= 1% > V1: Scatter -4% > As sunil reported already shows the same results again Alright. I will try to reproduce in my lab to understand the root cause. > As for V2 it didn't show any increase or decrease for both burst and scatter, > And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway. > > > How are you switching between burst and scatter mode? > > As for how were switching between burst and scatter, I just use two different traffic profiles & gather results. > So I run the test using burst profile restart the test then use scatter. > > Traffic @ Phy NIC Rx: > Ether()/IP()/UDP()/VXLAN()/Ether()/IP() > > Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on > Scatter: for scatter we do incrementally for 32 Glad that I asked because I thought you were talking about the NIC mode with scatter enabled or not. > And we do this for 1M flows > > I hope this answers your question They are helpful. I will try to reproduce the -4% results in my lab as a next step. It may take some time. Thanks for testing and providing feedbacks!
Hello Flavio, Great patch, I think you really did a lot to improve the code here and I think that's borne out by the consistent performance improvements across multiple tests. Regarding the 4% regression that Intel detected, I found the following white paper to describe the "scatter" test: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf This document calls out the following key points: The original test was summarized as: - 32 VMs with one million flows. - Test runs on four physical cores for OVS and 10 hyper-threaded cores for TestPMD - An Ixia pitches traffic at a sub 0.1% loss rate - The server catches traffic with a E810-C 100G - The traffic's profile is: Ether()/IP()/UDP()/VXLAN()/Ether()/IP() - On the outer IP, the source address changes incrementally across the 32 instances - The destination address remains the same on the outer IP. - The inner source IP remains - The inner destination address increments to create the one million flows for the test - EMC and SMC were disabled I could not reproduce this test exactly because I don't have access to the same hardware - notably the Intel NIC and an Ixia - and I didn't want to create an environment that wouldn't be reproduced in real world scenarios. I did pin VM and TXQs/RXQs threads to cores, but I didn't optimize the setup nearly to the extant that the white paper described. My test setup consisted of two Fedora 35 servers directly connected across Mellanox5E cards with Trex pitching traffic and TestPMD reflecting it. In my test I was still able to reproduce a similar performance penalty. I found that the key factors was the combination of VXLAN and a large number of flows. So once I had a setup that could reproduce close to the 4% penalty I stopped modifying my test framework and started searching for the slow code. I didn't see any obvious issues in the code that should cause a significant slowdown, in fact, most of the code is identical or slightly improved. So to help my analysis, I created several variations of your patch reverting small aspects of the change and benchmarked each variation. Because the difference in performance across each variation was so minor, I took a lot of samples. I pitched traffic over one million flows for 240 seconds and averaged out the throughput, I then repeated this process a total of five times for each patch. Finally, I repeated the whole process three times to produce 15 data points per patch. The best results came from the patch enclosed below, with the code from netdev_dpdk_common_send() protected by the splinlock, as it is in the pre-patch code. This yielded a 2.7% +/- 0.64 performance boost over the master branch. Cheers, Michael diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index bc1633663..5db5d7e2a 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -2777,13 +2777,13 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid, return 0; } - cnt = netdev_dpdk_common_send(netdev, batch, &stats); - if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) { COVERAGE_INC(vhost_tx_contention); rte_spinlock_lock(&dev->tx_q[qid].tx_lock); } + cnt = netdev_dpdk_common_send(netdev, batch, &stats); + pkts = (struct rte_mbuf **) batch->packets; vhost_batch_cnt = cnt; retries = 0; @@ -2843,13 +2843,15 @@ netdev_dpdk_eth_send(struct netdev *netdev, int qid, return 0; } - cnt = netdev_dpdk_common_send(netdev, batch, &stats); - dropped = batch_cnt - cnt; if (OVS_UNLIKELY(concurrent_txq)) { qid = qid % dev->up.n_txq; rte_spinlock_lock(&dev->tx_q[qid].tx_lock); } + cnt = netdev_dpdk_common_send(netdev, batch, &stats); + + dropped = batch_cnt - cnt; + dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt); if (OVS_UNLIKELY(dropped)) { struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; On Sun, 2021-01-10 at 00:05 -0300, Flavio Leitner wrote: > This patch split out the common code between vhost and > dpdk transmit paths to shared functions to simplify the > code and fix an issue. > > The issue is that the packet coming from non-DPDK device > and egressing on a DPDK device currently skips the hwol > preparation. > > This also have the side effect of leaving only the dpdk > transmit code under the txq lock. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > Reviewed-by: David Marchand <david.marchand@redhat.com> > --- > > V2: > - mentioned the tx lock change in the commit message. > - fixed packet leak when copy fails. > - moved pkt_cnt = cnt two lines up. > > I tested the following scenarios with iperf and iperf -R > and noticed no performance regression: > IPERF VM-Bridge 1.02x > IPERF VM-VM 1.04x > IPERF VM-ExtHost 1.00x > IPERF VM-NS 1.01x > IPERF VM-VLAN-Bridge 1.03x > IPERF VM-VLAN-VM 1.03x > IPERF VM-VLAN-ExtHost 1.01x > IPERF VM-VLAN-NS 1.02x > IPERF VM-V6-VM 1.03x > IPERF VM-V6-ExtHost 1.00x > IPERF VM-V6-NS 1.01x > IPERF-R VM-Bridge 1.01x > IPERF-R VM-VM 1.04x > IPERF-R VM-ExtHost 1.10x > IPERF-R VM-NS 1.01x > IPERF-R VM-VLAN-Bridge 1.03x > IPERF-R VM-VLAN-VM 1.02x > IPERF-R VM-VLAN-ExtHost 1.08x > IPERF-R VM-VLAN-NS 1.02x > IPERF-R VM-V6-VM 1.00x > IPERF-R VM-V6-ExtHost 1.11x > IPERF-R VM-V6-NS 1.00x > > Now using trex, 64byte packet, PVP: > Original: 3.6Mpps > avg. packets per output batch: 32.00 > idle cycles: 0 (0.00%) > avg cycles per packet: 304.92 (8331383020/27323150) > avg processing cycles per packet: 304.92 (8331383020/27323150) > > Patched: 3.6Mpps > avg. packets per output batch: 32.00 > idle cycles: 0 (0.00%) > avg cycles per packet: 304.08 (21875784116/71941516) > avg processing cycles per packet: 304.08 (21875784116/71941516) > > lib/netdev-dpdk.c | 335 +++++++++++++++++++------------------------- > -- > 1 file changed, 139 insertions(+), 196 deletions(-) > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index 2640a421a..a1437db4d 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -2585,90 +2585,6 @@ netdev_dpdk_vhost_update_tx_counters(struct > netdev_dpdk *dev, > } > } > > -static void > -__netdev_dpdk_vhost_send(struct netdev *netdev, int qid, > - struct dp_packet **pkts, int cnt) > -{ > - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > - struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > - struct netdev_dpdk_sw_stats sw_stats_add; > - unsigned int n_packets_to_free = cnt; > - unsigned int total_packets = cnt; > - int i, retries = 0; > - int max_retries = VHOST_ENQ_RETRY_MIN; > - int vid = netdev_dpdk_get_vid(dev); > - > - qid = dev->tx_q[qid % netdev->n_txq].map; > - > - if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 > - || !(dev->flags & NETDEV_UP))) { > - rte_spinlock_lock(&dev->stats_lock); > - dev->stats.tx_dropped+= cnt; > - rte_spinlock_unlock(&dev->stats_lock); > - goto out; > - } > - > - if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev- > >tx_q[qid].tx_lock))) { > - COVERAGE_INC(vhost_tx_contention); > - rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > - } > - > - sw_stats_add.tx_invalid_hwol_drops = cnt; > - if (userspace_tso_enabled()) { > - cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); > - } > - > - sw_stats_add.tx_invalid_hwol_drops -= cnt; > - sw_stats_add.tx_mtu_exceeded_drops = cnt; > - cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); > - sw_stats_add.tx_mtu_exceeded_drops -= cnt; > - > - /* Check has QoS has been configured for the netdev */ > - sw_stats_add.tx_qos_drops = cnt; > - cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true); > - sw_stats_add.tx_qos_drops -= cnt; > - > - n_packets_to_free = cnt; > - > - do { > - int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; > - unsigned int tx_pkts; > - > - tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts, > cnt); > - if (OVS_LIKELY(tx_pkts)) { > - /* Packets have been sent.*/ > - cnt -= tx_pkts; > - /* Prepare for possible retry.*/ > - cur_pkts = &cur_pkts[tx_pkts]; > - if (OVS_UNLIKELY(cnt && !retries)) { > - /* > - * Read max retries as there are packets not sent > - * and no retries have already occurred. > - */ > - atomic_read_relaxed(&dev->vhost_tx_retries_max, > &max_retries); > - } > - } else { > - /* No packets sent - do not retry.*/ > - break; > - } > - } while (cnt && (retries++ < max_retries)); > - > - rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > - > - sw_stats_add.tx_failure_drops = cnt; > - sw_stats_add.tx_retries = MIN(retries, max_retries); > - > - rte_spinlock_lock(&dev->stats_lock); > - netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets, > - &sw_stats_add); > - rte_spinlock_unlock(&dev->stats_lock); > - > -out: > - for (i = 0; i < n_packets_to_free; i++) { > - dp_packet_delete(pkts[i]); > - } > -} > - > static void > netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) > { > @@ -2783,76 +2699,70 @@ dpdk_copy_dp_packet_to_mbuf(struct > rte_mempool *mp, struct dp_packet *pkt_orig) > return pkt_dest; > } > > -/* Tx function. Transmit packets indefinitely */ > -static void > -dpdk_do_tx_copy(struct netdev *netdev, int qid, struct > dp_packet_batch *batch) > +/* Replace packets in a 'batch' with their corresponding copies > using > + * DPDK memory. > + * > + * Returns the number of good packets in the batch. */ > +static size_t > +dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct > dp_packet_batch *batch) > OVS_NO_THREAD_SAFETY_ANALYSIS > { > - const size_t batch_cnt = dp_packet_batch_size(batch); > -#if !defined(__CHECKER__) && !defined(_WIN32) > - const size_t PKT_ARRAY_SIZE = batch_cnt; > -#else > - /* Sparse or MSVC doesn't like variable length array. */ > - enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; > -#endif > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > - struct dp_packet *pkts[PKT_ARRAY_SIZE]; > - struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > - uint32_t cnt = batch_cnt; > - uint32_t dropped = 0; > - uint32_t tx_failure = 0; > - uint32_t mtu_drops = 0; > - uint32_t qos_drops = 0; > - > - if (dev->type != DPDK_DEV_VHOST) { > - /* Check if QoS has been configured for this netdev. */ > - cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch- > >packets, > - batch_cnt, false); > - qos_drops = batch_cnt - cnt; > - } > - > - uint32_t txcnt = 0; > - > - for (uint32_t i = 0; i < cnt; i++) { > - struct dp_packet *packet = batch->packets[i]; > - uint32_t size = dp_packet_size(packet); > - > - if (size > dev->max_packet_len > - && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { > - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", > size, > - dev->max_packet_len); > - mtu_drops++; > - continue; > - } > + size_t i, size = dp_packet_batch_size(batch); > + struct dp_packet *packet; > > - pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, > packet); > - if (OVS_UNLIKELY(!pkts[txcnt])) { > - dropped = cnt - i; > - break; > - } > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > + if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) { > + dp_packet_batch_refill(batch, packet, i); > + } else { > + struct dp_packet *pktcopy; > > - txcnt++; > - } > + pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, > packet); > + if (pktcopy) { > + dp_packet_batch_refill(batch, pktcopy, i); > + } > > - if (OVS_LIKELY(txcnt)) { > - if (dev->type == DPDK_DEV_VHOST) { > - __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); > - } else { > - tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, > - (struct rte_mbuf > **)pkts, > - txcnt); > + dp_packet_delete(packet); > } > } > > - dropped += qos_drops + mtu_drops + tx_failure; > - if (OVS_UNLIKELY(dropped)) { > - rte_spinlock_lock(&dev->stats_lock); > - dev->stats.tx_dropped += dropped; > - sw_stats->tx_failure_drops += tx_failure; > - sw_stats->tx_mtu_exceeded_drops += mtu_drops; > - sw_stats->tx_qos_drops += qos_drops; > - rte_spinlock_unlock(&dev->stats_lock); > + return dp_packet_batch_size(batch); > +} > + > +static size_t > +netdev_dpdk_common_send(struct netdev *netdev, struct > dp_packet_batch *batch, > + struct netdev_dpdk_sw_stats *stats) > +{ > + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > + size_t cnt, pkt_cnt = dp_packet_batch_size(batch); > + > + memset(stats, 0, sizeof *stats); > + > + /* Copy dp-packets to mbufs. */ > + if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > + cnt = dpdk_copy_batch_to_mbuf(netdev, batch); > + stats->tx_failure_drops += pkt_cnt - cnt; > + pkt_cnt = cnt; > } > + > + /* Drop oversized packets. */ > + cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt); > + stats->tx_mtu_exceeded_drops += pkt_cnt - cnt; > + pkt_cnt = cnt; > + > + /* Prepare each mbuf for hardware offloading. */ > + if (userspace_tso_enabled()) { > + cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt); > + stats->tx_invalid_hwol_drops += pkt_cnt - cnt; > + pkt_cnt = cnt; > + } > + > + /* Apply Quality of Service policy. */ > + cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true); > + stats->tx_qos_drops += pkt_cnt - cnt; > + > + return cnt; > } > > static int > @@ -2860,82 +2770,115 @@ netdev_dpdk_vhost_send(struct netdev > *netdev, int qid, > struct dp_packet_batch *batch, > bool concurrent_txq OVS_UNUSED) > { > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > + int max_retries = VHOST_ENQ_RETRY_MIN; > + int cnt, batch_cnt, vhost_batch_cnt; > + int vid = netdev_dpdk_get_vid(dev); > + struct netdev_dpdk_sw_stats stats; > + struct rte_mbuf **pkts; > + int retries; > > - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > - dpdk_do_tx_copy(netdev, qid, batch); > + batch_cnt = cnt = dp_packet_batch_size(batch); > + qid = dev->tx_q[qid % netdev->n_txq].map; > + if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 > + || !(dev->flags & NETDEV_UP))) { > + rte_spinlock_lock(&dev->stats_lock); > + dev->stats.tx_dropped += cnt; > + rte_spinlock_unlock(&dev->stats_lock); > dp_packet_delete_batch(batch, true); > - } else { > - __netdev_dpdk_vhost_send(netdev, qid, batch->packets, > - dp_packet_batch_size(batch)); > + return 0; > + } > + > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > + > + if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev- > >tx_q[qid].tx_lock))) { > + COVERAGE_INC(vhost_tx_contention); > + rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > } > + > + pkts = (struct rte_mbuf **) batch->packets; > + vhost_batch_cnt = cnt; > + retries = 0; > + do { > + int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; > + int tx_pkts; > + > + tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts, > cnt); > + if (OVS_LIKELY(tx_pkts)) { > + /* Packets have been sent.*/ > + cnt -= tx_pkts; > + /* Prepare for possible retry.*/ > + pkts = &pkts[tx_pkts]; > + if (OVS_UNLIKELY(cnt && !retries)) { > + /* > + * Read max retries as there are packets not sent > + * and no retries have already occurred. > + */ > + atomic_read_relaxed(&dev->vhost_tx_retries_max, > &max_retries); > + } > + } else { > + /* No packets sent - do not retry.*/ > + break; > + } > + } while (cnt && (retries++ < max_retries)); > + > + rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > + > + stats.tx_failure_drops += cnt; > + stats.tx_retries = MIN(retries, max_retries); > + > + rte_spinlock_lock(&dev->stats_lock); > + netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, > batch_cnt, > + &stats); > + rte_spinlock_unlock(&dev->stats_lock); > + > + pkts = (struct rte_mbuf **) batch->packets; > + for (int i = 0; i < vhost_batch_cnt; i++) { > + rte_pktmbuf_free(pkts[i]); > + } > + > return 0; > } > > -static inline void > -netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, > - struct dp_packet_batch *batch, > - bool concurrent_txq) > +static int > +netdev_dpdk_eth_send(struct netdev *netdev, int qid, > + struct dp_packet_batch *batch, bool > concurrent_txq) > { > + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > + int batch_cnt = dp_packet_batch_size(batch); > + struct netdev_dpdk_sw_stats stats; > + int cnt, dropped; > + > if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) { > dp_packet_delete_batch(batch, true); > - return; > + return 0; > } > > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > + dropped = batch_cnt - cnt; > if (OVS_UNLIKELY(concurrent_txq)) { > qid = qid % dev->up.n_txq; > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > } > > - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > - struct netdev *netdev = &dev->up; > - > - dpdk_do_tx_copy(netdev, qid, batch); > - dp_packet_delete_batch(batch, true); > - } else { > + dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt); > + if (OVS_UNLIKELY(dropped)) { > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > - int dropped; > - int tx_failure, mtu_drops, qos_drops, hwol_drops; > - int batch_cnt = dp_packet_batch_size(batch); > - struct rte_mbuf **pkts = (struct rte_mbuf **) batch- > >packets; > > - hwol_drops = batch_cnt; > - if (userspace_tso_enabled()) { > - batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, > batch_cnt); > - } > - hwol_drops -= batch_cnt; > - mtu_drops = batch_cnt; > - batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, > batch_cnt); > - mtu_drops -= batch_cnt; > - qos_drops = batch_cnt; > - batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true); > - qos_drops -= batch_cnt; > - > - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, > batch_cnt); > - > - dropped = tx_failure + mtu_drops + qos_drops + hwol_drops; > - if (OVS_UNLIKELY(dropped)) { > - rte_spinlock_lock(&dev->stats_lock); > - dev->stats.tx_dropped += dropped; > - sw_stats->tx_failure_drops += tx_failure; > - sw_stats->tx_mtu_exceeded_drops += mtu_drops; > - sw_stats->tx_qos_drops += qos_drops; > - sw_stats->tx_invalid_hwol_drops += hwol_drops; > - rte_spinlock_unlock(&dev->stats_lock); > - } > + rte_spinlock_lock(&dev->stats_lock); > + dev->stats.tx_dropped += dropped; > + sw_stats->tx_failure_drops += stats.tx_failure_drops; > + sw_stats->tx_mtu_exceeded_drops += > stats.tx_mtu_exceeded_drops; > + sw_stats->tx_qos_drops += stats.tx_qos_drops; > + sw_stats->tx_invalid_hwol_drops += > stats.tx_invalid_hwol_drops; > + rte_spinlock_unlock(&dev->stats_lock); > } > > if (OVS_UNLIKELY(concurrent_txq)) { > rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > } > -} > - > -static int > -netdev_dpdk_eth_send(struct netdev *netdev, int qid, > - struct dp_packet_batch *batch, bool > concurrent_txq) > -{ > - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > - netdev_dpdk_send__(dev, qid, batch, concurrent_txq); > return 0; > } >
Hello Sunil, Marko and Ian. Mike worked to identify the reason for the performance issue reported by you a while ago. He summarized below. I wonder if you can give a try on his patch too and tell us if we are on the right track. Thanks, fbl On Wed, Jan 05, 2022 at 03:01:47PM -0500, Mike Pattrick wrote: > Hello Flavio, > > Great patch, I think you really did a lot to improve the code here and > I think that's borne out by the consistent performance improvements > across multiple tests. > > Regarding the 4% regression that Intel detected, I found the following > white paper to describe the "scatter" test: > > https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf > > This document calls out the following key points: > > The original test was summarized as: > - 32 VMs with one million flows. > - Test runs on four physical cores for OVS and 10 hyper-threaded cores > for TestPMD > - An Ixia pitches traffic at a sub 0.1% loss rate > - The server catches traffic with a E810-C 100G > - The traffic's profile is: Ether()/IP()/UDP()/VXLAN()/Ether()/IP() > - On the outer IP, the source address changes incrementally across the > 32 instances > - The destination address remains the same on the outer IP. > - The inner source IP remains > - The inner destination address increments to create the one million > flows for the test > - EMC and SMC were disabled > > I could not reproduce this test exactly because I don't have access to > the same hardware - notably the Intel NIC and an Ixia - and I didn't > want to create an environment that wouldn't be reproduced in real world > scenarios. I did pin VM and TXQs/RXQs threads to cores, but I didn't > optimize the setup nearly to the extant that the white paper described. > My test setup consisted of two Fedora 35 servers directly connected > across Mellanox5E cards with Trex pitching traffic and TestPMD > reflecting it. > > In my test I was still able to reproduce a similar performance penalty. > I found that the key factors was the combination of VXLAN and a large > number of flows. So once I had a setup that could reproduce close to > the 4% penalty I stopped modifying my test framework and started > searching for the slow code. > > I didn't see any obvious issues in the code that should cause a > significant slowdown, in fact, most of the code is identical or > slightly improved. So to help my analysis, I created several variations > of your patch reverting small aspects of the change and benchmarked > each variation. > > Because the difference in performance across each variation was so > minor, I took a lot of samples. I pitched traffic over one million > flows for 240 seconds and averaged out the throughput, I then repeated > this process a total of five times for each patch. Finally, I repeated > the whole process three times to produce 15 data points per patch. > > The best results came from the patch enclosed below, with the code from > netdev_dpdk_common_send() protected by the splinlock, as it is in the > pre-patch code. This yielded a 2.7% +/- 0.64 performance boost over the > master branch. > > > Cheers, > Michael > > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index bc1633663..5db5d7e2a 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -2777,13 +2777,13 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid, > return 0; > } > > - cnt = netdev_dpdk_common_send(netdev, batch, &stats); > - > if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) { > COVERAGE_INC(vhost_tx_contention); > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > } > > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > + > pkts = (struct rte_mbuf **) batch->packets; > vhost_batch_cnt = cnt; > retries = 0; > @@ -2843,13 +2843,15 @@ netdev_dpdk_eth_send(struct netdev *netdev, int qid, > return 0; > } > > - cnt = netdev_dpdk_common_send(netdev, batch, &stats); > - dropped = batch_cnt - cnt; > if (OVS_UNLIKELY(concurrent_txq)) { > qid = qid % dev->up.n_txq; > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > } > > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > + > + dropped = batch_cnt - cnt; > + > dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt); > if (OVS_UNLIKELY(dropped)) { > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > > > On Sun, 2021-01-10 at 00:05 -0300, Flavio Leitner wrote: > > This patch split out the common code between vhost and > > dpdk transmit paths to shared functions to simplify the > > code and fix an issue. > > > > The issue is that the packet coming from non-DPDK device > > and egressing on a DPDK device currently skips the hwol > > preparation. > > > > This also have the side effect of leaving only the dpdk > > transmit code under the txq lock. > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > > Reviewed-by: David Marchand <david.marchand@redhat.com> > > --- > > > > V2: > > - mentioned the tx lock change in the commit message. > > - fixed packet leak when copy fails. > > - moved pkt_cnt = cnt two lines up. > > > > I tested the following scenarios with iperf and iperf -R > > and noticed no performance regression: > > IPERF VM-Bridge 1.02x > > IPERF VM-VM 1.04x > > IPERF VM-ExtHost 1.00x > > IPERF VM-NS 1.01x > > IPERF VM-VLAN-Bridge 1.03x > > IPERF VM-VLAN-VM 1.03x > > IPERF VM-VLAN-ExtHost 1.01x > > IPERF VM-VLAN-NS 1.02x > > IPERF VM-V6-VM 1.03x > > IPERF VM-V6-ExtHost 1.00x > > IPERF VM-V6-NS 1.01x > > IPERF-R VM-Bridge 1.01x > > IPERF-R VM-VM 1.04x > > IPERF-R VM-ExtHost 1.10x > > IPERF-R VM-NS 1.01x > > IPERF-R VM-VLAN-Bridge 1.03x > > IPERF-R VM-VLAN-VM 1.02x > > IPERF-R VM-VLAN-ExtHost 1.08x > > IPERF-R VM-VLAN-NS 1.02x > > IPERF-R VM-V6-VM 1.00x > > IPERF-R VM-V6-ExtHost 1.11x > > IPERF-R VM-V6-NS 1.00x > > > > Now using trex, 64byte packet, PVP: > > Original: 3.6Mpps > > avg. packets per output batch: 32.00 > > idle cycles: 0 (0.00%) > > avg cycles per packet: 304.92 (8331383020/27323150) > > avg processing cycles per packet: 304.92 (8331383020/27323150) > > > > Patched: 3.6Mpps > > avg. packets per output batch: 32.00 > > idle cycles: 0 (0.00%) > > avg cycles per packet: 304.08 (21875784116/71941516) > > avg processing cycles per packet: 304.08 (21875784116/71941516) > > > > lib/netdev-dpdk.c | 335 +++++++++++++++++++------------------------- > > -- > > 1 file changed, 139 insertions(+), 196 deletions(-) > > > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > > index 2640a421a..a1437db4d 100644 > > --- a/lib/netdev-dpdk.c > > +++ b/lib/netdev-dpdk.c > > @@ -2585,90 +2585,6 @@ netdev_dpdk_vhost_update_tx_counters(struct > > netdev_dpdk *dev, > > } > > } > > > > -static void > > -__netdev_dpdk_vhost_send(struct netdev *netdev, int qid, > > - struct dp_packet **pkts, int cnt) > > -{ > > - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > - struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > > - struct netdev_dpdk_sw_stats sw_stats_add; > > - unsigned int n_packets_to_free = cnt; > > - unsigned int total_packets = cnt; > > - int i, retries = 0; > > - int max_retries = VHOST_ENQ_RETRY_MIN; > > - int vid = netdev_dpdk_get_vid(dev); > > - > > - qid = dev->tx_q[qid % netdev->n_txq].map; > > - > > - if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 > > - || !(dev->flags & NETDEV_UP))) { > > - rte_spinlock_lock(&dev->stats_lock); > > - dev->stats.tx_dropped+= cnt; > > - rte_spinlock_unlock(&dev->stats_lock); > > - goto out; > > - } > > - > > - if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev- > > >tx_q[qid].tx_lock))) { > > - COVERAGE_INC(vhost_tx_contention); > > - rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > > - } > > - > > - sw_stats_add.tx_invalid_hwol_drops = cnt; > > - if (userspace_tso_enabled()) { > > - cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); > > - } > > - > > - sw_stats_add.tx_invalid_hwol_drops -= cnt; > > - sw_stats_add.tx_mtu_exceeded_drops = cnt; > > - cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); > > - sw_stats_add.tx_mtu_exceeded_drops -= cnt; > > - > > - /* Check has QoS has been configured for the netdev */ > > - sw_stats_add.tx_qos_drops = cnt; > > - cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true); > > - sw_stats_add.tx_qos_drops -= cnt; > > - > > - n_packets_to_free = cnt; > > - > > - do { > > - int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; > > - unsigned int tx_pkts; > > - > > - tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts, > > cnt); > > - if (OVS_LIKELY(tx_pkts)) { > > - /* Packets have been sent.*/ > > - cnt -= tx_pkts; > > - /* Prepare for possible retry.*/ > > - cur_pkts = &cur_pkts[tx_pkts]; > > - if (OVS_UNLIKELY(cnt && !retries)) { > > - /* > > - * Read max retries as there are packets not sent > > - * and no retries have already occurred. > > - */ > > - atomic_read_relaxed(&dev->vhost_tx_retries_max, > > &max_retries); > > - } > > - } else { > > - /* No packets sent - do not retry.*/ > > - break; > > - } > > - } while (cnt && (retries++ < max_retries)); > > - > > - rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > > - > > - sw_stats_add.tx_failure_drops = cnt; > > - sw_stats_add.tx_retries = MIN(retries, max_retries); > > - > > - rte_spinlock_lock(&dev->stats_lock); > > - netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets, > > - &sw_stats_add); > > - rte_spinlock_unlock(&dev->stats_lock); > > - > > -out: > > - for (i = 0; i < n_packets_to_free; i++) { > > - dp_packet_delete(pkts[i]); > > - } > > -} > > - > > static void > > netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) > > { > > @@ -2783,76 +2699,70 @@ dpdk_copy_dp_packet_to_mbuf(struct > > rte_mempool *mp, struct dp_packet *pkt_orig) > > return pkt_dest; > > } > > > > -/* Tx function. Transmit packets indefinitely */ > > -static void > > -dpdk_do_tx_copy(struct netdev *netdev, int qid, struct > > dp_packet_batch *batch) > > +/* Replace packets in a 'batch' with their corresponding copies > > using > > + * DPDK memory. > > + * > > + * Returns the number of good packets in the batch. */ > > +static size_t > > +dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct > > dp_packet_batch *batch) > > OVS_NO_THREAD_SAFETY_ANALYSIS > > { > > - const size_t batch_cnt = dp_packet_batch_size(batch); > > -#if !defined(__CHECKER__) && !defined(_WIN32) > > - const size_t PKT_ARRAY_SIZE = batch_cnt; > > -#else > > - /* Sparse or MSVC doesn't like variable length array. */ > > - enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; > > -#endif > > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > - struct dp_packet *pkts[PKT_ARRAY_SIZE]; > > - struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > > - uint32_t cnt = batch_cnt; > > - uint32_t dropped = 0; > > - uint32_t tx_failure = 0; > > - uint32_t mtu_drops = 0; > > - uint32_t qos_drops = 0; > > - > > - if (dev->type != DPDK_DEV_VHOST) { > > - /* Check if QoS has been configured for this netdev. */ > > - cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch- > > >packets, > > - batch_cnt, false); > > - qos_drops = batch_cnt - cnt; > > - } > > - > > - uint32_t txcnt = 0; > > - > > - for (uint32_t i = 0; i < cnt; i++) { > > - struct dp_packet *packet = batch->packets[i]; > > - uint32_t size = dp_packet_size(packet); > > - > > - if (size > dev->max_packet_len > > - && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { > > - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", > > size, > > - dev->max_packet_len); > > - mtu_drops++; > > - continue; > > - } > > + size_t i, size = dp_packet_batch_size(batch); > > + struct dp_packet *packet; > > > > - pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, > > packet); > > - if (OVS_UNLIKELY(!pkts[txcnt])) { > > - dropped = cnt - i; > > - break; > > - } > > + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { > > + if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) { > > + dp_packet_batch_refill(batch, packet, i); > > + } else { > > + struct dp_packet *pktcopy; > > > > - txcnt++; > > - } > > + pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, > > packet); > > + if (pktcopy) { > > + dp_packet_batch_refill(batch, pktcopy, i); > > + } > > > > - if (OVS_LIKELY(txcnt)) { > > - if (dev->type == DPDK_DEV_VHOST) { > > - __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); > > - } else { > > - tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, > > - (struct rte_mbuf > > **)pkts, > > - txcnt); > > + dp_packet_delete(packet); > > } > > } > > > > - dropped += qos_drops + mtu_drops + tx_failure; > > - if (OVS_UNLIKELY(dropped)) { > > - rte_spinlock_lock(&dev->stats_lock); > > - dev->stats.tx_dropped += dropped; > > - sw_stats->tx_failure_drops += tx_failure; > > - sw_stats->tx_mtu_exceeded_drops += mtu_drops; > > - sw_stats->tx_qos_drops += qos_drops; > > - rte_spinlock_unlock(&dev->stats_lock); > > + return dp_packet_batch_size(batch); > > +} > > + > > +static size_t > > +netdev_dpdk_common_send(struct netdev *netdev, struct > > dp_packet_batch *batch, > > + struct netdev_dpdk_sw_stats *stats) > > +{ > > + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > + size_t cnt, pkt_cnt = dp_packet_batch_size(batch); > > + > > + memset(stats, 0, sizeof *stats); > > + > > + /* Copy dp-packets to mbufs. */ > > + if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > > + cnt = dpdk_copy_batch_to_mbuf(netdev, batch); > > + stats->tx_failure_drops += pkt_cnt - cnt; > > + pkt_cnt = cnt; > > } > > + > > + /* Drop oversized packets. */ > > + cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt); > > + stats->tx_mtu_exceeded_drops += pkt_cnt - cnt; > > + pkt_cnt = cnt; > > + > > + /* Prepare each mbuf for hardware offloading. */ > > + if (userspace_tso_enabled()) { > > + cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt); > > + stats->tx_invalid_hwol_drops += pkt_cnt - cnt; > > + pkt_cnt = cnt; > > + } > > + > > + /* Apply Quality of Service policy. */ > > + cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true); > > + stats->tx_qos_drops += pkt_cnt - cnt; > > + > > + return cnt; > > } > > > > static int > > @@ -2860,82 +2770,115 @@ netdev_dpdk_vhost_send(struct netdev > > *netdev, int qid, > > struct dp_packet_batch *batch, > > bool concurrent_txq OVS_UNUSED) > > { > > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > + int max_retries = VHOST_ENQ_RETRY_MIN; > > + int cnt, batch_cnt, vhost_batch_cnt; > > + int vid = netdev_dpdk_get_vid(dev); > > + struct netdev_dpdk_sw_stats stats; > > + struct rte_mbuf **pkts; > > + int retries; > > > > - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > > - dpdk_do_tx_copy(netdev, qid, batch); > > + batch_cnt = cnt = dp_packet_batch_size(batch); > > + qid = dev->tx_q[qid % netdev->n_txq].map; > > + if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 > > + || !(dev->flags & NETDEV_UP))) { > > + rte_spinlock_lock(&dev->stats_lock); > > + dev->stats.tx_dropped += cnt; > > + rte_spinlock_unlock(&dev->stats_lock); > > dp_packet_delete_batch(batch, true); > > - } else { > > - __netdev_dpdk_vhost_send(netdev, qid, batch->packets, > > - dp_packet_batch_size(batch)); > > + return 0; > > + } > > + > > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > > + > > + if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev- > > >tx_q[qid].tx_lock))) { > > + COVERAGE_INC(vhost_tx_contention); > > + rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > > } > > + > > + pkts = (struct rte_mbuf **) batch->packets; > > + vhost_batch_cnt = cnt; > > + retries = 0; > > + do { > > + int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; > > + int tx_pkts; > > + > > + tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts, > > cnt); > > + if (OVS_LIKELY(tx_pkts)) { > > + /* Packets have been sent.*/ > > + cnt -= tx_pkts; > > + /* Prepare for possible retry.*/ > > + pkts = &pkts[tx_pkts]; > > + if (OVS_UNLIKELY(cnt && !retries)) { > > + /* > > + * Read max retries as there are packets not sent > > + * and no retries have already occurred. > > + */ > > + atomic_read_relaxed(&dev->vhost_tx_retries_max, > > &max_retries); > > + } > > + } else { > > + /* No packets sent - do not retry.*/ > > + break; > > + } > > + } while (cnt && (retries++ < max_retries)); > > + > > + rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > > + > > + stats.tx_failure_drops += cnt; > > + stats.tx_retries = MIN(retries, max_retries); > > + > > + rte_spinlock_lock(&dev->stats_lock); > > + netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, > > batch_cnt, > > + &stats); > > + rte_spinlock_unlock(&dev->stats_lock); > > + > > + pkts = (struct rte_mbuf **) batch->packets; > > + for (int i = 0; i < vhost_batch_cnt; i++) { > > + rte_pktmbuf_free(pkts[i]); > > + } > > + > > return 0; > > } > > > > -static inline void > > -netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, > > - struct dp_packet_batch *batch, > > - bool concurrent_txq) > > +static int > > +netdev_dpdk_eth_send(struct netdev *netdev, int qid, > > + struct dp_packet_batch *batch, bool > > concurrent_txq) > > { > > + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; > > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > + int batch_cnt = dp_packet_batch_size(batch); > > + struct netdev_dpdk_sw_stats stats; > > + int cnt, dropped; > > + > > if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) { > > dp_packet_delete_batch(batch, true); > > - return; > > + return 0; > > } > > > > + cnt = netdev_dpdk_common_send(netdev, batch, &stats); > > + dropped = batch_cnt - cnt; > > if (OVS_UNLIKELY(concurrent_txq)) { > > qid = qid % dev->up.n_txq; > > rte_spinlock_lock(&dev->tx_q[qid].tx_lock); > > } > > > > - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { > > - struct netdev *netdev = &dev->up; > > - > > - dpdk_do_tx_copy(netdev, qid, batch); > > - dp_packet_delete_batch(batch, true); > > - } else { > > + dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt); > > + if (OVS_UNLIKELY(dropped)) { > > struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; > > - int dropped; > > - int tx_failure, mtu_drops, qos_drops, hwol_drops; > > - int batch_cnt = dp_packet_batch_size(batch); > > - struct rte_mbuf **pkts = (struct rte_mbuf **) batch- > > >packets; > > > > - hwol_drops = batch_cnt; > > - if (userspace_tso_enabled()) { > > - batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, > > batch_cnt); > > - } > > - hwol_drops -= batch_cnt; > > - mtu_drops = batch_cnt; > > - batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, > > batch_cnt); > > - mtu_drops -= batch_cnt; > > - qos_drops = batch_cnt; > > - batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true); > > - qos_drops -= batch_cnt; > > - > > - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, > > batch_cnt); > > - > > - dropped = tx_failure + mtu_drops + qos_drops + hwol_drops; > > - if (OVS_UNLIKELY(dropped)) { > > - rte_spinlock_lock(&dev->stats_lock); > > - dev->stats.tx_dropped += dropped; > > - sw_stats->tx_failure_drops += tx_failure; > > - sw_stats->tx_mtu_exceeded_drops += mtu_drops; > > - sw_stats->tx_qos_drops += qos_drops; > > - sw_stats->tx_invalid_hwol_drops += hwol_drops; > > - rte_spinlock_unlock(&dev->stats_lock); > > - } > > + rte_spinlock_lock(&dev->stats_lock); > > + dev->stats.tx_dropped += dropped; > > + sw_stats->tx_failure_drops += stats.tx_failure_drops; > > + sw_stats->tx_mtu_exceeded_drops += > > stats.tx_mtu_exceeded_drops; > > + sw_stats->tx_qos_drops += stats.tx_qos_drops; > > + sw_stats->tx_invalid_hwol_drops += > > stats.tx_invalid_hwol_drops; > > + rte_spinlock_unlock(&dev->stats_lock); > > } > > > > if (OVS_UNLIKELY(concurrent_txq)) { > > rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > > } > > -} > > - > > -static int > > -netdev_dpdk_eth_send(struct netdev *netdev, int qid, > > - struct dp_packet_batch *batch, bool > > concurrent_txq) > > -{ > > - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > > > > - netdev_dpdk_send__(dev, qid, batch, concurrent_txq); > > return 0; > > } > > > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 2640a421a..a1437db4d 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -2585,90 +2585,6 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev, } } -static void -__netdev_dpdk_vhost_send(struct netdev *netdev, int qid, - struct dp_packet **pkts, int cnt) -{ - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); - struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; - struct netdev_dpdk_sw_stats sw_stats_add; - unsigned int n_packets_to_free = cnt; - unsigned int total_packets = cnt; - int i, retries = 0; - int max_retries = VHOST_ENQ_RETRY_MIN; - int vid = netdev_dpdk_get_vid(dev); - - qid = dev->tx_q[qid % netdev->n_txq].map; - - if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 - || !(dev->flags & NETDEV_UP))) { - rte_spinlock_lock(&dev->stats_lock); - dev->stats.tx_dropped+= cnt; - rte_spinlock_unlock(&dev->stats_lock); - goto out; - } - - if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) { - COVERAGE_INC(vhost_tx_contention); - rte_spinlock_lock(&dev->tx_q[qid].tx_lock); - } - - sw_stats_add.tx_invalid_hwol_drops = cnt; - if (userspace_tso_enabled()) { - cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt); - } - - sw_stats_add.tx_invalid_hwol_drops -= cnt; - sw_stats_add.tx_mtu_exceeded_drops = cnt; - cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt); - sw_stats_add.tx_mtu_exceeded_drops -= cnt; - - /* Check has QoS has been configured for the netdev */ - sw_stats_add.tx_qos_drops = cnt; - cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true); - sw_stats_add.tx_qos_drops -= cnt; - - n_packets_to_free = cnt; - - do { - int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; - unsigned int tx_pkts; - - tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts, cnt); - if (OVS_LIKELY(tx_pkts)) { - /* Packets have been sent.*/ - cnt -= tx_pkts; - /* Prepare for possible retry.*/ - cur_pkts = &cur_pkts[tx_pkts]; - if (OVS_UNLIKELY(cnt && !retries)) { - /* - * Read max retries as there are packets not sent - * and no retries have already occurred. - */ - atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries); - } - } else { - /* No packets sent - do not retry.*/ - break; - } - } while (cnt && (retries++ < max_retries)); - - rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); - - sw_stats_add.tx_failure_drops = cnt; - sw_stats_add.tx_retries = MIN(retries, max_retries); - - rte_spinlock_lock(&dev->stats_lock); - netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets, - &sw_stats_add); - rte_spinlock_unlock(&dev->stats_lock); - -out: - for (i = 0; i < n_packets_to_free; i++) { - dp_packet_delete(pkts[i]); - } -} - static void netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque) { @@ -2783,76 +2699,70 @@ dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig) return pkt_dest; } -/* Tx function. Transmit packets indefinitely */ -static void -dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch) +/* Replace packets in a 'batch' with their corresponding copies using + * DPDK memory. + * + * Returns the number of good packets in the batch. */ +static size_t +dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct dp_packet_batch *batch) OVS_NO_THREAD_SAFETY_ANALYSIS { - const size_t batch_cnt = dp_packet_batch_size(batch); -#if !defined(__CHECKER__) && !defined(_WIN32) - const size_t PKT_ARRAY_SIZE = batch_cnt; -#else - /* Sparse or MSVC doesn't like variable length array. */ - enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; -#endif struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); - struct dp_packet *pkts[PKT_ARRAY_SIZE]; - struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; - uint32_t cnt = batch_cnt; - uint32_t dropped = 0; - uint32_t tx_failure = 0; - uint32_t mtu_drops = 0; - uint32_t qos_drops = 0; - - if (dev->type != DPDK_DEV_VHOST) { - /* Check if QoS has been configured for this netdev. */ - cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch->packets, - batch_cnt, false); - qos_drops = batch_cnt - cnt; - } - - uint32_t txcnt = 0; - - for (uint32_t i = 0; i < cnt; i++) { - struct dp_packet *packet = batch->packets[i]; - uint32_t size = dp_packet_size(packet); - - if (size > dev->max_packet_len - && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) { - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size, - dev->max_packet_len); - mtu_drops++; - continue; - } + size_t i, size = dp_packet_batch_size(batch); + struct dp_packet *packet; - pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet); - if (OVS_UNLIKELY(!pkts[txcnt])) { - dropped = cnt - i; - break; - } + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) { + if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) { + dp_packet_batch_refill(batch, packet, i); + } else { + struct dp_packet *pktcopy; - txcnt++; - } + pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet); + if (pktcopy) { + dp_packet_batch_refill(batch, pktcopy, i); + } - if (OVS_LIKELY(txcnt)) { - if (dev->type == DPDK_DEV_VHOST) { - __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt); - } else { - tx_failure += netdev_dpdk_eth_tx_burst(dev, qid, - (struct rte_mbuf **)pkts, - txcnt); + dp_packet_delete(packet); } } - dropped += qos_drops + mtu_drops + tx_failure; - if (OVS_UNLIKELY(dropped)) { - rte_spinlock_lock(&dev->stats_lock); - dev->stats.tx_dropped += dropped; - sw_stats->tx_failure_drops += tx_failure; - sw_stats->tx_mtu_exceeded_drops += mtu_drops; - sw_stats->tx_qos_drops += qos_drops; - rte_spinlock_unlock(&dev->stats_lock); + return dp_packet_batch_size(batch); +} + +static size_t +netdev_dpdk_common_send(struct netdev *netdev, struct dp_packet_batch *batch, + struct netdev_dpdk_sw_stats *stats) +{ + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + size_t cnt, pkt_cnt = dp_packet_batch_size(batch); + + memset(stats, 0, sizeof *stats); + + /* Copy dp-packets to mbufs. */ + if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { + cnt = dpdk_copy_batch_to_mbuf(netdev, batch); + stats->tx_failure_drops += pkt_cnt - cnt; + pkt_cnt = cnt; } + + /* Drop oversized packets. */ + cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt); + stats->tx_mtu_exceeded_drops += pkt_cnt - cnt; + pkt_cnt = cnt; + + /* Prepare each mbuf for hardware offloading. */ + if (userspace_tso_enabled()) { + cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt); + stats->tx_invalid_hwol_drops += pkt_cnt - cnt; + pkt_cnt = cnt; + } + + /* Apply Quality of Service policy. */ + cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true); + stats->tx_qos_drops += pkt_cnt - cnt; + + return cnt; } static int @@ -2860,82 +2770,115 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch, bool concurrent_txq OVS_UNUSED) { + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + int max_retries = VHOST_ENQ_RETRY_MIN; + int cnt, batch_cnt, vhost_batch_cnt; + int vid = netdev_dpdk_get_vid(dev); + struct netdev_dpdk_sw_stats stats; + struct rte_mbuf **pkts; + int retries; - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { - dpdk_do_tx_copy(netdev, qid, batch); + batch_cnt = cnt = dp_packet_batch_size(batch); + qid = dev->tx_q[qid % netdev->n_txq].map; + if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0 + || !(dev->flags & NETDEV_UP))) { + rte_spinlock_lock(&dev->stats_lock); + dev->stats.tx_dropped += cnt; + rte_spinlock_unlock(&dev->stats_lock); dp_packet_delete_batch(batch, true); - } else { - __netdev_dpdk_vhost_send(netdev, qid, batch->packets, - dp_packet_batch_size(batch)); + return 0; + } + + cnt = netdev_dpdk_common_send(netdev, batch, &stats); + + if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) { + COVERAGE_INC(vhost_tx_contention); + rte_spinlock_lock(&dev->tx_q[qid].tx_lock); } + + pkts = (struct rte_mbuf **) batch->packets; + vhost_batch_cnt = cnt; + retries = 0; + do { + int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; + int tx_pkts; + + tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts, cnt); + if (OVS_LIKELY(tx_pkts)) { + /* Packets have been sent.*/ + cnt -= tx_pkts; + /* Prepare for possible retry.*/ + pkts = &pkts[tx_pkts]; + if (OVS_UNLIKELY(cnt && !retries)) { + /* + * Read max retries as there are packets not sent + * and no retries have already occurred. + */ + atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries); + } + } else { + /* No packets sent - do not retry.*/ + break; + } + } while (cnt && (retries++ < max_retries)); + + rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); + + stats.tx_failure_drops += cnt; + stats.tx_retries = MIN(retries, max_retries); + + rte_spinlock_lock(&dev->stats_lock); + netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, batch_cnt, + &stats); + rte_spinlock_unlock(&dev->stats_lock); + + pkts = (struct rte_mbuf **) batch->packets; + for (int i = 0; i < vhost_batch_cnt; i++) { + rte_pktmbuf_free(pkts[i]); + } + return 0; } -static inline void -netdev_dpdk_send__(struct netdev_dpdk *dev, int qid, - struct dp_packet_batch *batch, - bool concurrent_txq) +static int +netdev_dpdk_eth_send(struct netdev *netdev, int qid, + struct dp_packet_batch *batch, bool concurrent_txq) { + struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + int batch_cnt = dp_packet_batch_size(batch); + struct netdev_dpdk_sw_stats stats; + int cnt, dropped; + if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) { dp_packet_delete_batch(batch, true); - return; + return 0; } + cnt = netdev_dpdk_common_send(netdev, batch, &stats); + dropped = batch_cnt - cnt; if (OVS_UNLIKELY(concurrent_txq)) { qid = qid % dev->up.n_txq; rte_spinlock_lock(&dev->tx_q[qid].tx_lock); } - if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) { - struct netdev *netdev = &dev->up; - - dpdk_do_tx_copy(netdev, qid, batch); - dp_packet_delete_batch(batch, true); - } else { + dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt); + if (OVS_UNLIKELY(dropped)) { struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats; - int dropped; - int tx_failure, mtu_drops, qos_drops, hwol_drops; - int batch_cnt = dp_packet_batch_size(batch); - struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets; - hwol_drops = batch_cnt; - if (userspace_tso_enabled()) { - batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt); - } - hwol_drops -= batch_cnt; - mtu_drops = batch_cnt; - batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt); - mtu_drops -= batch_cnt; - qos_drops = batch_cnt; - batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true); - qos_drops -= batch_cnt; - - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt); - - dropped = tx_failure + mtu_drops + qos_drops + hwol_drops; - if (OVS_UNLIKELY(dropped)) { - rte_spinlock_lock(&dev->stats_lock); - dev->stats.tx_dropped += dropped; - sw_stats->tx_failure_drops += tx_failure; - sw_stats->tx_mtu_exceeded_drops += mtu_drops; - sw_stats->tx_qos_drops += qos_drops; - sw_stats->tx_invalid_hwol_drops += hwol_drops; - rte_spinlock_unlock(&dev->stats_lock); - } + rte_spinlock_lock(&dev->stats_lock); + dev->stats.tx_dropped += dropped; + sw_stats->tx_failure_drops += stats.tx_failure_drops; + sw_stats->tx_mtu_exceeded_drops += stats.tx_mtu_exceeded_drops; + sw_stats->tx_qos_drops += stats.tx_qos_drops; + sw_stats->tx_invalid_hwol_drops += stats.tx_invalid_hwol_drops; + rte_spinlock_unlock(&dev->stats_lock); } if (OVS_UNLIKELY(concurrent_txq)) { rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); } -} - -static int -netdev_dpdk_eth_send(struct netdev *netdev, int qid, - struct dp_packet_batch *batch, bool concurrent_txq) -{ - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); - netdev_dpdk_send__(dev, qid, batch, concurrent_txq); return 0; }
This patch split out the common code between vhost and dpdk transmit paths to shared functions to simplify the code and fix an issue. The issue is that the packet coming from non-DPDK device and egressing on a DPDK device currently skips the hwol preparation. This also have the side effect of leaving only the dpdk transmit code under the txq lock. Signed-off-by: Flavio Leitner <fbl@sysclose.org> --- V2: - mentioned the tx lock change in the commit message. - fixed packet leak when copy fails. - moved pkt_cnt = cnt two lines up. I tested the following scenarios with iperf and iperf -R and noticed no performance regression: IPERF VM-Bridge 1.02x IPERF VM-VM 1.04x IPERF VM-ExtHost 1.00x IPERF VM-NS 1.01x IPERF VM-VLAN-Bridge 1.03x IPERF VM-VLAN-VM 1.03x IPERF VM-VLAN-ExtHost 1.01x IPERF VM-VLAN-NS 1.02x IPERF VM-V6-VM 1.03x IPERF VM-V6-ExtHost 1.00x IPERF VM-V6-NS 1.01x IPERF-R VM-Bridge 1.01x IPERF-R VM-VM 1.04x IPERF-R VM-ExtHost 1.10x IPERF-R VM-NS 1.01x IPERF-R VM-VLAN-Bridge 1.03x IPERF-R VM-VLAN-VM 1.02x IPERF-R VM-VLAN-ExtHost 1.08x IPERF-R VM-VLAN-NS 1.02x IPERF-R VM-V6-VM 1.00x IPERF-R VM-V6-ExtHost 1.11x IPERF-R VM-V6-NS 1.00x Now using trex, 64byte packet, PVP: Original: 3.6Mpps avg. packets per output batch: 32.00 idle cycles: 0 (0.00%) avg cycles per packet: 304.92 (8331383020/27323150) avg processing cycles per packet: 304.92 (8331383020/27323150) Patched: 3.6Mpps avg. packets per output batch: 32.00 idle cycles: 0 (0.00%) avg cycles per packet: 304.08 (21875784116/71941516) avg processing cycles per packet: 304.08 (21875784116/71941516) lib/netdev-dpdk.c | 335 +++++++++++++++++++--------------------------- 1 file changed, 139 insertions(+), 196 deletions(-)