[ovs-dev,v4,3/7] netdev-dpdk: Enable TSO when using multi-seg mbufs
diff mbox series

Message ID 20190911081127.2140-4-michalx.obrembski@intel.com
State Superseded
Headers show
Series
  • dpdk: Add support for TSO
Related show

Commit Message

Michal Obrembski Sept. 11, 2019, 8:11 a.m. UTC
From: Tiago Lam <tiago.lam@intel.com>

TCP Segmentation Offload (TSO) is a feature which enables the TCP/IP
network stack to delegate segmentation of a TCP segment to the hardware
NIC, thus saving compute resources. This may improve performance
significantly for TCP workload in virtualized environments.

While a previous commit already added the necesary logic to netdev-dpdk
to deal with packets marked for TSO, this set of changes enables TSO by
default when using multi-segment mbufs.

Thus, to enable TSO on the physical DPDK interfaces, only the following
command needs to be issued before starting OvS:
    ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Mark Kavanagh <mark.b.kavanagh@intel.com>

Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Signed-off-by: Michal Obrembski <michalx.obrembski@intel.com>
---
 Documentation/automake.mk           |  1 +
 Documentation/topics/dpdk/index.rst |  1 +
 Documentation/topics/dpdk/tso.rst   | 99 +++++++++++++++++++++++++++++++++++++
 NEWS                                |  1 +
 lib/netdev-dpdk.c                   | 70 ++++++++++++++++++++++++--
 5 files changed, 167 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst

Patch
diff mbox series

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 2a3214a..5955dd7 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -40,6 +40,7 @@  DOC_SOURCE = \
 	Documentation/topics/dpdk/index.rst \
 	Documentation/topics/dpdk/bridge.rst \
 	Documentation/topics/dpdk/jumbo-frames.rst \
+	Documentation/topics/dpdk/tso.rst \
 	Documentation/topics/dpdk/memory.rst \
 	Documentation/topics/dpdk/pdump.rst \
 	Documentation/topics/dpdk/phy.rst \
diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
index cf24a7b..eb2a04d 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,4 +40,5 @@  The DPDK Datapath
    /topics/dpdk/qos
    /topics/dpdk/pdump
    /topics/dpdk/jumbo-frames
+   /topics/dpdk/tso
    /topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
new file mode 100644
index 0000000..14f8c39
--- /dev/null
+++ b/Documentation/topics/dpdk/tso.rst
@@ -0,0 +1,99 @@ 
+..
+      Copyright 2018, Red Hat, Inc.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+===
+TSO
+===
+
+**Note:** This feature is considered experimental.
+
+TCP Segmentation Offload (TSO) is a mechanism which allows a TCP/IP stack to
+offload the TCP segmentation into hardware, thus saving the cycles that would
+be required to perform this same segmentation in software.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC.
+
+When using TSO with DPDK, the implementation relies on the multi-segment mbufs
+feature, described in :doc:`/topics/dpdk/jumbo-frames`, where each mbuf
+contains ~2KiB of the entire packet's data and is linked to the next mbuf that
+contains the next portion of data.
+
+Enabling TSO
+~~~~~~~~~~~~
+.. Important::
+
+    Once multi-segment mbufs is enabled, TSO will be enabled by default, if
+    there's support for it in the underlying physical NICs attached to
+    OvS-DPDK.
+
+When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled in one of
+two ways, as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO` is thus advertised to the guest as an
+available feature:
+
+1. QEMU Command Line Parameter::
+
+    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
+    ...
+    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
+    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
+    ...
+
+2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be used to enable same::
+
+    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
+    $ ethtool -K eth0 tso on
+    $ ethtool -k eth0
+
+To enable TSO in a guest, the underlying NIC must first support `TSO` - consult
+your controller's datasheet for compatibility. Secondly, the NIC must have an
+associated DPDK Poll Mode Driver (PMD) which supports `TSO`.
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+The current OvS `TSO` implementation supports flat and VLAN networks only (i.e.
+no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP, etc.]).
+
+Also, as TSO is built on top of multi-segments mbufs, the constraints pointed
+out in :doc:`/topics/dpdk/jumbo-frames` also apply for TSO. Thus, some
+performance hits might be noticed when running specific functionality, like
+the Userspace Connection tracker. And as mentioned in the same section, it is
+paramount that a packet's headers is contained within the first mbuf (~2KiB in
+size).
diff --git a/NEWS b/NEWS
index 1278ada..e219822 100644
--- a/NEWS
+++ b/NEWS
@@ -44,6 +44,7 @@  v2.12.0 - xx xxx xxxx
        specific subtables based on the miniflow attributes, enhancing the
        performance of the subtable search.
      * Add Linux AF_XDP support through a new experimental netdev type "afxdp".
+     * Add support for TSO (experimental, between DPDK interfaces only).
    - OVSDB:
      * OVSDB clients can now resynchronize with clustered servers much more
        quickly after a brief disconnection, saving bandwidth and CPU time.
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2304f28..7552caa 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -345,7 +345,8 @@  struct ingress_policer {
 enum dpdk_hw_ol_features {
     NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
     NETDEV_RX_HW_CRC_STRIP = 1 << 1,
-    NETDEV_RX_HW_SCATTER = 1 << 2
+    NETDEV_RX_HW_SCATTER = 1 << 2,
+    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
 };
 
 /*
@@ -996,8 +997,18 @@  dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             return -ENOTSUP;
         }
 
+        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+            conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
+            conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
+            conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
+        }
+
         txconf = info.default_txconf;
         txconf.offloads = conf.txmode.offloads;
+    } else if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+        VLOG_WARN("Failed to set Tx TSO offload in %s. Requires option "
+                  "`dpdk-multi-seg-mbufs` to be enabled.", dev->up.name);
     }
 
     conf.intr_conf.lsc = dev->lsc_interrupt_mode;
@@ -1114,6 +1125,9 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
     uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
                                      DEV_RX_OFFLOAD_TCP_CKSUM |
                                      DEV_RX_OFFLOAD_IPV4_CKSUM;
+    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
+                                   DEV_TX_OFFLOAD_TCP_CKSUM |
+                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
 
     rte_eth_dev_info_get(dev->port_id, &info);
 
@@ -1140,6 +1154,18 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
         dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
     }
 
+    if (dpdk_multi_segment_mbufs) {
+        if (info.tx_offload_capa & tx_tso_offload_capa) {
+            dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+        } else {
+            dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+            VLOG_WARN("Tx TSO offload is not supported on port "
+                      DPDK_PORT_ID_FMT, dev->port_id);
+        }
+    } else {
+        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+    }
+
     n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
     n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
 
@@ -1727,6 +1753,11 @@  netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
         } else {
             smap_add(args, "rx_csum_offload", "false");
         }
+        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+            smap_add(args, "tx_tso_offload", "true");
+        } else {
+            smap_add(args, "tx_tso_offload", "false");
+        }
         smap_add(args, "lsc_interrupt_mode",
                  dev->lsc_interrupt_mode ? "true" : "false");
     }
@@ -2445,9 +2476,21 @@  netdev_dpdk_qos_run(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
     return cnt;
 }
 
+/* Filters a DPDK packet by the following criteria:
+ * - A packet is marked for TSO but the egress dev doesn't
+ *   support TSO;
+ * - A packet pkt_len is bigger than the pre-defined
+ *   max_packet_len, and the packet isn't marked for TSO.
+ *
+ * If any of the above case applies, the packet is then freed
+ * from 'pkts'. Otherwise the packet is kept in 'pkts'
+ * untouched.
+ *
+ * Returns the number of unfiltered packets left in 'pkts'.
+ */
 static int
-netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
-                              int pkt_cnt)
+netdev_dpdk_filter_packet(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
+                          int pkt_cnt)
 {
     int i = 0;
     int cnt = 0;
@@ -2457,6 +2500,15 @@  netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
     for (i = 0; i < pkt_cnt; i++) {
         pkt = pkts[i];
 
+        /* Drop TSO packet if there's no TSO support on egress port. */
+        if ((pkt->ol_flags & PKT_TX_TCP_SEG) &&
+            !(dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD)) {
+            VLOG_WARN_RL(&rl, "%s: TSO is disabled on port, TSO packet dropped"
+                        "%" PRIu32 " ", dev->up.name, pkt->pkt_len);
+            rte_pktmbuf_free(pkt);
+            continue;
+        }
+
         if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
             if (!(pkt->ol_flags & PKT_TX_TCP_SEG)) {
                 VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
@@ -2528,7 +2580,7 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
 
     rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
 
-    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
+    cnt = netdev_dpdk_filter_packet(dev, cur_pkts, cnt);
     /* Check has QoS has been configured for the netdev */
     cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true);
     dropped = total_pkts - cnt;
@@ -2747,7 +2799,7 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
         int batch_cnt = dp_packet_batch_size(batch);
         struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
 
-        tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
+        tx_cnt = netdev_dpdk_filter_packet(dev, pkts, batch_cnt);
         tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
         dropped = batch_cnt - tx_cnt;
 
@@ -4445,6 +4497,14 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
         dev->tx_q[0].map = 0;
     }
 
+    if (dpdk_multi_segment_mbufs) {
+        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+
+        VLOG_DBG("%s: TSO enabled on vhost port", dev->up.name);
+    } else {
+        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+    }
+
     netdev_dpdk_remap_txqs(dev);
 
     err = netdev_dpdk_mempool_configure(dev);