diff mbox series

[ovs-dev,v5] userspace: Add TCP Segmentation Offload support

Message ID 20200117214755.558977-1-fbl@sysclose.org
State Accepted
Headers show
Series [ovs-dev,v5] userspace: Add TCP Segmentation Offload support | expand

Commit Message

Flavio Leitner Jan. 17, 2020, 9:47 p.m. UTC
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 Documentation/automake.mk              |   1 +
 Documentation/topics/index.rst         |   1 +
 Documentation/topics/userspace-tso.rst |  98 +++++++
 NEWS                                   |   1 +
 lib/automake.mk                        |   2 +
 lib/conntrack.c                        |  29 +-
 lib/dp-packet.h                        | 176 ++++++++++-
 lib/ipf.c                              |  32 +-
 lib/netdev-dpdk.c                      | 348 +++++++++++++++++++---
 lib/netdev-linux-private.h             |   5 +
 lib/netdev-linux.c                     | 386 ++++++++++++++++++++++---
 lib/netdev-provider.h                  |   9 +
 lib/netdev.c                           |  78 ++++-
 lib/userspace-tso.c                    |  53 ++++
 lib/userspace-tso.h                    |  23 ++
 vswitchd/bridge.c                      |   2 +
 vswitchd/vswitch.xml                   |  20 ++
 17 files changed, 1140 insertions(+), 124 deletions(-)
 create mode 100644 Documentation/topics/userspace-tso.rst
 create mode 100644 lib/userspace-tso.c
 create mode 100644 lib/userspace-tso.h

Testing:
 - Travis, Cirrus, AppVeyor, testsuite passed OK.
 - notice no changes since v4 with regards to performance.

Changelog:
- v5
 * rebased on top of master (NEWS conflict)
 * added missing periods at the end of comments
 * mention DPDK requirement at vswitch.xml
 * restricted tso feature to OvS built with dpdk
 * headers in alphabetical order
 * removed unneeded call to initialize pkt
 * used OVS_UNLIKELY instead of unlikely
 * removed parenthesis from sizeof()
 * removed blank line at dp_packet_hwol_tx_l4_checksum()
 * removed redundant dp_packet_hwol_tx_ipv4_checksum()
 * updated function comments as suggested

- v4
 * rebased on top of master (recvmmsg)
 * fixed URL in doc to point to 19.11
 * renamed tso to userspace-tso
 * renamed the option to userspace-tso-enable
 * removed prototype that left over from v2
 * fixed function style declaration
 * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum
 * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
 * account for drops while preping the batch for TX.
 * don't prep the batch for TX if TSO is disabled.
 * simplified setsockopt error checking
 * fixed af_packet_sock error checking to not call setsockopt on
      closed sockets.
 * fixed ol_flags comment.
 * used VLOG_ERR_BUF() to pass error messages.
 * fixed packet leak at netdev_send_prepare_batch()
 * added a coverage counter to account drops while preparing a batch
   at netdev.c
 * fixed netdev_send() to not call ->send() if the batch is empty.
 * fixed packet leak at netdev_push_header and account for the drops.
 * removed DPDK requirement to enable userspace TSO support.
 * fixed parameter documentation in vswitch.xml.
 * renamed tso.rst to userspace-tso.rst and moved to topics/
 * added comments documeting the functions in dp-packet.h
 * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG

- v3
 * Improved the documentation.
 * Updated copyright year to 2020.
 * TSO offloaded msg now includes the netdev's name.
 * Added period at the end of all code comments.
 * Warn and drop encapsulation of TSO packets.
 * Fixed travis issue with restricted virtio types.
 * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
   which caused packet corruption.
 * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
   PKT_TX_IP_CKSUM only for IPv4 packets.

Comments

Stokes, Ian Jan. 17, 2020, 9:54 p.m. UTC | #1
On 1/17/2020 9:47 PM, Flavio Leitner wrote:
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhostuser interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhostuser interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> It is recommended to check if NIC hardware supports TSO before enabling
> the feature, which is off by default. For additional information please
> check the tso.rst document.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Fantastic work here Flavio, quick turn arouround when needed.

Acked

BR
Ian
> ---
>   Documentation/automake.mk              |   1 +
>   Documentation/topics/index.rst         |   1 +
>   Documentation/topics/userspace-tso.rst |  98 +++++++
>   NEWS                                   |   1 +
>   lib/automake.mk                        |   2 +
>   lib/conntrack.c                        |  29 +-
>   lib/dp-packet.h                        | 176 ++++++++++-
>   lib/ipf.c                              |  32 +-
>   lib/netdev-dpdk.c                      | 348 +++++++++++++++++++---
>   lib/netdev-linux-private.h             |   5 +
>   lib/netdev-linux.c                     | 386 ++++++++++++++++++++++---
>   lib/netdev-provider.h                  |   9 +
>   lib/netdev.c                           |  78 ++++-
>   lib/userspace-tso.c                    |  53 ++++
>   lib/userspace-tso.h                    |  23 ++
>   vswitchd/bridge.c                      |   2 +
>   vswitchd/vswitch.xml                   |  20 ++
>   17 files changed, 1140 insertions(+), 124 deletions(-)
>   create mode 100644 Documentation/topics/userspace-tso.rst
>   create mode 100644 lib/userspace-tso.c
>   create mode 100644 lib/userspace-tso.h
> 
> Testing:
>   - Travis, Cirrus, AppVeyor, testsuite passed OK.
>   - notice no changes since v4 with regards to performance.
> 
> Changelog:
> - v5
>   * rebased on top of master (NEWS conflict)
>   * added missing periods at the end of comments
>   * mention DPDK requirement at vswitch.xml
>   * restricted tso feature to OvS built with dpdk
>   * headers in alphabetical order
>   * removed unneeded call to initialize pkt
>   * used OVS_UNLIKELY instead of unlikely
>   * removed parenthesis from sizeof()
>   * removed blank line at dp_packet_hwol_tx_l4_checksum()
>   * removed redundant dp_packet_hwol_tx_ipv4_checksum()
>   * updated function comments as suggested
> 
> - v4
>   * rebased on top of master (recvmmsg)
>   * fixed URL in doc to point to 19.11
>   * renamed tso to userspace-tso
>   * renamed the option to userspace-tso-enable
>   * removed prototype that left over from v2
>   * fixed function style declaration
>   * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum
>   * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
>   * account for drops while preping the batch for TX.
>   * don't prep the batch for TX if TSO is disabled.
>   * simplified setsockopt error checking
>   * fixed af_packet_sock error checking to not call setsockopt on
>        closed sockets.
>   * fixed ol_flags comment.
>   * used VLOG_ERR_BUF() to pass error messages.
>   * fixed packet leak at netdev_send_prepare_batch()
>   * added a coverage counter to account drops while preparing a batch
>     at netdev.c
>   * fixed netdev_send() to not call ->send() if the batch is empty.
>   * fixed packet leak at netdev_push_header and account for the drops.
>   * removed DPDK requirement to enable userspace TSO support.
>   * fixed parameter documentation in vswitch.xml.
>   * renamed tso.rst to userspace-tso.rst and moved to topics/
>   * added comments documeting the functions in dp-packet.h
>   * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG
> 
> - v3
>   * Improved the documentation.
>   * Updated copyright year to 2020.
>   * TSO offloaded msg now includes the netdev's name.
>   * Added period at the end of all code comments.
>   * Warn and drop encapsulation of TSO packets.
>   * Fixed travis issue with restricted virtio types.
>   * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
>     which caused packet corruption.
>   * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
>     PKT_TX_IP_CKSUM only for IPv4 packets.
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index f2ca17bad..22976a3cd 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -57,6 +57,7 @@ DOC_SOURCE = \
>   	Documentation/topics/ovsdb-replication.rst \
>   	Documentation/topics/porting.rst \
>   	Documentation/topics/tracing.rst \
> +	Documentation/topics/userspace-tso.rst \
>   	Documentation/topics/windows.rst \
>   	Documentation/howto/index.rst \
>   	Documentation/howto/dpdk.rst \
> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
> index 34c4b10e0..08af3a24d 100644
> --- a/Documentation/topics/index.rst
> +++ b/Documentation/topics/index.rst
> @@ -50,5 +50,6 @@ OVS
>      language-bindings
>      testing
>      tracing
> +   userspace-tso
>      idl-compound-indexes
>      ovs-extensions
> diff --git a/Documentation/topics/userspace-tso.rst b/Documentation/topics/userspace-tso.rst
> new file mode 100644
> index 000000000..893c64839
> --- /dev/null
> +++ b/Documentation/topics/userspace-tso.rst
> @@ -0,0 +1,98 @@
> +..
> +      Copyright 2020, Red Hat, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +========================
> +Userspace Datapath - TSO
> +========================
> +
> +**Note:** This feature is considered experimental.
> +
> +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> +segmentation achieves computational savings in the core, freeing up CPU cycles
> +for more useful work.
> +
> +A common use case for TSO is when using virtualization, where traffic that's
> +coming in from a VM can offload the TCP segmentation, thus avoiding the
> +fragmentation in software. Additionally, if the traffic is headed to a VM
> +within the same host further optimization can be expected. As the traffic never
> +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> +and checksum calculations are required, which saves yet more cycles. Only when
> +the traffic actually leaves the host the segmentation needs to happen, in which
> +case it will be performed by the egress NIC. Consult your controller's
> +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> +refer to the `DPDK documentation`__.
> +
> +__ https://doc.dpdk.org/guides-19.11/nics/overview.html
> +
> +Enabling TSO
> +~~~~~~~~~~~~
> +
> +The TSO support may be enabled via a global config value
> +``userspace-tso-enable``.  Setting this to ``true`` enables TSO support for
> +all ports.
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
> +
> +The default value is ``false``.
> +
> +Changing ``userspace-tso-enable`` requires restarting the daemon.
> +
> +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled
> +as follows.
> +
> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> +connection is established, `TSO` is thus advertised to the guest as an
> +available feature:
> +
> +QEMU Command Line Parameter::
> +
> +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> +    ...
> +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> +    ...
> +
> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> +used to enable same::
> +
> +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> +    $ ethtool -K eth0 tso on
> +    $ ethtool -k eth0
> +
> +~~~~~~~~~~~
> +Limitations
> +~~~~~~~~~~~
> +
> +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> +etc.]).
> +
> +There is no software implementation of TSO, so all ports attached to the
> +datapath must support TSO or packets using that feature will be dropped
> +on ports without TSO support.  That also means guests using vhost-user
> +in client mode will receive TSO packet regardless of TSO being enabled
> +or disabled within the guest.
> diff --git a/NEWS b/NEWS
> index 579e91c89..c6d3b6053 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -30,6 +30,7 @@ Post-v2.12.0
>        * Add support for DPDK 19.11.
>        * Add hardware offload support for output, drop, set of MAC, IPv4 and
>          TCP/UDP ports actions (experimental).
> +     * Add experimental support for TSO.
>      - RSTP:
>        * The rstp_statistics column in Port table will only be updated every
>          stats-update-interval configured in Open_vSwitch table.
> diff --git a/lib/automake.mk b/lib/automake.mk
> index ebf714501..95925b57c 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -314,6 +314,8 @@ lib_libopenvswitch_la_SOURCES = \
>   	lib/unicode.h \
>   	lib/unixctl.c \
>   	lib/unixctl.h \
> +	lib/userspace-tso.c \
> +	lib/userspace-tso.h \
>   	lib/util.c \
>   	lib/util.h \
>   	lib/uuid.c \
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index b80080e72..60222ca53 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>           if (hwol_bad_l3_csum) {
>               ok = false;
>           } else {
> -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> +                                     || dp_packet_hwol_is_ipv4(pkt);
>               /* Validate the checksum only when hwol is not supported. */
>               ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
>                                    !hwol_good_l3_csum);
> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>       if (ok) {
>           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
>           if (!hwol_bad_l4_csum) {
> -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
>               /* Validate the checksum only when hwol is not supported. */
>               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
>                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>                   }
>                   if (seq_skew) {
>                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> -                                          l3_hdr->ip_tot_len, htons(ip_len));
> +                    if (!dp_packet_hwol_is_ipv4(pkt)) {
> +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> +                                                        l3_hdr->ip_tot_len,
> +                                                        htons(ip_len));
> +                    }
>                       l3_hdr->ip_tot_len = htons(ip_len);
>                   }
>               }
> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>       }
>   
>       th->tcp_csum = 0;
> -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> -                           dp_packet_l4_size(pkt));
> -    } else {
> -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> -        th->tcp_csum = csum_finish(
> -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> +                               dp_packet_l4_size(pkt));
> +        } else {
> +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> +            th->tcp_csum = csum_finish(
> +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +        }
>       }
>   
>       if (seq_skew) {
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 133942155..69ae5dfac 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p)
>   {
>       /* This initialization is needed for packets that do not come from DPDK
>        * interfaces, when vswitchd is built with --with-dpdk. */
> -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>       p->mbuf.nb_segs = 1;
>       p->mbuf.next = NULL;
>   }
> @@ -519,6 +519,95 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>       b->mbuf.buf_len = s;
>   }
>   
> +/* Returns 'true' if packet 'b' is marked for TCP segmentation offloading. */
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b)
> +{
> +    return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG);
> +}
> +
> +/* Returns 'true' if packet 'b' is marked for IPv4 checksum offloading. */
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> +{
> +    return !!(b->mbuf.ol_flags & PKT_TX_IPV4);
> +}
> +
> +/* Returns the L4 cksum offload bitmask. */
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> +{
> +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> +}
> +
> +/* Returns 'true' if packet 'b' is marked for TCP checksum offloading. */
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM;
> +}
> +
> +/* Returns 'true' if packet 'b' is marked for UDP checksum offloading. */
> +static inline bool
> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM;
> +}
> +
> +/* Returns 'true' if packet 'b' is marked for SCTP checksum offloading. */
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM;
> +}
> +
> +/* Mark packet 'b' for IPv4 checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_IPV4;
> +}
> +
> +/* Mark packet 'b' for IPv6 checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_IPV6;
> +}
> +
> +/* Mark packet 'b' for TCP checksum offloading.  It implies that either
> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> +}
> +
> +/* Mark packet 'b' for UDP checksum offloading.  It implies that either
> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> +}
> +
> +/* Mark packet 'b' for SCTP checksum offloading.  It implies that either
> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> +}
> +
> +/* Mark packet 'b' for TCP segmentation offloading.  It implies that
> + * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading
> + * and also for TCP checksum offloading. */
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
> +{
> +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> +}
> +
>   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>   static inline uint32_t
> @@ -648,6 +737,84 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>       b->allocated_ = s;
>   }
>   
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return 0;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
>   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>   static inline uint32_t
> @@ -939,6 +1106,13 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
>       }
>   }
>   
> +/* Return true if the packet 'b' requested L4 checksum offload. */
> +static inline bool
> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b)
> +{
> +    return !!dp_packet_hwol_l4_mask(b);
> +}
> +
>   #ifdef  __cplusplus
>   }
>   #endif
> diff --git a/lib/ipf.c b/lib/ipf.c
> index 45c489122..446e89d13 100644
> --- a/lib/ipf.c
> +++ b/lib/ipf.c
> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>       len += rest_len;
>       l3 = dp_packet_l3(pkt);
>       ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> -                                new_ip_frag_off);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    if (!dp_packet_hwol_is_ipv4(pkt)) {
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> +                                    new_ip_frag_off);
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    }
>       l3->ip_tot_len = htons(len);
>       l3->ip_frag_off = new_ip_frag_off;
>       dp_packet_set_l2_pad_size(pkt, 0);
> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
>       }
>   
>       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> +                     && !dp_packet_hwol_is_ipv4(pkt)
>                        && csum(l3, ip_hdr_len) != 0)) {
>           goto invalid_pkt;
>       }
> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
>                   } else {
>                       struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
>                       struct ip_header *l3_reass = dp_packet_l3(pkt);
> -                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> -                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> -                    l3_frag->ip_src = l3_reass->ip_src;
> +                    if (!dp_packet_hwol_is_ipv4(frag_0->pkt)) {
> +                        ovs_be32 reass_ip =
> +                            get_16aligned_be32(&l3_reass->ip_src);
> +                        ovs_be32 frag_ip =
> +                            get_16aligned_be32(&l3_frag->ip_src);
> +
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                    }
>   
> -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> +                    l3_frag->ip_src = l3_reass->ip_src;
>                       l3_frag->ip_dst = l3_reass->ip_dst;
>                   }
>   
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index d1469f6f2..b108cbd6b 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -72,6 +72,7 @@
>   #include "timeval.h"
>   #include "unaligned.h"
>   #include "unixctl.h"
> +#include "userspace-tso.h"
>   #include "util.h"
>   #include "uuid.h"
>   
> @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats {
>       uint64_t tx_qos_drops;
>       /* Packet drops in ingress policer processing. */
>       uint64_t rx_qos_drops;
> +    /* Packet drops in HWOL processing. */
> +    uint64_t tx_invalid_hwol_drops;
>   };
>   
>   enum { DPDK_RING_SIZE = 256 };
> @@ -410,7 +413,8 @@ struct ingress_policer {
>   enum dpdk_hw_ol_features {
>       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> -    NETDEV_RX_HW_SCATTER = 1 << 2
> +    NETDEV_RX_HW_SCATTER = 1 << 2,
> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>   };
>   
>   /*
> @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
>       }
>   
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       /* Limit configured rss hash functions to only those supported
>        * by the eth device. */
>       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>                                        DEV_RX_OFFLOAD_TCP_CKSUM |
>                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>   
>       rte_eth_dev_info_get(dev->port_id, &info);
>   
> @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>       }
>   
> +    if (info.tx_offload_capa & tx_tso_offload_capa) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +    } else {
> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_WARN("Tx TSO offload is not supported on %s port "
> +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
> +    }
> +
>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>   
> @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
>           goto out;
>       }
>   
> -    err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                | 1ULL << VIRTIO_NET_F_CSUM);
> -    if (err) {
> -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                 "port: %s\n", name);
> -        goto out;
> +    if (!userspace_tso_enabled()) {
> +        err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                    | 1ULL << VIRTIO_NET_F_CSUM);
> +        if (err) {
> +            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> +                     "port: %s\n", name);
> +            goto out;
> +        }
>       }
>   
>       err = rte_vhost_driver_start(dev->vhost_id);
> @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
>           } else {
>               smap_add(args, "rx_csum_offload", "false");
>           }
> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +            smap_add(args, "tx_tso_offload", "true");
> +        } else {
> +            smap_add(args, "tx_tso_offload", "false");
> +        }
>           smap_add(args, "lsc_interrupt_mode",
>                    dev->lsc_interrupt_mode ? "true" : "false");
>       }
> @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
>       rte_free(rx);
>   }
>   
> +/* Prepare the packet for HWOL.
> + * Return True if the packet is OK to continue. */
> +static bool
> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> +{
> +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> +
> +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> +        mbuf->outer_l2_len = 0;
> +        mbuf->outer_l3_len = 0;
> +    }
> +
> +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> +        struct tcp_header *th = dp_packet_l4(pkt);
> +
> +        if (!th) {
> +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> +                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> +            return false;
> +        }
> +
> +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
> +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> +
> +        if (mbuf->ol_flags & PKT_TX_IPV4) {
> +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
> +        }
> +    }
> +    return true;
> +}
> +
> +/* Prepare a batch for HWOL.
> + * Return the number of good packets in the batch. */
> +static int
> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> +                            int pkt_cnt)
> +{
> +    int i = 0;
> +    int cnt = 0;
> +    struct rte_mbuf *pkt;
> +
> +    /* Prepare and filter bad HWOL packets. */
> +    for (i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> +            rte_pktmbuf_free(pkt);
> +            continue;
> +        }
> +
> +        if (OVS_UNLIKELY(i != cnt)) {
> +            pkts[cnt] = pkt;
> +        }
> +        cnt++;
> +    }
> +
> +    return cnt;
> +}
> +
>   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
>    * 'pkts', even in case of failure.
>    *
> @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
>                            struct rte_mbuf **pkts, int cnt)
>   {
>       uint32_t nb_tx = 0;
> +    uint16_t nb_tx_prep = cnt;
> +
> +    if (userspace_tso_enabled()) {
> +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> +        if (nb_tx_prep != cnt) {
> +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> +                         cnt, rte_strerror(rte_errno));
> +        }
> +    }
>   
> -    while (nb_tx != cnt) {
> +    while (nb_tx != nb_tx_prep) {
>           uint32_t ret;
>   
> -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> +                               nb_tx_prep - nb_tx);
>           if (!ret) {
>               break;
>           }
> @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
>       int cnt = 0;
>       struct rte_mbuf *pkt;
>   
> +    /* Filter oversized packets, unless are marked for TSO. */
>       for (i = 0; i < pkt_cnt; i++) {
>           pkt = pkts[i];
> -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> +                         dev->max_packet_len);
>               rte_pktmbuf_free(pkt);
>               continue;
>           }
> @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
>   {
>       int dropped = sw_stats_add->tx_mtu_exceeded_drops +
>                     sw_stats_add->tx_qos_drops +
> -                  sw_stats_add->tx_failure_drops;
> +                  sw_stats_add->tx_failure_drops +
> +                  sw_stats_add->tx_invalid_hwol_drops;
>       struct netdev_stats *stats = &dev->stats;
>       int sent = attempted - dropped;
>       int i;
> @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
>           sw_stats->tx_failure_drops      += sw_stats_add->tx_failure_drops;
>           sw_stats->tx_mtu_exceeded_drops += sw_stats_add->tx_mtu_exceeded_drops;
>           sw_stats->tx_qos_drops          += sw_stats_add->tx_qos_drops;
> +        sw_stats->tx_invalid_hwol_drops += sw_stats_add->tx_invalid_hwol_drops;
>       }
>   }
>   
> @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>       }
>   
> +    sw_stats_add.tx_invalid_hwol_drops = cnt;
> +    if (userspace_tso_enabled()) {
> +        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> +    }
> +
> +    sw_stats_add.tx_invalid_hwol_drops -= cnt;
> +    sw_stats_add.tx_mtu_exceeded_drops = cnt;
>       cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> -    sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> +    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
>   
>       /* Check has QoS has been configured for the netdev */
>       sw_stats_add.tx_qos_drops = cnt;
> @@ -2562,6 +2674,120 @@ out:
>       }
>   }
>   
> +static void
> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> +{
> +    rte_free(opaque);
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> +{
> +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> +    uint16_t buf_len;
> +    void *buf;
> +
> +    if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) {
> +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> +    } else {
> +        total_len += sizeof *shinfo + sizeof(uintptr_t);
> +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> +    }
> +
> +    if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
> +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> +        return NULL;
> +    }
> +
> +    buf_len = total_len;
> +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> +    if (OVS_UNLIKELY(buf == NULL)) {
> +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> +        return NULL;
> +    }
> +
> +    /* Initialize shinfo. */
> +    if (shinfo) {
> +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> +        shinfo->fcb_opaque = buf;
> +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> +    } else {
> +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> +                                                    netdev_dpdk_extbuf_free,
> +                                                    buf);
> +        if (OVS_UNLIKELY(shinfo == NULL)) {
> +            rte_free(buf);
> +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> +                     "attempting to attach an external buffer.");
> +            return NULL;
> +        }
> +    }
> +
> +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> +                              shinfo);
> +    rte_pktmbuf_reset_headroom(pkt);
> +
> +    return pkt;
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> +{
> +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> +
> +    if (OVS_UNLIKELY(!pkt)) {
> +        return NULL;
> +    }
> +
> +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> +        return pkt;
> +    }
> +
> +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> +        return pkt;
> +    }
> +
> +    rte_pktmbuf_free(pkt);
> +
> +    return NULL;
> +}
> +
> +static struct dp_packet *
> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> +{
> +    struct rte_mbuf *mbuf_dest;
> +    struct dp_packet *pkt_dest;
> +    uint32_t pkt_len;
> +
> +    pkt_len = dp_packet_size(pkt_orig);
> +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> +            return NULL;
> +    }
> +
> +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> +    dp_packet_set_size(pkt_dest, pkt_len);
> +
> +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> +
> +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> +
> +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> +                                - (char *)dp_packet_eth(pkt_dest);
> +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> +                                - (char *) dp_packet_l3(pkt_dest);
> +    }
> +
> +    return pkt_dest;
> +}
> +
>   /* Tx function. Transmit packets indefinitely */
>   static void
>   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> @@ -2575,7 +2801,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>   #endif
>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
>       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>       uint32_t cnt = batch_cnt;
>       uint32_t dropped = 0;
> @@ -2596,34 +2822,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>           struct dp_packet *packet = batch->packets[i];
>           uint32_t size = dp_packet_size(packet);
>   
> -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> -                         size, dev->max_packet_len);
> -
> +        if (size > dev->max_packet_len
> +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> +                         dev->max_packet_len);
>               mtu_drops++;
>               continue;
>           }
>   
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>               dropped = cnt - i;
>               break;
>           }
>   
> -        /* We have to do a copy for now */
> -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> -               dp_packet_data(packet), size);
> -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> -
>           txcnt++;
>       }
>   
>       if (OVS_LIKELY(txcnt)) {
>           if (dev->type == DPDK_DEV_VHOST) {
> -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> -                                     txcnt);
> +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
>           } else {
> -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> +                                                   (struct rte_mbuf **)pkts,
> +                                                   txcnt);
>           }
>       }
>   
> @@ -2676,26 +2898,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>           dp_packet_delete_batch(batch, true);
>       } else {
>           struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> -        int tx_cnt, dropped;
> -        int tx_failure, mtu_drops, qos_drops;
> +        int dropped;
> +        int tx_failure, mtu_drops, qos_drops, hwol_drops;
>           int batch_cnt = dp_packet_batch_size(batch);
>           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>   
> -        tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> -        mtu_drops = batch_cnt - tx_cnt;
> -        qos_drops = tx_cnt;
> -        tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
> -        qos_drops -= tx_cnt;
> +        hwol_drops = batch_cnt;
> +        if (userspace_tso_enabled()) {
> +            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> +        }
> +        hwol_drops -= batch_cnt;
> +        mtu_drops = batch_cnt;
> +        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> +        mtu_drops -= batch_cnt;
> +        qos_drops = batch_cnt;
> +        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
> +        qos_drops -= batch_cnt;
>   
> -        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt);
> +        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt);
>   
> -        dropped = tx_failure + mtu_drops + qos_drops;
> +        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
>           if (OVS_UNLIKELY(dropped)) {
>               rte_spinlock_lock(&dev->stats_lock);
>               dev->stats.tx_dropped += dropped;
>               sw_stats->tx_failure_drops += tx_failure;
>               sw_stats->tx_mtu_exceeded_drops += mtu_drops;
>               sw_stats->tx_qos_drops += qos_drops;
> +            sw_stats->tx_invalid_hwol_drops += hwol_drops;
>               rte_spinlock_unlock(&dev->stats_lock);
>           }
>       }
> @@ -3011,7 +3240,8 @@ netdev_dpdk_get_sw_custom_stats(const struct netdev *netdev,
>       SW_CSTAT(tx_failure_drops)       \
>       SW_CSTAT(tx_mtu_exceeded_drops)  \
>       SW_CSTAT(tx_qos_drops)           \
> -    SW_CSTAT(rx_qos_drops)
> +    SW_CSTAT(rx_qos_drops)           \
> +    SW_CSTAT(tx_invalid_hwol_drops)
>   
>   #define SW_CSTAT(NAME) + 1
>       custom_stats->size = SW_CSTATS;
> @@ -4874,6 +5104,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>   
>       rte_free(dev->tx_q);
>       err = dpdk_eth_dev_init(dev);
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>       if (!dev->tx_q) {
>           err = ENOMEM;
> @@ -4903,6 +5139,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>           dev->tx_q[0].map = 0;
>       }
>   
> +    if (userspace_tso_enabled()) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
> +    }
> +
>       netdev_dpdk_remap_txqs(dev);
>   
>       err = netdev_dpdk_mempool_configure(dev);
> @@ -4975,6 +5216,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>           }
>   
> +        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
> +        if (userspace_tso_enabled()) {
> +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> +        }
> +
>           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>           if (err) {
>               VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -4999,14 +5245,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>               goto unlock;
>           }
>   
> -        err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                    | 1ULL << VIRTIO_NET_F_CSUM);
> -        if (err) {
> -            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                     "client port: %s\n", dev->up.name);
> -            goto unlock;
> +        if (userspace_tso_enabled()) {
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +        } else {
> +            err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                        | 1ULL << VIRTIO_NET_F_CSUM);
> +            if (err) {
> +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
> +                         "vhost user client port: %s\n", dev->up.name);
> +                goto unlock;
> +            }
>           }
>   
>           err = rte_vhost_driver_start(dev->vhost_id);
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> index f08159aa7..9dbc67658 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -27,6 +27,7 @@
>   #include <stdint.h>
>   #include <stdbool.h>
>   
> +#include "dp-packet.h"
>   #include "netdev-afxdp.h"
>   #include "netdev-afxdp-pool.h"
>   #include "netdev-provider.h"
> @@ -37,10 +38,13 @@
>   
>   struct netdev;
>   
> +#define LINUX_RXQ_TSO_MAX_LEN 65536
> +
>   struct netdev_rxq_linux {
>       struct netdev_rxq up;
>       bool is_tap;
>       int fd;
> +    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
>   };
>   
>   int netdev_linux_construct(struct netdev *);
> @@ -92,6 +96,7 @@ struct netdev_linux {
>       int tap_fd;
>       bool present;               /* If the device is present in the namespace */
>       uint64_t tx_dropped;        /* tap device can drop if the iface is down */
> +    uint64_t rx_dropped;        /* Packets dropped while recv from kernel. */
>   
>       /* LAG information. */
>       bool is_lag_master;         /* True if the netdev is a LAG master. */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 41d1e9273..a4a666657 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -29,16 +29,18 @@
>   #include <linux/filter.h>
>   #include <linux/gen_stats.h>
>   #include <linux/if_ether.h>
> +#include <linux/if_packet.h>
>   #include <linux/if_tun.h>
>   #include <linux/types.h>
>   #include <linux/ethtool.h>
>   #include <linux/mii.h>
>   #include <linux/rtnetlink.h>
>   #include <linux/sockios.h>
> +#include <linux/virtio_net.h>
>   #include <sys/ioctl.h>
>   #include <sys/socket.h>
> +#include <sys/uio.h>
>   #include <sys/utsname.h>
> -#include <netpacket/packet.h>
>   #include <net/if.h>
>   #include <net/if_arp.h>
>   #include <net/route.h>
> @@ -75,6 +77,7 @@
>   #include "timer.h"
>   #include "unaligned.h"
>   #include "openvswitch/vlog.h"
> +#include "userspace-tso.h"
>   #include "util.h"
>   
>   VLOG_DEFINE_THIS_MODULE(netdev_linux);
> @@ -237,6 +240,16 @@ enum {
>       VALID_DRVINFO           = 1 << 6,
>       VALID_FEATURES          = 1 << 7,
>   };
> +
> +/* Use one for the packet buffer and another for the aux buffer to receive
> + * TSO packets. */
> +#define IOV_STD_SIZE 1
> +#define IOV_TSO_SIZE 2
> +
> +enum {
> +    IOV_PACKET = 0,
> +    IOV_AUXBUF = 1,
> +};
>   
>   struct linux_lag_slave {
>      uint32_t block_id;
> @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>    * changes in the device miimon status, so we can use atomic_count. */
>   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>   
> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
>   static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
>                                      int cmd, const char *cmd_name);
>   static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
>       /* The device could be in the same network namespace or in another one. */
>       netnsid_unset(&netdev->netnsid);
>       ovs_mutex_init(&netdev->mutex);
> +
> +    if (userspace_tso_enabled()) {
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       return 0;
>   }
>   
> @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
>       /* Create tap device. */
>       get_flags(&netdev->up, &netdev->ifi_flags);
>       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> +    if (userspace_tso_enabled()) {
> +        ifr.ifr_flags |= IFF_VNET_HDR;
> +    }
> +
>       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
>       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
>           VLOG_WARN("%s: creating tap device failed: %s", name,
> @@ -1024,6 +1050,15 @@ static struct netdev_rxq *
>   netdev_linux_rxq_alloc(void)
>   {
>       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> +    if (userspace_tso_enabled()) {
> +        int i;
> +
> +        /* Allocate auxiliay buffers to receive TSO packets. */
> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> +            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> +        }
> +    }
> +
>       return &rx->up;
>   }
>   
> @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>               goto error;
>           }
>   
> +        if (userspace_tso_enabled()
> +            && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                          sizeof val)) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> +                     netdev_get_name(netdev_), ovs_strerror(errno));
> +            goto error;
> +        }
> +
>           /* Set non-blocking mode. */
>           error = set_nonblocking(rx->fd);
>           if (error) {
> @@ -1119,10 +1163,15 @@ static void
>   netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>   {
>       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    int i;
>   
>       if (!rx->is_tap) {
>           close(rx->fd);
>       }
> +
> +    for (i = 0; i < NETDEV_MAX_BURST; i++) {
> +        free(rx->aux_bufs[i]);
> +    }
>   }
>   
>   static void
> @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
>    * It also used recvmmsg to reduce multiple syscalls overhead;
>    */
>   static int
> -netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
> +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>                                    struct dp_packet_batch *batch)
>   {
> -    size_t size;
> +    int iovlen;
> +    size_t std_len;
>       ssize_t retval;
> -    struct iovec iovs[NETDEV_MAX_BURST];
> +    int virtio_net_hdr_size;
> +    struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE];
>       struct cmsghdr *cmsg;
>       union {
>           struct cmsghdr cmsg;
> @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>       struct dp_packet *buffers[NETDEV_MAX_BURST];
>       int i;
>   
> +    if (userspace_tso_enabled()) {
> +        /* Use the buffer from the allocated packet below to receive MTU
> +         * sized packets and an aux_buf for extra TSO data. */
> +        iovlen = IOV_TSO_SIZE;
> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
> +    } else {
> +        /* Use only the buffer from the allocated packet. */
> +        iovlen = IOV_STD_SIZE;
> +        virtio_net_hdr_size = 0;
> +    }
> +
> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
> -         buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> -                                                  DP_NETDEV_HEADROOM);
> -         /* Reserve headroom for a single VLAN tag */
> -         dp_packet_reserve(buffers[i], VLAN_HEADER_LEN);
> -         size = dp_packet_tailroom(buffers[i]);
> -         iovs[i].iov_base = dp_packet_data(buffers[i]);
> -         iovs[i].iov_len = size;
> +         buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> +         iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> +         iovs[i][IOV_PACKET].iov_len = std_len;
> +         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> +         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
>            mmsgs[i].msg_hdr.msg_name = NULL;
>            mmsgs[i].msg_hdr.msg_namelen = 0;
> -         mmsgs[i].msg_hdr.msg_iov = &iovs[i];
> -         mmsgs[i].msg_hdr.msg_iovlen = 1;
> +         mmsgs[i].msg_hdr.msg_iov = iovs[i];
> +         mmsgs[i].msg_hdr.msg_iovlen = iovlen;
>            mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i];
>            mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
>            mmsgs[i].msg_hdr.msg_flags = 0;
>       }
>   
>       do {
> -        retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
> +        retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
>       } while (retval < 0 && errno == EINTR);
>   
>       if (retval < 0) {
> -        /* Save -errno to retval temporarily */
> -        retval = -errno;
> -        i = 0;
> -        goto free_buffers;
> +        retval = errno;
> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> +            dp_packet_delete(buffers[i]);
> +        }
> +
> +        return retval;
>       }
>   
>       for (i = 0; i < retval; i++) {
>           if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> -            break;
> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> +            dp_packet_delete(buffers[i]);
> +            netdev->rx_dropped += 1;
> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether hdr size",
> +                         netdev_get_name(netdev_));
> +            continue;
> +        }
> +
> +        if (mmsgs[i].msg_len > std_len) {
> +            /* Build a single linear TSO packet by expanding the current packet
> +             * to append the data received in the aux_buf. */
> +            size_t extra_len = mmsgs[i].msg_len - std_len;
> +
> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> +                               + std_len);
> +            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> +            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> +                               + extra_len);
> +        } else {
> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> +                               + mmsgs[i].msg_len);
>           }
>   
> -        dp_packet_set_size(buffers[i],
> -                           dp_packet_size(buffers[i]) + mmsgs[i].msg_len);
> +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) {
> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> +            /* Unexpected error situation: the virtio header is not present
> +             * or corrupted. Drop the packet but continue in case next ones
> +             * are correct. */
> +            dp_packet_delete(buffers[i]);
> +            netdev->rx_dropped += 1;
> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
> +                         netdev_get_name(netdev_));
> +            continue;
> +        }
>   
>           for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg;
>                    cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) {
> @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>           dp_packet_batch_add(batch, buffers[i]);
>       }
>   
> -free_buffers:
> -    /* Free unused buffers, including buffers whose size is less than
> -     * ETH_HEADER_LEN.
> -     *
> -     * Note: i has been set correctly by the above for loop, so don't
> -     * try to re-initialize it.
> -     */
> +    /* Delete unused buffers. */
>       for (; i < NETDEV_MAX_BURST; i++) {
>           dp_packet_delete(buffers[i]);
>       }
>   
> -    /* netdev_linux_rxq_recv needs it to return 0 or positive errno */
> -    if (retval < 0) {
> -        return -retval;
> -    }
> -
>       return 0;
>   }
>   
> @@ -1263,20 +1349,40 @@ free_buffers:
>    * packets are added into *batch. The return value is 0 or errno.
>    */
>   static int
> -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
> +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> +                                struct dp_packet_batch *batch)
>   {
>       struct dp_packet *buffer;
> +    int virtio_net_hdr_size;
>       ssize_t retval;
> -    size_t size;
> +    size_t std_len;
> +    int iovlen;
>       int i;
>   
> +    if (userspace_tso_enabled()) {
> +        /* Use the buffer from the allocated packet below to receive MTU
> +         * sized packets and an aux_buf for extra TSO data. */
> +        iovlen = IOV_TSO_SIZE;
> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
> +    } else {
> +        /* Use only the buffer from the allocated packet. */
> +        iovlen = IOV_STD_SIZE;
> +        virtio_net_hdr_size = 0;
> +    }
> +
> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
> +        struct iovec iov[IOV_TSO_SIZE];
> +
>           /* Assume Ethernet port. No need to set packet_type. */
> -        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> -                                             DP_NETDEV_HEADROOM);
> -        size = dp_packet_tailroom(buffer);
> +        buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> +        iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> +        iov[IOV_PACKET].iov_len = std_len;
> +        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> +        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> +
>           do {
> -            retval = read(fd, dp_packet_data(buffer), size);
> +            retval = readv(rx->fd, iov, iovlen);
>           } while (retval < 0 && errno == EINTR);
>   
>           if (retval < 0) {
> @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
>               break;
>           }
>   
> -        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +        if (retval > std_len) {
> +            /* Build a single linear TSO packet by expanding the current packet
> +             * to append the data received in the aux_buf. */
> +            size_t extra_len = retval - std_len;
> +
> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> +            dp_packet_prealloc_tailroom(buffer, extra_len);
> +            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +        } else {
> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +        }
> +
> +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> +            /* Unexpected error situation: the virtio header is not present
> +             * or corrupted. Drop the packet but continue in case next ones
> +             * are correct. */
> +            dp_packet_delete(buffer);
> +            netdev->rx_dropped += 1;
> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
> +                         netdev_get_name(netdev_));
> +            continue;
> +        }
> +
>           dp_packet_batch_add(batch, buffer);
>       }
>   
> @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>   
>       dp_packet_batch_init(batch);
>       retval = (rx->is_tap
> -              ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch)
> -              : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch));
> +              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> +              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
>   
>       if (retval) {
>           if (retval != EAGAIN && retval != EMSGSIZE) {
> @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>   }
>   
>   static int
> -netdev_linux_sock_batch_send(int sock, int ifindex,
> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>                                struct dp_packet_batch *batch)
>   {
>       const size_t size = dp_packet_batch_size(batch);
> @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>   
>       struct dp_packet *packet;
>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
>           iov[i].iov_base = dp_packet_data(packet);
>           iov[i].iov_len = dp_packet_size(packet);
>           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>    * on other interface types because we attach a socket filter to the rx
>    * socket. */
>   static int
> -netdev_linux_tap_batch_send(struct netdev *netdev_,
> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
>                               struct dp_packet_batch *batch)
>   {
>       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
>       }
>   
>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        size_t size = dp_packet_size(packet);
> +        size_t size;
>           ssize_t retval;
>           int error;
>   
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
> +        size = dp_packet_size(packet);
>           do {
>               retval = write(netdev->tap_fd, dp_packet_data(packet), size);
>               error = retval < 0 ? errno : 0;
> @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>                     struct dp_packet_batch *batch,
>                     bool concurrent_txq OVS_UNUSED)
>   {
> +    bool tso = userspace_tso_enabled();
> +    int mtu = ETH_PAYLOAD_MAX;
>       int error = 0;
>       int sock = 0;
>   
> +    if (tso) {
> +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> +    }
> +
>       if (!is_tap_netdev(netdev_)) {
>           if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>               error = EOPNOTSUPP;
> @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>               goto free_batch;
>           }
>   
> -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
>       } else {
> -        error = netdev_linux_tap_batch_send(netdev_, batch);
> +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>       }
>       if (error) {
>           if (error == ENOBUFS) {
> @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev *netdev_, struct netdev_stats *stats)
>           stats->collisions          += dev_stats.collisions;
>       }
>       stats->tx_dropped += netdev->tx_dropped;
> +    stats->rx_dropped += netdev->rx_dropped;
>       ovs_mutex_unlock(&netdev->mutex);
>   
>       return error;
> @@ -6223,6 +6371,17 @@ af_packet_sock(void)
>               if (error) {
>                   close(sock);
>                   sock = -error;
> +            } else if (userspace_tso_enabled()) {
> +                int val = 1;
> +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                                   sizeof val);
> +                if (error) {
> +                    error = errno;
> +                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> +                             ovs_strerror(errno));
> +                    close(sock);
> +                    sock = -error;
> +                }
>               }
>           } else {
>               sock = -errno;
> @@ -6234,3 +6393,136 @@ af_packet_sock(void)
>   
>       return sock;
>   }
> +
> +static int
> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> +{
> +    struct eth_header *eth_hdr;
> +    ovs_be16 eth_type;
> +    int l2_len;
> +
> +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> +    if (!eth_hdr) {
> +        return -EINVAL;
> +    }
> +
> +    l2_len = ETH_HEADER_LEN;
> +    eth_type = eth_hdr->eth_type;
> +    if (eth_type_vlan(eth_type)) {
> +        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> +
> +        if (!vlan) {
> +            return -EINVAL;
> +        }
> +
> +        eth_type = vlan->vlan_next_type;
> +        l2_len += VLAN_HEADER_LEN;
> +    }
> +
> +    if (eth_type == htons(ETH_TYPE_IP)) {
> +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> +
> +        if (!ip_hdr) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = ip_hdr->ip_proto;
> +        dp_packet_hwol_set_tx_ipv4(b);
> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> +        struct ovs_16aligned_ip6_hdr *nh6;
> +
> +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> +        if (!nh6) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> +        dp_packet_hwol_set_tx_ipv6(b);
> +    }
> +
> +    return 0;
> +}
> +
> +static int
> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> +    uint16_t l4proto = 0;
> +
> +    if (OVS_UNLIKELY(!vnet)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> +        return 0;
> +    }
> +
> +    if (netdev_linux_parse_l2(b, &l4proto)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +        if (l4proto == IPPROTO_TCP) {
> +            dp_packet_hwol_set_csum_tcp(b);
> +        } else if (l4proto == IPPROTO_UDP) {
> +            dp_packet_hwol_set_csum_udp(b);
> +        } else if (l4proto == IPPROTO_SCTP) {
> +            dp_packet_hwol_set_csum_sctp(b);
> +        }
> +    }
> +
> +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> +                                | VIRTIO_NET_HDR_GSO_TCPV6
> +                                | VIRTIO_NET_HDR_GSO_UDP;
> +        uint8_t type = vnet->gso_type & allowed_mask;
> +
> +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
> +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> +            dp_packet_hwol_set_tcp_seg(b);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void
> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> +
> +    if (dp_packet_hwol_is_tso(b)) {
> +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
> +                            + TCP_HEADER_LEN;
> +
> +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
> +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
> +        if (dp_packet_hwol_is_ipv4(b)) {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +        } else {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +        }
> +
> +    } else {
> +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(b)) {
> +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
> +                                                  - (char *)dp_packet_eth(b));
> +
> +        if (dp_packet_hwol_l4_is_tcp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct tcp_header, tcp_csum);
> +        } else if (dp_packet_hwol_l4_is_udp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct udp_header, udp_csum);
> +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct sctp_header, sctp_csum);
> +        } else {
> +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> +        }
> +    }
> +}
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index f109c4e66..22f4cde33 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -37,6 +37,12 @@ extern "C" {
>   struct netdev_tnl_build_header_params;
>   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>   
> +enum netdev_ol_flags {
> +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> +};
> +
>   /* A network device (e.g. an Ethernet device).
>    *
>    * Network device implementations may read these members but should not modify
> @@ -51,6 +57,9 @@ struct netdev {
>        * opening this device, and therefore got assigned to the "system" class */
>       bool auto_classified;
>   
> +    /* This bitmask of the offloading features enabled by the netdev. */
> +    uint64_t ol_flags;
> +
>       /* If this is 'true', the user explicitly specified an MTU for this
>        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
>       bool mtu_user_config;
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 405c98c68..f95b19af4 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received);
>   COVERAGE_DEFINE(netdev_sent);
>   COVERAGE_DEFINE(netdev_add_router);
>   COVERAGE_DEFINE(netdev_get_stats);
> +COVERAGE_DEFINE(netdev_send_prepare_drops);
> +COVERAGE_DEFINE(netdev_push_header_drops);
>   
>   struct netdev_saved_flags {
>       struct netdev *netdev;
> @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev)
>               : NETDEV_PT_LEGACY_L2);
>   }
>   
> +/* Check if a 'packet' is compatible with 'netdev_flags'.
> + * If a packet is incompatible, return 'false' with the 'errormsg'
> + * pointing to a reason. */
> +static bool
> +netdev_send_prepare_packet(const uint64_t netdev_flags,
> +                           struct dp_packet *packet, char **errormsg)
> +{
> +    if (dp_packet_hwol_is_tso(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> +            /* Fall back to GSO in software. */
> +            VLOG_ERR_BUF(errormsg, "No TSO support");
> +            return false;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> +            /* Fall back to L4 csum in software. */
> +            VLOG_ERR_BUF(errormsg, "No L4 checksum support");
> +            return false;
> +    }
> +
> +    return true;
> +}
> +
> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> + * otherwise either fall back to software implementation or drop it. */
> +static void
> +netdev_send_prepare_batch(const struct netdev *netdev,
> +                          struct dp_packet_batch *batch)
> +{
> +    struct dp_packet *packet;
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        char *errormsg = NULL;
> +
> +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            dp_packet_delete(packet);
> +            COVERAGE_INC(netdev_send_prepare_drops);
> +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> +                         netdev_get_name(netdev), errormsg);
> +            free(errormsg);
> +        }
> +    }
> +}
> +
>   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
>    * otherwise a positive errno value.  Returns EAGAIN without blocking if
>    * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> @@ -811,8 +861,14 @@ int
>   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
>               bool concurrent_txq)
>   {
> -    int error = netdev->netdev_class->send(netdev, qid, batch,
> -                                           concurrent_txq);
> +    int error;
> +
> +    netdev_send_prepare_batch(netdev, batch);
> +    if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) {
> +        return 0;
> +    }
> +
> +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
>       if (!error) {
>           COVERAGE_INC(netdev_sent);
>       }
> @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev,
>                      const struct ovs_action_push_tnl *data)
>   {
>       struct dp_packet *packet;
> -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        netdev->netdev_class->push_header(netdev, packet, data);
> -        pkt_metadata_init(&packet->md, data->out_port);
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
> +                         || dp_packet_hwol_l4_mask(packet))) {
> +            COVERAGE_INC(netdev_push_header_drops);
> +            dp_packet_delete(packet);
> +            VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload flags is "
> +                         "not supported: packet dropped",
> +                         netdev_get_name(netdev));
> +        } else {
> +            netdev->netdev_class->push_header(netdev, packet, data);
> +            pkt_metadata_init(&packet->md, data->out_port);
> +            dp_packet_batch_refill(batch, packet, i);
> +        }
>       }
>   
>       return 0;
> diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c
> new file mode 100644
> index 000000000..6a4a0149b
> --- /dev/null
> +++ b/lib/userspace-tso.c
> @@ -0,0 +1,53 @@
> +/*
> + * Copyright (c) 2020 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "smap.h"
> +#include "ovs-thread.h"
> +#include "openvswitch/vlog.h"
> +#include "dpdk.h"
> +#include "userspace-tso.h"
> +#include "vswitch-idl.h"
> +
> +VLOG_DEFINE_THIS_MODULE(userspace_tso);
> +
> +static bool userspace_tso = false;
> +
> +void
> +userspace_tso_init(const struct smap *ovs_other_config)
> +{
> +    if (smap_get_bool(ovs_other_config, "userspace-tso-enable", false)) {
> +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> +        if (ovsthread_once_start(&once)) {
> +#ifdef DPDK_NETDEV
> +            VLOG_INFO("Userspace TCP Segmentation Offloading support enabled");
> +            userspace_tso = true;
> +#else
> +            VLOG_WARN("Userspace TCP Segmentation Offloading can not be enabled"
> +                      "since OVS is built without DPDK support.");
> +#endif
> +            ovsthread_once_done(&once);
> +        }
> +    }
> +}
> +
> +bool
> +userspace_tso_enabled(void)
> +{
> +    return userspace_tso;
> +}
> diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h
> new file mode 100644
> index 000000000..0758274c0
> --- /dev/null
> +++ b/lib/userspace-tso.h
> @@ -0,0 +1,23 @@
> +/*
> + * Copyright (c) 2020 Red Hat Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef USERSPACE_TSO_H
> +#define USERSPACE_TSO_H 1
> +
> +void userspace_tso_init(const struct smap *ovs_other_config);
> +bool userspace_tso_enabled(void);
> +
> +#endif /* userspace-tso.h */
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 86c7b10a9..e591c26a6 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -65,6 +65,7 @@
>   #include "system-stats.h"
>   #include "timeval.h"
>   #include "tnl-ports.h"
> +#include "userspace-tso.h"
>   #include "util.h"
>   #include "unixctl.h"
>   #include "lib/vswitch-idl.h"
> @@ -3285,6 +3286,7 @@ bridge_run(void)
>       if (cfg) {
>           netdev_set_flow_api_enabled(&cfg->other_config);
>           dpdk_init(&cfg->other_config);
> +        userspace_tso_init(&cfg->other_config);
>       }
>   
>       /* Initialize the ofproto library.  This only needs to run once, but
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index c43cb1aa4..3ddaaefda 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -690,6 +690,26 @@
>            once in few hours or a day or a week.
>           </p>
>         </column>
> +      <column name="other_config" key="userspace-tso-enable"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Set this value to <code>true</code> to enable userspace support for
> +          TCP Segmentation Offloading (TSO). When it is enabled, the interfaces
> +          can provide an oversized TCP segment to the datapath and the datapath
> +          will offload the TCP segmentation and checksum calculation to the
> +          interfaces when necessary.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>. Changing this value requires
> +          restarting the daemon.
> +        </p>
> +        <p>
> +          The feature only works if Open vSwitch is built with DPDK support.
> +        </p>
> +        <p>
> +          The feature is considered experimental.
> +        </p>
> +      </column>
>       </group>
>       <group title="Status">
>         <column name="next_cfg">
>
0-day Robot Jan. 17, 2020, 9:59 p.m. UTC | #2
Bleep bloop.  Greetings Flavio Leitner, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 80 characters long (recommended limit is 79)
#1901 FILE: lib/userspace-tso.c:41:
            VLOG_WARN("Userspace TCP Segmentation Offloading can not be enabled"

Lines checked: 1996, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Stokes, Ian Jan. 17, 2020, 10:55 p.m. UTC | #3
On 1/17/2020 9:54 PM, Stokes, Ian wrote:
> 
> 
> On 1/17/2020 9:47 PM, Flavio Leitner wrote:
>> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
>> the network stack to delegate the TCP segmentation to the NIC reducing
>> the per packet CPU overhead.
>>
>> A guest using vhostuser interface with TSO enabled can send TCP packets
>> much bigger than the MTU, which saves CPU cycles normally used to break
>> the packets down to MTU size and to calculate checksums.
>>
>> It also saves CPU cycles used to parse multiple packets/headers during
>> the packet processing inside virtual switch.
>>
>> If the destination of the packet is another guest in the same host, then
>> the same big packet can be sent through a vhostuser interface skipping
>> the segmentation completely. However, if the destination is not local,
>> the NIC hardware is instructed to do the TCP segmentation and checksum
>> calculation.
>>
>> It is recommended to check if NIC hardware supports TSO before enabling
>> the feature, which is off by default. For additional information please
>> check the tso.rst document.
>>
>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Fantastic work here Flavio, quick turn arouround when needed.
> 
> Acked

Are the any objectionions to merging this?

Theres been nothhing so far.

If no further objections I will merge this at the end of the hour?

BR
Ian
> 
> BR
> Ian
>> ---
>>   Documentation/automake.mk              |   1 +
>>   Documentation/topics/index.rst         |   1 +
>>   Documentation/topics/userspace-tso.rst |  98 +++++++
>>   NEWS                                   |   1 +
>>   lib/automake.mk                        |   2 +
>>   lib/conntrack.c                        |  29 +-
>>   lib/dp-packet.h                        | 176 ++++++++++-
>>   lib/ipf.c                              |  32 +-
>>   lib/netdev-dpdk.c                      | 348 +++++++++++++++++++---
>>   lib/netdev-linux-private.h             |   5 +
>>   lib/netdev-linux.c                     | 386 ++++++++++++++++++++++---
>>   lib/netdev-provider.h                  |   9 +
>>   lib/netdev.c                           |  78 ++++-
>>   lib/userspace-tso.c                    |  53 ++++
>>   lib/userspace-tso.h                    |  23 ++
>>   vswitchd/bridge.c                      |   2 +
>>   vswitchd/vswitch.xml                   |  20 ++
>>   17 files changed, 1140 insertions(+), 124 deletions(-)
>>   create mode 100644 Documentation/topics/userspace-tso.rst
>>   create mode 100644 lib/userspace-tso.c
>>   create mode 100644 lib/userspace-tso.h
>>
>> Testing:
>>   - Travis, Cirrus, AppVeyor, testsuite passed OK.
>>   - notice no changes since v4 with regards to performance.
>>
>> Changelog:
>> - v5
>>   * rebased on top of master (NEWS conflict)
>>   * added missing periods at the end of comments
>>   * mention DPDK requirement at vswitch.xml
>>   * restricted tso feature to OvS built with dpdk
>>   * headers in alphabetical order
>>   * removed unneeded call to initialize pkt
>>   * used OVS_UNLIKELY instead of unlikely
>>   * removed parenthesis from sizeof()
>>   * removed blank line at dp_packet_hwol_tx_l4_checksum()
>>   * removed redundant dp_packet_hwol_tx_ipv4_checksum()
>>   * updated function comments as suggested
>>
>> - v4
>>   * rebased on top of master (recvmmsg)
>>   * fixed URL in doc to point to 19.11
>>   * renamed tso to userspace-tso
>>   * renamed the option to userspace-tso-enable
>>   * removed prototype that left over from v2
>>   * fixed function style declaration
>>   * renamed dp_packet_hwol_tx_ip_checksum to 
>> dp_packet_hwol_tx_ipv4_checksum
>>   * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
>>   * account for drops while preping the batch for TX.
>>   * don't prep the batch for TX if TSO is disabled.
>>   * simplified setsockopt error checking
>>   * fixed af_packet_sock error checking to not call setsockopt on
>>        closed sockets.
>>   * fixed ol_flags comment.
>>   * used VLOG_ERR_BUF() to pass error messages.
>>   * fixed packet leak at netdev_send_prepare_batch()
>>   * added a coverage counter to account drops while preparing a batch
>>     at netdev.c
>>   * fixed netdev_send() to not call ->send() if the batch is empty.
>>   * fixed packet leak at netdev_push_header and account for the drops.
>>   * removed DPDK requirement to enable userspace TSO support.
>>   * fixed parameter documentation in vswitch.xml.
>>   * renamed tso.rst to userspace-tso.rst and moved to topics/
>>   * added comments documeting the functions in dp-packet.h
>>   * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG
>>
>> - v3
>>   * Improved the documentation.
>>   * Updated copyright year to 2020.
>>   * TSO offloaded msg now includes the netdev's name.
>>   * Added period at the end of all code comments.
>>   * Warn and drop encapsulation of TSO packets.
>>   * Fixed travis issue with restricted virtio types.
>>   * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
>>     which caused packet corruption.
>>   * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
>>     PKT_TX_IP_CKSUM only for IPv4 packets.
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index f2ca17bad..22976a3cd 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -57,6 +57,7 @@ DOC_SOURCE = \
>>       Documentation/topics/ovsdb-replication.rst \
>>       Documentation/topics/porting.rst \
>>       Documentation/topics/tracing.rst \
>> +    Documentation/topics/userspace-tso.rst \
>>       Documentation/topics/windows.rst \
>>       Documentation/howto/index.rst \
>>       Documentation/howto/dpdk.rst \
>> diff --git a/Documentation/topics/index.rst 
>> b/Documentation/topics/index.rst
>> index 34c4b10e0..08af3a24d 100644
>> --- a/Documentation/topics/index.rst
>> +++ b/Documentation/topics/index.rst
>> @@ -50,5 +50,6 @@ OVS
>>      language-bindings
>>      testing
>>      tracing
>> +   userspace-tso
>>      idl-compound-indexes
>>      ovs-extensions
>> diff --git a/Documentation/topics/userspace-tso.rst 
>> b/Documentation/topics/userspace-tso.rst
>> new file mode 100644
>> index 000000000..893c64839
>> --- /dev/null
>> +++ b/Documentation/topics/userspace-tso.rst
>> @@ -0,0 +1,98 @@
>> +..
>> +      Copyright 2020, Red Hat, Inc.
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); 
>> you may
>> +      not use this file except in compliance with the License. You 
>> may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, 
>> software
>> +      distributed under the License is distributed on an "AS IS" 
>> BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied. See the
>> +      License for the specific language governing permissions and 
>> limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +========================
>> +Userspace Datapath - TSO
>> +========================
>> +
>> +**Note:** This feature is considered experimental.
>> +
>> +TCP Segmentation Offload (TSO) enables a network stack to delegate 
>> segmentation
>> +of an oversized TCP segment to the underlying physical NIC. Offload 
>> of frame
>> +segmentation achieves computational savings in the core, freeing up 
>> CPU cycles
>> +for more useful work.
>> +
>> +A common use case for TSO is when using virtualization, where traffic 
>> that's
>> +coming in from a VM can offload the TCP segmentation, thus avoiding the
>> +fragmentation in software. Additionally, if the traffic is headed to 
>> a VM
>> +within the same host further optimization can be expected. As the 
>> traffic never
>> +leaves the machine, no MTU needs to be accounted for, and thus no 
>> segmentation
>> +and checksum calculations are required, which saves yet more cycles. 
>> Only when
>> +the traffic actually leaves the host the segmentation needs to 
>> happen, in which
>> +case it will be performed by the egress NIC. Consult your controller's
>> +datasheet for compatibility. Secondly, the NIC must have an 
>> associated DPDK
>> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features 
>> per PMD,
>> +refer to the `DPDK documentation`__.
>> +
>> +__ https://doc.dpdk.org/guides-19.11/nics/overview.html
>> +
>> +Enabling TSO
>> +~~~~~~~~~~~~
>> +
>> +The TSO support may be enabled via a global config value
>> +``userspace-tso-enable``.  Setting this to ``true`` enables TSO 
>> support for
>> +all ports.
>> +
>> +    $ ovs-vsctl set Open_vSwitch . 
>> other_config:userspace-tso-enable=true
>> +
>> +The default value is ``false``.
>> +
>> +Changing ``userspace-tso-enable`` requires restarting the daemon.
>> +
>> +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled
>> +as follows.
>> +
>> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
>> +connection is established, `TSO` is thus advertised to the guest as an
>> +available feature:
>> +
>> +QEMU Command Line Parameter::
>> +
>> +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
>> +    ...
>> +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
>> +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
>> +    ...
>> +
>> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool 
>> can be
>> +used to enable same::
>> +
>> +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite 
>> for TSO
>> +    $ ethtool -K eth0 tso on
>> +    $ ethtool -k eth0
>> +
>> +~~~~~~~~~~~
>> +Limitations
>> +~~~~~~~~~~~
>> +
>> +The current OvS userspace `TSO` implementation supports flat and VLAN 
>> networks
>> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, 
>> IPinIP,
>> +etc.]).
>> +
>> +There is no software implementation of TSO, so all ports attached to the
>> +datapath must support TSO or packets using that feature will be dropped
>> +on ports without TSO support.  That also means guests using vhost-user
>> +in client mode will receive TSO packet regardless of TSO being enabled
>> +or disabled within the guest.
>> diff --git a/NEWS b/NEWS
>> index 579e91c89..c6d3b6053 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -30,6 +30,7 @@ Post-v2.12.0
>>        * Add support for DPDK 19.11.
>>        * Add hardware offload support for output, drop, set of MAC, 
>> IPv4 and
>>          TCP/UDP ports actions (experimental).
>> +     * Add experimental support for TSO.
>>      - RSTP:
>>        * The rstp_statistics column in Port table will only be updated 
>> every
>>          stats-update-interval configured in Open_vSwitch table.
>> diff --git a/lib/automake.mk b/lib/automake.mk
>> index ebf714501..95925b57c 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -314,6 +314,8 @@ lib_libopenvswitch_la_SOURCES = \
>>       lib/unicode.h \
>>       lib/unixctl.c \
>>       lib/unixctl.h \
>> +    lib/userspace-tso.c \
>> +    lib/userspace-tso.h \
>>       lib/util.c \
>>       lib/util.h \
>>       lib/uuid.c \
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> index b80080e72..60222ca53 100644
>> --- a/lib/conntrack.c
>> +++ b/lib/conntrack.c
>> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct 
>> dp_packet *pkt, ovs_be16 dl_type,
>>           if (hwol_bad_l3_csum) {
>>               ok = false;
>>           } else {
>> -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
>> +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
>> +                                     || dp_packet_hwol_is_ipv4(pkt);
>>               /* Validate the checksum only when hwol is not 
>> supported. */
>>               ok = extract_l3_ipv4(&ctx->key, l3, 
>> dp_packet_l3_size(pkt), NULL,
>>                                    !hwol_good_l3_csum);
>> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct 
>> dp_packet *pkt, ovs_be16 dl_type,
>>       if (ok) {
>>           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
>>           if (!hwol_bad_l4_csum) {
>> -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
>> +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
>> +                                      || 
>> dp_packet_hwol_tx_l4_checksum(pkt);
>>               /* Validate the checksum only when hwol is not 
>> supported. */
>>               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
>>                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
>> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const 
>> struct conn_lookup_ctx *ctx,
>>                   }
>>                   if (seq_skew) {
>>                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
>> -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
>> -                                          l3_hdr->ip_tot_len, 
>> htons(ip_len));
>> +                    if (!dp_packet_hwol_is_ipv4(pkt)) {
>> +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
>> +                                                        
>> l3_hdr->ip_tot_len,
>> +                                                        htons(ip_len));
>> +                    }
>>                       l3_hdr->ip_tot_len = htons(ip_len);
>>                   }
>>               }
>> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const 
>> struct conn_lookup_ctx *ctx,
>>       }
>>       th->tcp_csum = 0;
>> -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>> -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, 
>> ctx->key.nw_proto,
>> -                           dp_packet_l4_size(pkt));
>> -    } else {
>> -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
>> -        th->tcp_csum = csum_finish(
>> -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
>> +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
>> +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>> +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, 
>> ctx->key.nw_proto,
>> +                               dp_packet_l4_size(pkt));
>> +        } else {
>> +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
>> +            th->tcp_csum = csum_finish(
>> +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
>> +        }
>>       }
>>       if (seq_skew) {
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index 133942155..69ae5dfac 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p)
>>   {
>>       /* This initialization is needed for packets that do not come 
>> from DPDK
>>        * interfaces, when vswitchd is built with --with-dpdk. */
>> -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>> +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>>       p->mbuf.nb_segs = 1;
>>       p->mbuf.next = NULL;
>>   }
>> @@ -519,6 +519,95 @@ dp_packet_set_allocated(struct dp_packet *b, 
>> uint16_t s)
>>       b->mbuf.buf_len = s;
>>   }
>> +/* Returns 'true' if packet 'b' is marked for TCP segmentation 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_is_tso(const struct dp_packet *b)
>> +{
>> +    return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG);
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for IPv4 checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
>> +{
>> +    return !!(b->mbuf.ol_flags & PKT_TX_IPV4);
>> +}
>> +
>> +/* Returns the L4 cksum offload bitmask. */
>> +static inline uint64_t
>> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
>> +{
>> +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for TCP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for UDP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for SCTP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for IPv4 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_IPV4;
>> +}
>> +
>> +/* Mark packet 'b' for IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_IPV6;
>> +}
>> +
>> +/* Mark packet 'b' for TCP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for UDP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_udp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for SCTP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for TCP segmentation offloading.  It implies that
>> + * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading
>> + * and also for TCP checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
>> +}
>> +
>>   /* Returns the RSS hash of the packet 'p'.  Note that the returned 
>> value is
>>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>>   static inline uint32_t
>> @@ -648,6 +737,84 @@ dp_packet_set_allocated(struct dp_packet *b, 
>> uint16_t s)
>>       b->allocated_ = s;
>>   }
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline uint64_t
>> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return 0;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>>   /* Returns the RSS hash of the packet 'p'.  Note that the returned 
>> value is
>>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>>   static inline uint32_t
>> @@ -939,6 +1106,13 @@ dp_packet_batch_reset_cutlen(struct 
>> dp_packet_batch *batch)
>>       }
>>   }
>> +/* Return true if the packet 'b' requested L4 checksum offload. */
>> +static inline bool
>> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b)
>> +{
>> +    return !!dp_packet_hwol_l4_mask(b);
>> +}
>> +
>>   #ifdef  __cplusplus
>>   }
>>   #endif
>> diff --git a/lib/ipf.c b/lib/ipf.c
>> index 45c489122..446e89d13 100644
>> --- a/lib/ipf.c
>> +++ b/lib/ipf.c
>> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>>       len += rest_len;
>>       l3 = dp_packet_l3(pkt);
>>       ovs_be16 new_ip_frag_off = l3->ip_frag_off & 
>> ~htons(IP_MORE_FRAGMENTS);
>> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
>> -                                new_ip_frag_off);
>> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, 
>> htons(len));
>> +    if (!dp_packet_hwol_is_ipv4(pkt)) {
>> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
>> +                                    new_ip_frag_off);
>> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, 
>> htons(len));
>> +    }
>>       l3->ip_tot_len = htons(len);
>>       l3->ip_frag_off = new_ip_frag_off;
>>       dp_packet_set_l2_pad_size(pkt, 0);
>> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct 
>> dp_packet *pkt)
>>       }
>>       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
>> +                     && !dp_packet_hwol_is_ipv4(pkt)
>>                        && csum(l3, ip_hdr_len) != 0)) {
>>           goto invalid_pkt;
>>       }
>> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
>>                   } else {
>>                       struct ip_header *l3_frag = 
>> dp_packet_l3(frag_0->pkt);
>>                       struct ip_header *l3_reass = dp_packet_l3(pkt);
>> -                    ovs_be32 reass_ip = 
>> get_16aligned_be32(&l3_reass->ip_src);
>> -                    ovs_be32 frag_ip = 
>> get_16aligned_be32(&l3_frag->ip_src);
>> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> -                                                     frag_ip, reass_ip);
>> -                    l3_frag->ip_src = l3_reass->ip_src;
>> +                    if (!dp_packet_hwol_is_ipv4(frag_0->pkt)) {
>> +                        ovs_be32 reass_ip =
>> +                            get_16aligned_be32(&l3_reass->ip_src);
>> +                        ovs_be32 frag_ip =
>> +                            get_16aligned_be32(&l3_frag->ip_src);
>> +
>> +                        l3_frag->ip_csum = 
>> recalc_csum32(l3_frag->ip_csum,
>> +                                                         frag_ip, 
>> reass_ip);
>> +                        reass_ip = 
>> get_16aligned_be32(&l3_reass->ip_dst);
>> +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
>> +                        l3_frag->ip_csum = 
>> recalc_csum32(l3_frag->ip_csum,
>> +                                                         frag_ip, 
>> reass_ip);
>> +                    }
>> -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
>> -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
>> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> -                                                     frag_ip, reass_ip);
>> +                    l3_frag->ip_src = l3_reass->ip_src;
>>                       l3_frag->ip_dst = l3_reass->ip_dst;
>>                   }
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index d1469f6f2..b108cbd6b 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -72,6 +72,7 @@
>>   #include "timeval.h"
>>   #include "unaligned.h"
>>   #include "unixctl.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   #include "uuid.h"
>> @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats {
>>       uint64_t tx_qos_drops;
>>       /* Packet drops in ingress policer processing. */
>>       uint64_t rx_qos_drops;
>> +    /* Packet drops in HWOL processing. */
>> +    uint64_t tx_invalid_hwol_drops;
>>   };
>>   enum { DPDK_RING_SIZE = 256 };
>> @@ -410,7 +413,8 @@ struct ingress_policer {
>>   enum dpdk_hw_ol_features {
>>       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>>       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
>> -    NETDEV_RX_HW_SCATTER = 1 << 2
>> +    NETDEV_RX_HW_SCATTER = 1 << 2,
>> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>>   };
>>   /*
>> @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, 
>> int n_rxq, int n_txq)
>>           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
>>       }
>> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       /* Limit configured rss hash functions to only those supported
>>        * by the eth device. */
>>       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
>> @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>>                                        DEV_RX_OFFLOAD_TCP_CKSUM |
>>                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
>> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
>> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
>> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>>       rte_eth_dev_info_get(dev->port_id, &info);
>> @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>>       }
>> +    if (info.tx_offload_capa & tx_tso_offload_capa) {
>> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +    } else {
>> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +        VLOG_WARN("Tx TSO offload is not supported on %s port "
>> +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), 
>> dev->port_id);
>> +    }
>> +
>>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>> @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev 
>> *netdev)
>>           goto out;
>>       }
>> -    err = rte_vhost_driver_disable_features(dev->vhost_id,
>> -                                1ULL << VIRTIO_NET_F_HOST_TSO4
>> -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> -                                | 1ULL << VIRTIO_NET_F_CSUM);
>> -    if (err) {
>> -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost 
>> user "
>> -                 "port: %s\n", name);
>> -        goto out;
>> +    if (!userspace_tso_enabled()) {
>> +        err = rte_vhost_driver_disable_features(dev->vhost_id,
>> +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
>> +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> +                                    | 1ULL << VIRTIO_NET_F_CSUM);
>> +        if (err) {
>> +            VLOG_ERR("rte_vhost_driver_disable_features failed for 
>> vhost user "
>> +                     "port: %s\n", name);
>> +            goto out;
>> +        }
>>       }
>>       err = rte_vhost_driver_start(dev->vhost_id);
>> @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev 
>> *netdev, struct smap *args)
>>           } else {
>>               smap_add(args, "rx_csum_offload", "false");
>>           }
>> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +            smap_add(args, "tx_tso_offload", "true");
>> +        } else {
>> +            smap_add(args, "tx_tso_offload", "false");
>> +        }
>>           smap_add(args, "lsc_interrupt_mode",
>>                    dev->lsc_interrupt_mode ? "true" : "false");
>>       }
>> @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
>>       rte_free(rx);
>>   }
>> +/* Prepare the packet for HWOL.
>> + * Return True if the packet is OK to continue. */
>> +static bool
>> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf 
>> *mbuf)
>> +{
>> +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
>> +
>> +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
>> +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char 
>> *)dp_packet_eth(pkt);
>> +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char 
>> *)dp_packet_l3(pkt);
>> +        mbuf->outer_l2_len = 0;
>> +        mbuf->outer_l3_len = 0;
>> +    }
>> +
>> +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
>> +        struct tcp_header *th = dp_packet_l4(pkt);
>> +
>> +        if (!th) {
>> +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
>> +                         " pkt len: %"PRIu32"", dev->up.name, 
>> mbuf->pkt_len);
>> +            return false;
>> +        }
>> +
>> +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
>> +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
>> +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
>> +
>> +        if (mbuf->ol_flags & PKT_TX_IPV4) {
>> +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +/* Prepare a batch for HWOL.
>> + * Return the number of good packets in the batch. */
>> +static int
>> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf 
>> **pkts,
>> +                            int pkt_cnt)
>> +{
>> +    int i = 0;
>> +    int cnt = 0;
>> +    struct rte_mbuf *pkt;
>> +
>> +    /* Prepare and filter bad HWOL packets. */
>> +    for (i = 0; i < pkt_cnt; i++) {
>> +        pkt = pkts[i];
>> +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
>> +            rte_pktmbuf_free(pkt);
>> +            continue;
>> +        }
>> +
>> +        if (OVS_UNLIKELY(i != cnt)) {
>> +            pkts[cnt] = pkt;
>> +        }
>> +        cnt++;
>> +    }
>> +
>> +    return cnt;
>> +}
>> +
>>   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes 
>> ownership of
>>    * 'pkts', even in case of failure.
>>    *
>> @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk 
>> *dev, int qid,
>>                            struct rte_mbuf **pkts, int cnt)
>>   {
>>       uint32_t nb_tx = 0;
>> +    uint16_t nb_tx_prep = cnt;
>> +
>> +    if (userspace_tso_enabled()) {
>> +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
>> +        if (nb_tx_prep != cnt) {
>> +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid 
>> packets. "
>> +                         "Only %u/%u are valid: %s", dev->up.name, 
>> nb_tx_prep,
>> +                         cnt, rte_strerror(rte_errno));
>> +        }
>> +    }
>> -    while (nb_tx != cnt) {
>> +    while (nb_tx != nb_tx_prep) {
>>           uint32_t ret;
>> -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - 
>> nb_tx);
>> +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
>> +                               nb_tx_prep - nb_tx);
>>           if (!ret) {
>>               break;
>>           }
>> @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct 
>> netdev_dpdk *dev, struct rte_mbuf **pkts,
>>       int cnt = 0;
>>       struct rte_mbuf *pkt;
>> +    /* Filter oversized packets, unless are marked for TSO. */
>>       for (i = 0; i < pkt_cnt; i++) {
>>           pkt = pkts[i];
>> -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
>> -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " 
>> max_packet_len %d",
>> -                         dev->up.name, pkt->pkt_len, 
>> dev->max_packet_len);
>> +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
>> +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
>> +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
>> +                         "max_packet_len %d", dev->up.name, 
>> pkt->pkt_len,
>> +                         dev->max_packet_len);
>>               rte_pktmbuf_free(pkt);
>>               continue;
>>           }
>> @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct 
>> netdev_dpdk *dev,
>>   {
>>       int dropped = sw_stats_add->tx_mtu_exceeded_drops +
>>                     sw_stats_add->tx_qos_drops +
>> -                  sw_stats_add->tx_failure_drops;
>> +                  sw_stats_add->tx_failure_drops +
>> +                  sw_stats_add->tx_invalid_hwol_drops;
>>       struct netdev_stats *stats = &dev->stats;
>>       int sent = attempted - dropped;
>>       int i;
>> @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct 
>> netdev_dpdk *dev,
>>           sw_stats->tx_failure_drops      += 
>> sw_stats_add->tx_failure_drops;
>>           sw_stats->tx_mtu_exceeded_drops += 
>> sw_stats_add->tx_mtu_exceeded_drops;
>>           sw_stats->tx_qos_drops          += sw_stats_add->tx_qos_drops;
>> +        sw_stats->tx_invalid_hwol_drops += 
>> sw_stats_add->tx_invalid_hwol_drops;
>>       }
>>   }
>> @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, 
>> int qid,
>>           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>>       }
>> +    sw_stats_add.tx_invalid_hwol_drops = cnt;
>> +    if (userspace_tso_enabled()) {
>> +        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
>> +    }
>> +
>> +    sw_stats_add.tx_invalid_hwol_drops -= cnt;
>> +    sw_stats_add.tx_mtu_exceeded_drops = cnt;
>>       cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
>> -    sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
>> +    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
>>       /* Check has QoS has been configured for the netdev */
>>       sw_stats_add.tx_qos_drops = cnt;
>> @@ -2562,6 +2674,120 @@ out:
>>       }
>>   }
>> +static void
>> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
>> +{
>> +    rte_free(opaque);
>> +}
>> +
>> +static struct rte_mbuf *
>> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
>> +{
>> +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
>> +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
>> +    uint16_t buf_len;
>> +    void *buf;
>> +
>> +    if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) {
>> +        shinfo = rte_pktmbuf_mtod(pkt, struct 
>> rte_mbuf_ext_shared_info *);
>> +    } else {
>> +        total_len += sizeof *shinfo + sizeof(uintptr_t);
>> +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> +    }
>> +
>> +    if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
>> +        VLOG_ERR("Can't copy packet: too big %u", total_len);
>> +        return NULL;
>> +    }
>> +
>> +    buf_len = total_len;
>> +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> +    if (OVS_UNLIKELY(buf == NULL)) {
>> +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", 
>> buf_len);
>> +        return NULL;
>> +    }
>> +
>> +    /* Initialize shinfo. */
>> +    if (shinfo) {
>> +        shinfo->free_cb = netdev_dpdk_extbuf_free;
>> +        shinfo->fcb_opaque = buf;
>> +        rte_mbuf_ext_refcnt_set(shinfo, 1);
>> +    } else {
>> +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> +                                                    
>> netdev_dpdk_extbuf_free,
>> +                                                    buf);
>> +        if (OVS_UNLIKELY(shinfo == NULL)) {
>> +            rte_free(buf);
>> +            VLOG_ERR("Failed to initialize shared info for mbuf while "
>> +                     "attempting to attach an external buffer.");
>> +            return NULL;
>> +        }
>> +    }
>> +
>> +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), 
>> buf_len,
>> +                              shinfo);
>> +    rte_pktmbuf_reset_headroom(pkt);
>> +
>> +    return pkt;
>> +}
>> +
>> +static struct rte_mbuf *
>> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
>> +{
>> +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>> +
>> +    if (OVS_UNLIKELY(!pkt)) {
>> +        return NULL;
>> +    }
>> +
>> +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
>> +        return pkt;
>> +    }
>> +
>> +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
>> +        return pkt;
>> +    }
>> +
>> +    rte_pktmbuf_free(pkt);
>> +
>> +    return NULL;
>> +}
>> +
>> +static struct dp_packet *
>> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet 
>> *pkt_orig)
>> +{
>> +    struct rte_mbuf *mbuf_dest;
>> +    struct dp_packet *pkt_dest;
>> +    uint32_t pkt_len;
>> +
>> +    pkt_len = dp_packet_size(pkt_orig);
>> +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
>> +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
>> +            return NULL;
>> +    }
>> +
>> +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
>> +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
>> +    dp_packet_set_size(pkt_dest, pkt_len);
>> +
>> +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
>> +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
>> +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
>> +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
>> +
>> +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
>> +           sizeof(struct dp_packet) - offsetof(struct dp_packet, 
>> l2_pad_size));
>> +
>> +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
>> +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
>> +                                - (char *)dp_packet_eth(pkt_dest);
>> +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
>> +                                - (char *) dp_packet_l3(pkt_dest);
>> +    }
>> +
>> +    return pkt_dest;
>> +}
>> +
>>   /* Tx function. Transmit packets indefinitely */
>>   static void
>>   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
>> dp_packet_batch *batch)
>> @@ -2575,7 +2801,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, 
>> struct dp_packet_batch *batch)
>>       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>>   #endif
>>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
>> +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
>>       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>>       uint32_t cnt = batch_cnt;
>>       uint32_t dropped = 0;
>> @@ -2596,34 +2822,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int 
>> qid, struct dp_packet_batch *batch)
>>           struct dp_packet *packet = batch->packets[i];
>>           uint32_t size = dp_packet_size(packet);
>> -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
>> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
>> -                         size, dev->max_packet_len);
>> -
>> +        if (size > dev->max_packet_len
>> +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
>> +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
>> +                         dev->max_packet_len);
>>               mtu_drops++;
>>               continue;
>>           }
>> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, 
>> packet);
>>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>>               dropped = cnt - i;
>>               break;
>>           }
>> -        /* We have to do a copy for now */
>> -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
>> -               dp_packet_data(packet), size);
>> -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
>> -
>>           txcnt++;
>>       }
>>       if (OVS_LIKELY(txcnt)) {
>>           if (dev->type == DPDK_DEV_VHOST) {
>> -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet 
>> **) pkts,
>> -                                     txcnt);
>> +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
>>           } else {
>> -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, 
>> txcnt);
>> +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
>> +                                                   (struct rte_mbuf 
>> **)pkts,
>> +                                                   txcnt);
>>           }
>>       }
>> @@ -2676,26 +2898,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, 
>> int qid,
>>           dp_packet_delete_batch(batch, true);
>>       } else {
>>           struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>> -        int tx_cnt, dropped;
>> -        int tx_failure, mtu_drops, qos_drops;
>> +        int dropped;
>> +        int tx_failure, mtu_drops, qos_drops, hwol_drops;
>>           int batch_cnt = dp_packet_batch_size(batch);
>>           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>> -        tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>> -        mtu_drops = batch_cnt - tx_cnt;
>> -        qos_drops = tx_cnt;
>> -        tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
>> -        qos_drops -= tx_cnt;
>> +        hwol_drops = batch_cnt;
>> +        if (userspace_tso_enabled()) {
>> +            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, 
>> batch_cnt);
>> +        }
>> +        hwol_drops -= batch_cnt;
>> +        mtu_drops = batch_cnt;
>> +        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>> +        mtu_drops -= batch_cnt;
>> +        qos_drops = batch_cnt;
>> +        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
>> +        qos_drops -= batch_cnt;
>> -        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt);
>> +        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, 
>> batch_cnt);
>> -        dropped = tx_failure + mtu_drops + qos_drops;
>> +        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
>>           if (OVS_UNLIKELY(dropped)) {
>>               rte_spinlock_lock(&dev->stats_lock);
>>               dev->stats.tx_dropped += dropped;
>>               sw_stats->tx_failure_drops += tx_failure;
>>               sw_stats->tx_mtu_exceeded_drops += mtu_drops;
>>               sw_stats->tx_qos_drops += qos_drops;
>> +            sw_stats->tx_invalid_hwol_drops += hwol_drops;
>>               rte_spinlock_unlock(&dev->stats_lock);
>>           }
>>       }
>> @@ -3011,7 +3240,8 @@ netdev_dpdk_get_sw_custom_stats(const struct 
>> netdev *netdev,
>>       SW_CSTAT(tx_failure_drops)       \
>>       SW_CSTAT(tx_mtu_exceeded_drops)  \
>>       SW_CSTAT(tx_qos_drops)           \
>> -    SW_CSTAT(rx_qos_drops)
>> +    SW_CSTAT(rx_qos_drops)           \
>> +    SW_CSTAT(tx_invalid_hwol_drops)
>>   #define SW_CSTAT(NAME) + 1
>>       custom_stats->size = SW_CSTATS;
>> @@ -4874,6 +5104,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>       rte_free(dev->tx_q);
>>       err = dpdk_eth_dev_init(dev);
>> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>>       if (!dev->tx_q) {
>>           err = ENOMEM;
>> @@ -4903,6 +5139,11 @@ dpdk_vhost_reconfigure_helper(struct 
>> netdev_dpdk *dev)
>>           dev->tx_q[0].map = 0;
>>       }
>> +    if (userspace_tso_enabled()) {
>> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +        VLOG_DBG("%s: TSO enabled on vhost port", 
>> netdev_get_name(&dev->up));
>> +    }
>> +
>>       netdev_dpdk_remap_txqs(dev);
>>       err = netdev_dpdk_mempool_configure(dev);
>> @@ -4975,6 +5216,11 @@ netdev_dpdk_vhost_client_reconfigure(struct 
>> netdev *netdev)
>>               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>           }
>> +        /* Enable External Buffers if TCP Segmentation Offload is 
>> enabled. */
>> +        if (userspace_tso_enabled()) {
>> +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
>> +        }
>> +
>>           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>>           if (err) {
>>               VLOG_ERR("vhost-user device setup failure for device %s\n",
>> @@ -4999,14 +5245,20 @@ netdev_dpdk_vhost_client_reconfigure(struct 
>> netdev *netdev)
>>               goto unlock;
>>           }
>> -        err = rte_vhost_driver_disable_features(dev->vhost_id,
>> -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
>> -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> -                                    | 1ULL << VIRTIO_NET_F_CSUM);
>> -        if (err) {
>> -            VLOG_ERR("rte_vhost_driver_disable_features failed for 
>> vhost user "
>> -                     "client port: %s\n", dev->up.name);
>> -            goto unlock;
>> +        if (userspace_tso_enabled()) {
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +        } else {
>> +            err = rte_vhost_driver_disable_features(dev->vhost_id,
>> +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
>> +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> +                                        | 1ULL << VIRTIO_NET_F_CSUM);
>> +            if (err) {
>> +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
>> +                         "vhost user client port: %s\n", dev->up.name);
>> +                goto unlock;
>> +            }
>>           }
>>           err = rte_vhost_driver_start(dev->vhost_id);
>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>> index f08159aa7..9dbc67658 100644
>> --- a/lib/netdev-linux-private.h
>> +++ b/lib/netdev-linux-private.h
>> @@ -27,6 +27,7 @@
>>   #include <stdint.h>
>>   #include <stdbool.h>
>> +#include "dp-packet.h"
>>   #include "netdev-afxdp.h"
>>   #include "netdev-afxdp-pool.h"
>>   #include "netdev-provider.h"
>> @@ -37,10 +38,13 @@
>>   struct netdev;
>> +#define LINUX_RXQ_TSO_MAX_LEN 65536
>> +
>>   struct netdev_rxq_linux {
>>       struct netdev_rxq up;
>>       bool is_tap;
>>       int fd;
>> +    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
>> buffers. */
>>   };
>>   int netdev_linux_construct(struct netdev *);
>> @@ -92,6 +96,7 @@ struct netdev_linux {
>>       int tap_fd;
>>       bool present;               /* If the device is present in the 
>> namespace */
>>       uint64_t tx_dropped;        /* tap device can drop if the iface 
>> is down */
>> +    uint64_t rx_dropped;        /* Packets dropped while recv from 
>> kernel. */
>>       /* LAG information. */
>>       bool is_lag_master;         /* True if the netdev is a LAG 
>> master. */
>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>> index 41d1e9273..a4a666657 100644
>> --- a/lib/netdev-linux.c
>> +++ b/lib/netdev-linux.c
>> @@ -29,16 +29,18 @@
>>   #include <linux/filter.h>
>>   #include <linux/gen_stats.h>
>>   #include <linux/if_ether.h>
>> +#include <linux/if_packet.h>
>>   #include <linux/if_tun.h>
>>   #include <linux/types.h>
>>   #include <linux/ethtool.h>
>>   #include <linux/mii.h>
>>   #include <linux/rtnetlink.h>
>>   #include <linux/sockios.h>
>> +#include <linux/virtio_net.h>
>>   #include <sys/ioctl.h>
>>   #include <sys/socket.h>
>> +#include <sys/uio.h>
>>   #include <sys/utsname.h>
>> -#include <netpacket/packet.h>
>>   #include <net/if.h>
>>   #include <net/if_arp.h>
>>   #include <net/route.h>
>> @@ -75,6 +77,7 @@
>>   #include "timer.h"
>>   #include "unaligned.h"
>>   #include "openvswitch/vlog.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   VLOG_DEFINE_THIS_MODULE(netdev_linux);
>> @@ -237,6 +240,16 @@ enum {
>>       VALID_DRVINFO           = 1 << 6,
>>       VALID_FEATURES          = 1 << 7,
>>   };
>> +
>> +/* Use one for the packet buffer and another for the aux buffer to 
>> receive
>> + * TSO packets. */
>> +#define IOV_STD_SIZE 1
>> +#define IOV_TSO_SIZE 2
>> +
>> +enum {
>> +    IOV_PACKET = 0,
>> +    IOV_AUXBUF = 1,
>> +};
>>   
>>   struct linux_lag_slave {
>>      uint32_t block_id;
>> @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = 
>> VLOG_RATE_LIMIT_INIT(5, 20);
>>    * changes in the device miimon status, so we can use atomic_count. */
>>   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
>> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
>>   static int netdev_linux_do_ethtool(const char *name, struct 
>> ethtool_cmd *,
>>                                      int cmd, const char *cmd_name);
>>   static int get_flags(const struct netdev *, unsigned int *flags);
>> @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev 
>> *netdev_)
>>       /* The device could be in the same network namespace or in 
>> another one. */
>>       netnsid_unset(&netdev->netnsid);
>>       ovs_mutex_init(&netdev->mutex);
>> +
>> +    if (userspace_tso_enabled()) {
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       return 0;
>>   }
>> @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
>>       /* Create tap device. */
>>       get_flags(&netdev->up, &netdev->ifi_flags);
>>       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
>> +    if (userspace_tso_enabled()) {
>> +        ifr.ifr_flags |= IFF_VNET_HDR;
>> +    }
>> +
>>       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
>>       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
>>           VLOG_WARN("%s: creating tap device failed: %s", name,
>> @@ -1024,6 +1050,15 @@ static struct netdev_rxq *
>>   netdev_linux_rxq_alloc(void)
>>   {
>>       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
>> +    if (userspace_tso_enabled()) {
>> +        int i;
>> +
>> +        /* Allocate auxiliay buffers to receive TSO packets. */
>> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
>> +        }
>> +    }
>> +
>>       return &rx->up;
>>   }
>> @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq 
>> *rxq_)
>>               goto error;
>>           }
>> +        if (userspace_tso_enabled()
>> +            && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
>> +                          sizeof val)) {
>> +            error = errno;
>> +            VLOG_ERR("%s: failed to enable vnet hdr in txq raw 
>> socket: %s",
>> +                     netdev_get_name(netdev_), ovs_strerror(errno));
>> +            goto error;
>> +        }
>> +
>>           /* Set non-blocking mode. */
>>           error = set_nonblocking(rx->fd);
>>           if (error) {
>> @@ -1119,10 +1163,15 @@ static void
>>   netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>>   {
>>       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>> +    int i;
>>       if (!rx->is_tap) {
>>           close(rx->fd);
>>       }
>> +
>> +    for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +        free(rx->aux_bufs[i]);
>> +    }
>>   }
>>   static void
>> @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct 
>> tpacket_auxdata *aux)
>>    * It also used recvmmsg to reduce multiple syscalls overhead;
>>    */
>>   static int
>> -netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>> +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>>                                    struct dp_packet_batch *batch)
>>   {
>> -    size_t size;
>> +    int iovlen;
>> +    size_t std_len;
>>       ssize_t retval;
>> -    struct iovec iovs[NETDEV_MAX_BURST];
>> +    int virtio_net_hdr_size;
>> +    struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE];
>>       struct cmsghdr *cmsg;
>>       union {
>>           struct cmsghdr cmsg;
>> @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>>       struct dp_packet *buffers[NETDEV_MAX_BURST];
>>       int i;
>> +    if (userspace_tso_enabled()) {
>> +        /* Use the buffer from the allocated packet below to receive MTU
>> +         * sized packets and an aux_buf for extra TSO data. */
>> +        iovlen = IOV_TSO_SIZE;
>> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
>> +    } else {
>> +        /* Use only the buffer from the allocated packet. */
>> +        iovlen = IOV_STD_SIZE;
>> +        virtio_net_hdr_size = 0;
>> +    }
>> +
>> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> -         buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN 
>> + mtu,
>> -                                                  DP_NETDEV_HEADROOM);
>> -         /* Reserve headroom for a single VLAN tag */
>> -         dp_packet_reserve(buffers[i], VLAN_HEADER_LEN);
>> -         size = dp_packet_tailroom(buffers[i]);
>> -         iovs[i].iov_base = dp_packet_data(buffers[i]);
>> -         iovs[i].iov_len = size;
>> +         buffers[i] = dp_packet_new_with_headroom(std_len, 
>> DP_NETDEV_HEADROOM);
>> +         iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
>> +         iovs[i][IOV_PACKET].iov_len = std_len;
>> +         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
>> +         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
>>            mmsgs[i].msg_hdr.msg_name = NULL;
>>            mmsgs[i].msg_hdr.msg_namelen = 0;
>> -         mmsgs[i].msg_hdr.msg_iov = &iovs[i];
>> -         mmsgs[i].msg_hdr.msg_iovlen = 1;
>> +         mmsgs[i].msg_hdr.msg_iov = iovs[i];
>> +         mmsgs[i].msg_hdr.msg_iovlen = iovlen;
>>            mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i];
>>            mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
>>            mmsgs[i].msg_hdr.msg_flags = 0;
>>       }
>>       do {
>> -        retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
>> +        retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, 
>> NULL);
>>       } while (retval < 0 && errno == EINTR);
>>       if (retval < 0) {
>> -        /* Save -errno to retval temporarily */
>> -        retval = -errno;
>> -        i = 0;
>> -        goto free_buffers;
>> +        retval = errno;
>> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +            dp_packet_delete(buffers[i]);
>> +        }
>> +
>> +        return retval;
>>       }
>>       for (i = 0; i < retval; i++) {
>>           if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
>> -            break;
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            dp_packet_delete(buffers[i]);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether 
>> hdr size",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>> +
>> +        if (mmsgs[i].msg_len > std_len) {
>> +            /* Build a single linear TSO packet by expanding the 
>> current packet
>> +             * to append the data received in the aux_buf. */
>> +            size_t extra_len = mmsgs[i].msg_len - std_len;
>> +
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + std_len);
>> +            dp_packet_prealloc_tailroom(buffers[i], extra_len);
>> +            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], 
>> extra_len);
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + extra_len);
>> +        } else {
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + mmsgs[i].msg_len);
>>           }
>> -        dp_packet_set_size(buffers[i],
>> -                           dp_packet_size(buffers[i]) + 
>> mmsgs[i].msg_len);
>> +        if (virtio_net_hdr_size && 
>> netdev_linux_parse_vnet_hdr(buffers[i])) {
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            /* Unexpected error situation: the virtio header is not 
>> present
>> +             * or corrupted. Drop the packet but continue in case 
>> next ones
>> +             * are correct. */
>> +            dp_packet_delete(buffers[i]);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net 
>> header",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>>           for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg;
>>                    cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) {
>> @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>>           dp_packet_batch_add(batch, buffers[i]);
>>       }
>> -free_buffers:
>> -    /* Free unused buffers, including buffers whose size is less than
>> -     * ETH_HEADER_LEN.
>> -     *
>> -     * Note: i has been set correctly by the above for loop, so don't
>> -     * try to re-initialize it.
>> -     */
>> +    /* Delete unused buffers. */
>>       for (; i < NETDEV_MAX_BURST; i++) {
>>           dp_packet_delete(buffers[i]);
>>       }
>> -    /* netdev_linux_rxq_recv needs it to return 0 or positive errno */
>> -    if (retval < 0) {
>> -        return -retval;
>> -    }
>> -
>>       return 0;
>>   }
>> @@ -1263,20 +1349,40 @@ free_buffers:
>>    * packets are added into *batch. The return value is 0 or errno.
>>    */
>>   static int
>> -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct 
>> dp_packet_batch *batch)
>> +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
>> +                                struct dp_packet_batch *batch)
>>   {
>>       struct dp_packet *buffer;
>> +    int virtio_net_hdr_size;
>>       ssize_t retval;
>> -    size_t size;
>> +    size_t std_len;
>> +    int iovlen;
>>       int i;
>> +    if (userspace_tso_enabled()) {
>> +        /* Use the buffer from the allocated packet below to receive MTU
>> +         * sized packets and an aux_buf for extra TSO data. */
>> +        iovlen = IOV_TSO_SIZE;
>> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
>> +    } else {
>> +        /* Use only the buffer from the allocated packet. */
>> +        iovlen = IOV_STD_SIZE;
>> +        virtio_net_hdr_size = 0;
>> +    }
>> +
>> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +        struct iovec iov[IOV_TSO_SIZE];
>> +
>>           /* Assume Ethernet port. No need to set packet_type. */
>> -        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>> -                                             DP_NETDEV_HEADROOM);
>> -        size = dp_packet_tailroom(buffer);
>> +        buffer = dp_packet_new_with_headroom(std_len, 
>> DP_NETDEV_HEADROOM);
>> +        iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
>> +        iov[IOV_PACKET].iov_len = std_len;
>> +        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
>> +        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
>> +
>>           do {
>> -            retval = read(fd, dp_packet_data(buffer), size);
>> +            retval = readv(rx->fd, iov, iovlen);
>>           } while (retval < 0 && errno == EINTR);
>>           if (retval < 0) {
>> @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int 
>> mtu, struct dp_packet_batch *batch)
>>               break;
>>           }
>> -        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
>> +        if (retval > std_len) {
>> +            /* Build a single linear TSO packet by expanding the 
>> current packet
>> +             * to append the data received in the aux_buf. */
>> +            size_t extra_len = retval - std_len;
>> +
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + 
>> std_len);
>> +            dp_packet_prealloc_tailroom(buffer, extra_len);
>> +            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + 
>> extra_len);
>> +        } else {
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
>> +        }
>> +
>> +        if (virtio_net_hdr_size && 
>> netdev_linux_parse_vnet_hdr(buffer)) {
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            /* Unexpected error situation: the virtio header is not 
>> present
>> +             * or corrupted. Drop the packet but continue in case 
>> next ones
>> +             * are correct. */
>> +            dp_packet_delete(buffer);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net 
>> header",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>> +
>>           dp_packet_batch_add(batch, buffer);
>>       }
>> @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, 
>> struct dp_packet_batch *batch,
>>       dp_packet_batch_init(batch);
>>       retval = (rx->is_tap
>> -              ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch)
>> -              : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch));
>> +              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
>> +              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
>>       if (retval) {
>>           if (retval != EAGAIN && retval != EMSGSIZE) {
>> @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>>   }
>>   static int
>> -netdev_linux_sock_batch_send(int sock, int ifindex,
>> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>>                                struct dp_packet_batch *batch)
>>   {
>>       const size_t size = dp_packet_batch_size(batch);
>> @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int 
>> ifindex,
>>       struct dp_packet *packet;
>>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> +        if (tso) {
>> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
>> +        }
>> +
>>           iov[i].iov_base = dp_packet_data(packet);
>>           iov[i].iov_len = dp_packet_size(packet);
>>           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
>> @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>>    * on other interface types because we attach a socket filter to the rx
>>    * socket. */
>>   static int
>> -netdev_linux_tap_batch_send(struct netdev *netdev_,
>> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
>>                               struct dp_packet_batch *batch)
>>   {
>>       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev 
>> *netdev_,
>>       }
>>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> -        size_t size = dp_packet_size(packet);
>> +        size_t size;
>>           ssize_t retval;
>>           int error;
>> +        if (tso) {
>> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
>> +        }
>> +
>> +        size = dp_packet_size(packet);
>>           do {
>>               retval = write(netdev->tap_fd, dp_packet_data(packet), 
>> size);
>>               error = retval < 0 ? errno : 0;
>> @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int 
>> qid OVS_UNUSED,
>>                     struct dp_packet_batch *batch,
>>                     bool concurrent_txq OVS_UNUSED)
>>   {
>> +    bool tso = userspace_tso_enabled();
>> +    int mtu = ETH_PAYLOAD_MAX;
>>       int error = 0;
>>       int sock = 0;
>> +    if (tso) {
>> +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
>> +    }
>> +
>>       if (!is_tap_netdev(netdev_)) {
>>           if 
>> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>>               error = EOPNOTSUPP;
>> @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int 
>> qid OVS_UNUSED,
>>               goto free_batch;
>>           }
>> -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
>> +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, 
>> batch);
>>       } else {
>> -        error = netdev_linux_tap_batch_send(netdev_, batch);
>> +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>>       }
>>       if (error) {
>>           if (error == ENOBUFS) {
>> @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev 
>> *netdev_, struct netdev_stats *stats)
>>           stats->collisions          += dev_stats.collisions;
>>       }
>>       stats->tx_dropped += netdev->tx_dropped;
>> +    stats->rx_dropped += netdev->rx_dropped;
>>       ovs_mutex_unlock(&netdev->mutex);
>>       return error;
>> @@ -6223,6 +6371,17 @@ af_packet_sock(void)
>>               if (error) {
>>                   close(sock);
>>                   sock = -error;
>> +            } else if (userspace_tso_enabled()) {
>> +                int val = 1;
>> +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, 
>> &val,
>> +                                   sizeof val);
>> +                if (error) {
>> +                    error = errno;
>> +                    VLOG_ERR("failed to enable vnet hdr in raw 
>> socket: %s",
>> +                             ovs_strerror(errno));
>> +                    close(sock);
>> +                    sock = -error;
>> +                }
>>               }
>>           } else {
>>               sock = -errno;
>> @@ -6234,3 +6393,136 @@ af_packet_sock(void)
>>       return sock;
>>   }
>> +
>> +static int
>> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
>> +{
>> +    struct eth_header *eth_hdr;
>> +    ovs_be16 eth_type;
>> +    int l2_len;
>> +
>> +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
>> +    if (!eth_hdr) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    l2_len = ETH_HEADER_LEN;
>> +    eth_type = eth_hdr->eth_type;
>> +    if (eth_type_vlan(eth_type)) {
>> +        struct vlan_header *vlan = dp_packet_at(b, l2_len, 
>> VLAN_HEADER_LEN);
>> +
>> +        if (!vlan) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        eth_type = vlan->vlan_next_type;
>> +        l2_len += VLAN_HEADER_LEN;
>> +    }
>> +
>> +    if (eth_type == htons(ETH_TYPE_IP)) {
>> +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, 
>> IP_HEADER_LEN);
>> +
>> +        if (!ip_hdr) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        *l4proto = ip_hdr->ip_proto;
>> +        dp_packet_hwol_set_tx_ipv4(b);
>> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
>> +        struct ovs_16aligned_ip6_hdr *nh6;
>> +
>> +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
>> +        if (!nh6) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
>> +        dp_packet_hwol_set_tx_ipv6(b);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
>> +{
>> +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
>> +    uint16_t l4proto = 0;
>> +
>> +    if (OVS_UNLIKELY(!vnet)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
>> +        return 0;
>> +    }
>> +
>> +    if (netdev_linux_parse_l2(b, &l4proto)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>> +        if (l4proto == IPPROTO_TCP) {
>> +            dp_packet_hwol_set_csum_tcp(b);
>> +        } else if (l4proto == IPPROTO_UDP) {
>> +            dp_packet_hwol_set_csum_udp(b);
>> +        } else if (l4proto == IPPROTO_SCTP) {
>> +            dp_packet_hwol_set_csum_sctp(b);
>> +        }
>> +    }
>> +
>> +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>> +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
>> +                                | VIRTIO_NET_HDR_GSO_TCPV6
>> +                                | VIRTIO_NET_HDR_GSO_UDP;
>> +        uint8_t type = vnet->gso_type & allowed_mask;
>> +
>> +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
>> +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
>> +            dp_packet_hwol_set_tcp_seg(b);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void
>> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
>> +{
>> +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
>> +
>> +    if (dp_packet_hwol_is_tso(b)) {
>> +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char 
>> *)dp_packet_eth(b))
>> +                            + TCP_HEADER_LEN;
>> +
>> +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
>> +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
>> +        if (dp_packet_hwol_is_ipv4(b)) {
>> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +        } else {
>> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>> +        }
>> +
>> +    } else {
>> +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
>> +    }
>> +
>> +    if (dp_packet_hwol_l4_mask(b)) {
>> +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +        vnet->csum_start = (OVS_FORCE __virtio16)((char 
>> *)dp_packet_l4(b)
>> +                                                  - (char 
>> *)dp_packet_eth(b));
>> +
>> +        if (dp_packet_hwol_l4_is_tcp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct tcp_header, tcp_csum);
>> +        } else if (dp_packet_hwol_l4_is_udp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct udp_header, udp_csum);
>> +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct sctp_header, sctp_csum);
>> +        } else {
>> +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
>> +        }
>> +    }
>> +}
>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>> index f109c4e66..22f4cde33 100644
>> --- a/lib/netdev-provider.h
>> +++ b/lib/netdev-provider.h
>> @@ -37,6 +37,12 @@ extern "C" {
>>   struct netdev_tnl_build_header_params;
>>   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>> +enum netdev_ol_flags {
>> +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
>> +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
>> +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
>> +};
>> +
>>   /* A network device (e.g. an Ethernet device).
>>    *
>>    * Network device implementations may read these members but should 
>> not modify
>> @@ -51,6 +57,9 @@ struct netdev {
>>        * opening this device, and therefore got assigned to the 
>> "system" class */
>>       bool auto_classified;
>> +    /* This bitmask of the offloading features enabled by the netdev. */
>> +    uint64_t ol_flags;
>> +
>>       /* If this is 'true', the user explicitly specified an MTU for this
>>        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
>>       bool mtu_user_config;
>> diff --git a/lib/netdev.c b/lib/netdev.c
>> index 405c98c68..f95b19af4 100644
>> --- a/lib/netdev.c
>> +++ b/lib/netdev.c
>> @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received);
>>   COVERAGE_DEFINE(netdev_sent);
>>   COVERAGE_DEFINE(netdev_add_router);
>>   COVERAGE_DEFINE(netdev_get_stats);
>> +COVERAGE_DEFINE(netdev_send_prepare_drops);
>> +COVERAGE_DEFINE(netdev_push_header_drops);
>>   struct netdev_saved_flags {
>>       struct netdev *netdev;
>> @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev)
>>               : NETDEV_PT_LEGACY_L2);
>>   }
>> +/* Check if a 'packet' is compatible with 'netdev_flags'.
>> + * If a packet is incompatible, return 'false' with the 'errormsg'
>> + * pointing to a reason. */
>> +static bool
>> +netdev_send_prepare_packet(const uint64_t netdev_flags,
>> +                           struct dp_packet *packet, char **errormsg)
>> +{
>> +    if (dp_packet_hwol_is_tso(packet)
>> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
>> +            /* Fall back to GSO in software. */
>> +            VLOG_ERR_BUF(errormsg, "No TSO support");
>> +            return false;
>> +    }
>> +
>> +    if (dp_packet_hwol_l4_mask(packet)
>> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
>> +            /* Fall back to L4 csum in software. */
>> +            VLOG_ERR_BUF(errormsg, "No L4 checksum support");
>> +            return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
>> + * otherwise either fall back to software implementation or drop it. */
>> +static void
>> +netdev_send_prepare_batch(const struct netdev *netdev,
>> +                          struct dp_packet_batch *batch)
>> +{
>> +    struct dp_packet *packet;
>> +    size_t i, size = dp_packet_batch_size(batch);
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
>> +        char *errormsg = NULL;
>> +
>> +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, 
>> &errormsg)) {
>> +            dp_packet_batch_refill(batch, packet, i);
>> +        } else {
>> +            dp_packet_delete(packet);
>> +            COVERAGE_INC(netdev_send_prepare_drops);
>> +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
>> +                         netdev_get_name(netdev), errormsg);
>> +            free(errormsg);
>> +        }
>> +    }
>> +}
>> +
>>   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every 
>> packet),
>>    * otherwise a positive errno value.  Returns EAGAIN without 
>> blocking if
>>    * at least one the packets cannot be queued immediately.  Returns 
>> EMSGSIZE
>> @@ -811,8 +861,14 @@ int
>>   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch 
>> *batch,
>>               bool concurrent_txq)
>>   {
>> -    int error = netdev->netdev_class->send(netdev, qid, batch,
>> -                                           concurrent_txq);
>> +    int error;
>> +
>> +    netdev_send_prepare_batch(netdev, batch);
>> +    if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) {
>> +        return 0;
>> +    }
>> +
>> +    error = netdev->netdev_class->send(netdev, qid, batch, 
>> concurrent_txq);
>>       if (!error) {
>>           COVERAGE_INC(netdev_sent);
>>       }
>> @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev,
>>                      const struct ovs_action_push_tnl *data)
>>   {
>>       struct dp_packet *packet;
>> -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> -        netdev->netdev_class->push_header(netdev, packet, data);
>> -        pkt_metadata_init(&packet->md, data->out_port);
>> +    size_t i, size = dp_packet_batch_size(batch);
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
>> +        if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
>> +                         || dp_packet_hwol_l4_mask(packet))) {
>> +            COVERAGE_INC(netdev_push_header_drops);
>> +            dp_packet_delete(packet);
>> +            VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload 
>> flags is "
>> +                         "not supported: packet dropped",
>> +                         netdev_get_name(netdev));
>> +        } else {
>> +            netdev->netdev_class->push_header(netdev, packet, data);
>> +            pkt_metadata_init(&packet->md, data->out_port);
>> +            dp_packet_batch_refill(batch, packet, i);
>> +        }
>>       }
>>       return 0;
>> diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c
>> new file mode 100644
>> index 000000000..6a4a0149b
>> --- /dev/null
>> +++ b/lib/userspace-tso.c
>> @@ -0,0 +1,53 @@
>> +/*
>> + * Copyright (c) 2020 Red Hat, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "smap.h"
>> +#include "ovs-thread.h"
>> +#include "openvswitch/vlog.h"
>> +#include "dpdk.h"
>> +#include "userspace-tso.h"
>> +#include "vswitch-idl.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(userspace_tso);
>> +
>> +static bool userspace_tso = false;
>> +
>> +void
>> +userspace_tso_init(const struct smap *ovs_other_config)
>> +{
>> +    if (smap_get_bool(ovs_other_config, "userspace-tso-enable", 
>> false)) {
>> +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +        if (ovsthread_once_start(&once)) {
>> +#ifdef DPDK_NETDEV
>> +            VLOG_INFO("Userspace TCP Segmentation Offloading support 
>> enabled");
>> +            userspace_tso = true;
>> +#else
>> +            VLOG_WARN("Userspace TCP Segmentation Offloading can not 
>> be enabled"
>> +                      "since OVS is built without DPDK support.");
>> +#endif
>> +            ovsthread_once_done(&once);
>> +        }
>> +    }
>> +}
>> +
>> +bool
>> +userspace_tso_enabled(void)
>> +{
>> +    return userspace_tso;
>> +}
>> diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h
>> new file mode 100644
>> index 000000000..0758274c0
>> --- /dev/null
>> +++ b/lib/userspace-tso.h
>> @@ -0,0 +1,23 @@
>> +/*
>> + * Copyright (c) 2020 Red Hat Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef USERSPACE_TSO_H
>> +#define USERSPACE_TSO_H 1
>> +
>> +void userspace_tso_init(const struct smap *ovs_other_config);
>> +bool userspace_tso_enabled(void);
>> +
>> +#endif /* userspace-tso.h */
>> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
>> index 86c7b10a9..e591c26a6 100644
>> --- a/vswitchd/bridge.c
>> +++ b/vswitchd/bridge.c
>> @@ -65,6 +65,7 @@
>>   #include "system-stats.h"
>>   #include "timeval.h"
>>   #include "tnl-ports.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   #include "unixctl.h"
>>   #include "lib/vswitch-idl.h"
>> @@ -3285,6 +3286,7 @@ bridge_run(void)
>>       if (cfg) {
>>           netdev_set_flow_api_enabled(&cfg->other_config);
>>           dpdk_init(&cfg->other_config);
>> +        userspace_tso_init(&cfg->other_config);
>>       }
>>       /* Initialize the ofproto library.  This only needs to run once, 
>> but
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index c43cb1aa4..3ddaaefda 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -690,6 +690,26 @@
>>            once in few hours or a day or a week.
>>           </p>
>>         </column>
>> +      <column name="other_config" key="userspace-tso-enable"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          Set this value to <code>true</code> to enable userspace 
>> support for
>> +          TCP Segmentation Offloading (TSO). When it is enabled, the 
>> interfaces
>> +          can provide an oversized TCP segment to the datapath and 
>> the datapath
>> +          will offload the TCP segmentation and checksum calculation 
>> to the
>> +          interfaces when necessary.
>> +        </p>
>> +        <p>
>> +          The default value is <code>false</code>. Changing this 
>> value requires
>> +          restarting the daemon.
>> +        </p>
>> +        <p>
>> +          The feature only works if Open vSwitch is built with DPDK 
>> support.
>> +        </p>
>> +        <p>
>> +          The feature is considered experimental.
>> +        </p>
>> +      </column>
>>       </group>
>>       <group title="Status">
>>         <column name="next_cfg">
>>
Stokes, Ian Jan. 17, 2020, 11:03 p.m. UTC | #4
Thanks all for review/testing, pushed to master.

Regards
Ian

-----Original Message-----
From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Stokes, Ian
Sent: Friday, January 17, 2020 10:56 PM
To: Flavio Leitner <fbl@sysclose.org>; dev@openvswitch.org
Cc: Ilya Maximets <i.maximets@ovn.org>; txfh2007 <txfh2007@aliyun.com>
Subject: Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support



On 1/17/2020 9:54 PM, Stokes, Ian wrote:
> 
> 
> On 1/17/2020 9:47 PM, Flavio Leitner wrote:
>> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
>> the network stack to delegate the TCP segmentation to the NIC reducing
>> the per packet CPU overhead.
>>
>> A guest using vhostuser interface with TSO enabled can send TCP packets
>> much bigger than the MTU, which saves CPU cycles normally used to break
>> the packets down to MTU size and to calculate checksums.
>>
>> It also saves CPU cycles used to parse multiple packets/headers during
>> the packet processing inside virtual switch.
>>
>> If the destination of the packet is another guest in the same host, then
>> the same big packet can be sent through a vhostuser interface skipping
>> the segmentation completely. However, if the destination is not local,
>> the NIC hardware is instructed to do the TCP segmentation and checksum
>> calculation.
>>
>> It is recommended to check if NIC hardware supports TSO before enabling
>> the feature, which is off by default. For additional information please
>> check the tso.rst document.
>>
>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Fantastic work here Flavio, quick turn arouround when needed.
> 
> Acked

Are the any objectionions to merging this?

Theres been nothhing so far.

If no further objections I will merge this at the end of the hour?

BR
Ian
> 
> BR
> Ian
>> ---
>>   Documentation/automake.mk              |   1 +
>>   Documentation/topics/index.rst         |   1 +
>>   Documentation/topics/userspace-tso.rst |  98 +++++++
>>   NEWS                                   |   1 +
>>   lib/automake.mk                        |   2 +
>>   lib/conntrack.c                        |  29 +-
>>   lib/dp-packet.h                        | 176 ++++++++++-
>>   lib/ipf.c                              |  32 +-
>>   lib/netdev-dpdk.c                      | 348 +++++++++++++++++++---
>>   lib/netdev-linux-private.h             |   5 +
>>   lib/netdev-linux.c                     | 386 ++++++++++++++++++++++---
>>   lib/netdev-provider.h                  |   9 +
>>   lib/netdev.c                           |  78 ++++-
>>   lib/userspace-tso.c                    |  53 ++++
>>   lib/userspace-tso.h                    |  23 ++
>>   vswitchd/bridge.c                      |   2 +
>>   vswitchd/vswitch.xml                   |  20 ++
>>   17 files changed, 1140 insertions(+), 124 deletions(-)
>>   create mode 100644 Documentation/topics/userspace-tso.rst
>>   create mode 100644 lib/userspace-tso.c
>>   create mode 100644 lib/userspace-tso.h
>>
>> Testing:
>>   - Travis, Cirrus, AppVeyor, testsuite passed OK.
>>   - notice no changes since v4 with regards to performance.
>>
>> Changelog:
>> - v5
>>   * rebased on top of master (NEWS conflict)
>>   * added missing periods at the end of comments
>>   * mention DPDK requirement at vswitch.xml
>>   * restricted tso feature to OvS built with dpdk
>>   * headers in alphabetical order
>>   * removed unneeded call to initialize pkt
>>   * used OVS_UNLIKELY instead of unlikely
>>   * removed parenthesis from sizeof()
>>   * removed blank line at dp_packet_hwol_tx_l4_checksum()
>>   * removed redundant dp_packet_hwol_tx_ipv4_checksum()
>>   * updated function comments as suggested
>>
>> - v4
>>   * rebased on top of master (recvmmsg)
>>   * fixed URL in doc to point to 19.11
>>   * renamed tso to userspace-tso
>>   * renamed the option to userspace-tso-enable
>>   * removed prototype that left over from v2
>>   * fixed function style declaration
>>   * renamed dp_packet_hwol_tx_ip_checksum to 
>> dp_packet_hwol_tx_ipv4_checksum
>>   * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
>>   * account for drops while preping the batch for TX.
>>   * don't prep the batch for TX if TSO is disabled.
>>   * simplified setsockopt error checking
>>   * fixed af_packet_sock error checking to not call setsockopt on
>>        closed sockets.
>>   * fixed ol_flags comment.
>>   * used VLOG_ERR_BUF() to pass error messages.
>>   * fixed packet leak at netdev_send_prepare_batch()
>>   * added a coverage counter to account drops while preparing a batch
>>     at netdev.c
>>   * fixed netdev_send() to not call ->send() if the batch is empty.
>>   * fixed packet leak at netdev_push_header and account for the drops.
>>   * removed DPDK requirement to enable userspace TSO support.
>>   * fixed parameter documentation in vswitch.xml.
>>   * renamed tso.rst to userspace-tso.rst and moved to topics/
>>   * added comments documeting the functions in dp-packet.h
>>   * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG
>>
>> - v3
>>   * Improved the documentation.
>>   * Updated copyright year to 2020.
>>   * TSO offloaded msg now includes the netdev's name.
>>   * Added period at the end of all code comments.
>>   * Warn and drop encapsulation of TSO packets.
>>   * Fixed travis issue with restricted virtio types.
>>   * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
>>     which caused packet corruption.
>>   * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
>>     PKT_TX_IP_CKSUM only for IPv4 packets.
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index f2ca17bad..22976a3cd 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -57,6 +57,7 @@ DOC_SOURCE = \
>>       Documentation/topics/ovsdb-replication.rst \
>>       Documentation/topics/porting.rst \
>>       Documentation/topics/tracing.rst \
>> +    Documentation/topics/userspace-tso.rst \
>>       Documentation/topics/windows.rst \
>>       Documentation/howto/index.rst \
>>       Documentation/howto/dpdk.rst \
>> diff --git a/Documentation/topics/index.rst 
>> b/Documentation/topics/index.rst
>> index 34c4b10e0..08af3a24d 100644
>> --- a/Documentation/topics/index.rst
>> +++ b/Documentation/topics/index.rst
>> @@ -50,5 +50,6 @@ OVS
>>      language-bindings
>>      testing
>>      tracing
>> +   userspace-tso
>>      idl-compound-indexes
>>      ovs-extensions
>> diff --git a/Documentation/topics/userspace-tso.rst 
>> b/Documentation/topics/userspace-tso.rst
>> new file mode 100644
>> index 000000000..893c64839
>> --- /dev/null
>> +++ b/Documentation/topics/userspace-tso.rst
>> @@ -0,0 +1,98 @@
>> +..
>> +      Copyright 2020, Red Hat, Inc.
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); 
>> you may
>> +      not use this file except in compliance with the License. You 
>> may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, 
>> software
>> +      distributed under the License is distributed on an "AS IS" 
>> BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied. See the
>> +      License for the specific language governing permissions and 
>> limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +========================
>> +Userspace Datapath - TSO
>> +========================
>> +
>> +**Note:** This feature is considered experimental.
>> +
>> +TCP Segmentation Offload (TSO) enables a network stack to delegate 
>> segmentation
>> +of an oversized TCP segment to the underlying physical NIC. Offload 
>> of frame
>> +segmentation achieves computational savings in the core, freeing up 
>> CPU cycles
>> +for more useful work.
>> +
>> +A common use case for TSO is when using virtualization, where traffic 
>> that's
>> +coming in from a VM can offload the TCP segmentation, thus avoiding the
>> +fragmentation in software. Additionally, if the traffic is headed to 
>> a VM
>> +within the same host further optimization can be expected. As the 
>> traffic never
>> +leaves the machine, no MTU needs to be accounted for, and thus no 
>> segmentation
>> +and checksum calculations are required, which saves yet more cycles. 
>> Only when
>> +the traffic actually leaves the host the segmentation needs to 
>> happen, in which
>> +case it will be performed by the egress NIC. Consult your controller's
>> +datasheet for compatibility. Secondly, the NIC must have an 
>> associated DPDK
>> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features 
>> per PMD,
>> +refer to the `DPDK documentation`__.
>> +
>> +__ https://doc.dpdk.org/guides-19.11/nics/overview.html
>> +
>> +Enabling TSO
>> +~~~~~~~~~~~~
>> +
>> +The TSO support may be enabled via a global config value
>> +``userspace-tso-enable``.  Setting this to ``true`` enables TSO 
>> support for
>> +all ports.
>> +
>> +    $ ovs-vsctl set Open_vSwitch . 
>> other_config:userspace-tso-enable=true
>> +
>> +The default value is ``false``.
>> +
>> +Changing ``userspace-tso-enable`` requires restarting the daemon.
>> +
>> +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled
>> +as follows.
>> +
>> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
>> +connection is established, `TSO` is thus advertised to the guest as an
>> +available feature:
>> +
>> +QEMU Command Line Parameter::
>> +
>> +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
>> +    ...
>> +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
>> +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
>> +    ...
>> +
>> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool 
>> can be
>> +used to enable same::
>> +
>> +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite 
>> for TSO
>> +    $ ethtool -K eth0 tso on
>> +    $ ethtool -k eth0
>> +
>> +~~~~~~~~~~~
>> +Limitations
>> +~~~~~~~~~~~
>> +
>> +The current OvS userspace `TSO` implementation supports flat and VLAN 
>> networks
>> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, 
>> IPinIP,
>> +etc.]).
>> +
>> +There is no software implementation of TSO, so all ports attached to the
>> +datapath must support TSO or packets using that feature will be dropped
>> +on ports without TSO support.  That also means guests using vhost-user
>> +in client mode will receive TSO packet regardless of TSO being enabled
>> +or disabled within the guest.
>> diff --git a/NEWS b/NEWS
>> index 579e91c89..c6d3b6053 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -30,6 +30,7 @@ Post-v2.12.0
>>        * Add support for DPDK 19.11.
>>        * Add hardware offload support for output, drop, set of MAC, 
>> IPv4 and
>>          TCP/UDP ports actions (experimental).
>> +     * Add experimental support for TSO.
>>      - RSTP:
>>        * The rstp_statistics column in Port table will only be updated 
>> every
>>          stats-update-interval configured in Open_vSwitch table.
>> diff --git a/lib/automake.mk b/lib/automake.mk
>> index ebf714501..95925b57c 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -314,6 +314,8 @@ lib_libopenvswitch_la_SOURCES = \
>>       lib/unicode.h \
>>       lib/unixctl.c \
>>       lib/unixctl.h \
>> +    lib/userspace-tso.c \
>> +    lib/userspace-tso.h \
>>       lib/util.c \
>>       lib/util.h \
>>       lib/uuid.c \
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> index b80080e72..60222ca53 100644
>> --- a/lib/conntrack.c
>> +++ b/lib/conntrack.c
>> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct 
>> dp_packet *pkt, ovs_be16 dl_type,
>>           if (hwol_bad_l3_csum) {
>>               ok = false;
>>           } else {
>> -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
>> +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
>> +                                     || dp_packet_hwol_is_ipv4(pkt);
>>               /* Validate the checksum only when hwol is not 
>> supported. */
>>               ok = extract_l3_ipv4(&ctx->key, l3, 
>> dp_packet_l3_size(pkt), NULL,
>>                                    !hwol_good_l3_csum);
>> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct 
>> dp_packet *pkt, ovs_be16 dl_type,
>>       if (ok) {
>>           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
>>           if (!hwol_bad_l4_csum) {
>> -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
>> +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
>> +                                      || 
>> dp_packet_hwol_tx_l4_checksum(pkt);
>>               /* Validate the checksum only when hwol is not 
>> supported. */
>>               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
>>                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
>> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const 
>> struct conn_lookup_ctx *ctx,
>>                   }
>>                   if (seq_skew) {
>>                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
>> -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
>> -                                          l3_hdr->ip_tot_len, 
>> htons(ip_len));
>> +                    if (!dp_packet_hwol_is_ipv4(pkt)) {
>> +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
>> +                                                        
>> l3_hdr->ip_tot_len,
>> +                                                        htons(ip_len));
>> +                    }
>>                       l3_hdr->ip_tot_len = htons(ip_len);
>>                   }
>>               }
>> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const 
>> struct conn_lookup_ctx *ctx,
>>       }
>>       th->tcp_csum = 0;
>> -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>> -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, 
>> ctx->key.nw_proto,
>> -                           dp_packet_l4_size(pkt));
>> -    } else {
>> -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
>> -        th->tcp_csum = csum_finish(
>> -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
>> +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
>> +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>> +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, 
>> ctx->key.nw_proto,
>> +                               dp_packet_l4_size(pkt));
>> +        } else {
>> +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
>> +            th->tcp_csum = csum_finish(
>> +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
>> +        }
>>       }
>>       if (seq_skew) {
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index 133942155..69ae5dfac 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p)
>>   {
>>       /* This initialization is needed for packets that do not come 
>> from DPDK
>>        * interfaces, when vswitchd is built with --with-dpdk. */
>> -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>> +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>>       p->mbuf.nb_segs = 1;
>>       p->mbuf.next = NULL;
>>   }
>> @@ -519,6 +519,95 @@ dp_packet_set_allocated(struct dp_packet *b, 
>> uint16_t s)
>>       b->mbuf.buf_len = s;
>>   }
>> +/* Returns 'true' if packet 'b' is marked for TCP segmentation 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_is_tso(const struct dp_packet *b)
>> +{
>> +    return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG);
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for IPv4 checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
>> +{
>> +    return !!(b->mbuf.ol_flags & PKT_TX_IPV4);
>> +}
>> +
>> +/* Returns the L4 cksum offload bitmask. */
>> +static inline uint64_t
>> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
>> +{
>> +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for TCP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for UDP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM;
>> +}
>> +
>> +/* Returns 'true' if packet 'b' is marked for SCTP checksum 
>> offloading. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
>> +{
>> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for IPv4 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_IPV4;
>> +}
>> +
>> +/* Mark packet 'b' for IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_IPV6;
>> +}
>> +
>> +/* Mark packet 'b' for TCP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for UDP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_udp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for SCTP checksum offloading.  It implies that either
>> + * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
>> +}
>> +
>> +/* Mark packet 'b' for TCP segmentation offloading.  It implies that
>> + * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading
>> + * and also for TCP checksum offloading. */
>> +static inline void
>> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
>> +{
>> +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
>> +}
>> +
>>   /* Returns the RSS hash of the packet 'p'.  Note that the returned 
>> value is
>>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>>   static inline uint32_t
>> @@ -648,6 +737,84 @@ dp_packet_set_allocated(struct dp_packet *b, 
>> uint16_t s)
>>       b->allocated_ = s;
>>   }
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline uint64_t
>> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return 0;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline bool
>> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
>> +{
>> +    return false;
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>> +/* There are no implementation when not DPDK enabled datapath. */
>> +static inline void
>> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED)
>> +{
>> +}
>> +
>>   /* Returns the RSS hash of the packet 'p'.  Note that the returned 
>> value is
>>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>>   static inline uint32_t
>> @@ -939,6 +1106,13 @@ dp_packet_batch_reset_cutlen(struct 
>> dp_packet_batch *batch)
>>       }
>>   }
>> +/* Return true if the packet 'b' requested L4 checksum offload. */
>> +static inline bool
>> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b)
>> +{
>> +    return !!dp_packet_hwol_l4_mask(b);
>> +}
>> +
>>   #ifdef  __cplusplus
>>   }
>>   #endif
>> diff --git a/lib/ipf.c b/lib/ipf.c
>> index 45c489122..446e89d13 100644
>> --- a/lib/ipf.c
>> +++ b/lib/ipf.c
>> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>>       len += rest_len;
>>       l3 = dp_packet_l3(pkt);
>>       ovs_be16 new_ip_frag_off = l3->ip_frag_off & 
>> ~htons(IP_MORE_FRAGMENTS);
>> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
>> -                                new_ip_frag_off);
>> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, 
>> htons(len));
>> +    if (!dp_packet_hwol_is_ipv4(pkt)) {
>> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
>> +                                    new_ip_frag_off);
>> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, 
>> htons(len));
>> +    }
>>       l3->ip_tot_len = htons(len);
>>       l3->ip_frag_off = new_ip_frag_off;
>>       dp_packet_set_l2_pad_size(pkt, 0);
>> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct 
>> dp_packet *pkt)
>>       }
>>       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
>> +                     && !dp_packet_hwol_is_ipv4(pkt)
>>                        && csum(l3, ip_hdr_len) != 0)) {
>>           goto invalid_pkt;
>>       }
>> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
>>                   } else {
>>                       struct ip_header *l3_frag = 
>> dp_packet_l3(frag_0->pkt);
>>                       struct ip_header *l3_reass = dp_packet_l3(pkt);
>> -                    ovs_be32 reass_ip = 
>> get_16aligned_be32(&l3_reass->ip_src);
>> -                    ovs_be32 frag_ip = 
>> get_16aligned_be32(&l3_frag->ip_src);
>> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> -                                                     frag_ip, reass_ip);
>> -                    l3_frag->ip_src = l3_reass->ip_src;
>> +                    if (!dp_packet_hwol_is_ipv4(frag_0->pkt)) {
>> +                        ovs_be32 reass_ip =
>> +                            get_16aligned_be32(&l3_reass->ip_src);
>> +                        ovs_be32 frag_ip =
>> +                            get_16aligned_be32(&l3_frag->ip_src);
>> +
>> +                        l3_frag->ip_csum = 
>> recalc_csum32(l3_frag->ip_csum,
>> +                                                         frag_ip, 
>> reass_ip);
>> +                        reass_ip = 
>> get_16aligned_be32(&l3_reass->ip_dst);
>> +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
>> +                        l3_frag->ip_csum = 
>> recalc_csum32(l3_frag->ip_csum,
>> +                                                         frag_ip, 
>> reass_ip);
>> +                    }
>> -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
>> -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
>> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> -                                                     frag_ip, reass_ip);
>> +                    l3_frag->ip_src = l3_reass->ip_src;
>>                       l3_frag->ip_dst = l3_reass->ip_dst;
>>                   }
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index d1469f6f2..b108cbd6b 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -72,6 +72,7 @@
>>   #include "timeval.h"
>>   #include "unaligned.h"
>>   #include "unixctl.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   #include "uuid.h"
>> @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats {
>>       uint64_t tx_qos_drops;
>>       /* Packet drops in ingress policer processing. */
>>       uint64_t rx_qos_drops;
>> +    /* Packet drops in HWOL processing. */
>> +    uint64_t tx_invalid_hwol_drops;
>>   };
>>   enum { DPDK_RING_SIZE = 256 };
>> @@ -410,7 +413,8 @@ struct ingress_policer {
>>   enum dpdk_hw_ol_features {
>>       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>>       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
>> -    NETDEV_RX_HW_SCATTER = 1 << 2
>> +    NETDEV_RX_HW_SCATTER = 1 << 2,
>> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>>   };
>>   /*
>> @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, 
>> int n_rxq, int n_txq)
>>           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
>>       }
>> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
>> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       /* Limit configured rss hash functions to only those supported
>>        * by the eth device. */
>>       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
>> @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>>                                        DEV_RX_OFFLOAD_TCP_CKSUM |
>>                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
>> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
>> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
>> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>>       rte_eth_dev_info_get(dev->port_id, &info);
>> @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>>       }
>> +    if (info.tx_offload_capa & tx_tso_offload_capa) {
>> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +    } else {
>> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
>> +        VLOG_WARN("Tx TSO offload is not supported on %s port "
>> +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), 
>> dev->port_id);
>> +    }
>> +
>>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>> @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev 
>> *netdev)
>>           goto out;
>>       }
>> -    err = rte_vhost_driver_disable_features(dev->vhost_id,
>> -                                1ULL << VIRTIO_NET_F_HOST_TSO4
>> -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> -                                | 1ULL << VIRTIO_NET_F_CSUM);
>> -    if (err) {
>> -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost 
>> user "
>> -                 "port: %s\n", name);
>> -        goto out;
>> +    if (!userspace_tso_enabled()) {
>> +        err = rte_vhost_driver_disable_features(dev->vhost_id,
>> +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
>> +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> +                                    | 1ULL << VIRTIO_NET_F_CSUM);
>> +        if (err) {
>> +            VLOG_ERR("rte_vhost_driver_disable_features failed for 
>> vhost user "
>> +                     "port: %s\n", name);
>> +            goto out;
>> +        }
>>       }
>>       err = rte_vhost_driver_start(dev->vhost_id);
>> @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev 
>> *netdev, struct smap *args)
>>           } else {
>>               smap_add(args, "rx_csum_offload", "false");
>>           }
>> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +            smap_add(args, "tx_tso_offload", "true");
>> +        } else {
>> +            smap_add(args, "tx_tso_offload", "false");
>> +        }
>>           smap_add(args, "lsc_interrupt_mode",
>>                    dev->lsc_interrupt_mode ? "true" : "false");
>>       }
>> @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
>>       rte_free(rx);
>>   }
>> +/* Prepare the packet for HWOL.
>> + * Return True if the packet is OK to continue. */
>> +static bool
>> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf 
>> *mbuf)
>> +{
>> +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
>> +
>> +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
>> +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char 
>> *)dp_packet_eth(pkt);
>> +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char 
>> *)dp_packet_l3(pkt);
>> +        mbuf->outer_l2_len = 0;
>> +        mbuf->outer_l3_len = 0;
>> +    }
>> +
>> +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
>> +        struct tcp_header *th = dp_packet_l4(pkt);
>> +
>> +        if (!th) {
>> +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
>> +                         " pkt len: %"PRIu32"", dev->up.name, 
>> mbuf->pkt_len);
>> +            return false;
>> +        }
>> +
>> +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
>> +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
>> +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
>> +
>> +        if (mbuf->ol_flags & PKT_TX_IPV4) {
>> +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +/* Prepare a batch for HWOL.
>> + * Return the number of good packets in the batch. */
>> +static int
>> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf 
>> **pkts,
>> +                            int pkt_cnt)
>> +{
>> +    int i = 0;
>> +    int cnt = 0;
>> +    struct rte_mbuf *pkt;
>> +
>> +    /* Prepare and filter bad HWOL packets. */
>> +    for (i = 0; i < pkt_cnt; i++) {
>> +        pkt = pkts[i];
>> +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
>> +            rte_pktmbuf_free(pkt);
>> +            continue;
>> +        }
>> +
>> +        if (OVS_UNLIKELY(i != cnt)) {
>> +            pkts[cnt] = pkt;
>> +        }
>> +        cnt++;
>> +    }
>> +
>> +    return cnt;
>> +}
>> +
>>   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes 
>> ownership of
>>    * 'pkts', even in case of failure.
>>    *
>> @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk 
>> *dev, int qid,
>>                            struct rte_mbuf **pkts, int cnt)
>>   {
>>       uint32_t nb_tx = 0;
>> +    uint16_t nb_tx_prep = cnt;
>> +
>> +    if (userspace_tso_enabled()) {
>> +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
>> +        if (nb_tx_prep != cnt) {
>> +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid 
>> packets. "
>> +                         "Only %u/%u are valid: %s", dev->up.name, 
>> nb_tx_prep,
>> +                         cnt, rte_strerror(rte_errno));
>> +        }
>> +    }
>> -    while (nb_tx != cnt) {
>> +    while (nb_tx != nb_tx_prep) {
>>           uint32_t ret;
>> -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - 
>> nb_tx);
>> +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
>> +                               nb_tx_prep - nb_tx);
>>           if (!ret) {
>>               break;
>>           }
>> @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct 
>> netdev_dpdk *dev, struct rte_mbuf **pkts,
>>       int cnt = 0;
>>       struct rte_mbuf *pkt;
>> +    /* Filter oversized packets, unless are marked for TSO. */
>>       for (i = 0; i < pkt_cnt; i++) {
>>           pkt = pkts[i];
>> -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
>> -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " 
>> max_packet_len %d",
>> -                         dev->up.name, pkt->pkt_len, 
>> dev->max_packet_len);
>> +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
>> +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
>> +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
>> +                         "max_packet_len %d", dev->up.name, 
>> pkt->pkt_len,
>> +                         dev->max_packet_len);
>>               rte_pktmbuf_free(pkt);
>>               continue;
>>           }
>> @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct 
>> netdev_dpdk *dev,
>>   {
>>       int dropped = sw_stats_add->tx_mtu_exceeded_drops +
>>                     sw_stats_add->tx_qos_drops +
>> -                  sw_stats_add->tx_failure_drops;
>> +                  sw_stats_add->tx_failure_drops +
>> +                  sw_stats_add->tx_invalid_hwol_drops;
>>       struct netdev_stats *stats = &dev->stats;
>>       int sent = attempted - dropped;
>>       int i;
>> @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct 
>> netdev_dpdk *dev,
>>           sw_stats->tx_failure_drops      += 
>> sw_stats_add->tx_failure_drops;
>>           sw_stats->tx_mtu_exceeded_drops += 
>> sw_stats_add->tx_mtu_exceeded_drops;
>>           sw_stats->tx_qos_drops          += sw_stats_add->tx_qos_drops;
>> +        sw_stats->tx_invalid_hwol_drops += 
>> sw_stats_add->tx_invalid_hwol_drops;
>>       }
>>   }
>> @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, 
>> int qid,
>>           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>>       }
>> +    sw_stats_add.tx_invalid_hwol_drops = cnt;
>> +    if (userspace_tso_enabled()) {
>> +        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
>> +    }
>> +
>> +    sw_stats_add.tx_invalid_hwol_drops -= cnt;
>> +    sw_stats_add.tx_mtu_exceeded_drops = cnt;
>>       cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
>> -    sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
>> +    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
>>       /* Check has QoS has been configured for the netdev */
>>       sw_stats_add.tx_qos_drops = cnt;
>> @@ -2562,6 +2674,120 @@ out:
>>       }
>>   }
>> +static void
>> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
>> +{
>> +    rte_free(opaque);
>> +}
>> +
>> +static struct rte_mbuf *
>> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
>> +{
>> +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
>> +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
>> +    uint16_t buf_len;
>> +    void *buf;
>> +
>> +    if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) {
>> +        shinfo = rte_pktmbuf_mtod(pkt, struct 
>> rte_mbuf_ext_shared_info *);
>> +    } else {
>> +        total_len += sizeof *shinfo + sizeof(uintptr_t);
>> +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> +    }
>> +
>> +    if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
>> +        VLOG_ERR("Can't copy packet: too big %u", total_len);
>> +        return NULL;
>> +    }
>> +
>> +    buf_len = total_len;
>> +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> +    if (OVS_UNLIKELY(buf == NULL)) {
>> +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", 
>> buf_len);
>> +        return NULL;
>> +    }
>> +
>> +    /* Initialize shinfo. */
>> +    if (shinfo) {
>> +        shinfo->free_cb = netdev_dpdk_extbuf_free;
>> +        shinfo->fcb_opaque = buf;
>> +        rte_mbuf_ext_refcnt_set(shinfo, 1);
>> +    } else {
>> +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> +                                                    
>> netdev_dpdk_extbuf_free,
>> +                                                    buf);
>> +        if (OVS_UNLIKELY(shinfo == NULL)) {
>> +            rte_free(buf);
>> +            VLOG_ERR("Failed to initialize shared info for mbuf while "
>> +                     "attempting to attach an external buffer.");
>> +            return NULL;
>> +        }
>> +    }
>> +
>> +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), 
>> buf_len,
>> +                              shinfo);
>> +    rte_pktmbuf_reset_headroom(pkt);
>> +
>> +    return pkt;
>> +}
>> +
>> +static struct rte_mbuf *
>> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
>> +{
>> +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>> +
>> +    if (OVS_UNLIKELY(!pkt)) {
>> +        return NULL;
>> +    }
>> +
>> +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
>> +        return pkt;
>> +    }
>> +
>> +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
>> +        return pkt;
>> +    }
>> +
>> +    rte_pktmbuf_free(pkt);
>> +
>> +    return NULL;
>> +}
>> +
>> +static struct dp_packet *
>> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet 
>> *pkt_orig)
>> +{
>> +    struct rte_mbuf *mbuf_dest;
>> +    struct dp_packet *pkt_dest;
>> +    uint32_t pkt_len;
>> +
>> +    pkt_len = dp_packet_size(pkt_orig);
>> +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
>> +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
>> +            return NULL;
>> +    }
>> +
>> +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
>> +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
>> +    dp_packet_set_size(pkt_dest, pkt_len);
>> +
>> +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
>> +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
>> +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
>> +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
>> +
>> +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
>> +           sizeof(struct dp_packet) - offsetof(struct dp_packet, 
>> l2_pad_size));
>> +
>> +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
>> +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
>> +                                - (char *)dp_packet_eth(pkt_dest);
>> +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
>> +                                - (char *) dp_packet_l3(pkt_dest);
>> +    }
>> +
>> +    return pkt_dest;
>> +}
>> +
>>   /* Tx function. Transmit packets indefinitely */
>>   static void
>>   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
>> dp_packet_batch *batch)
>> @@ -2575,7 +2801,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, 
>> struct dp_packet_batch *batch)
>>       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>>   #endif
>>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
>> +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
>>       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>>       uint32_t cnt = batch_cnt;
>>       uint32_t dropped = 0;
>> @@ -2596,34 +2822,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int 
>> qid, struct dp_packet_batch *batch)
>>           struct dp_packet *packet = batch->packets[i];
>>           uint32_t size = dp_packet_size(packet);
>> -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
>> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
>> -                         size, dev->max_packet_len);
>> -
>> +        if (size > dev->max_packet_len
>> +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
>> +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
>> +                         dev->max_packet_len);
>>               mtu_drops++;
>>               continue;
>>           }
>> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, 
>> packet);
>>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>>               dropped = cnt - i;
>>               break;
>>           }
>> -        /* We have to do a copy for now */
>> -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
>> -               dp_packet_data(packet), size);
>> -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
>> -
>>           txcnt++;
>>       }
>>       if (OVS_LIKELY(txcnt)) {
>>           if (dev->type == DPDK_DEV_VHOST) {
>> -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet 
>> **) pkts,
>> -                                     txcnt);
>> +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
>>           } else {
>> -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, 
>> txcnt);
>> +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
>> +                                                   (struct rte_mbuf 
>> **)pkts,
>> +                                                   txcnt);
>>           }
>>       }
>> @@ -2676,26 +2898,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, 
>> int qid,
>>           dp_packet_delete_batch(batch, true);
>>       } else {
>>           struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>> -        int tx_cnt, dropped;
>> -        int tx_failure, mtu_drops, qos_drops;
>> +        int dropped;
>> +        int tx_failure, mtu_drops, qos_drops, hwol_drops;
>>           int batch_cnt = dp_packet_batch_size(batch);
>>           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>> -        tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>> -        mtu_drops = batch_cnt - tx_cnt;
>> -        qos_drops = tx_cnt;
>> -        tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
>> -        qos_drops -= tx_cnt;
>> +        hwol_drops = batch_cnt;
>> +        if (userspace_tso_enabled()) {
>> +            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, 
>> batch_cnt);
>> +        }
>> +        hwol_drops -= batch_cnt;
>> +        mtu_drops = batch_cnt;
>> +        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>> +        mtu_drops -= batch_cnt;
>> +        qos_drops = batch_cnt;
>> +        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
>> +        qos_drops -= batch_cnt;
>> -        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt);
>> +        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, 
>> batch_cnt);
>> -        dropped = tx_failure + mtu_drops + qos_drops;
>> +        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
>>           if (OVS_UNLIKELY(dropped)) {
>>               rte_spinlock_lock(&dev->stats_lock);
>>               dev->stats.tx_dropped += dropped;
>>               sw_stats->tx_failure_drops += tx_failure;
>>               sw_stats->tx_mtu_exceeded_drops += mtu_drops;
>>               sw_stats->tx_qos_drops += qos_drops;
>> +            sw_stats->tx_invalid_hwol_drops += hwol_drops;
>>               rte_spinlock_unlock(&dev->stats_lock);
>>           }
>>       }
>> @@ -3011,7 +3240,8 @@ netdev_dpdk_get_sw_custom_stats(const struct 
>> netdev *netdev,
>>       SW_CSTAT(tx_failure_drops)       \
>>       SW_CSTAT(tx_mtu_exceeded_drops)  \
>>       SW_CSTAT(tx_qos_drops)           \
>> -    SW_CSTAT(rx_qos_drops)
>> +    SW_CSTAT(rx_qos_drops)           \
>> +    SW_CSTAT(tx_invalid_hwol_drops)
>>   #define SW_CSTAT(NAME) + 1
>>       custom_stats->size = SW_CSTATS;
>> @@ -4874,6 +5104,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>       rte_free(dev->tx_q);
>>       err = dpdk_eth_dev_init(dev);
>> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>>       if (!dev->tx_q) {
>>           err = ENOMEM;
>> @@ -4903,6 +5139,11 @@ dpdk_vhost_reconfigure_helper(struct 
>> netdev_dpdk *dev)
>>           dev->tx_q[0].map = 0;
>>       }
>> +    if (userspace_tso_enabled()) {
>> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
>> +        VLOG_DBG("%s: TSO enabled on vhost port", 
>> netdev_get_name(&dev->up));
>> +    }
>> +
>>       netdev_dpdk_remap_txqs(dev);
>>       err = netdev_dpdk_mempool_configure(dev);
>> @@ -4975,6 +5216,11 @@ netdev_dpdk_vhost_client_reconfigure(struct 
>> netdev *netdev)
>>               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>>           }
>> +        /* Enable External Buffers if TCP Segmentation Offload is 
>> enabled. */
>> +        if (userspace_tso_enabled()) {
>> +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
>> +        }
>> +
>>           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>>           if (err) {
>>               VLOG_ERR("vhost-user device setup failure for device %s\n",
>> @@ -4999,14 +5245,20 @@ netdev_dpdk_vhost_client_reconfigure(struct 
>> netdev *netdev)
>>               goto unlock;
>>           }
>> -        err = rte_vhost_driver_disable_features(dev->vhost_id,
>> -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
>> -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> -                                    | 1ULL << VIRTIO_NET_F_CSUM);
>> -        if (err) {
>> -            VLOG_ERR("rte_vhost_driver_disable_features failed for 
>> vhost user "
>> -                     "client port: %s\n", dev->up.name);
>> -            goto unlock;
>> +        if (userspace_tso_enabled()) {
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +        } else {
>> +            err = rte_vhost_driver_disable_features(dev->vhost_id,
>> +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
>> +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> +                                        | 1ULL << VIRTIO_NET_F_CSUM);
>> +            if (err) {
>> +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
>> +                         "vhost user client port: %s\n", dev->up.name);
>> +                goto unlock;
>> +            }
>>           }
>>           err = rte_vhost_driver_start(dev->vhost_id);
>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>> index f08159aa7..9dbc67658 100644
>> --- a/lib/netdev-linux-private.h
>> +++ b/lib/netdev-linux-private.h
>> @@ -27,6 +27,7 @@
>>   #include <stdint.h>
>>   #include <stdbool.h>
>> +#include "dp-packet.h"
>>   #include "netdev-afxdp.h"
>>   #include "netdev-afxdp-pool.h"
>>   #include "netdev-provider.h"
>> @@ -37,10 +38,13 @@
>>   struct netdev;
>> +#define LINUX_RXQ_TSO_MAX_LEN 65536
>> +
>>   struct netdev_rxq_linux {
>>       struct netdev_rxq up;
>>       bool is_tap;
>>       int fd;
>> +    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
>> buffers. */
>>   };
>>   int netdev_linux_construct(struct netdev *);
>> @@ -92,6 +96,7 @@ struct netdev_linux {
>>       int tap_fd;
>>       bool present;               /* If the device is present in the 
>> namespace */
>>       uint64_t tx_dropped;        /* tap device can drop if the iface 
>> is down */
>> +    uint64_t rx_dropped;        /* Packets dropped while recv from 
>> kernel. */
>>       /* LAG information. */
>>       bool is_lag_master;         /* True if the netdev is a LAG 
>> master. */
>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>> index 41d1e9273..a4a666657 100644
>> --- a/lib/netdev-linux.c
>> +++ b/lib/netdev-linux.c
>> @@ -29,16 +29,18 @@
>>   #include <linux/filter.h>
>>   #include <linux/gen_stats.h>
>>   #include <linux/if_ether.h>
>> +#include <linux/if_packet.h>
>>   #include <linux/if_tun.h>
>>   #include <linux/types.h>
>>   #include <linux/ethtool.h>
>>   #include <linux/mii.h>
>>   #include <linux/rtnetlink.h>
>>   #include <linux/sockios.h>
>> +#include <linux/virtio_net.h>
>>   #include <sys/ioctl.h>
>>   #include <sys/socket.h>
>> +#include <sys/uio.h>
>>   #include <sys/utsname.h>
>> -#include <netpacket/packet.h>
>>   #include <net/if.h>
>>   #include <net/if_arp.h>
>>   #include <net/route.h>
>> @@ -75,6 +77,7 @@
>>   #include "timer.h"
>>   #include "unaligned.h"
>>   #include "openvswitch/vlog.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   VLOG_DEFINE_THIS_MODULE(netdev_linux);
>> @@ -237,6 +240,16 @@ enum {
>>       VALID_DRVINFO           = 1 << 6,
>>       VALID_FEATURES          = 1 << 7,
>>   };
>> +
>> +/* Use one for the packet buffer and another for the aux buffer to 
>> receive
>> + * TSO packets. */
>> +#define IOV_STD_SIZE 1
>> +#define IOV_TSO_SIZE 2
>> +
>> +enum {
>> +    IOV_PACKET = 0,
>> +    IOV_AUXBUF = 1,
>> +};
>>   

>>   struct linux_lag_slave {
>>      uint32_t block_id;
>> @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = 
>> VLOG_RATE_LIMIT_INIT(5, 20);
>>    * changes in the device miimon status, so we can use atomic_count. */
>>   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
>> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
>>   static int netdev_linux_do_ethtool(const char *name, struct 
>> ethtool_cmd *,
>>                                      int cmd, const char *cmd_name);
>>   static int get_flags(const struct netdev *, unsigned int *flags);
>> @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev 
>> *netdev_)
>>       /* The device could be in the same network namespace or in 
>> another one. */
>>       netnsid_unset(&netdev->netnsid);
>>       ovs_mutex_init(&netdev->mutex);
>> +
>> +    if (userspace_tso_enabled()) {
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
>> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
>> +    }
>> +
>>       return 0;
>>   }
>> @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
>>       /* Create tap device. */
>>       get_flags(&netdev->up, &netdev->ifi_flags);
>>       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
>> +    if (userspace_tso_enabled()) {
>> +        ifr.ifr_flags |= IFF_VNET_HDR;
>> +    }
>> +
>>       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
>>       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
>>           VLOG_WARN("%s: creating tap device failed: %s", name,
>> @@ -1024,6 +1050,15 @@ static struct netdev_rxq *
>>   netdev_linux_rxq_alloc(void)
>>   {
>>       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
>> +    if (userspace_tso_enabled()) {
>> +        int i;
>> +
>> +        /* Allocate auxiliay buffers to receive TSO packets. */
>> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
>> +        }
>> +    }
>> +
>>       return &rx->up;
>>   }
>> @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq 
>> *rxq_)
>>               goto error;
>>           }
>> +        if (userspace_tso_enabled()
>> +            && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
>> +                          sizeof val)) {
>> +            error = errno;
>> +            VLOG_ERR("%s: failed to enable vnet hdr in txq raw 
>> socket: %s",
>> +                     netdev_get_name(netdev_), ovs_strerror(errno));
>> +            goto error;
>> +        }
>> +
>>           /* Set non-blocking mode. */
>>           error = set_nonblocking(rx->fd);
>>           if (error) {
>> @@ -1119,10 +1163,15 @@ static void
>>   netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>>   {
>>       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>> +    int i;
>>       if (!rx->is_tap) {
>>           close(rx->fd);
>>       }
>> +
>> +    for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +        free(rx->aux_bufs[i]);
>> +    }
>>   }
>>   static void
>> @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct 
>> tpacket_auxdata *aux)
>>    * It also used recvmmsg to reduce multiple syscalls overhead;
>>    */
>>   static int
>> -netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>> +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>>                                    struct dp_packet_batch *batch)
>>   {
>> -    size_t size;
>> +    int iovlen;
>> +    size_t std_len;
>>       ssize_t retval;
>> -    struct iovec iovs[NETDEV_MAX_BURST];
>> +    int virtio_net_hdr_size;
>> +    struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE];
>>       struct cmsghdr *cmsg;
>>       union {
>>           struct cmsghdr cmsg;
>> @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>>       struct dp_packet *buffers[NETDEV_MAX_BURST];
>>       int i;
>> +    if (userspace_tso_enabled()) {
>> +        /* Use the buffer from the allocated packet below to receive MTU
>> +         * sized packets and an aux_buf for extra TSO data. */
>> +        iovlen = IOV_TSO_SIZE;
>> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
>> +    } else {
>> +        /* Use only the buffer from the allocated packet. */
>> +        iovlen = IOV_STD_SIZE;
>> +        virtio_net_hdr_size = 0;
>> +    }
>> +
>> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> -         buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN 
>> + mtu,
>> -                                                  DP_NETDEV_HEADROOM);
>> -         /* Reserve headroom for a single VLAN tag */
>> -         dp_packet_reserve(buffers[i], VLAN_HEADER_LEN);
>> -         size = dp_packet_tailroom(buffers[i]);
>> -         iovs[i].iov_base = dp_packet_data(buffers[i]);
>> -         iovs[i].iov_len = size;
>> +         buffers[i] = dp_packet_new_with_headroom(std_len, 
>> DP_NETDEV_HEADROOM);
>> +         iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
>> +         iovs[i][IOV_PACKET].iov_len = std_len;
>> +         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
>> +         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
>>            mmsgs[i].msg_hdr.msg_name = NULL;
>>            mmsgs[i].msg_hdr.msg_namelen = 0;
>> -         mmsgs[i].msg_hdr.msg_iov = &iovs[i];
>> -         mmsgs[i].msg_hdr.msg_iovlen = 1;
>> +         mmsgs[i].msg_hdr.msg_iov = iovs[i];
>> +         mmsgs[i].msg_hdr.msg_iovlen = iovlen;
>>            mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i];
>>            mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
>>            mmsgs[i].msg_hdr.msg_flags = 0;
>>       }
>>       do {
>> -        retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
>> +        retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, 
>> NULL);
>>       } while (retval < 0 && errno == EINTR);
>>       if (retval < 0) {
>> -        /* Save -errno to retval temporarily */
>> -        retval = -errno;
>> -        i = 0;
>> -        goto free_buffers;
>> +        retval = errno;
>> +        for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +            dp_packet_delete(buffers[i]);
>> +        }
>> +
>> +        return retval;
>>       }
>>       for (i = 0; i < retval; i++) {
>>           if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
>> -            break;
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            dp_packet_delete(buffers[i]);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether 
>> hdr size",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>> +
>> +        if (mmsgs[i].msg_len > std_len) {
>> +            /* Build a single linear TSO packet by expanding the 
>> current packet
>> +             * to append the data received in the aux_buf. */
>> +            size_t extra_len = mmsgs[i].msg_len - std_len;
>> +
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + std_len);
>> +            dp_packet_prealloc_tailroom(buffers[i], extra_len);
>> +            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], 
>> extra_len);
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + extra_len);
>> +        } else {
>> +            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
>> +                               + mmsgs[i].msg_len);
>>           }
>> -        dp_packet_set_size(buffers[i],
>> -                           dp_packet_size(buffers[i]) + 
>> mmsgs[i].msg_len);
>> +        if (virtio_net_hdr_size && 
>> netdev_linux_parse_vnet_hdr(buffers[i])) {
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            /* Unexpected error situation: the virtio header is not 
>> present
>> +             * or corrupted. Drop the packet but continue in case 
>> next ones
>> +             * are correct. */
>> +            dp_packet_delete(buffers[i]);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net 
>> header",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>>           for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg;
>>                    cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) {
>> @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
>>           dp_packet_batch_add(batch, buffers[i]);
>>       }
>> -free_buffers:
>> -    /* Free unused buffers, including buffers whose size is less than
>> -     * ETH_HEADER_LEN.
>> -     *
>> -     * Note: i has been set correctly by the above for loop, so don't
>> -     * try to re-initialize it.
>> -     */
>> +    /* Delete unused buffers. */
>>       for (; i < NETDEV_MAX_BURST; i++) {
>>           dp_packet_delete(buffers[i]);
>>       }
>> -    /* netdev_linux_rxq_recv needs it to return 0 or positive errno */
>> -    if (retval < 0) {
>> -        return -retval;
>> -    }
>> -
>>       return 0;
>>   }
>> @@ -1263,20 +1349,40 @@ free_buffers:
>>    * packets are added into *batch. The return value is 0 or errno.
>>    */
>>   static int
>> -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct 
>> dp_packet_batch *batch)
>> +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
>> +                                struct dp_packet_batch *batch)
>>   {
>>       struct dp_packet *buffer;
>> +    int virtio_net_hdr_size;
>>       ssize_t retval;
>> -    size_t size;
>> +    size_t std_len;
>> +    int iovlen;
>>       int i;
>> +    if (userspace_tso_enabled()) {
>> +        /* Use the buffer from the allocated packet below to receive MTU
>> +         * sized packets and an aux_buf for extra TSO data. */
>> +        iovlen = IOV_TSO_SIZE;
>> +        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
>> +    } else {
>> +        /* Use only the buffer from the allocated packet. */
>> +        iovlen = IOV_STD_SIZE;
>> +        virtio_net_hdr_size = 0;
>> +    }
>> +
>> +    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
>>       for (i = 0; i < NETDEV_MAX_BURST; i++) {
>> +        struct iovec iov[IOV_TSO_SIZE];
>> +
>>           /* Assume Ethernet port. No need to set packet_type. */
>> -        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>> -                                             DP_NETDEV_HEADROOM);
>> -        size = dp_packet_tailroom(buffer);
>> +        buffer = dp_packet_new_with_headroom(std_len, 
>> DP_NETDEV_HEADROOM);
>> +        iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
>> +        iov[IOV_PACKET].iov_len = std_len;
>> +        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
>> +        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
>> +
>>           do {
>> -            retval = read(fd, dp_packet_data(buffer), size);
>> +            retval = readv(rx->fd, iov, iovlen);
>>           } while (retval < 0 && errno == EINTR);
>>           if (retval < 0) {
>> @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int 
>> mtu, struct dp_packet_batch *batch)
>>               break;
>>           }
>> -        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
>> +        if (retval > std_len) {
>> +            /* Build a single linear TSO packet by expanding the 
>> current packet
>> +             * to append the data received in the aux_buf. */
>> +            size_t extra_len = retval - std_len;
>> +
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + 
>> std_len);
>> +            dp_packet_prealloc_tailroom(buffer, extra_len);
>> +            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + 
>> extra_len);
>> +        } else {
>> +            dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
>> +        }
>> +
>> +        if (virtio_net_hdr_size && 
>> netdev_linux_parse_vnet_hdr(buffer)) {
>> +            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>> +            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> +
>> +            /* Unexpected error situation: the virtio header is not 
>> present
>> +             * or corrupted. Drop the packet but continue in case 
>> next ones
>> +             * are correct. */
>> +            dp_packet_delete(buffer);
>> +            netdev->rx_dropped += 1;
>> +            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net 
>> header",
>> +                         netdev_get_name(netdev_));
>> +            continue;
>> +        }
>> +
>>           dp_packet_batch_add(batch, buffer);
>>       }
>> @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, 
>> struct dp_packet_batch *batch,
>>       dp_packet_batch_init(batch);
>>       retval = (rx->is_tap
>> -              ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch)
>> -              : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch));
>> +              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
>> +              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
>>       if (retval) {
>>           if (retval != EAGAIN && retval != EMSGSIZE) {
>> @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>>   }
>>   static int
>> -netdev_linux_sock_batch_send(int sock, int ifindex,
>> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>>                                struct dp_packet_batch *batch)
>>   {
>>       const size_t size = dp_packet_batch_size(batch);
>> @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int 
>> ifindex,
>>       struct dp_packet *packet;
>>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> +        if (tso) {
>> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
>> +        }
>> +
>>           iov[i].iov_base = dp_packet_data(packet);
>>           iov[i].iov_len = dp_packet_size(packet);
>>           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
>> @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>>    * on other interface types because we attach a socket filter to the rx
>>    * socket. */
>>   static int
>> -netdev_linux_tap_batch_send(struct netdev *netdev_,
>> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
>>                               struct dp_packet_batch *batch)
>>   {
>>       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>> @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev 
>> *netdev_,
>>       }
>>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> -        size_t size = dp_packet_size(packet);
>> +        size_t size;
>>           ssize_t retval;
>>           int error;
>> +        if (tso) {
>> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
>> +        }
>> +
>> +        size = dp_packet_size(packet);
>>           do {
>>               retval = write(netdev->tap_fd, dp_packet_data(packet), 
>> size);
>>               error = retval < 0 ? errno : 0;
>> @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int 
>> qid OVS_UNUSED,
>>                     struct dp_packet_batch *batch,
>>                     bool concurrent_txq OVS_UNUSED)
>>   {
>> +    bool tso = userspace_tso_enabled();
>> +    int mtu = ETH_PAYLOAD_MAX;
>>       int error = 0;
>>       int sock = 0;
>> +    if (tso) {
>> +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
>> +    }
>> +
>>       if (!is_tap_netdev(netdev_)) {
>>           if 
>> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>>               error = EOPNOTSUPP;
>> @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int 
>> qid OVS_UNUSED,
>>               goto free_batch;
>>           }
>> -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
>> +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, 
>> batch);
>>       } else {
>> -        error = netdev_linux_tap_batch_send(netdev_, batch);
>> +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>>       }
>>       if (error) {
>>           if (error == ENOBUFS) {
>> @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev 
>> *netdev_, struct netdev_stats *stats)
>>           stats->collisions          += dev_stats.collisions;
>>       }
>>       stats->tx_dropped += netdev->tx_dropped;
>> +    stats->rx_dropped += netdev->rx_dropped;
>>       ovs_mutex_unlock(&netdev->mutex);
>>       return error;
>> @@ -6223,6 +6371,17 @@ af_packet_sock(void)
>>               if (error) {
>>                   close(sock);
>>                   sock = -error;
>> +            } else if (userspace_tso_enabled()) {
>> +                int val = 1;
>> +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, 
>> &val,
>> +                                   sizeof val);
>> +                if (error) {
>> +                    error = errno;
>> +                    VLOG_ERR("failed to enable vnet hdr in raw 
>> socket: %s",
>> +                             ovs_strerror(errno));
>> +                    close(sock);
>> +                    sock = -error;
>> +                }
>>               }
>>           } else {
>>               sock = -errno;
>> @@ -6234,3 +6393,136 @@ af_packet_sock(void)
>>       return sock;
>>   }
>> +
>> +static int
>> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
>> +{
>> +    struct eth_header *eth_hdr;
>> +    ovs_be16 eth_type;
>> +    int l2_len;
>> +
>> +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
>> +    if (!eth_hdr) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    l2_len = ETH_HEADER_LEN;
>> +    eth_type = eth_hdr->eth_type;
>> +    if (eth_type_vlan(eth_type)) {
>> +        struct vlan_header *vlan = dp_packet_at(b, l2_len, 
>> VLAN_HEADER_LEN);
>> +
>> +        if (!vlan) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        eth_type = vlan->vlan_next_type;
>> +        l2_len += VLAN_HEADER_LEN;
>> +    }
>> +
>> +    if (eth_type == htons(ETH_TYPE_IP)) {
>> +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, 
>> IP_HEADER_LEN);
>> +
>> +        if (!ip_hdr) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        *l4proto = ip_hdr->ip_proto;
>> +        dp_packet_hwol_set_tx_ipv4(b);
>> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
>> +        struct ovs_16aligned_ip6_hdr *nh6;
>> +
>> +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
>> +        if (!nh6) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
>> +        dp_packet_hwol_set_tx_ipv6(b);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
>> +{
>> +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
>> +    uint16_t l4proto = 0;
>> +
>> +    if (OVS_UNLIKELY(!vnet)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
>> +        return 0;
>> +    }
>> +
>> +    if (netdev_linux_parse_l2(b, &l4proto)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>> +        if (l4proto == IPPROTO_TCP) {
>> +            dp_packet_hwol_set_csum_tcp(b);
>> +        } else if (l4proto == IPPROTO_UDP) {
>> +            dp_packet_hwol_set_csum_udp(b);
>> +        } else if (l4proto == IPPROTO_SCTP) {
>> +            dp_packet_hwol_set_csum_sctp(b);
>> +        }
>> +    }
>> +
>> +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>> +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
>> +                                | VIRTIO_NET_HDR_GSO_TCPV6
>> +                                | VIRTIO_NET_HDR_GSO_UDP;
>> +        uint8_t type = vnet->gso_type & allowed_mask;
>> +
>> +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
>> +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
>> +            dp_packet_hwol_set_tcp_seg(b);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void
>> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
>> +{
>> +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
>> +
>> +    if (dp_packet_hwol_is_tso(b)) {
>> +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char 
>> *)dp_packet_eth(b))
>> +                            + TCP_HEADER_LEN;
>> +
>> +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
>> +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
>> +        if (dp_packet_hwol_is_ipv4(b)) {
>> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +        } else {
>> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>> +        }
>> +
>> +    } else {
>> +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
>> +    }
>> +
>> +    if (dp_packet_hwol_l4_mask(b)) {
>> +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +        vnet->csum_start = (OVS_FORCE __virtio16)((char 
>> *)dp_packet_l4(b)
>> +                                                  - (char 
>> *)dp_packet_eth(b));
>> +
>> +        if (dp_packet_hwol_l4_is_tcp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct tcp_header, tcp_csum);
>> +        } else if (dp_packet_hwol_l4_is_udp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct udp_header, udp_csum);
>> +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
>> +            vnet->csum_offset = (OVS_FORCE __virtio16) 
>> __builtin_offsetof(
>> +                                    struct sctp_header, sctp_csum);
>> +        } else {
>> +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
>> +        }
>> +    }
>> +}
>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>> index f109c4e66..22f4cde33 100644
>> --- a/lib/netdev-provider.h
>> +++ b/lib/netdev-provider.h
>> @@ -37,6 +37,12 @@ extern "C" {
>>   struct netdev_tnl_build_header_params;
>>   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>> +enum netdev_ol_flags {
>> +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
>> +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
>> +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
>> +};
>> +
>>   /* A network device (e.g. an Ethernet device).
>>    *
>>    * Network device implementations may read these members but should 
>> not modify
>> @@ -51,6 +57,9 @@ struct netdev {
>>        * opening this device, and therefore got assigned to the 
>> "system" class */
>>       bool auto_classified;
>> +    /* This bitmask of the offloading features enabled by the netdev. */
>> +    uint64_t ol_flags;
>> +
>>       /* If this is 'true', the user explicitly specified an MTU for this
>>        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
>>       bool mtu_user_config;
>> diff --git a/lib/netdev.c b/lib/netdev.c
>> index 405c98c68..f95b19af4 100644
>> --- a/lib/netdev.c
>> +++ b/lib/netdev.c
>> @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received);
>>   COVERAGE_DEFINE(netdev_sent);
>>   COVERAGE_DEFINE(netdev_add_router);
>>   COVERAGE_DEFINE(netdev_get_stats);
>> +COVERAGE_DEFINE(netdev_send_prepare_drops);
>> +COVERAGE_DEFINE(netdev_push_header_drops);
>>   struct netdev_saved_flags {
>>       struct netdev *netdev;
>> @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev)
>>               : NETDEV_PT_LEGACY_L2);
>>   }
>> +/* Check if a 'packet' is compatible with 'netdev_flags'.
>> + * If a packet is incompatible, return 'false' with the 'errormsg'
>> + * pointing to a reason. */
>> +static bool
>> +netdev_send_prepare_packet(const uint64_t netdev_flags,
>> +                           struct dp_packet *packet, char **errormsg)
>> +{
>> +    if (dp_packet_hwol_is_tso(packet)
>> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
>> +            /* Fall back to GSO in software. */
>> +            VLOG_ERR_BUF(errormsg, "No TSO support");
>> +            return false;
>> +    }
>> +
>> +    if (dp_packet_hwol_l4_mask(packet)
>> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
>> +            /* Fall back to L4 csum in software. */
>> +            VLOG_ERR_BUF(errormsg, "No L4 checksum support");
>> +            return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
>> + * otherwise either fall back to software implementation or drop it. */
>> +static void
>> +netdev_send_prepare_batch(const struct netdev *netdev,
>> +                          struct dp_packet_batch *batch)
>> +{
>> +    struct dp_packet *packet;
>> +    size_t i, size = dp_packet_batch_size(batch);
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
>> +        char *errormsg = NULL;
>> +
>> +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, 
>> &errormsg)) {
>> +            dp_packet_batch_refill(batch, packet, i);
>> +        } else {
>> +            dp_packet_delete(packet);
>> +            COVERAGE_INC(netdev_send_prepare_drops);
>> +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
>> +                         netdev_get_name(netdev), errormsg);
>> +            free(errormsg);
>> +        }
>> +    }
>> +}
>> +
>>   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every 
>> packet),
>>    * otherwise a positive errno value.  Returns EAGAIN without 
>> blocking if
>>    * at least one the packets cannot be queued immediately.  Returns 
>> EMSGSIZE
>> @@ -811,8 +861,14 @@ int
>>   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch 
>> *batch,
>>               bool concurrent_txq)
>>   {
>> -    int error = netdev->netdev_class->send(netdev, qid, batch,
>> -                                           concurrent_txq);
>> +    int error;
>> +
>> +    netdev_send_prepare_batch(netdev, batch);
>> +    if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) {
>> +        return 0;
>> +    }
>> +
>> +    error = netdev->netdev_class->send(netdev, qid, batch, 
>> concurrent_txq);
>>       if (!error) {
>>           COVERAGE_INC(netdev_sent);
>>       }
>> @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev,
>>                      const struct ovs_action_push_tnl *data)
>>   {
>>       struct dp_packet *packet;
>> -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>> -        netdev->netdev_class->push_header(netdev, packet, data);
>> -        pkt_metadata_init(&packet->md, data->out_port);
>> +    size_t i, size = dp_packet_batch_size(batch);
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
>> +        if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
>> +                         || dp_packet_hwol_l4_mask(packet))) {
>> +            COVERAGE_INC(netdev_push_header_drops);
>> +            dp_packet_delete(packet);
>> +            VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload 
>> flags is "
>> +                         "not supported: packet dropped",
>> +                         netdev_get_name(netdev));
>> +        } else {
>> +            netdev->netdev_class->push_header(netdev, packet, data);
>> +            pkt_metadata_init(&packet->md, data->out_port);
>> +            dp_packet_batch_refill(batch, packet, i);
>> +        }
>>       }
>>       return 0;
>> diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c
>> new file mode 100644
>> index 000000000..6a4a0149b
>> --- /dev/null
>> +++ b/lib/userspace-tso.c
>> @@ -0,0 +1,53 @@
>> +/*
>> + * Copyright (c) 2020 Red Hat, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "smap.h"
>> +#include "ovs-thread.h"
>> +#include "openvswitch/vlog.h"
>> +#include "dpdk.h"
>> +#include "userspace-tso.h"
>> +#include "vswitch-idl.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(userspace_tso);
>> +
>> +static bool userspace_tso = false;
>> +
>> +void
>> +userspace_tso_init(const struct smap *ovs_other_config)
>> +{
>> +    if (smap_get_bool(ovs_other_config, "userspace-tso-enable", 
>> false)) {
>> +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +        if (ovsthread_once_start(&once)) {
>> +#ifdef DPDK_NETDEV
>> +            VLOG_INFO("Userspace TCP Segmentation Offloading support 
>> enabled");
>> +            userspace_tso = true;
>> +#else
>> +            VLOG_WARN("Userspace TCP Segmentation Offloading can not 
>> be enabled"
>> +                      "since OVS is built without DPDK support.");
>> +#endif
>> +            ovsthread_once_done(&once);
>> +        }
>> +    }
>> +}
>> +
>> +bool
>> +userspace_tso_enabled(void)
>> +{
>> +    return userspace_tso;
>> +}
>> diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h
>> new file mode 100644
>> index 000000000..0758274c0
>> --- /dev/null
>> +++ b/lib/userspace-tso.h
>> @@ -0,0 +1,23 @@
>> +/*
>> + * Copyright (c) 2020 Red Hat Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef USERSPACE_TSO_H
>> +#define USERSPACE_TSO_H 1
>> +
>> +void userspace_tso_init(const struct smap *ovs_other_config);
>> +bool userspace_tso_enabled(void);
>> +
>> +#endif /* userspace-tso.h */
>> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
>> index 86c7b10a9..e591c26a6 100644
>> --- a/vswitchd/bridge.c
>> +++ b/vswitchd/bridge.c
>> @@ -65,6 +65,7 @@
>>   #include "system-stats.h"
>>   #include "timeval.h"
>>   #include "tnl-ports.h"
>> +#include "userspace-tso.h"
>>   #include "util.h"
>>   #include "unixctl.h"
>>   #include "lib/vswitch-idl.h"
>> @@ -3285,6 +3286,7 @@ bridge_run(void)
>>       if (cfg) {
>>           netdev_set_flow_api_enabled(&cfg->other_config);
>>           dpdk_init(&cfg->other_config);
>> +        userspace_tso_init(&cfg->other_config);
>>       }
>>       /* Initialize the ofproto library.  This only needs to run once, 
>> but
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index c43cb1aa4..3ddaaefda 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -690,6 +690,26 @@
>>            once in few hours or a day or a week.
>>           </p>
>>         </column>
>> +      <column name="other_config" key="userspace-tso-enable"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          Set this value to <code>true</code> to enable userspace 
>> support for
>> +          TCP Segmentation Offloading (TSO). When it is enabled, the 
>> interfaces
>> +          can provide an oversized TCP segment to the datapath and 
>> the datapath
>> +          will offload the TCP segmentation and checksum calculation 
>> to the
>> +          interfaces when necessary.
>> +        </p>
>> +        <p>
>> +          The default value is <code>false</code>. Changing this 
>> value requires
>> +          restarting the daemon.
>> +        </p>
>> +        <p>
>> +          The feature only works if Open vSwitch is built with DPDK 
>> support.
>> +        </p>
>> +        <p>
>> +          The feature is considered experimental.
>> +        </p>
>> +      </column>
>>       </group>
>>       <group title="Status">
>>         <column name="next_cfg">
>>
Ilya Maximets Jan. 17, 2020, 11:08 p.m. UTC | #5
On 18.01.2020 00:03, Stokes, Ian wrote:
> Thanks all for review/testing, pushed to master.

OK, thanks Ian.

@Ben, even though this patch already merged, I'd ask you to take a look
at the code in case you'll spot some issues especially in non-DPDK related
parts.

Thanks.

Best regards, Ilya Maximets.


> 
> Regards
> Ian
> 
> -----Original Message-----
> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Stokes, Ian
> Sent: Friday, January 17, 2020 10:56 PM
> To: Flavio Leitner <fbl@sysclose.org>; dev@openvswitch.org
> Cc: Ilya Maximets <i.maximets@ovn.org>; txfh2007 <txfh2007@aliyun.com>
> Subject: Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support
> 
> 
> 
> On 1/17/2020 9:54 PM, Stokes, Ian wrote:
>>
>>
>> On 1/17/2020 9:47 PM, Flavio Leitner wrote:
>>> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
>>> the network stack to delegate the TCP segmentation to the NIC reducing
>>> the per packet CPU overhead.
>>>
>>> A guest using vhostuser interface with TSO enabled can send TCP packets
>>> much bigger than the MTU, which saves CPU cycles normally used to break
>>> the packets down to MTU size and to calculate checksums.
>>>
>>> It also saves CPU cycles used to parse multiple packets/headers during
>>> the packet processing inside virtual switch.
>>>
>>> If the destination of the packet is another guest in the same host, then
>>> the same big packet can be sent through a vhostuser interface skipping
>>> the segmentation completely. However, if the destination is not local,
>>> the NIC hardware is instructed to do the TCP segmentation and checksum
>>> calculation.
>>>
>>> It is recommended to check if NIC hardware supports TSO before enabling
>>> the feature, which is off by default. For additional information please
>>> check the tso.rst document.
>>>
>>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>>
>> Fantastic work here Flavio, quick turn arouround when needed.
>>
>> Acked
> 
> Are the any objectionions to merging this?
> 
> Theres been nothhing so far.
> 
> If no further objections I will merge this at the end of the hour?
> 
> BR
> Ian
>>
>> BR
>> Ian
Ben Pfaff Jan. 21, 2020, 9:35 p.m. UTC | #6
On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> On 18.01.2020 00:03, Stokes, Ian wrote:
> > Thanks all for review/testing, pushed to master.
> 
> OK, thanks Ian.
> 
> @Ben, even though this patch already merged, I'd ask you to take a look
> at the code in case you'll spot some issues especially in non-DPDK related
> parts.

I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
name suggested to me "test whether the packet is IPv4" not "test whether
the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
much more, it makes it obvious at a glance that it's a
checksum-offloading check.

In the case where we actually receive a 64 kB packet, I think that this
code is going to be relatively inefficient.  If I'm reading the code
correctly (I did it quickly), then this is what happens:

        - The first 1500 bytes of the packet land in the first
          dp_packet.

        - The remaining 64000ish bytes land in the second dp_packet.

        - Then we expand the first dp_packet to the needed size and copy
          the remaining 64000 bytes into it.

An alternative would be:

        - Set up the first dp_packet as currently.

        - Set up the second dp_packet so that the bytes are received
          into it starting at offset (mtu + headroom).

        - If more than mtu bytes are received, then copy those bytes
          into the headroom of the second dp_packet and return it to the
          caller instead of the first dp_packet.

The advantage is that we do a 1500-byte copy instead of a 64000-byte
copy.  The disadvantage is that we waste any memory leftover in the
second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
kB.  Also we need slightly more sophisticated dp_packet allocation (we
will need to replenish the supply of aux_bufs).

Thanks,

Ben.
Flavio Leitner Jan. 22, 2020, 8:54 a.m. UTC | #7
Hi Ben,

Thanks for reviewing it!

On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > Thanks all for review/testing, pushed to master.
> > 
> > OK, thanks Ian.
> > 
> > @Ben, even though this patch already merged, I'd ask you to take a look
> > at the code in case you'll spot some issues especially in non-DPDK related
> > parts.
> 
> I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> name suggested to me "test whether the packet is IPv4" not "test whether
> the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> much more, it makes it obvious at a glance that it's a
> checksum-offloading check.

hwol = hardware offloading. I hear that all the time, but maybe there is a
better name. I will improve that if no one gets on it first.

> In the case where we actually receive a 64 kB packet, I think that this
> code is going to be relatively inefficient.  If I'm reading the code
> correctly (I did it quickly), then this is what happens:
> 
>         - The first 1500 bytes of the packet land in the first
>           dp_packet.
> 
>         - The remaining 64000ish bytes land in the second dp_packet.
> 
>         - Then we expand the first dp_packet to the needed size and copy
>           the remaining 64000 bytes into it.

That's correct.

> An alternative would be:
> 
>         - Set up the first dp_packet as currently.
> 
>         - Set up the second dp_packet so that the bytes are received
>           into it starting at offset (mtu + headroom).
> 
>         - If more than mtu bytes are received, then copy those bytes
>           into the headroom of the second dp_packet and return it to the
>           caller instead of the first dp_packet.

I wanted to avoid doing more extensive processing if it's not a TSO packet
to avoid performance regressions since it' very sensitive. Right now the 64k
buffer is preallocated and is static for each queue to avoid the malloc
performance issue. Now for TSO case, we have more time per packet for
processing.

> The advantage is that we do a 1500-byte copy instead of a 64000-byte
> copy.  The disadvantage is that we waste any memory leftover in the
> second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
> kB.  Also we need slightly more sophisticated dp_packet allocation (we
> will need to replenish the supply of aux_bufs).

I also tried to avoid waste of memory, which becomes a problem with multiple
ports and queues working in parallel.
William Tu Jan. 22, 2020, 6:33 p.m. UTC | #8
On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
>
> Hi Ben,
>
> Thanks for reviewing it!
>
> On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > Thanks all for review/testing, pushed to master.
> > >
> > > OK, thanks Ian.
> > >
> > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > at the code in case you'll spot some issues especially in non-DPDK related
> > > parts.
> >
> > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > name suggested to me "test whether the packet is IPv4" not "test whether
> > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > much more, it makes it obvious at a glance that it's a
> > checksum-offloading check.
>
> hwol = hardware offloading. I hear that all the time, but maybe there is a
> better name. I will improve that if no one gets on it first.
>
> > In the case where we actually receive a 64 kB packet, I think that this
> > code is going to be relatively inefficient.  If I'm reading the code
> > correctly (I did it quickly), then this is what happens:
> >
> >         - The first 1500 bytes of the packet land in the first
> >           dp_packet.
> >
> >         - The remaining 64000ish bytes land in the second dp_packet.

It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).

struct netdev_rxq_linux {
    struct netdev_rxq up;
    bool is_tap;
    int fd;
    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
};

> >
> >         - Then we expand the first dp_packet to the needed size and copy
> >           the remaining 64000 bytes into it.
>
> That's correct.
>
> > An alternative would be:
> >
> >         - Set up the first dp_packet as currently.
> >
> >         - Set up the second dp_packet so that the bytes are received
> >           into it starting at offset (mtu + headroom).
> >
> >         - If more than mtu bytes are received, then copy those bytes
> >           into the headroom of the second dp_packet and return it to the
> >           caller instead of the first dp_packet.
>
> I wanted to avoid doing more extensive processing if it's not a TSO packet
> to avoid performance regressions since it' very sensitive. Right now the 64k
> buffer is preallocated and is static for each queue to avoid the malloc
> performance issue. Now for TSO case, we have more time per packet for
> processing.

Can we implement Ben's idea by
1) set size of aux_buf to 64k + mtu
2) create 2nd dp_packet using this aux_buf and copy first packet to
first mtu bytes of aux_buf
3) since we steal this aux_bufs, allocate a new aux_buf by
rxq->aux_bufs[i] = xmalloc(64k + mtu)
4) free the first dp_packet, and use the second dp_packet

Regards,
William
>
> > The advantage is that we do a 1500-byte copy instead of a 64000-byte
> > copy.  The disadvantage is that we waste any memory leftover in the
> > second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
> > kB.  Also we need slightly more sophisticated dp_packet allocation (we
> > will need to replenish the supply of aux_bufs).
>
> I also tried to avoid waste of memory, which becomes a problem with multiple
> ports and queues working in parallel.
>
> --
> fbl
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Flavio Leitner Jan. 24, 2020, 2:40 p.m. UTC | #9
On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
> >
> >
> > Hi Ben,
> >
> > Thanks for reviewing it!
> >
> > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > Thanks all for review/testing, pushed to master.
> > > >
> > > > OK, thanks Ian.
> > > >
> > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > at the code in case you'll spot some issues especially in non-DPDK related
> > > > parts.
> > >
> > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > much more, it makes it obvious at a glance that it's a
> > > checksum-offloading check.
> >
> > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > better name. I will improve that if no one gets on it first.
> >
> > > In the case where we actually receive a 64 kB packet, I think that this
> > > code is going to be relatively inefficient.  If I'm reading the code
> > > correctly (I did it quickly), then this is what happens:
> > >
> > >         - The first 1500 bytes of the packet land in the first
> > >           dp_packet.
> > >
> > >         - The remaining 64000ish bytes land in the second dp_packet.
> 
> It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> 
> struct netdev_rxq_linux {
>     struct netdev_rxq up;
>     bool is_tap;
>     int fd;
>     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> };
> 
> > >
> > >         - Then we expand the first dp_packet to the needed size and copy
> > >           the remaining 64000 bytes into it.
> >
> > That's correct.
> >
> > > An alternative would be:
> > >
> > >         - Set up the first dp_packet as currently.
> > >
> > >         - Set up the second dp_packet so that the bytes are received
> > >           into it starting at offset (mtu + headroom).
> > >
> > >         - If more than mtu bytes are received, then copy those bytes
> > >           into the headroom of the second dp_packet and return it to the
> > >           caller instead of the first dp_packet.
> >
> > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > to avoid performance regressions since it' very sensitive. Right now the 64k
> > buffer is preallocated and is static for each queue to avoid the malloc
> > performance issue. Now for TSO case, we have more time per packet for
> > processing.
> 
> Can we implement Ben's idea by
> 1) set size of aux_buf to 64k + mtu
> 2) create 2nd dp_packet using this aux_buf and copy first packet to
> first mtu bytes of aux_buf
> 3) since we steal this aux_bufs, allocate a new aux_buf by
> rxq->aux_bufs[i] = xmalloc(64k + mtu)
> 4) free the first dp_packet, and use the second dp_packet

I did a quick experiment while at the conference and Ben's idea is
indeed a bit faster (2.7%) when the packet is not resized due to #1.

If the buffer gets resized to what's actually used, then it becomes
a bit slower (1.8%).

Anyways, feel free to have a look at the code[1]. Perhaps it could
be changed to be more efficient. Just send me a patch and I will be
happy to test again.

[1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
Thanks,
fbl
William Tu Jan. 24, 2020, 6:17 p.m. UTC | #10
On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
> On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > >
> > >
> > > Hi Ben,
> > >
> > > Thanks for reviewing it!
> > >
> > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > Thanks all for review/testing, pushed to master.
> > > > >
> > > > > OK, thanks Ian.
> > > > >
> > > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > > at the code in case you'll spot some issues especially in non-DPDK related
> > > > > parts.
> > > >
> > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > > much more, it makes it obvious at a glance that it's a
> > > > checksum-offloading check.
> > >
> > > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > > better name. I will improve that if no one gets on it first.
> > >
> > > > In the case where we actually receive a 64 kB packet, I think that this
> > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > correctly (I did it quickly), then this is what happens:
> > > >
> > > >         - The first 1500 bytes of the packet land in the first
> > > >           dp_packet.
> > > >
> > > >         - The remaining 64000ish bytes land in the second dp_packet.
> >
> > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> >
> > struct netdev_rxq_linux {
> >     struct netdev_rxq up;
> >     bool is_tap;
> >     int fd;
> >     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> > };
> >
> > > >
> > > >         - Then we expand the first dp_packet to the needed size and copy
> > > >           the remaining 64000 bytes into it.
> > >
> > > That's correct.
> > >
> > > > An alternative would be:
> > > >
> > > >         - Set up the first dp_packet as currently.
> > > >
> > > >         - Set up the second dp_packet so that the bytes are received
> > > >           into it starting at offset (mtu + headroom).
> > > >
> > > >         - If more than mtu bytes are received, then copy those bytes
> > > >           into the headroom of the second dp_packet and return it to the
> > > >           caller instead of the first dp_packet.
> > >
> > > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > > to avoid performance regressions since it' very sensitive. Right now the 64k
> > > buffer is preallocated and is static for each queue to avoid the malloc
> > > performance issue. Now for TSO case, we have more time per packet for
> > > processing.
> >
> > Can we implement Ben's idea by
> > 1) set size of aux_buf to 64k + mtu
> > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > first mtu bytes of aux_buf
> > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > 4) free the first dp_packet, and use the second dp_packet
>
> I did a quick experiment while at the conference and Ben's idea is
> indeed a bit faster (2.7%) when the packet is not resized due to #1.
>
> If the buffer gets resized to what's actually used, then it becomes
> a bit slower (1.8%).

Do we have to resize it?

>
> Anyways, feel free to have a look at the code[1]. Perhaps it could
> be changed to be more efficient. Just send me a patch and I will be
> happy to test again.
>
> [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben

Thanks!

I tested it by applying
https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
Implemented Ben suggestion.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Using
    iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)

Test 100 second TCP

without the patch
[  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec

with the patch
[  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec

I think it's pretty good improvement!
Regards,
William
Flavio Leitner Jan. 24, 2020, 9:38 p.m. UTC | #11
On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner <fbl@sysclose.org> wrote:
> >
> > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > > >
> > > >
> > > > Hi Ben,
> > > >
> > > > Thanks for reviewing it!
> > > >
> > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > Thanks all for review/testing, pushed to master.
> > > > > >
> > > > > > OK, thanks Ian.
> > > > > >
> > > > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > > > at the code in case you'll spot some issues especially in non-DPDK related
> > > > > > parts.
> > > > >
> > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > > > much more, it makes it obvious at a glance that it's a
> > > > > checksum-offloading check.
> > > >
> > > > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > > > better name. I will improve that if no one gets on it first.
> > > >
> > > > > In the case where we actually receive a 64 kB packet, I think that this
> > > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > > correctly (I did it quickly), then this is what happens:
> > > > >
> > > > >         - The first 1500 bytes of the packet land in the first
> > > > >           dp_packet.
> > > > >
> > > > >         - The remaining 64000ish bytes land in the second dp_packet.
> > >
> > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > >
> > > struct netdev_rxq_linux {
> > >     struct netdev_rxq up;
> > >     bool is_tap;
> > >     int fd;
> > >     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> > > };
> > >
> > > > >
> > > > >         - Then we expand the first dp_packet to the needed size and copy
> > > > >           the remaining 64000 bytes into it.
> > > >
> > > > That's correct.
> > > >
> > > > > An alternative would be:
> > > > >
> > > > >         - Set up the first dp_packet as currently.
> > > > >
> > > > >         - Set up the second dp_packet so that the bytes are received
> > > > >           into it starting at offset (mtu + headroom).
> > > > >
> > > > >         - If more than mtu bytes are received, then copy those bytes
> > > > >           into the headroom of the second dp_packet and return it to the
> > > > >           caller instead of the first dp_packet.
> > > >
> > > > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > > > to avoid performance regressions since it' very sensitive. Right now the 64k
> > > > buffer is preallocated and is static for each queue to avoid the malloc
> > > > performance issue. Now for TSO case, we have more time per packet for
> > > > processing.
> > >
> > > Can we implement Ben's idea by
> > > 1) set size of aux_buf to 64k + mtu
> > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > first mtu bytes of aux_buf
> > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > 4) free the first dp_packet, and use the second dp_packet
> >
> > I did a quick experiment while at the conference and Ben's idea is
> > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> >
> > If the buffer gets resized to what's actually used, then it becomes
> > a bit slower (1.8%).
> 
> Do we have to resize it?

Well, if there is congestion the packets will get proportional to
the TCP window size which can be like ~4k, ~9k, while it is allocating
64k. Assuming many ports and many packets in parallel, that might be
a waste of memory.

> > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > be changed to be more efficient. Just send me a patch and I will be
> > happy to test again.
> >
> > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> 
> Thanks!
> 
> I tested it by applying
> https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> Implemented Ben suggestion.
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Using
>     iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
> 
> Test 100 second TCP
> 
> without the patch
> [  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec
> 
> with the patch
> [  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec
> 
> I think it's pretty good improvement!

I agree. Could you test with the resize also (next patch in that
branch) and see how the numbers look like?

Thanks,
William Tu Jan. 24, 2020, 11:06 p.m. UTC | #12
On Fri, Jan 24, 2020 at 1:38 PM Flavio Leitner <fbl@sysclose.org> wrote:
>
> On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> > On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > > > >
> > > > >
> > > > > Hi Ben,
> > > > >
> > > > > Thanks for reviewing it!
> > > > >
> > > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > > Thanks all for review/testing, pushed to master.
> > > > > > >
> > > > > > > OK, thanks Ian.
> > > > > > >
> > > > > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > > > > at the code in case you'll spot some issues especially in non-DPDK related
> > > > > > > parts.
> > > > > >
> > > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > > > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > > > > much more, it makes it obvious at a glance that it's a
> > > > > > checksum-offloading check.
> > > > >
> > > > > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > > > > better name. I will improve that if no one gets on it first.
> > > > >
> > > > > > In the case where we actually receive a 64 kB packet, I think that this
> > > > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > > > correctly (I did it quickly), then this is what happens:
> > > > > >
> > > > > >         - The first 1500 bytes of the packet land in the first
> > > > > >           dp_packet.
> > > > > >
> > > > > >         - The remaining 64000ish bytes land in the second dp_packet.
> > > >
> > > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > > >
> > > > struct netdev_rxq_linux {
> > > >     struct netdev_rxq up;
> > > >     bool is_tap;
> > > >     int fd;
> > > >     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> > > > };
> > > >
> > > > > >
> > > > > >         - Then we expand the first dp_packet to the needed size and copy
> > > > > >           the remaining 64000 bytes into it.
> > > > >
> > > > > That's correct.
> > > > >
> > > > > > An alternative would be:
> > > > > >
> > > > > >         - Set up the first dp_packet as currently.
> > > > > >
> > > > > >         - Set up the second dp_packet so that the bytes are received
> > > > > >           into it starting at offset (mtu + headroom).
> > > > > >
> > > > > >         - If more than mtu bytes are received, then copy those bytes
> > > > > >           into the headroom of the second dp_packet and return it to the
> > > > > >           caller instead of the first dp_packet.
> > > > >
> > > > > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > > > > to avoid performance regressions since it' very sensitive. Right now the 64k
> > > > > buffer is preallocated and is static for each queue to avoid the malloc
> > > > > performance issue. Now for TSO case, we have more time per packet for
> > > > > processing.
> > > >
> > > > Can we implement Ben's idea by
> > > > 1) set size of aux_buf to 64k + mtu
> > > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > > first mtu bytes of aux_buf
> > > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > > 4) free the first dp_packet, and use the second dp_packet
> > >
> > > I did a quick experiment while at the conference and Ben's idea is
> > > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> > >
> > > If the buffer gets resized to what's actually used, then it becomes
> > > a bit slower (1.8%).
> >
> > Do we have to resize it?
>
> Well, if there is congestion the packets will get proportional to
> the TCP window size which can be like ~4k, ~9k, while it is allocating
> 64k. Assuming many ports and many packets in parallel, that might be
> a waste of memory.
>
> > > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > > be changed to be more efficient. Just send me a patch and I will be
> > > happy to test again.
> > >
> > > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> >
> > Thanks!
> >
> > I tested it by applying
> > https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> > Implemented Ben suggestion.
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> >
> > Using
> >     iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
> >
> > Test 100 second TCP
> >
> > without the patch
> > [  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec
> >
> > with the patch
> > [  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec
> >
> > I think it's pretty good improvement!
>
OK, I applied "Resize the large packet to the exact size"

The performance is
[  3]  0.0-100.0 sec  93.4 GBytes  8.02 Gbits/sec

Still pretty good. If you want to give it a try:

#!/bin/bash
ovs-vswitchd --no-chdir --pidfile
--log-file=/root/ovs/ovs-vswitchd.log --disable-system --detach
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3
ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true

ip netns add at_ns0
ip link add p0 type veth peer name afxdp-p0
ip link set p0 netns at_ns0
ip link set dev afxdp-p0 up
ovs-vsctl add-port br0 afxdp-p0

ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
ip addr add "10.1.1.1/24" dev p0
ip link set dev p0 up
NS_EXEC_HEREDOC

ip netns add at_ns1
ip link add p1 type veth peer name afxdp-p1
ip link set p1 netns at_ns1
ip link set dev afxdp-p1 up
ovs-vsctl add-port br0 afxdp-p1

ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
ip addr add "10.1.1.2/24" dev p1
ip link set dev p1 up
NS_EXEC_HEREDOC

Then iperf....

Regards,
William
Flavio Leitner Jan. 28, 2020, 2:40 p.m. UTC | #13
On Fri, Jan 24, 2020 at 03:06:47PM -0800, William Tu wrote:
> On Fri, Jan 24, 2020 at 1:38 PM Flavio Leitner <fbl@sysclose.org> wrote:
> >
> > On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> > > On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > > >
> > > > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <fbl@sysclose.org> wrote:
> > > > > >
> > > > > >
> > > > > > Hi Ben,
> > > > > >
> > > > > > Thanks for reviewing it!
> > > > > >
> > > > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > > > Thanks all for review/testing, pushed to master.
> > > > > > > >
> > > > > > > > OK, thanks Ian.
> > > > > > > >
> > > > > > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > > > > > at the code in case you'll spot some issues especially in non-DPDK related
> > > > > > > > parts.
> > > > > > >
> > > > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > > > > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > > > > > much more, it makes it obvious at a glance that it's a
> > > > > > > checksum-offloading check.
> > > > > >
> > > > > > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > > > > > better name. I will improve that if no one gets on it first.
> > > > > >
> > > > > > > In the case where we actually receive a 64 kB packet, I think that this
> > > > > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > > > > correctly (I did it quickly), then this is what happens:
> > > > > > >
> > > > > > >         - The first 1500 bytes of the packet land in the first
> > > > > > >           dp_packet.
> > > > > > >
> > > > > > >         - The remaining 64000ish bytes land in the second dp_packet.
> > > > >
> > > > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > > > >
> > > > > struct netdev_rxq_linux {
> > > > >     struct netdev_rxq up;
> > > > >     bool is_tap;
> > > > >     int fd;
> > > > >     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> > > > > };
> > > > >
> > > > > > >
> > > > > > >         - Then we expand the first dp_packet to the needed size and copy
> > > > > > >           the remaining 64000 bytes into it.
> > > > > >
> > > > > > That's correct.
> > > > > >
> > > > > > > An alternative would be:
> > > > > > >
> > > > > > >         - Set up the first dp_packet as currently.
> > > > > > >
> > > > > > >         - Set up the second dp_packet so that the bytes are received
> > > > > > >           into it starting at offset (mtu + headroom).
> > > > > > >
> > > > > > >         - If more than mtu bytes are received, then copy those bytes
> > > > > > >           into the headroom of the second dp_packet and return it to the
> > > > > > >           caller instead of the first dp_packet.
> > > > > >
> > > > > > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > > > > > to avoid performance regressions since it' very sensitive. Right now the 64k
> > > > > > buffer is preallocated and is static for each queue to avoid the malloc
> > > > > > performance issue. Now for TSO case, we have more time per packet for
> > > > > > processing.
> > > > >
> > > > > Can we implement Ben's idea by
> > > > > 1) set size of aux_buf to 64k + mtu
> > > > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > > > first mtu bytes of aux_buf
> > > > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > > > 4) free the first dp_packet, and use the second dp_packet
> > > >
> > > > I did a quick experiment while at the conference and Ben's idea is
> > > > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> > > >
> > > > If the buffer gets resized to what's actually used, then it becomes
> > > > a bit slower (1.8%).
> > >
> > > Do we have to resize it?
> >
> > Well, if there is congestion the packets will get proportional to
> > the TCP window size which can be like ~4k, ~9k, while it is allocating
> > 64k. Assuming many ports and many packets in parallel, that might be
> > a waste of memory.
> >
> > > > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > > > be changed to be more efficient. Just send me a patch and I will be
> > > > happy to test again.
> > > >
> > > > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> > >
> > > Thanks!
> > >
> > > I tested it by applying
> > > https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> > > Implemented Ben suggestion.
> > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > >
> > > Using
> > >     iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
> > >
> > > Test 100 second TCP
> > >
> > > without the patch
> > > [  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec
> > >
> > > with the patch
> > > [  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec
> > >
> > > I think it's pretty good improvement!
> >
> OK, I applied "Resize the large packet to the exact size"
> 
> The performance is
> [  3]  0.0-100.0 sec  93.4 GBytes  8.02 Gbits/sec
> 
> Still pretty good. If you want to give it a try:
> 
> #!/bin/bash
> ovs-vswitchd --no-chdir --pidfile
> --log-file=/root/ovs/ovs-vswitchd.log --disable-system --detach
> ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3
> ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
> 
> ip netns add at_ns0
> ip link add p0 type veth peer name afxdp-p0
> ip link set p0 netns at_ns0
> ip link set dev afxdp-p0 up
> ovs-vsctl add-port br0 afxdp-p0
> 
> ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> ip addr add "10.1.1.1/24" dev p0
> ip link set dev p0 up
> NS_EXEC_HEREDOC
> 
> ip netns add at_ns1
> ip link add p1 type veth peer name afxdp-p1
> ip link set p1 netns at_ns1
> ip link set dev afxdp-p1 up
> ovs-vsctl add-port br0 afxdp-p1
> 
> ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> ip addr add "10.1.1.2/24" dev p1
> ip link set dev p1 up
> NS_EXEC_HEREDOC
> 
> Then iperf....

I was testing VM-netns and moving to netns-netns helped a bit more.
It moved from 5.6 Gbits/sec to 6.3 Gbits/sec (~ +12%) and using the
resizing patch didn't change much like it happened in your test.

Ok, I will finish up the patch to move to Ben's idea plus resizing.

Thanks!
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..22976a3cd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -57,6 +57,7 @@  DOC_SOURCE = \
 	Documentation/topics/ovsdb-replication.rst \
 	Documentation/topics/porting.rst \
 	Documentation/topics/tracing.rst \
+	Documentation/topics/userspace-tso.rst \
 	Documentation/topics/windows.rst \
 	Documentation/howto/index.rst \
 	Documentation/howto/dpdk.rst \
diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
index 34c4b10e0..08af3a24d 100644
--- a/Documentation/topics/index.rst
+++ b/Documentation/topics/index.rst
@@ -50,5 +50,6 @@  OVS
    language-bindings
    testing
    tracing
+   userspace-tso
    idl-compound-indexes
    ovs-extensions
diff --git a/Documentation/topics/userspace-tso.rst b/Documentation/topics/userspace-tso.rst
new file mode 100644
index 000000000..893c64839
--- /dev/null
+++ b/Documentation/topics/userspace-tso.rst
@@ -0,0 +1,98 @@ 
+..
+      Copyright 2020, Red Hat, Inc.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+========================
+Userspace Datapath - TSO
+========================
+
+**Note:** This feature is considered experimental.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC. Consult your controller's
+datasheet for compatibility. Secondly, the NIC must have an associated DPDK
+Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
+refer to the `DPDK documentation`__.
+
+__ https://doc.dpdk.org/guides-19.11/nics/overview.html
+
+Enabling TSO
+~~~~~~~~~~~~
+
+The TSO support may be enabled via a global config value
+``userspace-tso-enable``.  Setting this to ``true`` enables TSO support for
+all ports.
+
+    $ ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
+
+The default value is ``false``.
+
+Changing ``userspace-tso-enable`` requires restarting the daemon.
+
+When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled
+as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO` is thus advertised to the guest as an
+available feature:
+
+QEMU Command Line Parameter::
+
+    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
+    ...
+    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
+    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
+    ...
+
+2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
+used to enable same::
+
+    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
+    $ ethtool -K eth0 tso on
+    $ ethtool -k eth0
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+
+The current OvS userspace `TSO` implementation supports flat and VLAN networks
+only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
+etc.]).
+
+There is no software implementation of TSO, so all ports attached to the
+datapath must support TSO or packets using that feature will be dropped
+on ports without TSO support.  That also means guests using vhost-user
+in client mode will receive TSO packet regardless of TSO being enabled
+or disabled within the guest.
diff --git a/NEWS b/NEWS
index 579e91c89..c6d3b6053 100644
--- a/NEWS
+++ b/NEWS
@@ -30,6 +30,7 @@  Post-v2.12.0
      * Add support for DPDK 19.11.
      * Add hardware offload support for output, drop, set of MAC, IPv4 and
        TCP/UDP ports actions (experimental).
+     * Add experimental support for TSO.
    - RSTP:
      * The rstp_statistics column in Port table will only be updated every
        stats-update-interval configured in Open_vSwitch table.
diff --git a/lib/automake.mk b/lib/automake.mk
index ebf714501..95925b57c 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -314,6 +314,8 @@  lib_libopenvswitch_la_SOURCES = \
 	lib/unicode.h \
 	lib/unixctl.c \
 	lib/unixctl.h \
+	lib/userspace-tso.c \
+	lib/userspace-tso.h \
 	lib/util.c \
 	lib/util.h \
 	lib/uuid.c \
diff --git a/lib/conntrack.c b/lib/conntrack.c
index b80080e72..60222ca53 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -2022,7 +2022,8 @@  conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
         if (hwol_bad_l3_csum) {
             ok = false;
         } else {
-            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
+            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
+                                     || dp_packet_hwol_is_ipv4(pkt);
             /* Validate the checksum only when hwol is not supported. */
             ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
                                  !hwol_good_l3_csum);
@@ -2036,7 +2037,8 @@  conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
     if (ok) {
         bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
         if (!hwol_bad_l4_csum) {
-            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
+            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
+                                      || dp_packet_hwol_tx_l4_checksum(pkt);
             /* Validate the checksum only when hwol is not supported. */
             if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
                            &ctx->icmp_related, l3, !hwol_good_l4_csum,
@@ -3237,8 +3239,11 @@  handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
                 }
                 if (seq_skew) {
                     ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
-                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
-                                          l3_hdr->ip_tot_len, htons(ip_len));
+                    if (!dp_packet_hwol_is_ipv4(pkt)) {
+                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
+                                                        l3_hdr->ip_tot_len,
+                                                        htons(ip_len));
+                    }
                     l3_hdr->ip_tot_len = htons(ip_len);
                 }
             }
@@ -3256,13 +3261,15 @@  handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
     }
 
     th->tcp_csum = 0;
-    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
-        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
-                           dp_packet_l4_size(pkt));
-    } else {
-        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
-        th->tcp_csum = csum_finish(
-             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
+        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
+            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
+                               dp_packet_l4_size(pkt));
+        } else {
+            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
+            th->tcp_csum = csum_finish(
+                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+        }
     }
 
     if (seq_skew) {
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 133942155..69ae5dfac 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -456,7 +456,7 @@  dp_packet_init_specific(struct dp_packet *p)
 {
     /* This initialization is needed for packets that do not come from DPDK
      * interfaces, when vswitchd is built with --with-dpdk. */
-    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
+    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
     p->mbuf.nb_segs = 1;
     p->mbuf.next = NULL;
 }
@@ -519,6 +519,95 @@  dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
     b->mbuf.buf_len = s;
 }
 
+/* Returns 'true' if packet 'b' is marked for TCP segmentation offloading. */
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b)
+{
+    return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG);
+}
+
+/* Returns 'true' if packet 'b' is marked for IPv4 checksum offloading. */
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b)
+{
+    return !!(b->mbuf.ol_flags & PKT_TX_IPV4);
+}
+
+/* Returns the L4 cksum offload bitmask. */
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b)
+{
+    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
+}
+
+/* Returns 'true' if packet 'b' is marked for TCP checksum offloading. */
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM;
+}
+
+/* Returns 'true' if packet 'b' is marked for UDP checksum offloading. */
+static inline bool
+dp_packet_hwol_l4_is_udp(struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM;
+}
+
+/* Returns 'true' if packet 'b' is marked for SCTP checksum offloading. */
+static inline bool
+dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM;
+}
+
+/* Mark packet 'b' for IPv4 checksum offloading. */
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_IPV4;
+}
+
+/* Mark packet 'b' for IPv6 checksum offloading. */
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_IPV6;
+}
+
+/* Mark packet 'b' for TCP checksum offloading.  It implies that either
+ * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
+}
+
+/* Mark packet 'b' for UDP checksum offloading.  It implies that either
+ * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
+}
+
+/* Mark packet 'b' for SCTP checksum offloading.  It implies that either
+ * the packet 'b' is marked for IPv4 or IPv6 checksum offloading. */
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
+}
+
+/* Mark packet 'b' for TCP segmentation offloading.  It implies that
+ * either the packet 'b' is marked for IPv4 or IPv6 checksum offloading
+ * and also for TCP checksum offloading. */
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
+{
+    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
+}
+
 /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
  * correct only if 'dp_packet_rss_valid(p)' returns true */
 static inline uint32_t
@@ -648,6 +737,84 @@  dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
     b->allocated_ = s;
 }
 
+/* There are no implementation when not DPDK enabled datapath. */
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
+{
+    return 0;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline bool
+dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline bool
+dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED)
+{
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED)
+{
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED)
+{
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED)
+{
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED)
+{
+}
+
+/* There are no implementation when not DPDK enabled datapath. */
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED)
+{
+}
+
 /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
  * correct only if 'dp_packet_rss_valid(p)' returns true */
 static inline uint32_t
@@ -939,6 +1106,13 @@  dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
     }
 }
 
+/* Return true if the packet 'b' requested L4 checksum offload. */
+static inline bool
+dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b)
+{
+    return !!dp_packet_hwol_l4_mask(b);
+}
+
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/ipf.c b/lib/ipf.c
index 45c489122..446e89d13 100644
--- a/lib/ipf.c
+++ b/lib/ipf.c
@@ -433,9 +433,11 @@  ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
     len += rest_len;
     l3 = dp_packet_l3(pkt);
     ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
-    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
-                                new_ip_frag_off);
-    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+    if (!dp_packet_hwol_is_ipv4(pkt)) {
+        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
+                                    new_ip_frag_off);
+        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+    }
     l3->ip_tot_len = htons(len);
     l3->ip_frag_off = new_ip_frag_off;
     dp_packet_set_l2_pad_size(pkt, 0);
@@ -606,6 +608,7 @@  ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
     }
 
     if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
+                     && !dp_packet_hwol_is_ipv4(pkt)
                      && csum(l3, ip_hdr_len) != 0)) {
         goto invalid_pkt;
     }
@@ -1181,16 +1184,21 @@  ipf_post_execute_reass_pkts(struct ipf *ipf,
                 } else {
                     struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
                     struct ip_header *l3_reass = dp_packet_l3(pkt);
-                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
-                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
-                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
-                                                     frag_ip, reass_ip);
-                    l3_frag->ip_src = l3_reass->ip_src;
+                    if (!dp_packet_hwol_is_ipv4(frag_0->pkt)) {
+                        ovs_be32 reass_ip =
+                            get_16aligned_be32(&l3_reass->ip_src);
+                        ovs_be32 frag_ip =
+                            get_16aligned_be32(&l3_frag->ip_src);
+
+                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                         frag_ip, reass_ip);
+                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
+                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
+                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                         frag_ip, reass_ip);
+                    }
 
-                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
-                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
-                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
-                                                     frag_ip, reass_ip);
+                    l3_frag->ip_src = l3_reass->ip_src;
                     l3_frag->ip_dst = l3_reass->ip_dst;
                 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d1469f6f2..b108cbd6b 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -72,6 +72,7 @@ 
 #include "timeval.h"
 #include "unaligned.h"
 #include "unixctl.h"
+#include "userspace-tso.h"
 #include "util.h"
 #include "uuid.h"
 
@@ -201,6 +202,8 @@  struct netdev_dpdk_sw_stats {
     uint64_t tx_qos_drops;
     /* Packet drops in ingress policer processing. */
     uint64_t rx_qos_drops;
+    /* Packet drops in HWOL processing. */
+    uint64_t tx_invalid_hwol_drops;
 };
 
 enum { DPDK_RING_SIZE = 256 };
@@ -410,7 +413,8 @@  struct ingress_policer {
 enum dpdk_hw_ol_features {
     NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
     NETDEV_RX_HW_CRC_STRIP = 1 << 1,
-    NETDEV_RX_HW_SCATTER = 1 << 2
+    NETDEV_RX_HW_SCATTER = 1 << 2,
+    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
 };
 
 /*
@@ -992,6 +996,12 @@  dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
         conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
     }
 
+    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     /* Limit configured rss hash functions to only those supported
      * by the eth device. */
     conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
@@ -1093,6 +1103,9 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
     uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
                                      DEV_RX_OFFLOAD_TCP_CKSUM |
                                      DEV_RX_OFFLOAD_IPV4_CKSUM;
+    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
+                                   DEV_TX_OFFLOAD_TCP_CKSUM |
+                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
 
     rte_eth_dev_info_get(dev->port_id, &info);
 
@@ -1119,6 +1132,14 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
         dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
     }
 
+    if (info.tx_offload_capa & tx_tso_offload_capa) {
+        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+    } else {
+        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+        VLOG_WARN("Tx TSO offload is not supported on %s port "
+                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
+    }
+
     n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
     n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
 
@@ -1369,14 +1390,16 @@  netdev_dpdk_vhost_construct(struct netdev *netdev)
         goto out;
     }
 
-    err = rte_vhost_driver_disable_features(dev->vhost_id,
-                                1ULL << VIRTIO_NET_F_HOST_TSO4
-                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
-                                | 1ULL << VIRTIO_NET_F_CSUM);
-    if (err) {
-        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
-                 "port: %s\n", name);
-        goto out;
+    if (!userspace_tso_enabled()) {
+        err = rte_vhost_driver_disable_features(dev->vhost_id,
+                                    1ULL << VIRTIO_NET_F_HOST_TSO4
+                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
+                                    | 1ULL << VIRTIO_NET_F_CSUM);
+        if (err) {
+            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
+                     "port: %s\n", name);
+            goto out;
+        }
     }
 
     err = rte_vhost_driver_start(dev->vhost_id);
@@ -1711,6 +1734,11 @@  netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
         } else {
             smap_add(args, "rx_csum_offload", "false");
         }
+        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+            smap_add(args, "tx_tso_offload", "true");
+        } else {
+            smap_add(args, "tx_tso_offload", "false");
+        }
         smap_add(args, "lsc_interrupt_mode",
                  dev->lsc_interrupt_mode ? "true" : "false");
     }
@@ -2138,6 +2166,67 @@  netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
     rte_free(rx);
 }
 
+/* Prepare the packet for HWOL.
+ * Return True if the packet is OK to continue. */
+static bool
+netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
+{
+    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
+
+    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
+        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
+        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
+        mbuf->outer_l2_len = 0;
+        mbuf->outer_l3_len = 0;
+    }
+
+    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
+        struct tcp_header *th = dp_packet_l4(pkt);
+
+        if (!th) {
+            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
+                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
+            return false;
+        }
+
+        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
+        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
+        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+
+        if (mbuf->ol_flags & PKT_TX_IPV4) {
+            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
+        }
+    }
+    return true;
+}
+
+/* Prepare a batch for HWOL.
+ * Return the number of good packets in the batch. */
+static int
+netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
+                            int pkt_cnt)
+{
+    int i = 0;
+    int cnt = 0;
+    struct rte_mbuf *pkt;
+
+    /* Prepare and filter bad HWOL packets. */
+    for (i = 0; i < pkt_cnt; i++) {
+        pkt = pkts[i];
+        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
+            rte_pktmbuf_free(pkt);
+            continue;
+        }
+
+        if (OVS_UNLIKELY(i != cnt)) {
+            pkts[cnt] = pkt;
+        }
+        cnt++;
+    }
+
+    return cnt;
+}
+
 /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
  * 'pkts', even in case of failure.
  *
@@ -2147,11 +2236,22 @@  netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
                          struct rte_mbuf **pkts, int cnt)
 {
     uint32_t nb_tx = 0;
+    uint16_t nb_tx_prep = cnt;
+
+    if (userspace_tso_enabled()) {
+        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
+        if (nb_tx_prep != cnt) {
+            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
+                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
+                         cnt, rte_strerror(rte_errno));
+        }
+    }
 
-    while (nb_tx != cnt) {
+    while (nb_tx != nb_tx_prep) {
         uint32_t ret;
 
-        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
+        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
+                               nb_tx_prep - nb_tx);
         if (!ret) {
             break;
         }
@@ -2437,11 +2537,14 @@  netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
     int cnt = 0;
     struct rte_mbuf *pkt;
 
+    /* Filter oversized packets, unless are marked for TSO. */
     for (i = 0; i < pkt_cnt; i++) {
         pkt = pkts[i];
-        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
-            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
-                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
+        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
+            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
+            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
+                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
+                         dev->max_packet_len);
             rte_pktmbuf_free(pkt);
             continue;
         }
@@ -2463,7 +2566,8 @@  netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
 {
     int dropped = sw_stats_add->tx_mtu_exceeded_drops +
                   sw_stats_add->tx_qos_drops +
-                  sw_stats_add->tx_failure_drops;
+                  sw_stats_add->tx_failure_drops +
+                  sw_stats_add->tx_invalid_hwol_drops;
     struct netdev_stats *stats = &dev->stats;
     int sent = attempted - dropped;
     int i;
@@ -2482,6 +2586,7 @@  netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
         sw_stats->tx_failure_drops      += sw_stats_add->tx_failure_drops;
         sw_stats->tx_mtu_exceeded_drops += sw_stats_add->tx_mtu_exceeded_drops;
         sw_stats->tx_qos_drops          += sw_stats_add->tx_qos_drops;
+        sw_stats->tx_invalid_hwol_drops += sw_stats_add->tx_invalid_hwol_drops;
     }
 }
 
@@ -2513,8 +2618,15 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
+    sw_stats_add.tx_invalid_hwol_drops = cnt;
+    if (userspace_tso_enabled()) {
+        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
+    }
+
+    sw_stats_add.tx_invalid_hwol_drops -= cnt;
+    sw_stats_add.tx_mtu_exceeded_drops = cnt;
     cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
-    sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
+    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
 
     /* Check has QoS has been configured for the netdev */
     sw_stats_add.tx_qos_drops = cnt;
@@ -2562,6 +2674,120 @@  out:
     }
 }
 
+static void
+netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
+{
+    rte_free(opaque);
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
+{
+    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
+    struct rte_mbuf_ext_shared_info *shinfo = NULL;
+    uint16_t buf_len;
+    void *buf;
+
+    if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) {
+        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
+    } else {
+        total_len += sizeof *shinfo + sizeof(uintptr_t);
+        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+    }
+
+    if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
+        VLOG_ERR("Can't copy packet: too big %u", total_len);
+        return NULL;
+    }
+
+    buf_len = total_len;
+    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+    if (OVS_UNLIKELY(buf == NULL)) {
+        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
+        return NULL;
+    }
+
+    /* Initialize shinfo. */
+    if (shinfo) {
+        shinfo->free_cb = netdev_dpdk_extbuf_free;
+        shinfo->fcb_opaque = buf;
+        rte_mbuf_ext_refcnt_set(shinfo, 1);
+    } else {
+        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+                                                    netdev_dpdk_extbuf_free,
+                                                    buf);
+        if (OVS_UNLIKELY(shinfo == NULL)) {
+            rte_free(buf);
+            VLOG_ERR("Failed to initialize shared info for mbuf while "
+                     "attempting to attach an external buffer.");
+            return NULL;
+        }
+    }
+
+    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
+                              shinfo);
+    rte_pktmbuf_reset_headroom(pkt);
+
+    return pkt;
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
+{
+    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+    if (OVS_UNLIKELY(!pkt)) {
+        return NULL;
+    }
+
+    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
+        return pkt;
+    }
+
+    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
+        return pkt;
+    }
+
+    rte_pktmbuf_free(pkt);
+
+    return NULL;
+}
+
+static struct dp_packet *
+dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
+{
+    struct rte_mbuf *mbuf_dest;
+    struct dp_packet *pkt_dest;
+    uint32_t pkt_len;
+
+    pkt_len = dp_packet_size(pkt_orig);
+    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
+    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
+            return NULL;
+    }
+
+    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
+    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
+    dp_packet_set_size(pkt_dest, pkt_len);
+
+    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
+    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
+    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
+                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
+
+    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
+           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
+
+    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
+        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
+                                - (char *)dp_packet_eth(pkt_dest);
+        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
+                                - (char *) dp_packet_l3(pkt_dest);
+    }
+
+    return pkt_dest;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2575,7 +2801,7 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
     enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
 #endif
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
+    struct dp_packet *pkts[PKT_ARRAY_SIZE];
     struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
     uint32_t cnt = batch_cnt;
     uint32_t dropped = 0;
@@ -2596,34 +2822,30 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
         struct dp_packet *packet = batch->packets[i];
         uint32_t size = dp_packet_size(packet);
 
-        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
-            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
-                         size, dev->max_packet_len);
-
+        if (size > dev->max_packet_len
+            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
+            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
+                         dev->max_packet_len);
             mtu_drops++;
             continue;
         }
 
-        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
+        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
         if (OVS_UNLIKELY(!pkts[txcnt])) {
             dropped = cnt - i;
             break;
         }
 
-        /* We have to do a copy for now */
-        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-               dp_packet_data(packet), size);
-        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
-
         txcnt++;
     }
 
     if (OVS_LIKELY(txcnt)) {
         if (dev->type == DPDK_DEV_VHOST) {
-            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
-                                     txcnt);
+            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
         } else {
-            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
+            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
+                                                   (struct rte_mbuf **)pkts,
+                                                   txcnt);
         }
     }
 
@@ -2676,26 +2898,33 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
         dp_packet_delete_batch(batch, true);
     } else {
         struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
-        int tx_cnt, dropped;
-        int tx_failure, mtu_drops, qos_drops;
+        int dropped;
+        int tx_failure, mtu_drops, qos_drops, hwol_drops;
         int batch_cnt = dp_packet_batch_size(batch);
         struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
 
-        tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
-        mtu_drops = batch_cnt - tx_cnt;
-        qos_drops = tx_cnt;
-        tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
-        qos_drops -= tx_cnt;
+        hwol_drops = batch_cnt;
+        if (userspace_tso_enabled()) {
+            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
+        }
+        hwol_drops -= batch_cnt;
+        mtu_drops = batch_cnt;
+        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
+        mtu_drops -= batch_cnt;
+        qos_drops = batch_cnt;
+        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
+        qos_drops -= batch_cnt;
 
-        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt);
+        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt);
 
-        dropped = tx_failure + mtu_drops + qos_drops;
+        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
         if (OVS_UNLIKELY(dropped)) {
             rte_spinlock_lock(&dev->stats_lock);
             dev->stats.tx_dropped += dropped;
             sw_stats->tx_failure_drops += tx_failure;
             sw_stats->tx_mtu_exceeded_drops += mtu_drops;
             sw_stats->tx_qos_drops += qos_drops;
+            sw_stats->tx_invalid_hwol_drops += hwol_drops;
             rte_spinlock_unlock(&dev->stats_lock);
         }
     }
@@ -3011,7 +3240,8 @@  netdev_dpdk_get_sw_custom_stats(const struct netdev *netdev,
     SW_CSTAT(tx_failure_drops)       \
     SW_CSTAT(tx_mtu_exceeded_drops)  \
     SW_CSTAT(tx_qos_drops)           \
-    SW_CSTAT(rx_qos_drops)
+    SW_CSTAT(rx_qos_drops)           \
+    SW_CSTAT(tx_invalid_hwol_drops)
 
 #define SW_CSTAT(NAME) + 1
     custom_stats->size = SW_CSTATS;
@@ -4874,6 +5104,12 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
 
     rte_free(dev->tx_q);
     err = dpdk_eth_dev_init(dev);
+    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
     if (!dev->tx_q) {
         err = ENOMEM;
@@ -4903,6 +5139,11 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
         dev->tx_q[0].map = 0;
     }
 
+    if (userspace_tso_enabled()) {
+        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
+    }
+
     netdev_dpdk_remap_txqs(dev);
 
     err = netdev_dpdk_mempool_configure(dev);
@@ -4975,6 +5216,11 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
             vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
         }
 
+        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
+        if (userspace_tso_enabled()) {
+            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
+        }
+
         err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
         if (err) {
             VLOG_ERR("vhost-user device setup failure for device %s\n",
@@ -4999,14 +5245,20 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
             goto unlock;
         }
 
-        err = rte_vhost_driver_disable_features(dev->vhost_id,
-                                    1ULL << VIRTIO_NET_F_HOST_TSO4
-                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
-                                    | 1ULL << VIRTIO_NET_F_CSUM);
-        if (err) {
-            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
-                     "client port: %s\n", dev->up.name);
-            goto unlock;
+        if (userspace_tso_enabled()) {
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+        } else {
+            err = rte_vhost_driver_disable_features(dev->vhost_id,
+                                        1ULL << VIRTIO_NET_F_HOST_TSO4
+                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
+                                        | 1ULL << VIRTIO_NET_F_CSUM);
+            if (err) {
+                VLOG_ERR("rte_vhost_driver_disable_features failed for "
+                         "vhost user client port: %s\n", dev->up.name);
+                goto unlock;
+            }
         }
 
         err = rte_vhost_driver_start(dev->vhost_id);
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index f08159aa7..9dbc67658 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -27,6 +27,7 @@ 
 #include <stdint.h>
 #include <stdbool.h>
 
+#include "dp-packet.h"
 #include "netdev-afxdp.h"
 #include "netdev-afxdp-pool.h"
 #include "netdev-provider.h"
@@ -37,10 +38,13 @@ 
 
 struct netdev;
 
+#define LINUX_RXQ_TSO_MAX_LEN 65536
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
     int fd;
+    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
 };
 
 int netdev_linux_construct(struct netdev *);
@@ -92,6 +96,7 @@  struct netdev_linux {
     int tap_fd;
     bool present;               /* If the device is present in the namespace */
     uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+    uint64_t rx_dropped;        /* Packets dropped while recv from kernel. */
 
     /* LAG information. */
     bool is_lag_master;         /* True if the netdev is a LAG master. */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 41d1e9273..a4a666657 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -29,16 +29,18 @@ 
 #include <linux/filter.h>
 #include <linux/gen_stats.h>
 #include <linux/if_ether.h>
+#include <linux/if_packet.h>
 #include <linux/if_tun.h>
 #include <linux/types.h>
 #include <linux/ethtool.h>
 #include <linux/mii.h>
 #include <linux/rtnetlink.h>
 #include <linux/sockios.h>
+#include <linux/virtio_net.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <sys/utsname.h>
-#include <netpacket/packet.h>
 #include <net/if.h>
 #include <net/if_arp.h>
 #include <net/route.h>
@@ -75,6 +77,7 @@ 
 #include "timer.h"
 #include "unaligned.h"
 #include "openvswitch/vlog.h"
+#include "userspace-tso.h"
 #include "util.h"
 
 VLOG_DEFINE_THIS_MODULE(netdev_linux);
@@ -237,6 +240,16 @@  enum {
     VALID_DRVINFO           = 1 << 6,
     VALID_FEATURES          = 1 << 7,
 };
+
+/* Use one for the packet buffer and another for the aux buffer to receive
+ * TSO packets. */
+#define IOV_STD_SIZE 1
+#define IOV_TSO_SIZE 2
+
+enum {
+    IOV_PACKET = 0,
+    IOV_AUXBUF = 1,
+};
 
 struct linux_lag_slave {
    uint32_t block_id;
@@ -501,6 +514,8 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
+static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
+static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -902,6 +917,13 @@  netdev_linux_common_construct(struct netdev *netdev_)
     /* The device could be in the same network namespace or in another one. */
     netnsid_unset(&netdev->netnsid);
     ovs_mutex_init(&netdev->mutex);
+
+    if (userspace_tso_enabled()) {
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     return 0;
 }
 
@@ -961,6 +983,10 @@  netdev_linux_construct_tap(struct netdev *netdev_)
     /* Create tap device. */
     get_flags(&netdev->up, &netdev->ifi_flags);
     ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+    if (userspace_tso_enabled()) {
+        ifr.ifr_flags |= IFF_VNET_HDR;
+    }
+
     ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
     if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
         VLOG_WARN("%s: creating tap device failed: %s", name,
@@ -1024,6 +1050,15 @@  static struct netdev_rxq *
 netdev_linux_rxq_alloc(void)
 {
     struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
+    if (userspace_tso_enabled()) {
+        int i;
+
+        /* Allocate auxiliay buffers to receive TSO packets. */
+        for (i = 0; i < NETDEV_MAX_BURST; i++) {
+            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
+        }
+    }
+
     return &rx->up;
 }
 
@@ -1069,6 +1104,15 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
             goto error;
         }
 
+        if (userspace_tso_enabled()
+            && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
+                          sizeof val)) {
+            error = errno;
+            VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
+                     netdev_get_name(netdev_), ovs_strerror(errno));
+            goto error;
+        }
+
         /* Set non-blocking mode. */
         error = set_nonblocking(rx->fd);
         if (error) {
@@ -1119,10 +1163,15 @@  static void
 netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
 {
     struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    int i;
 
     if (!rx->is_tap) {
         close(rx->fd);
     }
+
+    for (i = 0; i < NETDEV_MAX_BURST; i++) {
+        free(rx->aux_bufs[i]);
+    }
 }
 
 static void
@@ -1159,12 +1208,14 @@  auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
  * It also used recvmmsg to reduce multiple syscalls overhead;
  */
 static int
-netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
+netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
                                  struct dp_packet_batch *batch)
 {
-    size_t size;
+    int iovlen;
+    size_t std_len;
     ssize_t retval;
-    struct iovec iovs[NETDEV_MAX_BURST];
+    int virtio_net_hdr_size;
+    struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE];
     struct cmsghdr *cmsg;
     union {
         struct cmsghdr cmsg;
@@ -1174,41 +1225,87 @@  netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
     struct dp_packet *buffers[NETDEV_MAX_BURST];
     int i;
 
+    if (userspace_tso_enabled()) {
+        /* Use the buffer from the allocated packet below to receive MTU
+         * sized packets and an aux_buf for extra TSO data. */
+        iovlen = IOV_TSO_SIZE;
+        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
+    } else {
+        /* Use only the buffer from the allocated packet. */
+        iovlen = IOV_STD_SIZE;
+        virtio_net_hdr_size = 0;
+    }
+
+    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
-         buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
-                                                  DP_NETDEV_HEADROOM);
-         /* Reserve headroom for a single VLAN tag */
-         dp_packet_reserve(buffers[i], VLAN_HEADER_LEN);
-         size = dp_packet_tailroom(buffers[i]);
-         iovs[i].iov_base = dp_packet_data(buffers[i]);
-         iovs[i].iov_len = size;
+         buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
+         iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
+         iovs[i][IOV_PACKET].iov_len = std_len;
+         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
+         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
          mmsgs[i].msg_hdr.msg_name = NULL;
          mmsgs[i].msg_hdr.msg_namelen = 0;
-         mmsgs[i].msg_hdr.msg_iov = &iovs[i];
-         mmsgs[i].msg_hdr.msg_iovlen = 1;
+         mmsgs[i].msg_hdr.msg_iov = iovs[i];
+         mmsgs[i].msg_hdr.msg_iovlen = iovlen;
          mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i];
          mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
          mmsgs[i].msg_hdr.msg_flags = 0;
     }
 
     do {
-        retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
+        retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
     } while (retval < 0 && errno == EINTR);
 
     if (retval < 0) {
-        /* Save -errno to retval temporarily */
-        retval = -errno;
-        i = 0;
-        goto free_buffers;
+        retval = errno;
+        for (i = 0; i < NETDEV_MAX_BURST; i++) {
+            dp_packet_delete(buffers[i]);
+        }
+
+        return retval;
     }
 
     for (i = 0; i < retval; i++) {
         if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
-            break;
+            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
+            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+            dp_packet_delete(buffers[i]);
+            netdev->rx_dropped += 1;
+            VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether hdr size",
+                         netdev_get_name(netdev_));
+            continue;
+        }
+
+        if (mmsgs[i].msg_len > std_len) {
+            /* Build a single linear TSO packet by expanding the current packet
+             * to append the data received in the aux_buf. */
+            size_t extra_len = mmsgs[i].msg_len - std_len;
+
+            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
+                               + std_len);
+            dp_packet_prealloc_tailroom(buffers[i], extra_len);
+            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
+            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
+                               + extra_len);
+        } else {
+            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
+                               + mmsgs[i].msg_len);
         }
 
-        dp_packet_set_size(buffers[i],
-                           dp_packet_size(buffers[i]) + mmsgs[i].msg_len);
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) {
+            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
+            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+            /* Unexpected error situation: the virtio header is not present
+             * or corrupted. Drop the packet but continue in case next ones
+             * are correct. */
+            dp_packet_delete(buffers[i]);
+            netdev->rx_dropped += 1;
+            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
+                         netdev_get_name(netdev_));
+            continue;
+        }
 
         for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg;
                  cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) {
@@ -1238,22 +1335,11 @@  netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
         dp_packet_batch_add(batch, buffers[i]);
     }
 
-free_buffers:
-    /* Free unused buffers, including buffers whose size is less than
-     * ETH_HEADER_LEN.
-     *
-     * Note: i has been set correctly by the above for loop, so don't
-     * try to re-initialize it.
-     */
+    /* Delete unused buffers. */
     for (; i < NETDEV_MAX_BURST; i++) {
         dp_packet_delete(buffers[i]);
     }
 
-    /* netdev_linux_rxq_recv needs it to return 0 or positive errno */
-    if (retval < 0) {
-        return -retval;
-    }
-
     return 0;
 }
 
@@ -1263,20 +1349,40 @@  free_buffers:
  * packets are added into *batch. The return value is 0 or errno.
  */
 static int
-netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
+netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
+                                struct dp_packet_batch *batch)
 {
     struct dp_packet *buffer;
+    int virtio_net_hdr_size;
     ssize_t retval;
-    size_t size;
+    size_t std_len;
+    int iovlen;
     int i;
 
+    if (userspace_tso_enabled()) {
+        /* Use the buffer from the allocated packet below to receive MTU
+         * sized packets and an aux_buf for extra TSO data. */
+        iovlen = IOV_TSO_SIZE;
+        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
+    } else {
+        /* Use only the buffer from the allocated packet. */
+        iovlen = IOV_STD_SIZE;
+        virtio_net_hdr_size = 0;
+    }
+
+    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
+        struct iovec iov[IOV_TSO_SIZE];
+
         /* Assume Ethernet port. No need to set packet_type. */
-        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
-                                             DP_NETDEV_HEADROOM);
-        size = dp_packet_tailroom(buffer);
+        buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
+        iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
+        iov[IOV_PACKET].iov_len = std_len;
+        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
+        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
+
         do {
-            retval = read(fd, dp_packet_data(buffer), size);
+            retval = readv(rx->fd, iov, iovlen);
         } while (retval < 0 && errno == EINTR);
 
         if (retval < 0) {
@@ -1284,7 +1390,33 @@  netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
             break;
         }
 
-        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+        if (retval > std_len) {
+            /* Build a single linear TSO packet by expanding the current packet
+             * to append the data received in the aux_buf. */
+            size_t extra_len = retval - std_len;
+
+            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
+            dp_packet_prealloc_tailroom(buffer, extra_len);
+            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
+            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+        } else {
+            dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+        }
+
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
+            struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
+            struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+            /* Unexpected error situation: the virtio header is not present
+             * or corrupted. Drop the packet but continue in case next ones
+             * are correct. */
+            dp_packet_delete(buffer);
+            netdev->rx_dropped += 1;
+            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
+                         netdev_get_name(netdev_));
+            continue;
+        }
+
         dp_packet_batch_add(batch, buffer);
     }
 
@@ -1310,8 +1442,8 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
 
     dp_packet_batch_init(batch);
     retval = (rx->is_tap
-              ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch)
-              : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch));
+              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
+              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
 
     if (retval) {
         if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1353,7 +1485,7 @@  netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
 }
 
 static int
-netdev_linux_sock_batch_send(int sock, int ifindex,
+netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
                              struct dp_packet_batch *batch)
 {
     const size_t size = dp_packet_batch_size(batch);
@@ -1367,6 +1499,10 @@  netdev_linux_sock_batch_send(int sock, int ifindex,
 
     struct dp_packet *packet;
     DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (tso) {
+            netdev_linux_prepend_vnet_hdr(packet, mtu);
+        }
+
         iov[i].iov_base = dp_packet_data(packet);
         iov[i].iov_len = dp_packet_size(packet);
         mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
@@ -1399,7 +1535,7 @@  netdev_linux_sock_batch_send(int sock, int ifindex,
  * on other interface types because we attach a socket filter to the rx
  * socket. */
 static int
-netdev_linux_tap_batch_send(struct netdev *netdev_,
+netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
                             struct dp_packet_batch *batch)
 {
     struct netdev_linux *netdev = netdev_linux_cast(netdev_);
@@ -1416,10 +1552,15 @@  netdev_linux_tap_batch_send(struct netdev *netdev_,
     }
 
     DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-        size_t size = dp_packet_size(packet);
+        size_t size;
         ssize_t retval;
         int error;
 
+        if (tso) {
+            netdev_linux_prepend_vnet_hdr(packet, mtu);
+        }
+
+        size = dp_packet_size(packet);
         do {
             retval = write(netdev->tap_fd, dp_packet_data(packet), size);
             error = retval < 0 ? errno : 0;
@@ -1454,9 +1595,15 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
                   struct dp_packet_batch *batch,
                   bool concurrent_txq OVS_UNUSED)
 {
+    bool tso = userspace_tso_enabled();
+    int mtu = ETH_PAYLOAD_MAX;
     int error = 0;
     int sock = 0;
 
+    if (tso) {
+        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
+    }
+
     if (!is_tap_netdev(netdev_)) {
         if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
             error = EOPNOTSUPP;
@@ -1475,9 +1622,9 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
             goto free_batch;
         }
 
-        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
+        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
     } else {
-        error = netdev_linux_tap_batch_send(netdev_, batch);
+        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
     }
     if (error) {
         if (error == ENOBUFS) {
@@ -2045,6 +2192,7 @@  netdev_tap_get_stats(const struct netdev *netdev_, struct netdev_stats *stats)
         stats->collisions          += dev_stats.collisions;
     }
     stats->tx_dropped += netdev->tx_dropped;
+    stats->rx_dropped += netdev->rx_dropped;
     ovs_mutex_unlock(&netdev->mutex);
 
     return error;
@@ -6223,6 +6371,17 @@  af_packet_sock(void)
             if (error) {
                 close(sock);
                 sock = -error;
+            } else if (userspace_tso_enabled()) {
+                int val = 1;
+                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
+                                   sizeof val);
+                if (error) {
+                    error = errno;
+                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
+                             ovs_strerror(errno));
+                    close(sock);
+                    sock = -error;
+                }
             }
         } else {
             sock = -errno;
@@ -6234,3 +6393,136 @@  af_packet_sock(void)
 
     return sock;
 }
+
+static int
+netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
+{
+    struct eth_header *eth_hdr;
+    ovs_be16 eth_type;
+    int l2_len;
+
+    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
+    if (!eth_hdr) {
+        return -EINVAL;
+    }
+
+    l2_len = ETH_HEADER_LEN;
+    eth_type = eth_hdr->eth_type;
+    if (eth_type_vlan(eth_type)) {
+        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
+
+        if (!vlan) {
+            return -EINVAL;
+        }
+
+        eth_type = vlan->vlan_next_type;
+        l2_len += VLAN_HEADER_LEN;
+    }
+
+    if (eth_type == htons(ETH_TYPE_IP)) {
+        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
+
+        if (!ip_hdr) {
+            return -EINVAL;
+        }
+
+        *l4proto = ip_hdr->ip_proto;
+        dp_packet_hwol_set_tx_ipv4(b);
+    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
+        struct ovs_16aligned_ip6_hdr *nh6;
+
+        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
+        if (!nh6) {
+            return -EINVAL;
+        }
+
+        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
+        dp_packet_hwol_set_tx_ipv6(b);
+    }
+
+    return 0;
+}
+
+static int
+netdev_linux_parse_vnet_hdr(struct dp_packet *b)
+{
+    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
+    uint16_t l4proto = 0;
+
+    if (OVS_UNLIKELY(!vnet)) {
+        return -EINVAL;
+    }
+
+    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
+        return 0;
+    }
+
+    if (netdev_linux_parse_l2(b, &l4proto)) {
+        return -EINVAL;
+    }
+
+    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+        if (l4proto == IPPROTO_TCP) {
+            dp_packet_hwol_set_csum_tcp(b);
+        } else if (l4proto == IPPROTO_UDP) {
+            dp_packet_hwol_set_csum_udp(b);
+        } else if (l4proto == IPPROTO_SCTP) {
+            dp_packet_hwol_set_csum_sctp(b);
+        }
+    }
+
+    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
+                                | VIRTIO_NET_HDR_GSO_TCPV6
+                                | VIRTIO_NET_HDR_GSO_UDP;
+        uint8_t type = vnet->gso_type & allowed_mask;
+
+        if (type == VIRTIO_NET_HDR_GSO_TCPV4
+            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
+            dp_packet_hwol_set_tcp_seg(b);
+        }
+    }
+
+    return 0;
+}
+
+static void
+netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
+{
+    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
+
+    if (dp_packet_hwol_is_tso(b)) {
+        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
+                            + TCP_HEADER_LEN;
+
+        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
+        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
+        if (dp_packet_hwol_is_ipv4(b)) {
+            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+        } else {
+            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+        }
+
+    } else {
+        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
+    }
+
+    if (dp_packet_hwol_l4_mask(b)) {
+        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
+                                                  - (char *)dp_packet_eth(b));
+
+        if (dp_packet_hwol_l4_is_tcp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct tcp_header, tcp_csum);
+        } else if (dp_packet_hwol_l4_is_udp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct udp_header, udp_csum);
+        } else if (dp_packet_hwol_l4_is_sctp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct sctp_header, sctp_csum);
+        } else {
+            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
+        }
+    }
+}
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index f109c4e66..22f4cde33 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -37,6 +37,12 @@  extern "C" {
 struct netdev_tnl_build_header_params;
 #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
 
+enum netdev_ol_flags {
+    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
+    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
+    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
+};
+
 /* A network device (e.g. an Ethernet device).
  *
  * Network device implementations may read these members but should not modify
@@ -51,6 +57,9 @@  struct netdev {
      * opening this device, and therefore got assigned to the "system" class */
     bool auto_classified;
 
+    /* This bitmask of the offloading features enabled by the netdev. */
+    uint64_t ol_flags;
+
     /* If this is 'true', the user explicitly specified an MTU for this
      * netdev.  Otherwise, Open vSwitch is allowed to override it. */
     bool mtu_user_config;
diff --git a/lib/netdev.c b/lib/netdev.c
index 405c98c68..f95b19af4 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -66,6 +66,8 @@  COVERAGE_DEFINE(netdev_received);
 COVERAGE_DEFINE(netdev_sent);
 COVERAGE_DEFINE(netdev_add_router);
 COVERAGE_DEFINE(netdev_get_stats);
+COVERAGE_DEFINE(netdev_send_prepare_drops);
+COVERAGE_DEFINE(netdev_push_header_drops);
 
 struct netdev_saved_flags {
     struct netdev *netdev;
@@ -782,6 +784,54 @@  netdev_get_pt_mode(const struct netdev *netdev)
             : NETDEV_PT_LEGACY_L2);
 }
 
+/* Check if a 'packet' is compatible with 'netdev_flags'.
+ * If a packet is incompatible, return 'false' with the 'errormsg'
+ * pointing to a reason. */
+static bool
+netdev_send_prepare_packet(const uint64_t netdev_flags,
+                           struct dp_packet *packet, char **errormsg)
+{
+    if (dp_packet_hwol_is_tso(packet)
+        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
+            /* Fall back to GSO in software. */
+            VLOG_ERR_BUF(errormsg, "No TSO support");
+            return false;
+    }
+
+    if (dp_packet_hwol_l4_mask(packet)
+        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
+            /* Fall back to L4 csum in software. */
+            VLOG_ERR_BUF(errormsg, "No L4 checksum support");
+            return false;
+    }
+
+    return true;
+}
+
+/* Check if each packet in 'batch' is compatible with 'netdev' features,
+ * otherwise either fall back to software implementation or drop it. */
+static void
+netdev_send_prepare_batch(const struct netdev *netdev,
+                          struct dp_packet_batch *batch)
+{
+    struct dp_packet *packet;
+    size_t i, size = dp_packet_batch_size(batch);
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        char *errormsg = NULL;
+
+        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
+            dp_packet_batch_refill(batch, packet, i);
+        } else {
+            dp_packet_delete(packet);
+            COVERAGE_INC(netdev_send_prepare_drops);
+            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
+                         netdev_get_name(netdev), errormsg);
+            free(errormsg);
+        }
+    }
+}
+
 /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
  * otherwise a positive errno value.  Returns EAGAIN without blocking if
  * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
@@ -811,8 +861,14 @@  int
 netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
             bool concurrent_txq)
 {
-    int error = netdev->netdev_class->send(netdev, qid, batch,
-                                           concurrent_txq);
+    int error;
+
+    netdev_send_prepare_batch(netdev, batch);
+    if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) {
+        return 0;
+    }
+
+    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
     if (!error) {
         COVERAGE_INC(netdev_sent);
     }
@@ -878,9 +934,21 @@  netdev_push_header(const struct netdev *netdev,
                    const struct ovs_action_push_tnl *data)
 {
     struct dp_packet *packet;
-    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-        netdev->netdev_class->push_header(netdev, packet, data);
-        pkt_metadata_init(&packet->md, data->out_port);
+    size_t i, size = dp_packet_batch_size(batch);
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
+                         || dp_packet_hwol_l4_mask(packet))) {
+            COVERAGE_INC(netdev_push_header_drops);
+            dp_packet_delete(packet);
+            VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload flags is "
+                         "not supported: packet dropped",
+                         netdev_get_name(netdev));
+        } else {
+            netdev->netdev_class->push_header(netdev, packet, data);
+            pkt_metadata_init(&packet->md, data->out_port);
+            dp_packet_batch_refill(batch, packet, i);
+        }
     }
 
     return 0;
diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c
new file mode 100644
index 000000000..6a4a0149b
--- /dev/null
+++ b/lib/userspace-tso.c
@@ -0,0 +1,53 @@ 
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "smap.h"
+#include "ovs-thread.h"
+#include "openvswitch/vlog.h"
+#include "dpdk.h"
+#include "userspace-tso.h"
+#include "vswitch-idl.h"
+
+VLOG_DEFINE_THIS_MODULE(userspace_tso);
+
+static bool userspace_tso = false;
+
+void
+userspace_tso_init(const struct smap *ovs_other_config)
+{
+    if (smap_get_bool(ovs_other_config, "userspace-tso-enable", false)) {
+        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+        if (ovsthread_once_start(&once)) {
+#ifdef DPDK_NETDEV
+            VLOG_INFO("Userspace TCP Segmentation Offloading support enabled");
+            userspace_tso = true;
+#else
+            VLOG_WARN("Userspace TCP Segmentation Offloading can not be enabled"
+                      "since OVS is built without DPDK support.");
+#endif
+            ovsthread_once_done(&once);
+        }
+    }
+}
+
+bool
+userspace_tso_enabled(void)
+{
+    return userspace_tso;
+}
diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h
new file mode 100644
index 000000000..0758274c0
--- /dev/null
+++ b/lib/userspace-tso.h
@@ -0,0 +1,23 @@ 
+/*
+ * Copyright (c) 2020 Red Hat Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef USERSPACE_TSO_H
+#define USERSPACE_TSO_H 1
+
+void userspace_tso_init(const struct smap *ovs_other_config);
+bool userspace_tso_enabled(void);
+
+#endif /* userspace-tso.h */
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index 86c7b10a9..e591c26a6 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -65,6 +65,7 @@ 
 #include "system-stats.h"
 #include "timeval.h"
 #include "tnl-ports.h"
+#include "userspace-tso.h"
 #include "util.h"
 #include "unixctl.h"
 #include "lib/vswitch-idl.h"
@@ -3285,6 +3286,7 @@  bridge_run(void)
     if (cfg) {
         netdev_set_flow_api_enabled(&cfg->other_config);
         dpdk_init(&cfg->other_config);
+        userspace_tso_init(&cfg->other_config);
     }
 
     /* Initialize the ofproto library.  This only needs to run once, but
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index c43cb1aa4..3ddaaefda 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -690,6 +690,26 @@ 
          once in few hours or a day or a week.
         </p>
       </column>
+      <column name="other_config" key="userspace-tso-enable"
+              type='{"type": "boolean"}'>
+        <p>
+          Set this value to <code>true</code> to enable userspace support for
+          TCP Segmentation Offloading (TSO). When it is enabled, the interfaces
+          can provide an oversized TCP segment to the datapath and the datapath
+          will offload the TCP segmentation and checksum calculation to the
+          interfaces when necessary.
+        </p>
+        <p>
+          The default value is <code>false</code>. Changing this value requires
+          restarting the daemon.
+        </p>
+        <p>
+          The feature only works if Open vSwitch is built with DPDK support.
+        </p>
+        <p>
+          The feature is considered experimental.
+        </p>
+      </column>
     </group>
     <group title="Status">
       <column name="next_cfg">