[ovs-dev,v3,3/3] netdev-dpdk: Add TCP Segmentation Offload support
diff mbox series

Message ID 20200109144457.2489481-4-fbl@sysclose.org
State Changes Requested
Headers show
Series
  • Add support for TSO with DPDK
Related show

Commit Message

Flavio Leitner Jan. 9, 2020, 2:44 p.m. UTC
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 Documentation/automake.mk           |   1 +
 Documentation/topics/dpdk/index.rst |   1 +
 Documentation/topics/dpdk/tso.rst   |  96 +++++++++
 NEWS                                |   1 +
 lib/automake.mk                     |   2 +
 lib/conntrack.c                     |  29 ++-
 lib/dp-packet.h                     | 152 +++++++++++++-
 lib/ipf.c                           |  32 +--
 lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
 lib/netdev-linux-private.h          |   4 +
 lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
 lib/netdev-provider.h               |  10 +
 lib/netdev.c                        |  66 +++++-
 lib/tso.c                           |  54 +++++
 lib/tso.h                           |  23 ++
 vswitchd/bridge.c                   |   2 +
 vswitchd/vswitch.xml                |  12 ++
 17 files changed, 1002 insertions(+), 91 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst
 create mode 100644 lib/tso.c
 create mode 100644 lib/tso.h

Changelog:
- v3
 * Improved the documentation.
 * Updated copyright year to 2020.
 * TSO offloaded msg now includes the netdev's name.
 * Added period at the end of all code comments.
 * Warn and drop encapsulation of TSO packets.
 * Fixed travis issue with restricted virtio types.
 * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
   which caused packet corruption.
 * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
   PKT_TX_IP_CKSUM only for IPv4 packets.

Comments

Stokes, Ian Jan. 14, 2020, 3:41 p.m. UTC | #1
On 1/9/2020 2:44 PM, Flavio Leitner wrote:
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhostuser interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhostuser interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> It is recommended to check if NIC hardware supports TSO before enabling
> the feature, which is off by default. For additional information please
> check the tso.rst document.

Thansk for the patch Flavio. You've addressed my comments at least and I 
can see that Ciara has tested the series.

I think this will need to be rebased however as there has been a change 
to netdev-linux to operate on batches rather than singe packets. Can I 
ask you to rebase the series for these changes?

@Ilya: I believe Flavio has addressed your comments to date but not sure 
if you have more?

Thanks
Ian
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>   Documentation/automake.mk           |   1 +
>   Documentation/topics/dpdk/index.rst |   1 +
>   Documentation/topics/dpdk/tso.rst   |  96 +++++++++
>   NEWS                                |   1 +
>   lib/automake.mk                     |   2 +
>   lib/conntrack.c                     |  29 ++-
>   lib/dp-packet.h                     | 152 +++++++++++++-
>   lib/ipf.c                           |  32 +--
>   lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
>   lib/netdev-linux-private.h          |   4 +
>   lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
>   lib/netdev-provider.h               |  10 +
>   lib/netdev.c                        |  66 +++++-
>   lib/tso.c                           |  54 +++++
>   lib/tso.h                           |  23 ++
>   vswitchd/bridge.c                   |   2 +
>   vswitchd/vswitch.xml                |  12 ++
>   17 files changed, 1002 insertions(+), 91 deletions(-)
>   create mode 100644 Documentation/topics/dpdk/tso.rst
>   create mode 100644 lib/tso.c
>   create mode 100644 lib/tso.h
> 
> Changelog:
> - v3
>   * Improved the documentation.
>   * Updated copyright year to 2020.
>   * TSO offloaded msg now includes the netdev's name.
>   * Added period at the end of all code comments.
>   * Warn and drop encapsulation of TSO packets.
>   * Fixed travis issue with restricted virtio types.
>   * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
>     which caused packet corruption.
>   * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
>     PKT_TX_IP_CKSUM only for IPv4 packets.
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index f2ca17bad..284327edd 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -35,6 +35,7 @@ DOC_SOURCE = \
>   	Documentation/topics/dpdk/index.rst \
>   	Documentation/topics/dpdk/bridge.rst \
>   	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/tso.rst \
>   	Documentation/topics/dpdk/memory.rst \
>   	Documentation/topics/dpdk/pdump.rst \
>   	Documentation/topics/dpdk/phy.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index f2862ea70..400d56051 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,4 +40,5 @@ DPDK Support
>      /topics/dpdk/qos
>      /topics/dpdk/pdump
>      /topics/dpdk/jumbo-frames
> +   /topics/dpdk/tso
>      /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> new file mode 100644
> index 000000000..189c86480
> --- /dev/null
> +++ b/Documentation/topics/dpdk/tso.rst
> @@ -0,0 +1,96 @@
> +..
> +      Copyright 2020, Red Hat, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +========================
> +Userspace Datapath - TSO
> +========================
> +
> +**Note:** This feature is considered experimental.
> +
> +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> +segmentation achieves computational savings in the core, freeing up CPU cycles
> +for more useful work.
> +
> +A common use case for TSO is when using virtualization, where traffic that's
> +coming in from a VM can offload the TCP segmentation, thus avoiding the
> +fragmentation in software. Additionally, if the traffic is headed to a VM
> +within the same host further optimization can be expected. As the traffic never
> +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> +and checksum calculations are required, which saves yet more cycles. Only when
> +the traffic actually leaves the host the segmentation needs to happen, in which
> +case it will be performed by the egress NIC. Consult your controller's
> +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> +refer to the `DPDK documentation`__.
> +
> +__ https://doc.dpdk.org/guides/nics/overview.html
> +
> +Enabling TSO
> +~~~~~~~~~~~~
> +
> +The TSO support may be enabled via a global config value ``tso-support``.
> +Setting this to ``true`` enables TSO support for all ports.
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
> +
> +The default value is ``false``.
> +
> +Changing ``tso-support`` requires restarting the daemon.
> +
> +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> +
> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> +connection is established, `TSO` is thus advertised to the guest as an
> +available feature:
> +
> +QEMU Command Line Parameter::
> +
> +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> +    ...
> +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> +    ...
> +
> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> +used to enable same::
> +
> +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> +    $ ethtool -K eth0 tso on
> +    $ ethtool -k eth0
> +
> +~~~~~~~~~~~
> +Limitations
> +~~~~~~~~~~~
> +
> +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> +etc.]).
> +
> +There is no software implementation of TSO, so all ports attached to the
> +datapath must support TSO or packets using that feature will be dropped
> +on ports without TSO support.  That also means guests using vhost-user
> +in client mode will receive TSO packet regardless of TSO being enabled
> +or disabled within the guest.
> diff --git a/NEWS b/NEWS
> index 965facaf8..306c0493d 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -26,6 +26,7 @@ Post-v2.12.0
>        * DPDK ring ports (dpdkr) are deprecated and will be removed in next
>          releases.
>        * Add support for DPDK 19.11.
> +     * Add experimental support for TSO.
>      - RSTP:
>        * The rstp_statistics column in Port table will only be updated every
>          stats-update-interval configured in Open_vSwtich table.
> diff --git a/lib/automake.mk b/lib/automake.mk
> index ebf714501..94a1b4459 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
>   	lib/tnl-neigh-cache.h \
>   	lib/tnl-ports.c \
>   	lib/tnl-ports.h \
> +	lib/tso.c \
> +	lib/tso.h \
>   	lib/netdev-native-tnl.c \
>   	lib/netdev-native-tnl.h \
>   	lib/token-bucket.c \
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index b80080e72..679054b98 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>           if (hwol_bad_l3_csum) {
>               ok = false;
>           } else {
> -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
>               /* Validate the checksum only when hwol is not supported. */
>               ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
>                                    !hwol_good_l3_csum);
> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>       if (ok) {
>           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
>           if (!hwol_bad_l4_csum) {
> -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
>               /* Validate the checksum only when hwol is not supported. */
>               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
>                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>                   }
>                   if (seq_skew) {
>                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> -                                          l3_hdr->ip_tot_len, htons(ip_len));
> +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> +                                                        l3_hdr->ip_tot_len,
> +                                                        htons(ip_len));
> +                    }
>                       l3_hdr->ip_tot_len = htons(ip_len);
>                   }
>               }
> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>       }
>   
>       th->tcp_csum = 0;
> -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> -                           dp_packet_l4_size(pkt));
> -    } else {
> -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> -        th->tcp_csum = csum_finish(
> -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> +                               dp_packet_l4_size(pkt));
> +        } else {
> +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> +            th->tcp_csum = csum_finish(
> +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +        }
>       }
>   
>       if (seq_skew) {
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 133942155..d10a0416e 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
>   static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
>   static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
>   
> +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
> +
>   void *dp_packet_resize_l2(struct dp_packet *, int increment);
>   void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
>   static inline void *dp_packet_eth(const struct dp_packet *);
> @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
>   {
>       /* This initialization is needed for packets that do not come from DPDK
>        * interfaces, when vswitchd is built with --with-dpdk. */
> -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>       p->mbuf.nb_segs = 1;
>       p->mbuf.next = NULL;
>   }
> @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>       b->mbuf.buf_len = s;
>   }
>   
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> +           ? true
> +           : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> +{
> +    return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;
> +}
> +
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> +{
> +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM
> +           ? true
> +           : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM
> +           ? true
> +           : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM
> +           ? true
> +           : false;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_IPV4;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_IPV6;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> +}
> +
>   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>   static inline uint32_t
> @@ -648,6 +724,66 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>       b->allocated_ = s;
>   }
>   
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return 0;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
> +}
> +
>   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>    * correct only if 'dp_packet_rss_valid(p)' returns true */
>   static inline uint32_t
> @@ -939,6 +1075,20 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
>       }
>   }
>   
> +static inline bool
> +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> +{
> +
> +    return dp_packet_hwol_l4_mask(p) ? true : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
> +{
> +
> +    return dp_packet_hwol_l4_mask(p) ? true : false;
> +}
> +
>   #ifdef  __cplusplus
>   }
>   #endif
> diff --git a/lib/ipf.c b/lib/ipf.c
> index 45c489122..0f43593a2 100644
> --- a/lib/ipf.c
> +++ b/lib/ipf.c
> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>       len += rest_len;
>       l3 = dp_packet_l3(pkt);
>       ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> -                                new_ip_frag_off);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> +                                    new_ip_frag_off);
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    }
>       l3->ip_tot_len = htons(len);
>       l3->ip_frag_off = new_ip_frag_off;
>       dp_packet_set_l2_pad_size(pkt, 0);
> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
>       }
>   
>       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> +                     && !dp_packet_hwol_tx_ip_checksum(pkt)
>                        && csum(l3, ip_hdr_len) != 0)) {
>           goto invalid_pkt;
>       }
> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
>                   } else {
>                       struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
>                       struct ip_header *l3_reass = dp_packet_l3(pkt);
> -                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> -                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> -                    l3_frag->ip_src = l3_reass->ip_src;
> +                    if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
> +                        ovs_be32 reass_ip =
> +                            get_16aligned_be32(&l3_reass->ip_src);
> +                        ovs_be32 frag_ip =
> +                            get_16aligned_be32(&l3_frag->ip_src);
> +
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                    }
>   
> -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> +                    l3_frag->ip_src = l3_reass->ip_src;
>                       l3_frag->ip_dst = l3_reass->ip_dst;
>                   }
>   
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 5e09786ac..2de60aa3f 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -64,6 +64,7 @@
>   #include "smap.h"
>   #include "sset.h"
>   #include "timeval.h"
> +#include "tso.h"
>   #include "unaligned.h"
>   #include "unixctl.h"
>   #include "util.h"
> @@ -360,7 +361,8 @@ struct ingress_policer {
>   enum dpdk_hw_ol_features {
>       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> -    NETDEV_RX_HW_SCATTER = 1 << 2
> +    NETDEV_RX_HW_SCATTER = 1 << 2,
> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>   };
>   
>   /*
> @@ -942,6 +944,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
>       }
>   
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       /* Limit configured rss hash functions to only those supported
>        * by the eth device. */
>       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> @@ -1043,6 +1051,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>                                        DEV_RX_OFFLOAD_TCP_CKSUM |
>                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>   
>       rte_eth_dev_info_get(dev->port_id, &info);
>   
> @@ -1069,6 +1080,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>       }
>   
> +    if (info.tx_offload_capa & tx_tso_offload_capa) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +    } else {
> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_WARN("Tx TSO offload is not supported on %s port "
> +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
> +    }
> +
>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>   
> @@ -1319,14 +1338,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
>           goto out;
>       }
>   
> -    err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                | 1ULL << VIRTIO_NET_F_CSUM);
> -    if (err) {
> -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                 "port: %s\n", name);
> -        goto out;
> +    if (!tso_enabled()) {
> +        err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                    | 1ULL << VIRTIO_NET_F_CSUM);
> +        if (err) {
> +            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> +                     "port: %s\n", name);
> +            goto out;
> +        }
>       }
>   
>       err = rte_vhost_driver_start(dev->vhost_id);
> @@ -1661,6 +1682,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
>           } else {
>               smap_add(args, "rx_csum_offload", "false");
>           }
> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +            smap_add(args, "tx_tso_offload", "true");
> +        } else {
> +            smap_add(args, "tx_tso_offload", "false");
> +        }
>           smap_add(args, "lsc_interrupt_mode",
>                    dev->lsc_interrupt_mode ? "true" : "false");
>       }
> @@ -2088,6 +2114,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
>       rte_free(rx);
>   }
>   
> +/* Prepare the packet for HWOL.
> + * Return True if the packet is OK to continue. */
> +static bool
> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> +{
> +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> +
> +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> +        mbuf->outer_l2_len = 0;
> +        mbuf->outer_l3_len = 0;
> +    }
> +
> +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> +        struct tcp_header *th = dp_packet_l4(pkt);
> +
> +        if (!th) {
> +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> +                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> +            return false;
> +        }
> +
> +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
> +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> +
> +        if (mbuf->ol_flags & PKT_TX_IPV4) {
> +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
> +        }
> +    }
> +    return true;
> +}
> +
> +/* Prepare a batch for HWOL.
> + * Return the number of good packets in the batch. */
> +static int
> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> +                            int pkt_cnt)
> +{
> +    int i = 0;
> +    int cnt = 0;
> +    struct rte_mbuf *pkt;
> +
> +    /* Prepare and filter bad HWOL packets. */
> +    for (i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> +            rte_pktmbuf_free(pkt);
> +            continue;
> +        }
> +
> +        if (OVS_UNLIKELY(i != cnt)) {
> +            pkts[cnt] = pkt;
> +        }
> +        cnt++;
> +    }
> +
> +    return cnt;
> +}
> +
>   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
>    * 'pkts', even in case of failure.
>    *
> @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
>                            struct rte_mbuf **pkts, int cnt)
>   {
>       uint32_t nb_tx = 0;
> +    uint16_t nb_tx_prep = cnt;
> +
> +    if (tso_enabled()) {
> +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> +        if (nb_tx_prep != cnt) {
> +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> +                         cnt, rte_strerror(rte_errno));
> +        }
> +    }
>   
> -    while (nb_tx != cnt) {
> +    while (nb_tx != nb_tx_prep) {
>           uint32_t ret;
>   
> -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> +                               nb_tx_prep - nb_tx);
>           if (!ret) {
>               break;
>           }
> @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
>       int cnt = 0;
>       struct rte_mbuf *pkt;
>   
> +    /* Filter oversized packets, unless are marked for TSO. */
>       for (i = 0; i < pkt_cnt; i++) {
>           pkt = pkts[i];
> -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> +                         dev->max_packet_len);
>               rte_pktmbuf_free(pkt);
>               continue;
>           }
> @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>       struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
>       struct netdev_dpdk_sw_stats sw_stats_add;
>       unsigned int n_packets_to_free = cnt;
> -    unsigned int total_packets = cnt;
> +    unsigned int total_packets;
>       int i, retries = 0;
>       int max_retries = VHOST_ENQ_RETRY_MIN;
>       int vid = netdev_dpdk_get_vid(dev);
> @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>       }
>   
> -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
>       sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
>   
>       /* Check has QoS has been configured for the netdev */
> @@ -2511,6 +2613,121 @@ out:
>       }
>   }
>   
> +static void
> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> +{
> +    rte_free(opaque);
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> +{
> +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> +    uint16_t buf_len;
> +    void *buf;
> +
> +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> +    } else {
> +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> +    }
> +
> +    if (unlikely(total_len > UINT16_MAX)) {
> +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> +        return NULL;
> +    }
> +
> +    buf_len = total_len;
> +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> +    if (unlikely(buf == NULL)) {
> +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> +        return NULL;
> +    }
> +
> +    /* Initialize shinfo. */
> +    if (shinfo) {
> +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> +        shinfo->fcb_opaque = buf;
> +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> +    } else {
> +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> +                                                    netdev_dpdk_extbuf_free,
> +                                                    buf);
> +        if (unlikely(shinfo == NULL)) {
> +            rte_free(buf);
> +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> +                     "attempting to attach an external buffer.");
> +            return NULL;
> +        }
> +    }
> +
> +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> +                              shinfo);
> +    rte_pktmbuf_reset_headroom(pkt);
> +
> +    return pkt;
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> +{
> +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> +
> +    if (OVS_UNLIKELY(!pkt)) {
> +        return NULL;
> +    }
> +
> +    dp_packet_init_specific((struct dp_packet *)pkt);
> +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> +        return pkt;
> +    }
> +
> +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> +        return pkt;
> +    }
> +
> +    rte_pktmbuf_free(pkt);
> +
> +    return NULL;
> +}
> +
> +static struct dp_packet *
> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> +{
> +    struct rte_mbuf *mbuf_dest;
> +    struct dp_packet *pkt_dest;
> +    uint32_t pkt_len;
> +
> +    pkt_len = dp_packet_size(pkt_orig);
> +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> +            return NULL;
> +    }
> +
> +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> +    dp_packet_set_size(pkt_dest, pkt_len);
> +
> +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> +
> +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> +
> +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> +                                - (char *)dp_packet_eth(pkt_dest);
> +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> +                                - (char *) dp_packet_l3(pkt_dest);
> +    }
> +
> +    return pkt_dest;
> +}
> +
>   /* Tx function. Transmit packets indefinitely */
>   static void
>   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>   #endif
>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
>       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>       uint32_t cnt = batch_cnt;
>       uint32_t dropped = 0;
> @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>           struct dp_packet *packet = batch->packets[i];
>           uint32_t size = dp_packet_size(packet);
>   
> -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> -                         size, dev->max_packet_len);
> -
> +        if (size > dev->max_packet_len
> +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> +                         dev->max_packet_len);
>               mtu_drops++;
>               continue;
>           }
>   
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>               dropped = cnt - i;
>               break;
>           }
>   
> -        /* We have to do a copy for now */
> -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> -               dp_packet_data(packet), size);
> -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> -
>           txcnt++;
>       }
>   
>       if (OVS_LIKELY(txcnt)) {
>           if (dev->type == DPDK_DEV_VHOST) {
> -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> -                                     txcnt);
> +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
>           } else {
> -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> +                                                   (struct rte_mbuf **)pkts,
> +                                                   txcnt);
>           }
>       }
>   
> @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>           int batch_cnt = dp_packet_batch_size(batch);
>           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>   
> +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
>           tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>           mtu_drops = batch_cnt - tx_cnt;
>           qos_drops = tx_cnt;
> @@ -4345,6 +4559,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>   
>       rte_free(dev->tx_q);
>       err = dpdk_eth_dev_init(dev);
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>       if (!dev->tx_q) {
>           err = ENOMEM;
> @@ -4374,6 +4594,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>           dev->tx_q[0].map = 0;
>       }
>   
> +    if (tso_enabled()) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
> +    }
> +
>       netdev_dpdk_remap_txqs(dev);
>   
>       err = netdev_dpdk_mempool_configure(dev);
> @@ -4446,6 +4671,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>           }
>   
> +        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
> +        if (tso_enabled()) {
> +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> +        }
> +
>           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>           if (err) {
>               VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -4470,14 +4700,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>               goto unlock;
>           }
>   
> -        err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                    | 1ULL << VIRTIO_NET_F_CSUM);
> -        if (err) {
> -            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                     "client port: %s\n", dev->up.name);
> -            goto unlock;
> +        if (tso_enabled()) {
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +        } else {
> +            err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                        | 1ULL << VIRTIO_NET_F_CSUM);
> +            if (err) {
> +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
> +                         "vhost user client port: %s\n", dev->up.name);
> +                goto unlock;
> +            }
>           }
>   
>           err = rte_vhost_driver_start(dev->vhost_id);
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> index f08159aa7..102548db7 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -37,10 +37,14 @@
>   
>   struct netdev;
>   
> +#define LINUX_RXQ_TSO_MAX_LEN 65536
> +
>   struct netdev_rxq_linux {
>       struct netdev_rxq up;
>       bool is_tap;
>       int fd;
> +    char *bufaux;          /* Extra buffer to recv TSO pkt. */
> +    int bufaux_len;        /* Extra buffer length. */
>   };
>   
>   int netdev_linux_construct(struct netdev *);
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 8a62f9d74..604cb6913 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -29,16 +29,18 @@
>   #include <linux/filter.h>
>   #include <linux/gen_stats.h>
>   #include <linux/if_ether.h>
> +#include <linux/if_packet.h>
>   #include <linux/if_tun.h>
>   #include <linux/types.h>
>   #include <linux/ethtool.h>
>   #include <linux/mii.h>
>   #include <linux/rtnetlink.h>
>   #include <linux/sockios.h>
> +#include <linux/virtio_net.h>
>   #include <sys/ioctl.h>
>   #include <sys/socket.h>
> +#include <sys/uio.h>
>   #include <sys/utsname.h>
> -#include <netpacket/packet.h>
>   #include <net/if.h>
>   #include <net/if_arp.h>
>   #include <net/route.h>
> @@ -72,6 +74,7 @@
>   #include "socket-util.h"
>   #include "sset.h"
>   #include "tc.h"
> +#include "tso.h"
>   #include "timer.h"
>   #include "unaligned.h"
>   #include "openvswitch/vlog.h"
> @@ -501,6 +504,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>    * changes in the device miimon status, so we can use atomic_count. */
>   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>   
> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
>   static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
>                                      int cmd, const char *cmd_name);
>   static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -902,6 +907,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
>       /* The device could be in the same network namespace or in another one. */
>       netnsid_unset(&netdev->netnsid);
>       ovs_mutex_init(&netdev->mutex);
> +
> +    if (tso_enabled()) {
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>       return 0;
>   }
>   
> @@ -961,6 +973,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
>       /* Create tap device. */
>       get_flags(&netdev->up, &netdev->ifi_flags);
>       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> +    if (tso_enabled()) {
> +        ifr.ifr_flags |= IFF_VNET_HDR;
> +    }
> +
>       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
>       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
>           VLOG_WARN("%s: creating tap device failed: %s", name,
> @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
>   netdev_linux_rxq_alloc(void)
>   {
>       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> +    if (tso_enabled()) {
> +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> +        if (rx->bufaux) {
> +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> +        }
> +    }
> +
>       return &rx->up;
>   }
>   
> @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>               goto error;
>           }
>   
> +        if (tso_enabled()) {
> +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                               sizeof val);
> +            if (error) {
> +                error = errno;
> +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> +                         netdev_get_name(netdev_), ovs_strerror(errno));
> +                goto error;
> +            }
> +        }
> +
>           /* Set non-blocking mode. */
>           error = set_nonblocking(rx->fd);
>           if (error) {
> @@ -1123,6 +1157,8 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>       if (!rx->is_tap) {
>           close(rx->fd);
>       }
> +
> +    free(rx->bufaux);
>   }
>   
>   static void
> @@ -1152,11 +1188,13 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
>   }
>   
>   static int
> -netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> +netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
> +                           struct dp_packet *buffer)
>   {
> -    size_t size;
> +    size_t std_len;
> +    size_t total_len;
>       ssize_t retval;
> -    struct iovec iov;
> +    struct iovec iov[2];
>       struct cmsghdr *cmsg;
>       union {
>           struct cmsghdr cmsg;
> @@ -1166,14 +1204,17 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>   
>       /* Reserve headroom for a single VLAN tag */
>       dp_packet_reserve(buffer, VLAN_HEADER_LEN);
> -    size = dp_packet_tailroom(buffer);
> +    std_len = dp_packet_tailroom(buffer);
> +    total_len = std_len + bufaux_len;
>   
> -    iov.iov_base = dp_packet_data(buffer);
> -    iov.iov_len = size;
> +    iov[0].iov_base = dp_packet_data(buffer);
> +    iov[0].iov_len = std_len;
> +    iov[1].iov_base = bufaux;
> +    iov[1].iov_len = bufaux_len;
>       msgh.msg_name = NULL;
>       msgh.msg_namelen = 0;
> -    msgh.msg_iov = &iov;
> -    msgh.msg_iovlen = 1;
> +    msgh.msg_iov = iov;
> +    msgh.msg_iovlen = 2;
>       msgh.msg_control = &cmsg_buffer;
>       msgh.msg_controllen = sizeof cmsg_buffer;
>       msgh.msg_flags = 0;
> @@ -1184,11 +1225,26 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>   
>       if (retval < 0) {
>           return errno;
> -    } else if (retval > size) {
> +    } else if (retval > total_len) {
>           return EMSGSIZE;
>       }
>   
> -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    if (retval > std_len) {
> +        /* Build a single linear TSO packet. */
> +        size_t extra_len = retval - std_len;
> +
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> +        dp_packet_prealloc_tailroom(buffer, extra_len);
> +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +    } else {
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    }
> +
> +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> +        return EINVAL;
> +    }
>   
>       for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
>           const struct tpacket_auxdata *aux;
> @@ -1221,20 +1277,44 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>   }
>   
>   static int
> -netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
> +netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
> +                          struct dp_packet *buffer)
>   {
>       ssize_t retval;
> -    size_t size = dp_packet_tailroom(buffer);
> +    size_t std_len;
> +    struct iovec iov[2];
> +
> +    std_len = dp_packet_tailroom(buffer);
> +    iov[0].iov_base = dp_packet_data(buffer);
> +    iov[0].iov_len = std_len;
> +    iov[1].iov_base = bufaux;
> +    iov[1].iov_len = bufaux_len;
>   
>       do {
> -        retval = read(fd, dp_packet_data(buffer), size);
> +        retval = readv(fd, iov, 2);
>       } while (retval < 0 && errno == EINTR);
>   
>       if (retval < 0) {
>           return errno;
>       }
>   
> -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    if (retval > std_len) {
> +        /* Build a single linear TSO packet. */
> +        size_t extra_len = retval - std_len;
> +
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> +        dp_packet_prealloc_tailroom(buffer, extra_len);
> +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +    } else {
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    }
> +
> +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> +        return EINVAL;
> +    }
> +
>       return 0;
>   }
>   
> @@ -1245,6 +1325,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>       struct netdev *netdev = rx->up.netdev;
>       struct dp_packet *buffer;
> +    size_t buffer_len;
>       ssize_t retval;
>       int mtu;
>   
> @@ -1252,12 +1333,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>           mtu = ETH_PAYLOAD_MAX;
>       }
>   
> +    buffer_len = VLAN_ETH_HEADER_LEN + mtu;
> +    if (tso_enabled()) {
> +            buffer_len += sizeof(struct virtio_net_hdr);
> +    }
> +
>       /* Assume Ethernet port. No need to set packet_type. */
> -    buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> -                                           DP_NETDEV_HEADROOM);
> +    buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
>       retval = (rx->is_tap
> -              ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> -              : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> +              ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
> +                                          buffer)
> +              : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
> +                                           buffer));
>   
>       if (retval) {
>           if (retval != EAGAIN && retval != EMSGSIZE) {
> @@ -1302,7 +1389,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>   }
>   
>   static int
> -netdev_linux_sock_batch_send(int sock, int ifindex,
> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>                                struct dp_packet_batch *batch)
>   {
>       const size_t size = dp_packet_batch_size(batch);
> @@ -1316,6 +1403,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>   
>       struct dp_packet *packet;
>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
>           iov[i].iov_base = dp_packet_data(packet);
>           iov[i].iov_len = dp_packet_size(packet);
>           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> @@ -1348,7 +1439,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>    * on other interface types because we attach a socket filter to the rx
>    * socket. */
>   static int
> -netdev_linux_tap_batch_send(struct netdev *netdev_,
> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
>                               struct dp_packet_batch *batch)
>   {
>       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> @@ -1365,10 +1456,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
>       }
>   
>       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        size_t size = dp_packet_size(packet);
> +        size_t size;
>           ssize_t retval;
>           int error;
>   
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
> +        size = dp_packet_size(packet);
>           do {
>               retval = write(netdev->tap_fd, dp_packet_data(packet), size);
>               error = retval < 0 ? errno : 0;
> @@ -1403,9 +1499,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>                     struct dp_packet_batch *batch,
>                     bool concurrent_txq OVS_UNUSED)
>   {
> +    bool tso = tso_enabled();
> +    int mtu = ETH_PAYLOAD_MAX;
>       int error = 0;
>       int sock = 0;
>   
> +    if (tso) {
> +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> +    }
> +
>       if (!is_tap_netdev(netdev_)) {
>           if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>               error = EOPNOTSUPP;
> @@ -1424,9 +1526,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>               goto free_batch;
>           }
>   
> -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
>       } else {
> -        error = netdev_linux_tap_batch_send(netdev_, batch);
> +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>       }
>       if (error) {
>           if (error == ENOBUFS) {
> @@ -6173,6 +6275,19 @@ af_packet_sock(void)
>                   close(sock);
>                   sock = -error;
>               }
> +
> +            if (tso_enabled()) {
> +                int val = 1;
> +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                                   sizeof val);
> +                if (error) {
> +                    error = errno;
> +                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> +                             ovs_strerror(errno));
> +                    close(sock);
> +                    sock = -error;
> +                }
> +            }
>           } else {
>               sock = -errno;
>               VLOG_ERR("failed to create packet socket: %s",
> @@ -6183,3 +6298,136 @@ af_packet_sock(void)
>   
>       return sock;
>   }
> +
> +static int
> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> +{
> +    struct eth_header *eth_hdr;
> +    ovs_be16 eth_type;
> +    int l2_len;
> +
> +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> +    if (!eth_hdr) {
> +        return -EINVAL;
> +    }
> +
> +    l2_len = ETH_HEADER_LEN;
> +    eth_type = eth_hdr->eth_type;
> +    if (eth_type_vlan(eth_type)) {
> +        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> +
> +        if (!vlan) {
> +            return -EINVAL;
> +        }
> +
> +        eth_type = vlan->vlan_next_type;
> +        l2_len += VLAN_HEADER_LEN;
> +    }
> +
> +    if (eth_type == htons(ETH_TYPE_IP)) {
> +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> +
> +        if (!ip_hdr) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = ip_hdr->ip_proto;
> +        dp_packet_hwol_set_tx_ipv4(b);
> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> +        struct ovs_16aligned_ip6_hdr *nh6;
> +
> +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> +        if (!nh6) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> +        dp_packet_hwol_set_tx_ipv6(b);
> +    }
> +
> +    return 0;
> +}
> +
> +static int
> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> +    uint16_t l4proto = 0;
> +
> +    if (OVS_UNLIKELY(!vnet)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> +        return 0;
> +    }
> +
> +    if (netdev_linux_parse_l2(b, &l4proto)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +        if (l4proto == IPPROTO_TCP) {
> +            dp_packet_hwol_set_csum_tcp(b);
> +        } else if (l4proto == IPPROTO_UDP) {
> +            dp_packet_hwol_set_csum_udp(b);
> +        } else if (l4proto == IPPROTO_SCTP) {
> +            dp_packet_hwol_set_csum_sctp(b);
> +        }
> +    }
> +
> +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> +                                | VIRTIO_NET_HDR_GSO_TCPV6
> +                                | VIRTIO_NET_HDR_GSO_UDP;
> +        uint8_t type = vnet->gso_type & allowed_mask;
> +
> +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
> +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> +            dp_packet_hwol_set_tcp_seg(b);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void
> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> +
> +    if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
> +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
> +                            + TCP_HEADER_LEN;
> +
> +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
> +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
> +        if (dp_packet_hwol_is_ipv4(b)) {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +        } else {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +        }
> +
> +    } else {
> +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(b)) {
> +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
> +                                                  - (char *)dp_packet_eth(b));
> +
> +        if (dp_packet_hwol_l4_is_tcp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct tcp_header, tcp_csum);
> +        } else if (dp_packet_hwol_l4_is_udp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct udp_header, udp_csum);
> +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct sctp_header, sctp_csum);
> +        } else {
> +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> +        }
> +    }
> +}
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index f109c4e66..87c375b47 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -37,6 +37,12 @@ extern "C" {
>   struct netdev_tnl_build_header_params;
>   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>   
> +enum netdev_ol_flags {
> +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> +};
> +
>   /* A network device (e.g. an Ethernet device).
>    *
>    * Network device implementations may read these members but should not modify
> @@ -51,6 +57,10 @@ struct netdev {
>        * opening this device, and therefore got assigned to the "system" class */
>       bool auto_classified;
>   
> +    /* This bitmask of the offloading features enabled/supported by the
> +     * supported by the netdev. */
> +    uint64_t ol_flags;
> +
>       /* If this is 'true', the user explicitly specified an MTU for this
>        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
>       bool mtu_user_config;
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 405c98c68..998525875 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
>               : NETDEV_PT_LEGACY_L2);
>   }
>   
> +/* Check if a 'packet' is compatible with 'netdev_flags'.
> + * If a packet is incompatible, return 'false' with the 'errormsg'
> + * pointing to a reason. */
> +static bool
> +netdev_send_prepare_packet(const uint64_t netdev_flags,
> +                           struct dp_packet *packet, char **errormsg)
> +{
> +    if (dp_packet_hwol_is_tso(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> +            /* Fall back to GSO in software. */
> +            *errormsg = "No TSO support";
> +            return false;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> +            /* Fall back to L4 csum in software. */
> +            *errormsg = "No L4 checksum support";
> +            return false;
> +    }
> +
> +    return true;
> +}
> +
> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> + * otherwise either fall back to software implementation or drop it. */
> +static void
> +netdev_send_prepare_batch(const struct netdev *netdev,
> +                          struct dp_packet_batch *batch)
> +{
> +    struct dp_packet *packet;
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        char *errormsg = NULL;
> +
> +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> +                         errormsg ? errormsg : "Unsupported feature",
> +                         netdev_get_name(netdev));
> +        }
> +    }
> +}
> +
>   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
>    * otherwise a positive errno value.  Returns EAGAIN without blocking if
>    * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> @@ -811,8 +857,10 @@ int
>   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
>               bool concurrent_txq)
>   {
> -    int error = netdev->netdev_class->send(netdev, qid, batch,
> -                                           concurrent_txq);
> +    int error;
> +
> +    netdev_send_prepare_batch(netdev, batch);
> +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
>       if (!error) {
>           COVERAGE_INC(netdev_sent);
>       }
> @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
>                      const struct ovs_action_push_tnl *data)
>   {
>       struct dp_packet *packet;
> -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        netdev->netdev_class->push_header(netdev, packet, data);
> -        pkt_metadata_init(&packet->md, data->out_port);
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        if (!dp_packet_hwol_is_tso(packet)) {
> +            netdev->netdev_class->push_header(netdev, packet, data);
> +            pkt_metadata_init(&packet->md, data->out_port);
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
> +                         "packet dropped", netdev_get_name(netdev));
> +        }
>       }
>   
>       return 0;
> diff --git a/lib/tso.c b/lib/tso.c
> new file mode 100644
> index 000000000..9dc15e146
> --- /dev/null
> +++ b/lib/tso.c
> @@ -0,0 +1,54 @@
> +/*
> + * Copyright (c) 2020 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "smap.h"
> +#include "ovs-thread.h"
> +#include "openvswitch/vlog.h"
> +#include "dpdk.h"
> +#include "tso.h"
> +#include "vswitch-idl.h"
> +
> +VLOG_DEFINE_THIS_MODULE(tso);
> +
> +static bool tso_support_enabled = false;
> +
> +void
> +tso_init(const struct smap *ovs_other_config)
> +{
> +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> +        if (ovsthread_once_start(&once)) {
> +            if (dpdk_available()) {
> +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> +                tso_support_enabled = true;
> +            } else {
> +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> +                         "without enabling DPDK");
> +                tso_support_enabled = false;
> +            }
> +            ovsthread_once_done(&once);
> +        }
> +    }
> +}
> +
> +bool
> +tso_enabled(void)
> +{
> +    return tso_support_enabled;
> +}
> diff --git a/lib/tso.h b/lib/tso.h
> new file mode 100644
> index 000000000..6594496ac
> --- /dev/null
> +++ b/lib/tso.h
> @@ -0,0 +1,23 @@
> +/*
> + * Copyright (c) 2020 Red Hat Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef TSO_H
> +#define TSO_H 1
> +
> +void tso_init(const struct smap *ovs_other_config);
> +bool tso_enabled(void);
> +
> +#endif /* tso.h */
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 86c7b10a9..6d73922f6 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -65,6 +65,7 @@
>   #include "system-stats.h"
>   #include "timeval.h"
>   #include "tnl-ports.h"
> +#include "tso.h"
>   #include "util.h"
>   #include "unixctl.h"
>   #include "lib/vswitch-idl.h"
> @@ -3285,6 +3286,7 @@ bridge_run(void)
>       if (cfg) {
>           netdev_set_flow_api_enabled(&cfg->other_config);
>           dpdk_init(&cfg->other_config);
> +        tso_init(&cfg->other_config);
>       }
>   
>       /* Initialize the ofproto library.  This only needs to run once, but
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0ec726c39..354dcabfa 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -690,6 +690,18 @@
>            once in few hours or a day or a week.
>           </p>
>         </column>
> +      <column name="other_config" key="tso-support"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Set this value to <code>true</code> to enable support for TSO (TCP
> +          Segmentation Offloading). When TSO is enabled, vhost-user client
> +          interfaces can transmit packets up to 64KB.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>. Changing this value requires
> +          restarting the daemon.
> +        </p>
> +      </column>
>       </group>
>       <group title="Status">
>         <column name="next_cfg">
>
Flavio Leitner Jan. 14, 2020, 4:16 p.m. UTC | #2
On Tue, Jan 14, 2020 at 03:41:57PM +0000, Stokes, Ian wrote:
> 
> 
> On 1/9/2020 2:44 PM, Flavio Leitner wrote:
> > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> > the network stack to delegate the TCP segmentation to the NIC reducing
> > the per packet CPU overhead.
> > 
> > A guest using vhostuser interface with TSO enabled can send TCP packets
> > much bigger than the MTU, which saves CPU cycles normally used to break
> > the packets down to MTU size and to calculate checksums.
> > 
> > It also saves CPU cycles used to parse multiple packets/headers during
> > the packet processing inside virtual switch.
> > 
> > If the destination of the packet is another guest in the same host, then
> > the same big packet can be sent through a vhostuser interface skipping
> > the segmentation completely. However, if the destination is not local,
> > the NIC hardware is instructed to do the TCP segmentation and checksum
> > calculation.
> > 
> > It is recommended to check if NIC hardware supports TSO before enabling
> > the feature, which is off by default. For additional information please
> > check the tso.rst document.
> 
> Thansk for the patch Flavio. You've addressed my comments at least and I can
> see that Ciara has tested the series.
> 
> I think this will need to be rebased however as there has been a change to
> netdev-linux to operate on batches rather than singe packets. Can I ask you
> to rebase the series for these changes?

Ok, will do.
fbl


> 
> @Ilya: I believe Flavio has addressed your comments to date but not sure if
> you have more?
> 
> Thanks
> Ian
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> >   Documentation/automake.mk           |   1 +
> >   Documentation/topics/dpdk/index.rst |   1 +
> >   Documentation/topics/dpdk/tso.rst   |  96 +++++++++
> >   NEWS                                |   1 +
> >   lib/automake.mk                     |   2 +
> >   lib/conntrack.c                     |  29 ++-
> >   lib/dp-packet.h                     | 152 +++++++++++++-
> >   lib/ipf.c                           |  32 +--
> >   lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
> >   lib/netdev-linux-private.h          |   4 +
> >   lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
> >   lib/netdev-provider.h               |  10 +
> >   lib/netdev.c                        |  66 +++++-
> >   lib/tso.c                           |  54 +++++
> >   lib/tso.h                           |  23 ++
> >   vswitchd/bridge.c                   |   2 +
> >   vswitchd/vswitch.xml                |  12 ++
> >   17 files changed, 1002 insertions(+), 91 deletions(-)
> >   create mode 100644 Documentation/topics/dpdk/tso.rst
> >   create mode 100644 lib/tso.c
> >   create mode 100644 lib/tso.h
> > 
> > Changelog:
> > - v3
> >   * Improved the documentation.
> >   * Updated copyright year to 2020.
> >   * TSO offloaded msg now includes the netdev's name.
> >   * Added period at the end of all code comments.
> >   * Warn and drop encapsulation of TSO packets.
> >   * Fixed travis issue with restricted virtio types.
> >   * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
> >     which caused packet corruption.
> >   * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
> >     PKT_TX_IP_CKSUM only for IPv4 packets.
> > 
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index f2ca17bad..284327edd 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -35,6 +35,7 @@ DOC_SOURCE = \
> >   	Documentation/topics/dpdk/index.rst \
> >   	Documentation/topics/dpdk/bridge.rst \
> >   	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/tso.rst \
> >   	Documentation/topics/dpdk/memory.rst \
> >   	Documentation/topics/dpdk/pdump.rst \
> >   	Documentation/topics/dpdk/phy.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> > index f2862ea70..400d56051 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,4 +40,5 @@ DPDK Support
> >      /topics/dpdk/qos
> >      /topics/dpdk/pdump
> >      /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/tso
> >      /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> > new file mode 100644
> > index 000000000..189c86480
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/tso.rst
> > @@ -0,0 +1,96 @@
> > +..
> > +      Copyright 2020, Red Hat, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> > +      not use this file except in compliance with the License. You may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing, software
> > +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> > +      License for the specific language governing permissions and limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +========================
> > +Userspace Datapath - TSO
> > +========================
> > +
> > +**Note:** This feature is considered experimental.
> > +
> > +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> > +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> > +segmentation achieves computational savings in the core, freeing up CPU cycles
> > +for more useful work.
> > +
> > +A common use case for TSO is when using virtualization, where traffic that's
> > +coming in from a VM can offload the TCP segmentation, thus avoiding the
> > +fragmentation in software. Additionally, if the traffic is headed to a VM
> > +within the same host further optimization can be expected. As the traffic never
> > +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> > +and checksum calculations are required, which saves yet more cycles. Only when
> > +the traffic actually leaves the host the segmentation needs to happen, in which
> > +case it will be performed by the egress NIC. Consult your controller's
> > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> > +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> > +refer to the `DPDK documentation`__.
> > +
> > +__ https://doc.dpdk.org/guides/nics/overview.html
> > +
> > +Enabling TSO
> > +~~~~~~~~~~~~
> > +
> > +The TSO support may be enabled via a global config value ``tso-support``.
> > +Setting this to ``true`` enables TSO support for all ports.
> > +
> > +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
> > +
> > +The default value is ``false``.
> > +
> > +Changing ``tso-support`` requires restarting the daemon.
> > +
> > +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> > +
> > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> > +connection is established, `TSO` is thus advertised to the guest as an
> > +available feature:
> > +
> > +QEMU Command Line Parameter::
> > +
> > +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> > +    ...
> > +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> > +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> > +    ...
> > +
> > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> > +used to enable same::
> > +
> > +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> > +    $ ethtool -K eth0 tso on
> > +    $ ethtool -k eth0
> > +
> > +~~~~~~~~~~~
> > +Limitations
> > +~~~~~~~~~~~
> > +
> > +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> > +etc.]).
> > +
> > +There is no software implementation of TSO, so all ports attached to the
> > +datapath must support TSO or packets using that feature will be dropped
> > +on ports without TSO support.  That also means guests using vhost-user
> > +in client mode will receive TSO packet regardless of TSO being enabled
> > +or disabled within the guest.
> > diff --git a/NEWS b/NEWS
> > index 965facaf8..306c0493d 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -26,6 +26,7 @@ Post-v2.12.0
> >        * DPDK ring ports (dpdkr) are deprecated and will be removed in next
> >          releases.
> >        * Add support for DPDK 19.11.
> > +     * Add experimental support for TSO.
> >      - RSTP:
> >        * The rstp_statistics column in Port table will only be updated every
> >          stats-update-interval configured in Open_vSwtich table.
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index ebf714501..94a1b4459 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
> >   	lib/tnl-neigh-cache.h \
> >   	lib/tnl-ports.c \
> >   	lib/tnl-ports.h \
> > +	lib/tso.c \
> > +	lib/tso.h \
> >   	lib/netdev-native-tnl.c \
> >   	lib/netdev-native-tnl.h \
> >   	lib/token-bucket.c \
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index b80080e72..679054b98 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >           if (hwol_bad_l3_csum) {
> >               ok = false;
> >           } else {
> > -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> > +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> > +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
> >               /* Validate the checksum only when hwol is not supported. */
> >               ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
> >                                    !hwol_good_l3_csum);
> > @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >       if (ok) {
> >           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
> >           if (!hwol_bad_l4_csum) {
> > -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> > +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> > +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
> >               /* Validate the checksum only when hwol is not supported. */
> >               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
> >                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
> > @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >                   }
> >                   if (seq_skew) {
> >                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> > -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > -                                          l3_hdr->ip_tot_len, htons(ip_len));
> > +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > +                                                        l3_hdr->ip_tot_len,
> > +                                                        htons(ip_len));
> > +                    }
> >                       l3_hdr->ip_tot_len = htons(ip_len);
> >                   }
> >               }
> > @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >       }
> >       th->tcp_csum = 0;
> > -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > -                           dp_packet_l4_size(pkt));
> > -    } else {
> > -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > -        th->tcp_csum = csum_finish(
> > -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> > +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > +                               dp_packet_l4_size(pkt));
> > +        } else {
> > +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > +            th->tcp_csum = csum_finish(
> > +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +        }
> >       }
> >       if (seq_skew) {
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index 133942155..d10a0416e 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
> >   static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
> >   static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
> > +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
> > +
> >   void *dp_packet_resize_l2(struct dp_packet *, int increment);
> >   void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
> >   static inline void *dp_packet_eth(const struct dp_packet *);
> > @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
> >   {
> >       /* This initialization is needed for packets that do not come from DPDK
> >        * interfaces, when vswitchd is built with --with-dpdk. */
> > -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> > +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> >       p->mbuf.nb_segs = 1;
> >       p->mbuf.next = NULL;
> >   }
> > @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >       b->mbuf.buf_len = s;
> >   }
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> > +{
> > +    return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;
> > +}
> > +
> > +static inline uint64_t
> > +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> > +{
> > +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_IPV4;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_IPV6;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> > +}
> > +
> >   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
> >    * correct only if 'dp_packet_rss_valid(p)' returns true */
> >   static inline uint32_t
> > @@ -648,6 +724,66 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >       b->allocated_ = s;
> >   }
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline uint64_t
> > +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return 0;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> >   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
> >    * correct only if 'dp_packet_rss_valid(p)' returns true */
> >   static inline uint32_t
> > @@ -939,6 +1075,20 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
> >       }
> >   }
> > +static inline bool
> > +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> > +}
> > +
> >   #ifdef  __cplusplus
> >   }
> >   #endif
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index 45c489122..0f43593a2 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
> >       len += rest_len;
> >       l3 = dp_packet_l3(pkt);
> >       ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> > -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> > -                                new_ip_frag_off);
> > -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> > +    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> > +                                    new_ip_frag_off);
> > +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> > +    }
> >       l3->ip_tot_len = htons(len);
> >       l3->ip_frag_off = new_ip_frag_off;
> >       dp_packet_set_l2_pad_size(pkt, 0);
> > @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
> >       }
> >       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> > +                     && !dp_packet_hwol_tx_ip_checksum(pkt)
> >                        && csum(l3, ip_hdr_len) != 0)) {
> >           goto invalid_pkt;
> >       }
> > @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> >                   } else {
> >                       struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
> >                       struct ip_header *l3_reass = dp_packet_l3(pkt);
> > -                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> > -                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> > -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > -                                                     frag_ip, reass_ip);
> > -                    l3_frag->ip_src = l3_reass->ip_src;
> > +                    if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
> > +                        ovs_be32 reass_ip =
> > +                            get_16aligned_be32(&l3_reass->ip_src);
> > +                        ovs_be32 frag_ip =
> > +                            get_16aligned_be32(&l3_frag->ip_src);
> > +
> > +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > +                                                         frag_ip, reass_ip);
> > +                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> > +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> > +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > +                                                         frag_ip, reass_ip);
> > +                    }
> > -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> > -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> > -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > -                                                     frag_ip, reass_ip);
> > +                    l3_frag->ip_src = l3_reass->ip_src;
> >                       l3_frag->ip_dst = l3_reass->ip_dst;
> >                   }
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 5e09786ac..2de60aa3f 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -64,6 +64,7 @@
> >   #include "smap.h"
> >   #include "sset.h"
> >   #include "timeval.h"
> > +#include "tso.h"
> >   #include "unaligned.h"
> >   #include "unixctl.h"
> >   #include "util.h"
> > @@ -360,7 +361,8 @@ struct ingress_policer {
> >   enum dpdk_hw_ol_features {
> >       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
> >       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> > -    NETDEV_RX_HW_SCATTER = 1 << 2
> > +    NETDEV_RX_HW_SCATTER = 1 << 2,
> > +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
> >   };
> >   /*
> > @@ -942,6 +944,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
> >           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
> >       }
> > +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       /* Limit configured rss hash functions to only those supported
> >        * by the eth device. */
> >       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> > @@ -1043,6 +1051,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
> >                                        DEV_RX_OFFLOAD_TCP_CKSUM |
> >                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
> > +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> > +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
> > +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
> >       rte_eth_dev_info_get(dev->port_id, &info);
> > @@ -1069,6 +1080,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
> >       }
> > +    if (info.tx_offload_capa & tx_tso_offload_capa) {
> > +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> > +    } else {
> > +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> > +        VLOG_WARN("Tx TSO offload is not supported on %s port "
> > +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
> > +    }
> > +
> >       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
> >       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
> > @@ -1319,14 +1338,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
> >           goto out;
> >       }
> > -    err = rte_vhost_driver_disable_features(dev->vhost_id,
> > -                                1ULL << VIRTIO_NET_F_HOST_TSO4
> > -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > -                                | 1ULL << VIRTIO_NET_F_CSUM);
> > -    if (err) {
> > -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > -                 "port: %s\n", name);
> > -        goto out;
> > +    if (!tso_enabled()) {
> > +        err = rte_vhost_driver_disable_features(dev->vhost_id,
> > +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> > +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > +                                    | 1ULL << VIRTIO_NET_F_CSUM);
> > +        if (err) {
> > +            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > +                     "port: %s\n", name);
> > +            goto out;
> > +        }
> >       }
> >       err = rte_vhost_driver_start(dev->vhost_id);
> > @@ -1661,6 +1682,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
> >           } else {
> >               smap_add(args, "rx_csum_offload", "false");
> >           }
> > +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +            smap_add(args, "tx_tso_offload", "true");
> > +        } else {
> > +            smap_add(args, "tx_tso_offload", "false");
> > +        }
> >           smap_add(args, "lsc_interrupt_mode",
> >                    dev->lsc_interrupt_mode ? "true" : "false");
> >       }
> > @@ -2088,6 +2114,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
> >       rte_free(rx);
> >   }
> > +/* Prepare the packet for HWOL.
> > + * Return True if the packet is OK to continue. */
> > +static bool
> > +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> > +{
> > +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> > +
> > +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> > +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> > +        mbuf->outer_l2_len = 0;
> > +        mbuf->outer_l3_len = 0;
> > +    }
> > +
> > +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> > +        struct tcp_header *th = dp_packet_l4(pkt);
> > +
> > +        if (!th) {
> > +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> > +                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> > +            return false;
> > +        }
> > +
> > +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> > +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
> > +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> > +
> > +        if (mbuf->ol_flags & PKT_TX_IPV4) {
> > +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
> > +        }
> > +    }
> > +    return true;
> > +}
> > +
> > +/* Prepare a batch for HWOL.
> > + * Return the number of good packets in the batch. */
> > +static int
> > +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> > +                            int pkt_cnt)
> > +{
> > +    int i = 0;
> > +    int cnt = 0;
> > +    struct rte_mbuf *pkt;
> > +
> > +    /* Prepare and filter bad HWOL packets. */
> > +    for (i = 0; i < pkt_cnt; i++) {
> > +        pkt = pkts[i];
> > +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> > +            rte_pktmbuf_free(pkt);
> > +            continue;
> > +        }
> > +
> > +        if (OVS_UNLIKELY(i != cnt)) {
> > +            pkts[cnt] = pkt;
> > +        }
> > +        cnt++;
> > +    }
> > +
> > +    return cnt;
> > +}
> > +
> >   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
> >    * 'pkts', even in case of failure.
> >    *
> > @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >                            struct rte_mbuf **pkts, int cnt)
> >   {
> >       uint32_t nb_tx = 0;
> > +    uint16_t nb_tx_prep = cnt;
> > +
> > +    if (tso_enabled()) {
> > +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> > +        if (nb_tx_prep != cnt) {
> > +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> > +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> > +                         cnt, rte_strerror(rte_errno));
> > +        }
> > +    }
> > -    while (nb_tx != cnt) {
> > +    while (nb_tx != nb_tx_prep) {
> >           uint32_t ret;
> > -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> > +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> > +                               nb_tx_prep - nb_tx);
> >           if (!ret) {
> >               break;
> >           }
> > @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> >       int cnt = 0;
> >       struct rte_mbuf *pkt;
> > +    /* Filter oversized packets, unless are marked for TSO. */
> >       for (i = 0; i < pkt_cnt; i++) {
> >           pkt = pkts[i];
> > -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> > -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> > +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> > +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> > +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> > +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> > +                         dev->max_packet_len);
> >               rte_pktmbuf_free(pkt);
> >               continue;
> >           }
> > @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >       struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> >       struct netdev_dpdk_sw_stats sw_stats_add;
> >       unsigned int n_packets_to_free = cnt;
> > -    unsigned int total_packets = cnt;
> > +    unsigned int total_packets;
> >       int i, retries = 0;
> >       int max_retries = VHOST_ENQ_RETRY_MIN;
> >       int vid = netdev_dpdk_get_vid(dev);
> > @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >       }
> > -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> > +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> > +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
> >       sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> >       /* Check has QoS has been configured for the netdev */
> > @@ -2511,6 +2613,121 @@ out:
> >       }
> >   }
> > +static void
> > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> > +{
> > +    rte_free(opaque);
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> > +{
> > +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> > +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> > +    uint16_t buf_len;
> > +    void *buf;
> > +
> > +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> > +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> > +    } else {
> > +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> > +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> > +    }
> > +
> > +    if (unlikely(total_len > UINT16_MAX)) {
> > +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> > +        return NULL;
> > +    }
> > +
> > +    buf_len = total_len;
> > +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > +    if (unlikely(buf == NULL)) {
> > +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> > +        return NULL;
> > +    }
> > +
> > +    /* Initialize shinfo. */
> > +    if (shinfo) {
> > +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> > +        shinfo->fcb_opaque = buf;
> > +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +    } else {
> > +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > +                                                    netdev_dpdk_extbuf_free,
> > +                                                    buf);
> > +        if (unlikely(shinfo == NULL)) {
> > +            rte_free(buf);
> > +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> > +                     "attempting to attach an external buffer.");
> > +            return NULL;
> > +        }
> > +    }
> > +
> > +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> > +                              shinfo);
> > +    rte_pktmbuf_reset_headroom(pkt);
> > +
> > +    return pkt;
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> > +{
> > +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > +
> > +    if (OVS_UNLIKELY(!pkt)) {
> > +        return NULL;
> > +    }
> > +
> > +    dp_packet_init_specific((struct dp_packet *)pkt);
> > +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> > +        return pkt;
> > +    }
> > +
> > +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> > +        return pkt;
> > +    }
> > +
> > +    rte_pktmbuf_free(pkt);
> > +
> > +    return NULL;
> > +}
> > +
> > +static struct dp_packet *
> > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> > +{
> > +    struct rte_mbuf *mbuf_dest;
> > +    struct dp_packet *pkt_dest;
> > +    uint32_t pkt_len;
> > +
> > +    pkt_len = dp_packet_size(pkt_orig);
> > +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> > +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> > +            return NULL;
> > +    }
> > +
> > +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> > +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> > +    dp_packet_set_size(pkt_dest, pkt_len);
> > +
> > +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> > +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> > +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> > +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> > +
> > +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> > +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> > +
> > +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> > +                                - (char *)dp_packet_eth(pkt_dest);
> > +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> > +                                - (char *) dp_packet_l3(pkt_dest);
> > +    }
> > +
> > +    return pkt_dest;
> > +}
> > +
> >   /* Tx function. Transmit packets indefinitely */
> >   static void
> >   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> > @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> >   #endif
> >       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> > +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> >       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> >       uint32_t cnt = batch_cnt;
> >       uint32_t dropped = 0;
> > @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >           struct dp_packet *packet = batch->packets[i];
> >           uint32_t size = dp_packet_size(packet);
> > -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> > -                         size, dev->max_packet_len);
> > -
> > +        if (size > dev->max_packet_len
> > +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> > +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> > +                         dev->max_packet_len);
> >               mtu_drops++;
> >               continue;
> >           }
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> > +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
> >           if (OVS_UNLIKELY(!pkts[txcnt])) {
> >               dropped = cnt - i;
> >               break;
> >           }
> > -        /* We have to do a copy for now */
> > -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> > -               dp_packet_data(packet), size);
> > -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> > -
> >           txcnt++;
> >       }
> >       if (OVS_LIKELY(txcnt)) {
> >           if (dev->type == DPDK_DEV_VHOST) {
> > -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> > -                                     txcnt);
> > +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> >           } else {
> > -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> > +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> > +                                                   (struct rte_mbuf **)pkts,
> > +                                                   txcnt);
> >           }
> >       }
> > @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> >           int batch_cnt = dp_packet_batch_size(batch);
> >           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> > +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> >           tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> >           mtu_drops = batch_cnt - tx_cnt;
> >           qos_drops = tx_cnt;
> > @@ -4345,6 +4559,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >       rte_free(dev->tx_q);
> >       err = dpdk_eth_dev_init(dev);
> > +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
> >       if (!dev->tx_q) {
> >           err = ENOMEM;
> > @@ -4374,6 +4594,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> >           dev->tx_q[0].map = 0;
> >       }
> > +    if (tso_enabled()) {
> > +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> > +        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
> > +    }
> > +
> >       netdev_dpdk_remap_txqs(dev);
> >       err = netdev_dpdk_mempool_configure(dev);
> > @@ -4446,6 +4671,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> >               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >           }
> > +        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
> > +        if (tso_enabled()) {
> > +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> > +        }
> > +
> >           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> >           if (err) {
> >               VLOG_ERR("vhost-user device setup failure for device %s\n",
> > @@ -4470,14 +4700,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> >               goto unlock;
> >           }
> > -        err = rte_vhost_driver_disable_features(dev->vhost_id,
> > -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> > -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > -                                    | 1ULL << VIRTIO_NET_F_CSUM);
> > -        if (err) {
> > -            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > -                     "client port: %s\n", dev->up.name);
> > -            goto unlock;
> > +        if (tso_enabled()) {
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +        } else {
> > +            err = rte_vhost_driver_disable_features(dev->vhost_id,
> > +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
> > +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > +                                        | 1ULL << VIRTIO_NET_F_CSUM);
> > +            if (err) {
> > +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
> > +                         "vhost user client port: %s\n", dev->up.name);
> > +                goto unlock;
> > +            }
> >           }
> >           err = rte_vhost_driver_start(dev->vhost_id);
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > index f08159aa7..102548db7 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -37,10 +37,14 @@
> >   struct netdev;
> > +#define LINUX_RXQ_TSO_MAX_LEN 65536
> > +
> >   struct netdev_rxq_linux {
> >       struct netdev_rxq up;
> >       bool is_tap;
> >       int fd;
> > +    char *bufaux;          /* Extra buffer to recv TSO pkt. */
> > +    int bufaux_len;        /* Extra buffer length. */
> >   };
> >   int netdev_linux_construct(struct netdev *);
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index 8a62f9d74..604cb6913 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -29,16 +29,18 @@
> >   #include <linux/filter.h>
> >   #include <linux/gen_stats.h>
> >   #include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> >   #include <linux/if_tun.h>
> >   #include <linux/types.h>
> >   #include <linux/ethtool.h>
> >   #include <linux/mii.h>
> >   #include <linux/rtnetlink.h>
> >   #include <linux/sockios.h>
> > +#include <linux/virtio_net.h>
> >   #include <sys/ioctl.h>
> >   #include <sys/socket.h>
> > +#include <sys/uio.h>
> >   #include <sys/utsname.h>
> > -#include <netpacket/packet.h>
> >   #include <net/if.h>
> >   #include <net/if_arp.h>
> >   #include <net/route.h>
> > @@ -72,6 +74,7 @@
> >   #include "socket-util.h"
> >   #include "sset.h"
> >   #include "tc.h"
> > +#include "tso.h"
> >   #include "timer.h"
> >   #include "unaligned.h"
> >   #include "openvswitch/vlog.h"
> > @@ -501,6 +504,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> >    * changes in the device miimon status, so we can use atomic_count. */
> >   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
> > +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> > +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
> >   static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
> >                                      int cmd, const char *cmd_name);
> >   static int get_flags(const struct netdev *, unsigned int *flags);
> > @@ -902,6 +907,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
> >       /* The device could be in the same network namespace or in another one. */
> >       netnsid_unset(&netdev->netnsid);
> >       ovs_mutex_init(&netdev->mutex);
> > +
> > +    if (tso_enabled()) {
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       return 0;
> >   }
> > @@ -961,6 +973,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
> >       /* Create tap device. */
> >       get_flags(&netdev->up, &netdev->ifi_flags);
> >       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> > +    if (tso_enabled()) {
> > +        ifr.ifr_flags |= IFF_VNET_HDR;
> > +    }
> > +
> >       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
> >       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
> >           VLOG_WARN("%s: creating tap device failed: %s", name,
> > @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
> >   netdev_linux_rxq_alloc(void)
> >   {
> >       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > +    if (tso_enabled()) {
> > +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > +        if (rx->bufaux) {
> > +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        }
> > +    }
> > +
> >       return &rx->up;
> >   }
> > @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >               goto error;
> >           }
> > +        if (tso_enabled()) {
> > +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                               sizeof val);
> > +            if (error) {
> > +                error = errno;
> > +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> > +                         netdev_get_name(netdev_), ovs_strerror(errno));
> > +                goto error;
> > +            }
> > +        }
> > +
> >           /* Set non-blocking mode. */
> >           error = set_nonblocking(rx->fd);
> >           if (error) {
> > @@ -1123,6 +1157,8 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> >       if (!rx->is_tap) {
> >           close(rx->fd);
> >       }
> > +
> > +    free(rx->bufaux);
> >   }
> >   static void
> > @@ -1152,11 +1188,13 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
> >   }
> >   static int
> > -netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> > +netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
> > +                           struct dp_packet *buffer)
> >   {
> > -    size_t size;
> > +    size_t std_len;
> > +    size_t total_len;
> >       ssize_t retval;
> > -    struct iovec iov;
> > +    struct iovec iov[2];
> >       struct cmsghdr *cmsg;
> >       union {
> >           struct cmsghdr cmsg;
> > @@ -1166,14 +1204,17 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >       /* Reserve headroom for a single VLAN tag */
> >       dp_packet_reserve(buffer, VLAN_HEADER_LEN);
> > -    size = dp_packet_tailroom(buffer);
> > +    std_len = dp_packet_tailroom(buffer);
> > +    total_len = std_len + bufaux_len;
> > -    iov.iov_base = dp_packet_data(buffer);
> > -    iov.iov_len = size;
> > +    iov[0].iov_base = dp_packet_data(buffer);
> > +    iov[0].iov_len = std_len;
> > +    iov[1].iov_base = bufaux;
> > +    iov[1].iov_len = bufaux_len;
> >       msgh.msg_name = NULL;
> >       msgh.msg_namelen = 0;
> > -    msgh.msg_iov = &iov;
> > -    msgh.msg_iovlen = 1;
> > +    msgh.msg_iov = iov;
> > +    msgh.msg_iovlen = 2;
> >       msgh.msg_control = &cmsg_buffer;
> >       msgh.msg_controllen = sizeof cmsg_buffer;
> >       msgh.msg_flags = 0;
> > @@ -1184,11 +1225,26 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >       if (retval < 0) {
> >           return errno;
> > -    } else if (retval > size) {
> > +    } else if (retval > total_len) {
> >           return EMSGSIZE;
> >       }
> > -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    if (retval > std_len) {
> > +        /* Build a single linear TSO packet. */
> > +        size_t extra_len = retval - std_len;
> > +
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > +        dp_packet_prealloc_tailroom(buffer, extra_len);
> > +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +    } else {
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    }
> > +
> > +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> > +        return EINVAL;
> > +    }
> >       for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
> >           const struct tpacket_auxdata *aux;
> > @@ -1221,20 +1277,44 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >   }
> >   static int
> > -netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
> > +netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
> > +                          struct dp_packet *buffer)
> >   {
> >       ssize_t retval;
> > -    size_t size = dp_packet_tailroom(buffer);
> > +    size_t std_len;
> > +    struct iovec iov[2];
> > +
> > +    std_len = dp_packet_tailroom(buffer);
> > +    iov[0].iov_base = dp_packet_data(buffer);
> > +    iov[0].iov_len = std_len;
> > +    iov[1].iov_base = bufaux;
> > +    iov[1].iov_len = bufaux_len;
> >       do {
> > -        retval = read(fd, dp_packet_data(buffer), size);
> > +        retval = readv(fd, iov, 2);
> >       } while (retval < 0 && errno == EINTR);
> >       if (retval < 0) {
> >           return errno;
> >       }
> > -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    if (retval > std_len) {
> > +        /* Build a single linear TSO packet. */
> > +        size_t extra_len = retval - std_len;
> > +
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > +        dp_packet_prealloc_tailroom(buffer, extra_len);
> > +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +    } else {
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    }
> > +
> > +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> > +        return EINVAL;
> > +    }
> > +
> >       return 0;
> >   }
> > @@ -1245,6 +1325,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >       struct netdev *netdev = rx->up.netdev;
> >       struct dp_packet *buffer;
> > +    size_t buffer_len;
> >       ssize_t retval;
> >       int mtu;
> > @@ -1252,12 +1333,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >           mtu = ETH_PAYLOAD_MAX;
> >       }
> > +    buffer_len = VLAN_ETH_HEADER_LEN + mtu;
> > +    if (tso_enabled()) {
> > +            buffer_len += sizeof(struct virtio_net_hdr);
> > +    }
> > +
> >       /* Assume Ethernet port. No need to set packet_type. */
> > -    buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> > -                                           DP_NETDEV_HEADROOM);
> > +    buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
> >       retval = (rx->is_tap
> > -              ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> > -              : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > +              ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
> > +                                          buffer)
> > +              : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
> > +                                           buffer));
> >       if (retval) {
> >           if (retval != EAGAIN && retval != EMSGSIZE) {
> > @@ -1302,7 +1389,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
> >   }
> >   static int
> > -netdev_linux_sock_batch_send(int sock, int ifindex,
> > +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
> >                                struct dp_packet_batch *batch)
> >   {
> >       const size_t size = dp_packet_batch_size(batch);
> > @@ -1316,6 +1403,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> >       struct dp_packet *packet;
> >       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        if (tso) {
> > +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> > +        }
> > +
> >           iov[i].iov_base = dp_packet_data(packet);
> >           iov[i].iov_len = dp_packet_size(packet);
> >           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> > @@ -1348,7 +1439,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> >    * on other interface types because we attach a socket filter to the rx
> >    * socket. */
> >   static int
> > -netdev_linux_tap_batch_send(struct netdev *netdev_,
> > +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
> >                               struct dp_packet_batch *batch)
> >   {
> >       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> > @@ -1365,10 +1456,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
> >       }
> >       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > -        size_t size = dp_packet_size(packet);
> > +        size_t size;
> >           ssize_t retval;
> >           int error;
> > +        if (tso) {
> > +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> > +        }
> > +
> > +        size = dp_packet_size(packet);
> >           do {
> >               retval = write(netdev->tap_fd, dp_packet_data(packet), size);
> >               error = retval < 0 ? errno : 0;
> > @@ -1403,9 +1499,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >                     struct dp_packet_batch *batch,
> >                     bool concurrent_txq OVS_UNUSED)
> >   {
> > +    bool tso = tso_enabled();
> > +    int mtu = ETH_PAYLOAD_MAX;
> >       int error = 0;
> >       int sock = 0;
> > +    if (tso) {
> > +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> > +    }
> > +
> >       if (!is_tap_netdev(netdev_)) {
> >           if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >               error = EOPNOTSUPP;
> > @@ -1424,9 +1526,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >               goto free_batch;
> >           }
> > -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
> >       } else {
> > -        error = netdev_linux_tap_batch_send(netdev_, batch);
> > +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
> >       }
> >       if (error) {
> >           if (error == ENOBUFS) {
> > @@ -6173,6 +6275,19 @@ af_packet_sock(void)
> >                   close(sock);
> >                   sock = -error;
> >               }
> > +
> > +            if (tso_enabled()) {
> > +                int val = 1;
> > +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                                   sizeof val);
> > +                if (error) {
> > +                    error = errno;
> > +                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> > +                             ovs_strerror(errno));
> > +                    close(sock);
> > +                    sock = -error;
> > +                }
> > +            }
> >           } else {
> >               sock = -errno;
> >               VLOG_ERR("failed to create packet socket: %s",
> > @@ -6183,3 +6298,136 @@ af_packet_sock(void)
> >       return sock;
> >   }
> > +
> > +static int
> > +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> > +{
> > +    struct eth_header *eth_hdr;
> > +    ovs_be16 eth_type;
> > +    int l2_len;
> > +
> > +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> > +    if (!eth_hdr) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    l2_len = ETH_HEADER_LEN;
> > +    eth_type = eth_hdr->eth_type;
> > +    if (eth_type_vlan(eth_type)) {
> > +        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> > +
> > +        if (!vlan) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        eth_type = vlan->vlan_next_type;
> > +        l2_len += VLAN_HEADER_LEN;
> > +    }
> > +
> > +    if (eth_type == htons(ETH_TYPE_IP)) {
> > +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> > +
> > +        if (!ip_hdr) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        *l4proto = ip_hdr->ip_proto;
> > +        dp_packet_hwol_set_tx_ipv4(b);
> > +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> > +        struct ovs_16aligned_ip6_hdr *nh6;
> > +
> > +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> > +        if (!nh6) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> > +        dp_packet_hwol_set_tx_ipv6(b);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int
> > +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> > +{
> > +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> > +    uint16_t l4proto = 0;
> > +
> > +    if (OVS_UNLIKELY(!vnet)) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> > +        return 0;
> > +    }
> > +
> > +    if (netdev_linux_parse_l2(b, &l4proto)) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > +        if (l4proto == IPPROTO_TCP) {
> > +            dp_packet_hwol_set_csum_tcp(b);
> > +        } else if (l4proto == IPPROTO_UDP) {
> > +            dp_packet_hwol_set_csum_udp(b);
> > +        } else if (l4proto == IPPROTO_SCTP) {
> > +            dp_packet_hwol_set_csum_sctp(b);
> > +        }
> > +    }
> > +
> > +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> > +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> > +                                | VIRTIO_NET_HDR_GSO_TCPV6
> > +                                | VIRTIO_NET_HDR_GSO_UDP;
> > +        uint8_t type = vnet->gso_type & allowed_mask;
> > +
> > +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
> > +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> > +            dp_packet_hwol_set_tcp_seg(b);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void
> > +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> > +{
> > +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> > +
> > +    if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
> > +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
> > +                            + TCP_HEADER_LEN;
> > +
> > +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
> > +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
> > +        if (dp_packet_hwol_is_ipv4(b)) {
> > +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> > +        } else {
> > +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> > +        }
> > +
> > +    } else {
> > +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> > +    }
> > +
> > +    if (dp_packet_hwol_l4_mask(b)) {
> > +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> > +        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
> > +                                                  - (char *)dp_packet_eth(b));
> > +
> > +        if (dp_packet_hwol_l4_is_tcp(b)) {
> > +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> > +                                    struct tcp_header, tcp_csum);
> > +        } else if (dp_packet_hwol_l4_is_udp(b)) {
> > +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> > +                                    struct udp_header, udp_csum);
> > +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
> > +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> > +                                    struct sctp_header, sctp_csum);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> > +        }
> > +    }
> > +}
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index f109c4e66..87c375b47 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -37,6 +37,12 @@ extern "C" {
> >   struct netdev_tnl_build_header_params;
> >   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
> > +enum netdev_ol_flags {
> > +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> > +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> > +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> > +};
> > +
> >   /* A network device (e.g. an Ethernet device).
> >    *
> >    * Network device implementations may read these members but should not modify
> > @@ -51,6 +57,10 @@ struct netdev {
> >        * opening this device, and therefore got assigned to the "system" class */
> >       bool auto_classified;
> > +    /* This bitmask of the offloading features enabled/supported by the
> > +     * supported by the netdev. */
> > +    uint64_t ol_flags;
> > +
> >       /* If this is 'true', the user explicitly specified an MTU for this
> >        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
> >       bool mtu_user_config;
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 405c98c68..998525875 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
> >               : NETDEV_PT_LEGACY_L2);
> >   }
> > +/* Check if a 'packet' is compatible with 'netdev_flags'.
> > + * If a packet is incompatible, return 'false' with the 'errormsg'
> > + * pointing to a reason. */
> > +static bool
> > +netdev_send_prepare_packet(const uint64_t netdev_flags,
> > +                           struct dp_packet *packet, char **errormsg)
> > +{
> > +    if (dp_packet_hwol_is_tso(packet)
> > +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> > +            /* Fall back to GSO in software. */
> > +            *errormsg = "No TSO support";
> > +            return false;
> > +    }
> > +
> > +    if (dp_packet_hwol_l4_mask(packet)
> > +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> > +            /* Fall back to L4 csum in software. */
> > +            *errormsg = "No L4 checksum support";
> > +            return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> > + * otherwise either fall back to software implementation or drop it. */
> > +static void
> > +netdev_send_prepare_batch(const struct netdev *netdev,
> > +                          struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet *packet;
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        char *errormsg = NULL;
> > +
> > +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> > +                         errormsg ? errormsg : "Unsupported feature",
> > +                         netdev_get_name(netdev));
> > +        }
> > +    }
> > +}
> > +
> >   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
> >    * otherwise a positive errno value.  Returns EAGAIN without blocking if
> >    * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> > @@ -811,8 +857,10 @@ int
> >   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
> >               bool concurrent_txq)
> >   {
> > -    int error = netdev->netdev_class->send(netdev, qid, batch,
> > -                                           concurrent_txq);
> > +    int error;
> > +
> > +    netdev_send_prepare_batch(netdev, batch);
> > +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
> >       if (!error) {
> >           COVERAGE_INC(netdev_sent);
> >       }
> > @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
> >                      const struct ovs_action_push_tnl *data)
> >   {
> >       struct dp_packet *packet;
> > -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > -        netdev->netdev_class->push_header(netdev, packet, data);
> > -        pkt_metadata_init(&packet->md, data->out_port);
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        if (!dp_packet_hwol_is_tso(packet)) {
> > +            netdev->netdev_class->push_header(netdev, packet, data);
> > +            pkt_metadata_init(&packet->md, data->out_port);
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
> > +                         "packet dropped", netdev_get_name(netdev));
> > +        }
> >       }
> >       return 0;
> > diff --git a/lib/tso.c b/lib/tso.c
> > new file mode 100644
> > index 000000000..9dc15e146
> > --- /dev/null
> > +++ b/lib/tso.c
> > @@ -0,0 +1,54 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#include <config.h>
> > +
> > +#include "smap.h"
> > +#include "ovs-thread.h"
> > +#include "openvswitch/vlog.h"
> > +#include "dpdk.h"
> > +#include "tso.h"
> > +#include "vswitch-idl.h"
> > +
> > +VLOG_DEFINE_THIS_MODULE(tso);
> > +
> > +static bool tso_support_enabled = false;
> > +
> > +void
> > +tso_init(const struct smap *ovs_other_config)
> > +{
> > +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> > +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +
> > +        if (ovsthread_once_start(&once)) {
> > +            if (dpdk_available()) {
> > +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> > +                tso_support_enabled = true;
> > +            } else {
> > +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> > +                         "without enabling DPDK");
> > +                tso_support_enabled = false;
> > +            }
> > +            ovsthread_once_done(&once);
> > +        }
> > +    }
> > +}
> > +
> > +bool
> > +tso_enabled(void)
> > +{
> > +    return tso_support_enabled;
> > +}
> > diff --git a/lib/tso.h b/lib/tso.h
> > new file mode 100644
> > index 000000000..6594496ac
> > --- /dev/null
> > +++ b/lib/tso.h
> > @@ -0,0 +1,23 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef TSO_H
> > +#define TSO_H 1
> > +
> > +void tso_init(const struct smap *ovs_other_config);
> > +bool tso_enabled(void);
> > +
> > +#endif /* tso.h */
> > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> > index 86c7b10a9..6d73922f6 100644
> > --- a/vswitchd/bridge.c
> > +++ b/vswitchd/bridge.c
> > @@ -65,6 +65,7 @@
> >   #include "system-stats.h"
> >   #include "timeval.h"
> >   #include "tnl-ports.h"
> > +#include "tso.h"
> >   #include "util.h"
> >   #include "unixctl.h"
> >   #include "lib/vswitch-idl.h"
> > @@ -3285,6 +3286,7 @@ bridge_run(void)
> >       if (cfg) {
> >           netdev_set_flow_api_enabled(&cfg->other_config);
> >           dpdk_init(&cfg->other_config);
> > +        tso_init(&cfg->other_config);
> >       }
> >       /* Initialize the ofproto library.  This only needs to run once, but
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 0ec726c39..354dcabfa 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -690,6 +690,18 @@
> >            once in few hours or a day or a week.
> >           </p>
> >         </column>
> > +      <column name="other_config" key="tso-support"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          Set this value to <code>true</code> to enable support for TSO (TCP
> > +          Segmentation Offloading). When TSO is enabled, vhost-user client
> > +          interfaces can transmit packets up to 64KB.
> > +        </p>
> > +        <p>
> > +          The default value is <code>false</code>. Changing this value requires
> > +          restarting the daemon.
> > +        </p>
> > +      </column>
> >       </group>
> >       <group title="Status">
> >         <column name="next_cfg">
> > 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ilya Maximets Jan. 14, 2020, 5:48 p.m. UTC | #3
On 09.01.2020 15:44, Flavio Leitner wrote:
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
> 
> A guest using vhostuser interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
> 
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
> 
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhostuser interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
> 
> It is recommended to check if NIC hardware supports TSO before enabling
> the feature, which is off by default. For additional information please
> check the tso.rst document.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---

It seems this patch needs a rebase due to recvmmsg related changes in
netdev-linux.

I didn't check the sizes and offsets inside the code and didn't look
close to the features enabling on devices. Some comments inline.

>  Documentation/automake.mk           |   1 +
>  Documentation/topics/dpdk/index.rst |   1 +
>  Documentation/topics/dpdk/tso.rst   |  96 +++++++++
>  NEWS                                |   1 +
>  lib/automake.mk                     |   2 +
>  lib/conntrack.c                     |  29 ++-
>  lib/dp-packet.h                     | 152 +++++++++++++-
>  lib/ipf.c                           |  32 +--
>  lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
>  lib/netdev-linux-private.h          |   4 +
>  lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
>  lib/netdev-provider.h               |  10 +
>  lib/netdev.c                        |  66 +++++-
>  lib/tso.c                           |  54 +++++
>  lib/tso.h                           |  23 ++
>  vswitchd/bridge.c                   |   2 +
>  vswitchd/vswitch.xml                |  12 ++
>  17 files changed, 1002 insertions(+), 91 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/tso.rst
>  create mode 100644 lib/tso.c
>  create mode 100644 lib/tso.h
> 
> Changelog:
> - v3
>  * Improved the documentation.
>  * Updated copyright year to 2020.
>  * TSO offloaded msg now includes the netdev's name.
>  * Added period at the end of all code comments.
>  * Warn and drop encapsulation of TSO packets.
>  * Fixed travis issue with restricted virtio types.
>  * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
>    which caused packet corruption.
>  * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
>    PKT_TX_IP_CKSUM only for IPv4 packets.
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index f2ca17bad..284327edd 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -35,6 +35,7 @@ DOC_SOURCE = \
>  	Documentation/topics/dpdk/index.rst \
>  	Documentation/topics/dpdk/bridge.rst \
>  	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/tso.rst \
>  	Documentation/topics/dpdk/memory.rst \
>  	Documentation/topics/dpdk/pdump.rst \
>  	Documentation/topics/dpdk/phy.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index f2862ea70..400d56051 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,4 +40,5 @@ DPDK Support
>     /topics/dpdk/qos
>     /topics/dpdk/pdump
>     /topics/dpdk/jumbo-frames
> +   /topics/dpdk/tso
>     /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> new file mode 100644
> index 000000000..189c86480
> --- /dev/null
> +++ b/Documentation/topics/dpdk/tso.rst
> @@ -0,0 +1,96 @@
> +..
> +      Copyright 2020, Red Hat, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +========================
> +Userspace Datapath - TSO
> +========================
> +
> +**Note:** This feature is considered experimental.
> +
> +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> +segmentation achieves computational savings in the core, freeing up CPU cycles
> +for more useful work.
> +
> +A common use case for TSO is when using virtualization, where traffic that's
> +coming in from a VM can offload the TCP segmentation, thus avoiding the
> +fragmentation in software. Additionally, if the traffic is headed to a VM
> +within the same host further optimization can be expected. As the traffic never
> +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> +and checksum calculations are required, which saves yet more cycles. Only when
> +the traffic actually leaves the host the segmentation needs to happen, in which
> +case it will be performed by the egress NIC. Consult your controller's
> +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> +refer to the `DPDK documentation`__.
> +
> +__ https://doc.dpdk.org/guides/nics/overview.html

This should point to 19.11 version of a guide instead of latest one:
https://doc.dpdk.org/guides-19.11/nics/overview.html

> +
> +Enabling TSO
> +~~~~~~~~~~~~
> +
> +The TSO support may be enabled via a global config value ``tso-support``.
> +Setting this to ``true`` enables TSO support for all ports.
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true


I'd suggest to rename this to 'userspace-tso-support' to avoid misunderstanding.

> +
> +The default value is ``false``.
> +
> +Changing ``tso-support`` requires restarting the daemon.
> +
> +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> +
> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> +connection is established, `TSO` is thus advertised to the guest as an
> +available feature:
> +
> +QEMU Command Line Parameter::
> +
> +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> +    ...
> +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> +    ...
> +
> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> +used to enable same::
> +
> +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> +    $ ethtool -K eth0 tso on
> +    $ ethtool -k eth0
> +
> +~~~~~~~~~~~
> +Limitations
> +~~~~~~~~~~~
> +
> +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> +etc.]).
> +
> +There is no software implementation of TSO, so all ports attached to the
> +datapath must support TSO or packets using that feature will be dropped
> +on ports without TSO support.  That also means guests using vhost-user
> +in client mode will receive TSO packet regardless of TSO being enabled
> +or disabled within the guest.
> diff --git a/NEWS b/NEWS
> index 965facaf8..306c0493d 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -26,6 +26,7 @@ Post-v2.12.0
>       * DPDK ring ports (dpdkr) are deprecated and will be removed in next
>         releases.
>       * Add support for DPDK 19.11.
> +     * Add experimental support for TSO.
>     - RSTP:
>       * The rstp_statistics column in Port table will only be updated every
>         stats-update-interval configured in Open_vSwtich table.
> diff --git a/lib/automake.mk b/lib/automake.mk
> index ebf714501..94a1b4459 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
>  	lib/tnl-neigh-cache.h \
>  	lib/tnl-ports.c \
>  	lib/tnl-ports.h \
> +	lib/tso.c \
> +	lib/tso.h \

s/tso/userspace-tso/

>  	lib/netdev-native-tnl.c \
>  	lib/netdev-native-tnl.h \
>  	lib/token-bucket.c \
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index b80080e72..679054b98 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>          if (hwol_bad_l3_csum) {
>              ok = false;
>          } else {
> -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
>              /* Validate the checksum only when hwol is not supported. */
>              ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
>                                   !hwol_good_l3_csum);
> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
>      if (ok) {
>          bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
>          if (!hwol_bad_l4_csum) {
> -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
>              /* Validate the checksum only when hwol is not supported. */
>              if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
>                             &ctx->icmp_related, l3, !hwol_good_l4_csum,
> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>                  }
>                  if (seq_skew) {
>                      ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> -                                          l3_hdr->ip_tot_len, htons(ip_len));
> +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> +                                                        l3_hdr->ip_tot_len,
> +                                                        htons(ip_len));
> +                    }
>                      l3_hdr->ip_tot_len = htons(ip_len);
>                  }
>              }
> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>      }
>  
>      th->tcp_csum = 0;
> -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> -                           dp_packet_l4_size(pkt));
> -    } else {
> -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> -        th->tcp_csum = csum_finish(
> -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> +                               dp_packet_l4_size(pkt));
> +        } else {
> +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> +            th->tcp_csum = csum_finish(
> +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> +        }
>      }
>  
>      if (seq_skew) {
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 133942155..d10a0416e 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
>  static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
>  static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
>  
> +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);

No such function.

> +
>  void *dp_packet_resize_l2(struct dp_packet *, int increment);
>  void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
>  static inline void *dp_packet_eth(const struct dp_packet *);
> @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
>  {
>      /* This initialization is needed for packets that do not come from DPDK
>       * interfaces, when vswitchd is built with --with-dpdk. */
> -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
>      p->mbuf.nb_segs = 1;
>      p->mbuf.next = NULL;
>  }
> @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>      b->mbuf.buf_len = s;
>  }
>  
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> +           ? true
> +           : false;

Usual way for converting to bool is to use '!!'.  This will save some space.

> +}
> +
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> +{
> +    return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;

Ditto.

> +}
> +
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> +{
> +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM

The result of '==' is already boolean.

> +           ? true
> +           : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM

Ditto.

> +           ? true
> +           : false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> +{
> +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM

Ditto.

> +           ? true
> +           : false;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {

'{' should be on the next line.  Same for all the functions below.

And some comments to these functions would be nice.  At least a single
comment for a group of functions.

> +    b->mbuf.ol_flags |= PKT_TX_IPV4;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_IPV6;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
> +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> +}
> +
>  /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>   * correct only if 'dp_packet_rss_valid(p)' returns true */
>  static inline uint32_t
> @@ -648,6 +724,66 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
>      b->allocated_ = s;
>  }
>  
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return 0;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> +{
> +    return false;
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
> +}
> +
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
> +}
> +
>  /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
>   * correct only if 'dp_packet_rss_valid(p)' returns true */
>  static inline uint32_t
> @@ -939,6 +1075,20 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
>      }
>  }
>  

Some comments for below functions too.

> +static inline bool
> +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> +{
> +
> +    return dp_packet_hwol_l4_mask(p) ? true : false;

'!!'?
Also, it seems strange to check l4 offloading mask to check if
we have ip checksum.  Shouldn't we check for PKT_TX_IPV4/6
instead?  This might not work for pure IP packets (without L4).

> +}
> +
> +static inline bool
> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
> +{
> +
> +    return dp_packet_hwol_l4_mask(p) ? true : false;

!!

> +}
> +
>  #ifdef  __cplusplus
>  }
>  #endif
> diff --git a/lib/ipf.c b/lib/ipf.c
> index 45c489122..0f43593a2 100644
> --- a/lib/ipf.c
> +++ b/lib/ipf.c
> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>      len += rest_len;
>      l3 = dp_packet_l3(pkt);
>      ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> -                                new_ip_frag_off);
> -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> +                                    new_ip_frag_off);
> +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    }
>      l3->ip_tot_len = htons(len);
>      l3->ip_frag_off = new_ip_frag_off;
>      dp_packet_set_l2_pad_size(pkt, 0);
> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
>      }
>  
>      if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> +                     && !dp_packet_hwol_tx_ip_checksum(pkt)
>                       && csum(l3, ip_hdr_len) != 0)) {
>          goto invalid_pkt;
>      }
> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
>                  } else {
>                      struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
>                      struct ip_header *l3_reass = dp_packet_l3(pkt);
> -                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> -                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> -                    l3_frag->ip_src = l3_reass->ip_src;
> +                    if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
> +                        ovs_be32 reass_ip =
> +                            get_16aligned_be32(&l3_reass->ip_src);
> +                        ovs_be32 frag_ip =
> +                            get_16aligned_be32(&l3_frag->ip_src);
> +
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                         frag_ip, reass_ip);
> +                    }
>  
> -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> -                                                     frag_ip, reass_ip);
> +                    l3_frag->ip_src = l3_reass->ip_src;
>                      l3_frag->ip_dst = l3_reass->ip_dst;
>                  }
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 5e09786ac..2de60aa3f 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -64,6 +64,7 @@
>  #include "smap.h"
>  #include "sset.h"
>  #include "timeval.h"
> +#include "tso.h"
>  #include "unaligned.h"
>  #include "unixctl.h"
>  #include "util.h"
> @@ -360,7 +361,8 @@ struct ingress_policer {
>  enum dpdk_hw_ol_features {
>      NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
>      NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> -    NETDEV_RX_HW_SCATTER = 1 << 2
> +    NETDEV_RX_HW_SCATTER = 1 << 2,
> +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
>  };
>  
>  /*
> @@ -942,6 +944,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>          conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
>      }
>  
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>      /* Limit configured rss hash functions to only those supported
>       * by the eth device. */
>      conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> @@ -1043,6 +1051,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>      uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
>                                       DEV_RX_OFFLOAD_TCP_CKSUM |
>                                       DEV_RX_OFFLOAD_IPV4_CKSUM;
> +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
> +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
>  
>      rte_eth_dev_info_get(dev->port_id, &info);
>  
> @@ -1069,6 +1080,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>          dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
>      }
>  
> +    if (info.tx_offload_capa & tx_tso_offload_capa) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +    } else {
> +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_WARN("Tx TSO offload is not supported on %s port "
> +                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
> +    }
> +
>      n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>      n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>  
> @@ -1319,14 +1338,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
>          goto out;
>      }
>  
> -    err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                | 1ULL << VIRTIO_NET_F_CSUM);
> -    if (err) {
> -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                 "port: %s\n", name);
> -        goto out;
> +    if (!tso_enabled()) {
> +        err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                    | 1ULL << VIRTIO_NET_F_CSUM);
> +        if (err) {
> +            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> +                     "port: %s\n", name);
> +            goto out;
> +        }
>      }
>  
>      err = rte_vhost_driver_start(dev->vhost_id);
> @@ -1661,6 +1682,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
>          } else {
>              smap_add(args, "rx_csum_offload", "false");
>          }
> +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +            smap_add(args, "tx_tso_offload", "true");
> +        } else {
> +            smap_add(args, "tx_tso_offload", "false");
> +        }

I know that we're currently returning all of this stuff here in get-config(),
but ideally this should be in get_status().  Might fix all at once later probably.

>          smap_add(args, "lsc_interrupt_mode",
>                   dev->lsc_interrupt_mode ? "true" : "false");
>      }
> @@ -2088,6 +2114,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
>      rte_free(rx);
>  }
>  
> +/* Prepare the packet for HWOL.
> + * Return True if the packet is OK to continue. */
> +static bool
> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> +{
> +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> +
> +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> +        mbuf->outer_l2_len = 0;
> +        mbuf->outer_l3_len = 0;
> +    }
> +
> +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> +        struct tcp_header *th = dp_packet_l4(pkt);
> +
> +        if (!th) {
> +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> +                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> +            return false;
> +        }
> +
> +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
> +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> +
> +        if (mbuf->ol_flags & PKT_TX_IPV4) {
> +            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
> +        }
> +    }
> +    return true;
> +}
> +
> +/* Prepare a batch for HWOL.
> + * Return the number of good packets in the batch. */
> +static int
> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> +                            int pkt_cnt)
> +{
> +    int i = 0;
> +    int cnt = 0;
> +    struct rte_mbuf *pkt;
> +
> +    /* Prepare and filter bad HWOL packets. */
> +    for (i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> +            rte_pktmbuf_free(pkt);
> +            continue;
> +        }
> +
> +        if (OVS_UNLIKELY(i != cnt)) {
> +            pkts[cnt] = pkt;
> +        }
> +        cnt++;
> +    }
> +
> +    return cnt;
> +}
> +
>  /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
>   * 'pkts', even in case of failure.
>   *
> @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
>                           struct rte_mbuf **pkts, int cnt)
>  {
>      uint32_t nb_tx = 0;
> +    uint16_t nb_tx_prep = cnt;
> +
> +    if (tso_enabled()) {
> +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);

Packets dropped and not counted.

> +        if (nb_tx_prep != cnt) {
> +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> +                         cnt, rte_strerror(rte_errno));
> +        }
> +    }
>  
> -    while (nb_tx != cnt) {
> +    while (nb_tx != nb_tx_prep) {
>          uint32_t ret;
>  
> -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> +                               nb_tx_prep - nb_tx);
>          if (!ret) {
>              break;
>          }
> @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
>      int cnt = 0;
>      struct rte_mbuf *pkt;
>  
> +    /* Filter oversized packets, unless are marked for TSO. */
>      for (i = 0; i < pkt_cnt; i++) {
>          pkt = pkts[i];
> -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> +                         dev->max_packet_len);
>              rte_pktmbuf_free(pkt);
>              continue;
>          }
> @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
>      struct netdev_dpdk_sw_stats sw_stats_add;
>      unsigned int n_packets_to_free = cnt;
> -    unsigned int total_packets = cnt;
> +    unsigned int total_packets;
>      int i, retries = 0;
>      int max_retries = VHOST_ENQ_RETRY_MIN;
>      int vid = netdev_dpdk_get_vid(dev);
> @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>      }
>  
> -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);

lib/daemon-unix.c Have you checked the performance
impact for non-TSO setup?

> +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
>      sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
>  
>      /* Check has QoS has been configured for the netdev */
> @@ -2511,6 +2613,121 @@ out:
>      }
>  }
>  
> +static void
> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> +{
> +    rte_free(opaque);
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> +{
> +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> +    uint16_t buf_len;
> +    void *buf;
> +
> +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> +    } else {
> +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> +    }
> +
> +    if (unlikely(total_len > UINT16_MAX)) {
> +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> +        return NULL;
> +    }
> +
> +    buf_len = total_len;
> +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> +    if (unlikely(buf == NULL)) {
> +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> +        return NULL;
> +    }
> +
> +    /* Initialize shinfo. */
> +    if (shinfo) {
> +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> +        shinfo->fcb_opaque = buf;
> +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> +    } else {
> +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> +                                                    netdev_dpdk_extbuf_free,
> +                                                    buf);
> +        if (unlikely(shinfo == NULL)) {
> +            rte_free(buf);
> +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> +                     "attempting to attach an external buffer.");
> +            return NULL;
> +        }
> +    }
> +
> +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> +                              shinfo);
> +    rte_pktmbuf_reset_headroom(pkt);
> +
> +    return pkt;
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> +{
> +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> +
> +    if (OVS_UNLIKELY(!pkt)) {
> +        return NULL;
> +    }
> +
> +    dp_packet_init_specific((struct dp_packet *)pkt);
> +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> +        return pkt;
> +    }
> +
> +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> +        return pkt;
> +    }
> +
> +    rte_pktmbuf_free(pkt);
> +
> +    return NULL;
> +}
> +
> +static struct dp_packet *
> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> +{
> +    struct rte_mbuf *mbuf_dest;
> +    struct dp_packet *pkt_dest;
> +    uint32_t pkt_len;
> +
> +    pkt_len = dp_packet_size(pkt_orig);
> +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> +            return NULL;
> +    }
> +
> +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> +    dp_packet_set_size(pkt_dest, pkt_len);
> +
> +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> +
> +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> +
> +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> +                                - (char *)dp_packet_eth(pkt_dest);
> +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> +                                - (char *) dp_packet_l3(pkt_dest);
> +    }
> +
> +    return pkt_dest;
> +}
> +
>  /* Tx function. Transmit packets indefinitely */
>  static void
>  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>      enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>  #endif
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
>      struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
>      uint32_t cnt = batch_cnt;
>      uint32_t dropped = 0;
> @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>          struct dp_packet *packet = batch->packets[i];
>          uint32_t size = dp_packet_size(packet);
>  
> -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> -                         size, dev->max_packet_len);
> -
> +        if (size > dev->max_packet_len
> +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> +                         dev->max_packet_len);
>              mtu_drops++;
>              continue;
>          }
>  
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>              dropped = cnt - i;
>              break;
>          }
>  
> -        /* We have to do a copy for now */
> -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> -               dp_packet_data(packet), size);
> -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> -
>          txcnt++;
>      }
>  
>      if (OVS_LIKELY(txcnt)) {
>          if (dev->type == DPDK_DEV_VHOST) {
> -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> -                                     txcnt);
> +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
>          } else {
> -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> +                                                   (struct rte_mbuf **)pkts,
> +                                                   txcnt);
>          }
>      }
>  
> @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>          int batch_cnt = dp_packet_batch_size(batch);
>          struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>  
> +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);

Packets dropped and not counted.

Also, this function called unconditionally. Perfomance impact on non-TSO case?

>          tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
>          mtu_drops = batch_cnt - tx_cnt;
>          qos_drops = tx_cnt;
> @@ -4345,6 +4559,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>  
>      rte_free(dev->tx_q);
>      err = dpdk_eth_dev_init(dev);
> +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>      dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>      if (!dev->tx_q) {
>          err = ENOMEM;
> @@ -4374,6 +4594,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>          dev->tx_q[0].map = 0;
>      }
>  
> +    if (tso_enabled()) {
> +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> +        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
> +    }
> +
>      netdev_dpdk_remap_txqs(dev);
>  
>      err = netdev_dpdk_mempool_configure(dev);
> @@ -4446,6 +4671,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>              vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>          }
>  
> +        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
> +        if (tso_enabled()) {
> +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> +        }
> +
>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>          if (err) {
>              VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -4470,14 +4700,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>              goto unlock;
>          }
>  
> -        err = rte_vhost_driver_disable_features(dev->vhost_id,
> -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> -                                    | 1ULL << VIRTIO_NET_F_CSUM);
> -        if (err) {
> -            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> -                     "client port: %s\n", dev->up.name);
> -            goto unlock;
> +        if (tso_enabled()) {
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +        } else {
> +            err = rte_vhost_driver_disable_features(dev->vhost_id,
> +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
> +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
> +                                        | 1ULL << VIRTIO_NET_F_CSUM);
> +            if (err) {
> +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
> +                         "vhost user client port: %s\n", dev->up.name);
> +                goto unlock;
> +            }
>          }
>  
>          err = rte_vhost_driver_start(dev->vhost_id);
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> index f08159aa7..102548db7 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -37,10 +37,14 @@
>  
>  struct netdev;
>  
> +#define LINUX_RXQ_TSO_MAX_LEN 65536
> +
>  struct netdev_rxq_linux {
>      struct netdev_rxq up;
>      bool is_tap;
>      int fd;
> +    char *bufaux;          /* Extra buffer to recv TSO pkt. */
> +    int bufaux_len;        /* Extra buffer length. */

Length never changes.  Why we need 'bufaux_len' ?

>  };
>  
>  int netdev_linux_construct(struct netdev *);
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 8a62f9d74..604cb6913 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -29,16 +29,18 @@
>  #include <linux/filter.h>
>  #include <linux/gen_stats.h>
>  #include <linux/if_ether.h>
> +#include <linux/if_packet.h>
>  #include <linux/if_tun.h>
>  #include <linux/types.h>
>  #include <linux/ethtool.h>
>  #include <linux/mii.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/sockios.h>
> +#include <linux/virtio_net.h>
>  #include <sys/ioctl.h>
>  #include <sys/socket.h>
> +#include <sys/uio.h>
>  #include <sys/utsname.h>
> -#include <netpacket/packet.h>
>  #include <net/if.h>
>  #include <net/if_arp.h>
>  #include <net/route.h>
> @@ -72,6 +74,7 @@
>  #include "socket-util.h"
>  #include "sset.h"
>  #include "tc.h"
> +#include "tso.h"
>  #include "timer.h"
>  #include "unaligned.h"
>  #include "openvswitch/vlog.h"
> @@ -501,6 +504,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>   * changes in the device miimon status, so we can use atomic_count. */
>  static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>  
> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
>  static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
>                                     int cmd, const char *cmd_name);
>  static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -902,6 +907,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
>      /* The device could be in the same network namespace or in another one. */
>      netnsid_unset(&netdev->netnsid);
>      ovs_mutex_init(&netdev->mutex);
> +
> +    if (tso_enabled()) {
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> +    }
> +
>      return 0;
>  }
>  
> @@ -961,6 +973,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
>      /* Create tap device. */
>      get_flags(&netdev->up, &netdev->ifi_flags);
>      ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> +    if (tso_enabled()) {
> +        ifr.ifr_flags |= IFF_VNET_HDR;
> +    }
> +
>      ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
>      if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
>          VLOG_WARN("%s: creating tap device failed: %s", name,
> @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
>  netdev_linux_rxq_alloc(void)
>  {
>      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> +    if (tso_enabled()) {
> +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> +        if (rx->bufaux) {

xmalloc can not fail.

> +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> +        }
> +    }
> +
>      return &rx->up;
>  }
>  
> @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>              goto error;
>          }
>  
> +        if (tso_enabled()) {
> +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                               sizeof val);
> +            if (error) {

You're not using the 'error'.  Make it just "if (setsockopt()) {".

> +                error = errno;
> +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> +                         netdev_get_name(netdev_), ovs_strerror(errno));
> +                goto error;
> +            }
> +        }
> +
>          /* Set non-blocking mode. */
>          error = set_nonblocking(rx->fd);
>          if (error) {
> @@ -1123,6 +1157,8 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>      if (!rx->is_tap) {
>          close(rx->fd);
>      }
> +
> +    free(rx->bufaux);
>  }
>  
>  static void
> @@ -1152,11 +1188,13 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
>  }
>  
>  static int
> -netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> +netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
> +                           struct dp_packet *buffer)
>  {
> -    size_t size;
> +    size_t std_len;
> +    size_t total_len;
>      ssize_t retval;
> -    struct iovec iov;
> +    struct iovec iov[2];
>      struct cmsghdr *cmsg;
>      union {
>          struct cmsghdr cmsg;
> @@ -1166,14 +1204,17 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>  
>      /* Reserve headroom for a single VLAN tag */
>      dp_packet_reserve(buffer, VLAN_HEADER_LEN);
> -    size = dp_packet_tailroom(buffer);
> +    std_len = dp_packet_tailroom(buffer);
> +    total_len = std_len + bufaux_len;
>  
> -    iov.iov_base = dp_packet_data(buffer);
> -    iov.iov_len = size;
> +    iov[0].iov_base = dp_packet_data(buffer);
> +    iov[0].iov_len = std_len;
> +    iov[1].iov_base = bufaux;
> +    iov[1].iov_len = bufaux_len;
>      msgh.msg_name = NULL;
>      msgh.msg_namelen = 0;
> -    msgh.msg_iov = &iov;
> -    msgh.msg_iovlen = 1;
> +    msgh.msg_iov = iov;
> +    msgh.msg_iovlen = 2;
>      msgh.msg_control = &cmsg_buffer;
>      msgh.msg_controllen = sizeof cmsg_buffer;
>      msgh.msg_flags = 0;
> @@ -1184,11 +1225,26 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>  
>      if (retval < 0) {
>          return errno;
> -    } else if (retval > size) {
> +    } else if (retval > total_len) {
>          return EMSGSIZE;
>      }
>  
> -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    if (retval > std_len) {
> +        /* Build a single linear TSO packet. */
> +        size_t extra_len = retval - std_len;
> +
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> +        dp_packet_prealloc_tailroom(buffer, extra_len);
> +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +    } else {
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    }
> +
> +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> +        return EINVAL;
> +    }
>  
>      for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
>          const struct tpacket_auxdata *aux;
> @@ -1221,20 +1277,44 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
>  }
>  
>  static int
> -netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
> +netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
> +                          struct dp_packet *buffer)
>  {
>      ssize_t retval;
> -    size_t size = dp_packet_tailroom(buffer);
> +    size_t std_len;
> +    struct iovec iov[2];
> +
> +    std_len = dp_packet_tailroom(buffer);
> +    iov[0].iov_base = dp_packet_data(buffer);
> +    iov[0].iov_len = std_len;
> +    iov[1].iov_base = bufaux;
> +    iov[1].iov_len = bufaux_len;
>  
>      do {
> -        retval = read(fd, dp_packet_data(buffer), size);
> +        retval = readv(fd, iov, 2);
>      } while (retval < 0 && errno == EINTR);
>  
>      if (retval < 0) {
>          return errno;
>      }
>  
> -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    if (retval > std_len) {
> +        /* Build a single linear TSO packet. */
> +        size_t extra_len = retval - std_len;
> +
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> +        dp_packet_prealloc_tailroom(buffer, extra_len);
> +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +    } else {
> +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +    }
> +
> +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> +        return EINVAL;
> +    }
> +
>      return 0;
>  }
>  
> @@ -1245,6 +1325,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>      struct netdev *netdev = rx->up.netdev;
>      struct dp_packet *buffer;
> +    size_t buffer_len;
>      ssize_t retval;
>      int mtu;
>  
> @@ -1252,12 +1333,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          mtu = ETH_PAYLOAD_MAX;
>      }
>  
> +    buffer_len = VLAN_ETH_HEADER_LEN + mtu;
> +    if (tso_enabled()) {
> +            buffer_len += sizeof(struct virtio_net_hdr);
> +    }
> +
>      /* Assume Ethernet port. No need to set packet_type. */
> -    buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> -                                           DP_NETDEV_HEADROOM);
> +    buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
>      retval = (rx->is_tap
> -              ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> -              : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> +              ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
> +                                          buffer)
> +              : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
> +                                           buffer));
>  
>      if (retval) {
>          if (retval != EAGAIN && retval != EMSGSIZE) {
> @@ -1302,7 +1389,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>  }
>  
>  static int
> -netdev_linux_sock_batch_send(int sock, int ifindex,
> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>                               struct dp_packet_batch *batch)
>  {
>      const size_t size = dp_packet_batch_size(batch);
> @@ -1316,6 +1403,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>  
>      struct dp_packet *packet;
>      DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
>          iov[i].iov_base = dp_packet_data(packet);
>          iov[i].iov_len = dp_packet_size(packet);
>          mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> @@ -1348,7 +1439,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>   * on other interface types because we attach a socket filter to the rx
>   * socket. */
>  static int
> -netdev_linux_tap_batch_send(struct netdev *netdev_,
> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
>                              struct dp_packet_batch *batch)
>  {
>      struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> @@ -1365,10 +1456,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
>      }
>  
>      DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        size_t size = dp_packet_size(packet);
> +        size_t size;
>          ssize_t retval;
>          int error;
>  
> +        if (tso) {
> +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> +        }
> +
> +        size = dp_packet_size(packet);
>          do {
>              retval = write(netdev->tap_fd, dp_packet_data(packet), size);
>              error = retval < 0 ? errno : 0;
> @@ -1403,9 +1499,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>                    struct dp_packet_batch *batch,
>                    bool concurrent_txq OVS_UNUSED)
>  {
> +    bool tso = tso_enabled();
> +    int mtu = ETH_PAYLOAD_MAX;
>      int error = 0;
>      int sock = 0;
>  
> +    if (tso) {
> +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> +    }
> +
>      if (!is_tap_netdev(netdev_)) {
>          if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>              error = EOPNOTSUPP;
> @@ -1424,9 +1526,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>              goto free_batch;
>          }
>  
> -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
>      } else {
> -        error = netdev_linux_tap_batch_send(netdev_, batch);
> +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>      }
>      if (error) {
>          if (error == ENOBUFS) {
> @@ -6173,6 +6275,19 @@ af_packet_sock(void)
>                  close(sock);
>                  sock = -error;
>              }
> +
> +            if (tso_enabled()) {
> +                int val = 1;
> +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> +                                   sizeof val);

socket might be already closed here and there will be double close.

> +                if (error) {

Ditto.

> +                    error = errno;
> +                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> +                             ovs_strerror(errno));
> +                    close(sock);
> +                    sock = -error;
> +                }
> +            }
>          } else {
>              sock = -errno;
>              VLOG_ERR("failed to create packet socket: %s",
> @@ -6183,3 +6298,136 @@ af_packet_sock(void)
>  
>      return sock;
>  }
> +
> +static int
> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> +{
> +    struct eth_header *eth_hdr;
> +    ovs_be16 eth_type;
> +    int l2_len;
> +
> +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> +    if (!eth_hdr) {
> +        return -EINVAL;
> +    }
> +
> +    l2_len = ETH_HEADER_LEN;
> +    eth_type = eth_hdr->eth_type;
> +    if (eth_type_vlan(eth_type)) {
> +        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> +
> +        if (!vlan) {
> +            return -EINVAL;
> +        }
> +
> +        eth_type = vlan->vlan_next_type;
> +        l2_len += VLAN_HEADER_LEN;
> +    }
> +
> +    if (eth_type == htons(ETH_TYPE_IP)) {
> +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> +
> +        if (!ip_hdr) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = ip_hdr->ip_proto;
> +        dp_packet_hwol_set_tx_ipv4(b);
> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> +        struct ovs_16aligned_ip6_hdr *nh6;
> +
> +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> +        if (!nh6) {
> +            return -EINVAL;
> +        }
> +
> +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> +        dp_packet_hwol_set_tx_ipv6(b);
> +    }
> +
> +    return 0;
> +}
> +
> +static int
> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> +    uint16_t l4proto = 0;
> +
> +    if (OVS_UNLIKELY(!vnet)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> +        return 0;
> +    }
> +
> +    if (netdev_linux_parse_l2(b, &l4proto)) {
> +        return -EINVAL;
> +    }
> +
> +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +        if (l4proto == IPPROTO_TCP) {
> +            dp_packet_hwol_set_csum_tcp(b);
> +        } else if (l4proto == IPPROTO_UDP) {
> +            dp_packet_hwol_set_csum_udp(b);
> +        } else if (l4proto == IPPROTO_SCTP) {
> +            dp_packet_hwol_set_csum_sctp(b);
> +        }
> +    }
> +
> +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> +                                | VIRTIO_NET_HDR_GSO_TCPV6
> +                                | VIRTIO_NET_HDR_GSO_UDP;
> +        uint8_t type = vnet->gso_type & allowed_mask;
> +
> +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
> +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> +            dp_packet_hwol_set_tcp_seg(b);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void
> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> +{
> +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> +
> +    if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
> +        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
> +                            + TCP_HEADER_LEN;
> +
> +        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
> +        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
> +        if (dp_packet_hwol_is_ipv4(b)) {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +        } else {
> +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +        }
> +
> +    } else {
> +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(b)) {
> +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
> +                                                  - (char *)dp_packet_eth(b));
> +
> +        if (dp_packet_hwol_l4_is_tcp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct tcp_header, tcp_csum);
> +        } else if (dp_packet_hwol_l4_is_udp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct udp_header, udp_csum);
> +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
> +            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> +                                    struct sctp_header, sctp_csum);
> +        } else {
> +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> +        }
> +    }
> +}
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index f109c4e66..87c375b47 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -37,6 +37,12 @@ extern "C" {
>  struct netdev_tnl_build_header_params;
>  #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>  
> +enum netdev_ol_flags {
> +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> +};
> +
>  /* A network device (e.g. an Ethernet device).
>   *
>   * Network device implementations may read these members but should not modify
> @@ -51,6 +57,10 @@ struct netdev {
>       * opening this device, and therefore got assigned to the "system" class */
>      bool auto_classified;
>  
> +    /* This bitmask of the offloading features enabled/supported by the
> +     * supported by the netdev. */

So, enabled or supported?  Please, choose one.

> +    uint64_t ol_flags;
> +
>      /* If this is 'true', the user explicitly specified an MTU for this
>       * netdev.  Otherwise, Open vSwitch is allowed to override it. */
>      bool mtu_user_config;
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 405c98c68..998525875 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
>              : NETDEV_PT_LEGACY_L2);
>  }
>  
> +/* Check if a 'packet' is compatible with 'netdev_flags'.
> + * If a packet is incompatible, return 'false' with the 'errormsg'
> + * pointing to a reason. */
> +static bool
> +netdev_send_prepare_packet(const uint64_t netdev_flags,
> +                           struct dp_packet *packet, char **errormsg)

It's better to use VLOG_ERR_BUF instead of passing char **.
Not an OVS style.

> +{
> +    if (dp_packet_hwol_is_tso(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> +            /* Fall back to GSO in software. */
> +            *errormsg = "No TSO support";
> +            return false;
> +    }
> +
> +    if (dp_packet_hwol_l4_mask(packet)
> +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> +            /* Fall back to L4 csum in software. */
> +            *errormsg = "No L4 checksum support";
> +            return false;
> +    }
> +
> +    return true;
> +}
> +
> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> + * otherwise either fall back to software implementation or drop it. */
> +static void
> +netdev_send_prepare_batch(const struct netdev *netdev,
> +                          struct dp_packet_batch *batch)
> +{
> +    struct dp_packet *packet;
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        char *errormsg = NULL;
> +
> +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> +                         errormsg ? errormsg : "Unsupported feature",
> +                         netdev_get_name(netdev));

Hmm, it seems that packet will be leaked here.

Also, you're dropping packet without accounting it anyhow.  We merged few
patches about counting dropped packets recently, so, now we should add new
counters for each dropping case in order to keep things consistent.

> +        }
> +    }
> +}
> +
>  /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
>   * otherwise a positive errno value.  Returns EAGAIN without blocking if
>   * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> @@ -811,8 +857,10 @@ int
>  netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
>              bool concurrent_txq)
>  {
> -    int error = netdev->netdev_class->send(netdev, qid, batch,
> -                                           concurrent_txq);
> +    int error;
> +
> +    netdev_send_prepare_batch(netdev, batch);

send() doesn't expect empty batches.  You need to check before calling it.

> +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
>      if (!error) {
>          COVERAGE_INC(netdev_sent);
>      }
> @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
>                     const struct ovs_action_push_tnl *data)
>  {
>      struct dp_packet *packet;
> -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> -        netdev->netdev_class->push_header(netdev, packet, data);
> -        pkt_metadata_init(&packet->md, data->out_port);
> +    size_t i, size = dp_packet_batch_size(batch);
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        if (!dp_packet_hwol_is_tso(packet)) {
> +            netdev->netdev_class->push_header(netdev, packet, data);
> +            pkt_metadata_init(&packet->md, data->out_port);
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
> +                         "packet dropped", netdev_get_name(netdev));

Packet leaked.  Drop not counted.

> +        }
>      }
>  
>      return 0;
> diff --git a/lib/tso.c b/lib/tso.c
> new file mode 100644
> index 000000000..9dc15e146
> --- /dev/null
> +++ b/lib/tso.c
> @@ -0,0 +1,54 @@
> +/*
> + * Copyright (c) 2020 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "smap.h"
> +#include "ovs-thread.h"
> +#include "openvswitch/vlog.h"
> +#include "dpdk.h"
> +#include "tso.h"
> +#include "vswitch-idl.h"
> +
> +VLOG_DEFINE_THIS_MODULE(tso);

userspace_tso

> +
> +static bool tso_support_enabled = false;
> +
> +void
> +tso_init(const struct smap *ovs_other_config)

userspace_tso_init

> +{
> +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> +        if (ovsthread_once_start(&once)) {
> +            if (dpdk_available()) {
> +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> +                tso_support_enabled = true;
> +            } else {
> +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> +                         "without enabling DPDK");

This is an artificial restriction.  But we may remove it later.

> +                tso_support_enabled = false;
> +            }
> +            ovsthread_once_done(&once);
> +        }
> +    }
> +}
> +
> +bool
> +tso_enabled(void)

userspace_tso_enabled()

> +{
> +    return tso_support_enabled;
> +}
> diff --git a/lib/tso.h b/lib/tso.h
> new file mode 100644
> index 000000000..6594496ac
> --- /dev/null
> +++ b/lib/tso.h
> @@ -0,0 +1,23 @@
> +/*
> + * Copyright (c) 2020 Red Hat Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef TSO_H
> +#define TSO_H 1
> +
> +void tso_init(const struct smap *ovs_other_config);
> +bool tso_enabled(void);
> +
> +#endif /* tso.h */
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 86c7b10a9..6d73922f6 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -65,6 +65,7 @@
>  #include "system-stats.h"
>  #include "timeval.h"
>  #include "tnl-ports.h"
> +#include "tso.h"
>  #include "util.h"
>  #include "unixctl.h"
>  #include "lib/vswitch-idl.h"
> @@ -3285,6 +3286,7 @@ bridge_run(void)
>      if (cfg) {
>          netdev_set_flow_api_enabled(&cfg->other_config);
>          dpdk_init(&cfg->other_config);
> +        tso_init(&cfg->other_config);
>      }
>  
>      /* Initialize the ofproto library.  This only needs to run once, but
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0ec726c39..354dcabfa 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -690,6 +690,18 @@
>           once in few hours or a day or a week.
>          </p>
>        </column>
> +      <column name="other_config" key="tso-support"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Set this value to <code>true</code> to enable support for TSO (TCP
> +          Segmentation Offloading). When TSO is enabled, vhost-user client

Why we're talking about vhost-user interfaces here?  Physical DPDK ports and
netdev-linux will be able too.  Sounds strange.

> +          interfaces can transmit packets up to 64KB.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>. Changing this value requires
> +          restarting the daemon.
> +        </p>
> +      </column>
>      </group>
>      <group title="Status">
>        <column name="next_cfg">
>
Flavio Leitner Jan. 14, 2020, 10:17 p.m. UTC | #4
On Tue, Jan 14, 2020 at 06:48:14PM +0100, Ilya Maximets wrote:
> On 09.01.2020 15:44, Flavio Leitner wrote:
> > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> > the network stack to delegate the TCP segmentation to the NIC reducing
> > the per packet CPU overhead.
> > 
> > A guest using vhostuser interface with TSO enabled can send TCP packets
> > much bigger than the MTU, which saves CPU cycles normally used to break
> > the packets down to MTU size and to calculate checksums.
> > 
> > It also saves CPU cycles used to parse multiple packets/headers during
> > the packet processing inside virtual switch.
> > 
> > If the destination of the packet is another guest in the same host, then
> > the same big packet can be sent through a vhostuser interface skipping
> > the segmentation completely. However, if the destination is not local,
> > the NIC hardware is instructed to do the TCP segmentation and checksum
> > calculation.
> > 
> > It is recommended to check if NIC hardware supports TSO before enabling
> > the feature, which is off by default. For additional information please
> > check the tso.rst document.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> 
> It seems this patch needs a rebase due to recvmmsg related changes in
> netdev-linux.

Yeah, Ian requested that too. I am working on it.

 
> I didn't check the sizes and offsets inside the code and didn't look
> close to the features enabling on devices. Some comments inline.

OK, thanks for this review anyways.

> >  Documentation/automake.mk           |   1 +
> >  Documentation/topics/dpdk/index.rst |   1 +
> >  Documentation/topics/dpdk/tso.rst   |  96 +++++++++
> >  NEWS                                |   1 +
> >  lib/automake.mk                     |   2 +
> >  lib/conntrack.c                     |  29 ++-
> >  lib/dp-packet.h                     | 152 +++++++++++++-
> >  lib/ipf.c                           |  32 +--
> >  lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
> >  lib/netdev-linux-private.h          |   4 +
> >  lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
> >  lib/netdev-provider.h               |  10 +
> >  lib/netdev.c                        |  66 +++++-
> >  lib/tso.c                           |  54 +++++
> >  lib/tso.h                           |  23 ++
> >  vswitchd/bridge.c                   |   2 +
> >  vswitchd/vswitch.xml                |  12 ++
> >  17 files changed, 1002 insertions(+), 91 deletions(-)
> >  create mode 100644 Documentation/topics/dpdk/tso.rst
> >  create mode 100644 lib/tso.c
> >  create mode 100644 lib/tso.h
> > 
> > Changelog:
> > - v3
> >  * Improved the documentation.
> >  * Updated copyright year to 2020.
> >  * TSO offloaded msg now includes the netdev's name.
> >  * Added period at the end of all code comments.
> >  * Warn and drop encapsulation of TSO packets.
> >  * Fixed travis issue with restricted virtio types.
> >  * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
> >    which caused packet corruption.
> >  * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
> >    PKT_TX_IP_CKSUM only for IPv4 packets.
> > 
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index f2ca17bad..284327edd 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -35,6 +35,7 @@ DOC_SOURCE = \
> >  	Documentation/topics/dpdk/index.rst \
> >  	Documentation/topics/dpdk/bridge.rst \
> >  	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/tso.rst \
> >  	Documentation/topics/dpdk/memory.rst \
> >  	Documentation/topics/dpdk/pdump.rst \
> >  	Documentation/topics/dpdk/phy.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> > index f2862ea70..400d56051 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,4 +40,5 @@ DPDK Support
> >     /topics/dpdk/qos
> >     /topics/dpdk/pdump
> >     /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/tso
> >     /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> > new file mode 100644
> > index 000000000..189c86480
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/tso.rst
> > @@ -0,0 +1,96 @@
> > +..
> > +      Copyright 2020, Red Hat, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> > +      not use this file except in compliance with the License. You may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing, software
> > +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> > +      License for the specific language governing permissions and limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +========================
> > +Userspace Datapath - TSO
> > +========================
> > +
> > +**Note:** This feature is considered experimental.
> > +
> > +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> > +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> > +segmentation achieves computational savings in the core, freeing up CPU cycles
> > +for more useful work.
> > +
> > +A common use case for TSO is when using virtualization, where traffic that's
> > +coming in from a VM can offload the TCP segmentation, thus avoiding the
> > +fragmentation in software. Additionally, if the traffic is headed to a VM
> > +within the same host further optimization can be expected. As the traffic never
> > +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> > +and checksum calculations are required, which saves yet more cycles. Only when
> > +the traffic actually leaves the host the segmentation needs to happen, in which
> > +case it will be performed by the egress NIC. Consult your controller's
> > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> > +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> > +refer to the `DPDK documentation`__.
> > +
> > +__ https://doc.dpdk.org/guides/nics/overview.html
> 
> This should point to 19.11 version of a guide instead of latest one:
> https://doc.dpdk.org/guides-19.11/nics/overview.html

Sounds fair, ok.

> 
> > +
> > +Enabling TSO
> > +~~~~~~~~~~~~
> > +
> > +The TSO support may be enabled via a global config value ``tso-support``.
> > +Setting this to ``true`` enables TSO support for all ports.
> > +
> > +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
> 
> 
> I'd suggest to rename this to 'userspace-tso-support' to avoid misunderstanding.

ok, I will rename.


> > +
> > +The default value is ``false``.
> > +
> > +Changing ``tso-support`` requires restarting the daemon.
> > +
> > +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> > +
> > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> > +connection is established, `TSO` is thus advertised to the guest as an
> > +available feature:
> > +
> > +QEMU Command Line Parameter::
> > +
> > +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> > +    ...
> > +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> > +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> > +    ...
> > +
> > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> > +used to enable same::
> > +
> > +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> > +    $ ethtool -K eth0 tso on
> > +    $ ethtool -k eth0
> > +
> > +~~~~~~~~~~~
> > +Limitations
> > +~~~~~~~~~~~
> > +
> > +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> > +etc.]).
> > +
> > +There is no software implementation of TSO, so all ports attached to the
> > +datapath must support TSO or packets using that feature will be dropped
> > +on ports without TSO support.  That also means guests using vhost-user
> > +in client mode will receive TSO packet regardless of TSO being enabled
> > +or disabled within the guest.
> > diff --git a/NEWS b/NEWS
> > index 965facaf8..306c0493d 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -26,6 +26,7 @@ Post-v2.12.0
> >       * DPDK ring ports (dpdkr) are deprecated and will be removed in next
> >         releases.
> >       * Add support for DPDK 19.11.
> > +     * Add experimental support for TSO.
> >     - RSTP:
> >       * The rstp_statistics column in Port table will only be updated every
> >         stats-update-interval configured in Open_vSwtich table.
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index ebf714501..94a1b4459 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
> >  	lib/tnl-neigh-cache.h \
> >  	lib/tnl-ports.c \
> >  	lib/tnl-ports.h \
> > +	lib/tso.c \
> > +	lib/tso.h \
> 
> s/tso/userspace-tso/

Yup

> 
> >  	lib/netdev-native-tnl.c \
> >  	lib/netdev-native-tnl.h \
> >  	lib/token-bucket.c \
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index b80080e72..679054b98 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >          if (hwol_bad_l3_csum) {
> >              ok = false;
> >          } else {
> > -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> > +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> > +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
> >              /* Validate the checksum only when hwol is not supported. */
> >              ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
> >                                   !hwol_good_l3_csum);
> > @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >      if (ok) {
> >          bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
> >          if (!hwol_bad_l4_csum) {
> > -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> > +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> > +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
> >              /* Validate the checksum only when hwol is not supported. */
> >              if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
> >                             &ctx->icmp_related, l3, !hwol_good_l4_csum,
> > @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >                  }
> >                  if (seq_skew) {
> >                      ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> > -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > -                                          l3_hdr->ip_tot_len, htons(ip_len));
> > +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > +                                                        l3_hdr->ip_tot_len,
> > +                                                        htons(ip_len));
> > +                    }
> >                      l3_hdr->ip_tot_len = htons(ip_len);
> >                  }
> >              }
> > @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >      }
> >  
> >      th->tcp_csum = 0;
> > -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > -                           dp_packet_l4_size(pkt));
> > -    } else {
> > -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > -        th->tcp_csum = csum_finish(
> > -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> > +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > +                               dp_packet_l4_size(pkt));
> > +        } else {
> > +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > +            th->tcp_csum = csum_finish(
> > +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +        }
> >      }
> >  
> >      if (seq_skew) {
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index 133942155..d10a0416e 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
> >  static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
> >  static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
> >  
> > +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
> 
> No such function.

Good catch, it got moved to netdev_linux_prepend_vnet_hdr().

> > +
> >  void *dp_packet_resize_l2(struct dp_packet *, int increment);
> >  void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
> >  static inline void *dp_packet_eth(const struct dp_packet *);
> > @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
> >  {
> >      /* This initialization is needed for packets that do not come from DPDK
> >       * interfaces, when vswitchd is built with --with-dpdk. */
> > -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> > +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> >      p->mbuf.nb_segs = 1;
> >      p->mbuf.next = NULL;
> >  }
> > @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >      b->mbuf.buf_len = s;
> >  }
> >  
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> > +           ? true
> > +           : false;
> 
> Usual way for converting to bool is to use '!!'.  This will save some space.

Ok will change that and the other similar cases.

> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
> 
> '{' should be on the next line.  Same for all the functions below.
> 
> And some comments to these functions would be nice.  At least a single
> comment for a group of functions.

Ok.

> Some comments for below functions too.

Ok.
> 
> > +static inline bool
> > +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> 
> '!!'?
> Also, it seems strange to check l4 offloading mask to check if
> we have ip checksum.  Shouldn't we check for PKT_TX_IPV4/6
> instead?  This might not work for pure IP packets (without L4).

I remember there were cases where the flag was not set. I don't have
my notes handy now, but I will double check later.

[...]
> > @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >                           struct rte_mbuf **pkts, int cnt)
> >  {
> >      uint32_t nb_tx = 0;
> > +    uint16_t nb_tx_prep = cnt;
> > +
> > +    if (tso_enabled()) {
> > +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> 
> Packets dropped and not counted.


It returns cnt (total) - nb_tx (transmitted), which the caller will
account as tx failures.
 

> 
> > +        if (nb_tx_prep != cnt) {
> > +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> > +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> > +                         cnt, rte_strerror(rte_errno));
> > +        }
> > +    }
> >  
> > -    while (nb_tx != cnt) {
> > +    while (nb_tx != nb_tx_prep) {
> >          uint32_t ret;
> >  
> > -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> > +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> > +                               nb_tx_prep - nb_tx);
> >          if (!ret) {
> >              break;
> >          }
> > @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> >      int cnt = 0;
> >      struct rte_mbuf *pkt;
> >  
> > +    /* Filter oversized packets, unless are marked for TSO. */
> >      for (i = 0; i < pkt_cnt; i++) {
> >          pkt = pkts[i];
> > -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> > -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> > +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> > +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> > +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> > +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> > +                         dev->max_packet_len);
> >              rte_pktmbuf_free(pkt);
> >              continue;
> >          }
> > @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> >      struct netdev_dpdk_sw_stats sw_stats_add;
> >      unsigned int n_packets_to_free = cnt;
> > -    unsigned int total_packets = cnt;
> > +    unsigned int total_packets;
> >      int i, retries = 0;
> >      int max_retries = VHOST_ENQ_RETRY_MIN;
> >      int vid = netdev_dpdk_get_vid(dev);
> > @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >      }
> >  
> > -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> > +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> 
> lib/daemon-unix.c Have you checked the performance
> impact for non-TSO setup?

I did a light test and noticed no difference. It only checks for two flags
in the same field for the non-TSO case.


> 
> > +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
> >      sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> >  
> >      /* Check has QoS has been configured for the netdev */
> > @@ -2511,6 +2613,121 @@ out:
> >      }
> >  }
> >  
> > +static void
> > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> > +{
> > +    rte_free(opaque);
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> > +{
> > +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> > +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> > +    uint16_t buf_len;
> > +    void *buf;
> > +
> > +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> > +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> > +    } else {
> > +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> > +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> > +    }
> > +
> > +    if (unlikely(total_len > UINT16_MAX)) {
> > +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> > +        return NULL;
> > +    }
> > +
> > +    buf_len = total_len;
> > +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > +    if (unlikely(buf == NULL)) {
> > +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> > +        return NULL;
> > +    }
> > +
> > +    /* Initialize shinfo. */
> > +    if (shinfo) {
> > +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> > +        shinfo->fcb_opaque = buf;
> > +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +    } else {
> > +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > +                                                    netdev_dpdk_extbuf_free,
> > +                                                    buf);
> > +        if (unlikely(shinfo == NULL)) {
> > +            rte_free(buf);
> > +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> > +                     "attempting to attach an external buffer.");
> > +            return NULL;
> > +        }
> > +    }
> > +
> > +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> > +                              shinfo);
> > +    rte_pktmbuf_reset_headroom(pkt);
> > +
> > +    return pkt;
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> > +{
> > +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > +
> > +    if (OVS_UNLIKELY(!pkt)) {
> > +        return NULL;
> > +    }
> > +
> > +    dp_packet_init_specific((struct dp_packet *)pkt);
> > +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> > +        return pkt;
> > +    }
> > +
> > +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> > +        return pkt;
> > +    }
> > +
> > +    rte_pktmbuf_free(pkt);
> > +
> > +    return NULL;
> > +}
> > +
> > +static struct dp_packet *
> > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> > +{
> > +    struct rte_mbuf *mbuf_dest;
> > +    struct dp_packet *pkt_dest;
> > +    uint32_t pkt_len;
> > +
> > +    pkt_len = dp_packet_size(pkt_orig);
> > +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> > +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> > +            return NULL;
> > +    }
> > +
> > +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> > +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> > +    dp_packet_set_size(pkt_dest, pkt_len);
> > +
> > +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> > +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> > +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> > +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> > +
> > +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> > +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> > +
> > +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> > +                                - (char *)dp_packet_eth(pkt_dest);
> > +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> > +                                - (char *) dp_packet_l3(pkt_dest);
> > +    }
> > +
> > +    return pkt_dest;
> > +}
> > +
> >  /* Tx function. Transmit packets indefinitely */
> >  static void
> >  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> > @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >      enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> >  #endif
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> > +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> >      struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> >      uint32_t cnt = batch_cnt;
> >      uint32_t dropped = 0;
> > @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >          struct dp_packet *packet = batch->packets[i];
> >          uint32_t size = dp_packet_size(packet);
> >  
> > -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> > -                         size, dev->max_packet_len);
> > -
> > +        if (size > dev->max_packet_len
> > +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> > +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> > +                         dev->max_packet_len);
> >              mtu_drops++;
> >              continue;
> >          }
> >  
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> > +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
> >          if (OVS_UNLIKELY(!pkts[txcnt])) {
> >              dropped = cnt - i;
> >              break;
> >          }
> >  
> > -        /* We have to do a copy for now */
> > -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> > -               dp_packet_data(packet), size);
> > -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> > -
> >          txcnt++;
> >      }
> >  
> >      if (OVS_LIKELY(txcnt)) {
> >          if (dev->type == DPDK_DEV_VHOST) {
> > -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> > -                                     txcnt);
> > +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> >          } else {
> > -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> > +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> > +                                                   (struct rte_mbuf **)pkts,
> > +                                                   txcnt);
> >          }
> >      }
> >  
> > @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> >          int batch_cnt = dp_packet_batch_size(batch);
> >          struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> >  
> > +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> 
> Packets dropped and not counted.

Right.


> Also, this function called unconditionally. Perfomance impact on non-TSO case?

It's the same as above.


> >          tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> >          mtu_drops = batch_cnt - tx_cnt;
> >          qos_drops = tx_cnt;
[...]
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > index f08159aa7..102548db7 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -37,10 +37,14 @@
> >  
> >  struct netdev;
> >  
> > +#define LINUX_RXQ_TSO_MAX_LEN 65536
> > +
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> >      bool is_tap;
> >      int fd;
> > +    char *bufaux;          /* Extra buffer to recv TSO pkt. */
> > +    int bufaux_len;        /* Extra buffer length. */
> 
> Length never changes.  Why we need 'bufaux_len' ?

I have no strong opinion, so either way is fine by me.

[...]
> > @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
> >  netdev_linux_rxq_alloc(void)
> >  {
> >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > +    if (tso_enabled()) {
> > +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > +        if (rx->bufaux) {
> 
> xmalloc can not fail.

Old habits, will fix that.


> > +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        }
> > +    }
> > +
> >      return &rx->up;
> >  }
> >  
> > @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >              goto error;
> >          }
> >  
> > +        if (tso_enabled()) {
> > +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                               sizeof val);
> > +            if (error) {
> 
> You're not using the 'error'.  Make it just "if (setsockopt()) {".

I agree.

> 
> > +                error = errno;
> > +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> > +                         netdev_get_name(netdev_), ovs_strerror(errno));
> > +                goto error;
> > +            }
> > +        }
> > +
> >          /* Set non-blocking mode. */
> >          error = set_nonblocking(rx->fd);
> >          if (error) {
> > @@ -6173,6 +6275,19 @@ af_packet_sock(void)
> >                  close(sock);
> >                  sock = -error;
> >              }
> > +
> > +            if (tso_enabled()) {
> > +                int val = 1;
> > +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                                   sizeof val);
> 
> socket might be already closed here and there will be double close.

Good catch, thanks.

[...]
> > @@ -51,6 +57,10 @@ struct netdev {
> >       * opening this device, and therefore got assigned to the "system" class */
> >      bool auto_classified;
> >  
> > +    /* This bitmask of the offloading features enabled/supported by the
> > +     * supported by the netdev. */
> 
> So, enabled or supported?  Please, choose one.

I see, will fix that.


> 
> > +    uint64_t ol_flags;
> > +
> >      /* If this is 'true', the user explicitly specified an MTU for this
> >       * netdev.  Otherwise, Open vSwitch is allowed to override it. */
> >      bool mtu_user_config;
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 405c98c68..998525875 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
> >              : NETDEV_PT_LEGACY_L2);
> >  }
> >  
> > +/* Check if a 'packet' is compatible with 'netdev_flags'.
> > + * If a packet is incompatible, return 'false' with the 'errormsg'
> > + * pointing to a reason. */
> > +static bool
> > +netdev_send_prepare_packet(const uint64_t netdev_flags,
> > +                           struct dp_packet *packet, char **errormsg)
> 
> It's better to use VLOG_ERR_BUF instead of passing char **.
> Not an OVS style.

Sounds good to me.


> > +netdev_send_prepare_batch(const struct netdev *netdev,
> > +                          struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet *packet;
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        char *errormsg = NULL;
> > +
> > +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> > +                         errormsg ? errormsg : "Unsupported feature",
> > +                         netdev_get_name(netdev));
> 
> Hmm, it seems that packet will be leaked here.
> 
> Also, you're dropping packet without accounting it anyhow.  We merged few
> patches about counting dropped packets recently, so, now we should add new
> counters for each dropping case in order to keep things consistent.

Right, will fix that too.


> > +        }
> > +    }
> > +}
> > +
> >  /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
> >   * otherwise a positive errno value.  Returns EAGAIN without blocking if
> >   * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> > @@ -811,8 +857,10 @@ int
> >  netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
> >              bool concurrent_txq)
> >  {
> > -    int error = netdev->netdev_class->send(netdev, qid, batch,
> > -                                           concurrent_txq);
> > +    int error;
> > +
> > +    netdev_send_prepare_batch(netdev, batch);
> 
> send() doesn't expect empty batches.  You need to check before calling it.

OK.


> > +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
> >      if (!error) {
> >          COVERAGE_INC(netdev_sent);
> >      }
> > @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
> >                     const struct ovs_action_push_tnl *data)
> >  {
> >      struct dp_packet *packet;
> > -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > -        netdev->netdev_class->push_header(netdev, packet, data);
> > -        pkt_metadata_init(&packet->md, data->out_port);
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        if (!dp_packet_hwol_is_tso(packet)) {
> > +            netdev->netdev_class->push_header(netdev, packet, data);
> > +            pkt_metadata_init(&packet->md, data->out_port);
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
> > +                         "packet dropped", netdev_get_name(netdev));
> 
> Packet leaked.  Drop not counted.

Yeah, same as above.


> > +        }
> >      }
> >  
> >      return 0;
> > diff --git a/lib/tso.c b/lib/tso.c
> > new file mode 100644
> > index 000000000..9dc15e146
> > --- /dev/null
> > +++ b/lib/tso.c
> > @@ -0,0 +1,54 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#include <config.h>
> > +
> > +#include "smap.h"
> > +#include "ovs-thread.h"
> > +#include "openvswitch/vlog.h"
> > +#include "dpdk.h"
> > +#include "tso.h"
> > +#include "vswitch-idl.h"
> > +
> > +VLOG_DEFINE_THIS_MODULE(tso);
> 
> userspace_tso
> 
> > +
> > +static bool tso_support_enabled = false;
> > +
> > +void
> > +tso_init(const struct smap *ovs_other_config)
> 
> userspace_tso_init
> 
> > +{
> > +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> > +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +
> > +        if (ovsthread_once_start(&once)) {
> > +            if (dpdk_available()) {
> > +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> > +                tso_support_enabled = true;
> > +            } else {
> > +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> > +                         "without enabling DPDK");
> 
> This is an artificial restriction.  But we may remove it later.
> 
> > +                tso_support_enabled = false;
> > +            }
> > +            ovsthread_once_done(&once);
> > +        }
> > +    }
> > +}
> > +
> > +bool
> > +tso_enabled(void)
> 
> userspace_tso_enabled()
> 
> > +{
> > +    return tso_support_enabled;
> > +}
> > diff --git a/lib/tso.h b/lib/tso.h
> > new file mode 100644
> > index 000000000..6594496ac
> > --- /dev/null
> > +++ b/lib/tso.h
> > @@ -0,0 +1,23 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef TSO_H
> > +#define TSO_H 1
> > +
> > +void tso_init(const struct smap *ovs_other_config);
> > +bool tso_enabled(void);
> > +
> > +#endif /* tso.h */
> > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> > index 86c7b10a9..6d73922f6 100644
> > --- a/vswitchd/bridge.c
> > +++ b/vswitchd/bridge.c
> > @@ -65,6 +65,7 @@
> >  #include "system-stats.h"
> >  #include "timeval.h"
> >  #include "tnl-ports.h"
> > +#include "tso.h"
> >  #include "util.h"
> >  #include "unixctl.h"
> >  #include "lib/vswitch-idl.h"
> > @@ -3285,6 +3286,7 @@ bridge_run(void)
> >      if (cfg) {
> >          netdev_set_flow_api_enabled(&cfg->other_config);
> >          dpdk_init(&cfg->other_config);
> > +        tso_init(&cfg->other_config);
> >      }
> >  
> >      /* Initialize the ofproto library.  This only needs to run once, but
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 0ec726c39..354dcabfa 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -690,6 +690,18 @@
> >           once in few hours or a day or a week.
> >          </p>
> >        </column>
> > +      <column name="other_config" key="tso-support"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          Set this value to <code>true</code> to enable support for TSO (TCP
> > +          Segmentation Offloading). When TSO is enabled, vhost-user client
> 
> Why we're talking about vhost-user interfaces here?  Physical DPDK ports and
> netdev-linux will be able too.  Sounds strange.

Yeah, that comment can be improved now.

Thanks Ilya!

Patch
diff mbox series

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..284327edd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -35,6 +35,7 @@  DOC_SOURCE = \
 	Documentation/topics/dpdk/index.rst \
 	Documentation/topics/dpdk/bridge.rst \
 	Documentation/topics/dpdk/jumbo-frames.rst \
+	Documentation/topics/dpdk/tso.rst \
 	Documentation/topics/dpdk/memory.rst \
 	Documentation/topics/dpdk/pdump.rst \
 	Documentation/topics/dpdk/phy.rst \
diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
index f2862ea70..400d56051 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,4 +40,5 @@  DPDK Support
    /topics/dpdk/qos
    /topics/dpdk/pdump
    /topics/dpdk/jumbo-frames
+   /topics/dpdk/tso
    /topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
new file mode 100644
index 000000000..189c86480
--- /dev/null
+++ b/Documentation/topics/dpdk/tso.rst
@@ -0,0 +1,96 @@ 
+..
+      Copyright 2020, Red Hat, Inc.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+========================
+Userspace Datapath - TSO
+========================
+
+**Note:** This feature is considered experimental.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC. Consult your controller's
+datasheet for compatibility. Secondly, the NIC must have an associated DPDK
+Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
+refer to the `DPDK documentation`__.
+
+__ https://doc.dpdk.org/guides/nics/overview.html
+
+Enabling TSO
+~~~~~~~~~~~~
+
+The TSO support may be enabled via a global config value ``tso-support``.
+Setting this to ``true`` enables TSO support for all ports.
+
+    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
+
+The default value is ``false``.
+
+Changing ``tso-support`` requires restarting the daemon.
+
+When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO` is thus advertised to the guest as an
+available feature:
+
+QEMU Command Line Parameter::
+
+    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
+    ...
+    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
+    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
+    ...
+
+2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
+used to enable same::
+
+    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
+    $ ethtool -K eth0 tso on
+    $ ethtool -k eth0
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+
+The current OvS userspace `TSO` implementation supports flat and VLAN networks
+only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
+etc.]).
+
+There is no software implementation of TSO, so all ports attached to the
+datapath must support TSO or packets using that feature will be dropped
+on ports without TSO support.  That also means guests using vhost-user
+in client mode will receive TSO packet regardless of TSO being enabled
+or disabled within the guest.
diff --git a/NEWS b/NEWS
index 965facaf8..306c0493d 100644
--- a/NEWS
+++ b/NEWS
@@ -26,6 +26,7 @@  Post-v2.12.0
      * DPDK ring ports (dpdkr) are deprecated and will be removed in next
        releases.
      * Add support for DPDK 19.11.
+     * Add experimental support for TSO.
    - RSTP:
      * The rstp_statistics column in Port table will only be updated every
        stats-update-interval configured in Open_vSwtich table.
diff --git a/lib/automake.mk b/lib/automake.mk
index ebf714501..94a1b4459 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -304,6 +304,8 @@  lib_libopenvswitch_la_SOURCES = \
 	lib/tnl-neigh-cache.h \
 	lib/tnl-ports.c \
 	lib/tnl-ports.h \
+	lib/tso.c \
+	lib/tso.h \
 	lib/netdev-native-tnl.c \
 	lib/netdev-native-tnl.h \
 	lib/token-bucket.c \
diff --git a/lib/conntrack.c b/lib/conntrack.c
index b80080e72..679054b98 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -2022,7 +2022,8 @@  conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
         if (hwol_bad_l3_csum) {
             ok = false;
         } else {
-            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
+            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
+                                     || dp_packet_hwol_tx_ip_checksum(pkt);
             /* Validate the checksum only when hwol is not supported. */
             ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
                                  !hwol_good_l3_csum);
@@ -2036,7 +2037,8 @@  conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
     if (ok) {
         bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
         if (!hwol_bad_l4_csum) {
-            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
+            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
+                                      || dp_packet_hwol_tx_l4_checksum(pkt);
             /* Validate the checksum only when hwol is not supported. */
             if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
                            &ctx->icmp_related, l3, !hwol_good_l4_csum,
@@ -3237,8 +3239,11 @@  handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
                 }
                 if (seq_skew) {
                     ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
-                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
-                                          l3_hdr->ip_tot_len, htons(ip_len));
+                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
+                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
+                                                        l3_hdr->ip_tot_len,
+                                                        htons(ip_len));
+                    }
                     l3_hdr->ip_tot_len = htons(ip_len);
                 }
             }
@@ -3256,13 +3261,15 @@  handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
     }
 
     th->tcp_csum = 0;
-    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
-        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
-                           dp_packet_l4_size(pkt));
-    } else {
-        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
-        th->tcp_csum = csum_finish(
-             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
+        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
+            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
+                               dp_packet_l4_size(pkt));
+        } else {
+            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
+            th->tcp_csum = csum_finish(
+                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
+        }
     }
 
     if (seq_skew) {
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 133942155..d10a0416e 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,6 +114,8 @@  static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -456,7 +458,7 @@  dp_packet_init_specific(struct dp_packet *p)
 {
     /* This initialization is needed for packets that do not come from DPDK
      * interfaces, when vswitchd is built with --with-dpdk. */
-    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
+    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
     p->mbuf.nb_segs = 1;
     p->mbuf.next = NULL;
 }
@@ -519,6 +521,80 @@  dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
     b->mbuf.buf_len = s;
 }
 
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
+           ? true
+           : false;
+}
+
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b)
+{
+    return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;
+}
+
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b)
+{
+    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM
+           ? true
+           : false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_udp(struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM
+           ? true
+           : false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
+{
+    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM
+           ? true
+           : false;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_IPV4;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_IPV6;
+}
+
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
+}
+
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
+    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
+}
+
 /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
  * correct only if 'dp_packet_rss_valid(p)' returns true */
 static inline uint32_t
@@ -648,6 +724,66 @@  dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
     b->allocated_ = s;
 }
 
+static inline bool
+dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+static inline bool
+dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+static inline uint64_t
+dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
+{
+    return 0;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+static inline bool
+dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
+{
+    return false;
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
+}
+
+static inline void
+dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
+}
+
 /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
  * correct only if 'dp_packet_rss_valid(p)' returns true */
 static inline uint32_t
@@ -939,6 +1075,20 @@  dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
     }
 }
 
+static inline bool
+dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
+{
+
+    return dp_packet_hwol_l4_mask(p) ? true : false;
+}
+
+static inline bool
+dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
+{
+
+    return dp_packet_hwol_l4_mask(p) ? true : false;
+}
+
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/ipf.c b/lib/ipf.c
index 45c489122..0f43593a2 100644
--- a/lib/ipf.c
+++ b/lib/ipf.c
@@ -433,9 +433,11 @@  ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
     len += rest_len;
     l3 = dp_packet_l3(pkt);
     ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
-    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
-                                new_ip_frag_off);
-    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
+        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
+                                    new_ip_frag_off);
+        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+    }
     l3->ip_tot_len = htons(len);
     l3->ip_frag_off = new_ip_frag_off;
     dp_packet_set_l2_pad_size(pkt, 0);
@@ -606,6 +608,7 @@  ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
     }
 
     if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
+                     && !dp_packet_hwol_tx_ip_checksum(pkt)
                      && csum(l3, ip_hdr_len) != 0)) {
         goto invalid_pkt;
     }
@@ -1181,16 +1184,21 @@  ipf_post_execute_reass_pkts(struct ipf *ipf,
                 } else {
                     struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
                     struct ip_header *l3_reass = dp_packet_l3(pkt);
-                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
-                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
-                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
-                                                     frag_ip, reass_ip);
-                    l3_frag->ip_src = l3_reass->ip_src;
+                    if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
+                        ovs_be32 reass_ip =
+                            get_16aligned_be32(&l3_reass->ip_src);
+                        ovs_be32 frag_ip =
+                            get_16aligned_be32(&l3_frag->ip_src);
+
+                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                         frag_ip, reass_ip);
+                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
+                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
+                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                         frag_ip, reass_ip);
+                    }
 
-                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
-                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
-                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
-                                                     frag_ip, reass_ip);
+                    l3_frag->ip_src = l3_reass->ip_src;
                     l3_frag->ip_dst = l3_reass->ip_dst;
                 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 5e09786ac..2de60aa3f 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -64,6 +64,7 @@ 
 #include "smap.h"
 #include "sset.h"
 #include "timeval.h"
+#include "tso.h"
 #include "unaligned.h"
 #include "unixctl.h"
 #include "util.h"
@@ -360,7 +361,8 @@  struct ingress_policer {
 enum dpdk_hw_ol_features {
     NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
     NETDEV_RX_HW_CRC_STRIP = 1 << 1,
-    NETDEV_RX_HW_SCATTER = 1 << 2
+    NETDEV_RX_HW_SCATTER = 1 << 2,
+    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
 };
 
 /*
@@ -942,6 +944,12 @@  dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
         conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
     }
 
+    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
+        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     /* Limit configured rss hash functions to only those supported
      * by the eth device. */
     conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
@@ -1043,6 +1051,9 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
     uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
                                      DEV_RX_OFFLOAD_TCP_CKSUM |
                                      DEV_RX_OFFLOAD_IPV4_CKSUM;
+    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
+                                   DEV_TX_OFFLOAD_TCP_CKSUM |
+                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
 
     rte_eth_dev_info_get(dev->port_id, &info);
 
@@ -1069,6 +1080,14 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
         dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
     }
 
+    if (info.tx_offload_capa & tx_tso_offload_capa) {
+        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+    } else {
+        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
+        VLOG_WARN("Tx TSO offload is not supported on %s port "
+                  DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
+    }
+
     n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
     n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
 
@@ -1319,14 +1338,16 @@  netdev_dpdk_vhost_construct(struct netdev *netdev)
         goto out;
     }
 
-    err = rte_vhost_driver_disable_features(dev->vhost_id,
-                                1ULL << VIRTIO_NET_F_HOST_TSO4
-                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
-                                | 1ULL << VIRTIO_NET_F_CSUM);
-    if (err) {
-        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
-                 "port: %s\n", name);
-        goto out;
+    if (!tso_enabled()) {
+        err = rte_vhost_driver_disable_features(dev->vhost_id,
+                                    1ULL << VIRTIO_NET_F_HOST_TSO4
+                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
+                                    | 1ULL << VIRTIO_NET_F_CSUM);
+        if (err) {
+            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
+                     "port: %s\n", name);
+            goto out;
+        }
     }
 
     err = rte_vhost_driver_start(dev->vhost_id);
@@ -1661,6 +1682,11 @@  netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
         } else {
             smap_add(args, "rx_csum_offload", "false");
         }
+        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+            smap_add(args, "tx_tso_offload", "true");
+        } else {
+            smap_add(args, "tx_tso_offload", "false");
+        }
         smap_add(args, "lsc_interrupt_mode",
                  dev->lsc_interrupt_mode ? "true" : "false");
     }
@@ -2088,6 +2114,67 @@  netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
     rte_free(rx);
 }
 
+/* Prepare the packet for HWOL.
+ * Return True if the packet is OK to continue. */
+static bool
+netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
+{
+    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
+
+    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
+        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
+        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
+        mbuf->outer_l2_len = 0;
+        mbuf->outer_l3_len = 0;
+    }
+
+    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
+        struct tcp_header *th = dp_packet_l4(pkt);
+
+        if (!th) {
+            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
+                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
+            return false;
+        }
+
+        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
+        mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
+        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+
+        if (mbuf->ol_flags & PKT_TX_IPV4) {
+            mbuf->ol_flags |= PKT_TX_IP_CKSUM;
+        }
+    }
+    return true;
+}
+
+/* Prepare a batch for HWOL.
+ * Return the number of good packets in the batch. */
+static int
+netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
+                            int pkt_cnt)
+{
+    int i = 0;
+    int cnt = 0;
+    struct rte_mbuf *pkt;
+
+    /* Prepare and filter bad HWOL packets. */
+    for (i = 0; i < pkt_cnt; i++) {
+        pkt = pkts[i];
+        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
+            rte_pktmbuf_free(pkt);
+            continue;
+        }
+
+        if (OVS_UNLIKELY(i != cnt)) {
+            pkts[cnt] = pkt;
+        }
+        cnt++;
+    }
+
+    return cnt;
+}
+
 /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
  * 'pkts', even in case of failure.
  *
@@ -2097,11 +2184,22 @@  netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
                          struct rte_mbuf **pkts, int cnt)
 {
     uint32_t nb_tx = 0;
+    uint16_t nb_tx_prep = cnt;
+
+    if (tso_enabled()) {
+        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
+        if (nb_tx_prep != cnt) {
+            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
+                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
+                         cnt, rte_strerror(rte_errno));
+        }
+    }
 
-    while (nb_tx != cnt) {
+    while (nb_tx != nb_tx_prep) {
         uint32_t ret;
 
-        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
+        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
+                               nb_tx_prep - nb_tx);
         if (!ret) {
             break;
         }
@@ -2386,11 +2484,14 @@  netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
     int cnt = 0;
     struct rte_mbuf *pkt;
 
+    /* Filter oversized packets, unless are marked for TSO. */
     for (i = 0; i < pkt_cnt; i++) {
         pkt = pkts[i];
-        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
-            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
-                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
+        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
+            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
+            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
+                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
+                         dev->max_packet_len);
             rte_pktmbuf_free(pkt);
             continue;
         }
@@ -2442,7 +2543,7 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
     struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
     struct netdev_dpdk_sw_stats sw_stats_add;
     unsigned int n_packets_to_free = cnt;
-    unsigned int total_packets = cnt;
+    unsigned int total_packets;
     int i, retries = 0;
     int max_retries = VHOST_ENQ_RETRY_MIN;
     int vid = netdev_dpdk_get_vid(dev);
@@ -2462,7 +2563,8 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
-    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
+    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
+    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
     sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
 
     /* Check has QoS has been configured for the netdev */
@@ -2511,6 +2613,121 @@  out:
     }
 }
 
+static void
+netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
+{
+    rte_free(opaque);
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
+{
+    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
+    struct rte_mbuf_ext_shared_info *shinfo = NULL;
+    uint16_t buf_len;
+    void *buf;
+
+    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
+        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
+    } else {
+        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+    }
+
+    if (unlikely(total_len > UINT16_MAX)) {
+        VLOG_ERR("Can't copy packet: too big %u", total_len);
+        return NULL;
+    }
+
+    buf_len = total_len;
+    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+    if (unlikely(buf == NULL)) {
+        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
+        return NULL;
+    }
+
+    /* Initialize shinfo. */
+    if (shinfo) {
+        shinfo->free_cb = netdev_dpdk_extbuf_free;
+        shinfo->fcb_opaque = buf;
+        rte_mbuf_ext_refcnt_set(shinfo, 1);
+    } else {
+        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+                                                    netdev_dpdk_extbuf_free,
+                                                    buf);
+        if (unlikely(shinfo == NULL)) {
+            rte_free(buf);
+            VLOG_ERR("Failed to initialize shared info for mbuf while "
+                     "attempting to attach an external buffer.");
+            return NULL;
+        }
+    }
+
+    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
+                              shinfo);
+    rte_pktmbuf_reset_headroom(pkt);
+
+    return pkt;
+}
+
+static struct rte_mbuf *
+dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
+{
+    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+    if (OVS_UNLIKELY(!pkt)) {
+        return NULL;
+    }
+
+    dp_packet_init_specific((struct dp_packet *)pkt);
+    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
+        return pkt;
+    }
+
+    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
+        return pkt;
+    }
+
+    rte_pktmbuf_free(pkt);
+
+    return NULL;
+}
+
+static struct dp_packet *
+dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
+{
+    struct rte_mbuf *mbuf_dest;
+    struct dp_packet *pkt_dest;
+    uint32_t pkt_len;
+
+    pkt_len = dp_packet_size(pkt_orig);
+    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
+    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
+            return NULL;
+    }
+
+    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
+    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
+    dp_packet_set_size(pkt_dest, pkt_len);
+
+    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
+    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
+    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
+                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
+
+    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
+           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
+
+    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
+        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
+                                - (char *)dp_packet_eth(pkt_dest);
+        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
+                                - (char *) dp_packet_l3(pkt_dest);
+    }
+
+    return pkt_dest;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2524,7 +2741,7 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
     enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
 #endif
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
+    struct dp_packet *pkts[PKT_ARRAY_SIZE];
     struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
     uint32_t cnt = batch_cnt;
     uint32_t dropped = 0;
@@ -2545,34 +2762,30 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
         struct dp_packet *packet = batch->packets[i];
         uint32_t size = dp_packet_size(packet);
 
-        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
-            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
-                         size, dev->max_packet_len);
-
+        if (size > dev->max_packet_len
+            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
+            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
+                         dev->max_packet_len);
             mtu_drops++;
             continue;
         }
 
-        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
+        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
         if (OVS_UNLIKELY(!pkts[txcnt])) {
             dropped = cnt - i;
             break;
         }
 
-        /* We have to do a copy for now */
-        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-               dp_packet_data(packet), size);
-        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
-
         txcnt++;
     }
 
     if (OVS_LIKELY(txcnt)) {
         if (dev->type == DPDK_DEV_VHOST) {
-            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
-                                     txcnt);
+            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
         } else {
-            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
+            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
+                                                   (struct rte_mbuf **)pkts,
+                                                   txcnt);
         }
     }
 
@@ -2630,6 +2843,7 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
         int batch_cnt = dp_packet_batch_size(batch);
         struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
 
+        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
         tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
         mtu_drops = batch_cnt - tx_cnt;
         qos_drops = tx_cnt;
@@ -4345,6 +4559,12 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
 
     rte_free(dev->tx_q);
     err = dpdk_eth_dev_init(dev);
+    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
     if (!dev->tx_q) {
         err = ENOMEM;
@@ -4374,6 +4594,11 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
         dev->tx_q[0].map = 0;
     }
 
+    if (tso_enabled()) {
+        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
+        VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
+    }
+
     netdev_dpdk_remap_txqs(dev);
 
     err = netdev_dpdk_mempool_configure(dev);
@@ -4446,6 +4671,11 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
             vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
         }
 
+        /* Enable External Buffers if TCP Segmentation Offload is enabled. */
+        if (tso_enabled()) {
+            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
+        }
+
         err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
         if (err) {
             VLOG_ERR("vhost-user device setup failure for device %s\n",
@@ -4470,14 +4700,20 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
             goto unlock;
         }
 
-        err = rte_vhost_driver_disable_features(dev->vhost_id,
-                                    1ULL << VIRTIO_NET_F_HOST_TSO4
-                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
-                                    | 1ULL << VIRTIO_NET_F_CSUM);
-        if (err) {
-            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
-                     "client port: %s\n", dev->up.name);
-            goto unlock;
+        if (tso_enabled()) {
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+        } else {
+            err = rte_vhost_driver_disable_features(dev->vhost_id,
+                                        1ULL << VIRTIO_NET_F_HOST_TSO4
+                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
+                                        | 1ULL << VIRTIO_NET_F_CSUM);
+            if (err) {
+                VLOG_ERR("rte_vhost_driver_disable_features failed for "
+                         "vhost user client port: %s\n", dev->up.name);
+                goto unlock;
+            }
         }
 
         err = rte_vhost_driver_start(dev->vhost_id);
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index f08159aa7..102548db7 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -37,10 +37,14 @@ 
 
 struct netdev;
 
+#define LINUX_RXQ_TSO_MAX_LEN 65536
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
     int fd;
+    char *bufaux;          /* Extra buffer to recv TSO pkt. */
+    int bufaux_len;        /* Extra buffer length. */
 };
 
 int netdev_linux_construct(struct netdev *);
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 8a62f9d74..604cb6913 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -29,16 +29,18 @@ 
 #include <linux/filter.h>
 #include <linux/gen_stats.h>
 #include <linux/if_ether.h>
+#include <linux/if_packet.h>
 #include <linux/if_tun.h>
 #include <linux/types.h>
 #include <linux/ethtool.h>
 #include <linux/mii.h>
 #include <linux/rtnetlink.h>
 #include <linux/sockios.h>
+#include <linux/virtio_net.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <sys/utsname.h>
-#include <netpacket/packet.h>
 #include <net/if.h>
 #include <net/if_arp.h>
 #include <net/route.h>
@@ -72,6 +74,7 @@ 
 #include "socket-util.h"
 #include "sset.h"
 #include "tc.h"
+#include "tso.h"
 #include "timer.h"
 #include "unaligned.h"
 #include "openvswitch/vlog.h"
@@ -501,6 +504,8 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
+static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
+static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -902,6 +907,13 @@  netdev_linux_common_construct(struct netdev *netdev_)
     /* The device could be in the same network namespace or in another one. */
     netnsid_unset(&netdev->netnsid);
     ovs_mutex_init(&netdev->mutex);
+
+    if (tso_enabled()) {
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
+        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
+    }
+
     return 0;
 }
 
@@ -961,6 +973,10 @@  netdev_linux_construct_tap(struct netdev *netdev_)
     /* Create tap device. */
     get_flags(&netdev->up, &netdev->ifi_flags);
     ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+    if (tso_enabled()) {
+        ifr.ifr_flags |= IFF_VNET_HDR;
+    }
+
     ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
     if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
         VLOG_WARN("%s: creating tap device failed: %s", name,
@@ -1024,6 +1040,13 @@  static struct netdev_rxq *
 netdev_linux_rxq_alloc(void)
 {
     struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
+    if (tso_enabled()) {
+        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
+        if (rx->bufaux) {
+            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
+        }
+    }
+
     return &rx->up;
 }
 
@@ -1069,6 +1092,17 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
             goto error;
         }
 
+        if (tso_enabled()) {
+            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
+                               sizeof val);
+            if (error) {
+                error = errno;
+                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
+                         netdev_get_name(netdev_), ovs_strerror(errno));
+                goto error;
+            }
+        }
+
         /* Set non-blocking mode. */
         error = set_nonblocking(rx->fd);
         if (error) {
@@ -1123,6 +1157,8 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     if (!rx->is_tap) {
         close(rx->fd);
     }
+
+    free(rx->bufaux);
 }
 
 static void
@@ -1152,11 +1188,13 @@  auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
 }
 
 static int
-netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
+netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
+                           struct dp_packet *buffer)
 {
-    size_t size;
+    size_t std_len;
+    size_t total_len;
     ssize_t retval;
-    struct iovec iov;
+    struct iovec iov[2];
     struct cmsghdr *cmsg;
     union {
         struct cmsghdr cmsg;
@@ -1166,14 +1204,17 @@  netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
 
     /* Reserve headroom for a single VLAN tag */
     dp_packet_reserve(buffer, VLAN_HEADER_LEN);
-    size = dp_packet_tailroom(buffer);
+    std_len = dp_packet_tailroom(buffer);
+    total_len = std_len + bufaux_len;
 
-    iov.iov_base = dp_packet_data(buffer);
-    iov.iov_len = size;
+    iov[0].iov_base = dp_packet_data(buffer);
+    iov[0].iov_len = std_len;
+    iov[1].iov_base = bufaux;
+    iov[1].iov_len = bufaux_len;
     msgh.msg_name = NULL;
     msgh.msg_namelen = 0;
-    msgh.msg_iov = &iov;
-    msgh.msg_iovlen = 1;
+    msgh.msg_iov = iov;
+    msgh.msg_iovlen = 2;
     msgh.msg_control = &cmsg_buffer;
     msgh.msg_controllen = sizeof cmsg_buffer;
     msgh.msg_flags = 0;
@@ -1184,11 +1225,26 @@  netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
 
     if (retval < 0) {
         return errno;
-    } else if (retval > size) {
+    } else if (retval > total_len) {
         return EMSGSIZE;
     }
 
-    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+    if (retval > std_len) {
+        /* Build a single linear TSO packet. */
+        size_t extra_len = retval - std_len;
+
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
+        dp_packet_prealloc_tailroom(buffer, extra_len);
+        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+    } else {
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+    }
+
+    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
+        VLOG_WARN_RL(&rl, "Invalid virtio net header");
+        return EINVAL;
+    }
 
     for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
         const struct tpacket_auxdata *aux;
@@ -1221,20 +1277,44 @@  netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
 }
 
 static int
-netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
+netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
+                          struct dp_packet *buffer)
 {
     ssize_t retval;
-    size_t size = dp_packet_tailroom(buffer);
+    size_t std_len;
+    struct iovec iov[2];
+
+    std_len = dp_packet_tailroom(buffer);
+    iov[0].iov_base = dp_packet_data(buffer);
+    iov[0].iov_len = std_len;
+    iov[1].iov_base = bufaux;
+    iov[1].iov_len = bufaux_len;
 
     do {
-        retval = read(fd, dp_packet_data(buffer), size);
+        retval = readv(fd, iov, 2);
     } while (retval < 0 && errno == EINTR);
 
     if (retval < 0) {
         return errno;
     }
 
-    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+    if (retval > std_len) {
+        /* Build a single linear TSO packet. */
+        size_t extra_len = retval - std_len;
+
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
+        dp_packet_prealloc_tailroom(buffer, extra_len);
+        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+    } else {
+        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+    }
+
+    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
+        VLOG_WARN_RL(&rl, "Invalid virtio net header");
+        return EINVAL;
+    }
+
     return 0;
 }
 
@@ -1245,6 +1325,7 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
     struct netdev *netdev = rx->up.netdev;
     struct dp_packet *buffer;
+    size_t buffer_len;
     ssize_t retval;
     int mtu;
 
@@ -1252,12 +1333,18 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         mtu = ETH_PAYLOAD_MAX;
     }
 
+    buffer_len = VLAN_ETH_HEADER_LEN + mtu;
+    if (tso_enabled()) {
+            buffer_len += sizeof(struct virtio_net_hdr);
+    }
+
     /* Assume Ethernet port. No need to set packet_type. */
-    buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
-                                           DP_NETDEV_HEADROOM);
+    buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
     retval = (rx->is_tap
-              ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
-              : netdev_linux_rxq_recv_sock(rx->fd, buffer));
+              ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
+                                          buffer)
+              : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
+                                           buffer));
 
     if (retval) {
         if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1302,7 +1389,7 @@  netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
 }
 
 static int
-netdev_linux_sock_batch_send(int sock, int ifindex,
+netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
                              struct dp_packet_batch *batch)
 {
     const size_t size = dp_packet_batch_size(batch);
@@ -1316,6 +1403,10 @@  netdev_linux_sock_batch_send(int sock, int ifindex,
 
     struct dp_packet *packet;
     DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (tso) {
+            netdev_linux_prepend_vnet_hdr(packet, mtu);
+        }
+
         iov[i].iov_base = dp_packet_data(packet);
         iov[i].iov_len = dp_packet_size(packet);
         mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
@@ -1348,7 +1439,7 @@  netdev_linux_sock_batch_send(int sock, int ifindex,
  * on other interface types because we attach a socket filter to the rx
  * socket. */
 static int
-netdev_linux_tap_batch_send(struct netdev *netdev_,
+netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
                             struct dp_packet_batch *batch)
 {
     struct netdev_linux *netdev = netdev_linux_cast(netdev_);
@@ -1365,10 +1456,15 @@  netdev_linux_tap_batch_send(struct netdev *netdev_,
     }
 
     DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-        size_t size = dp_packet_size(packet);
+        size_t size;
         ssize_t retval;
         int error;
 
+        if (tso) {
+            netdev_linux_prepend_vnet_hdr(packet, mtu);
+        }
+
+        size = dp_packet_size(packet);
         do {
             retval = write(netdev->tap_fd, dp_packet_data(packet), size);
             error = retval < 0 ? errno : 0;
@@ -1403,9 +1499,15 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
                   struct dp_packet_batch *batch,
                   bool concurrent_txq OVS_UNUSED)
 {
+    bool tso = tso_enabled();
+    int mtu = ETH_PAYLOAD_MAX;
     int error = 0;
     int sock = 0;
 
+    if (tso) {
+        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
+    }
+
     if (!is_tap_netdev(netdev_)) {
         if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
             error = EOPNOTSUPP;
@@ -1424,9 +1526,9 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
             goto free_batch;
         }
 
-        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
+        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
     } else {
-        error = netdev_linux_tap_batch_send(netdev_, batch);
+        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
     }
     if (error) {
         if (error == ENOBUFS) {
@@ -6173,6 +6275,19 @@  af_packet_sock(void)
                 close(sock);
                 sock = -error;
             }
+
+            if (tso_enabled()) {
+                int val = 1;
+                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
+                                   sizeof val);
+                if (error) {
+                    error = errno;
+                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
+                             ovs_strerror(errno));
+                    close(sock);
+                    sock = -error;
+                }
+            }
         } else {
             sock = -errno;
             VLOG_ERR("failed to create packet socket: %s",
@@ -6183,3 +6298,136 @@  af_packet_sock(void)
 
     return sock;
 }
+
+static int
+netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
+{
+    struct eth_header *eth_hdr;
+    ovs_be16 eth_type;
+    int l2_len;
+
+    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
+    if (!eth_hdr) {
+        return -EINVAL;
+    }
+
+    l2_len = ETH_HEADER_LEN;
+    eth_type = eth_hdr->eth_type;
+    if (eth_type_vlan(eth_type)) {
+        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
+
+        if (!vlan) {
+            return -EINVAL;
+        }
+
+        eth_type = vlan->vlan_next_type;
+        l2_len += VLAN_HEADER_LEN;
+    }
+
+    if (eth_type == htons(ETH_TYPE_IP)) {
+        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
+
+        if (!ip_hdr) {
+            return -EINVAL;
+        }
+
+        *l4proto = ip_hdr->ip_proto;
+        dp_packet_hwol_set_tx_ipv4(b);
+    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
+        struct ovs_16aligned_ip6_hdr *nh6;
+
+        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
+        if (!nh6) {
+            return -EINVAL;
+        }
+
+        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
+        dp_packet_hwol_set_tx_ipv6(b);
+    }
+
+    return 0;
+}
+
+static int
+netdev_linux_parse_vnet_hdr(struct dp_packet *b)
+{
+    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
+    uint16_t l4proto = 0;
+
+    if (OVS_UNLIKELY(!vnet)) {
+        return -EINVAL;
+    }
+
+    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
+        return 0;
+    }
+
+    if (netdev_linux_parse_l2(b, &l4proto)) {
+        return -EINVAL;
+    }
+
+    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+        if (l4proto == IPPROTO_TCP) {
+            dp_packet_hwol_set_csum_tcp(b);
+        } else if (l4proto == IPPROTO_UDP) {
+            dp_packet_hwol_set_csum_udp(b);
+        } else if (l4proto == IPPROTO_SCTP) {
+            dp_packet_hwol_set_csum_sctp(b);
+        }
+    }
+
+    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
+                                | VIRTIO_NET_HDR_GSO_TCPV6
+                                | VIRTIO_NET_HDR_GSO_UDP;
+        uint8_t type = vnet->gso_type & allowed_mask;
+
+        if (type == VIRTIO_NET_HDR_GSO_TCPV4
+            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
+            dp_packet_hwol_set_tcp_seg(b);
+        }
+    }
+
+    return 0;
+}
+
+static void
+netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
+{
+    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
+
+    if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
+        uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
+                            + TCP_HEADER_LEN;
+
+        vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
+        vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
+        if (dp_packet_hwol_is_ipv4(b)) {
+            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+        } else {
+            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+        }
+
+    } else {
+        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
+    }
+
+    if (dp_packet_hwol_l4_mask(b)) {
+        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+        vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
+                                                  - (char *)dp_packet_eth(b));
+
+        if (dp_packet_hwol_l4_is_tcp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct tcp_header, tcp_csum);
+        } else if (dp_packet_hwol_l4_is_udp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct udp_header, udp_csum);
+        } else if (dp_packet_hwol_l4_is_sctp(b)) {
+            vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
+                                    struct sctp_header, sctp_csum);
+        } else {
+            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
+        }
+    }
+}
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index f109c4e66..87c375b47 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -37,6 +37,12 @@  extern "C" {
 struct netdev_tnl_build_header_params;
 #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
 
+enum netdev_ol_flags {
+    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
+    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
+    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
+};
+
 /* A network device (e.g. an Ethernet device).
  *
  * Network device implementations may read these members but should not modify
@@ -51,6 +57,10 @@  struct netdev {
      * opening this device, and therefore got assigned to the "system" class */
     bool auto_classified;
 
+    /* This bitmask of the offloading features enabled/supported by the
+     * supported by the netdev. */
+    uint64_t ol_flags;
+
     /* If this is 'true', the user explicitly specified an MTU for this
      * netdev.  Otherwise, Open vSwitch is allowed to override it. */
     bool mtu_user_config;
diff --git a/lib/netdev.c b/lib/netdev.c
index 405c98c68..998525875 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -782,6 +782,52 @@  netdev_get_pt_mode(const struct netdev *netdev)
             : NETDEV_PT_LEGACY_L2);
 }
 
+/* Check if a 'packet' is compatible with 'netdev_flags'.
+ * If a packet is incompatible, return 'false' with the 'errormsg'
+ * pointing to a reason. */
+static bool
+netdev_send_prepare_packet(const uint64_t netdev_flags,
+                           struct dp_packet *packet, char **errormsg)
+{
+    if (dp_packet_hwol_is_tso(packet)
+        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
+            /* Fall back to GSO in software. */
+            *errormsg = "No TSO support";
+            return false;
+    }
+
+    if (dp_packet_hwol_l4_mask(packet)
+        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
+            /* Fall back to L4 csum in software. */
+            *errormsg = "No L4 checksum support";
+            return false;
+    }
+
+    return true;
+}
+
+/* Check if each packet in 'batch' is compatible with 'netdev' features,
+ * otherwise either fall back to software implementation or drop it. */
+static void
+netdev_send_prepare_batch(const struct netdev *netdev,
+                          struct dp_packet_batch *batch)
+{
+    struct dp_packet *packet;
+    size_t i, size = dp_packet_batch_size(batch);
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        char *errormsg = NULL;
+
+        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
+            dp_packet_batch_refill(batch, packet, i);
+        } else {
+            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
+                         errormsg ? errormsg : "Unsupported feature",
+                         netdev_get_name(netdev));
+        }
+    }
+}
+
 /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
  * otherwise a positive errno value.  Returns EAGAIN without blocking if
  * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
@@ -811,8 +857,10 @@  int
 netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
             bool concurrent_txq)
 {
-    int error = netdev->netdev_class->send(netdev, qid, batch,
-                                           concurrent_txq);
+    int error;
+
+    netdev_send_prepare_batch(netdev, batch);
+    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
     if (!error) {
         COVERAGE_INC(netdev_sent);
     }
@@ -878,9 +926,17 @@  netdev_push_header(const struct netdev *netdev,
                    const struct ovs_action_push_tnl *data)
 {
     struct dp_packet *packet;
-    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-        netdev->netdev_class->push_header(netdev, packet, data);
-        pkt_metadata_init(&packet->md, data->out_port);
+    size_t i, size = dp_packet_batch_size(batch);
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        if (!dp_packet_hwol_is_tso(packet)) {
+            netdev->netdev_class->push_header(netdev, packet, data);
+            pkt_metadata_init(&packet->md, data->out_port);
+            dp_packet_batch_refill(batch, packet, i);
+        } else {
+            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
+                         "packet dropped", netdev_get_name(netdev));
+        }
     }
 
     return 0;
diff --git a/lib/tso.c b/lib/tso.c
new file mode 100644
index 000000000..9dc15e146
--- /dev/null
+++ b/lib/tso.c
@@ -0,0 +1,54 @@ 
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "smap.h"
+#include "ovs-thread.h"
+#include "openvswitch/vlog.h"
+#include "dpdk.h"
+#include "tso.h"
+#include "vswitch-idl.h"
+
+VLOG_DEFINE_THIS_MODULE(tso);
+
+static bool tso_support_enabled = false;
+
+void
+tso_init(const struct smap *ovs_other_config)
+{
+    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
+        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+        if (ovsthread_once_start(&once)) {
+            if (dpdk_available()) {
+                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
+                tso_support_enabled = true;
+            } else {
+                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
+                         "without enabling DPDK");
+                tso_support_enabled = false;
+            }
+            ovsthread_once_done(&once);
+        }
+    }
+}
+
+bool
+tso_enabled(void)
+{
+    return tso_support_enabled;
+}
diff --git a/lib/tso.h b/lib/tso.h
new file mode 100644
index 000000000..6594496ac
--- /dev/null
+++ b/lib/tso.h
@@ -0,0 +1,23 @@ 
+/*
+ * Copyright (c) 2020 Red Hat Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TSO_H
+#define TSO_H 1
+
+void tso_init(const struct smap *ovs_other_config);
+bool tso_enabled(void);
+
+#endif /* tso.h */
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index 86c7b10a9..6d73922f6 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -65,6 +65,7 @@ 
 #include "system-stats.h"
 #include "timeval.h"
 #include "tnl-ports.h"
+#include "tso.h"
 #include "util.h"
 #include "unixctl.h"
 #include "lib/vswitch-idl.h"
@@ -3285,6 +3286,7 @@  bridge_run(void)
     if (cfg) {
         netdev_set_flow_api_enabled(&cfg->other_config);
         dpdk_init(&cfg->other_config);
+        tso_init(&cfg->other_config);
     }
 
     /* Initialize the ofproto library.  This only needs to run once, but
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0ec726c39..354dcabfa 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -690,6 +690,18 @@ 
          once in few hours or a day or a week.
         </p>
       </column>
+      <column name="other_config" key="tso-support"
+              type='{"type": "boolean"}'>
+        <p>
+          Set this value to <code>true</code> to enable support for TSO (TCP
+          Segmentation Offloading). When TSO is enabled, vhost-user client
+          interfaces can transmit packets up to 64KB.
+        </p>
+        <p>
+          The default value is <code>false</code>. Changing this value requires
+          restarting the daemon.
+        </p>
+      </column>
     </group>
     <group title="Status">
       <column name="next_cfg">