diff mbox series

[ovs-dev,v2] netdev-linux: Prepend the std packet in the TSO packet

Message ID 20200203214550.19320-1-fbl@sysclose.org
State Accepted
Headers show
Series [ovs-dev,v2] netdev-linux: Prepend the std packet in the TSO packet | expand

Commit Message

Flavio Leitner Feb. 3, 2020, 9:45 p.m. UTC
Usually TSO packets are close to 50k, 60k bytes long, so to
to copy less bytes when receiving a packet from the kernel
change the approach. Instead of extending the MTU sized
packet received and append with remaining TSO data from
the TSO buffer, allocate a TSO packet with enough headroom
to prepend the std packet data.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 lib/dp-packet.c            |   8 +--
 lib/dp-packet.h            |   2 +
 lib/netdev-linux-private.h |   3 +-
 lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
 4 files changed, 78 insertions(+), 52 deletions(-)

V2:
  - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
  - iov_len uses packet's tailroom.

  This patch depends on a previous posted patch to work:
  Subject: netdev-linux-private: fix max length to be 16 bits
  https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.html

  With both patches applied, I can run iperf3 and scp on both directions
  with good performance and no issues.

Comments

Flavio Leitner Feb. 4, 2020, 3 p.m. UTC | #1
On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨�D)-云服务集团 wrote:
> Hi, Flavio
> 
> With this one patch and previous several merged TSO-related patches, can
> veth work with "ethtool -K vethX tx on"? I always can't figure out why veth
> can work in dpdk data path when tx offload features are on, it looks like
> you're fixing this big issue, right?

If you have tso enabled with dpdk, then veth works with vethX tx on
(which is the default setting for veth). Otherwise TSO is not
enabled and then you need to turn tx offloading off.

You said "can work in dpdk data path when tx ... are on", so I think
it's okay? Not sure though.

> For tap interface, it can't support TSO, do you Redhat guys have plan to
> enable it on kernel side.

Yeah, currently it only works in one direction (OvS -> kernel). I am
looking into this now.
See below my iperf3 results as a reference:

Traffic Direction      TSO disabled         TSO enabled  
VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
tap->VM                2.29 Gbits/sec       2.19 Gbits/sec

fbl
 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
> 发送时间: 2020年2月4日 5:46
> 收件人: dev@openvswitch.org
> 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara
> <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi Yang (杨
> ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; Ben
> Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
> 
> Usually TSO packets are close to 50k, 60k bytes long, so to to copy less
> bytes when receiving a packet from the kernel change the approach. Instead
> of extending the MTU sized packet received and append with remaining TSO
> data from the TSO buffer, allocate a TSO packet with enough headroom to
> prepend the std packet data.
> 
> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> Suggested-by: Ben Pfaff <blp@ovn.org>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  lib/dp-packet.c            |   8 +--
>  lib/dp-packet.h            |   2 +
>  lib/netdev-linux-private.h |   3 +-
>  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
>  4 files changed, 78 insertions(+), 52 deletions(-)
> 
> V2:
>   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
>   - iov_len uses packet's tailroom.
> 
>   This patch depends on a previous posted patch to work:
>   Subject: netdev-linux-private: fix max length to be 16 bits
>   https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.html
> 
>   With both patches applied, I can run iperf3 and scp on both directions
>   with good performance and no issues.
> 
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 8dfedcb7c..cd2623500
> 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t *new_base,
>  
>  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
>   * bytes of headroom and tailroom, respectively. */ -static void
> -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t
> new_tailroom)
> +void
> +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t 
> +new_tailroom)
>  {
>      void *new_base, *new_data;
>      size_t new_allocated;
> @@ -297,7 +297,7 @@ void
>  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
>      if (size > dp_packet_tailroom(b)) {
> -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
>      }
>  }
>  
> @@ -308,7 +308,7 @@ void
>  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
>      if (size > dp_packet_headroom(b)) {
> -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> +        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
>      }
>  }
>  
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 69ae5dfac..9a9d35183
> 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -152,6 +152,8 @@ struct dp_packet *dp_packet_clone_with_headroom(const
> struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const void *,
> size_t);  struct dp_packet *dp_packet_clone_data_with_headroom(const void *,
> size_t,
>                                                       size_t headroom);
> +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> +                      size_t new_tailroom);
>  static inline void dp_packet_delete(struct dp_packet *);
>  
>  static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h index
> be2d7b10b..c7c515f70 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
>      struct netdev_rxq up;
>      bool is_tap;
>      int fd;
> -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> */
> +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> +                                                     packets. */
>  };
>  
>  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
>  netdev_linux_rxq_alloc(void)
>  {
>      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> -    if (userspace_tso_enabled()) {
> -        int i;
> -
> -        /* Allocate auxiliay buffers to receive TSO packets. */
> -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> -        }
> -    }
> -
>      return &rx->up;
>  }
>  
> @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>      }
>  
>      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> -        free(rx->aux_bufs[i]);
> +        dp_packet_delete(rx->aux_bufs[i]);
>      }
>  }
>  
> @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> netdev_rxq_linux *rx, int mtu,
>          virtio_net_hdr_size = 0;
>      }
>  
> -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> +    /* The length here needs to be accounted in the same way when the
> +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
>      for (i = 0; i < NETDEV_MAX_BURST; i++) {
>           buffers[i] = dp_packet_new_with_headroom(std_len,
> DP_NETDEV_HEADROOM);
>           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
>           iovs[i][IOV_PACKET].iov_len = std_len;
> -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> +         if (iovlen == IOV_TSO_SIZE) {
> +             iovs[i][IOV_AUXBUF].iov_base =
> dp_packet_data(rx->aux_bufs[i]);
> +             iovs[i][IOV_AUXBUF].iov_len =
> dp_packet_tailroom(rx->aux_bufs[i]);
> +         }
> +
>           mmsgs[i].msg_hdr.msg_name = NULL;
>           mmsgs[i].msg_hdr.msg_namelen = 0;
>           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 @@
> netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>      }
>  
>      for (i = 0; i < retval; i++) {
> +        struct dp_packet *pkt;
> +
>          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
>              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>              struct netdev_linux *netdev = netdev_linux_cast(netdev_); @@
> -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> netdev_rxq_linux *rx, int mtu,
>          }
>  
>          if (mmsgs[i].msg_len > std_len) {
> -            /* Build a single linear TSO packet by expanding the current
> packet
> -             * to append the data received in the aux_buf. */
> -            size_t extra_len = mmsgs[i].msg_len - std_len;
> -
> -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> -                               + std_len);
> -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> -                               + extra_len);
> -        } else {
> -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> -                               + mmsgs[i].msg_len);
> -        }
> +            /* Build a single linear TSO packet by prepending the data from
> +             * std_len buffer to the aux_buf. */
> +            pkt = rx->aux_bufs[i];
> +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> +            /* The headroom should be the same in buffers[i], pkt and
> +             * DP_NETDEV_HEADROOM. */
> +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> +            dp_packet_delete(buffers[i]);
> +            rx->aux_bufs[i] = NULL;
> +         } else {
> +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> +            pkt = buffers[i];
> +         }
>  
> -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> {
> +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
>              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>              struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>  
>              /* Unexpected error situation: the virtio header is not present
>               * or corrupted. Drop the packet but continue in case next ones
>               * are correct. */
> -            dp_packet_delete(buffers[i]);
> +            dp_packet_delete(pkt);
>              netdev->rx_dropped += 1;
>              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net
> header",
>                           netdev_get_name(netdev_)); @@ -1325,16 +1323,16 @@
> netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>                  struct eth_header *eth;
>                  bool double_tagged;
>  
> -                eth = dp_packet_data(buffers[i]);
> +                eth = dp_packet_data(pkt);
>                  double_tagged = eth->eth_type ==
> htons(ETH_TYPE_VLAN_8021Q);
>  
> -                eth_push_vlan(buffers[i],
> +                eth_push_vlan(pkt,
>                                auxdata_to_vlan_tpid(aux, double_tagged),
>                                htons(aux->tp_vlan_tci));
>                  break;
>              }
>          }
> -        dp_packet_batch_add(batch, buffers[i]);
> +        dp_packet_batch_add(batch, pkt);
>      }
>  
>      /* Delete unused buffers. */
> @@ -1354,7 +1352,6 @@ static int
>  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
>                                  struct dp_packet_batch *batch)  {
> -    struct dp_packet *buffer;
>      int virtio_net_hdr_size;
>      ssize_t retval;
>      size_t std_len;
> @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> netdev_rxq_linux *rx, int mtu,
>          virtio_net_hdr_size = 0;
>      }
>  
> -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> +    /* The length here needs to be accounted in the same way when the
> +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
>      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> +        struct dp_packet *buffer;
> +        struct dp_packet *pkt;
>          struct iovec iov[IOV_TSO_SIZE];
>  
>          /* Assume Ethernet port. No need to set packet_type. */
>          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
>          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
>          iov[IOV_PACKET].iov_len = std_len;
> -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> +        if (iovlen == IOV_TSO_SIZE) {
> +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> +        }
>  
>          do {
>              retval = readv(rx->fd, iov, iovlen); @@ -1393,33 +1396,36 @@
> netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
>          }
>  
>          if (retval > std_len) {
> -            /* Build a single linear TSO packet by expanding the current
> packet
> -             * to append the data received in the aux_buf. */
> -            size_t extra_len = retval - std_len;
> -
> -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> -            dp_packet_prealloc_tailroom(buffer, extra_len);
> -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> +            /* Build a single linear TSO packet by prepending the data from
> +             * std_len buffer to the aux_buf. */
> +            pkt = rx->aux_bufs[i];
> +            dp_packet_set_size(pkt, retval - std_len);
> +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> +            /* The headroom should be the same in buffers[i], pkt and
> +             * DP_NETDEV_HEADROOM. */
> +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> +            dp_packet_delete(buffer);
> +            rx->aux_bufs[i] = NULL;
>          } else {
>              dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> +            pkt = buffer;
>          }
>  
> -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
>              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
>              struct netdev_linux *netdev = netdev_linux_cast(netdev_);
>  
>              /* Unexpected error situation: the virtio header is not present
>               * or corrupted. Drop the packet but continue in case next ones
>               * are correct. */
> -            dp_packet_delete(buffer);
> +            dp_packet_delete(pkt);
>              netdev->rx_dropped += 1;
>              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net
> header",
>                           netdev_get_name(netdev_));
>              continue;
>          }
>  
> -        dp_packet_batch_add(batch, buffer);
> +        dp_packet_batch_add(batch, pkt);
>      }
>  
>      if ((i == 0) && (retval < 0)) {
> @@ -1442,6 +1448,23 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct
> dp_packet_batch *batch,
>          mtu = ETH_PAYLOAD_MAX;
>      }
>  
> +    if (userspace_tso_enabled()) {
> +        /* Allocate TSO packets. The packet has enough headroom to store
> +         * a full non-TSO packet. When a TSO packet is received, the data
> +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> +         * (aux_buf). */
> +        size_t std_len = sizeof(struct virtio_net_hdr) +
> VLAN_ETH_HEADER_LEN
> +                         + DP_NETDEV_HEADROOM + mtu;
> +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> +            if (rx->aux_bufs[i]) {
> +                continue;
> +            }
> +
> +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len,
> std_len);
> +        }
> +    }
> +
>      dp_packet_batch_init(batch);
>      retval = (rx->is_tap
>                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> --
> 2.24.1
>
Flavio Leitner Feb. 4, 2020, 5:12 p.m. UTC | #2
On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨�D)-云服务集团 wrote:
> > Hi, Flavio
> > 
> > With this one patch and previous several merged TSO-related patches, can
> > veth work with "ethtool -K vethX tx on"? I always can't figure out why veth
> > can work in dpdk data path when tx offload features are on, it looks like
> > you're fixing this big issue, right?
> 
> If you have tso enabled with dpdk, then veth works with vethX tx on
> (which is the default setting for veth). Otherwise TSO is not
> enabled and then you need to turn tx offloading off.
> 
> You said "can work in dpdk data path when tx ... are on", so I think
> it's okay? Not sure though.
> 
> > For tap interface, it can't support TSO, do you Redhat guys have plan to
> > enable it on kernel side.
> 
> Yeah, currently it only works in one direction (OvS -> kernel). I am
> looking into this now.

With TSO enabled on the TAP device:

Traffic Direction      TSO disabled         TSO enabled  
VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
tap->VM                2.29 Gbits/sec       18.0 Gbits/sec

The code is in my github branch:
https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1

commit 884371df3bf3df836d4c2ab2d62b420339691fe8
Author: Flavio Leitner <fbl@sysclose.org>
Date:   Tue Feb 4 11:18:49 2020 -0300

    netdev-linux: Enable TSO in the TAP device.
    
    Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
    offloading in the tap device.
    
    Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
    Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Please try it out and let me know if it works for you as well.
Thanks
fbl


> VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> veth->VM               2.30 Gbits/sec       9.58 Gbits/sec



> See below my iperf3 results as a reference:
> 
> Traffic Direction      TSO disabled         TSO enabled  
> VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> 
> fbl
>  
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
> > 发送时间: 2020年2月4日 5:46
> > 收件人: dev@openvswitch.org
> > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara
> > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi Yang (杨
> > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; Ben
> > Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
> > 
> > Usually TSO packets are close to 50k, 60k bytes long, so to to copy less
> > bytes when receiving a packet from the kernel change the approach. Instead
> > of extending the MTU sized packet received and append with remaining TSO
> > data from the TSO buffer, allocate a TSO packet with enough headroom to
> > prepend the std packet data.
> > 
> > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> > Suggested-by: Ben Pfaff <blp@ovn.org>
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> >  lib/dp-packet.c            |   8 +--
> >  lib/dp-packet.h            |   2 +
> >  lib/netdev-linux-private.h |   3 +-
> >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> >  4 files changed, 78 insertions(+), 52 deletions(-)
> > 
> > V2:
> >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> >   - iov_len uses packet's tailroom.
> > 
> >   This patch depends on a previous posted patch to work:
> >   Subject: netdev-linux-private: fix max length to be 16 bits
> >   https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.html
> > 
> >   With both patches applied, I can run iperf3 and scp on both directions
> >   with good performance and no issues.
> > 
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 8dfedcb7c..cd2623500
> > 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t *new_base,
> >  
> >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> >   * bytes of headroom and tailroom, respectively. */ -static void
> > -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t
> > new_tailroom)
> > +void
> > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t 
> > +new_tailroom)
> >  {
> >      void *new_base, *new_data;
> >      size_t new_allocated;
> > @@ -297,7 +297,7 @@ void
> >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> >      if (size > dp_packet_tailroom(b)) {
> > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
> >      }
> >  }
> >  
> > @@ -308,7 +308,7 @@ void
> >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> >      if (size > dp_packet_headroom(b)) {
> > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > +        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
> >      }
> >  }
> >  
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 69ae5dfac..9a9d35183
> > 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -152,6 +152,8 @@ struct dp_packet *dp_packet_clone_with_headroom(const
> > struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const void *,
> > size_t);  struct dp_packet *dp_packet_clone_data_with_headroom(const void *,
> > size_t,
> >                                                       size_t headroom);
> > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > +                      size_t new_tailroom);
> >  static inline void dp_packet_delete(struct dp_packet *);
> >  
> >  static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h index
> > be2d7b10b..c7c515f70 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> >      bool is_tap;
> >      int fd;
> > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > */
> > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > +                                                     packets. */
> >  };
> >  
> >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> >  netdev_linux_rxq_alloc(void)
> >  {
> >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > -    if (userspace_tso_enabled()) {
> > -        int i;
> > -
> > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > -        }
> > -    }
> > -
> >      return &rx->up;
> >  }
> >  
> > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> >      }
> >  
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > -        free(rx->aux_bufs[i]);
> > +        dp_packet_delete(rx->aux_bufs[i]);
> >      }
> >  }
> >  
> > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > netdev_rxq_linux *rx, int mtu,
> >          virtio_net_hdr_size = 0;
> >      }
> >  
> > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > +    /* The length here needs to be accounted in the same way when the
> > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > DP_NETDEV_HEADROOM);
> >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> >           iovs[i][IOV_PACKET].iov_len = std_len;
> > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > +         if (iovlen == IOV_TSO_SIZE) {
> > +             iovs[i][IOV_AUXBUF].iov_base =
> > dp_packet_data(rx->aux_bufs[i]);
> > +             iovs[i][IOV_AUXBUF].iov_len =
> > dp_packet_tailroom(rx->aux_bufs[i]);
> > +         }
> > +
> >           mmsgs[i].msg_hdr.msg_name = NULL;
> >           mmsgs[i].msg_hdr.msg_namelen = 0;
> >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 @@
> > netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> >      }
> >  
> >      for (i = 0; i < retval; i++) {
> > +        struct dp_packet *pkt;
> > +
> >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = netdev_linux_cast(netdev_); @@
> > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > netdev_rxq_linux *rx, int mtu,
> >          }
> >  
> >          if (mmsgs[i].msg_len > std_len) {
> > -            /* Build a single linear TSO packet by expanding the current
> > packet
> > -             * to append the data received in the aux_buf. */
> > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > -
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + std_len);
> > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + extra_len);
> > -        } else {
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + mmsgs[i].msg_len);
> > -        }
> > +            /* Build a single linear TSO packet by prepending the data from
> > +             * std_len buffer to the aux_buf. */
> > +            pkt = rx->aux_bufs[i];
> > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > +            /* The headroom should be the same in buffers[i], pkt and
> > +             * DP_NETDEV_HEADROOM. */
> > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > +            dp_packet_delete(buffers[i]);
> > +            rx->aux_bufs[i] = NULL;
> > +         } else {
> > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > +            pkt = buffers[i];
> > +         }
> >  
> > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > {
> > +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> >  
> >              /* Unexpected error situation: the virtio header is not present
> >               * or corrupted. Drop the packet but continue in case next ones
> >               * are correct. */
> > -            dp_packet_delete(buffers[i]);
> > +            dp_packet_delete(pkt);
> >              netdev->rx_dropped += 1;
> >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net
> > header",
> >                           netdev_get_name(netdev_)); @@ -1325,16 +1323,16 @@
> > netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> >                  struct eth_header *eth;
> >                  bool double_tagged;
> >  
> > -                eth = dp_packet_data(buffers[i]);
> > +                eth = dp_packet_data(pkt);
> >                  double_tagged = eth->eth_type ==
> > htons(ETH_TYPE_VLAN_8021Q);
> >  
> > -                eth_push_vlan(buffers[i],
> > +                eth_push_vlan(pkt,
> >                                auxdata_to_vlan_tpid(aux, double_tagged),
> >                                htons(aux->tp_vlan_tci));
> >                  break;
> >              }
> >          }
> > -        dp_packet_batch_add(batch, buffers[i]);
> > +        dp_packet_batch_add(batch, pkt);
> >      }
> >  
> >      /* Delete unused buffers. */
> > @@ -1354,7 +1352,6 @@ static int
> >  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> >                                  struct dp_packet_batch *batch)  {
> > -    struct dp_packet *buffer;
> >      int virtio_net_hdr_size;
> >      ssize_t retval;
> >      size_t std_len;
> > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > netdev_rxq_linux *rx, int mtu,
> >          virtio_net_hdr_size = 0;
> >      }
> >  
> > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > +    /* The length here needs to be accounted in the same way when the
> > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > +        struct dp_packet *buffer;
> > +        struct dp_packet *pkt;
> >          struct iovec iov[IOV_TSO_SIZE];
> >  
> >          /* Assume Ethernet port. No need to set packet_type. */
> >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> >          iov[IOV_PACKET].iov_len = std_len;
> > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        if (iovlen == IOV_TSO_SIZE) {
> > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > +        }
> >  
> >          do {
> >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33 +1396,36 @@
> > netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> >          }
> >  
> >          if (retval > std_len) {
> > -            /* Build a single linear TSO packet by expanding the current
> > packet
> > -             * to append the data received in the aux_buf. */
> > -            size_t extra_len = retval - std_len;
> > -
> > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +            /* Build a single linear TSO packet by prepending the data from
> > +             * std_len buffer to the aux_buf. */
> > +            pkt = rx->aux_bufs[i];
> > +            dp_packet_set_size(pkt, retval - std_len);
> > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > +            /* The headroom should be the same in buffers[i], pkt and
> > +             * DP_NETDEV_HEADROOM. */
> > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > +            dp_packet_delete(buffer);
> > +            rx->aux_bufs[i] = NULL;
> >          } else {
> >              dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +            pkt = buffer;
> >          }
> >  
> > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> >  
> >              /* Unexpected error situation: the virtio header is not present
> >               * or corrupted. Drop the packet but continue in case next ones
> >               * are correct. */
> > -            dp_packet_delete(buffer);
> > +            dp_packet_delete(pkt);
> >              netdev->rx_dropped += 1;
> >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net
> > header",
> >                           netdev_get_name(netdev_));
> >              continue;
> >          }
> >  
> > -        dp_packet_batch_add(batch, buffer);
> > +        dp_packet_batch_add(batch, pkt);
> >      }
> >  
> >      if ((i == 0) && (retval < 0)) {
> > @@ -1442,6 +1448,23 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct
> > dp_packet_batch *batch,
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> >  
> > +    if (userspace_tso_enabled()) {
> > +        /* Allocate TSO packets. The packet has enough headroom to store
> > +         * a full non-TSO packet. When a TSO packet is received, the data
> > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > +         * (aux_buf). */
> > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > VLAN_ETH_HEADER_LEN
> > +                         + DP_NETDEV_HEADROOM + mtu;
> > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > +            if (rx->aux_bufs[i]) {
> > +                continue;
> > +            }
> > +
> > +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len,
> > std_len);
> > +        }
> > +    }
> > +
> >      dp_packet_batch_init(batch);
> >      retval = (rx->is_tap
> >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > --
> > 2.24.1
> > 
> 
> 
> 
> -- 
> fbl
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Yi Yang (杨燚)-云服务集团 Feb. 5, 2020, 12:05 a.m. UTC | #3
Thanks Flavio, which kernel version can support TSO for tap device?

-----邮件原件-----
发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
发送时间: 2020年2月5日 1:12
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet

On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > Hi, Flavio
> > 
> > With this one patch and previous several merged TSO-related patches, 
> > can veth work with "ethtool -K vethX tx on"? I always can't figure 
> > out why veth can work in dpdk data path when tx offload features are 
> > on, it looks like you're fixing this big issue, right?
> 
> If you have tso enabled with dpdk, then veth works with vethX tx on 
> (which is the default setting for veth). Otherwise TSO is not enabled 
> and then you need to turn tx offloading off.
> 
> You said "can work in dpdk data path when tx ... are on", so I think 
> it's okay? Not sure though.
> 
> > For tap interface, it can't support TSO, do you Redhat guys have 
> > plan to enable it on kernel side.
> 
> Yeah, currently it only works in one direction (OvS -> kernel). I am 
> looking into this now.

With TSO enabled on the TAP device:

Traffic Direction      TSO disabled         TSO enabled  
VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
tap->VM                2.29 Gbits/sec       18.0 Gbits/sec

The code is in my github branch:
https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1

commit 884371df3bf3df836d4c2ab2d62b420339691fe8
Author: Flavio Leitner <fbl@sysclose.org>
Date:   Tue Feb 4 11:18:49 2020 -0300

    netdev-linux: Enable TSO in the TAP device.
    
    Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
    offloading in the tap device.
    
    Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
    Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Please try it out and let me know if it works for you as well.
Thanks
fbl


> VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> veth->VM               2.30 Gbits/sec       9.58 Gbits/sec



> See below my iperf3 results as a reference:
> 
> Traffic Direction      TSO disabled         TSO enabled  
> VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> 
> fbl
>  
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > 发送时间: 2020年2月4日 5:46
> > 收件人: dev@openvswitch.org
> > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > Yang (杨
> > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; Ben 
> > Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > packet
> > 
> > Usually TSO packets are close to 50k, 60k bytes long, so to to copy 
> > less bytes when receiving a packet from the kernel change the 
> > approach. Instead of extending the MTU sized packet received and 
> > append with remaining TSO data from the TSO buffer, allocate a TSO 
> > packet with enough headroom to prepend the std packet data.
> > 
> > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload 
> > support")
> > Suggested-by: Ben Pfaff <blp@ovn.org>
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> >  lib/dp-packet.c            |   8 +--
> >  lib/dp-packet.h            |   2 +
> >  lib/netdev-linux-private.h |   3 +-
> >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> >  4 files changed, 78 insertions(+), 52 deletions(-)
> > 
> > V2:
> >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> >   - iov_len uses packet's tailroom.
> > 
> >   This patch depends on a previous posted patch to work:
> >   Subject: netdev-linux-private: fix max length to be 16 bits
> >   
> > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > html
> > 
> >   With both patches applied, I can run iperf3 and scp on both directions
> >   with good performance and no issues.
> > 
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 
> > 8dfedcb7c..cd2623500
> > 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t 
> > *new_base,
> >  
> >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> >   * bytes of headroom and tailroom, respectively. */ -static void 
> > -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t
> > new_tailroom)
> > +void
> > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t
> > +new_tailroom)
> >  {
> >      void *new_base, *new_data;
> >      size_t new_allocated;
> > @@ -297,7 +297,7 @@ void
> >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> >      if (size > dp_packet_tailroom(b)) {
> > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
> >      }
> >  }
> >  
> > @@ -308,7 +308,7 @@ void
> >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> >      if (size > dp_packet_headroom(b)) {
> > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > +        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
> >      }
> >  }
> >  
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 
> > 69ae5dfac..9a9d35183
> > 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -152,6 +152,8 @@ struct dp_packet 
> > *dp_packet_clone_with_headroom(const
> > struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const 
> > void *, size_t);  struct dp_packet 
> > *dp_packet_clone_data_with_headroom(const void *, size_t,
> >                                                       size_t 
> > headroom);
> > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > +                      size_t new_tailroom);
> >  static inline void dp_packet_delete(struct dp_packet *);
> >  
> >  static inline void *dp_packet_at(const struct dp_packet *, size_t 
> > offset, diff --git a/lib/netdev-linux-private.h 
> > b/lib/netdev-linux-private.h index
> > be2d7b10b..c7c515f70 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> >      bool is_tap;
> >      int fd;
> > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > */
> > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > +                                                     packets. */
> >  };
> >  
> >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> >  netdev_linux_rxq_alloc(void)
> >  {
> >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > -    if (userspace_tso_enabled()) {
> > -        int i;
> > -
> > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > -        }
> > -    }
> > -
> >      return &rx->up;
> >  }
> >  
> > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> >      }
> >  
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > -        free(rx->aux_bufs[i]);
> > +        dp_packet_delete(rx->aux_bufs[i]);
> >      }
> >  }
> >  
> > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > netdev_rxq_linux *rx, int mtu,
> >          virtio_net_hdr_size = 0;
> >      }
> >  
> > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > +    /* The length here needs to be accounted in the same way when the
> > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > DP_NETDEV_HEADROOM);
> >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> >           iovs[i][IOV_PACKET].iov_len = std_len;
> > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > +         if (iovlen == IOV_TSO_SIZE) {
> > +             iovs[i][IOV_AUXBUF].iov_base =
> > dp_packet_data(rx->aux_bufs[i]);
> > +             iovs[i][IOV_AUXBUF].iov_len =
> > dp_packet_tailroom(rx->aux_bufs[i]);
> > +         }
> > +
> >           mmsgs[i].msg_hdr.msg_name = NULL;
> >           mmsgs[i].msg_hdr.msg_namelen = 0;
> >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 @@ 
> > netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> >      }
> >  
> >      for (i = 0; i < retval; i++) {
> > +        struct dp_packet *pkt;
> > +
> >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = 
> > netdev_linux_cast(netdev_); @@
> > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > netdev_rxq_linux *rx, int mtu,
> >          }
> >  
> >          if (mmsgs[i].msg_len > std_len) {
> > -            /* Build a single linear TSO packet by expanding the current
> > packet
> > -             * to append the data received in the aux_buf. */
> > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > -
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + std_len);
> > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + extra_len);
> > -        } else {
> > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > -                               + mmsgs[i].msg_len);
> > -        }
> > +            /* Build a single linear TSO packet by prepending the data from
> > +             * std_len buffer to the aux_buf. */
> > +            pkt = rx->aux_bufs[i];
> > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > +            /* The headroom should be the same in buffers[i], pkt and
> > +             * DP_NETDEV_HEADROOM. */
> > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > +            dp_packet_delete(buffers[i]);
> > +            rx->aux_bufs[i] = NULL;
> > +         } else {
> > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > +            pkt = buffers[i];
> > +         }
> >  
> > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > {
> > +        if (virtio_net_hdr_size && 
> > + netdev_linux_parse_vnet_hdr(pkt)) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = 
> > netdev_linux_cast(netdev_);
> >  
> >              /* Unexpected error situation: the virtio header is not present
> >               * or corrupted. Drop the packet but continue in case next ones
> >               * are correct. */
> > -            dp_packet_delete(buffers[i]);
> > +            dp_packet_delete(pkt);
> >              netdev->rx_dropped += 1;
> >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > net header",
> >                           netdev_get_name(netdev_)); @@ -1325,16 
> > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> >                  struct eth_header *eth;
> >                  bool double_tagged;
> >  
> > -                eth = dp_packet_data(buffers[i]);
> > +                eth = dp_packet_data(pkt);
> >                  double_tagged = eth->eth_type == 
> > htons(ETH_TYPE_VLAN_8021Q);
> >  
> > -                eth_push_vlan(buffers[i],
> > +                eth_push_vlan(pkt,
> >                                auxdata_to_vlan_tpid(aux, double_tagged),
> >                                htons(aux->tp_vlan_tci));
> >                  break;
> >              }
> >          }
> > -        dp_packet_batch_add(batch, buffers[i]);
> > +        dp_packet_batch_add(batch, pkt);
> >      }
> >  
> >      /* Delete unused buffers. */
> > @@ -1354,7 +1352,6 @@ static int
> >  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> >                                  struct dp_packet_batch *batch)  {
> > -    struct dp_packet *buffer;
> >      int virtio_net_hdr_size;
> >      ssize_t retval;
> >      size_t std_len;
> > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > netdev_rxq_linux *rx, int mtu,
> >          virtio_net_hdr_size = 0;
> >      }
> >  
> > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > +    /* The length here needs to be accounted in the same way when the
> > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > +        struct dp_packet *buffer;
> > +        struct dp_packet *pkt;
> >          struct iovec iov[IOV_TSO_SIZE];
> >  
> >          /* Assume Ethernet port. No need to set packet_type. */
> >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> >          iov[IOV_PACKET].iov_len = std_len;
> > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        if (iovlen == IOV_TSO_SIZE) {
> > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > +        }
> >  
> >          do {
> >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33 
> > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> >          }
> >  
> >          if (retval > std_len) {
> > -            /* Build a single linear TSO packet by expanding the current
> > packet
> > -             * to append the data received in the aux_buf. */
> > -            size_t extra_len = retval - std_len;
> > -
> > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +            /* Build a single linear TSO packet by prepending the data from
> > +             * std_len buffer to the aux_buf. */
> > +            pkt = rx->aux_bufs[i];
> > +            dp_packet_set_size(pkt, retval - std_len);
> > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > +            /* The headroom should be the same in buffers[i], pkt and
> > +             * DP_NETDEV_HEADROOM. */
> > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > +            dp_packet_delete(buffer);
> > +            rx->aux_bufs[i] = NULL;
> >          } else {
> >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > retval);
> > +            pkt = buffer;
> >          }
> >  
> > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        if (virtio_net_hdr_size && 
> > + netdev_linux_parse_vnet_hdr(pkt)) {
> >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> >              struct netdev_linux *netdev = 
> > netdev_linux_cast(netdev_);
> >  
> >              /* Unexpected error situation: the virtio header is not present
> >               * or corrupted. Drop the packet but continue in case next ones
> >               * are correct. */
> > -            dp_packet_delete(buffer);
> > +            dp_packet_delete(pkt);
> >              netdev->rx_dropped += 1;
> >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > net header",
> >                           netdev_get_name(netdev_));
> >              continue;
> >          }
> >  
> > -        dp_packet_batch_add(batch, buffer);
> > +        dp_packet_batch_add(batch, pkt);
> >      }
> >  
> >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > dp_packet_batch *batch,
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> >  
> > +    if (userspace_tso_enabled()) {
> > +        /* Allocate TSO packets. The packet has enough headroom to store
> > +         * a full non-TSO packet. When a TSO packet is received, the data
> > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > +         * (aux_buf). */
> > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > VLAN_ETH_HEADER_LEN
> > +                         + DP_NETDEV_HEADROOM + mtu;
> > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > +            if (rx->aux_bufs[i]) {
> > +                continue;
> > +            }
> > +
> > +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len,
> > std_len);
> > +        }
> > +    }
> > +
> >      dp_packet_batch_init(batch);
> >      retval = (rx->is_tap
> >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > --
> > 2.24.1
> > 
> 
> 
> 
> --
> fbl
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

--
fbl
Flavio Leitner Feb. 5, 2020, 11:45 a.m. UTC | #4
On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Thanks Flavio, which kernel version can support TSO for tap device?

That ioctl was introduced in 2.6.27.
fbl

> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
> 发送时间: 2020年2月5日 1:12
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
> 
> On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > Hi, Flavio
> > > 
> > > With this one patch and previous several merged TSO-related patches, 
> > > can veth work with "ethtool -K vethX tx on"? I always can't figure 
> > > out why veth can work in dpdk data path when tx offload features are 
> > > on, it looks like you're fixing this big issue, right?
> > 
> > If you have tso enabled with dpdk, then veth works with vethX tx on 
> > (which is the default setting for veth). Otherwise TSO is not enabled 
> > and then you need to turn tx offloading off.
> > 
> > You said "can work in dpdk data path when tx ... are on", so I think 
> > it's okay? Not sure though.
> > 
> > > For tap interface, it can't support TSO, do you Redhat guys have 
> > > plan to enable it on kernel side.
> > 
> > Yeah, currently it only works in one direction (OvS -> kernel). I am 
> > looking into this now.
> 
> With TSO enabled on the TAP device:
> 
> Traffic Direction      TSO disabled         TSO enabled  
> VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> 
> The code is in my github branch:
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> 
> commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> Author: Flavio Leitner <fbl@sysclose.org>
> Date:   Tue Feb 4 11:18:49 2020 -0300
> 
>     netdev-linux: Enable TSO in the TAP device.
>     
>     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
>     offloading in the tap device.
>     
>     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
>     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Please try it out and let me know if it works for you as well.
> Thanks
> fbl
> 
> 
> > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> 
> 
> 
> > See below my iperf3 results as a reference:
> > 
> > Traffic Direction      TSO disabled         TSO enabled  
> > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > 
> > fbl
> >  
> > > -----邮件原件-----
> > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > 发送时间: 2020年2月4日 5:46
> > > 收件人: dev@openvswitch.org
> > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > > Yang (杨
> > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; Ben 
> > > Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > packet
> > > 
> > > Usually TSO packets are close to 50k, 60k bytes long, so to to copy 
> > > less bytes when receiving a packet from the kernel change the 
> > > approach. Instead of extending the MTU sized packet received and 
> > > append with remaining TSO data from the TSO buffer, allocate a TSO 
> > > packet with enough headroom to prepend the std packet data.
> > > 
> > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload 
> > > support")
> > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > ---
> > >  lib/dp-packet.c            |   8 +--
> > >  lib/dp-packet.h            |   2 +
> > >  lib/netdev-linux-private.h |   3 +-
> > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > 
> > > V2:
> > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > >   - iov_len uses packet's tailroom.
> > > 
> > >   This patch depends on a previous posted patch to work:
> > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > >   
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > html
> > > 
> > >   With both patches applied, I can run iperf3 and scp on both directions
> > >   with good performance and no issues.
> > > 
> > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 
> > > 8dfedcb7c..cd2623500
> > > 100644
> > > --- a/lib/dp-packet.c
> > > +++ b/lib/dp-packet.c
> > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t 
> > > *new_base,
> > >  
> > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > >   * bytes of headroom and tailroom, respectively. */ -static void 
> > > -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t
> > > new_tailroom)
> > > +void
> > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t
> > > +new_tailroom)
> > >  {
> > >      void *new_base, *new_data;
> > >      size_t new_allocated;
> > > @@ -297,7 +297,7 @@ void
> > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > >      if (size > dp_packet_tailroom(b)) {
> > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
> > >      }
> > >  }
> > >  
> > > @@ -308,7 +308,7 @@ void
> > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > >      if (size > dp_packet_headroom(b)) {
> > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > +        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
> > >      }
> > >  }
> > >  
> > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 
> > > 69ae5dfac..9a9d35183
> > > 100644
> > > --- a/lib/dp-packet.h
> > > +++ b/lib/dp-packet.h
> > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > *dp_packet_clone_with_headroom(const
> > > struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const 
> > > void *, size_t);  struct dp_packet 
> > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > >                                                       size_t 
> > > headroom);
> > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > +                      size_t new_tailroom);
> > >  static inline void dp_packet_delete(struct dp_packet *);
> > >  
> > >  static inline void *dp_packet_at(const struct dp_packet *, size_t 
> > > offset, diff --git a/lib/netdev-linux-private.h 
> > > b/lib/netdev-linux-private.h index
> > > be2d7b10b..c7c515f70 100644
> > > --- a/lib/netdev-linux-private.h
> > > +++ b/lib/netdev-linux-private.h
> > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > >      struct netdev_rxq up;
> > >      bool is_tap;
> > >      int fd;
> > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > */
> > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > +                                                     packets. */
> > >  };
> > >  
> > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > --- a/lib/netdev-linux.c
> > > +++ b/lib/netdev-linux.c
> > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > >  netdev_linux_rxq_alloc(void)
> > >  {
> > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > -    if (userspace_tso_enabled()) {
> > > -        int i;
> > > -
> > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > -        }
> > > -    }
> > > -
> > >      return &rx->up;
> > >  }
> > >  
> > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > >      }
> > >  
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > -        free(rx->aux_bufs[i]);
> > > +        dp_packet_delete(rx->aux_bufs[i]);
> > >      }
> > >  }
> > >  
> > > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          virtio_net_hdr_size = 0;
> > >      }
> > >  
> > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > +    /* The length here needs to be accounted in the same way when the
> > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > DP_NETDEV_HEADROOM);
> > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > +         if (iovlen == IOV_TSO_SIZE) {
> > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > dp_packet_data(rx->aux_bufs[i]);
> > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > +         }
> > > +
> > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 @@ 
> > > netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > >      }
> > >  
> > >      for (i = 0; i < retval; i++) {
> > > +        struct dp_packet *pkt;
> > > +
> > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_); @@
> > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          }
> > >  
> > >          if (mmsgs[i].msg_len > std_len) {
> > > -            /* Build a single linear TSO packet by expanding the current
> > > packet
> > > -             * to append the data received in the aux_buf. */
> > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > -
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + std_len);
> > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + extra_len);
> > > -        } else {
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + mmsgs[i].msg_len);
> > > -        }
> > > +            /* Build a single linear TSO packet by prepending the data from
> > > +             * std_len buffer to the aux_buf. */
> > > +            pkt = rx->aux_bufs[i];
> > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > +            /* The headroom should be the same in buffers[i], pkt and
> > > +             * DP_NETDEV_HEADROOM. */
> > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > +            dp_packet_delete(buffers[i]);
> > > +            rx->aux_bufs[i] = NULL;
> > > +         } else {
> > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > +            pkt = buffers[i];
> > > +         }
> > >  
> > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > {
> > > +        if (virtio_net_hdr_size && 
> > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_);
> > >  
> > >              /* Unexpected error situation: the virtio header is not present
> > >               * or corrupted. Drop the packet but continue in case next ones
> > >               * are correct. */
> > > -            dp_packet_delete(buffers[i]);
> > > +            dp_packet_delete(pkt);
> > >              netdev->rx_dropped += 1;
> > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > net header",
> > >                           netdev_get_name(netdev_)); @@ -1325,16 
> > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > >                  struct eth_header *eth;
> > >                  bool double_tagged;
> > >  
> > > -                eth = dp_packet_data(buffers[i]);
> > > +                eth = dp_packet_data(pkt);
> > >                  double_tagged = eth->eth_type == 
> > > htons(ETH_TYPE_VLAN_8021Q);
> > >  
> > > -                eth_push_vlan(buffers[i],
> > > +                eth_push_vlan(pkt,
> > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > >                                htons(aux->tp_vlan_tci));
> > >                  break;
> > >              }
> > >          }
> > > -        dp_packet_batch_add(batch, buffers[i]);
> > > +        dp_packet_batch_add(batch, pkt);
> > >      }
> > >  
> > >      /* Delete unused buffers. */
> > > @@ -1354,7 +1352,6 @@ static int
> > >  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> > >                                  struct dp_packet_batch *batch)  {
> > > -    struct dp_packet *buffer;
> > >      int virtio_net_hdr_size;
> > >      ssize_t retval;
> > >      size_t std_len;
> > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          virtio_net_hdr_size = 0;
> > >      }
> > >  
> > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > +    /* The length here needs to be accounted in the same way when the
> > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > +        struct dp_packet *buffer;
> > > +        struct dp_packet *pkt;
> > >          struct iovec iov[IOV_TSO_SIZE];
> > >  
> > >          /* Assume Ethernet port. No need to set packet_type. */
> > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > >          iov[IOV_PACKET].iov_len = std_len;
> > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > +        if (iovlen == IOV_TSO_SIZE) {
> > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > +        }
> > >  
> > >          do {
> > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33 
> > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> > >          }
> > >  
> > >          if (retval > std_len) {
> > > -            /* Build a single linear TSO packet by expanding the current
> > > packet
> > > -             * to append the data received in the aux_buf. */
> > > -            size_t extra_len = retval - std_len;
> > > -
> > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > +            /* Build a single linear TSO packet by prepending the data from
> > > +             * std_len buffer to the aux_buf. */
> > > +            pkt = rx->aux_bufs[i];
> > > +            dp_packet_set_size(pkt, retval - std_len);
> > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > +            /* The headroom should be the same in buffers[i], pkt and
> > > +             * DP_NETDEV_HEADROOM. */
> > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > +            dp_packet_delete(buffer);
> > > +            rx->aux_bufs[i] = NULL;
> > >          } else {
> > >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > > retval);
> > > +            pkt = buffer;
> > >          }
> > >  
> > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > +        if (virtio_net_hdr_size && 
> > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_);
> > >  
> > >              /* Unexpected error situation: the virtio header is not present
> > >               * or corrupted. Drop the packet but continue in case next ones
> > >               * are correct. */
> > > -            dp_packet_delete(buffer);
> > > +            dp_packet_delete(pkt);
> > >              netdev->rx_dropped += 1;
> > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > net header",
> > >                           netdev_get_name(netdev_));
> > >              continue;
> > >          }
> > >  
> > > -        dp_packet_batch_add(batch, buffer);
> > > +        dp_packet_batch_add(batch, pkt);
> > >      }
> > >  
> > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > dp_packet_batch *batch,
> > >          mtu = ETH_PAYLOAD_MAX;
> > >      }
> > >  
> > > +    if (userspace_tso_enabled()) {
> > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > +         * (aux_buf). */
> > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > VLAN_ETH_HEADER_LEN
> > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > +            if (rx->aux_bufs[i]) {
> > > +                continue;
> > > +            }
> > > +
> > > +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len,
> > > std_len);
> > > +        }
> > > +    }
> > > +
> > >      dp_packet_batch_init(batch);
> > >      retval = (rx->is_tap
> > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > > --
> > > 2.24.1
> > > 
> > 
> > 
> > 
> > --
> > fbl
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> --
> fbl
Yi Yang (杨燚)-云服务集团 Feb. 6, 2020, 7:34 a.m. UTC | #5
Hi, Flavio

I tried current ovs master and your two patches (this one and the one for tap), tap to tap and veth to veth performance are very bad when tso is on, I think merged tso patch series are wrong for such use cases, BTW, the performance data is normal if I turned off tso "sudo ip netns exec nsXXX ethtool -K vethX tso off", for tap, the same is there (I used your tap patch for this), it is ok when I turned off tso.

here is performance data:

Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 36024 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  7.59 MBytes  6.37 Mbits/sec  2410   7.07 KBytes
[  4]  10.00-20.00  sec  7.76 MBytes  6.51 Mbits/sec  2496   5.66 KBytes
[  4]  20.00-30.00  sec  7.99 MBytes  6.70 Mbits/sec  2536   5.66 KBytes
[  4]  30.00-40.00  sec  7.85 MBytes  6.58 Mbits/sec  2506   5.66 KBytes
[  4]  40.00-50.00  sec  7.93 MBytes  6.65 Mbits/sec  2556   5.66 KBytes
[  4]  50.00-60.00  sec  8.25 MBytes  6.92 Mbits/sec  2696   7.07 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  47.4 MBytes  6.62 Mbits/sec  15200             sender
[  4]   0.00-60.00  sec  47.3 MBytes  6.61 Mbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 36022
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 36024
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  7.44 MBytes  6.24 Mbits/sec
[  5]  10.00-20.00  sec  7.80 MBytes  6.54 Mbits/sec
[  5]  20.00-30.00  sec  7.95 MBytes  6.67 Mbits/sec
[  5]  30.00-40.00  sec  7.83 MBytes  6.57 Mbits/sec
[  5]  40.00-50.00  sec  8.00 MBytes  6.71 Mbits/sec
[  5]  50.00-60.00  sec  8.23 MBytes  6.91 Mbits/sec


iperf Done.

You can use the below script to check performance (need sudo run-iperf3.sh).

$ cat run-iperf3.sh
#!/bin/bash

ovs-vsctl add-br br-int -- set bridge br-int datapath_type=netdev protocols=OpenFlow10,OpenFlow12,OpenFlow13 stp_enable=false
ip link add veth1 type veth peer name vethbr1
ip link add veth2 type veth peer name vethbr2
ip netns add ns01
ip netns add ns02

ip link set veth1 netns ns01
ip link set veth2 netns ns02

ip netns exec ns01 ifconfig lo 127.0.0.1 up
ip netns exec ns01 ifconfig veth1 10.15.1.2/24 up

ip netns exec ns02 ifconfig lo 127.0.0.1 up
ip netns exec ns02 ifconfig veth2 10.15.1.3/24 up

ifconfig vethbr1 0 up
ifconfig vethbr2 0 up


ovs-vsctl add-port br-int vethbr1
ovs-vsctl add-port br-int vethbr2

ip netns exec ns01 ping 10.15.1.3 -c 3
ip netns exec ns02 ping 10.15.1.2 -c 3

ip netns exec ns02 iperf3 -s -i 10 -D
ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output

-----邮件原件-----
发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
发送时间: 2020年2月5日 19:46
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
主题: Re: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet

On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Thanks Flavio, which kernel version can support TSO for tap device?

That ioctl was introduced in 2.6.27.
fbl

> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> 发送时间: 2020年2月5日 1:12
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet 
> in the TSO packet
> 
> On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > Hi, Flavio
> > > 
> > > With this one patch and previous several merged TSO-related 
> > > patches, can veth work with "ethtool -K vethX tx on"? I always 
> > > can't figure out why veth can work in dpdk data path when tx 
> > > offload features are on, it looks like you're fixing this big issue, right?
> > 
> > If you have tso enabled with dpdk, then veth works with vethX tx on 
> > (which is the default setting for veth). Otherwise TSO is not 
> > enabled and then you need to turn tx offloading off.
> > 
> > You said "can work in dpdk data path when tx ... are on", so I think 
> > it's okay? Not sure though.
> > 
> > > For tap interface, it can't support TSO, do you Redhat guys have 
> > > plan to enable it on kernel side.
> > 
> > Yeah, currently it only works in one direction (OvS -> kernel). I am 
> > looking into this now.
> 
> With TSO enabled on the TAP device:
> 
> Traffic Direction      TSO disabled         TSO enabled  
> VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> 
> The code is in my github branch:
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> 
> commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> Author: Flavio Leitner <fbl@sysclose.org>
> Date:   Tue Feb 4 11:18:49 2020 -0300
> 
>     netdev-linux: Enable TSO in the TAP device.
>     
>     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
>     offloading in the tap device.
>     
>     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
>     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Please try it out and let me know if it works for you as well.
> Thanks
> fbl
> 
> 
> > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> 
> 
> 
> > See below my iperf3 results as a reference:
> > 
> > Traffic Direction      TSO disabled         TSO enabled  
> > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > 
> > fbl
> >  
> > > -----邮件原件-----
> > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > 发送时间: 2020年2月4日 5:46
> > > 收件人: dev@openvswitch.org
> > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > > Yang (杨
> > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; 
> > > Ben Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > packet
> > > 
> > > Usually TSO packets are close to 50k, 60k bytes long, so to to 
> > > copy less bytes when receiving a packet from the kernel change the 
> > > approach. Instead of extending the MTU sized packet received and 
> > > append with remaining TSO data from the TSO buffer, allocate a TSO 
> > > packet with enough headroom to prepend the std packet data.
> > > 
> > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload
> > > support")
> > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > ---
> > >  lib/dp-packet.c            |   8 +--
> > >  lib/dp-packet.h            |   2 +
> > >  lib/netdev-linux-private.h |   3 +-
> > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > 
> > > V2:
> > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > >   - iov_len uses packet's tailroom.
> > > 
> > >   This patch depends on a previous posted patch to work:
> > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > >   
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > html
> > > 
> > >   With both patches applied, I can run iperf3 and scp on both directions
> > >   with good performance and no issues.
> > > 
> > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index
> > > 8dfedcb7c..cd2623500
> > > 100644
> > > --- a/lib/dp-packet.c
> > > +++ b/lib/dp-packet.c
> > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t 
> > > *new_base,
> > >  
> > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > >   * bytes of headroom and tailroom, respectively. */ -static void 
> > > -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, 
> > > size_t
> > > new_tailroom)
> > > +void
> > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t
> > > +new_tailroom)
> > >  {
> > >      void *new_base, *new_data;
> > >      size_t new_allocated;
> > > @@ -297,7 +297,7 @@ void
> > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > >      if (size > dp_packet_tailroom(b)) {
> > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 
> > > + 64));
> > >      }
> > >  }
> > >  
> > > @@ -308,7 +308,7 @@ void
> > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > >      if (size > dp_packet_headroom(b)) {
> > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > +        dp_packet_resize(b, MAX(size, 64), 
> > > + dp_packet_tailroom(b));
> > >      }
> > >  }
> > >  
> > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index
> > > 69ae5dfac..9a9d35183
> > > 100644
> > > --- a/lib/dp-packet.h
> > > +++ b/lib/dp-packet.h
> > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > *dp_packet_clone_with_headroom(const
> > > struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const 
> > > void *, size_t);  struct dp_packet 
> > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > >                                                       size_t 
> > > headroom);
> > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > +                      size_t new_tailroom);
> > >  static inline void dp_packet_delete(struct dp_packet *);
> > >  
> > >  static inline void *dp_packet_at(const struct dp_packet *, size_t 
> > > offset, diff --git a/lib/netdev-linux-private.h 
> > > b/lib/netdev-linux-private.h index
> > > be2d7b10b..c7c515f70 100644
> > > --- a/lib/netdev-linux-private.h
> > > +++ b/lib/netdev-linux-private.h
> > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > >      struct netdev_rxq up;
> > >      bool is_tap;
> > >      int fd;
> > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > */
> > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > +                                                     packets. */
> > >  };
> > >  
> > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > --- a/lib/netdev-linux.c
> > > +++ b/lib/netdev-linux.c
> > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > >  netdev_linux_rxq_alloc(void)
> > >  {
> > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > -    if (userspace_tso_enabled()) {
> > > -        int i;
> > > -
> > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > -        }
> > > -    }
> > > -
> > >      return &rx->up;
> > >  }
> > >  
> > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > >      }
> > >  
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > -        free(rx->aux_bufs[i]);
> > > +        dp_packet_delete(rx->aux_bufs[i]);
> > >      }
> > >  }
> > >  
> > > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          virtio_net_hdr_size = 0;
> > >      }
> > >  
> > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > +    /* The length here needs to be accounted in the same way when the
> > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > DP_NETDEV_HEADROOM);
> > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > +         if (iovlen == IOV_TSO_SIZE) {
> > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > dp_packet_data(rx->aux_bufs[i]);
> > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > +         }
> > > +
> > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 
> > > @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > >      }
> > >  
> > >      for (i = 0; i < retval; i++) {
> > > +        struct dp_packet *pkt;
> > > +
> > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_); @@
> > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          }
> > >  
> > >          if (mmsgs[i].msg_len > std_len) {
> > > -            /* Build a single linear TSO packet by expanding the current
> > > packet
> > > -             * to append the data received in the aux_buf. */
> > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > -
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + std_len);
> > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + extra_len);
> > > -        } else {
> > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > -                               + mmsgs[i].msg_len);
> > > -        }
> > > +            /* Build a single linear TSO packet by prepending the data from
> > > +             * std_len buffer to the aux_buf. */
> > > +            pkt = rx->aux_bufs[i];
> > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > +            /* The headroom should be the same in buffers[i], pkt and
> > > +             * DP_NETDEV_HEADROOM. */
> > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > +            dp_packet_delete(buffers[i]);
> > > +            rx->aux_bufs[i] = NULL;
> > > +         } else {
> > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > +            pkt = buffers[i];
> > > +         }
> > >  
> > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > {
> > > +        if (virtio_net_hdr_size &&
> > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_);
> > >  
> > >              /* Unexpected error situation: the virtio header is not present
> > >               * or corrupted. Drop the packet but continue in case next ones
> > >               * are correct. */
> > > -            dp_packet_delete(buffers[i]);
> > > +            dp_packet_delete(pkt);
> > >              netdev->rx_dropped += 1;
> > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > net header",
> > >                           netdev_get_name(netdev_)); @@ -1325,16
> > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct 
> > > +netdev_rxq_linux *rx, int mtu,
> > >                  struct eth_header *eth;
> > >                  bool double_tagged;
> > >  
> > > -                eth = dp_packet_data(buffers[i]);
> > > +                eth = dp_packet_data(pkt);
> > >                  double_tagged = eth->eth_type == 
> > > htons(ETH_TYPE_VLAN_8021Q);
> > >  
> > > -                eth_push_vlan(buffers[i],
> > > +                eth_push_vlan(pkt,
> > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > >                                htons(aux->tp_vlan_tci));
> > >                  break;
> > >              }
> > >          }
> > > -        dp_packet_batch_add(batch, buffers[i]);
> > > +        dp_packet_batch_add(batch, pkt);
> > >      }
> > >  
> > >      /* Delete unused buffers. */
> > > @@ -1354,7 +1352,6 @@ static int
> > >  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> > >                                  struct dp_packet_batch *batch)  {
> > > -    struct dp_packet *buffer;
> > >      int virtio_net_hdr_size;
> > >      ssize_t retval;
> > >      size_t std_len;
> > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > netdev_rxq_linux *rx, int mtu,
> > >          virtio_net_hdr_size = 0;
> > >      }
> > >  
> > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > +    /* The length here needs to be accounted in the same way when the
> > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > +        struct dp_packet *buffer;
> > > +        struct dp_packet *pkt;
> > >          struct iovec iov[IOV_TSO_SIZE];
> > >  
> > >          /* Assume Ethernet port. No need to set packet_type. */
> > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > >          iov[IOV_PACKET].iov_len = std_len;
> > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > +        if (iovlen == IOV_TSO_SIZE) {
> > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > +        }
> > >  
> > >          do {
> > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33
> > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct 
> > > +netdev_rxq_linux *rx, int mtu,
> > >          }
> > >  
> > >          if (retval > std_len) {
> > > -            /* Build a single linear TSO packet by expanding the current
> > > packet
> > > -             * to append the data received in the aux_buf. */
> > > -            size_t extra_len = retval - std_len;
> > > -
> > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > +            /* Build a single linear TSO packet by prepending the data from
> > > +             * std_len buffer to the aux_buf. */
> > > +            pkt = rx->aux_bufs[i];
> > > +            dp_packet_set_size(pkt, retval - std_len);
> > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > +            /* The headroom should be the same in buffers[i], pkt and
> > > +             * DP_NETDEV_HEADROOM. */
> > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > +            dp_packet_delete(buffer);
> > > +            rx->aux_bufs[i] = NULL;
> > >          } else {
> > >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > > retval);
> > > +            pkt = buffer;
> > >          }
> > >  
> > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > +        if (virtio_net_hdr_size &&
> > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > >              struct netdev_linux *netdev = 
> > > netdev_linux_cast(netdev_);
> > >  
> > >              /* Unexpected error situation: the virtio header is not present
> > >               * or corrupted. Drop the packet but continue in case next ones
> > >               * are correct. */
> > > -            dp_packet_delete(buffer);
> > > +            dp_packet_delete(pkt);
> > >              netdev->rx_dropped += 1;
> > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > net header",
> > >                           netdev_get_name(netdev_));
> > >              continue;
> > >          }
> > >  
> > > -        dp_packet_batch_add(batch, buffer);
> > > +        dp_packet_batch_add(batch, pkt);
> > >      }
> > >  
> > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > dp_packet_batch *batch,
> > >          mtu = ETH_PAYLOAD_MAX;
> > >      }
> > >  
> > > +    if (userspace_tso_enabled()) {
> > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > +         * (aux_buf). */
> > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > VLAN_ETH_HEADER_LEN
> > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > +            if (rx->aux_bufs[i]) {
> > > +                continue;
> > > +            }
> > > +
> > > +            rx->aux_bufs[i] = 
> > > + dp_packet_new_with_headroom(data_len,
> > > std_len);
> > > +        }
> > > +    }
> > > +
> > >      dp_packet_batch_init(batch);
> > >      retval = (rx->is_tap
> > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > > --
> > > 2.24.1
> > > 
> > 
> > 
> > 
> > --
> > fbl
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> --
> fbl



--
fbl
Flavio Leitner Feb. 6, 2020, 12:10 p.m. UTC | #6
On Thu, Feb 06, 2020 at 07:34:35AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, Flavio
> 
> I tried current ovs master and your two patches (this one and the
> one for tap), tap to tap and veth to veth performance are very bad
> when tso is on, I think merged tso patch series are wrong for such
> use cases, BTW, the performance data is normal if I turned off tso
> "sudo ip netns exec nsXXX ethtool -K vethX tso off", for tap, the
> same is there (I used your tap patch for this), it is ok when I
> turned off tso.

Can you try out the code from here instead:
https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 
 
> here is performance data:
> 
> Connecting to host 10.15.1.3, port 5201
> [  4] local 10.15.1.2 port 36024 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.00  sec  7.59 MBytes  6.37 Mbits/sec  2410   7.07 KBytes
> [  4]  10.00-20.00  sec  7.76 MBytes  6.51 Mbits/sec  2496   5.66 KBytes
> [  4]  20.00-30.00  sec  7.99 MBytes  6.70 Mbits/sec  2536   5.66 KBytes
> [  4]  30.00-40.00  sec  7.85 MBytes  6.58 Mbits/sec  2506   5.66 KBytes
> [  4]  40.00-50.00  sec  7.93 MBytes  6.65 Mbits/sec  2556   5.66 KBytes
> [  4]  50.00-60.00  sec  8.25 MBytes  6.92 Mbits/sec  2696   7.07 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-60.00  sec  47.4 MBytes  6.62 Mbits/sec  15200             sender
> [  4]   0.00-60.00  sec  47.3 MBytes  6.61 Mbits/sec                  receiver
> 
> Server output:
> Accepted connection from 10.15.1.2, port 36022
> [  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 36024
> [ ID] Interval           Transfer     Bandwidth
> [  5]   0.00-10.00  sec  7.44 MBytes  6.24 Mbits/sec
> [  5]  10.00-20.00  sec  7.80 MBytes  6.54 Mbits/sec
> [  5]  20.00-30.00  sec  7.95 MBytes  6.67 Mbits/sec
> [  5]  30.00-40.00  sec  7.83 MBytes  6.57 Mbits/sec
> [  5]  40.00-50.00  sec  8.00 MBytes  6.71 Mbits/sec
> [  5]  50.00-60.00  sec  8.23 MBytes  6.91 Mbits/sec

Yeah, that looks like to be the performance only for TCP
retransmits. 

Here is your script running with the code I provided before/above:
[root@wsfd-netdev93 yiyang]# ./yiyang-test.sh 
PING 10.15.1.3 (10.15.1.3) 56(84) bytes of data.
64 bytes from 10.15.1.3: icmp_seq=1 ttl=64 time=0.425 ms
64 bytes from 10.15.1.3: icmp_seq=2 ttl=64 time=0.253 ms
64 bytes from 10.15.1.3: icmp_seq=3 ttl=64 time=0.356 ms

--- 10.15.1.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 44ms
rtt min/avg/max/mdev = 0.253/0.344/0.425/0.073 ms
PING 10.15.1.2 (10.15.1.2) 56(84) bytes of data.
64 bytes from 10.15.1.2: icmp_seq=1 ttl=64 time=0.302 ms
64 bytes from 10.15.1.2: icmp_seq=2 ttl=64 time=0.475 ms
64 bytes from 10.15.1.2: icmp_seq=3 ttl=64 time=0.383 ms

--- 10.15.1.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 43ms
rtt min/avg/max/mdev = 0.302/0.386/0.475/0.074 ms
Connecting to host 10.15.1.3, port 5201
[  5] local 10.15.1.2 port 59660 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  7.31 GBytes  6.28 Gbits/sec  49510    187 KBytes       
[  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec  51415    215 KBytes       
[  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec  46197    182 KBytes       
[  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec  46344    202 KBytes       
[  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec  49123    287 KBytes       
[  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec  48734    214 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec  291323             sender
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 59658
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59660
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec                  
[  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec                  
[  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec                  
[  5]  60.00-60.00  sec   127 KBytes  4.09 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver


iperf Done.

HTH,
fbl


> 
> 
> iperf Done.
> 
> You can use the below script to check performance (need sudo run-iperf3.sh).
> 
> $ cat run-iperf3.sh
> #!/bin/bash
> 
> ovs-vsctl add-br br-int -- set bridge br-int datapath_type=netdev protocols=OpenFlow10,OpenFlow12,OpenFlow13 stp_enable=false
> ip link add veth1 type veth peer name vethbr1
> ip link add veth2 type veth peer name vethbr2
> ip netns add ns01
> ip netns add ns02
> 
> ip link set veth1 netns ns01
> ip link set veth2 netns ns02
> 
> ip netns exec ns01 ifconfig lo 127.0.0.1 up
> ip netns exec ns01 ifconfig veth1 10.15.1.2/24 up
> 
> ip netns exec ns02 ifconfig lo 127.0.0.1 up
> ip netns exec ns02 ifconfig veth2 10.15.1.3/24 up
> 
> ifconfig vethbr1 0 up
> ifconfig vethbr2 0 up
> 
> 
> ovs-vsctl add-port br-int vethbr1
> ovs-vsctl add-port br-int vethbr2
> 
> ip netns exec ns01 ping 10.15.1.3 -c 3
> ip netns exec ns02 ping 10.15.1.2 -c 3
> 
> ip netns exec ns02 iperf3 -s -i 10 -D
> ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output
> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
> 发送时间: 2020年2月5日 19:46
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
> 
> On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > Thanks Flavio, which kernel version can support TSO for tap device?
> 
> That ioctl was introduced in 2.6.27.
> fbl
> 
> > 
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > 发送时间: 2020年2月5日 1:12
> > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet 
> > in the TSO packet
> > 
> > On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > > Hi, Flavio
> > > > 
> > > > With this one patch and previous several merged TSO-related 
> > > > patches, can veth work with "ethtool -K vethX tx on"? I always 
> > > > can't figure out why veth can work in dpdk data path when tx 
> > > > offload features are on, it looks like you're fixing this big issue, right?
> > > 
> > > If you have tso enabled with dpdk, then veth works with vethX tx on 
> > > (which is the default setting for veth). Otherwise TSO is not 
> > > enabled and then you need to turn tx offloading off.
> > > 
> > > You said "can work in dpdk data path when tx ... are on", so I think 
> > > it's okay? Not sure though.
> > > 
> > > > For tap interface, it can't support TSO, do you Redhat guys have 
> > > > plan to enable it on kernel side.
> > > 
> > > Yeah, currently it only works in one direction (OvS -> kernel). I am 
> > > looking into this now.
> > 
> > With TSO enabled on the TAP device:
> > 
> > Traffic Direction      TSO disabled         TSO enabled  
> > VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> > tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> > 
> > The code is in my github branch:
> > https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> > 
> > commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> > Author: Flavio Leitner <fbl@sysclose.org>
> > Date:   Tue Feb 4 11:18:49 2020 -0300
> > 
> >     netdev-linux: Enable TSO in the TAP device.
> >     
> >     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
> >     offloading in the tap device.
> >     
> >     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> >     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > 
> > Please try it out and let me know if it works for you as well.
> > Thanks
> > fbl
> > 
> > 
> > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > 
> > 
> > 
> > > See below my iperf3 results as a reference:
> > > 
> > > Traffic Direction      TSO disabled         TSO enabled  
> > > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > > 
> > > fbl
> > >  
> > > > -----邮件原件-----
> > > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > > 发送时间: 2020年2月4日 5:46
> > > > 收件人: dev@openvswitch.org
> > > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > > > Yang (杨
> > > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; 
> > > > Ben Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > > packet
> > > > 
> > > > Usually TSO packets are close to 50k, 60k bytes long, so to to 
> > > > copy less bytes when receiving a packet from the kernel change the 
> > > > approach. Instead of extending the MTU sized packet received and 
> > > > append with remaining TSO data from the TSO buffer, allocate a TSO 
> > > > packet with enough headroom to prepend the std packet data.
> > > > 
> > > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload
> > > > support")
> > > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > > ---
> > > >  lib/dp-packet.c            |   8 +--
> > > >  lib/dp-packet.h            |   2 +
> > > >  lib/netdev-linux-private.h |   3 +-
> > > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > > 
> > > > V2:
> > > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > > >   - iov_len uses packet's tailroom.
> > > > 
> > > >   This patch depends on a previous posted patch to work:
> > > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > > >   
> > > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > > html
> > > > 
> > > >   With both patches applied, I can run iperf3 and scp on both directions
> > > >   with good performance and no issues.
> > > > 
> > > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index
> > > > 8dfedcb7c..cd2623500
> > > > 100644
> > > > --- a/lib/dp-packet.c
> > > > +++ b/lib/dp-packet.c
> > > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t 
> > > > *new_base,
> > > >  
> > > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > > >   * bytes of headroom and tailroom, respectively. */ -static void 
> > > > -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, 
> > > > size_t
> > > > new_tailroom)
> > > > +void
> > > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t
> > > > +new_tailroom)
> > > >  {
> > > >      void *new_base, *new_data;
> > > >      size_t new_allocated;
> > > > @@ -297,7 +297,7 @@ void
> > > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > > >      if (size > dp_packet_tailroom(b)) {
> > > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 
> > > > + 64));
> > > >      }
> > > >  }
> > > >  
> > > > @@ -308,7 +308,7 @@ void
> > > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > > >      if (size > dp_packet_headroom(b)) {
> > > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > > +        dp_packet_resize(b, MAX(size, 64), 
> > > > + dp_packet_tailroom(b));
> > > >      }
> > > >  }
> > > >  
> > > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index
> > > > 69ae5dfac..9a9d35183
> > > > 100644
> > > > --- a/lib/dp-packet.h
> > > > +++ b/lib/dp-packet.h
> > > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > > *dp_packet_clone_with_headroom(const
> > > > struct dp_packet *,  struct dp_packet *dp_packet_clone_data(const 
> > > > void *, size_t);  struct dp_packet 
> > > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > > >                                                       size_t 
> > > > headroom);
> > > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > > +                      size_t new_tailroom);
> > > >  static inline void dp_packet_delete(struct dp_packet *);
> > > >  
> > > >  static inline void *dp_packet_at(const struct dp_packet *, size_t 
> > > > offset, diff --git a/lib/netdev-linux-private.h 
> > > > b/lib/netdev-linux-private.h index
> > > > be2d7b10b..c7c515f70 100644
> > > > --- a/lib/netdev-linux-private.h
> > > > +++ b/lib/netdev-linux-private.h
> > > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > > >      struct netdev_rxq up;
> > > >      bool is_tap;
> > > >      int fd;
> > > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > > */
> > > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > > +                                                     packets. */
> > > >  };
> > > >  
> > > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > > --- a/lib/netdev-linux.c
> > > > +++ b/lib/netdev-linux.c
> > > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > > >  netdev_linux_rxq_alloc(void)
> > > >  {
> > > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > > -    if (userspace_tso_enabled()) {
> > > > -        int i;
> > > > -
> > > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > > -        }
> > > > -    }
> > > > -
> > > >      return &rx->up;
> > > >  }
> > > >  
> > > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > > >      }
> > > >  
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > -        free(rx->aux_bufs[i]);
> > > > +        dp_packet_delete(rx->aux_bufs[i]);
> > > >      }
> > > >  }
> > > >  
> > > > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          virtio_net_hdr_size = 0;
> > > >      }
> > > >  
> > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > +    /* The length here needs to be accounted in the same way when the
> > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > > DP_NETDEV_HEADROOM);
> > > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > +         if (iovlen == IOV_TSO_SIZE) {
> > > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > > dp_packet_data(rx->aux_bufs[i]);
> > > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > > +         }
> > > > +
> > > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 
> > > > @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > > >      }
> > > >  
> > > >      for (i = 0; i < retval; i++) {
> > > > +        struct dp_packet *pkt;
> > > > +
> > > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_); @@
> > > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          }
> > > >  
> > > >          if (mmsgs[i].msg_len > std_len) {
> > > > -            /* Build a single linear TSO packet by expanding the current
> > > > packet
> > > > -             * to append the data received in the aux_buf. */
> > > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > > -
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + std_len);
> > > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + extra_len);
> > > > -        } else {
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + mmsgs[i].msg_len);
> > > > -        }
> > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > +             * std_len buffer to the aux_buf. */
> > > > +            pkt = rx->aux_bufs[i];
> > > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > +             * DP_NETDEV_HEADROOM. */
> > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > +            dp_packet_delete(buffers[i]);
> > > > +            rx->aux_bufs[i] = NULL;
> > > > +         } else {
> > > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > > +            pkt = buffers[i];
> > > > +         }
> > > >  
> > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > > {
> > > > +        if (virtio_net_hdr_size &&
> > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_);
> > > >  
> > > >              /* Unexpected error situation: the virtio header is not present
> > > >               * or corrupted. Drop the packet but continue in case next ones
> > > >               * are correct. */
> > > > -            dp_packet_delete(buffers[i]);
> > > > +            dp_packet_delete(pkt);
> > > >              netdev->rx_dropped += 1;
> > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > > net header",
> > > >                           netdev_get_name(netdev_)); @@ -1325,16
> > > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct 
> > > > +netdev_rxq_linux *rx, int mtu,
> > > >                  struct eth_header *eth;
> > > >                  bool double_tagged;
> > > >  
> > > > -                eth = dp_packet_data(buffers[i]);
> > > > +                eth = dp_packet_data(pkt);
> > > >                  double_tagged = eth->eth_type == 
> > > > htons(ETH_TYPE_VLAN_8021Q);
> > > >  
> > > > -                eth_push_vlan(buffers[i],
> > > > +                eth_push_vlan(pkt,
> > > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > > >                                htons(aux->tp_vlan_tci));
> > > >                  break;
> > > >              }
> > > >          }
> > > > -        dp_packet_batch_add(batch, buffers[i]);
> > > > +        dp_packet_batch_add(batch, pkt);
> > > >      }
> > > >  
> > > >      /* Delete unused buffers. */
> > > > @@ -1354,7 +1352,6 @@ static int
> > > >  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> > > >                                  struct dp_packet_batch *batch)  {
> > > > -    struct dp_packet *buffer;
> > > >      int virtio_net_hdr_size;
> > > >      ssize_t retval;
> > > >      size_t std_len;
> > > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          virtio_net_hdr_size = 0;
> > > >      }
> > > >  
> > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > +    /* The length here needs to be accounted in the same way when the
> > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > +        struct dp_packet *buffer;
> > > > +        struct dp_packet *pkt;
> > > >          struct iovec iov[IOV_TSO_SIZE];
> > > >  
> > > >          /* Assume Ethernet port. No need to set packet_type. */
> > > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > > >          iov[IOV_PACKET].iov_len = std_len;
> > > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > +        if (iovlen == IOV_TSO_SIZE) {
> > > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > > +        }
> > > >  
> > > >          do {
> > > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33
> > > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct 
> > > > +netdev_rxq_linux *rx, int mtu,
> > > >          }
> > > >  
> > > >          if (retval > std_len) {
> > > > -            /* Build a single linear TSO packet by expanding the current
> > > > packet
> > > > -             * to append the data received in the aux_buf. */
> > > > -            size_t extra_len = retval - std_len;
> > > > -
> > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > +             * std_len buffer to the aux_buf. */
> > > > +            pkt = rx->aux_bufs[i];
> > > > +            dp_packet_set_size(pkt, retval - std_len);
> > > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > +             * DP_NETDEV_HEADROOM. */
> > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > +            dp_packet_delete(buffer);
> > > > +            rx->aux_bufs[i] = NULL;
> > > >          } else {
> > > >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > > > retval);
> > > > +            pkt = buffer;
> > > >          }
> > > >  
> > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > > +        if (virtio_net_hdr_size &&
> > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_);
> > > >  
> > > >              /* Unexpected error situation: the virtio header is not present
> > > >               * or corrupted. Drop the packet but continue in case next ones
> > > >               * are correct. */
> > > > -            dp_packet_delete(buffer);
> > > > +            dp_packet_delete(pkt);
> > > >              netdev->rx_dropped += 1;
> > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio 
> > > > net header",
> > > >                           netdev_get_name(netdev_));
> > > >              continue;
> > > >          }
> > > >  
> > > > -        dp_packet_batch_add(batch, buffer);
> > > > +        dp_packet_batch_add(batch, pkt);
> > > >      }
> > > >  
> > > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > > dp_packet_batch *batch,
> > > >          mtu = ETH_PAYLOAD_MAX;
> > > >      }
> > > >  
> > > > +    if (userspace_tso_enabled()) {
> > > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > > +         * (aux_buf). */
> > > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > > VLAN_ETH_HEADER_LEN
> > > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > +            if (rx->aux_bufs[i]) {
> > > > +                continue;
> > > > +            }
> > > > +
> > > > +            rx->aux_bufs[i] = 
> > > > + dp_packet_new_with_headroom(data_len,
> > > > std_len);
> > > > +        }
> > > > +    }
> > > > +
> > > >      dp_packet_batch_init(batch);
> > > >      retval = (rx->is_tap
> > > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > > > --
> > > > 2.24.1
> > > > 
> > > 
> > > 
> > > 
> > > --
> > > fbl
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > 
> > --
> > fbl
> 
> 
> 
> --
> fbl
Yi Yang (杨燚)-云服务集团 Feb. 6, 2020, 1:03 p.m. UTC | #7
Hi, Flavio

What's the difference between https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 and ovs master? My network is very slow, maybe you call tell me which commits are new compared to ovs master, I can get those commits and apply them against my local ovs tree, that will be faster way to do quick check. I used current ovs master, I didn't apply any special patch except your patches.

-----邮件原件-----
发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
发送时间: 2020年2月6日 20:10
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
主题: Re: 答复: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet

On Thu, Feb 06, 2020 at 07:34:35AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, Flavio
> 
> I tried current ovs master and your two patches (this one and the one 
> for tap), tap to tap and veth to veth performance are very bad when 
> tso is on, I think merged tso patch series are wrong for such use 
> cases, BTW, the performance data is normal if I turned off tso "sudo 
> ip netns exec nsXXX ethtool -K vethX tso off", for tap, the same is 
> there (I used your tap patch for this), it is ok when I turned off 
> tso.

Can you try out the code from here instead:
https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 
 
> here is performance data:
> 
> Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port 
> 36024 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.00  sec  7.59 MBytes  6.37 Mbits/sec  2410   7.07 KBytes
> [  4]  10.00-20.00  sec  7.76 MBytes  6.51 Mbits/sec  2496   5.66 KBytes
> [  4]  20.00-30.00  sec  7.99 MBytes  6.70 Mbits/sec  2536   5.66 KBytes
> [  4]  30.00-40.00  sec  7.85 MBytes  6.58 Mbits/sec  2506   5.66 KBytes
> [  4]  40.00-50.00  sec  7.93 MBytes  6.65 Mbits/sec  2556   5.66 KBytes
> [  4]  50.00-60.00  sec  8.25 MBytes  6.92 Mbits/sec  2696   7.07 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-60.00  sec  47.4 MBytes  6.62 Mbits/sec  15200             sender
> [  4]   0.00-60.00  sec  47.3 MBytes  6.61 Mbits/sec                  receiver
> 
> Server output:
> Accepted connection from 10.15.1.2, port 36022 [  5] local 10.15.1.3 
> port 5201 connected to 10.15.1.2 port 36024
> [ ID] Interval           Transfer     Bandwidth
> [  5]   0.00-10.00  sec  7.44 MBytes  6.24 Mbits/sec
> [  5]  10.00-20.00  sec  7.80 MBytes  6.54 Mbits/sec [  5]  
> 20.00-30.00  sec  7.95 MBytes  6.67 Mbits/sec [  5]  30.00-40.00  sec  
> 7.83 MBytes  6.57 Mbits/sec [  5]  40.00-50.00  sec  8.00 MBytes  6.71 
> Mbits/sec [  5]  50.00-60.00  sec  8.23 MBytes  6.91 Mbits/sec

Yeah, that looks like to be the performance only for TCP retransmits. 

Here is your script running with the code I provided before/above:
[root@wsfd-netdev93 yiyang]# ./yiyang-test.sh PING 10.15.1.3 (10.15.1.3) 56(84) bytes of data.
64 bytes from 10.15.1.3: icmp_seq=1 ttl=64 time=0.425 ms
64 bytes from 10.15.1.3: icmp_seq=2 ttl=64 time=0.253 ms
64 bytes from 10.15.1.3: icmp_seq=3 ttl=64 time=0.356 ms

--- 10.15.1.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 44ms rtt min/avg/max/mdev = 0.253/0.344/0.425/0.073 ms PING 10.15.1.2 (10.15.1.2) 56(84) bytes of data.
64 bytes from 10.15.1.2: icmp_seq=1 ttl=64 time=0.302 ms
64 bytes from 10.15.1.2: icmp_seq=2 ttl=64 time=0.475 ms
64 bytes from 10.15.1.2: icmp_seq=3 ttl=64 time=0.383 ms

--- 10.15.1.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 43ms rtt min/avg/max/mdev = 0.302/0.386/0.475/0.074 ms Connecting to host 10.15.1.3, port 5201 [  5] local 10.15.1.2 port 59660 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  7.31 GBytes  6.28 Gbits/sec  49510    187 KBytes       
[  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec  51415    215 KBytes       
[  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec  46197    182 KBytes       
[  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec  46344    202 KBytes       
[  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec  49123    287 KBytes       
[  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec  48734    214 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec  291323             sender
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 59658 [  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59660
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec                  
[  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec                  
[  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec                  
[  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec                  
[  5]  60.00-60.00  sec   127 KBytes  4.09 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver


iperf Done.

HTH,
fbl


> 
> 
> iperf Done.
> 
> You can use the below script to check performance (need sudo run-iperf3.sh).
> 
> $ cat run-iperf3.sh
> #!/bin/bash
> 
> ovs-vsctl add-br br-int -- set bridge br-int datapath_type=netdev 
> protocols=OpenFlow10,OpenFlow12,OpenFlow13 stp_enable=false ip link 
> add veth1 type veth peer name vethbr1 ip link add veth2 type veth peer 
> name vethbr2 ip netns add ns01 ip netns add ns02
> 
> ip link set veth1 netns ns01
> ip link set veth2 netns ns02
> 
> ip netns exec ns01 ifconfig lo 127.0.0.1 up ip netns exec ns01 
> ifconfig veth1 10.15.1.2/24 up
> 
> ip netns exec ns02 ifconfig lo 127.0.0.1 up ip netns exec ns02 
> ifconfig veth2 10.15.1.3/24 up
> 
> ifconfig vethbr1 0 up
> ifconfig vethbr2 0 up
> 
> 
> ovs-vsctl add-port br-int vethbr1
> ovs-vsctl add-port br-int vethbr2
> 
> ip netns exec ns01 ping 10.15.1.3 -c 3 ip netns exec ns02 ping 
> 10.15.1.2 -c 3
> 
> ip netns exec ns02 iperf3 -s -i 10 -D
> ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output
> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> 发送时间: 2020年2月5日 19:46
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> packet in the TSO packet
> 
> On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > Thanks Flavio, which kernel version can support TSO for tap device?
> 
> That ioctl was introduced in 2.6.27.
> fbl
> 
> > 
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > 发送时间: 2020年2月5日 1:12
> > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> > packet in the TSO packet
> > 
> > On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > > Hi, Flavio
> > > > 
> > > > With this one patch and previous several merged TSO-related 
> > > > patches, can veth work with "ethtool -K vethX tx on"? I always 
> > > > can't figure out why veth can work in dpdk data path when tx 
> > > > offload features are on, it looks like you're fixing this big issue, right?
> > > 
> > > If you have tso enabled with dpdk, then veth works with vethX tx 
> > > on (which is the default setting for veth). Otherwise TSO is not 
> > > enabled and then you need to turn tx offloading off.
> > > 
> > > You said "can work in dpdk data path when tx ... are on", so I 
> > > think it's okay? Not sure though.
> > > 
> > > > For tap interface, it can't support TSO, do you Redhat guys have 
> > > > plan to enable it on kernel side.
> > > 
> > > Yeah, currently it only works in one direction (OvS -> kernel). I 
> > > am looking into this now.
> > 
> > With TSO enabled on the TAP device:
> > 
> > Traffic Direction      TSO disabled         TSO enabled  
> > VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> > tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> > 
> > The code is in my github branch:
> > https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> > 
> > commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> > Author: Flavio Leitner <fbl@sysclose.org>
> > Date:   Tue Feb 4 11:18:49 2020 -0300
> > 
> >     netdev-linux: Enable TSO in the TAP device.
> >     
> >     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
> >     offloading in the tap device.
> >     
> >     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> >     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > 
> > Please try it out and let me know if it works for you as well.
> > Thanks
> > fbl
> > 
> > 
> > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > 
> > 
> > 
> > > See below my iperf3 results as a reference:
> > > 
> > > Traffic Direction      TSO disabled         TSO enabled  
> > > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > > 
> > > fbl
> > >  
> > > > -----邮件原件-----
> > > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > > 发送时间: 2020年2月4日 5:46
> > > > 收件人: dev@openvswitch.org
> > > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > > > Yang (杨
> > > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; 
> > > > Ben Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > > packet
> > > > 
> > > > Usually TSO packets are close to 50k, 60k bytes long, so to to 
> > > > copy less bytes when receiving a packet from the kernel change 
> > > > the approach. Instead of extending the MTU sized packet received 
> > > > and append with remaining TSO data from the TSO buffer, allocate 
> > > > a TSO packet with enough headroom to prepend the std packet data.
> > > > 
> > > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload
> > > > support")
> > > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > > ---
> > > >  lib/dp-packet.c            |   8 +--
> > > >  lib/dp-packet.h            |   2 +
> > > >  lib/netdev-linux-private.h |   3 +-
> > > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > > 
> > > > V2:
> > > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > > >   - iov_len uses packet's tailroom.
> > > > 
> > > >   This patch depends on a previous posted patch to work:
> > > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > > >   
> > > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > > html
> > > > 
> > > >   With both patches applied, I can run iperf3 and scp on both directions
> > > >   with good performance and no issues.
> > > > 
> > > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index
> > > > 8dfedcb7c..cd2623500
> > > > 100644
> > > > --- a/lib/dp-packet.c
> > > > +++ b/lib/dp-packet.c
> > > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, 
> > > > uint8_t *new_base,
> > > >  
> > > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > > >   * bytes of headroom and tailroom, respectively. */ -static 
> > > > void -dp_packet_resize__(struct dp_packet *b, size_t 
> > > > new_headroom, size_t
> > > > new_tailroom)
> > > > +void
> > > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, 
> > > > +size_t
> > > > +new_tailroom)
> > > >  {
> > > >      void *new_base, *new_data;
> > > >      size_t new_allocated;
> > > > @@ -297,7 +297,7 @@ void
> > > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > > >      if (size > dp_packet_tailroom(b)) {
> > > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 
> > > > + 64));
> > > >      }
> > > >  }
> > > >  
> > > > @@ -308,7 +308,7 @@ void
> > > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > > >      if (size > dp_packet_headroom(b)) {
> > > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > > +        dp_packet_resize(b, MAX(size, 64), 
> > > > + dp_packet_tailroom(b));
> > > >      }
> > > >  }
> > > >  
> > > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index
> > > > 69ae5dfac..9a9d35183
> > > > 100644
> > > > --- a/lib/dp-packet.h
> > > > +++ b/lib/dp-packet.h
> > > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > > *dp_packet_clone_with_headroom(const
> > > > struct dp_packet *,  struct dp_packet 
> > > > *dp_packet_clone_data(const void *, size_t);  struct dp_packet 
> > > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > > >                                                       size_t 
> > > > headroom);
> > > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > > +                      size_t new_tailroom);
> > > >  static inline void dp_packet_delete(struct dp_packet *);
> > > >  
> > > >  static inline void *dp_packet_at(const struct dp_packet *, 
> > > > size_t offset, diff --git a/lib/netdev-linux-private.h 
> > > > b/lib/netdev-linux-private.h index
> > > > be2d7b10b..c7c515f70 100644
> > > > --- a/lib/netdev-linux-private.h
> > > > +++ b/lib/netdev-linux-private.h
> > > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > > >      struct netdev_rxq up;
> > > >      bool is_tap;
> > > >      int fd;
> > > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > > */
> > > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > > +                                                     packets. 
> > > > + */
> > > >  };
> > > >  
> > > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > > --- a/lib/netdev-linux.c
> > > > +++ b/lib/netdev-linux.c
> > > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > > >  netdev_linux_rxq_alloc(void)
> > > >  {
> > > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > > -    if (userspace_tso_enabled()) {
> > > > -        int i;
> > > > -
> > > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > > -        }
> > > > -    }
> > > > -
> > > >      return &rx->up;
> > > >  }
> > > >  
> > > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > > >      }
> > > >  
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > -        free(rx->aux_bufs[i]);
> > > > +        dp_packet_delete(rx->aux_bufs[i]);
> > > >      }
> > > >  }
> > > >  
> > > > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          virtio_net_hdr_size = 0;
> > > >      }
> > > >  
> > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > +    /* The length here needs to be accounted in the same way when the
> > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > > DP_NETDEV_HEADROOM);
> > > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > +         if (iovlen == IOV_TSO_SIZE) {
> > > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > > dp_packet_data(rx->aux_bufs[i]);
> > > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > > +         }
> > > > +
> > > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 
> > > > @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > > >      }
> > > >  
> > > >      for (i = 0; i < retval; i++) {
> > > > +        struct dp_packet *pkt;
> > > > +
> > > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_); @@
> > > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          }
> > > >  
> > > >          if (mmsgs[i].msg_len > std_len) {
> > > > -            /* Build a single linear TSO packet by expanding the current
> > > > packet
> > > > -             * to append the data received in the aux_buf. */
> > > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > > -
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + std_len);
> > > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + extra_len);
> > > > -        } else {
> > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > -                               + mmsgs[i].msg_len);
> > > > -        }
> > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > +             * std_len buffer to the aux_buf. */
> > > > +            pkt = rx->aux_bufs[i];
> > > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > +             * DP_NETDEV_HEADROOM. */
> > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > +            dp_packet_delete(buffers[i]);
> > > > +            rx->aux_bufs[i] = NULL;
> > > > +         } else {
> > > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > > +            pkt = buffers[i];
> > > > +         }
> > > >  
> > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > > {
> > > > +        if (virtio_net_hdr_size &&
> > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_);
> > > >  
> > > >              /* Unexpected error situation: the virtio header is not present
> > > >               * or corrupted. Drop the packet but continue in case next ones
> > > >               * are correct. */
> > > > -            dp_packet_delete(buffers[i]);
> > > > +            dp_packet_delete(pkt);
> > > >              netdev->rx_dropped += 1;
> > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > virtio net header",
> > > >                           netdev_get_name(netdev_)); @@ -1325,16
> > > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > +netdev_rxq_linux *rx, int mtu,
> > > >                  struct eth_header *eth;
> > > >                  bool double_tagged;
> > > >  
> > > > -                eth = dp_packet_data(buffers[i]);
> > > > +                eth = dp_packet_data(pkt);
> > > >                  double_tagged = eth->eth_type == 
> > > > htons(ETH_TYPE_VLAN_8021Q);
> > > >  
> > > > -                eth_push_vlan(buffers[i],
> > > > +                eth_push_vlan(pkt,
> > > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > > >                                htons(aux->tp_vlan_tci));
> > > >                  break;
> > > >              }
> > > >          }
> > > > -        dp_packet_batch_add(batch, buffers[i]);
> > > > +        dp_packet_batch_add(batch, pkt);
> > > >      }
> > > >  
> > > >      /* Delete unused buffers. */ @@ -1354,7 +1352,6 @@ static 
> > > > int  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux 
> > > > *rx, int mtu,
> > > >                                  struct dp_packet_batch *batch)  {
> > > > -    struct dp_packet *buffer;
> > > >      int virtio_net_hdr_size;
> > > >      ssize_t retval;
> > > >      size_t std_len;
> > > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > netdev_rxq_linux *rx, int mtu,
> > > >          virtio_net_hdr_size = 0;
> > > >      }
> > > >  
> > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > +    /* The length here needs to be accounted in the same way when the
> > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > +        struct dp_packet *buffer;
> > > > +        struct dp_packet *pkt;
> > > >          struct iovec iov[IOV_TSO_SIZE];
> > > >  
> > > >          /* Assume Ethernet port. No need to set packet_type. */
> > > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > > >          iov[IOV_PACKET].iov_len = std_len;
> > > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > +        if (iovlen == IOV_TSO_SIZE) {
> > > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > > +        }
> > > >  
> > > >          do {
> > > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33
> > > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > +netdev_rxq_linux *rx, int mtu,
> > > >          }
> > > >  
> > > >          if (retval > std_len) {
> > > > -            /* Build a single linear TSO packet by expanding the current
> > > > packet
> > > > -             * to append the data received in the aux_buf. */
> > > > -            size_t extra_len = retval - std_len;
> > > > -
> > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > +             * std_len buffer to the aux_buf. */
> > > > +            pkt = rx->aux_bufs[i];
> > > > +            dp_packet_set_size(pkt, retval - std_len);
> > > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > +             * DP_NETDEV_HEADROOM. */
> > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > +            dp_packet_delete(buffer);
> > > > +            rx->aux_bufs[i] = NULL;
> > > >          } else {
> > > >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > > > retval);
> > > > +            pkt = buffer;
> > > >          }
> > > >  
> > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > > +        if (virtio_net_hdr_size &&
> > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > >              struct netdev_linux *netdev = 
> > > > netdev_linux_cast(netdev_);
> > > >  
> > > >              /* Unexpected error situation: the virtio header is not present
> > > >               * or corrupted. Drop the packet but continue in case next ones
> > > >               * are correct. */
> > > > -            dp_packet_delete(buffer);
> > > > +            dp_packet_delete(pkt);
> > > >              netdev->rx_dropped += 1;
> > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > virtio net header",
> > > >                           netdev_get_name(netdev_));
> > > >              continue;
> > > >          }
> > > >  
> > > > -        dp_packet_batch_add(batch, buffer);
> > > > +        dp_packet_batch_add(batch, pkt);
> > > >      }
> > > >  
> > > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > > dp_packet_batch *batch,
> > > >          mtu = ETH_PAYLOAD_MAX;
> > > >      }
> > > >  
> > > > +    if (userspace_tso_enabled()) {
> > > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > > +         * (aux_buf). */
> > > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > > VLAN_ETH_HEADER_LEN
> > > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > +            if (rx->aux_bufs[i]) {
> > > > +                continue;
> > > > +            }
> > > > +
> > > > +            rx->aux_bufs[i] =
> > > > + dp_packet_new_with_headroom(data_len,
> > > > std_len);
> > > > +        }
> > > > +    }
> > > > +
> > > >      dp_packet_batch_init(batch);
> > > >      retval = (rx->is_tap
> > > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > > > --
> > > > 2.24.1
> > > > 
> > > 
> > > 
> > > 
> > > --
> > > fbl
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > 
> > --
> > fbl
> 
> 
> 
> --
> fbl



--
fbl
Flavio Leitner Feb. 6, 2020, 1:11 p.m. UTC | #8
On Thu, Feb 06, 2020 at 01:03:32PM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, Flavio
> 
> What's the difference between
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 and ovs
> master? My network is very slow, maybe you call tell me which
> commits are new compared to ovs master, I can get those commits
> and apply them against my local ovs tree, that will be faster way
> to do quick check. I used current ovs master, I didn't apply any
> special patch except your patches.

You can browse the list of patches there:

netdev-linux: Enable TSO in the TAP device.
netdev-linux: Prepend the std packet in the TSO packet 
netdev-linux-private: fix max length to be 16 bits 

On top of: Prepare for 2.13.0. 

You can always add a git remote and fetch only the changes you need.

fbl


> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
> 发送时间: 2020年2月6日 20:10
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: 答复: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
> 
> On Thu, Feb 06, 2020 at 07:34:35AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > Hi, Flavio
> > 
> > I tried current ovs master and your two patches (this one and the one 
> > for tap), tap to tap and veth to veth performance are very bad when 
> > tso is on, I think merged tso patch series are wrong for such use 
> > cases, BTW, the performance data is normal if I turned off tso "sudo 
> > ip netns exec nsXXX ethtool -K vethX tso off", for tap, the same is 
> > there (I used your tap patch for this), it is ok when I turned off 
> > tso.
> 
> Can you try out the code from here instead:
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 
>  
> > here is performance data:
> > 
> > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port 
> > 36024 connected to 10.15.1.3 port 5201
> > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > [  4]   0.00-10.00  sec  7.59 MBytes  6.37 Mbits/sec  2410   7.07 KBytes
> > [  4]  10.00-20.00  sec  7.76 MBytes  6.51 Mbits/sec  2496   5.66 KBytes
> > [  4]  20.00-30.00  sec  7.99 MBytes  6.70 Mbits/sec  2536   5.66 KBytes
> > [  4]  30.00-40.00  sec  7.85 MBytes  6.58 Mbits/sec  2506   5.66 KBytes
> > [  4]  40.00-50.00  sec  7.93 MBytes  6.65 Mbits/sec  2556   5.66 KBytes
> > [  4]  50.00-60.00  sec  8.25 MBytes  6.92 Mbits/sec  2696   7.07 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bandwidth       Retr
> > [  4]   0.00-60.00  sec  47.4 MBytes  6.62 Mbits/sec  15200             sender
> > [  4]   0.00-60.00  sec  47.3 MBytes  6.61 Mbits/sec                  receiver
> > 
> > Server output:
> > Accepted connection from 10.15.1.2, port 36022 [  5] local 10.15.1.3 
> > port 5201 connected to 10.15.1.2 port 36024
> > [ ID] Interval           Transfer     Bandwidth
> > [  5]   0.00-10.00  sec  7.44 MBytes  6.24 Mbits/sec
> > [  5]  10.00-20.00  sec  7.80 MBytes  6.54 Mbits/sec [  5]  
> > 20.00-30.00  sec  7.95 MBytes  6.67 Mbits/sec [  5]  30.00-40.00  sec  
> > 7.83 MBytes  6.57 Mbits/sec [  5]  40.00-50.00  sec  8.00 MBytes  6.71 
> > Mbits/sec [  5]  50.00-60.00  sec  8.23 MBytes  6.91 Mbits/sec
> 
> Yeah, that looks like to be the performance only for TCP retransmits. 
> 
> Here is your script running with the code I provided before/above:
> [root@wsfd-netdev93 yiyang]# ./yiyang-test.sh PING 10.15.1.3 (10.15.1.3) 56(84) bytes of data.
> 64 bytes from 10.15.1.3: icmp_seq=1 ttl=64 time=0.425 ms
> 64 bytes from 10.15.1.3: icmp_seq=2 ttl=64 time=0.253 ms
> 64 bytes from 10.15.1.3: icmp_seq=3 ttl=64 time=0.356 ms
> 
> --- 10.15.1.3 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 44ms rtt min/avg/max/mdev = 0.253/0.344/0.425/0.073 ms PING 10.15.1.2 (10.15.1.2) 56(84) bytes of data.
> 64 bytes from 10.15.1.2: icmp_seq=1 ttl=64 time=0.302 ms
> 64 bytes from 10.15.1.2: icmp_seq=2 ttl=64 time=0.475 ms
> 64 bytes from 10.15.1.2: icmp_seq=3 ttl=64 time=0.383 ms
> 
> --- 10.15.1.2 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 43ms rtt min/avg/max/mdev = 0.302/0.386/0.475/0.074 ms Connecting to host 10.15.1.3, port 5201 [  5] local 10.15.1.2 port 59660 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-10.00  sec  7.31 GBytes  6.28 Gbits/sec  49510    187 KBytes       
> [  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec  51415    215 KBytes       
> [  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec  46197    182 KBytes       
> [  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec  46344    202 KBytes       
> [  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec  49123    287 KBytes       
> [  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec  48734    214 KBytes       
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec  291323             sender
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver
> 
> Server output:
> Accepted connection from 10.15.1.2, port 59658 [  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59660
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec                  
> [  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec                  
> [  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec                  
> [  5]  60.00-60.00  sec   127 KBytes  4.09 Gbits/sec                  
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver
> 
> 
> iperf Done.
> 
> HTH,
> fbl
> 
> 
> > 
> > 
> > iperf Done.
> > 
> > You can use the below script to check performance (need sudo run-iperf3.sh).
> > 
> > $ cat run-iperf3.sh
> > #!/bin/bash
> > 
> > ovs-vsctl add-br br-int -- set bridge br-int datapath_type=netdev 
> > protocols=OpenFlow10,OpenFlow12,OpenFlow13 stp_enable=false ip link 
> > add veth1 type veth peer name vethbr1 ip link add veth2 type veth peer 
> > name vethbr2 ip netns add ns01 ip netns add ns02
> > 
> > ip link set veth1 netns ns01
> > ip link set veth2 netns ns02
> > 
> > ip netns exec ns01 ifconfig lo 127.0.0.1 up ip netns exec ns01 
> > ifconfig veth1 10.15.1.2/24 up
> > 
> > ip netns exec ns02 ifconfig lo 127.0.0.1 up ip netns exec ns02 
> > ifconfig veth2 10.15.1.3/24 up
> > 
> > ifconfig vethbr1 0 up
> > ifconfig vethbr2 0 up
> > 
> > 
> > ovs-vsctl add-port br-int vethbr1
> > ovs-vsctl add-port br-int vethbr2
> > 
> > ip netns exec ns01 ping 10.15.1.3 -c 3 ip netns exec ns02 ping 
> > 10.15.1.2 -c 3
> > 
> > ip netns exec ns02 iperf3 -s -i 10 -D
> > ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output
> > 
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > 发送时间: 2020年2月5日 19:46
> > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > 主题: Re: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> > packet in the TSO packet
> > 
> > On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > > Thanks Flavio, which kernel version can support TSO for tap device?
> > 
> > That ioctl was introduced in 2.6.27.
> > fbl
> > 
> > > 
> > > -----邮件原件-----
> > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > 发送时间: 2020年2月5日 1:12
> > > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > > 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> > > packet in the TSO packet
> > > 
> > > On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > > > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > > > Hi, Flavio
> > > > > 
> > > > > With this one patch and previous several merged TSO-related 
> > > > > patches, can veth work with "ethtool -K vethX tx on"? I always 
> > > > > can't figure out why veth can work in dpdk data path when tx 
> > > > > offload features are on, it looks like you're fixing this big issue, right?
> > > > 
> > > > If you have tso enabled with dpdk, then veth works with vethX tx 
> > > > on (which is the default setting for veth). Otherwise TSO is not 
> > > > enabled and then you need to turn tx offloading off.
> > > > 
> > > > You said "can work in dpdk data path when tx ... are on", so I 
> > > > think it's okay? Not sure though.
> > > > 
> > > > > For tap interface, it can't support TSO, do you Redhat guys have 
> > > > > plan to enable it on kernel side.
> > > > 
> > > > Yeah, currently it only works in one direction (OvS -> kernel). I 
> > > > am looking into this now.
> > > 
> > > With TSO enabled on the TAP device:
> > > 
> > > Traffic Direction      TSO disabled         TSO enabled  
> > > VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> > > tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> > > 
> > > The code is in my github branch:
> > > https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> > > 
> > > commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> > > Author: Flavio Leitner <fbl@sysclose.org>
> > > Date:   Tue Feb 4 11:18:49 2020 -0300
> > > 
> > >     netdev-linux: Enable TSO in the TAP device.
> > >     
> > >     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
> > >     offloading in the tap device.
> > >     
> > >     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> > >     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > 
> > > Please try it out and let me know if it works for you as well.
> > > Thanks
> > > fbl
> > > 
> > > 
> > > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > 
> > > 
> > > 
> > > > See below my iperf3 results as a reference:
> > > > 
> > > > Traffic Direction      TSO disabled         TSO enabled  
> > > > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > > > 
> > > > fbl
> > > >  
> > > > > -----邮件原件-----
> > > > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > > > 发送时间: 2020年2月4日 5:46
> > > > > 收件人: dev@openvswitch.org
> > > > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; Yi 
> > > > > Yang (杨
> > > > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 <txfh2007@aliyun.com>; 
> > > > > Ben Pfaff <blp@ovn.org>; Flavio Leitner <fbl@sysclose.org>
> > > > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > > > packet
> > > > > 
> > > > > Usually TSO packets are close to 50k, 60k bytes long, so to to 
> > > > > copy less bytes when receiving a packet from the kernel change 
> > > > > the approach. Instead of extending the MTU sized packet received 
> > > > > and append with remaining TSO data from the TSO buffer, allocate 
> > > > > a TSO packet with enough headroom to prepend the std packet data.
> > > > > 
> > > > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload
> > > > > support")
> > > > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > > > ---
> > > > >  lib/dp-packet.c            |   8 +--
> > > > >  lib/dp-packet.h            |   2 +
> > > > >  lib/netdev-linux-private.h |   3 +-
> > > > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > > > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > > > 
> > > > > V2:
> > > > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > > > >   - iov_len uses packet's tailroom.
> > > > > 
> > > > >   This patch depends on a previous posted patch to work:
> > > > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > > > >   
> > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > > > html
> > > > > 
> > > > >   With both patches applied, I can run iperf3 and scp on both directions
> > > > >   with good performance and no issues.
> > > > > 
> > > > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index
> > > > > 8dfedcb7c..cd2623500
> > > > > 100644
> > > > > --- a/lib/dp-packet.c
> > > > > +++ b/lib/dp-packet.c
> > > > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, 
> > > > > uint8_t *new_base,
> > > > >  
> > > > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > > > >   * bytes of headroom and tailroom, respectively. */ -static 
> > > > > void -dp_packet_resize__(struct dp_packet *b, size_t 
> > > > > new_headroom, size_t
> > > > > new_tailroom)
> > > > > +void
> > > > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, 
> > > > > +size_t
> > > > > +new_tailroom)
> > > > >  {
> > > > >      void *new_base, *new_data;
> > > > >      size_t new_allocated;
> > > > > @@ -297,7 +297,7 @@ void
> > > > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > > > >      if (size > dp_packet_tailroom(b)) {
> > > > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 
> > > > > + 64));
> > > > >      }
> > > > >  }
> > > > >  
> > > > > @@ -308,7 +308,7 @@ void
> > > > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > > > >      if (size > dp_packet_headroom(b)) {
> > > > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > > > +        dp_packet_resize(b, MAX(size, 64), 
> > > > > + dp_packet_tailroom(b));
> > > > >      }
> > > > >  }
> > > > >  
> > > > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index
> > > > > 69ae5dfac..9a9d35183
> > > > > 100644
> > > > > --- a/lib/dp-packet.h
> > > > > +++ b/lib/dp-packet.h
> > > > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > > > *dp_packet_clone_with_headroom(const
> > > > > struct dp_packet *,  struct dp_packet 
> > > > > *dp_packet_clone_data(const void *, size_t);  struct dp_packet 
> > > > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > > > >                                                       size_t 
> > > > > headroom);
> > > > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > > > +                      size_t new_tailroom);
> > > > >  static inline void dp_packet_delete(struct dp_packet *);
> > > > >  
> > > > >  static inline void *dp_packet_at(const struct dp_packet *, 
> > > > > size_t offset, diff --git a/lib/netdev-linux-private.h 
> > > > > b/lib/netdev-linux-private.h index
> > > > > be2d7b10b..c7c515f70 100644
> > > > > --- a/lib/netdev-linux-private.h
> > > > > +++ b/lib/netdev-linux-private.h
> > > > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > > > >      struct netdev_rxq up;
> > > > >      bool is_tap;
> > > > >      int fd;
> > > > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > > > */
> > > > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > > > +                                                     packets. 
> > > > > + */
> > > > >  };
> > > > >  
> > > > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > > > --- a/lib/netdev-linux.c
> > > > > +++ b/lib/netdev-linux.c
> > > > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > > > >  netdev_linux_rxq_alloc(void)
> > > > >  {
> > > > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > > > -    if (userspace_tso_enabled()) {
> > > > > -        int i;
> > > > > -
> > > > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > > > -        }
> > > > > -    }
> > > > > -
> > > > >      return &rx->up;
> > > > >  }
> > > > >  
> > > > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > > > >      }
> > > > >  
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > -        free(rx->aux_bufs[i]);
> > > > > +        dp_packet_delete(rx->aux_bufs[i]);
> > > > >      }
> > > > >  }
> > > > >  
> > > > > @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          virtio_net_hdr_size = 0;
> > > > >      }
> > > > >  
> > > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > > +    /* The length here needs to be accounted in the same way when the
> > > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > > > DP_NETDEV_HEADROOM);
> > > > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > > > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > > +         if (iovlen == IOV_TSO_SIZE) {
> > > > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > > > dp_packet_data(rx->aux_bufs[i]);
> > > > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > > > +         }
> > > > > +
> > > > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > > > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > > > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 +1264,8 
> > > > > @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > > > >      }
> > > > >  
> > > > >      for (i = 0; i < retval; i++) {
> > > > > +        struct dp_packet *pkt;
> > > > > +
> > > > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_); @@
> > > > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          }
> > > > >  
> > > > >          if (mmsgs[i].msg_len > std_len) {
> > > > > -            /* Build a single linear TSO packet by expanding the current
> > > > > packet
> > > > > -             * to append the data received in the aux_buf. */
> > > > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > > > -
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + std_len);
> > > > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + extra_len);
> > > > > -        } else {
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + mmsgs[i].msg_len);
> > > > > -        }
> > > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > > +             * std_len buffer to the aux_buf. */
> > > > > +            pkt = rx->aux_bufs[i];
> > > > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > > +             * DP_NETDEV_HEADROOM. */
> > > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > > +            dp_packet_delete(buffers[i]);
> > > > > +            rx->aux_bufs[i] = NULL;
> > > > > +         } else {
> > > > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > > > +            pkt = buffers[i];
> > > > > +         }
> > > > >  
> > > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > > > {
> > > > > +        if (virtio_net_hdr_size &&
> > > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_);
> > > > >  
> > > > >              /* Unexpected error situation: the virtio header is not present
> > > > >               * or corrupted. Drop the packet but continue in case next ones
> > > > >               * are correct. */
> > > > > -            dp_packet_delete(buffers[i]);
> > > > > +            dp_packet_delete(pkt);
> > > > >              netdev->rx_dropped += 1;
> > > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > > virtio net header",
> > > > >                           netdev_get_name(netdev_)); @@ -1325,16
> > > > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > > +netdev_rxq_linux *rx, int mtu,
> > > > >                  struct eth_header *eth;
> > > > >                  bool double_tagged;
> > > > >  
> > > > > -                eth = dp_packet_data(buffers[i]);
> > > > > +                eth = dp_packet_data(pkt);
> > > > >                  double_tagged = eth->eth_type == 
> > > > > htons(ETH_TYPE_VLAN_8021Q);
> > > > >  
> > > > > -                eth_push_vlan(buffers[i],
> > > > > +                eth_push_vlan(pkt,
> > > > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > > > >                                htons(aux->tp_vlan_tci));
> > > > >                  break;
> > > > >              }
> > > > >          }
> > > > > -        dp_packet_batch_add(batch, buffers[i]);
> > > > > +        dp_packet_batch_add(batch, pkt);
> > > > >      }
> > > > >  
> > > > >      /* Delete unused buffers. */ @@ -1354,7 +1352,6 @@ static 
> > > > > int  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux 
> > > > > *rx, int mtu,
> > > > >                                  struct dp_packet_batch *batch)  {
> > > > > -    struct dp_packet *buffer;
> > > > >      int virtio_net_hdr_size;
> > > > >      ssize_t retval;
> > > > >      size_t std_len;
> > > > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          virtio_net_hdr_size = 0;
> > > > >      }
> > > > >  
> > > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > > +    /* The length here needs to be accounted in the same way when the
> > > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > +        struct dp_packet *buffer;
> > > > > +        struct dp_packet *pkt;
> > > > >          struct iovec iov[IOV_TSO_SIZE];
> > > > >  
> > > > >          /* Assume Ethernet port. No need to set packet_type. */
> > > > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > > > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > > > >          iov[IOV_PACKET].iov_len = std_len;
> > > > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > > +        if (iovlen == IOV_TSO_SIZE) {
> > > > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > > > +        }
> > > > >  
> > > > >          do {
> > > > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33
> > > > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > > +netdev_rxq_linux *rx, int mtu,
> > > > >          }
> > > > >  
> > > > >          if (retval > std_len) {
> > > > > -            /* Build a single linear TSO packet by expanding the current
> > > > > packet
> > > > > -             * to append the data received in the aux_buf. */
> > > > > -            size_t extra_len = retval - std_len;
> > > > > -
> > > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > > +             * std_len buffer to the aux_buf. */
> > > > > +            pkt = rx->aux_bufs[i];
> > > > > +            dp_packet_set_size(pkt, retval - std_len);
> > > > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > > +             * DP_NETDEV_HEADROOM. */
> > > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > > +            dp_packet_delete(buffer);
> > > > > +            rx->aux_bufs[i] = NULL;
> > > > >          } else {
> > > > >              dp_packet_set_size(buffer, dp_packet_size(buffer) + 
> > > > > retval);
> > > > > +            pkt = buffer;
> > > > >          }
> > > > >  
> > > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > > > +        if (virtio_net_hdr_size &&
> > > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_);
> > > > >  
> > > > >              /* Unexpected error situation: the virtio header is not present
> > > > >               * or corrupted. Drop the packet but continue in case next ones
> > > > >               * are correct. */
> > > > > -            dp_packet_delete(buffer);
> > > > > +            dp_packet_delete(pkt);
> > > > >              netdev->rx_dropped += 1;
> > > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > > virtio net header",
> > > > >                           netdev_get_name(netdev_));
> > > > >              continue;
> > > > >          }
> > > > >  
> > > > > -        dp_packet_batch_add(batch, buffer);
> > > > > +        dp_packet_batch_add(batch, pkt);
> > > > >      }
> > > > >  
> > > > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > > > dp_packet_batch *batch,
> > > > >          mtu = ETH_PAYLOAD_MAX;
> > > > >      }
> > > > >  
> > > > > +    if (userspace_tso_enabled()) {
> > > > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > > > +         * (aux_buf). */
> > > > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > > > VLAN_ETH_HEADER_LEN
> > > > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > +            if (rx->aux_bufs[i]) {
> > > > > +                continue;
> > > > > +            }
> > > > > +
> > > > > +            rx->aux_bufs[i] =
> > > > > + dp_packet_new_with_headroom(data_len,
> > > > > std_len);
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > >      dp_packet_batch_init(batch);
> > > > >      retval = (rx->is_tap
> > > > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> > > > > --
> > > > > 2.24.1
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > fbl
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > 
> > > --
> > > fbl
> > 
> > 
> > 
> > --
> > fbl
> 
> 
> 
> --
> fbl
Yi Yang (杨燚)-云服务集团 Feb. 6, 2020, 1:45 p.m. UTC | #9
Got it, I didn't apply the third one " netdev-linux-private: fix max length to be 16 bits ", that should be the reason of the issue. I had thought you just added a comment in this patch, so didn't apply it, it changed 65536 to 65535, it is my mistake. Thanks a lot.

-----邮件原件-----
发件人: Flavio Leitner [mailto:fbl@sysclose.org] 
发送时间: 2020年2月6日 21:12
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
主题: Re: 答复: 答复: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet

On Thu, Feb 06, 2020 at 01:03:32PM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, Flavio
> 
> What's the difference between
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1 and ovs 
> master? My network is very slow, maybe you call tell me which commits 
> are new compared to ovs master, I can get those commits and apply them 
> against my local ovs tree, that will be faster way to do quick check. 
> I used current ovs master, I didn't apply any special patch except 
> your patches.

You can browse the list of patches there:

netdev-linux: Enable TSO in the TAP device.
netdev-linux: Prepend the std packet in the TSO packet
netdev-linux-private: fix max length to be 16 bits 

On top of: Prepare for 2.13.0. 

You can always add a git remote and fetch only the changes you need.

fbl


> 
> -----邮件原件-----
> 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> 发送时间: 2020年2月6日 20:10
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> 主题: Re: 答复: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> packet in the TSO packet
> 
> On Thu, Feb 06, 2020 at 07:34:35AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > Hi, Flavio
> > 
> > I tried current ovs master and your two patches (this one and the 
> > one for tap), tap to tap and veth to veth performance are very bad 
> > when tso is on, I think merged tso patch series are wrong for such 
> > use cases, BTW, the performance data is normal if I turned off tso 
> > "sudo ip netns exec nsXXX ethtool -K vethX tso off", for tap, the 
> > same is there (I used your tap patch for this), it is ok when I 
> > turned off tso.
> 
> Can you try out the code from here instead:
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
>  
> > here is performance data:
> > 
> > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > 36024 connected to 10.15.1.3 port 5201
> > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > [  4]   0.00-10.00  sec  7.59 MBytes  6.37 Mbits/sec  2410   7.07 KBytes
> > [  4]  10.00-20.00  sec  7.76 MBytes  6.51 Mbits/sec  2496   5.66 KBytes
> > [  4]  20.00-30.00  sec  7.99 MBytes  6.70 Mbits/sec  2536   5.66 KBytes
> > [  4]  30.00-40.00  sec  7.85 MBytes  6.58 Mbits/sec  2506   5.66 KBytes
> > [  4]  40.00-50.00  sec  7.93 MBytes  6.65 Mbits/sec  2556   5.66 KBytes
> > [  4]  50.00-60.00  sec  8.25 MBytes  6.92 Mbits/sec  2696   7.07 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bandwidth       Retr
> > [  4]   0.00-60.00  sec  47.4 MBytes  6.62 Mbits/sec  15200             sender
> > [  4]   0.00-60.00  sec  47.3 MBytes  6.61 Mbits/sec                  receiver
> > 
> > Server output:
> > Accepted connection from 10.15.1.2, port 36022 [  5] local 10.15.1.3 
> > port 5201 connected to 10.15.1.2 port 36024
> > [ ID] Interval           Transfer     Bandwidth
> > [  5]   0.00-10.00  sec  7.44 MBytes  6.24 Mbits/sec
> > [  5]  10.00-20.00  sec  7.80 MBytes  6.54 Mbits/sec [  5]
> > 20.00-30.00  sec  7.95 MBytes  6.67 Mbits/sec [  5]  30.00-40.00  
> > sec
> > 7.83 MBytes  6.57 Mbits/sec [  5]  40.00-50.00  sec  8.00 MBytes  
> > 6.71 Mbits/sec [  5]  50.00-60.00  sec  8.23 MBytes  6.91 Mbits/sec
> 
> Yeah, that looks like to be the performance only for TCP retransmits. 
> 
> Here is your script running with the code I provided before/above:
> [root@wsfd-netdev93 yiyang]# ./yiyang-test.sh PING 10.15.1.3 (10.15.1.3) 56(84) bytes of data.
> 64 bytes from 10.15.1.3: icmp_seq=1 ttl=64 time=0.425 ms
> 64 bytes from 10.15.1.3: icmp_seq=2 ttl=64 time=0.253 ms
> 64 bytes from 10.15.1.3: icmp_seq=3 ttl=64 time=0.356 ms
> 
> --- 10.15.1.3 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 44ms rtt min/avg/max/mdev = 0.253/0.344/0.425/0.073 ms PING 10.15.1.2 (10.15.1.2) 56(84) bytes of data.
> 64 bytes from 10.15.1.2: icmp_seq=1 ttl=64 time=0.302 ms
> 64 bytes from 10.15.1.2: icmp_seq=2 ttl=64 time=0.475 ms
> 64 bytes from 10.15.1.2: icmp_seq=3 ttl=64 time=0.383 ms
> 
> --- 10.15.1.2 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 43ms rtt min/avg/max/mdev = 0.302/0.386/0.475/0.074 ms Connecting to host 10.15.1.3, port 5201 [  5] local 10.15.1.2 port 59660 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-10.00  sec  7.31 GBytes  6.28 Gbits/sec  49510    187 KBytes       
> [  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec  51415    215 KBytes       
> [  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec  46197    182 KBytes       
> [  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec  46344    202 KBytes       
> [  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec  49123    287 KBytes       
> [  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec  48734    214 KBytes       
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec  291323             sender
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver
> 
> Server output:
> Accepted connection from 10.15.1.2, port 59658 [  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59660
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  10.00-20.00  sec  7.37 GBytes  6.33 Gbits/sec                  
> [  5]  20.00-30.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  30.00-40.00  sec  7.30 GBytes  6.27 Gbits/sec                  
> [  5]  40.00-50.00  sec  7.16 GBytes  6.15 Gbits/sec                  
> [  5]  50.00-60.00  sec  7.33 GBytes  6.30 Gbits/sec                  
> [  5]  60.00-60.00  sec   127 KBytes  4.09 Gbits/sec                  
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-60.00  sec  43.8 GBytes  6.27 Gbits/sec                  receiver
> 
> 
> iperf Done.
> 
> HTH,
> fbl
> 
> 
> > 
> > 
> > iperf Done.
> > 
> > You can use the below script to check performance (need sudo run-iperf3.sh).
> > 
> > $ cat run-iperf3.sh
> > #!/bin/bash
> > 
> > ovs-vsctl add-br br-int -- set bridge br-int datapath_type=netdev
> > protocols=OpenFlow10,OpenFlow12,OpenFlow13 stp_enable=false ip link 
> > add veth1 type veth peer name vethbr1 ip link add veth2 type veth 
> > peer name vethbr2 ip netns add ns01 ip netns add ns02
> > 
> > ip link set veth1 netns ns01
> > ip link set veth2 netns ns02
> > 
> > ip netns exec ns01 ifconfig lo 127.0.0.1 up ip netns exec ns01 
> > ifconfig veth1 10.15.1.2/24 up
> > 
> > ip netns exec ns02 ifconfig lo 127.0.0.1 up ip netns exec ns02 
> > ifconfig veth2 10.15.1.3/24 up
> > 
> > ifconfig vethbr1 0 up
> > ifconfig vethbr2 0 up
> > 
> > 
> > ovs-vsctl add-port br-int vethbr1
> > ovs-vsctl add-port br-int vethbr2
> > 
> > ip netns exec ns01 ping 10.15.1.3 -c 3 ip netns exec ns02 ping
> > 10.15.1.2 -c 3
> > 
> > ip netns exec ns02 iperf3 -s -i 10 -D ip netns exec ns01 iperf3 -t 
> > 60 -i 10 -c 10.15.1.3 --get-server-output
> > 
> > -----邮件原件-----
> > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > 发送时间: 2020年2月5日 19:46
> > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > 主题: Re: 答复: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> > packet in the TSO packet
> > 
> > On Wed, Feb 05, 2020 at 12:05:23AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> > > Thanks Flavio, which kernel version can support TSO for tap device?
> > 
> > That ioctl was introduced in 2.6.27.
> > fbl
> > 
> > > 
> > > -----邮件原件-----
> > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > 发送时间: 2020年2月5日 1:12
> > > 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> > > 抄送: dev@openvswitch.org; i.maximets@ovn.org; txfh2007@aliyun.com
> > > 主题: Re: [ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std 
> > > packet in the TSO packet
> > > 
> > > On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > > > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨 D)-云服务集团 wrote:
> > > > > Hi, Flavio
> > > > > 
> > > > > With this one patch and previous several merged TSO-related 
> > > > > patches, can veth work with "ethtool -K vethX tx on"? I always 
> > > > > can't figure out why veth can work in dpdk data path when tx 
> > > > > offload features are on, it looks like you're fixing this big issue, right?
> > > > 
> > > > If you have tso enabled with dpdk, then veth works with vethX tx 
> > > > on (which is the default setting for veth). Otherwise TSO is not 
> > > > enabled and then you need to turn tx offloading off.
> > > > 
> > > > You said "can work in dpdk data path when tx ... are on", so I 
> > > > think it's okay? Not sure though.
> > > > 
> > > > > For tap interface, it can't support TSO, do you Redhat guys 
> > > > > have plan to enable it on kernel side.
> > > > 
> > > > Yeah, currently it only works in one direction (OvS -> kernel). 
> > > > I am looking into this now.
> > > 
> > > With TSO enabled on the TAP device:
> > > 
> > > Traffic Direction      TSO disabled         TSO enabled  
> > > VM->tap                2.98 Gbits/sec       22.9 Gbits/sec
> > > tap->VM                2.29 Gbits/sec       18.0 Gbits/sec
> > > 
> > > The code is in my github branch:
> > > https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> > > 
> > > commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> > > Author: Flavio Leitner <fbl@sysclose.org>
> > > Date:   Tue Feb 4 11:18:49 2020 -0300
> > > 
> > >     netdev-linux: Enable TSO in the TAP device.
> > >     
> > >     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
> > >     offloading in the tap device.
> > >     
> > >     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> > >     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > 
> > > Please try it out and let me know if it works for you as well.
> > > Thanks
> > > fbl
> > > 
> > > 
> > > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > 
> > > 
> > > 
> > > > See below my iperf3 results as a reference:
> > > > 
> > > > Traffic Direction      TSO disabled         TSO enabled  
> > > > VM->tap                2.98 Gbits/sec       22.7 Gbits/sec
> > > > VM->veth               2.96 Gbits/sec       22.6 Gbits/sec
> > > > veth->VM               2.30 Gbits/sec       9.58 Gbits/sec
> > > > tap->VM                2.29 Gbits/sec       2.19 Gbits/sec
> > > > 
> > > > fbl
> > > >  
> > > > > -----邮件原件-----
> > > > > 发件人: Flavio Leitner [mailto:fbl@sysclose.org]
> > > > > 发送时间: 2020年2月4日 5:46
> > > > > 收件人: dev@openvswitch.org
> > > > > 抄送: Stokes Ian <ian.stokes@intel.com>; Loftus Ciara 
> > > > > <ciara.loftus@intel.com>; Ilya Maximets <i.maximets@ovn.org>; 
> > > > > Yi Yang (杨
> > > > > ?D)-云服务集团 <yangyi01@inspur.com>; txfh2007 
> > > > > <txfh2007@aliyun.com>; Ben Pfaff <blp@ovn.org>; Flavio Leitner 
> > > > > <fbl@sysclose.org>
> > > > > 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO 
> > > > > packet
> > > > > 
> > > > > Usually TSO packets are close to 50k, 60k bytes long, so to to 
> > > > > copy less bytes when receiving a packet from the kernel change 
> > > > > the approach. Instead of extending the MTU sized packet 
> > > > > received and append with remaining TSO data from the TSO 
> > > > > buffer, allocate a TSO packet with enough headroom to prepend the std packet data.
> > > > > 
> > > > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload
> > > > > support")
> > > > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > > > > ---
> > > > >  lib/dp-packet.c            |   8 +--
> > > > >  lib/dp-packet.h            |   2 +
> > > > >  lib/netdev-linux-private.h |   3 +-
> > > > >  lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
> > > > >  4 files changed, 78 insertions(+), 52 deletions(-)
> > > > > 
> > > > > V2:
> > > > >   - tso packets tailroom depends on headroom in netdev_linux_rxq_recv()
> > > > >   - iov_len uses packet's tailroom.
> > > > > 
> > > > >   This patch depends on a previous posted patch to work:
> > > > >   Subject: netdev-linux-private: fix max length to be 16 bits
> > > > >   
> > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.
> > > > > html
> > > > > 
> > > > >   With both patches applied, I can run iperf3 and scp on both directions
> > > > >   with good performance and no issues.
> > > > > 
> > > > > diff --git a/lib/dp-packet.c b/lib/dp-packet.c index
> > > > > 8dfedcb7c..cd2623500
> > > > > 100644
> > > > > --- a/lib/dp-packet.c
> > > > > +++ b/lib/dp-packet.c
> > > > > @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, 
> > > > > uint8_t *new_base,
> > > > >  
> > > > >  /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
> > > > >   * bytes of headroom and tailroom, respectively. */ -static 
> > > > > void -dp_packet_resize__(struct dp_packet *b, size_t 
> > > > > new_headroom, size_t
> > > > > new_tailroom)
> > > > > +void
> > > > > +dp_packet_resize(struct dp_packet *b, size_t new_headroom, 
> > > > > +size_t
> > > > > +new_tailroom)
> > > > >  {
> > > > >      void *new_base, *new_data;
> > > > >      size_t new_allocated;
> > > > > @@ -297,7 +297,7 @@ void
> > > > >  dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)  {
> > > > >      if (size > dp_packet_tailroom(b)) {
> > > > > -        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
> > > > > +        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 
> > > > > + 64));
> > > > >      }
> > > > >  }
> > > > >  
> > > > > @@ -308,7 +308,7 @@ void
> > > > >  dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)  {
> > > > >      if (size > dp_packet_headroom(b)) {
> > > > > -        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
> > > > > +        dp_packet_resize(b, MAX(size, 64), 
> > > > > + dp_packet_tailroom(b));
> > > > >      }
> > > > >  }
> > > > >  
> > > > > diff --git a/lib/dp-packet.h b/lib/dp-packet.h index
> > > > > 69ae5dfac..9a9d35183
> > > > > 100644
> > > > > --- a/lib/dp-packet.h
> > > > > +++ b/lib/dp-packet.h
> > > > > @@ -152,6 +152,8 @@ struct dp_packet 
> > > > > *dp_packet_clone_with_headroom(const
> > > > > struct dp_packet *,  struct dp_packet 
> > > > > *dp_packet_clone_data(const void *, size_t);  struct dp_packet 
> > > > > *dp_packet_clone_data_with_headroom(const void *, size_t,
> > > > >                                                       size_t 
> > > > > headroom);
> > > > > +void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
> > > > > +                      size_t new_tailroom);
> > > > >  static inline void dp_packet_delete(struct dp_packet *);
> > > > >  
> > > > >  static inline void *dp_packet_at(const struct dp_packet *, 
> > > > > size_t offset, diff --git a/lib/netdev-linux-private.h 
> > > > > b/lib/netdev-linux-private.h index
> > > > > be2d7b10b..c7c515f70 100644
> > > > > --- a/lib/netdev-linux-private.h
> > > > > +++ b/lib/netdev-linux-private.h
> > > > > @@ -45,7 +45,8 @@ struct netdev_rxq_linux {
> > > > >      struct netdev_rxq up;
> > > > >      bool is_tap;
> > > > >      int fd;
> > > > > -    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers.
> > > > > */
> > > > > +    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
> > > > > +                                                     packets. 
> > > > > + */
> > > > >  };
> > > > >  
> > > > >  int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux.
> > > > > c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644
> > > > > --- a/lib/netdev-linux.c
> > > > > +++ b/lib/netdev-linux.c
> > > > > @@ -1052,15 +1052,6 @@ static struct netdev_rxq *
> > > > >  netdev_linux_rxq_alloc(void)
> > > > >  {
> > > > >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > > > > -    if (userspace_tso_enabled()) {
> > > > > -        int i;
> > > > > -
> > > > > -        /* Allocate auxiliay buffers to receive TSO packets. */
> > > > > -        for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > -            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > > > > -        }
> > > > > -    }
> > > > > -
> > > > >      return &rx->up;
> > > > >  }
> > > > >  
> > > > > @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> > > > >      }
> > > > >  
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > -        free(rx->aux_bufs[i]);
> > > > > +        dp_packet_delete(rx->aux_bufs[i]);
> > > > >      }
> > > > >  }
> > > > >  
> > > > > @@ -1238,13 +1229,18 @@ 
> > > > > netdev_linux_batch_rxq_recv_sock(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          virtio_net_hdr_size = 0;
> > > > >      }
> > > > >  
> > > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > > +    /* The length here needs to be accounted in the same way when the
> > > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + 
> > > > > + mtu;
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > >           buffers[i] = dp_packet_new_with_headroom(std_len,
> > > > > DP_NETDEV_HEADROOM);
> > > > >           iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> > > > >           iovs[i][IOV_PACKET].iov_len = std_len;
> > > > > -         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > > -         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > > +         if (iovlen == IOV_TSO_SIZE) {
> > > > > +             iovs[i][IOV_AUXBUF].iov_base =
> > > > > dp_packet_data(rx->aux_bufs[i]);
> > > > > +             iovs[i][IOV_AUXBUF].iov_len =
> > > > > dp_packet_tailroom(rx->aux_bufs[i]);
> > > > > +         }
> > > > > +
> > > > >           mmsgs[i].msg_hdr.msg_name = NULL;
> > > > >           mmsgs[i].msg_hdr.msg_namelen = 0;
> > > > >           mmsgs[i].msg_hdr.msg_iov = iovs[i]; @@ -1268,6 
> > > > > +1264,8 @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> > > > >      }
> > > > >  
> > > > >      for (i = 0; i < retval; i++) {
> > > > > +        struct dp_packet *pkt;
> > > > > +
> > > > >          if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_); @@
> > > > > -1280,29 +1278,29 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          }
> > > > >  
> > > > >          if (mmsgs[i].msg_len > std_len) {
> > > > > -            /* Build a single linear TSO packet by expanding the current
> > > > > packet
> > > > > -             * to append the data received in the aux_buf. */
> > > > > -            size_t extra_len = mmsgs[i].msg_len - std_len;
> > > > > -
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + std_len);
> > > > > -            dp_packet_prealloc_tailroom(buffers[i], extra_len);
> > > > > -            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + extra_len);
> > > > > -        } else {
> > > > > -            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> > > > > -                               + mmsgs[i].msg_len);
> > > > > -        }
> > > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > > +             * std_len buffer to the aux_buf. */
> > > > > +            pkt = rx->aux_bufs[i];
> > > > > +            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
> > > > > +            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
> > > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > > +             * DP_NETDEV_HEADROOM. */
> > > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > > +            dp_packet_delete(buffers[i]);
> > > > > +            rx->aux_bufs[i] = NULL;
> > > > > +         } else {
> > > > > +            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
> > > > > +            pkt = buffers[i];
> > > > > +         }
> > > > >  
> > > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i]))
> > > > > {
> > > > > +        if (virtio_net_hdr_size &&
> > > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_);
> > > > >  
> > > > >              /* Unexpected error situation: the virtio header is not present
> > > > >               * or corrupted. Drop the packet but continue in case next ones
> > > > >               * are correct. */
> > > > > -            dp_packet_delete(buffers[i]);
> > > > > +            dp_packet_delete(pkt);
> > > > >              netdev->rx_dropped += 1;
> > > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > > virtio net header",
> > > > >                           netdev_get_name(netdev_)); @@ 
> > > > > -1325,16
> > > > > +1323,16 @@ netdev_linux_batch_rxq_recv_sock(struct
> > > > > +netdev_rxq_linux *rx, int mtu,
> > > > >                  struct eth_header *eth;
> > > > >                  bool double_tagged;
> > > > >  
> > > > > -                eth = dp_packet_data(buffers[i]);
> > > > > +                eth = dp_packet_data(pkt);
> > > > >                  double_tagged = eth->eth_type == 
> > > > > htons(ETH_TYPE_VLAN_8021Q);
> > > > >  
> > > > > -                eth_push_vlan(buffers[i],
> > > > > +                eth_push_vlan(pkt,
> > > > >                                auxdata_to_vlan_tpid(aux, double_tagged),
> > > > >                                htons(aux->tp_vlan_tci));
> > > > >                  break;
> > > > >              }
> > > > >          }
> > > > > -        dp_packet_batch_add(batch, buffers[i]);
> > > > > +        dp_packet_batch_add(batch, pkt);
> > > > >      }
> > > > >  
> > > > >      /* Delete unused buffers. */ @@ -1354,7 +1352,6 @@ static 
> > > > > int  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux 
> > > > > *rx, int mtu,
> > > > >                                  struct dp_packet_batch *batch)  {
> > > > > -    struct dp_packet *buffer;
> > > > >      int virtio_net_hdr_size;
> > > > >      ssize_t retval;
> > > > >      size_t std_len;
> > > > > @@ -1372,16 +1369,22 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > > netdev_rxq_linux *rx, int mtu,
> > > > >          virtio_net_hdr_size = 0;
> > > > >      }
> > > > >  
> > > > > -    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> > > > > +    /* The length here needs to be accounted in the same way when the
> > > > > +     * aux_buf is allocated so that it can be prepended to TSO buffer. */
> > > > > +    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + 
> > > > > + mtu;
> > > > >      for (i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > +        struct dp_packet *buffer;
> > > > > +        struct dp_packet *pkt;
> > > > >          struct iovec iov[IOV_TSO_SIZE];
> > > > >  
> > > > >          /* Assume Ethernet port. No need to set packet_type. */
> > > > >          buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> > > > >          iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> > > > >          iov[IOV_PACKET].iov_len = std_len;
> > > > > -        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> > > > > -        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> > > > > +        if (iovlen == IOV_TSO_SIZE) {
> > > > > +            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > > > > +            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
> > > > > +        }
> > > > >  
> > > > >          do {
> > > > >              retval = readv(rx->fd, iov, iovlen); @@ -1393,33
> > > > > +1396,36 @@ netdev_linux_batch_rxq_recv_tap(struct
> > > > > +netdev_rxq_linux *rx, int mtu,
> > > > >          }
> > > > >  
> > > > >          if (retval > std_len) {
> > > > > -            /* Build a single linear TSO packet by expanding the current
> > > > > packet
> > > > > -             * to append the data received in the aux_buf. */
> > > > > -            size_t extra_len = retval - std_len;
> > > > > -
> > > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > > > > -            dp_packet_prealloc_tailroom(buffer, extra_len);
> > > > > -            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> > > > > -            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > > > > +            /* Build a single linear TSO packet by prepending the data from
> > > > > +             * std_len buffer to the aux_buf. */
> > > > > +            pkt = rx->aux_bufs[i];
> > > > > +            dp_packet_set_size(pkt, retval - std_len);
> > > > > +            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
> > > > > +            /* The headroom should be the same in buffers[i], pkt and
> > > > > +             * DP_NETDEV_HEADROOM. */
> > > > > +            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
> > > > > +            dp_packet_delete(buffer);
> > > > > +            rx->aux_bufs[i] = NULL;
> > > > >          } else {
> > > > >              dp_packet_set_size(buffer, dp_packet_size(buffer) 
> > > > > + retval);
> > > > > +            pkt = buffer;
> > > > >          }
> > > > >  
> > > > > -        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> > > > > +        if (virtio_net_hdr_size &&
> > > > > + netdev_linux_parse_vnet_hdr(pkt)) {
> > > > >              struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> > > > >              struct netdev_linux *netdev = 
> > > > > netdev_linux_cast(netdev_);
> > > > >  
> > > > >              /* Unexpected error situation: the virtio header is not present
> > > > >               * or corrupted. Drop the packet but continue in case next ones
> > > > >               * are correct. */
> > > > > -            dp_packet_delete(buffer);
> > > > > +            dp_packet_delete(pkt);
> > > > >              netdev->rx_dropped += 1;
> > > > >              VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid 
> > > > > virtio net header",
> > > > >                           netdev_get_name(netdev_));
> > > > >              continue;
> > > > >          }
> > > > >  
> > > > > -        dp_packet_batch_add(batch, buffer);
> > > > > +        dp_packet_batch_add(batch, pkt);
> > > > >      }
> > > > >  
> > > > >      if ((i == 0) && (retval < 0)) { @@ -1442,6 +1448,23 @@ 
> > > > > netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct 
> > > > > dp_packet_batch *batch,
> > > > >          mtu = ETH_PAYLOAD_MAX;
> > > > >      }
> > > > >  
> > > > > +    if (userspace_tso_enabled()) {
> > > > > +        /* Allocate TSO packets. The packet has enough headroom to store
> > > > > +         * a full non-TSO packet. When a TSO packet is received, the data
> > > > > +         * from non-TSO buffer (std_len) is prepended to the TSO packet
> > > > > +         * (aux_buf). */
> > > > > +        size_t std_len = sizeof(struct virtio_net_hdr) +
> > > > > VLAN_ETH_HEADER_LEN
> > > > > +                         + DP_NETDEV_HEADROOM + mtu;
> > > > > +        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
> > > > > +        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
> > > > > +            if (rx->aux_bufs[i]) {
> > > > > +                continue;
> > > > > +            }
> > > > > +
> > > > > +            rx->aux_bufs[i] = 
> > > > > + dp_packet_new_with_headroom(data_len,
> > > > > std_len);
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > >      dp_packet_batch_init(batch);
> > > > >      retval = (rx->is_tap
> > > > >                ? netdev_linux_batch_rxq_recv_tap(rx, mtu, 
> > > > > batch)
> > > > > --
> > > > > 2.24.1
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > fbl
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > 
> > > --
> > > fbl
> > 
> > 
> > 
> > --
> > fbl
> 
> 
> 
> --
> fbl



--
fbl
Ben Pfaff Feb. 6, 2020, 8:26 p.m. UTC | #10
On Mon, Feb 03, 2020 at 06:45:50PM -0300, Flavio Leitner wrote:
> Usually TSO packets are close to 50k, 60k bytes long, so to
> to copy less bytes when receiving a packet from the kernel
> change the approach. Instead of extending the MTU sized
> packet received and append with remaining TSO data from
> the TSO buffer, allocate a TSO packet with enough headroom
> to prepend the std packet data.
> 
> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> Suggested-by: Ben Pfaff <blp@ovn.org>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Thanks, applied to master.

Do you want this one backported to 2.13?
Flavio Leitner Feb. 6, 2020, 8:53 p.m. UTC | #11
On Thu, Feb 06, 2020 at 12:26:38PM -0800, Ben Pfaff wrote:
> On Mon, Feb 03, 2020 at 06:45:50PM -0300, Flavio Leitner wrote:
> > Usually TSO packets are close to 50k, 60k bytes long, so to
> > to copy less bytes when receiving a packet from the kernel
> > change the approach. Instead of extending the MTU sized
> > packet received and append with remaining TSO data from
> > the TSO buffer, allocate a TSO packet with enough headroom
> > to prepend the std packet data.
> > 
> > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> > Suggested-by: Ben Pfaff <blp@ovn.org>
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Thanks, applied to master.
> 
> Do you want this one backported to 2.13?

Yes, please.
Thanks!
Ben Pfaff Feb. 6, 2020, 9 p.m. UTC | #12
On Thu, Feb 06, 2020 at 05:53:21PM -0300, Flavio Leitner wrote:
> On Thu, Feb 06, 2020 at 12:26:38PM -0800, Ben Pfaff wrote:
> > On Mon, Feb 03, 2020 at 06:45:50PM -0300, Flavio Leitner wrote:
> > > Usually TSO packets are close to 50k, 60k bytes long, so to
> > > to copy less bytes when receiving a packet from the kernel
> > > change the approach. Instead of extending the MTU sized
> > > packet received and append with remaining TSO data from
> > > the TSO buffer, allocate a TSO packet with enough headroom
> > > to prepend the std packet data.
> > > 
> > > Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> > > Suggested-by: Ben Pfaff <blp@ovn.org>
> > > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > 
> > Thanks, applied to master.
> > 
> > Do you want this one backported to 2.13?
> 
> Yes, please.
> Thanks!

Done.
William Tu Feb. 20, 2020, 7:39 p.m. UTC | #13
On Tue, Feb 4, 2020 at 9:12 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
> On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨�D)-云服务集团 wrote:
> > > Hi, Flavio
> The code is in my github branch:
> https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
>
> commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> Author: Flavio Leitner <fbl@sysclose.org>
> Date:   Tue Feb 4 11:18:49 2020 -0300
>
>     netdev-linux: Enable TSO in the TAP device.
>
>     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
>     offloading in the tap device.
>
>     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
>     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>
Hi Flavio,
Are you sending out this patch?
I've tested it and looks good, improve performance a lot.
William
Flavio Leitner Feb. 20, 2020, 8:04 p.m. UTC | #14
On Thu, Feb 20, 2020 at 11:39:21AM -0800, William Tu wrote:
> On Tue, Feb 4, 2020 at 9:12 AM Flavio Leitner <fbl@sysclose.org> wrote:
> >
> > On Tue, Feb 04, 2020 at 12:00:19PM -0300, Flavio Leitner wrote:
> > > On Tue, Feb 04, 2020 at 12:51:24AM +0000, Yi Yang (杨�D)-云服务集团 wrote:
> > > > Hi, Flavio
> > The code is in my github branch:
> > https://github.com/fleitner/ovs/tree/tso-tap-enable-tx-v1
> >
> > commit 884371df3bf3df836d4c2ab2d62b420339691fe8
> > Author: Flavio Leitner <fbl@sysclose.org>
> > Date:   Tue Feb 4 11:18:49 2020 -0300
> >
> >     netdev-linux: Enable TSO in the TAP device.
> >
> >     Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
> >     offloading in the tap device.
> >
> >     Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> >     Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> >
> Hi Flavio,
> Are you sending out this patch?
> I've tested it and looks good, improve performance a lot.

Hi William,

I was waiting for the test feedback, so thanks for providing it.
I will rebase and propose it.
diff mbox series

Patch

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 8dfedcb7c..cd2623500 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -243,8 +243,8 @@  dp_packet_copy__(struct dp_packet *b, uint8_t *new_base,
 
 /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
  * bytes of headroom and tailroom, respectively. */
-static void
-dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom)
+void
+dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t new_tailroom)
 {
     void *new_base, *new_data;
     size_t new_allocated;
@@ -297,7 +297,7 @@  void
 dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)
 {
     if (size > dp_packet_tailroom(b)) {
-        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
+        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
     }
 }
 
@@ -308,7 +308,7 @@  void
 dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)
 {
     if (size > dp_packet_headroom(b)) {
-        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
+        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
     }
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 69ae5dfac..9a9d35183 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -152,6 +152,8 @@  struct dp_packet *dp_packet_clone_with_headroom(const struct dp_packet *,
 struct dp_packet *dp_packet_clone_data(const void *, size_t);
 struct dp_packet *dp_packet_clone_data_with_headroom(const void *, size_t,
                                                      size_t headroom);
+void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
+                      size_t new_tailroom);
 static inline void dp_packet_delete(struct dp_packet *);
 
 static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index be2d7b10b..c7c515f70 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -45,7 +45,8 @@  struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
     int fd;
-    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
+    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
+                                                     packets. */
 };
 
 int netdev_linux_construct(struct netdev *);
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6add3e2fc..c6f3d2740 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1052,15 +1052,6 @@  static struct netdev_rxq *
 netdev_linux_rxq_alloc(void)
 {
     struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
-    if (userspace_tso_enabled()) {
-        int i;
-
-        /* Allocate auxiliay buffers to receive TSO packets. */
-        for (i = 0; i < NETDEV_MAX_BURST; i++) {
-            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
-        }
-    }
-
     return &rx->up;
 }
 
@@ -1172,7 +1163,7 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     }
 
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
-        free(rx->aux_bufs[i]);
+        dp_packet_delete(rx->aux_bufs[i]);
     }
 }
 
@@ -1238,13 +1229,18 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
         virtio_net_hdr_size = 0;
     }
 
-    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
+    /* The length here needs to be accounted in the same way when the
+     * aux_buf is allocated so that it can be prepended to TSO buffer. */
+    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
          buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
          iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
          iovs[i][IOV_PACKET].iov_len = std_len;
-         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
-         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
+         if (iovlen == IOV_TSO_SIZE) {
+             iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
+             iovs[i][IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
+         }
+
          mmsgs[i].msg_hdr.msg_name = NULL;
          mmsgs[i].msg_hdr.msg_namelen = 0;
          mmsgs[i].msg_hdr.msg_iov = iovs[i];
@@ -1268,6 +1264,8 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
     }
 
     for (i = 0; i < retval; i++) {
+        struct dp_packet *pkt;
+
         if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
@@ -1280,29 +1278,29 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
         }
 
         if (mmsgs[i].msg_len > std_len) {
-            /* Build a single linear TSO packet by expanding the current packet
-             * to append the data received in the aux_buf. */
-            size_t extra_len = mmsgs[i].msg_len - std_len;
-
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + std_len);
-            dp_packet_prealloc_tailroom(buffers[i], extra_len);
-            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + extra_len);
-        } else {
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + mmsgs[i].msg_len);
-        }
+            /* Build a single linear TSO packet by prepending the data from
+             * std_len buffer to the aux_buf. */
+            pkt = rx->aux_bufs[i];
+            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
+            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
+            /* The headroom should be the same in buffers[i], pkt and
+             * DP_NETDEV_HEADROOM. */
+            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
+            dp_packet_delete(buffers[i]);
+            rx->aux_bufs[i] = NULL;
+         } else {
+            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
+            pkt = buffers[i];
+         }
 
-        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) {
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
 
             /* Unexpected error situation: the virtio header is not present
              * or corrupted. Drop the packet but continue in case next ones
              * are correct. */
-            dp_packet_delete(buffers[i]);
+            dp_packet_delete(pkt);
             netdev->rx_dropped += 1;
             VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
                          netdev_get_name(netdev_));
@@ -1325,16 +1323,16 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
                 struct eth_header *eth;
                 bool double_tagged;
 
-                eth = dp_packet_data(buffers[i]);
+                eth = dp_packet_data(pkt);
                 double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
 
-                eth_push_vlan(buffers[i],
+                eth_push_vlan(pkt,
                               auxdata_to_vlan_tpid(aux, double_tagged),
                               htons(aux->tp_vlan_tci));
                 break;
             }
         }
-        dp_packet_batch_add(batch, buffers[i]);
+        dp_packet_batch_add(batch, pkt);
     }
 
     /* Delete unused buffers. */
@@ -1354,7 +1352,6 @@  static int
 netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
                                 struct dp_packet_batch *batch)
 {
-    struct dp_packet *buffer;
     int virtio_net_hdr_size;
     ssize_t retval;
     size_t std_len;
@@ -1372,16 +1369,22 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
         virtio_net_hdr_size = 0;
     }
 
-    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
+    /* The length here needs to be accounted in the same way when the
+     * aux_buf is allocated so that it can be prepended to TSO buffer. */
+    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
+        struct dp_packet *buffer;
+        struct dp_packet *pkt;
         struct iovec iov[IOV_TSO_SIZE];
 
         /* Assume Ethernet port. No need to set packet_type. */
         buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
         iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
         iov[IOV_PACKET].iov_len = std_len;
-        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
-        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
+        if (iovlen == IOV_TSO_SIZE) {
+            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
+            iov[IOV_AUXBUF].iov_len = dp_packet_tailroom(rx->aux_bufs[i]);
+        }
 
         do {
             retval = readv(rx->fd, iov, iovlen);
@@ -1393,33 +1396,36 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
         }
 
         if (retval > std_len) {
-            /* Build a single linear TSO packet by expanding the current packet
-             * to append the data received in the aux_buf. */
-            size_t extra_len = retval - std_len;
-
-            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
-            dp_packet_prealloc_tailroom(buffer, extra_len);
-            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
-            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+            /* Build a single linear TSO packet by prepending the data from
+             * std_len buffer to the aux_buf. */
+            pkt = rx->aux_bufs[i];
+            dp_packet_set_size(pkt, retval - std_len);
+            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
+            /* The headroom should be the same in buffers[i], pkt and
+             * DP_NETDEV_HEADROOM. */
+            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
+            dp_packet_delete(buffer);
+            rx->aux_bufs[i] = NULL;
         } else {
             dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+            pkt = buffer;
         }
 
-        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
 
             /* Unexpected error situation: the virtio header is not present
              * or corrupted. Drop the packet but continue in case next ones
              * are correct. */
-            dp_packet_delete(buffer);
+            dp_packet_delete(pkt);
             netdev->rx_dropped += 1;
             VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
                          netdev_get_name(netdev_));
             continue;
         }
 
-        dp_packet_batch_add(batch, buffer);
+        dp_packet_batch_add(batch, pkt);
     }
 
     if ((i == 0) && (retval < 0)) {
@@ -1442,6 +1448,23 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         mtu = ETH_PAYLOAD_MAX;
     }
 
+    if (userspace_tso_enabled()) {
+        /* Allocate TSO packets. The packet has enough headroom to store
+         * a full non-TSO packet. When a TSO packet is received, the data
+         * from non-TSO buffer (std_len) is prepended to the TSO packet
+         * (aux_buf). */
+        size_t std_len = sizeof(struct virtio_net_hdr) + VLAN_ETH_HEADER_LEN
+                         + DP_NETDEV_HEADROOM + mtu;
+        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - std_len;
+        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
+            if (rx->aux_bufs[i]) {
+                continue;
+            }
+
+            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len, std_len);
+        }
+    }
+
     dp_packet_batch_init(batch);
     retval = (rx->is_tap
               ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)