diff mbox series

[ovs-dev] netdev-linux: Prepend the std packet in the TSO packet

Message ID 20200129231511.94186-1-fbl@sysclose.org
State Superseded
Headers show
Series [ovs-dev] netdev-linux: Prepend the std packet in the TSO packet | expand

Commit Message

Flavio Leitner Jan. 29, 2020, 11:15 p.m. UTC
Usually TSO packets are close to 50k, 60k bytes long, so to
to copy less bytes when receiving a packet from the kernel
change the approach. Instead of extending the MTU sized
packet received and append with remaining TSO data from
the TSO buffer, allocate a TSO packet with enough headroom
to prepend the std packet data.

Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 lib/dp-packet.c            |   8 +--
 lib/dp-packet.h            |   2 +
 lib/netdev-linux-private.h |   3 +-
 lib/netdev-linux.c         | 117 ++++++++++++++++++++++---------------
 4 files changed, 78 insertions(+), 52 deletions(-)

Comments

Ben Pfaff Jan. 31, 2020, 4:54 p.m. UTC | #1
On Wed, Jan 29, 2020 at 08:15:11PM -0300, Flavio Leitner wrote:
> Usually TSO packets are close to 50k, 60k bytes long, so to
> to copy less bytes when receiving a packet from the kernel
> change the approach. Instead of extending the MTU sized
> packet received and append with remaining TSO data from
> the TSO buffer, allocate a TSO packet with enough headroom
> to prepend the std packet data.
> 
> Suggested-by: Ben Pfaff <blp@ovn.org>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Did you test this with TSO packets?  I think I see an inconsistency.
netdev_linux_rxq_recv() constructs dp_packets like this with a size of 0
(and a tailroom of data_len):

> +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len, std_len);

and then later on dp_packet_size() on the aux_bufs should report 0,
which won't work properly:

> +         if (iovlen == IOV_TSO_SIZE) {
> +             iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> +             iovs[i][IOV_AUXBUF].iov_len = dp_packet_size(rx->aux_bufs[i]);
> +         }

I think that the above should use dp_packet_tailroom() instead, and the
inconsistency makes me nervous.

Thanks,

Ben.
Flavio Leitner Jan. 31, 2020, 6:10 p.m. UTC | #2
Hi Ben,

On Fri, Jan 31, 2020 at 08:54:23AM -0800, Ben Pfaff wrote:
> On Wed, Jan 29, 2020 at 08:15:11PM -0300, Flavio Leitner wrote:
> > Usually TSO packets are close to 50k, 60k bytes long, so to
> > to copy less bytes when receiving a packet from the kernel
> > change the approach. Instead of extending the MTU sized
> > packet received and append with remaining TSO data from
> > the TSO buffer, allocate a TSO packet with enough headroom
> > to prepend the std packet data.
> > 
> > Suggested-by: Ben Pfaff <blp@ovn.org>
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> 
> Did you test this with TSO packets?  I think I see an inconsistency.
> netdev_linux_rxq_recv() constructs dp_packets like this with a size of 0
> (and a tailroom of data_len):
> 
> > +            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len, std_len);
> 
> and then later on dp_packet_size() on the aux_bufs should report 0,
> which won't work properly:
> 
> > +         if (iovlen == IOV_TSO_SIZE) {
> > +             iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
> > +             iovs[i][IOV_AUXBUF].iov_len = dp_packet_size(rx->aux_bufs[i]);
> > +         }
> 
> I think that the above should use dp_packet_tailroom() instead, and the
> inconsistency makes me nervous.

You're right. I did test and the short log is below. It sounds like
iperf3 is not the best tool to verify this feature, so I switched to
'scp' and noticed an issue.

I will fix the test script to run all tests (vm-vm, vm-ns, ns-ns...)
with scp to copy data and sha256 to verify it, and then follow up with
a V2 of this patch.

Short log:
ns1 TX side:
# iperf3 -c 192.168.100.32 -t 10
Connecting to host 192.168.100.32, port 5201
[  5] local 192.168.100.31 port 41354 connected to 192.168.100.32 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.01 GBytes  8.67 Gbits/sec  7339    165 KBytes       
[  5]   1.00-2.00   sec  1.04 GBytes  8.91 Gbits/sec  8331    216 KBytes       
[  5]   2.00-3.00   sec  1006 MBytes  8.44 Gbits/sec  5957    291 KBytes       
[  5]   3.00-4.00   sec  1.02 GBytes  8.79 Gbits/sec  5534    245 KBytes       
[  5]   4.00-5.00   sec  1024 MBytes  8.59 Gbits/sec  8400    219 KBytes       
[  5]   5.00-6.00   sec  1.01 GBytes  8.67 Gbits/sec  7792    223 KBytes       
[  5]   6.00-7.00   sec  1.01 GBytes  8.68 Gbits/sec  7860    170 KBytes       
[  5]   7.00-8.00   sec  1.00 GBytes  8.60 Gbits/sec  7687    246 KBytes       
[  5]   8.00-9.00   sec  1.02 GBytes  8.72 Gbits/sec  8168    182 KBytes       
[  5]   9.00-10.00  sec  1.03 GBytes  8.86 Gbits/sec  9886    225 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.1 GBytes  8.69 Gbits/sec  76954             sender
[  5]   0.00-10.00  sec  10.1 GBytes  8.69 Gbits/sec                  receiver


ns2 RX side:
12:50:53.307719 IP 192.168.100.31.41354 > 192.168.100.32.5201: Flags [P.], seq 4235400:4300560, ack 1, win 502, options [nop,nop,TS val 3623839079 ecr 3548400447], length 65160
12:50:53.307743 IP 192.168.100.31.41354 > 192.168.100.32.5201: Flags [P.], seq 4300560:4365720, ack 1, win 502, options [nop,nop,TS val 3623839079 ecr 3548400447], length 65160
12:50:53.307762 IP 192.168.100.31.41354 > 192.168.100.32.5201: Flags [P.], seq 4365720:4419296, ack 1, win 502, options [nop,nop,TS val 3623839079 ecr 3548400447], length 53576
12:50:53.307759 IP 192.168.100.32.5201 > 192.168.100.31.41354: Flags [.], ack 4300560, win 24317, options [nop,nop,TS val 3548400447 ecr 3623839079], length 0
[...]

Thanks for the review, nice catch!
diff mbox series

Patch

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 8dfedcb7c..cd2623500 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -243,8 +243,8 @@  dp_packet_copy__(struct dp_packet *b, uint8_t *new_base,
 
 /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom'
  * bytes of headroom and tailroom, respectively. */
-static void
-dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom)
+void
+dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t new_tailroom)
 {
     void *new_base, *new_data;
     size_t new_allocated;
@@ -297,7 +297,7 @@  void
 dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size)
 {
     if (size > dp_packet_tailroom(b)) {
-        dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64));
+        dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64));
     }
 }
 
@@ -308,7 +308,7 @@  void
 dp_packet_prealloc_headroom(struct dp_packet *b, size_t size)
 {
     if (size > dp_packet_headroom(b)) {
-        dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b));
+        dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b));
     }
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 69ae5dfac..9a9d35183 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -152,6 +152,8 @@  struct dp_packet *dp_packet_clone_with_headroom(const struct dp_packet *,
 struct dp_packet *dp_packet_clone_data(const void *, size_t);
 struct dp_packet *dp_packet_clone_data_with_headroom(const void *, size_t,
                                                      size_t headroom);
+void dp_packet_resize(struct dp_packet *b, size_t new_headroom,
+                      size_t new_tailroom);
 static inline void dp_packet_delete(struct dp_packet *);
 
 static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index 143616ca8..e5c13fc37 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -44,7 +44,8 @@  struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
     int fd;
-    char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
+    struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO
+                                                     packets. */
 };
 
 int netdev_linux_construct(struct netdev *);
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6add3e2fc..c30a22287 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1052,15 +1052,6 @@  static struct netdev_rxq *
 netdev_linux_rxq_alloc(void)
 {
     struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
-    if (userspace_tso_enabled()) {
-        int i;
-
-        /* Allocate auxiliay buffers to receive TSO packets. */
-        for (i = 0; i < NETDEV_MAX_BURST; i++) {
-            rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
-        }
-    }
-
     return &rx->up;
 }
 
@@ -1172,7 +1163,7 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     }
 
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
-        free(rx->aux_bufs[i]);
+        dp_packet_delete(rx->aux_bufs[i]);
     }
 }
 
@@ -1238,13 +1229,18 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
         virtio_net_hdr_size = 0;
     }
 
-    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
+    /* The length here needs to be accounted in the same way when the
+     * aux_buf is allocated so that it can be prepended to TSO buffer. */
+    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
          buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
          iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
          iovs[i][IOV_PACKET].iov_len = std_len;
-         iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
-         iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
+         if (iovlen == IOV_TSO_SIZE) {
+             iovs[i][IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
+             iovs[i][IOV_AUXBUF].iov_len = dp_packet_size(rx->aux_bufs[i]);
+         }
+
          mmsgs[i].msg_hdr.msg_name = NULL;
          mmsgs[i].msg_hdr.msg_namelen = 0;
          mmsgs[i].msg_hdr.msg_iov = iovs[i];
@@ -1268,6 +1264,8 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
     }
 
     for (i = 0; i < retval; i++) {
+        struct dp_packet *pkt;
+
         if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
@@ -1280,29 +1278,29 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
         }
 
         if (mmsgs[i].msg_len > std_len) {
-            /* Build a single linear TSO packet by expanding the current packet
-             * to append the data received in the aux_buf. */
-            size_t extra_len = mmsgs[i].msg_len - std_len;
-
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + std_len);
-            dp_packet_prealloc_tailroom(buffers[i], extra_len);
-            memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + extra_len);
-        } else {
-            dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
-                               + mmsgs[i].msg_len);
-        }
+            /* Build a single linear TSO packet by prepending the data from
+             * std_len buffer to the aux_buf. */
+            pkt = rx->aux_bufs[i];
+            dp_packet_set_size(pkt, mmsgs[i].msg_len - std_len);
+            dp_packet_push(pkt, dp_packet_data(buffers[i]), std_len);
+            /* The headroom should be the same in buffers[i], pkt and
+             * DP_NETDEV_HEADROOM. */
+            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
+            dp_packet_delete(buffers[i]);
+            rx->aux_bufs[i] = NULL;
+         } else {
+            dp_packet_set_size(buffers[i], mmsgs[i].msg_len);
+            pkt = buffers[i];
+         }
 
-        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) {
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
 
             /* Unexpected error situation: the virtio header is not present
              * or corrupted. Drop the packet but continue in case next ones
              * are correct. */
-            dp_packet_delete(buffers[i]);
+            dp_packet_delete(pkt);
             netdev->rx_dropped += 1;
             VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
                          netdev_get_name(netdev_));
@@ -1325,16 +1323,16 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
                 struct eth_header *eth;
                 bool double_tagged;
 
-                eth = dp_packet_data(buffers[i]);
+                eth = dp_packet_data(pkt);
                 double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
 
-                eth_push_vlan(buffers[i],
+                eth_push_vlan(pkt,
                               auxdata_to_vlan_tpid(aux, double_tagged),
                               htons(aux->tp_vlan_tci));
                 break;
             }
         }
-        dp_packet_batch_add(batch, buffers[i]);
+        dp_packet_batch_add(batch, pkt);
     }
 
     /* Delete unused buffers. */
@@ -1354,7 +1352,6 @@  static int
 netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
                                 struct dp_packet_batch *batch)
 {
-    struct dp_packet *buffer;
     int virtio_net_hdr_size;
     ssize_t retval;
     size_t std_len;
@@ -1372,16 +1369,22 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
         virtio_net_hdr_size = 0;
     }
 
-    std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
+    /* The length here needs to be accounted in the same way when the
+     * aux_buf is allocated so that it can be prepended to TSO buffer. */
+    std_len = virtio_net_hdr_size + VLAN_ETH_HEADER_LEN + mtu;
     for (i = 0; i < NETDEV_MAX_BURST; i++) {
+        struct dp_packet *buffer;
+        struct dp_packet *pkt;
         struct iovec iov[IOV_TSO_SIZE];
 
         /* Assume Ethernet port. No need to set packet_type. */
         buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
         iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
         iov[IOV_PACKET].iov_len = std_len;
-        iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
-        iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
+        if (iovlen == IOV_TSO_SIZE) {
+            iov[IOV_AUXBUF].iov_base = dp_packet_data(rx->aux_bufs[i]);
+            iov[IOV_AUXBUF].iov_len = dp_packet_size(rx->aux_bufs[i]);
+        }
 
         do {
             retval = readv(rx->fd, iov, iovlen);
@@ -1393,33 +1396,36 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
         }
 
         if (retval > std_len) {
-            /* Build a single linear TSO packet by expanding the current packet
-             * to append the data received in the aux_buf. */
-            size_t extra_len = retval - std_len;
-
-            dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
-            dp_packet_prealloc_tailroom(buffer, extra_len);
-            memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
-            dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
+            /* Build a single linear TSO packet by prepending the data from
+             * std_len buffer to the aux_buf. */
+            pkt = rx->aux_bufs[i];
+            dp_packet_set_size(pkt, retval - std_len);
+            dp_packet_push(pkt, dp_packet_data(buffer), std_len);
+            /* The headroom should be the same in buffers[i], pkt and
+             * DP_NETDEV_HEADROOM. */
+            dp_packet_resize(pkt, DP_NETDEV_HEADROOM, 0);
+            dp_packet_delete(buffer);
+            rx->aux_bufs[i] = NULL;
         } else {
             dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
+            pkt = buffer;
         }
 
-        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(pkt)) {
             struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
             struct netdev_linux *netdev = netdev_linux_cast(netdev_);
 
             /* Unexpected error situation: the virtio header is not present
              * or corrupted. Drop the packet but continue in case next ones
              * are correct. */
-            dp_packet_delete(buffer);
+            dp_packet_delete(pkt);
             netdev->rx_dropped += 1;
             VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
                          netdev_get_name(netdev_));
             continue;
         }
 
-        dp_packet_batch_add(batch, buffer);
+        dp_packet_batch_add(batch, pkt);
     }
 
     if ((i == 0) && (retval < 0)) {
@@ -1442,6 +1448,23 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         mtu = ETH_PAYLOAD_MAX;
     }
 
+    if (userspace_tso_enabled()) {
+        /* Allocate TSO packets. The packet has enough headroom to store
+         * a full non-TSO packet. When a TSO packet is received, the data
+         * from non-TSO buffer (std_len) is prepended to the TSO packet
+         * (aux_buf). */
+        size_t data_len = LINUX_RXQ_TSO_MAX_LEN - mtu;
+        size_t std_len = sizeof(struct virtio_net_hdr) + VLAN_ETH_HEADER_LEN
+                         + DP_NETDEV_HEADROOM + mtu;
+        for (int i = 0; i < NETDEV_MAX_BURST; i++) {
+            if (rx->aux_bufs[i]) {
+                continue;
+            }
+
+            rx->aux_bufs[i] = dp_packet_new_with_headroom(data_len, std_len);
+        }
+    }
+
     dp_packet_batch_init(batch);
     retval = (rx->is_tap
               ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)