[ovs-dev] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
diff mbox series

Message ID 1579592987-2358-1-git-send-email-yang_y_yi@163.com
State New
Headers show
Series
  • [ovs-dev] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
Related show

Commit Message

yang_y_yi Jan. 21, 2020, 7:49 a.m. UTC
From: Yi Yang <yangyi01@inspur.com>

We can avoid high system call overhead by using TPACKET_V1/V2/V3
and use DPDK-like poll to receive and send packets (Note: send
still needs to call sendto to trigger final packet transmission).

I can see about 30% improvement compared to last recvmmsg
optimization if I use TPACKET_V3. TPACKET_V1/V2 is worse than
TPACKET_V3, but it still can improve about 20%.

For veth, it is 1.47 Gbps before this patch, it is about 1.98
Gbps after applied this patch. But it is about 4.00 Gbps if we
use af_packet for veth, the bottle neck lies in ovs-vswitchd
thread, it will handle too many things for every loop (as below)
, so it can't work very efficintly as pmd_thread.

        memory_run();
        bridge_run();
        unixctl_server_run(unixctl);
        netdev_run();

        memory_wait();
        bridge_wait();
        unixctl_server_wait(unixctl);
        netdev_wait();
        poll_block();

In the next step, it will be better if let pmd_thread to handle
tap and veth interface.

Signed-off-by: Yi Yang <yangyi01@inspur.com>
Co-authored-by: William Tu <u9012063@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
---
 acinclude.m4               |  23 +++
 configure.ac               |   1 +
 lib/netdev-linux-private.h |  27 +++
 lib/netdev-linux.c         | 481 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 527 insertions(+), 5 deletions(-)

Comments

Ben Pfaff Jan. 21, 2020, 7:25 p.m. UTC | #1
On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y_yi@163.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> We can avoid high system call overhead by using TPACKET_V1/V2/V3
> and use DPDK-like poll to receive and send packets (Note: send
> still needs to call sendto to trigger final packet transmission).
> 
> I can see about 30% improvement compared to last recvmmsg
> optimization if I use TPACKET_V3. TPACKET_V1/V2 is worse than
> TPACKET_V3, but it still can improve about 20%.
> 
> For veth, it is 1.47 Gbps before this patch, it is about 1.98
> Gbps after applied this patch. But it is about 4.00 Gbps if we
> use af_packet for veth, the bottle neck lies in ovs-vswitchd
> thread, it will handle too many things for every loop (as below)
> , so it can't work very efficintly as pmd_thread.
> 
>         memory_run();
>         bridge_run();
>         unixctl_server_run(unixctl);
>         netdev_run();
> 
>         memory_wait();
>         bridge_wait();
>         unixctl_server_wait(unixctl);
>         netdev_wait();
>         poll_block();
> 
> In the next step, it will be better if let pmd_thread to handle
> tap and veth interface.
> 
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> Co-authored-by: William Tu <u9012063@gmail.com>
> Signed-off-by: William Tu <u9012063@gmail.com>

Thanks for the patch!

I am a bit concerned about version compatibility issues here.  There are
two relevant kinds of versions.  The first is the version of the
kernel/library headers.  This patch works pretty hard to adapt to the
headers that are available at compile time, only dealing with the
versions of the protocols that are available from the headers.  This
approach is sometimes fine, but an approach can be better is to simply
declare the structures or constants that the headers lack.  This is
often pretty easy for Linux data structures.  OVS does this for some
structures that it cares about with the headers in ovs/include/linux.
This approach has two advantages: the OVS code (outside these special
declarations) doesn't have to care whether particular structures are
declared, because they are always declared, and the OVS build always
supports a particular feature regardless of the headers of the system on
which it was built.

The second kind of version is the version of the system that OVS runs
on.  Unless a given feature is one that is supported by every version
that OVS cares about, OVS needs to test at runtime whether the feature
is supported and, if not, fall back to the older feature.  I don't see
that in this code.  Instead, it looks to me like it assumes that if the
feature was available at build time, then it is available at runtime.
This is not a good way to do things, since we want people to be able to
get builds from distributors such as Red Hat or Debian and then run
those builds on a diverse collection of kernels.

One specific comment I have here is that, in acinclude.m4, it would be
better to use AC_CHECK_TYPE or AC_CHECK_TYPES thatn OVS_GREP_IFELSE.
The latter is for testing for kernel builds only; we can't use the
normal AC_* tests for those because we often can't successfully build
kernel headers using the compiler and flags that Autoconf sets up for
building OVS.

Thanks,

Ben.
William Tu Jan. 22, 2020, 6:13 p.m. UTC | #2
On Mon, Jan 20, 2020 at 11:50 PM <yang_y_yi@163.com> wrote:
>
> From: Yi Yang <yangyi01@inspur.com>
>
> We can avoid high system call overhead by using TPACKET_V1/V2/V3
> and use DPDK-like poll to receive and send packets (Note: send
> still needs to call sendto to trigger final packet transmission).
>
> I can see about 30% improvement compared to last recvmmsg
> optimization if I use TPACKET_V3. TPACKET_V1/V2 is worse than
> TPACKET_V3, but it still can improve about 20%.
>
> For veth, it is 1.47 Gbps before this patch, it is about 1.98
> Gbps after applied this patch. But it is about 4.00 Gbps if we
> use af_packet for veth, the bottle neck lies in ovs-vswitchd

Hi Yiyang,

I don't understand these three numbers.
Don't you also use af_packet for veth for 1.47 Gbps and 1.98 Gbps?
What's the difference between your 4.00 Gbps and 1.98Gbps?

William
> thread, it will handle too many things for every loop (as below)
> , so it can't work very efficintly as pmd_thread.
>
>         memory_run();
>         bridge_run();
>         unixctl_server_run(unixctl);
>         netdev_run();
>
>         memory_wait();
>         bridge_wait();
>         unixctl_server_wait(unixctl);
>         netdev_wait();
>         poll_block();
>
> In the next step, it will be better if let pmd_thread to handle
> tap and veth interface.
>
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> Co-authored-by: William Tu <u9012063@gmail.com>
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  acinclude.m4               |  23 +++
>  configure.ac               |   1 +
>  lib/netdev-linux-private.h |  27 +++
>  lib/netdev-linux.c         | 481 ++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 527 insertions(+), 5 deletions(-)
>
> diff --git a/acinclude.m4 b/acinclude.m4
> index c1470cc..e99aff1 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -1095,6 +1095,29 @@ AC_DEFUN([OVS_CHECK_IF_DL],
>        AC_SEARCH_LIBS([pcap_open_live], [pcap])
>     fi])
>
> +dnl OVS_CHECK_LINUX_TPACKET
> +dnl
> +dnl Configure Linux TPACKET.
> +AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [
> +  AC_CHECK_HEADER([linux/if_packet.h],
> +                  [HAVE_TPACKET=yes],
> +                  [HAVE_TPACKET=no])
> +  AM_CONDITIONAL([HAVE_TPACKET], [test "$HAVE_TPACKET" = yes])
> +  if test "$HAVE_TPACKET" = yes; then
> +    AC_DEFINE([HAVE_TPACKET], [1],
> +              [Define to 1 if linux/if_packet.h is available.])
> +    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket3_hdr ],
> +                    [AC_DEFINE([HAVE_TPACKET_V3], [1],
> +                      [Define to 1 if struct tpacket3_hdr is defined])])
> +    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket2_hdr ],
> +                    [AC_DEFINE([HAVE_TPACKET_V2], [1],
> +                      [Define to 1 if struct tpacket2_hdr is defined])])
> +    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket_hdr ],
> +                    [AC_DEFINE([HAVE_TPACKET_V1], [1],
> +                      [Define to 1 if struct tpacket_hdr is defined])])
> +  fi
> +])
> +
>  dnl Checks for buggy strtok_r.
>  dnl
>  dnl Some versions of glibc 2.7 has a bug in strtok_r when compiling
> diff --git a/configure.ac b/configure.ac
> index 4f483fa..51c288b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -89,6 +89,7 @@ OVS_CHECK_VISUAL_STUDIO_DDK
>  OVS_CHECK_COVERAGE
>  OVS_CHECK_NDEBUG
>  OVS_CHECK_NETLINK
> +OVS_CHECK_LINUX_TPACKET
>  OVS_CHECK_OPENSSL
>  OVS_CHECK_LIBCAPNG
>  OVS_CHECK_LOGDIR
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> index 143616c..e8febfe 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -26,6 +26,9 @@
>  #include <linux/mii.h>
>  #include <stdint.h>
>  #include <stdbool.h>
> +#ifdef HAVE_TPACKET
> +#include <linux/if_packet.h>
> +#endif
>
>  #include "dp-packet.h"
>  #include "netdev-afxdp.h"
> @@ -40,6 +43,25 @@ struct netdev;
>
>  #define LINUX_RXQ_TSO_MAX_LEN 65536
>
> +#ifdef HAVE_TPACKET
> +struct tpacket_ring {
> +    int sockfd;
> +    struct iovec *rd;
> +    uint8_t *mm_space;
> +    size_t mm_len, rd_len;
> +    struct sockaddr_ll ll;
> +    int type, rd_num, flen, version;
> +    union {
> +        struct tpacket_req  req;
> +        struct tpacket_req3 req3;
> +    };
> +    uint32_t block_num;
> +    uint32_t frame_num;
> +    uint32_t frame_num_in_block;
> +    void * ppd;
> +};
> +#endif /* HAVE_TPACKET */
> +
>  struct netdev_rxq_linux {
>      struct netdev_rxq up;
>      bool is_tap;
> @@ -103,6 +125,11 @@ struct netdev_linux {
>
>      int numa_id;                /* NUMA node id. */
>
> +#ifdef HAVE_TPACKET
> +    struct tpacket_ring *tp_rx_ring;
> +    struct tpacket_ring *tp_tx_ring;
> +#endif
> +
>  #ifdef HAVE_AF_XDP
>      /* AF_XDP information. */
>      struct xsk_socket_info **xsks;
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 6add3e2..b5becf3 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -48,6 +48,9 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
> +#ifdef HAVE_TPACKET
> +#include <sys/mman.h>
> +#endif
>
>  #include "coverage.h"
>  #include "dp-packet.h"
> @@ -153,6 +156,34 @@ struct tpacket_auxdata {
>      uint16_t tp_vlan_tpid;
>  };
>
> +#ifdef HAVE_TPACKET /* All the definitions for TPACKET */
> +#ifndef __aligned_tpacket
> +# define __aligned_tpacket __attribute__((aligned(TPACKET_ALIGNMENT)))
> +#endif
> +
> +#ifndef __align_tpacket
> +# define __align_tpacket(x) __attribute__((aligned(TPACKET_ALIGN(x))))
> +#endif
> +
> +struct block_desc {
> +    uint32_t version;
> +    uint32_t offset_to_priv;
> +    struct tpacket_hdr_v1 h1;
> +};
> +
> +union frame_map {
> +    struct {
> +        struct tpacket_hdr tp_h __aligned_tpacket;
> +        struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket_hdr));
> +    } *v1;
> +    struct {
> +        struct tpacket2_hdr tp_h __aligned_tpacket;
> +        struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket2_hdr));
> +    } *v2;
> +    void *raw;
> +};
> +#endif /* HAVE_TPACKET */
> +
>  /* Linux 2.6.27 introduced ethtool_cmd_speed
>   *
>   * To avoid revisiting problems reported with using configure to detect
> @@ -1064,6 +1095,141 @@ netdev_linux_rxq_alloc(void)
>      return &rx->up;
>  }
>
> +#ifdef HAVE_TPACKET
> +static inline int
> +tpacket_set_packet_loss_discard(int sock)
> +{
> +    int discard = 1;
> +
> +    return setsockopt(sock, SOL_PACKET, PACKET_LOSS, (void *) &discard,
> +                      sizeof(discard));
> +}
> +
> +static inline void *
> +tpacket_get_next_frame(struct tpacket_ring *ring, uint32_t frame_num)
> +{
> +#ifdef HAVE_TPACKET_V3
> +    uint8_t *f0 = ring->rd[0].iov_base;
> +
> +    return f0 + (frame_num * ring->req3.tp_frame_size);
> +#else
> +    return ring->rd[frame_num].iov_base;
> +#endif
> +}
> +
> +/*
> + * For TPACKET_V1&V2, ring->rd_num is tp_frame_nr, ring->flen is tp_frame_size
> + */
> +static inline void
> +tpacket_v1_v2_fill_ring(struct tpacket_ring *ring, unsigned int blocks)
> +{
> +    ring->req.tp_block_size = getpagesize() << 2;
> +    ring->req.tp_frame_size = TPACKET_ALIGNMENT << 7;
> +    ring->req.tp_block_nr = blocks;
> +
> +    ring->req.tp_frame_nr = ring->req.tp_block_size /
> +                            ring->req.tp_frame_size *
> +                            ring->req.tp_block_nr;
> +
> +    ring->mm_len = ring->req.tp_block_size * ring->req.tp_block_nr;
> +    ring->rd_num = ring->req.tp_frame_nr;
> +    ring->flen = ring->req.tp_frame_size;
> +}
> +
> +/*
> + * For TPACKET_V3, ring->rd_num is tp_block_nr, ring->flen is tp_block_size
> + */
> +static inline void
> +tpacket_v3_fill_ring(struct tpacket_ring *ring, unsigned int blocks, int type)
> +{
> +    if (type == PACKET_RX_RING) {
> +        ring->req3.tp_retire_blk_tov = 0;
> +        ring->req3.tp_sizeof_priv = 0;
> +        ring->req3.tp_feature_req_word = 0;
> +    }
> +    ring->req3.tp_block_size = getpagesize() << 2;
> +    ring->req3.tp_frame_size = TPACKET_ALIGNMENT << 7;
> +    ring->req3.tp_block_nr = blocks;
> +
> +    ring->req3.tp_frame_nr = ring->req3.tp_block_size /
> +                             ring->req3.tp_frame_size *
> +                             ring->req3.tp_block_nr;
> +
> +    ring->mm_len = ring->req3.tp_block_size * ring->req3.tp_block_nr;
> +    ring->rd_num = ring->req3.tp_block_nr;
> +    ring->flen = ring->req3.tp_block_size;
> +}
> +
> +static int
> +tpacket_setup_ring(int sock, struct tpacket_ring *ring, int version, int type)
> +{
> +    int ret = 0;
> +    unsigned int blocks = 256;
> +
> +    ring->type = type;
> +    ring->version = version;
> +
> +    switch (version) {
> +    case TPACKET_V1:
> +    case TPACKET_V2:
> +            if (type == PACKET_TX_RING) {
> +                    tpacket_set_packet_loss_discard(sock);
> +            }
> +            tpacket_v1_v2_fill_ring(ring, blocks);
> +            ret = setsockopt(sock, SOL_PACKET, type, &ring->req,
> +                             sizeof(ring->req));
> +            break;
> +
> +    case TPACKET_V3:
> +            tpacket_v3_fill_ring(ring, blocks, type);
> +            ret = setsockopt(sock, SOL_PACKET, type, &ring->req3,
> +                             sizeof(ring->req3));
> +            break;
> +    }
> +
> +    if (ret == -1) {
> +        return -1;
> +    }
> +
> +    ring->rd_len = ring->rd_num * sizeof(*ring->rd);
> +    ring->rd = xmalloc(ring->rd_len);
> +    if (ring->rd == NULL) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static inline int
> +tpacket_mmap_rx_tx_ring(int sock, struct tpacket_ring *rx_ring,
> +                struct tpacket_ring *tx_ring)
> +{
> +    int i;
> +
> +    rx_ring->mm_space = mmap(0, rx_ring->mm_len + tx_ring->mm_len,
> +                          PROT_READ | PROT_WRITE,
> +                          MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0);
> +    if (rx_ring->mm_space == MAP_FAILED) {
> +        return -1;
> +    }
> +
> +    memset(rx_ring->rd, 0, rx_ring->rd_len);
> +    for (i = 0; i < rx_ring->rd_num; ++i) {
> +            rx_ring->rd[i].iov_base = rx_ring->mm_space + (i * rx_ring->flen);
> +            rx_ring->rd[i].iov_len = rx_ring->flen;
> +    }
> +
> +    tx_ring->mm_space = rx_ring->mm_space + rx_ring->mm_len;
> +    memset(tx_ring->rd, 0, tx_ring->rd_len);
> +    for (i = 0; i < tx_ring->rd_num; ++i) {
> +            tx_ring->rd[i].iov_base = tx_ring->mm_space + (i * tx_ring->flen);
> +            tx_ring->rd[i].iov_len = tx_ring->flen;
> +    }
> +
> +    return 0;
> +}
> +#endif
> +
>  static int
>  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>  {
> @@ -1079,6 +1245,15 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>      } else {
>          struct sockaddr_ll sll;
>          int ifindex, val;
> +#ifdef HAVE_TPACKET
> +#ifdef HAVE_TPACKET_V3
> +        int ver = TPACKET_V3;
> +#elif defined(HAVE_TPACKET_V2)
> +        int ver = TPACKET_V2;
> +#else
> +        int ver = TPACKET_V1;
> +#endif
> +#endif
>          /* Result of tcpdump -dd inbound */
>          static const struct sock_filter filt[] = {
>              { 0x28, 0, 0, 0xfffff004 }, /* ldh [0] */
> @@ -1091,13 +1266,52 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>          };
>
>          /* Create file descriptor. */
> -        rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
> +        rx->fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
>          if (rx->fd < 0) {
>              error = errno;
>              VLOG_ERR("failed to create raw socket (%s)", ovs_strerror(error));
>              goto error;
>          }
>
> +#ifdef HAVE_TPACKET
> +        error = setsockopt(rx->fd, SOL_PACKET, PACKET_VERSION, &ver,
> +                           sizeof(ver));
> +        if (error != 0) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to set tpacket version (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +        netdev->tp_rx_ring = xzalloc(sizeof(struct tpacket_ring));
> +        netdev->tp_tx_ring = xzalloc(sizeof(struct tpacket_ring));
> +        netdev->tp_rx_ring->sockfd = rx->fd;
> +        netdev->tp_tx_ring->sockfd = rx->fd;
> +        error = tpacket_setup_ring(rx->fd, netdev->tp_rx_ring, ver,
> +                                   PACKET_RX_RING);
> +        if (error != 0) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to set tpacket rx ring (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +        error = tpacket_setup_ring(rx->fd, netdev->tp_tx_ring, ver,
> +                                   PACKET_TX_RING);
> +        if (error != 0) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to set tpacket tx ring (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +        error = tpacket_mmap_rx_tx_ring(rx->fd, netdev->tp_rx_ring,
> +                                       netdev->tp_tx_ring);
> +        if (error != 0) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to mmap tpacket rx & tx ring (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +#endif
> +
>          val = 1;
>          if (setsockopt(rx->fd, SOL_PACKET, PACKET_AUXDATA, &val, sizeof val)) {
>              error = errno;
> @@ -1129,7 +1343,12 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>
>          /* Bind to specific ethernet device. */
>          memset(&sll, 0, sizeof sll);
> -        sll.sll_family = AF_PACKET;
> +        sll.sll_family = PF_PACKET;
> +#ifdef HAVE_TPACKET
> +        sll.sll_hatype = 0;
> +        sll.sll_pkttype = 0;
> +        sll.sll_halen = 0;
> +#endif
>          sll.sll_ifindex = ifindex;
>          sll.sll_protocol = htons(ETH_P_ALL);
>          if (bind(rx->fd, (struct sockaddr *) &sll, sizeof sll) < 0) {
> @@ -1168,6 +1387,17 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
>      int i;
>
>      if (!rx->is_tap) {
> +#ifdef HAVE_TPACKET
> +        struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
> +
> +        if (netdev->tp_rx_ring) {
> +            munmap(netdev->tp_rx_ring->mm_space,
> +                   2 * netdev->tp_rx_ring->mm_len);
> +            free(netdev->tp_rx_ring->rd);
> +            free(netdev->tp_tx_ring->rd);
> +        }
> +#endif
> +
>          close(rx->fd);
>      }
>
> @@ -1184,6 +1414,7 @@ netdev_linux_rxq_dealloc(struct netdev_rxq *rxq_)
>      free(rx);
>  }
>
> +#ifndef HAVE_TPACKET
>  static ovs_be16
>  auxdata_to_vlan_tpid(const struct tpacket_auxdata *aux, bool double_tagged)
>  {
> @@ -1345,6 +1576,7 @@ netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
>      return 0;
>  }
>
> +#else /* ifdef HAVE_TPACKET */
>  /*
>   * Receive packets from tap by batch process for better performance,
>   * it can receive NETDEV_MAX_BURST packets at most once, the received
> @@ -1428,6 +1660,125 @@ netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
>
>      return 0;
>  }
> +static int
> +netdev_linux_batch_recv_tpacket(struct netdev_rxq *rxq_, int mtu,
> +                                struct dp_packet_batch *batch)
> +{
> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
> +    struct dp_packet *buffer;
> +    int i = 0;
> +
> +#ifdef HAVE_TPACKET_V3
> +    unsigned int block_num;
> +    unsigned int fn_in_block;
> +    struct block_desc *pbd;
> +    struct tpacket3_hdr *ppd;
> +
> +    ppd = (struct tpacket3_hdr *)netdev->tp_rx_ring->ppd;
> +    block_num = netdev->tp_rx_ring->block_num;
> +    fn_in_block = netdev->tp_rx_ring->frame_num_in_block;
> +    pbd = (struct block_desc *) netdev->tp_rx_ring->rd[block_num].iov_base;
> +#else
> +#if defined(HAVE_TPACKET_V2)
> +    struct tpacket2_hdr *ppd;
> +#else
> +    struct tpacket_hdr *ppd;
> +#endif
> +    unsigned int frame_num;
> +    unsigned int frame_nr = netdev->tp_rx_ring->rd_num;
> +
> +    frame_num = netdev->tp_rx_ring->frame_num;
> +#endif
> +
> +    while (i < NETDEV_MAX_BURST) {
> +#ifdef HAVE_TPACKET_V3
> +        if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
> +            break;
> +        }
> +        if (fn_in_block == 0) {
> +            ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
> +                                           pbd->h1.offset_to_first_pkt);
> +        }
> +#elif defined(HAVE_TPACKET_V2)
> +        ppd = (struct tpacket2_hdr *)
> +                  netdev->tp_rx_ring->rd[frame_num].iov_base;
> +        if ((ppd->tp_status & TP_STATUS_USER) == 0) {
> +            break;
> +        }
> +#else
> +        ppd = (struct tpacket_hdr *)netdev->tp_rx_ring->rd[frame_num].iov_base;
> +        if ((ppd->tp_status & TP_STATUS_USER) == 0) {
> +            break;
> +        }
> +#endif
> +
> +        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> +                                             DP_NETDEV_HEADROOM);
> +        memcpy(dp_packet_data(buffer),
> +               (uint8_t *) ppd + ppd->tp_mac, ppd->tp_snaplen);
> +        dp_packet_set_size(buffer,
> +                           dp_packet_size(buffer) + ppd->tp_snaplen);
> +#if defined(HAVE_TPACKET_V2) || defined(HAVE_TPACKET_V3)
> +        if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
> +            struct eth_header *eth;
> +            bool double_tagged;
> +            ovs_be16 vlan_tpid;
> +
> +            eth = dp_packet_data(buffer);
> +            double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
> +            if (ppd->tp_status & TP_STATUS_VLAN_TPID_VALID) {
> +#ifdef HAVE_TPACKET_V3
> +                vlan_tpid = htons(ppd->hv1.tp_vlan_tpid);
> +#else
> +                vlan_tpid = htons(ppd->tp_vlan_tpid);
> +#endif
> +            } else if (double_tagged) {
> +                vlan_tpid = htons(ETH_TYPE_VLAN_8021AD);
> +            } else {
> +                vlan_tpid = htons(ETH_TYPE_VLAN_8021Q);
> +            }
> +#ifdef HAVE_TPACKET_V3
> +            eth_push_vlan(buffer, vlan_tpid, htons(ppd->hv1.tp_vlan_tci));
> +#else
> +            eth_push_vlan(buffer, vlan_tpid, htons(ppd->tp_vlan_tci));
> +#endif
> +        }
> +#endif
> +        dp_packet_batch_add(batch, buffer);
> +
> +#ifdef HAVE_TPACKET_V3
> +        fn_in_block++;
> +        if (fn_in_block >= pbd->h1.num_pkts) {
> +            pbd->h1.block_status = TP_STATUS_KERNEL;
> +            block_num = (block_num + 1) %
> +                            netdev->tp_rx_ring->req3.tp_block_nr;
> +            pbd = (struct block_desc *)
> +                     netdev->tp_rx_ring->rd[block_num].iov_base;
> +            fn_in_block = 0;
> +            ppd = NULL;
> +        } else {
> +            ppd = (struct tpacket3_hdr *)
> +                   ((uint8_t *) ppd + ppd->tp_next_offset);
> +        }
> +#else
> +        ppd->tp_status = TP_STATUS_KERNEL;
> +        frame_num = (frame_num + 1) % frame_nr;
> +#endif
> +        i++;
> +    }
> +
> +#ifdef HAVE_TPACKET_V3
> +    netdev->tp_rx_ring->block_num = block_num;
> +    netdev->tp_rx_ring->frame_num_in_block = fn_in_block;
> +    netdev->tp_rx_ring->ppd = ppd;
> +#else
> +    netdev->tp_rx_ring->frame_num = frame_num;
> +#endif
> +
> +    return 0;
> +}
> +#endif
>
>  static int
>  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> @@ -1443,9 +1794,15 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>      }
>
>      dp_packet_batch_init(batch);
> -    retval = (rx->is_tap
> -              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> -              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
> +    if (rx->is_tap) {
> +        retval = netdev_linux_batch_rxq_recv_tap(rx, mtu, batch);
> +    } else {
> +#ifndef HAVE_TPACKET
> +        retval = netdev_linux_batch_rxq_recv_sock(rx, mtu, batch);
> +#else
> +        retval = netdev_linux_batch_recv_tpacket(rxq_, mtu, batch);
> +#endif
> +    }
>
>      if (retval) {
>          if (retval != EAGAIN && retval != EMSGSIZE) {
> @@ -1486,6 +1843,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
>      }
>  }
>
> +#ifndef HAVE_TPACKET
>  static int
>  netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>                               struct dp_packet_batch *batch)
> @@ -1531,6 +1889,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
>      return error;
>  }
>
> +#else /* ifdef HAVE_TPACKET */
>  /* Use the tap fd to send 'batch' to tap device 'netdev'.  Using the tap fd is
>   * essential, because packets sent to a tap device with an AF_PACKET socket
>   * will loop back to be *received* again on the tap device.  This doesn't occur
> @@ -1650,6 +2009,114 @@ netdev_linux_get_numa_id(const struct netdev *netdev_)
>      return numa_id;
>  }
>
> +static inline int
> +tpacket_tx_is_ready(void * next_frame)
> +{
> +#ifdef HAVE_TPACKE_V3
> +    struct tpacket3_hdr *hdr = (struct tpacket3_hdr *)next_frame;
> +#elif defined(HAVE_TPACKE_V2)
> +    struct tpacket2_hdr *hdr = (struct tpacket2_hdr *)next_frame;
> +#else
> +    struct tpacket_hdr *hdr = (struct tpacket_hdr *)next_frame;
> +#endif
> +    return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
> +}
> +
> +static int
> +netdev_linux_tpacket_batch_send(struct netdev *netdev_,
> +                            struct dp_packet_batch *batch)
> +{
> +    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +    struct dp_packet *packet;
> +    int sockfd;
> +    ssize_t bytes_sent;
> +    int total_pkts = 0;
> +
> +#ifdef HAVE_TPACKET_V3
> +    unsigned int frame_nr = netdev->tp_tx_ring->req3.tp_frame_nr;
> +#else
> +    unsigned int frame_nr = netdev->tp_tx_ring->rd_num;
> +#endif
> +    unsigned int frame_num = netdev->tp_tx_ring->frame_num;
> +
> +    /* The Linux tap driver returns EIO if the device is not up,
> +     * so if the device is not up, don't waste time sending it.
> +     * However, if the device is in another network namespace
> +     * then OVS can't retrieve the state. In that case, send the
> +     * packets anyway. */
> +    if (netdev->present && !(netdev->ifi_flags & IFF_UP)) {
> +        netdev->tx_dropped += dp_packet_batch_size(batch);
> +        return 0;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        union frame_map ppd;
> +        size_t size = dp_packet_size(packet);
> +#ifdef HAVE_TPACKET_V3
> +        struct tpacket3_hdr *next_frame
> +                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
> +#elif defined(HAVE_TPACKET_V2)
> +        struct tpacket2_hdr *next_frame
> +                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
> +#else
> +        struct tpacket_hdr *next_frame
> +                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
> +#endif
> +
> +        ppd.raw = next_frame;
> +        if (!tpacket_tx_is_ready(next_frame)) {
> +            break;
> +        }
> +#ifdef HAVE_TPACKET_V3
> +        next_frame->tp_snaplen = size;
> +        next_frame->tp_len = size;
> +        next_frame->tp_next_offset = 0;
> +
> +        memcpy((uint8_t *)ppd.raw + TPACKET3_HDRLEN
> +                   - sizeof(struct sockaddr_ll),
> +               dp_packet_data(packet),
> +               size);
> +#elif defined(HAVE_TPACKET_V2)
> +        ppd.v2->tp_h.tp_snaplen = size;
> +        ppd.v2->tp_h.tp_len = size;
> +
> +        memcpy((uint8_t *)ppd.raw + TPACKET2_HDRLEN
> +                   - sizeof(struct sockaddr_ll),
> +               dp_packet_data(packet),
> +               size);
> +#else
> +        ppd.v1->tp_h.tp_snaplen = size;
> +        ppd.v1->tp_h.tp_len = size;
> +
> +        memcpy((uint8_t *)ppd.raw + TPACKET_HDRLEN
> +                   - sizeof(struct sockaddr_ll),
> +               dp_packet_data(packet),
> +               size);
> +#endif
> +        next_frame->tp_status = TP_STATUS_SEND_REQUEST;
> +        frame_num = (frame_num + 1) % frame_nr;
> +        total_pkts++;
> +    }
> +    netdev->tp_tx_ring->frame_num = frame_num;
> +
> +    /* kick-off transmits */
> +    if (total_pkts != 0) {
> +        sockfd = netdev->tp_tx_ring->sockfd;
> +        bytes_sent = sendto(sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +        if (bytes_sent == -1 &&
> +                errno != ENOBUFS && errno != EAGAIN) {
> +            /*
> +             * In case of an ENOBUFS/EAGAIN error all of the enqueued
> +             * packets will be considered successful even though only some
> +             * are sent.
> +             */
> +            netdev->tx_dropped += dp_packet_batch_size(batch);
> +        }
> +    }
> +    return 0;
> +}
> +#endif
> +
>  /* Sends 'batch' on 'netdev'.  Returns 0 if successful, otherwise a positive
>   * errno value.  Returns EAGAIN without blocking if the packet cannot be queued
>   * immediately.  Returns EMSGSIZE if a partial packet was transmitted or if
> @@ -1689,7 +2156,11 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
>              goto free_batch;
>          }
>
> +#ifndef HAVE_TPACKET
>          error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
> +#else
> +        error = netdev_linux_tpacket_batch_send(netdev_, batch);
> +#endif
>      } else {
>          error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
>      }
> --
> 1.8.3.1
>
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
William Tu Jan. 22, 2020, 8:22 p.m. UTC | #3
On Tue, Jan 21, 2020 at 11:25 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y_yi@163.com wrote:
> > From: Yi Yang <yangyi01@inspur.com>
> >
> > We can avoid high system call overhead by using TPACKET_V1/V2/V3
> > and use DPDK-like poll to receive and send packets (Note: send
> > still needs to call sendto to trigger final packet transmission).
> >
> > I can see about 30% improvement compared to last recvmmsg
> > optimization if I use TPACKET_V3. TPACKET_V1/V2 is worse than
> > TPACKET_V3, but it still can improve about 20%.
> >
> > For veth, it is 1.47 Gbps before this patch, it is about 1.98
> > Gbps after applied this patch. But it is about 4.00 Gbps if we
> > use af_packet for veth, the bottle neck lies in ovs-vswitchd
> > thread, it will handle too many things for every loop (as below)
> > , so it can't work very efficintly as pmd_thread.
> >
> >         memory_run();
> >         bridge_run();
> >         unixctl_server_run(unixctl);
> >         netdev_run();
> >
> >         memory_wait();
> >         bridge_wait();
> >         unixctl_server_wait(unixctl);
> >         netdev_wait();
> >         poll_block();
> >
> > In the next step, it will be better if let pmd_thread to handle
> > tap and veth interface.
> >
> > Signed-off-by: Yi Yang <yangyi01@inspur.com>
> > Co-authored-by: William Tu <u9012063@gmail.com>
> > Signed-off-by: William Tu <u9012063@gmail.com>
>
> Thanks for the patch!
>
> I am a bit concerned about version compatibility issues here.  There are
> two relevant kinds of versions.  The first is the version of the
> kernel/library headers.  This patch works pretty hard to adapt to the
> headers that are available at compile time, only dealing with the
> versions of the protocols that are available from the headers.  This
> approach is sometimes fine, but an approach can be better is to simply
> declare the structures or constants that the headers lack.  This is
> often pretty easy for Linux data structures.  OVS does this for some
> structures that it cares about with the headers in ovs/include/linux.
> This approach has two advantages: the OVS code (outside these special
> declarations) doesn't have to care whether particular structures are
> declared, because they are always declared, and the OVS build always
> supports a particular feature regardless of the headers of the system on
> which it was built.
>
> The second kind of version is the version of the system that OVS runs
> on.  Unless a given feature is one that is supported by every version
> that OVS cares about, OVS needs to test at runtime whether the feature
> is supported and, if not, fall back to the older feature.  I don't see
> that in this code.  Instead, it looks to me like it assumes that if the
> feature was available at build time, then it is available at runtime.
> This is not a good way to do things, since we want people to be able to
> get builds from distributors such as Red Hat or Debian and then run
> those builds on a diverse collection of kernels.
>
Hi Yiyang,

Can we just implement TPACKET v3, and drop v2 and v1?
V3 is supported since kernel 3.10,

commit f6fb8f100b807378fda19e83e5ac6828b638603a
Author: chetan loke <loke.chetan@gmail.com>
Date:   Fri Aug 19 10:18:16 2011 +0000

    af-packet: TPACKET_V3 flexible buffer implementation.

and based on OVS release
http://docs.openvswitch.org/en/latest/faq/releases/
after OVS 2.12, the minimum kernel requirement is 3.10.

Regards,
William
William Tu Feb. 3, 2020, 5:23 p.m. UTC | #4
On Sun, Feb 2, 2020 at 8:06 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Hi, William
>
> Sorry for last reply, I don't know why I always can't get your comments
> email from my outlook, Ben's comments are ok, I also can't see your comments
> in outlook junk box.
>
> About your comments in
> https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367146.html, I
> checked it in my CentOS 7 which has 3.10.0 kernel, TPACKET_V3 sample code
> can work, so I'm ok to remove V1&V2 code.
>

OK thank you for confirming that v3 works on 3.10 cento 7!
William
Ilya Maximets Feb. 6, 2020, 11:10 a.m. UTC | #5
> On Sun, Feb 2, 2020 at 8:06 PM Yi Yang (杨燚)-云服务集团 <yangyi01 at inspur.com> wrote:
>>
>> Hi, William
>>
>> Sorry for last reply, I don't know why I always can't get your comments
>> email from my outlook, Ben's comments are ok, I also can't see your comments
>> in outlook junk box.
>>
>> About your comments in
>> https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367146.html, I
>> checked it in my CentOS 7 which has 3.10.0 kernel, TPACKET_V3 sample code
>> can work, so I'm ok to remove V1&V2 code.
>>
> 
> OK thank you for confirming that v3 works on 3.10 cento 7!


FYI, rhel/centos 3.10 kernels has almost nothing in common with
upstream 3.10 kernel.

Best regards, Ilya Maximets.
Yi Yang (杨燚)-云服务集团 Feb. 6, 2020, 1:12 p.m. UTC | #6
Thanks Illya for pointing out this, I checked if_packet.h in git.kernel.org by using v3.10 tag, it indeed can support TPACKET_V3

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/if_packet.h?h=v3.10

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/if_packet.h?h=v3.10#n225

#define TPACKET3_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))

So we can safely support TPACKET_V3 only because ovs' minimal kernel requirement is 3.10.0

-----邮件原件-----
发件人: Ilya Maximets [mailto:i.maximets@ovn.org] 
发送时间: 2020年2月6日 19:11
收件人: ovs-dev@openvswitch.org; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; William Tu <u9012063@gmail.com>
抄送: Ben Pfaff <blp@ovn.org>; Ilya Maximets <i.maximets@ovn.org>
主题: Re: [ovs-dev] [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath

> On Sun, Feb 2, 2020 at 8:06 PM Yi Yang (杨燚)-云服务集团 <yangyi01 at inspur.com> wrote:
>>
>> Hi, William
>>
>> Sorry for last reply, I don't know why I always can't get your 
>> comments email from my outlook, Ben's comments are ok, I also can't 
>> see your comments in outlook junk box.
>>
>> About your comments in
>> https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367146.ht
>> ml, I checked it in my CentOS 7 which has 3.10.0 kernel, TPACKET_V3 
>> sample code can work, so I'm ok to remove V1&V2 code.
>>
> 
> OK thank you for confirming that v3 works on 3.10 cento 7!


FYI, rhel/centos 3.10 kernels has almost nothing in common with upstream 3.10 kernel.

Best regards, Ilya Maximets.

Patch
diff mbox series

diff --git a/acinclude.m4 b/acinclude.m4
index c1470cc..e99aff1 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1095,6 +1095,29 @@  AC_DEFUN([OVS_CHECK_IF_DL],
       AC_SEARCH_LIBS([pcap_open_live], [pcap])
    fi])
 
+dnl OVS_CHECK_LINUX_TPACKET
+dnl
+dnl Configure Linux TPACKET.
+AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [
+  AC_CHECK_HEADER([linux/if_packet.h],
+                  [HAVE_TPACKET=yes],
+                  [HAVE_TPACKET=no])
+  AM_CONDITIONAL([HAVE_TPACKET], [test "$HAVE_TPACKET" = yes])
+  if test "$HAVE_TPACKET" = yes; then
+    AC_DEFINE([HAVE_TPACKET], [1],
+              [Define to 1 if linux/if_packet.h is available.])
+    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket3_hdr ],
+                    [AC_DEFINE([HAVE_TPACKET_V3], [1],
+                      [Define to 1 if struct tpacket3_hdr is defined])])
+    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket2_hdr ],
+                    [AC_DEFINE([HAVE_TPACKET_V2], [1],
+                      [Define to 1 if struct tpacket2_hdr is defined])])
+    OVS_GREP_IFELSE([/usr/include/linux/if_packet.h], [struct tpacket_hdr ],
+                    [AC_DEFINE([HAVE_TPACKET_V1], [1],
+                      [Define to 1 if struct tpacket_hdr is defined])])
+  fi
+])
+
 dnl Checks for buggy strtok_r.
 dnl
 dnl Some versions of glibc 2.7 has a bug in strtok_r when compiling
diff --git a/configure.ac b/configure.ac
index 4f483fa..51c288b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -89,6 +89,7 @@  OVS_CHECK_VISUAL_STUDIO_DDK
 OVS_CHECK_COVERAGE
 OVS_CHECK_NDEBUG
 OVS_CHECK_NETLINK
+OVS_CHECK_LINUX_TPACKET
 OVS_CHECK_OPENSSL
 OVS_CHECK_LIBCAPNG
 OVS_CHECK_LOGDIR
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index 143616c..e8febfe 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -26,6 +26,9 @@ 
 #include <linux/mii.h>
 #include <stdint.h>
 #include <stdbool.h>
+#ifdef HAVE_TPACKET
+#include <linux/if_packet.h>
+#endif
 
 #include "dp-packet.h"
 #include "netdev-afxdp.h"
@@ -40,6 +43,25 @@  struct netdev;
 
 #define LINUX_RXQ_TSO_MAX_LEN 65536
 
+#ifdef HAVE_TPACKET
+struct tpacket_ring {
+    int sockfd;
+    struct iovec *rd;
+    uint8_t *mm_space;
+    size_t mm_len, rd_len;
+    struct sockaddr_ll ll;
+    int type, rd_num, flen, version;
+    union {
+        struct tpacket_req  req;
+        struct tpacket_req3 req3;
+    };
+    uint32_t block_num;
+    uint32_t frame_num;
+    uint32_t frame_num_in_block;
+    void * ppd;
+};
+#endif /* HAVE_TPACKET */
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
@@ -103,6 +125,11 @@  struct netdev_linux {
 
     int numa_id;                /* NUMA node id. */
 
+#ifdef HAVE_TPACKET
+    struct tpacket_ring *tp_rx_ring;
+    struct tpacket_ring *tp_tx_ring;
+#endif
+
 #ifdef HAVE_AF_XDP
     /* AF_XDP information. */
     struct xsk_socket_info **xsks;
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6add3e2..b5becf3 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -48,6 +48,9 @@ 
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#ifdef HAVE_TPACKET
+#include <sys/mman.h>
+#endif
 
 #include "coverage.h"
 #include "dp-packet.h"
@@ -153,6 +156,34 @@  struct tpacket_auxdata {
     uint16_t tp_vlan_tpid;
 };
 
+#ifdef HAVE_TPACKET /* All the definitions for TPACKET */
+#ifndef __aligned_tpacket
+# define __aligned_tpacket __attribute__((aligned(TPACKET_ALIGNMENT)))
+#endif
+
+#ifndef __align_tpacket
+# define __align_tpacket(x) __attribute__((aligned(TPACKET_ALIGN(x))))
+#endif
+
+struct block_desc {
+    uint32_t version;
+    uint32_t offset_to_priv;
+    struct tpacket_hdr_v1 h1;
+};
+
+union frame_map {
+    struct {
+        struct tpacket_hdr tp_h __aligned_tpacket;
+        struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket_hdr));
+    } *v1;
+    struct {
+        struct tpacket2_hdr tp_h __aligned_tpacket;
+        struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket2_hdr));
+    } *v2;
+    void *raw;
+};
+#endif /* HAVE_TPACKET */
+
 /* Linux 2.6.27 introduced ethtool_cmd_speed
  *
  * To avoid revisiting problems reported with using configure to detect
@@ -1064,6 +1095,141 @@  netdev_linux_rxq_alloc(void)
     return &rx->up;
 }
 
+#ifdef HAVE_TPACKET
+static inline int
+tpacket_set_packet_loss_discard(int sock)
+{
+    int discard = 1;
+
+    return setsockopt(sock, SOL_PACKET, PACKET_LOSS, (void *) &discard,
+                      sizeof(discard));
+}
+
+static inline void *
+tpacket_get_next_frame(struct tpacket_ring *ring, uint32_t frame_num)
+{
+#ifdef HAVE_TPACKET_V3
+    uint8_t *f0 = ring->rd[0].iov_base;
+
+    return f0 + (frame_num * ring->req3.tp_frame_size);
+#else
+    return ring->rd[frame_num].iov_base;
+#endif
+}
+
+/*
+ * For TPACKET_V1&V2, ring->rd_num is tp_frame_nr, ring->flen is tp_frame_size
+ */
+static inline void
+tpacket_v1_v2_fill_ring(struct tpacket_ring *ring, unsigned int blocks)
+{
+    ring->req.tp_block_size = getpagesize() << 2;
+    ring->req.tp_frame_size = TPACKET_ALIGNMENT << 7;
+    ring->req.tp_block_nr = blocks;
+
+    ring->req.tp_frame_nr = ring->req.tp_block_size /
+                            ring->req.tp_frame_size *
+                            ring->req.tp_block_nr;
+
+    ring->mm_len = ring->req.tp_block_size * ring->req.tp_block_nr;
+    ring->rd_num = ring->req.tp_frame_nr;
+    ring->flen = ring->req.tp_frame_size;
+}
+
+/*
+ * For TPACKET_V3, ring->rd_num is tp_block_nr, ring->flen is tp_block_size
+ */
+static inline void
+tpacket_v3_fill_ring(struct tpacket_ring *ring, unsigned int blocks, int type)
+{
+    if (type == PACKET_RX_RING) {
+        ring->req3.tp_retire_blk_tov = 0;
+        ring->req3.tp_sizeof_priv = 0;
+        ring->req3.tp_feature_req_word = 0;
+    }
+    ring->req3.tp_block_size = getpagesize() << 2;
+    ring->req3.tp_frame_size = TPACKET_ALIGNMENT << 7;
+    ring->req3.tp_block_nr = blocks;
+
+    ring->req3.tp_frame_nr = ring->req3.tp_block_size /
+                             ring->req3.tp_frame_size *
+                             ring->req3.tp_block_nr;
+
+    ring->mm_len = ring->req3.tp_block_size * ring->req3.tp_block_nr;
+    ring->rd_num = ring->req3.tp_block_nr;
+    ring->flen = ring->req3.tp_block_size;
+}
+
+static int
+tpacket_setup_ring(int sock, struct tpacket_ring *ring, int version, int type)
+{
+    int ret = 0;
+    unsigned int blocks = 256;
+
+    ring->type = type;
+    ring->version = version;
+
+    switch (version) {
+    case TPACKET_V1:
+    case TPACKET_V2:
+            if (type == PACKET_TX_RING) {
+                    tpacket_set_packet_loss_discard(sock);
+            }
+            tpacket_v1_v2_fill_ring(ring, blocks);
+            ret = setsockopt(sock, SOL_PACKET, type, &ring->req,
+                             sizeof(ring->req));
+            break;
+
+    case TPACKET_V3:
+            tpacket_v3_fill_ring(ring, blocks, type);
+            ret = setsockopt(sock, SOL_PACKET, type, &ring->req3,
+                             sizeof(ring->req3));
+            break;
+    }
+
+    if (ret == -1) {
+        return -1;
+    }
+
+    ring->rd_len = ring->rd_num * sizeof(*ring->rd);
+    ring->rd = xmalloc(ring->rd_len);
+    if (ring->rd == NULL) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static inline int
+tpacket_mmap_rx_tx_ring(int sock, struct tpacket_ring *rx_ring,
+                struct tpacket_ring *tx_ring)
+{
+    int i;
+
+    rx_ring->mm_space = mmap(0, rx_ring->mm_len + tx_ring->mm_len,
+                          PROT_READ | PROT_WRITE,
+                          MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0);
+    if (rx_ring->mm_space == MAP_FAILED) {
+        return -1;
+    }
+
+    memset(rx_ring->rd, 0, rx_ring->rd_len);
+    for (i = 0; i < rx_ring->rd_num; ++i) {
+            rx_ring->rd[i].iov_base = rx_ring->mm_space + (i * rx_ring->flen);
+            rx_ring->rd[i].iov_len = rx_ring->flen;
+    }
+
+    tx_ring->mm_space = rx_ring->mm_space + rx_ring->mm_len;
+    memset(tx_ring->rd, 0, tx_ring->rd_len);
+    for (i = 0; i < tx_ring->rd_num; ++i) {
+            tx_ring->rd[i].iov_base = tx_ring->mm_space + (i * tx_ring->flen);
+            tx_ring->rd[i].iov_len = tx_ring->flen;
+    }
+
+    return 0;
+}
+#endif
+
 static int
 netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 {
@@ -1079,6 +1245,15 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     } else {
         struct sockaddr_ll sll;
         int ifindex, val;
+#ifdef HAVE_TPACKET
+#ifdef HAVE_TPACKET_V3
+        int ver = TPACKET_V3;
+#elif defined(HAVE_TPACKET_V2)
+        int ver = TPACKET_V2;
+#else
+        int ver = TPACKET_V1;
+#endif
+#endif
         /* Result of tcpdump -dd inbound */
         static const struct sock_filter filt[] = {
             { 0x28, 0, 0, 0xfffff004 }, /* ldh [0] */
@@ -1091,13 +1266,52 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
         };
 
         /* Create file descriptor. */
-        rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
+        rx->fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
         if (rx->fd < 0) {
             error = errno;
             VLOG_ERR("failed to create raw socket (%s)", ovs_strerror(error));
             goto error;
         }
 
+#ifdef HAVE_TPACKET
+        error = setsockopt(rx->fd, SOL_PACKET, PACKET_VERSION, &ver,
+                           sizeof(ver));
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket version (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        netdev->tp_rx_ring = xzalloc(sizeof(struct tpacket_ring));
+        netdev->tp_tx_ring = xzalloc(sizeof(struct tpacket_ring));
+        netdev->tp_rx_ring->sockfd = rx->fd;
+        netdev->tp_tx_ring->sockfd = rx->fd;
+        error = tpacket_setup_ring(rx->fd, netdev->tp_rx_ring, ver,
+                                   PACKET_RX_RING);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket rx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        error = tpacket_setup_ring(rx->fd, netdev->tp_tx_ring, ver,
+                                   PACKET_TX_RING);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket tx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        error = tpacket_mmap_rx_tx_ring(rx->fd, netdev->tp_rx_ring,
+                                       netdev->tp_tx_ring);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to mmap tpacket rx & tx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+#endif
+
         val = 1;
         if (setsockopt(rx->fd, SOL_PACKET, PACKET_AUXDATA, &val, sizeof val)) {
             error = errno;
@@ -1129,7 +1343,12 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 
         /* Bind to specific ethernet device. */
         memset(&sll, 0, sizeof sll);
-        sll.sll_family = AF_PACKET;
+        sll.sll_family = PF_PACKET;
+#ifdef HAVE_TPACKET
+        sll.sll_hatype = 0;
+        sll.sll_pkttype = 0;
+        sll.sll_halen = 0;
+#endif
         sll.sll_ifindex = ifindex;
         sll.sll_protocol = htons(ETH_P_ALL);
         if (bind(rx->fd, (struct sockaddr *) &sll, sizeof sll) < 0) {
@@ -1168,6 +1387,17 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     int i;
 
     if (!rx->is_tap) {
+#ifdef HAVE_TPACKET
+        struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
+
+        if (netdev->tp_rx_ring) {
+            munmap(netdev->tp_rx_ring->mm_space,
+                   2 * netdev->tp_rx_ring->mm_len);
+            free(netdev->tp_rx_ring->rd);
+            free(netdev->tp_tx_ring->rd);
+        }
+#endif
+
         close(rx->fd);
     }
 
@@ -1184,6 +1414,7 @@  netdev_linux_rxq_dealloc(struct netdev_rxq *rxq_)
     free(rx);
 }
 
+#ifndef HAVE_TPACKET
 static ovs_be16
 auxdata_to_vlan_tpid(const struct tpacket_auxdata *aux, bool double_tagged)
 {
@@ -1345,6 +1576,7 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
     return 0;
 }
 
+#else /* ifdef HAVE_TPACKET */
 /*
  * Receive packets from tap by batch process for better performance,
  * it can receive NETDEV_MAX_BURST packets at most once, the received
@@ -1428,6 +1660,125 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
 
     return 0;
 }
+static int
+netdev_linux_batch_recv_tpacket(struct netdev_rxq *rxq_, int mtu,
+                                struct dp_packet_batch *batch)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
+    struct dp_packet *buffer;
+    int i = 0;
+
+#ifdef HAVE_TPACKET_V3
+    unsigned int block_num;
+    unsigned int fn_in_block;
+    struct block_desc *pbd;
+    struct tpacket3_hdr *ppd;
+
+    ppd = (struct tpacket3_hdr *)netdev->tp_rx_ring->ppd;
+    block_num = netdev->tp_rx_ring->block_num;
+    fn_in_block = netdev->tp_rx_ring->frame_num_in_block;
+    pbd = (struct block_desc *) netdev->tp_rx_ring->rd[block_num].iov_base;
+#else
+#if defined(HAVE_TPACKET_V2)
+    struct tpacket2_hdr *ppd;
+#else
+    struct tpacket_hdr *ppd;
+#endif
+    unsigned int frame_num;
+    unsigned int frame_nr = netdev->tp_rx_ring->rd_num;
+
+    frame_num = netdev->tp_rx_ring->frame_num;
+#endif
+
+    while (i < NETDEV_MAX_BURST) {
+#ifdef HAVE_TPACKET_V3
+        if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
+            break;
+        }
+        if (fn_in_block == 0) {
+            ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+                                           pbd->h1.offset_to_first_pkt);
+        }
+#elif defined(HAVE_TPACKET_V2)
+        ppd = (struct tpacket2_hdr *)
+                  netdev->tp_rx_ring->rd[frame_num].iov_base;
+        if ((ppd->tp_status & TP_STATUS_USER) == 0) {
+            break;
+        }
+#else
+        ppd = (struct tpacket_hdr *)netdev->tp_rx_ring->rd[frame_num].iov_base;
+        if ((ppd->tp_status & TP_STATUS_USER) == 0) {
+            break;
+        }
+#endif
+
+        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
+                                             DP_NETDEV_HEADROOM);
+        memcpy(dp_packet_data(buffer),
+               (uint8_t *) ppd + ppd->tp_mac, ppd->tp_snaplen);
+        dp_packet_set_size(buffer,
+                           dp_packet_size(buffer) + ppd->tp_snaplen);
+#if defined(HAVE_TPACKET_V2) || defined(HAVE_TPACKET_V3)
+        if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
+            struct eth_header *eth;
+            bool double_tagged;
+            ovs_be16 vlan_tpid;
+
+            eth = dp_packet_data(buffer);
+            double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
+            if (ppd->tp_status & TP_STATUS_VLAN_TPID_VALID) {
+#ifdef HAVE_TPACKET_V3
+                vlan_tpid = htons(ppd->hv1.tp_vlan_tpid);
+#else
+                vlan_tpid = htons(ppd->tp_vlan_tpid);
+#endif
+            } else if (double_tagged) {
+                vlan_tpid = htons(ETH_TYPE_VLAN_8021AD);
+            } else {
+                vlan_tpid = htons(ETH_TYPE_VLAN_8021Q);
+            }
+#ifdef HAVE_TPACKET_V3
+            eth_push_vlan(buffer, vlan_tpid, htons(ppd->hv1.tp_vlan_tci));
+#else
+            eth_push_vlan(buffer, vlan_tpid, htons(ppd->tp_vlan_tci));
+#endif
+        }
+#endif
+        dp_packet_batch_add(batch, buffer);
+
+#ifdef HAVE_TPACKET_V3
+        fn_in_block++;
+        if (fn_in_block >= pbd->h1.num_pkts) {
+            pbd->h1.block_status = TP_STATUS_KERNEL;
+            block_num = (block_num + 1) %
+                            netdev->tp_rx_ring->req3.tp_block_nr;
+            pbd = (struct block_desc *)
+                     netdev->tp_rx_ring->rd[block_num].iov_base;
+            fn_in_block = 0;
+            ppd = NULL;
+        } else {
+            ppd = (struct tpacket3_hdr *)
+                   ((uint8_t *) ppd + ppd->tp_next_offset);
+        }
+#else
+        ppd->tp_status = TP_STATUS_KERNEL;
+        frame_num = (frame_num + 1) % frame_nr;
+#endif
+        i++;
+    }
+
+#ifdef HAVE_TPACKET_V3
+    netdev->tp_rx_ring->block_num = block_num;
+    netdev->tp_rx_ring->frame_num_in_block = fn_in_block;
+    netdev->tp_rx_ring->ppd = ppd;
+#else
+    netdev->tp_rx_ring->frame_num = frame_num;
+#endif
+
+    return 0;
+}
+#endif
 
 static int
 netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
@@ -1443,9 +1794,15 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     }
 
     dp_packet_batch_init(batch);
-    retval = (rx->is_tap
-              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
-              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
+    if (rx->is_tap) {
+        retval = netdev_linux_batch_rxq_recv_tap(rx, mtu, batch);
+    } else {
+#ifndef HAVE_TPACKET
+        retval = netdev_linux_batch_rxq_recv_sock(rx, mtu, batch);
+#else
+        retval = netdev_linux_batch_recv_tpacket(rxq_, mtu, batch);
+#endif
+    }
 
     if (retval) {
         if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1486,6 +1843,7 @@  netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
     }
 }
 
+#ifndef HAVE_TPACKET
 static int
 netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
                              struct dp_packet_batch *batch)
@@ -1531,6 +1889,7 @@  netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
     return error;
 }
 
+#else /* ifdef HAVE_TPACKET */
 /* Use the tap fd to send 'batch' to tap device 'netdev'.  Using the tap fd is
  * essential, because packets sent to a tap device with an AF_PACKET socket
  * will loop back to be *received* again on the tap device.  This doesn't occur
@@ -1650,6 +2009,114 @@  netdev_linux_get_numa_id(const struct netdev *netdev_)
     return numa_id;
 }
 
+static inline int
+tpacket_tx_is_ready(void * next_frame)
+{
+#ifdef HAVE_TPACKE_V3
+    struct tpacket3_hdr *hdr = (struct tpacket3_hdr *)next_frame;
+#elif defined(HAVE_TPACKE_V2)
+    struct tpacket2_hdr *hdr = (struct tpacket2_hdr *)next_frame;
+#else
+    struct tpacket_hdr *hdr = (struct tpacket_hdr *)next_frame;
+#endif
+    return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
+}
+
+static int
+netdev_linux_tpacket_batch_send(struct netdev *netdev_,
+                            struct dp_packet_batch *batch)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+    struct dp_packet *packet;
+    int sockfd;
+    ssize_t bytes_sent;
+    int total_pkts = 0;
+
+#ifdef HAVE_TPACKET_V3
+    unsigned int frame_nr = netdev->tp_tx_ring->req3.tp_frame_nr;
+#else
+    unsigned int frame_nr = netdev->tp_tx_ring->rd_num;
+#endif
+    unsigned int frame_num = netdev->tp_tx_ring->frame_num;
+
+    /* The Linux tap driver returns EIO if the device is not up,
+     * so if the device is not up, don't waste time sending it.
+     * However, if the device is in another network namespace
+     * then OVS can't retrieve the state. In that case, send the
+     * packets anyway. */
+    if (netdev->present && !(netdev->ifi_flags & IFF_UP)) {
+        netdev->tx_dropped += dp_packet_batch_size(batch);
+        return 0;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        union frame_map ppd;
+        size_t size = dp_packet_size(packet);
+#ifdef HAVE_TPACKET_V3
+        struct tpacket3_hdr *next_frame
+                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
+#elif defined(HAVE_TPACKET_V2)
+        struct tpacket2_hdr *next_frame
+                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
+#else
+        struct tpacket_hdr *next_frame
+                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
+#endif
+
+        ppd.raw = next_frame;
+        if (!tpacket_tx_is_ready(next_frame)) {
+            break;
+        }
+#ifdef HAVE_TPACKET_V3
+        next_frame->tp_snaplen = size;
+        next_frame->tp_len = size;
+        next_frame->tp_next_offset = 0;
+
+        memcpy((uint8_t *)ppd.raw + TPACKET3_HDRLEN
+                   - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet),
+               size);
+#elif defined(HAVE_TPACKET_V2)
+        ppd.v2->tp_h.tp_snaplen = size;
+        ppd.v2->tp_h.tp_len = size;
+
+        memcpy((uint8_t *)ppd.raw + TPACKET2_HDRLEN
+                   - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet),
+               size);
+#else
+        ppd.v1->tp_h.tp_snaplen = size;
+        ppd.v1->tp_h.tp_len = size;
+
+        memcpy((uint8_t *)ppd.raw + TPACKET_HDRLEN
+                   - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet),
+               size);
+#endif
+        next_frame->tp_status = TP_STATUS_SEND_REQUEST;
+        frame_num = (frame_num + 1) % frame_nr;
+        total_pkts++;
+    }
+    netdev->tp_tx_ring->frame_num = frame_num;
+
+    /* kick-off transmits */
+    if (total_pkts != 0) {
+        sockfd = netdev->tp_tx_ring->sockfd;
+        bytes_sent = sendto(sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+        if (bytes_sent == -1 &&
+                errno != ENOBUFS && errno != EAGAIN) {
+            /*
+             * In case of an ENOBUFS/EAGAIN error all of the enqueued
+             * packets will be considered successful even though only some
+             * are sent.
+             */
+            netdev->tx_dropped += dp_packet_batch_size(batch);
+        }
+    }
+    return 0;
+}
+#endif
+
 /* Sends 'batch' on 'netdev'.  Returns 0 if successful, otherwise a positive
  * errno value.  Returns EAGAIN without blocking if the packet cannot be queued
  * immediately.  Returns EMSGSIZE if a partial packet was transmitted or if
@@ -1689,7 +2156,11 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
             goto free_batch;
         }
 
+#ifndef HAVE_TPACKET
         error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
+#else
+        error = netdev_linux_tpacket_batch_send(netdev_, batch);
+#endif
     } else {
         error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
     }