[ovs-dev,v2] Use TPACKET_V3 to accelerate veth for userspace datapath
diff mbox series

Message ID 20200207115043.26228-1-yang_y_yi@126.com
State Changes Requested
Headers show
Series
  • [ovs-dev,v2] Use TPACKET_V3 to accelerate veth for userspace datapath
Related show

Commit Message

yang_y_yi@126.com Feb. 7, 2020, 11:50 a.m. UTC
From: Yi Yang <yangyi01@inspur.com>

We can avoid high system call overhead by using TPACKET_V3
and using DPDK-like poll to receive and send packets (Note: send
still needs to call sendto to trigger final packet transmission).

From Linux kernel 3.10 on, TPACKET_V3 has been supported,
so all the Linux kernels current OVS supports can run
TPACKET_V3 without any problem.

I can see about 30% performance improvement for veth compared to
last recvmmsg optimization if I use TPACKET_V3, it is about 1.98
Gbps, but it was 1.47 Gbps before.

Note: it can't support TSO which is in progress.

Changelog:
- v1->v2
 * Remove TPACKET_V1 and TPACKET_V2 which is obsolete
 * Add include/linux/if_packet.h
 * Change include/sparse/linux/if_packet.h

Signed-off-by: Yi Yang <yangyi01@inspur.com>
Co-authored-by: William Tu <u9012063@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
---
 acinclude.m4                     |  12 ++
 configure.ac                     |   1 +
 include/linux/automake.mk        |   1 +
 include/linux/if_packet.h        | 122 +++++++++++++++
 include/sparse/linux/if_packet.h | 104 +++++++++++++
 lib/netdev-linux-private.h       |  24 +++
 lib/netdev-linux.c               | 327 ++++++++++++++++++++++++++++++++++++++-
 7 files changed, 586 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/if_packet.h

Comments

Ilya Maximets Feb. 7, 2020, 2:43 p.m. UTC | #1
On 2/7/20 12:50 PM, yang_y_yi@126.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> We can avoid high system call overhead by using TPACKET_V3
> and using DPDK-like poll to receive and send packets (Note: send
> still needs to call sendto to trigger final packet transmission).
> 
>>From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> so all the Linux kernels current OVS supports can run
> TPACKET_V3 without any problem.
> 
> I can see about 30% performance improvement for veth compared to
> last recvmmsg optimization if I use TPACKET_V3, it is about 1.98
> Gbps, but it was 1.47 Gbps before.
> 
> Note: it can't support TSO which is in progress.

So, this patch effectively breaks TSO functionality in compile time,
i.e. it compiles out the TSO capable function invocation.
I don't think that we should mege that. For this patch to be acceptable,
tpacket implementation should support TSO or it should be possible to
dynamically switch to usual sendmmsg if we want to enable TSO support.

NACK for this version. Will wait for v3.

> 
> Changelog:
> - v1->v2
>  * Remove TPACKET_V1 and TPACKET_V2 which is obsolete
>  * Add include/linux/if_packet.h
>  * Change include/sparse/linux/if_packet.h

Please, place the change log under the '---'.  It should not be part
of a commit message.

> 
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> Co-authored-by: William Tu <u9012063@gmail.com>
> Signed-off-by: William Tu <u9012063@gmail.com>
William Tu Feb. 14, 2020, 12:38 a.m. UTC | #2
On Fri, Feb 7, 2020 at 6:43 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/7/20 12:50 PM, yang_y_yi@126.com wrote:
> > From: Yi Yang <yangyi01@inspur.com>
> >
> > We can avoid high system call overhead by using TPACKET_V3
> > and using DPDK-like poll to receive and send packets (Note: send
> > still needs to call sendto to trigger final packet transmission).
> >
> >>From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> > so all the Linux kernels current OVS supports can run
> > TPACKET_V3 without any problem.
> >
> > I can see about 30% performance improvement for veth compared to
> > last recvmmsg optimization if I use TPACKET_V3, it is about 1.98
> > Gbps, but it was 1.47 Gbps before.
> >
> > Note: it can't support TSO which is in progress.
>
> So, this patch effectively breaks TSO functionality in compile time,
> i.e. it compiles out the TSO capable function invocation.
> I don't think that we should mege that. For this patch to be acceptable,
> tpacket implementation should support TSO or it should be possible to
> dynamically switch to usual sendmmsg if we want to enable TSO support.
>
I think it's impossible to support tpacket + TSO, because tpacket pre-allocate
a ring buffer with 2K buffer size, and each descriptor can only point
to one entry.
(If I understand correctly)

So I think we should dynamically switch back to sendmmsg when TSO is
enabled.

Regards,
William
Yi Yang (杨燚)-云服务集团 Feb. 14, 2020, 12:59 a.m. UTC | #3
No, block size and frame size are defined by user, you can specify any size, but block size must be pagesize aligned, please read v3 patch and try it in your environment.

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年2月14日 8:38
收件人: Ilya Maximets <i.maximets@ovn.org>
抄送: yang_y_yi@126.com; ovs-dev <ovs-dev@openvswitch.org>; yang_y_yi <yang_y_yi@163.com>; Ben Pfaff <blp@ovn.org>; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
主题: Re: [ovs-dev] [PATCH v2] Use TPACKET_V3 to accelerate veth for userspace datapath

On Fri, Feb 7, 2020 at 6:43 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/7/20 12:50 PM, yang_y_yi@126.com wrote:
> > From: Yi Yang <yangyi01@inspur.com>
> >
> > We can avoid high system call overhead by using TPACKET_V3 and using 
> > DPDK-like poll to receive and send packets (Note: send still needs 
> > to call sendto to trigger final packet transmission).
> >
> >>From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> > so all the Linux kernels current OVS supports can run
> > TPACKET_V3 without any problem.
> >
> > I can see about 30% performance improvement for veth compared to 
> > last recvmmsg optimization if I use TPACKET_V3, it is about 1.98 
> > Gbps, but it was 1.47 Gbps before.
> >
> > Note: it can't support TSO which is in progress.
>
> So, this patch effectively breaks TSO functionality in compile time, 
> i.e. it compiles out the TSO capable function invocation.
> I don't think that we should mege that. For this patch to be 
> acceptable, tpacket implementation should support TSO or it should be 
> possible to dynamically switch to usual sendmmsg if we want to enable TSO support.
>
I think it's impossible to support tpacket + TSO, because tpacket pre-allocate a ring buffer with 2K buffer size, and each descriptor can only point to one entry.
(If I understand correctly)

So I think we should dynamically switch back to sendmmsg when TSO is enabled.

Regards,
William
William Tu Feb. 14, 2020, 1:55 a.m. UTC | #4
On Thu, Feb 13, 2020 at 5:00 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> No, block size and frame size are defined by user, you can specify any size, but block size must be pagesize aligned, please read v3 patch and try it in your environment.
>

Right, but
How do we set block size and frame size to accommodate the
TSO 64K-size packet?

William

> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com]
> 发送时间: 2020年2月14日 8:38
> 收件人: Ilya Maximets <i.maximets@ovn.org>
> 抄送: yang_y_yi@126.com; ovs-dev <ovs-dev@openvswitch.org>; yang_y_yi <yang_y_yi@163.com>; Ben Pfaff <blp@ovn.org>; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 主题: Re: [ovs-dev] [PATCH v2] Use TPACKET_V3 to accelerate veth for userspace datapath
>
> On Fri, Feb 7, 2020 at 6:43 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 2/7/20 12:50 PM, yang_y_yi@126.com wrote:
> > > From: Yi Yang <yangyi01@inspur.com>
> > >
> > > We can avoid high system call overhead by using TPACKET_V3 and using
> > > DPDK-like poll to receive and send packets (Note: send still needs
> > > to call sendto to trigger final packet transmission).
> > >
> > >>From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> > > so all the Linux kernels current OVS supports can run
> > > TPACKET_V3 without any problem.
> > >
> > > I can see about 30% performance improvement for veth compared to
> > > last recvmmsg optimization if I use TPACKET_V3, it is about 1.98
> > > Gbps, but it was 1.47 Gbps before.
> > >
> > > Note: it can't support TSO which is in progress.
> >
> > So, this patch effectively breaks TSO functionality in compile time,
> > i.e. it compiles out the TSO capable function invocation.
> > I don't think that we should mege that. For this patch to be
> > acceptable, tpacket implementation should support TSO or it should be
> > possible to dynamically switch to usual sendmmsg if we want to enable TSO support.
> >
> I think it's impossible to support tpacket + TSO, because tpacket pre-allocate a ring buffer with 2K buffer size, and each descriptor can only point to one entry.
> (If I understand correctly)
>
> So I think we should dynamically switch back to sendmmsg when TSO is enabled.
>
> Regards,
> William
Yi Yang (杨燚)-云服务集团 Feb. 14, 2020, 3:17 a.m. UTC | #5
Maximum packet size for TSO is 65535+ethheader (+vlan header if have), for TSO, frame size is set to 64k+4k (for tpacket3_header), block size is same as frame size, tpacket send requires one packet must be inside one frame, one packet can't cross frame boundary for tpacket.

Please read my code for details by https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367689.html, I have sent out v3 patch, but no comments so far.


-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年2月14日 9:56
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@126.com; ovs-dev@openvswitch.org; yang_y_yi@163.com; blp@ovn.org
主题: Re: [ovs-dev] [PATCH v2] Use TPACKET_V3 to accelerate veth for userspace datapath

On Thu, Feb 13, 2020 at 5:00 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> No, block size and frame size are defined by user, you can specify any size, but block size must be pagesize aligned, please read v3 patch and try it in your environment.
>

Right, but
How do we set block size and frame size to accommodate the TSO 64K-size packet?

William

> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com]
> 发送时间: 2020年2月14日 8:38
> 收件人: Ilya Maximets <i.maximets@ovn.org>
> 抄送: yang_y_yi@126.com; ovs-dev <ovs-dev@openvswitch.org>; yang_y_yi 
> <yang_y_yi@163.com>; Ben Pfaff <blp@ovn.org>; Yi Yang (杨燚)-云服务集团 
> <yangyi01@inspur.com>
> 主题: Re: [ovs-dev] [PATCH v2] Use TPACKET_V3 to accelerate veth for 
> userspace datapath
>
> On Fri, Feb 7, 2020 at 6:43 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 2/7/20 12:50 PM, yang_y_yi@126.com wrote:
> > > From: Yi Yang <yangyi01@inspur.com>
> > >
> > > We can avoid high system call overhead by using TPACKET_V3 and 
> > > using DPDK-like poll to receive and send packets (Note: send still 
> > > needs to call sendto to trigger final packet transmission).
> > >
> > >>From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> > > so all the Linux kernels current OVS supports can run
> > > TPACKET_V3 without any problem.
> > >
> > > I can see about 30% performance improvement for veth compared to 
> > > last recvmmsg optimization if I use TPACKET_V3, it is about 1.98 
> > > Gbps, but it was 1.47 Gbps before.
> > >
> > > Note: it can't support TSO which is in progress.
> >
> > So, this patch effectively breaks TSO functionality in compile time, 
> > i.e. it compiles out the TSO capable function invocation.
> > I don't think that we should mege that. For this patch to be 
> > acceptable, tpacket implementation should support TSO or it should 
> > be possible to dynamically switch to usual sendmmsg if we want to enable TSO support.
> >
> I think it's impossible to support tpacket + TSO, because tpacket pre-allocate a ring buffer with 2K buffer size, and each descriptor can only point to one entry.
> (If I understand correctly)
>
> So I think we should dynamically switch back to sendmmsg when TSO is enabled.
>
> Regards,
> William

Patch
diff mbox series

diff --git a/acinclude.m4 b/acinclude.m4
index 1212a46..b39bbb9 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1093,6 +1093,18 @@  AC_DEFUN([OVS_CHECK_IF_DL],
       AC_SEARCH_LIBS([pcap_open_live], [pcap])
    fi])
 
+dnl OVS_CHECK_LINUX_TPACKET
+dnl
+dnl Configure Linux TPACKET.
+AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [
+  AC_COMPILE_IFELSE([
+    AC_LANG_PROGRAM([#include <linux/if_packet.h>], [
+        struct tpacket3_hdr x =  { 0 };
+    ])],
+    [AC_DEFINE([HAVE_TPACKET_V3], [1],
+    [Define to 1 if struct tpacket3_hdr is available.])])
+])
+
 dnl Checks for buggy strtok_r.
 dnl
 dnl Some versions of glibc 2.7 has a bug in strtok_r when compiling
diff --git a/configure.ac b/configure.ac
index 1877aae..b61a1f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -89,6 +89,7 @@  OVS_CHECK_VISUAL_STUDIO_DDK
 OVS_CHECK_COVERAGE
 OVS_CHECK_NDEBUG
 OVS_CHECK_NETLINK
+OVS_CHECK_LINUX_TPACKET
 OVS_CHECK_OPENSSL
 OVS_CHECK_LIBCAPNG
 OVS_CHECK_LOGDIR
diff --git a/include/linux/automake.mk b/include/linux/automake.mk
index 8f063f4..a659e65 100644
--- a/include/linux/automake.mk
+++ b/include/linux/automake.mk
@@ -1,4 +1,5 @@ 
 noinst_HEADERS += \
+	include/linux/if_packet.h \
 	include/linux/netlink.h \
 	include/linux/netfilter/nf_conntrack_sctp.h \
 	include/linux/pkt_cls.h \
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
new file mode 100644
index 0000000..4864464
--- /dev/null
+++ b/include/linux/if_packet.h
@@ -0,0 +1,122 @@ 
+#ifndef __LINUX_IF_PACKET_WRAPPER_H
+#define __LINUX_IF_PACKET_WRAPPER_H 1
+
+#ifdef HAVE_TPACKET_V3
+#include_next <linux/if_packet.h>
+#else
+#define HAVE_TPACKET_V3 1
+
+struct sockaddr_pkt {
+        unsigned short spkt_family;
+        unsigned char spkt_device[14];
+        ovs_be16 spkt_protocol;
+};
+
+struct sockaddr_ll {
+        unsigned short  sll_family;
+        ovs_be16        sll_protocol;
+        int             sll_ifindex;
+        unsigned short  sll_hatype;
+        unsigned char   sll_pkttype;
+        unsigned char   sll_halen;
+        unsigned char   sll_addr[8];
+};
+
+/* Packet socket options */
+#define PACKET_RX_RING                  5
+#define PACKET_TX_RING                 13
+
+/* Rx ring - header status */
+#define TP_STATUS_KERNEL                0
+#define TP_STATUS_USER            (1 << 0)
+#define TP_STATUS_VLAN_VALID      (1 << 4) /* auxdata has valid tp_vlan_tci */
+#define TP_STATUS_VLAN_TPID_VALID (1 << 6) /* auxdata has valid tp_vlan_tpid */
+
+/* Tx ring - header status */
+#define TP_STATUS_SEND_REQUEST    (1 << 0)
+#define TP_STATUS_SENDING         (1 << 1)
+
+struct tpacket_hdr {
+    unsigned long tp_status;
+    unsigned int tp_len;
+    unsigned int tp_snaplen;
+    unsigned short tp_mac;
+    unsigned short tp_net;
+    unsigned int tp_sec;
+    unsigned int tp_usec;
+};
+
+#define TPACKET_ALIGNMENT 16
+#define TPACKET_ALIGN(x) (((x)+TPACKET_ALIGNMENT-1)&~(TPACKET_ALIGNMENT-1))
+
+struct tpacket_hdr_variant1 {
+    uint32_t tp_rxhash;
+    uint32_t tp_vlan_tci;
+    uint16_t tp_vlan_tpid;
+    uint16_t tp_padding;
+};
+
+struct tpacket3_hdr {
+    uint32_t  tp_next_offset;
+    uint32_t  tp_sec;
+    uint32_t  tp_nsec;
+    uint32_t  tp_snaplen;
+    uint32_t  tp_len;
+    uint32_t  tp_status;
+    uint16_t  tp_mac;
+    uint16_t  tp_net;
+    /* pkt_hdr variants */
+    union {
+        struct tpacket_hdr_variant1 hv1;
+    };
+    uint8_t  tp_padding[8];
+};
+
+struct tpacket_bd_ts {
+    unsigned int ts_sec;
+    union {
+        unsigned int ts_usec;
+        unsigned int ts_nsec;
+    };
+};
+
+struct tpacket_hdr_v1 {
+    uint32_t block_status;
+    uint32_t num_pkts;
+    uint32_t offset_to_first_pkt;
+    uint32_t blk_len;
+    uint64_t __attribute__((aligned(8))) seq_num;
+    struct tpacket_bd_ts ts_first_pkt, ts_last_pkt;
+};
+
+union tpacket_bd_header_u {
+    struct tpacket_hdr_v1 bh1;
+};
+
+struct tpacket_block_desc {
+    uint32_t version;
+    uint32_t offset_to_priv;
+    union tpacket_bd_header_u hdr;
+};
+
+#define TPACKET3_HDRLEN \
+    (TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))
+
+enum tpacket_versions {
+    TPACKET_V1,
+    TPACKET_V2,
+    TPACKET_V3
+};
+
+struct tpacket_req3 {
+    unsigned int tp_block_size; /* Minimal size of contiguous block */
+    unsigned int tp_block_nr; /* Number of blocks */
+    unsigned int tp_frame_size; /* Size of frame */
+    unsigned int tp_frame_nr; /* Total number of frames */
+    unsigned int tp_retire_blk_tov; /* timeout in msecs */
+    unsigned int tp_sizeof_priv; /* offset to private data area */
+    unsigned int tp_feature_req_word;
+};
+
+#endif /* HAVE_TPACKET_V3 */
+#endif /* __LINUX_IF_PACKET_WRAPPER_H */
diff --git a/include/sparse/linux/if_packet.h b/include/sparse/linux/if_packet.h
index 5ff6d47..8a7c652 100644
--- a/include/sparse/linux/if_packet.h
+++ b/include/sparse/linux/if_packet.h
@@ -27,4 +27,108 @@  struct sockaddr_ll {
         unsigned char   sll_addr[8];
 };
 
+/* Packet socket options */
+#define PACKET_RX_RING                  5
+#define PACKET_TX_RING                 13
+
+/* Rx ring - header status */
+#define TP_STATUS_KERNEL                0
+#define TP_STATUS_USER            (1 << 0)
+#define TP_STATUS_VLAN_VALID      (1 << 4) /* auxdata has valid tp_vlan_tci */
+#define TP_STATUS_VLAN_TPID_VALID (1 << 6) /* auxdata has valid tp_vlan_tpid */
+
+/* Tx ring - header status */
+#define TP_STATUS_SEND_REQUEST    (1 << 0)
+#define TP_STATUS_SENDING         (1 << 1)
+
+#define tpacket_hdr rpl_tpacket_hdr
+struct tpacket_hdr {
+    unsigned long tp_status;
+    unsigned int tp_len;
+    unsigned int tp_snaplen;
+    unsigned short tp_mac;
+    unsigned short tp_net;
+    unsigned int tp_sec;
+    unsigned int tp_usec;
+};
+
+#define TPACKET_ALIGNMENT 16
+#define TPACKET_ALIGN(x) (((x)+TPACKET_ALIGNMENT-1)&~(TPACKET_ALIGNMENT-1))
+
+#define tpacket_hdr_variant1 rpl_tpacket_hdr_variant1
+struct tpacket_hdr_variant1 {
+    uint32_t tp_rxhash;
+    uint32_t tp_vlan_tci;
+    uint16_t tp_vlan_tpid;
+    uint16_t tp_padding;
+};
+
+#define tpacket3_hdr rpl_tpacket3_hdr
+struct tpacket3_hdr {
+    uint32_t  tp_next_offset;
+    uint32_t  tp_sec;
+    uint32_t  tp_nsec;
+    uint32_t  tp_snaplen;
+    uint32_t  tp_len;
+    uint32_t  tp_status;
+    uint16_t  tp_mac;
+    uint16_t  tp_net;
+    /* pkt_hdr variants */
+    union {
+        struct tpacket_hdr_variant1 hv1;
+    };
+    uint8_t  tp_padding[8];
+};
+
+#define tpacket_bd_ts rpl_tpacket_bd_ts
+struct tpacket_bd_ts {
+    unsigned int ts_sec;
+    union {
+        unsigned int ts_usec;
+        unsigned int ts_nsec;
+    };
+};
+
+#define tpacket_hdr_v1 rpl_tpacket_hdr_v1
+struct tpacket_hdr_v1 {
+    uint32_t block_status;
+    uint32_t num_pkts;
+    uint32_t offset_to_first_pkt;
+    uint32_t blk_len;
+    uint64_t __attribute__((aligned(8))) seq_num;
+    struct tpacket_bd_ts ts_first_pkt, ts_last_pkt;
+};
+
+#define tpacket_bd_header_u rpl_tpacket_bd_header_u
+union tpacket_bd_header_u {
+    struct tpacket_hdr_v1 bh1;
+};
+
+#define tpacket_block_desc rpl_tpacket_block_desc
+struct tpacket_block_desc {
+    uint32_t version;
+    uint32_t offset_to_priv;
+    union tpacket_bd_header_u hdr;
+};
+
+#define TPACKET3_HDRLEN \
+    (TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))
+
+enum rpl_tpacket_versions {
+    TPACKET_V1,
+    TPACKET_V2,
+    TPACKET_V3
+};
+
+#define tpacket_req3 rpl_tpacket_req3
+struct tpacket_req3 {
+    unsigned int tp_block_size; /* Minimal size of contiguous block */
+    unsigned int tp_block_nr; /* Number of blocks */
+    unsigned int tp_frame_size; /* Size of frame */
+    unsigned int tp_frame_nr; /* Total number of frames */
+    unsigned int tp_retire_blk_tov; /* timeout in msecs */
+    unsigned int tp_sizeof_priv; /* offset to private data area */
+    unsigned int tp_feature_req_word;
+};
+
 #endif
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index c7c515f..bcbd00a 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -26,6 +26,9 @@ 
 #include <linux/mii.h>
 #include <stdint.h>
 #include <stdbool.h>
+#ifdef HAVE_TPACKET_V3
+#include <linux/if_packet.h>
+#endif
 
 #include "dp-packet.h"
 #include "netdev-afxdp.h"
@@ -41,6 +44,22 @@  struct netdev;
 /* The maximum packet length is 16 bits */
 #define LINUX_RXQ_TSO_MAX_LEN 65535
 
+#ifdef HAVE_TPACKET_V3
+struct tpacket_ring {
+    int sockfd;
+    struct iovec *rd;
+    uint8_t *mm_space;
+    size_t mm_len, rd_len;
+    struct sockaddr_ll ll;
+    int type, rd_num, flen;
+    struct tpacket_req3 req;
+    uint32_t block_num;
+    uint32_t frame_num;
+    uint32_t frame_num_in_block;
+    void * ppd;
+};
+#endif /* HAVE_TPACKET_V3 */
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
@@ -105,6 +124,11 @@  struct netdev_linux {
 
     int numa_id;                /* NUMA node id. */
 
+#ifdef HAVE_TPACKET_V3
+    struct tpacket_ring *tp_rx_ring;
+    struct tpacket_ring *tp_tx_ring;
+#endif
+
 #ifdef HAVE_AF_XDP
     /* AF_XDP information. */
     struct xsk_socket_info **xsks;
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index c6f3d27..e2b82d0 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -48,6 +48,9 @@ 
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#ifdef HAVE_TPACKET_V3
+#include <sys/mman.h>
+#endif
 
 #include "coverage.h"
 #include "dp-packet.h"
@@ -1055,6 +1058,94 @@  netdev_linux_rxq_alloc(void)
     return &rx->up;
 }
 
+#ifdef HAVE_TPACKET_V3
+static inline struct tpacket3_hdr *
+tpacket_get_next_frame(struct tpacket_ring *ring, uint32_t frame_num)
+{
+    uint8_t *f0 = ring->rd[0].iov_base;
+
+    return (struct tpacket3_hdr *)
+               (f0 + (frame_num * ring->req.tp_frame_size));
+}
+
+/*
+ * ring->rd_num is tp_block_nr, ring->flen is tp_block_size
+ */
+static inline void
+tpacket_fill_ring(struct tpacket_ring *ring, unsigned int blocks, int type)
+{
+    if (type == PACKET_RX_RING) {
+        ring->req.tp_retire_blk_tov = 0;
+        ring->req.tp_sizeof_priv = 0;
+        ring->req.tp_feature_req_word = 0;
+    }
+    ring->req.tp_block_size = getpagesize() << 2;
+    ring->req.tp_frame_size = TPACKET_ALIGNMENT << 7;
+    ring->req.tp_block_nr = blocks;
+
+    ring->req.tp_frame_nr = ring->req.tp_block_size /
+                             ring->req.tp_frame_size *
+                             ring->req.tp_block_nr;
+
+    ring->mm_len = ring->req.tp_block_size * ring->req.tp_block_nr;
+    ring->rd_num = ring->req.tp_block_nr;
+    ring->flen = ring->req.tp_block_size;
+}
+
+static int
+tpacket_setup_ring(int sock, struct tpacket_ring *ring, int type)
+{
+    int ret = 0;
+    unsigned int blocks = 256;
+
+    ring->type = type;
+    tpacket_fill_ring(ring, blocks, type);
+    ret = setsockopt(sock, SOL_PACKET, type, &ring->req,
+                     sizeof(ring->req));
+
+    if (ret == -1) {
+        return -1;
+    }
+
+    ring->rd_len = ring->rd_num * sizeof(*ring->rd);
+    ring->rd = xmalloc(ring->rd_len);
+    if (ring->rd == NULL) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static inline int
+tpacket_mmap_rx_tx_ring(int sock, struct tpacket_ring *rx_ring,
+                struct tpacket_ring *tx_ring)
+{
+    int i;
+
+    rx_ring->mm_space = mmap(0, rx_ring->mm_len + tx_ring->mm_len,
+                          PROT_READ | PROT_WRITE,
+                          MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0);
+    if (rx_ring->mm_space == MAP_FAILED) {
+        return -1;
+    }
+
+    memset(rx_ring->rd, 0, rx_ring->rd_len);
+    for (i = 0; i < rx_ring->rd_num; ++i) {
+            rx_ring->rd[i].iov_base = rx_ring->mm_space + (i * rx_ring->flen);
+            rx_ring->rd[i].iov_len = rx_ring->flen;
+    }
+
+    tx_ring->mm_space = rx_ring->mm_space + rx_ring->mm_len;
+    memset(tx_ring->rd, 0, tx_ring->rd_len);
+    for (i = 0; i < tx_ring->rd_num; ++i) {
+            tx_ring->rd[i].iov_base = tx_ring->mm_space + (i * tx_ring->flen);
+            tx_ring->rd[i].iov_len = tx_ring->flen;
+    }
+
+    return 0;
+}
+#endif
+
 static int
 netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 {
@@ -1062,6 +1153,9 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     struct netdev *netdev_ = rx->up.netdev;
     struct netdev_linux *netdev = netdev_linux_cast(netdev_);
     int error;
+#ifdef HAVE_TPACKET_V3
+    int ver = TPACKET_V3;
+#endif
 
     ovs_mutex_lock(&netdev->mutex);
     rx->is_tap = is_tap_netdev(netdev_);
@@ -1070,6 +1164,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     } else {
         struct sockaddr_ll sll;
         int ifindex, val;
+
         /* Result of tcpdump -dd inbound */
         static const struct sock_filter filt[] = {
             { 0x28, 0, 0, 0xfffff004 }, /* ldh [0] */
@@ -1082,7 +1177,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
         };
 
         /* Create file descriptor. */
-        rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
+        rx->fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
         if (rx->fd < 0) {
             error = errno;
             VLOG_ERR("failed to create raw socket (%s)", ovs_strerror(error));
@@ -1106,6 +1201,49 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
             goto error;
         }
 
+#ifdef HAVE_TPACKET_V3
+        /* TPACKET_V3 ring setup must be after setsockopt
+         * PACKET_VNET_HDR because PACKET_VNET_HDR will return error
+         * (EBUSY) if ring is set up
+         */
+        error = setsockopt(rx->fd, SOL_PACKET, PACKET_VERSION, &ver,
+                           sizeof(ver));
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket version (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        netdev->tp_rx_ring = xzalloc(sizeof(struct tpacket_ring));
+        netdev->tp_tx_ring = xzalloc(sizeof(struct tpacket_ring));
+        netdev->tp_rx_ring->sockfd = rx->fd;
+        netdev->tp_tx_ring->sockfd = rx->fd;
+        error = tpacket_setup_ring(rx->fd, netdev->tp_rx_ring,
+                                   PACKET_RX_RING);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket rx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        error = tpacket_setup_ring(rx->fd, netdev->tp_tx_ring,
+                                   PACKET_TX_RING);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to set tpacket tx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+        error = tpacket_mmap_rx_tx_ring(rx->fd, netdev->tp_rx_ring,
+                                       netdev->tp_tx_ring);
+        if (error != 0) {
+            error = errno;
+            VLOG_ERR("%s: failed to mmap tpacket rx & tx ring (%s)",
+                     netdev_get_name(netdev_), ovs_strerror(error));
+            goto error;
+        }
+#endif
+
         /* Set non-blocking mode. */
         error = set_nonblocking(rx->fd);
         if (error) {
@@ -1120,7 +1258,12 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 
         /* Bind to specific ethernet device. */
         memset(&sll, 0, sizeof sll);
-        sll.sll_family = AF_PACKET;
+        sll.sll_family = PF_PACKET;
+#ifdef HAVE_TPACKET_V3
+        sll.sll_hatype = 0;
+        sll.sll_pkttype = 0;
+        sll.sll_halen = 0;
+#endif
         sll.sll_ifindex = ifindex;
         sll.sll_protocol = htons(ETH_P_ALL);
         if (bind(rx->fd, (struct sockaddr *) &sll, sizeof sll) < 0) {
@@ -1159,6 +1302,17 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     int i;
 
     if (!rx->is_tap) {
+#ifdef HAVE_TPACKET_V3
+        struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
+
+        if (netdev->tp_rx_ring) {
+            munmap(netdev->tp_rx_ring->mm_space,
+                   2 * netdev->tp_rx_ring->mm_len);
+            free(netdev->tp_rx_ring->rd);
+            free(netdev->tp_tx_ring->rd);
+        }
+#endif
+
         close(rx->fd);
     }
 
@@ -1175,6 +1329,7 @@  netdev_linux_rxq_dealloc(struct netdev_rxq *rxq_)
     free(rx);
 }
 
+#ifndef HAVE_TPACKET_V3
 static ovs_be16
 auxdata_to_vlan_tpid(const struct tpacket_auxdata *aux, bool double_tagged)
 {
@@ -1342,6 +1497,7 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
 
     return 0;
 }
+#endif /* HAVE_TPACKET_V3 */
 
 /*
  * Receive packets from tap by batch process for better performance,
@@ -1435,6 +1591,84 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
     return 0;
 }
 
+#ifdef HAVE_TPACKET_V3
+static int
+netdev_linux_batch_recv_tpacket(struct netdev_rxq_linux *rx, int mtu,
+                                struct dp_packet_batch *batch)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
+    struct dp_packet *buffer;
+    int i = 0;
+    unsigned int block_num;
+    unsigned int fn_in_block;
+    struct tpacket_block_desc *pbd;
+    struct tpacket3_hdr *ppd;
+
+    ppd = (struct tpacket3_hdr *)netdev->tp_rx_ring->ppd;
+    block_num = netdev->tp_rx_ring->block_num;
+    fn_in_block = netdev->tp_rx_ring->frame_num_in_block;
+    pbd = (struct tpacket_block_desc *)
+              netdev->tp_rx_ring->rd[block_num].iov_base;
+
+    while (i < NETDEV_MAX_BURST) {
+        if ((pbd->hdr.bh1.block_status & TP_STATUS_USER) == 0) {
+            break;
+        }
+        if (fn_in_block == 0) {
+            ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+                                           pbd->hdr.bh1.offset_to_first_pkt);
+        }
+
+        buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
+                                             DP_NETDEV_HEADROOM);
+        memcpy(dp_packet_data(buffer),
+               (uint8_t *) ppd + ppd->tp_mac, ppd->tp_snaplen);
+        dp_packet_set_size(buffer,
+                           dp_packet_size(buffer) + ppd->tp_snaplen);
+
+        if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
+            struct eth_header *eth;
+            bool double_tagged;
+            ovs_be16 vlan_tpid;
+
+            eth = dp_packet_data(buffer);
+            double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
+            if (ppd->tp_status & TP_STATUS_VLAN_TPID_VALID) {
+                vlan_tpid = htons(ppd->hv1.tp_vlan_tpid);
+            } else if (double_tagged) {
+                vlan_tpid = htons(ETH_TYPE_VLAN_8021AD);
+            } else {
+                vlan_tpid = htons(ETH_TYPE_VLAN_8021Q);
+            }
+            eth_push_vlan(buffer, vlan_tpid, htons(ppd->hv1.tp_vlan_tci));
+        }
+
+        dp_packet_batch_add(batch, buffer);
+
+        fn_in_block++;
+        if (fn_in_block >= pbd->hdr.bh1.num_pkts) {
+            pbd->hdr.bh1.block_status = TP_STATUS_KERNEL;
+            block_num = (block_num + 1) %
+                            netdev->tp_rx_ring->req.tp_block_nr;
+            pbd = (struct tpacket_block_desc *)
+                     netdev->tp_rx_ring->rd[block_num].iov_base;
+            fn_in_block = 0;
+            ppd = NULL;
+        } else {
+            ppd = (struct tpacket3_hdr *)
+                   ((uint8_t *) ppd + ppd->tp_next_offset);
+        }
+        i++;
+    }
+
+    netdev->tp_rx_ring->block_num = block_num;
+    netdev->tp_rx_ring->frame_num_in_block = fn_in_block;
+    netdev->tp_rx_ring->ppd = ppd;
+
+    return 0;
+}
+#endif /* HAVE_TPACKET_V3 */
+
 static int
 netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
                       int *qfill)
@@ -1466,9 +1700,15 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     }
 
     dp_packet_batch_init(batch);
-    retval = (rx->is_tap
-              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
-              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
+    if (rx->is_tap) {
+        retval = netdev_linux_batch_rxq_recv_tap(rx, mtu, batch);
+    } else {
+#ifndef HAVE_TPACKET_V3
+        retval = netdev_linux_batch_rxq_recv_sock(rx, mtu, batch);
+#else
+        retval = netdev_linux_batch_recv_tpacket(rx, mtu, batch);
+#endif
+    }
 
     if (retval) {
         if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1509,6 +1749,7 @@  netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
     }
 }
 
+#ifndef HAVE_TPACKET_V3
 static int
 netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
                              struct dp_packet_batch *batch)
@@ -1553,6 +1794,7 @@  netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
     free(iov);
     return error;
 }
+#endif /* HAVE_TPACKET_V3 */
 
 /* Use the tap fd to send 'batch' to tap device 'netdev'.  Using the tap fd is
  * essential, because packets sent to a tap device with an AF_PACKET socket
@@ -1673,6 +1915,77 @@  netdev_linux_get_numa_id(const struct netdev *netdev_)
     return numa_id;
 }
 
+#ifdef HAVE_TPACKET_V3
+static inline int
+tpacket_tx_is_ready(void * next_frame)
+{
+    struct tpacket3_hdr *hdr = (struct tpacket3_hdr *)next_frame;
+
+    return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
+}
+
+static int
+netdev_linux_tpacket_batch_send(struct netdev *netdev_,
+                            struct dp_packet_batch *batch)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+    struct dp_packet *packet;
+    int sockfd;
+    ssize_t bytes_sent;
+    int total_pkts = 0;
+
+    unsigned int frame_nr = netdev->tp_tx_ring->req.tp_frame_nr;
+    unsigned int frame_num = netdev->tp_tx_ring->frame_num;
+
+    /* The Linux tap driver returns EIO if the device is not up,
+     * so if the device is not up, don't waste time sending it.
+     * However, if the device is in another network namespace
+     * then OVS can't retrieve the state. In that case, send the
+     * packets anyway. */
+    if (netdev->present && !(netdev->ifi_flags & IFF_UP)) {
+        netdev->tx_dropped += dp_packet_batch_size(batch);
+        return 0;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        size_t size = dp_packet_size(packet);
+        struct tpacket3_hdr *ppd
+                    = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
+
+        if (!tpacket_tx_is_ready(ppd)) {
+            break;
+        }
+        ppd->tp_snaplen = size;
+        ppd->tp_len = size;
+        ppd->tp_next_offset = 0;
+
+        memcpy((uint8_t *)ppd + TPACKET3_HDRLEN - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet),
+               size);
+        ppd->tp_status = TP_STATUS_SEND_REQUEST;
+        frame_num = (frame_num + 1) % frame_nr;
+        total_pkts++;
+    }
+    netdev->tp_tx_ring->frame_num = frame_num;
+
+    /* kick-off transmits */
+    if (total_pkts != 0) {
+        sockfd = netdev->tp_tx_ring->sockfd;
+        bytes_sent = sendto(sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+        if (bytes_sent == -1 &&
+                errno != ENOBUFS && errno != EAGAIN) {
+            /*
+             * In case of an ENOBUFS/EAGAIN error all of the enqueued
+             * packets will be considered successful even though only some
+             * are sent.
+             */
+            netdev->tx_dropped += dp_packet_batch_size(batch);
+        }
+    }
+    return 0;
+}
+#endif
+
 /* Sends 'batch' on 'netdev'.  Returns 0 if successful, otherwise a positive
  * errno value.  Returns EAGAIN without blocking if the packet cannot be queued
  * immediately.  Returns EMSGSIZE if a partial packet was transmitted or if
@@ -1712,7 +2025,11 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
             goto free_batch;
         }
 
+#ifndef HAVE_TPACKET_V3
         error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
+#else
+        error = netdev_linux_tpacket_batch_send(netdev_, batch);
+#endif
     } else {
         error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
     }