diff mbox series

[ovs-dev,RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.

Message ID 1576802485-15017-1-git-send-email-u9012063@gmail.com
State RFC
Headers show
Series [ovs-dev,RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support. | expand

Commit Message

William Tu Dec. 20, 2019, 12:41 a.m. UTC
Currently the performance of sending packets from userspace
ovs to kernel veth device is pretty bad as reported from YiYang[1].
The patch adds AF_PACKET v3, tpacket v3, as another way to
tx/rx packet to linux device, hopefully showing better performance.

AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However,
my current patch using iperf tcp shows only 1.4Gbps, maybe
I'm doing something wrong.  Also DPDK has similar implementation
using AF_PACKET v2[3].  This is still work-in-progress but any
feedbacks are welcome.

[1] https://patchwork.ozlabs.org/patch/1204939/
[2] slide 18, https://www.netdevconf.info/2.2/slides/karlsson-afpacket-talk.pdf
[3] dpdk/drivers/net/af_packet/rte_eth_af_packet.c
---
 lib/automake.mk            |   2 +
 lib/netdev-linux-private.h |  23 +++
 lib/netdev-linux.c         |  24 ++-
 lib/netdev-provider.h      |   1 +
 lib/netdev-tpacket.c       | 487 +++++++++++++++++++++++++++++++++++++++++++++
 lib/netdev-tpacket.h       |  43 ++++
 lib/netdev.c               |   1 +
 7 files changed, 580 insertions(+), 1 deletion(-)
 create mode 100644 lib/netdev-tpacket.c
 create mode 100644 lib/netdev-tpacket.h

Comments

Ben Pfaff Dec. 20, 2019, 4:44 a.m. UTC | #1
On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> Currently the performance of sending packets from userspace
> ovs to kernel veth device is pretty bad as reported from YiYang[1].
> The patch adds AF_PACKET v3, tpacket v3, as another way to
> tx/rx packet to linux device, hopefully showing better performance.
> 
> AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However,
> my current patch using iperf tcp shows only 1.4Gbps, maybe
> I'm doing something wrong.  Also DPDK has similar implementation
> using AF_PACKET v2[3].  This is still work-in-progress but any
> feedbacks are welcome.

Is there a good reason that this is implemented as a new kind of netdev
rather than just a new way for the existing netdev implementation to do
packet i/o?
William Tu Dec. 20, 2019, 5:42 p.m. UTC | #2
On Fri, Dec 20, 2019 at 06:09:08AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, William
> 
> What kernel version can support AF_PACKET v3? I can try it with your patch.

Hi Yiyang,

Kernel +4.0 should have v3 support.

I'm also reading this doc:
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt

-------------------------------------------------------------------------------
+ AF_PACKET TPACKET_V3 example
-------------------------------------------------------------------------------

AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
sizes by doing it's own memory management. It is based on blocks where polling
works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.

It is said that TPACKET_V3 brings the following benefits:
 *) ~15 - 20% reduction in CPU-usage
 *) ~20% increase in packet capture rate
 *) ~2x increase in packet density
 *) Port aggregation analysis
 *) Non static frame size to capture entire packet payload

So it seems to be a good candidate to be used with packet fanout.

DPDK library is using TPACKET_V2, and V3 is better due to:
TPACKET_V2 --> TPACKET_V3:
	- Flexible buffer implementation for RX_RING:
		1. Blocks can be configured with non-static frame-size
		2. Read/poll is at a block-level (as opposed to packet-level)
		3. Added poll timeout to avoid indefinite user-space wait
		   on idle links
		4. Added user-configurable knobs:
			4.1 block::timeout
			4.2 tpkt_hdr::sk_rxhash
	- RX Hash data available in user space
	- TX_RING semantics are conceptually similar to TPACKET_V2;

Thanks
William

> 
> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com] 
> 发送时间: 2019年12月20日 8:41
> 收件人: dev@openvswitch.org
> 抄送: i.maximets@ovn.org; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>;
> blp@ovn.org; echaudro@redhat.com
> 主题: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
> 
> Currently the performance of sending packets from userspace ovs to kernel
> veth device is pretty bad as reported from YiYang[1].
> The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx packet to
> linux device, hopefully showing better performance.
> 
> AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my current
> patch using iperf tcp shows only 1.4Gbps, maybe I'm doing something wrong.
> Also DPDK has similar implementation using AF_PACKET v2[3].  This is still
> work-in-progress but any feedbacks are welcome.
> 
> [1] https://patchwork.ozlabs.org/patch/1204939/
> [2] slide 18, https://www.netdevconf.info/2.2/slides/karlsson-afpacket-talk.
> pdf
> [3] dpdk/drivers/net/af_packet/rte_eth_af_packet.c
> ---
>  lib/automake.mk            |   2 +
>  lib/netdev-linux-private.h |  23 +++
>  lib/netdev-linux.c         |  24 ++-
>  lib/netdev-provider.h      |   1 +
>  lib/netdev-tpacket.c       | 487
> +++++++++++++++++++++++++++++++++++++++++++++
>  lib/netdev-tpacket.h       |  43 ++++
>  lib/netdev.c               |   1 +
>  7 files changed, 580 insertions(+), 1 deletion(-)  create mode 100644
> lib/netdev-tpacket.c  create mode 100644 lib/netdev-tpacket.h
>
William Tu Dec. 20, 2019, 5:49 p.m. UTC | #3
On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote:
> On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> > Currently the performance of sending packets from userspace
> > ovs to kernel veth device is pretty bad as reported from YiYang[1].
> > The patch adds AF_PACKET v3, tpacket v3, as another way to
> > tx/rx packet to linux device, hopefully showing better performance.
> > 
> > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However,
> > my current patch using iperf tcp shows only 1.4Gbps, maybe
> > I'm doing something wrong.  Also DPDK has similar implementation
> > using AF_PACKET v2[3].  This is still work-in-progress but any
> > feedbacks are welcome.
> 
> Is there a good reason that this is implemented as a new kind of netdev
> rather than just a new way for the existing netdev implementation to do
> packet i/o?

The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and
other dpdk netdev), which has its own memory mgmt, ring structure, and
polling the descriptors. So I implemented it as a new kind. I feel its
pretty different than tap or existing af_packet netdev.

But integrate it to the existing netdev (lib/netdev-linux.c) is also OK.

William
Ben Pfaff Dec. 20, 2019, 9:14 p.m. UTC | #4
On Fri, Dec 20, 2019 at 09:49:44AM -0800, William Tu wrote:
> On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote:
> > On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> > > Currently the performance of sending packets from userspace
> > > ovs to kernel veth device is pretty bad as reported from YiYang[1].
> > > The patch adds AF_PACKET v3, tpacket v3, as another way to
> > > tx/rx packet to linux device, hopefully showing better performance.
> > > 
> > > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However,
> > > my current patch using iperf tcp shows only 1.4Gbps, maybe
> > > I'm doing something wrong.  Also DPDK has similar implementation
> > > using AF_PACKET v2[3].  This is still work-in-progress but any
> > > feedbacks are welcome.
> > 
> > Is there a good reason that this is implemented as a new kind of netdev
> > rather than just a new way for the existing netdev implementation to do
> > packet i/o?
> 
> The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and
> other dpdk netdev), which has its own memory mgmt, ring structure, and
> polling the descriptors. So I implemented it as a new kind. I feel its
> pretty different than tap or existing af_packet netdev.
> 
> But integrate it to the existing netdev (lib/netdev-linux.c) is also OK.

Do you think it's sufficiently different from a user's point of view?  I
think that's probably an important point of view here.  It's great if
the user can just suddenly get better performance without having to do
anything else.

On the other hand, if the user might need to know that tpacket is in
use, like maybe if there is some downside or tradeoff to using it (for
example, it needs a ring--does that use a lot of memory and would that
be regrettable sometimes?), then that argues toward making it
configurable.

You say it's more like afxdp.  Maybe it should be a fallback for afxdp,
so that if afxdp isn't available for some reason then it automatically
uses tpacket itself.  Or maybe that's a ridiculous idea for some
reason, I don't know.

Do you think it is likely that a system supports tpacket but not afxdp?

Thanks,

Ben.
William Tu Dec. 21, 2019, 12:42 a.m. UTC | #5
On Fri, Dec 20, 2019 at 01:14:37PM -0800, Ben Pfaff wrote:
> On Fri, Dec 20, 2019 at 09:49:44AM -0800, William Tu wrote:
> > On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote:
> > > On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> > > > Currently the performance of sending packets from userspace
> > > > ovs to kernel veth device is pretty bad as reported from YiYang[1].
> > > > The patch adds AF_PACKET v3, tpacket v3, as another way to
> > > > tx/rx packet to linux device, hopefully showing better performance.
> > > > 
> > > > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However,
> > > > my current patch using iperf tcp shows only 1.4Gbps, maybe
> > > > I'm doing something wrong.  Also DPDK has similar implementation
> > > > using AF_PACKET v2[3].  This is still work-in-progress but any
> > > > feedbacks are welcome.
> > > 
> > > Is there a good reason that this is implemented as a new kind of netdev
> > > rather than just a new way for the existing netdev implementation to do
> > > packet i/o?
> > 
> > The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and
> > other dpdk netdev), which has its own memory mgmt, ring structure, and
> > polling the descriptors. So I implemented it as a new kind. I feel its
> > pretty different than tap or existing af_packet netdev.
> > 
> > But integrate it to the existing netdev (lib/netdev-linux.c) is also OK.
> 
> Do you think it's sufficiently different from a user's point of view?  I
> think that's probably an important point of view here.  It's great if
> the user can just suddenly get better performance without having to do
> anything else.
> 
> On the other hand, if the user might need to know that tpacket is in
> use, like maybe if there is some downside or tradeoff to using it (for
> example, it needs a ring--does that use a lot of memory and would that
> be regrettable sometimes?), then that argues toward making it
> configurable.

Good point.
Let me spend more time on optimizing the patch's performance so I can
better answer this question.

> 
> You say it's more like afxdp.  Maybe it should be a fallback for afxdp,
> so that if afxdp isn't available for some reason then it automatically
> uses tpacket itself.  Or maybe that's a ridiculous idea for some
> reason, I don't know.
> 
> Do you think it is likely that a system supports tpacket but not afxdp?

It's possible, before 5.0 kernel, there is no afxdp.
So tpacket could be a good fallbck case.
But the performance differs a lot based on this patch.

William
Yi Yang (杨燚)-云服务集团 Dec. 23, 2019, 12:29 a.m. UTC | #6
Thanks William, https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt is very good document for TPACKET_V*, I completely agree TPCKET_V3 is the best way to improve tap and veth performance. Can you share us how to use your patch? Lib/netdev-linux.c is still there, which recv function will be called when I add veth/tap in OVS DPDK?

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2019年12月21日 1:43
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: dev@openvswitch.org; i.maximets@ovn.org; blp@ovn.org; echaudro@redhat.com
主题: Re: 答复: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.

On Fri, Dec 20, 2019 at 06:09:08AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Hi, William
> 
> What kernel version can support AF_PACKET v3? I can try it with your patch.

Hi Yiyang,

Kernel +4.0 should have v3 support.

I'm also reading this doc:
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt

-------------------------------------------------------------------------------
+ AF_PACKET TPACKET_V3 example
-------------------------------------------------------------------------------

AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame sizes by doing it's own memory management. It is based on blocks where polling works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.

It is said that TPACKET_V3 brings the following benefits:
 *) ~15 - 20% reduction in CPU-usage
 *) ~20% increase in packet capture rate
 *) ~2x increase in packet density
 *) Port aggregation analysis
 *) Non static frame size to capture entire packet payload

So it seems to be a good candidate to be used with packet fanout.

DPDK library is using TPACKET_V2, and V3 is better due to:
TPACKET_V2 --> TPACKET_V3:
	- Flexible buffer implementation for RX_RING:
		1. Blocks can be configured with non-static frame-size
		2. Read/poll is at a block-level (as opposed to packet-level)
		3. Added poll timeout to avoid indefinite user-space wait
		   on idle links
		4. Added user-configurable knobs:
			4.1 block::timeout
			4.2 tpkt_hdr::sk_rxhash
	- RX Hash data available in user space
	- TX_RING semantics are conceptually similar to TPACKET_V2;

Thanks
William

> 
> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com]
> 发送时间: 2019年12月20日 8:41
> 收件人: dev@openvswitch.org
> 抄送: i.maximets@ovn.org; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; 
> blp@ovn.org; echaudro@redhat.com
> 主题: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
> 
> Currently the performance of sending packets from userspace ovs to 
> kernel veth device is pretty bad as reported from YiYang[1].
> The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx 
> packet to linux device, hopefully showing better performance.
> 
> AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my 
> current patch using iperf tcp shows only 1.4Gbps, maybe I'm doing something wrong.
> Also DPDK has similar implementation using AF_PACKET v2[3].  This is 
> still work-in-progress but any feedbacks are welcome.
> 
> [1] https://patchwork.ozlabs.org/patch/1204939/
> [2] slide 18, https://www.netdevconf.info/2.2/slides/karlsson-afpacket-talk.
> pdf
> [3] dpdk/drivers/net/af_packet/rte_eth_af_packet.c
> ---
>  lib/automake.mk            |   2 +
>  lib/netdev-linux-private.h |  23 +++
>  lib/netdev-linux.c         |  24 ++-
>  lib/netdev-provider.h      |   1 +
>  lib/netdev-tpacket.c       | 487
> +++++++++++++++++++++++++++++++++++++++++++++
>  lib/netdev-tpacket.h       |  43 ++++
>  lib/netdev.c               |   1 +
>  7 files changed, 580 insertions(+), 1 deletion(-)  create mode 100644 
> lib/netdev-tpacket.c  create mode 100644 lib/netdev-tpacket.h
>
William Tu Dec. 23, 2019, 11 p.m. UTC | #7
On Mon, Dec 23, 2019 at 12:29:25AM +0000, Yi Yang (杨燚)-云服务集团 wrote:
> Thanks William, https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt is very good document for TPACKET_V*, I completely agree TPCKET_V3 is the best way to improve tap and veth performance. Can you share us how to use your patch? Lib/netdev-linux.c is still there, which recv function will be called when I add veth/tap in OVS DPDK?
> 

Hi Yiyang,

To use my patch, simply add a port (ex: veth) by doing
ovs-vsctl add-port br0 afxdp-p0
ovs-vsctl -- set interface afxdp-p0 options:n_rxq=1 type="tpacket"

btw, this is not using OVS-DPDK.

--William
William Tu Dec. 24, 2019, 12:16 a.m. UTC | #8
On Sun, Dec 22, 2019 at 4:35 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Thanks William, af_packet only can open tap interface, it can't create tap
> interface. Tap interface onlu can be created by the below way
>
> ovs-vsctl add-port tapX -- set interface tapX type=internal
>
> this tap is very special, it is like a mystery to me so far. "ip tuntap add
> tapX mode tap" can't work for such tap interface.

Why not? What's the error message?
you can create a tapX device using ip tuntap first,
and add tapX using OVS

using ovs-vsctl add-port tapX -- set interface tapX type=afxdp

Regards,
William
>
> Anybody can tell me how I can create such a tap interface without using "
> ovs-vsctl add-port tapX"
>
> By the way, I tried af_packet for veth, the performance is very good, it is
> about 4Gbps on my machine, but it used TPACKET_V2.
>
> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com]
> 发送时间: 2019年12月21日 1:50
> 收件人: Ben Pfaff <blp@ovn.org>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; Yi Yang (杨燚)-云服务集团
> <yangyi01@inspur.com>; echaudro@redhat.com
> 主题: Re: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
>
> On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote:
> > On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> > > Currently the performance of sending packets from userspace ovs to
> > > kernel veth device is pretty bad as reported from YiYang[1].
> > > The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx
> > > packet to linux device, hopefully showing better performance.
> > >
> > > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my
> > > current patch using iperf tcp shows only 1.4Gbps, maybe I'm doing
> > > something wrong.  Also DPDK has similar implementation using
> > > AF_PACKET v2[3].  This is still work-in-progress but any feedbacks
> > > are welcome.
> >
> > Is there a good reason that this is implemented as a new kind of
> > netdev rather than just a new way for the existing netdev
> > implementation to do packet i/o?
>
> The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and other
> dpdk netdev), which has its own memory mgmt, ring structure, and polling the
> descriptors. So I implemented it as a new kind. I feel its pretty different
> than tap or existing af_packet netdev.
>
> But integrate it to the existing netdev (lib/netdev-linux.c) is also OK.
>
> William
>
Yi Yang (杨燚)-云服务集团 Dec. 24, 2019, 1:22 a.m. UTC | #9
William, maybe you don't know that kind of tap interface you're saying only can be used for VM, that is why openvswitch has to introduce internal type for the case I'm saying.

In OVS DPDK case, if you create the below interface, it is a tap interface.

ovs-vsctl add-port tapX -- set interface type=internal

It won't work if you create tap interface in the below way

Ip tuntap add tapX node tap
ovs-vsctl add-port br-int tapX

I have tried af_packet for it, it can't work, I don't think af_xdp can work for such tap, maybe you can double confirm this.



-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2019年12月24日 8:17
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: blp@ovn.org; dev@openvswitch.org; i.maximets@ovn.org; echaudro@redhat.com
主题: Re: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.

On Sun, Dec 22, 2019 at 4:35 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Thanks William, af_packet only can open tap interface, it can't create 
> tap interface. Tap interface onlu can be created by the below way
>
> ovs-vsctl add-port tapX -- set interface tapX type=internal
>
> this tap is very special, it is like a mystery to me so far. "ip 
> tuntap add tapX mode tap" can't work for such tap interface.

Why not? What's the error message?
you can create a tapX device using ip tuntap first, and add tapX using OVS

using ovs-vsctl add-port tapX -- set interface tapX type=afxdp

Regards,
William
>
> Anybody can tell me how I can create such a tap interface without using "
> ovs-vsctl add-port tapX"
>
> By the way, I tried af_packet for veth, the performance is very good, 
> it is about 4Gbps on my machine, but it used TPACKET_V2.
>
> -----邮件原件-----
> 发件人: William Tu [mailto:u9012063@gmail.com]
> 发送时间: 2019年12月21日 1:50
> 收件人: Ben Pfaff <blp@ovn.org>
> 抄送: dev@openvswitch.org; i.maximets@ovn.org; Yi Yang (杨燚)-云服务集团
> <yangyi01@inspur.com>; echaudro@redhat.com
> 主题: Re: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
>
> On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote:
> > On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote:
> > > Currently the performance of sending packets from userspace ovs to 
> > > kernel veth device is pretty bad as reported from YiYang[1].
> > > The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx 
> > > packet to linux device, hopefully showing better performance.
> > >
> > > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my 
> > > current patch using iperf tcp shows only 1.4Gbps, maybe I'm doing 
> > > something wrong.  Also DPDK has similar implementation using 
> > > AF_PACKET v2[3].  This is still work-in-progress but any feedbacks 
> > > are welcome.
> >
> > Is there a good reason that this is implemented as a new kind of 
> > netdev rather than just a new way for the existing netdev 
> > implementation to do packet i/o?
>
> The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and 
> other dpdk netdev), which has its own memory mgmt, ring structure, and 
> polling the descriptors. So I implemented it as a new kind. I feel its 
> pretty different than tap or existing af_packet netdev.
>
> But integrate it to the existing netdev (lib/netdev-linux.c) is also OK.
>
> William
>
William Tu Jan. 2, 2020, 11:54 p.m. UTC | #10
On Mon, Dec 23, 2019 at 5:22 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> William, maybe you don't know that kind of tap interface you're saying only can be used for VM, that is why openvswitch has to introduce internal type for the case I'm saying.
>
> In OVS DPDK case, if you create the below interface, it is a tap interface.
>
> ovs-vsctl add-port tapX -- set interface type=internal
>
> It won't work if you create tap interface in the below way
>
> Ip tuntap add tapX node tap
> ovs-vsctl add-port br-int tapX
>
Hi Yi,

I think this is mentioned in Documentation/faq/issues.rst,
see
Q: I created a tap device tap0, configured an IP address on it, and added it to
a bridge, like this::

    $ tunctl -t tap0
    $ ip addr add 192.168.0.123/24 dev tap0
    $ ip link set tap0 up
    $ ovs-vsctl add-br br0
    $ ovs-vsctl add-port br0 tap0

I expected that I could then use this IP address to contact other hosts on the
network, but it doesn't work.  Why not?
...

Regards,
William
Ilya Maximets Jan. 3, 2020, 1:28 p.m. UTC | #11
On 03.01.2020 00:54, William Tu wrote:
> On Mon, Dec 23, 2019 at 5:22 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>>
>> William, maybe you don't know that kind of tap interface you're saying only can be used for VM, that is why openvswitch has to introduce internal type for the case I'm saying.

There is no such thing as "only can be used for VM".
QEMU creates usual tap interface and OVS could open
it with AF_PACKET/XDP socket as usual.  See below.

>>
>> In OVS DPDK case, if you create the below interface, it is a tap interface.
>>
>> ovs-vsctl add-port tapX -- set interface type=internal
>>
>> It won't work if you create tap interface in the below way
>>
>> Ip tuntap add tapX node tap
>> ovs-vsctl add-port br-int tapX
>>
> Hi Yi,
> 
> I think this is mentioned in Documentation/faq/issues.rst,
> see
> Q: I created a tap device tap0, configured an IP address on it, and added it to
> a bridge, like this::
> 
>     $ tunctl -t tap0
>     $ ip addr add 192.168.0.123/24 dev tap0
>     $ ip link set tap0 up
>     $ ovs-vsctl add-br br0
>     $ ovs-vsctl add-port br0 tap0
> 
> I expected that I could then use this IP address to contact other hosts on the
> network, but it doesn't work.  Why not?

You're doing something really strange here.  TUN/TAP interface
is a way to communicate between userspace application that creates it
and the kernel network stack.  In the command sequence above there is
no application that actually listens on the other side of this
tap0 network interface.  And it's effectively in DOWN state and
can not be used.  'tunctl' just allows you to create a persistent
tap interface that will survive restart of the owning application
without loosing ip address and other configuration.

  $ tunctl -t tap0
  $ ip addr add 192.168.0.123/24 dev tap0
  $ ip link set tap0 up
  $ ip link show tap0
    5: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
       state DOWN mode DEFAULT group default qlen 1000
       link/ether 82:51:60:3a:08:e0 brd ff:ff:ff:ff:ff:ff

To make it work some application needs to open /dev/net/tun and
perform an TUNSETIFF ioctl to open this or create new tap interface.
OVS will not do that, it will just open usual AF_PACKET socket
that will try to get packets from the DOWN kernel interface (interface
type=system by default).
Same will happen if you'll try to open it with type=afxdp.

OVS will open and configure the tap interface correctly only if you'll
provide type=tap.  In this case, OVS will open /dev/net/tun and
will perform TUNSETIFF ioctl to open persistent or create new tap
interface. Retrieved tap_fd will be used for data transmission. After
that tap0 will get the carrier and the state will finally become UP.
(type=internal is equal to type=tap for userspace datapath).

In case of tap interface created by QEMU, OVS is able to open it with
usual AF_PACKET/XDP socket just because QEMU is the userspace application
that owns it (opens /dev/net/tun and performs TUNSETIFF ioctl).  The
interface is in UP state as long as QEMU process alive.

TAP interface is not a stand-alone entity, it's a pipe between particular
userspace application and the kernel network stack.  And it will obviously
not work if you're connecting to it from the kernel side (via socket)
without any application listening from the userspace.

Best regards, Ilya Maximets.
William Tu Jan. 3, 2020, 5:19 p.m. UTC | #12
On Fri, Jan 3, 2020 at 5:28 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 03.01.2020 00:54, William Tu wrote:
> > On Mon, Dec 23, 2019 at 5:22 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> >>
> >> William, maybe you don't know that kind of tap interface you're saying only can be used for VM, that is why openvswitch has to introduce internal type for the case I'm saying.
>
> There is no such thing as "only can be used for VM".
> QEMU creates usual tap interface and OVS could open
> it with AF_PACKET/XDP socket as usual.  See below.
>
> >>
> >> In OVS DPDK case, if you create the below interface, it is a tap interface.
> >>
> >> ovs-vsctl add-port tapX -- set interface type=internal
> >>
> >> It won't work if you create tap interface in the below way
> >>
> >> Ip tuntap add tapX node tap
> >> ovs-vsctl add-port br-int tapX
> >>
> > Hi Yi,
> >
> > I think this is mentioned in Documentation/faq/issues.rst,
> > see
> > Q: I created a tap device tap0, configured an IP address on it, and added it to
> > a bridge, like this::
> >
> >     $ tunctl -t tap0
> >     $ ip addr add 192.168.0.123/24 dev tap0
> >     $ ip link set tap0 up
> >     $ ovs-vsctl add-br br0
> >     $ ovs-vsctl add-port br0 tap0
> >
> > I expected that I could then use this IP address to contact other hosts on the
> > network, but it doesn't work.  Why not?
>
> You're doing something really strange here.  TUN/TAP interface
> is a way to communicate between userspace application that creates it
> and the kernel network stack.  In the command sequence above there is
> no application that actually listens on the other side of this
> tap0 network interface.  And it's effectively in DOWN state and
> can not be used.  'tunctl' just allows you to create a persistent
> tap interface that will survive restart of the owning application
> without loosing ip address and other configuration.
>
>   $ tunctl -t tap0
>   $ ip addr add 192.168.0.123/24 dev tap0
>   $ ip link set tap0 up
>   $ ip link show tap0
>     5: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>        state DOWN mode DEFAULT group default qlen 1000
>        link/ether 82:51:60:3a:08:e0 brd ff:ff:ff:ff:ff:ff
>
> To make it work some application needs to open /dev/net/tun and
> perform an TUNSETIFF ioctl to open this or create new tap interface.
> OVS will not do that, it will just open usual AF_PACKET socket
> that will try to get packets from the DOWN kernel interface (interface
> type=system by default).
> Same will happen if you'll try to open it with type=afxdp.
>
> OVS will open and configure the tap interface correctly only if you'll
> provide type=tap.  In this case, OVS will open /dev/net/tun and
> will perform TUNSETIFF ioctl to open persistent or create new tap
> interface. Retrieved tap_fd will be used for data transmission. After
> that tap0 will get the carrier and the state will finally become UP.
> (type=internal is equal to type=tap for userspace datapath).
>
> In case of tap interface created by QEMU, OVS is able to open it with
> usual AF_PACKET/XDP socket just because QEMU is the userspace application
> that owns it (opens /dev/net/tun and performs TUNSETIFF ioctl).  The
> interface is in UP state as long as QEMU process alive.
>
> TAP interface is not a stand-alone entity, it's a pipe between particular
> userspace application and the kernel network stack.  And it will obviously
> not work if you're connecting to it from the kernel side (via socket)
> without any application listening from the userspace.
>
Thanks Ilya!
Lots of people get confused when using tap device and internal.
It's very clear with you explanation!
Do you want to consider adding these text to Documentation/faq/issues.rst?

William
William Tu Jan. 3, 2020, 6:36 p.m. UTC | #13
On Fri, Jan 03, 2020 at 02:28:03PM +0100, Ilya Maximets wrote:
> On 03.01.2020 00:54, William Tu wrote:
> > On Mon, Dec 23, 2019 at 5:22 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> >>
> >> William, maybe you don't know that kind of tap interface you're saying only can be used for VM, that is why openvswitch has to introduce internal type for the case I'm saying.
> 
> There is no such thing as "only can be used for VM".
> QEMU creates usual tap interface and OVS could open
> it with AF_PACKET/XDP socket as usual.  See below.
> 
> >>
> >> In OVS DPDK case, if you create the below interface, it is a tap interface.
> >>
> >> ovs-vsctl add-port tapX -- set interface type=internal
> >>
> >> It won't work if you create tap interface in the below way
> >>
> >> Ip tuntap add tapX node tap
> >> ovs-vsctl add-port br-int tapX
> >>
> > Hi Yi,
> > 
> > I think this is mentioned in Documentation/faq/issues.rst,
> > see
> > Q: I created a tap device tap0, configured an IP address on it, and added it to
> > a bridge, like this::
> > 
> >     $ tunctl -t tap0
> >     $ ip addr add 192.168.0.123/24 dev tap0
> >     $ ip link set tap0 up
> >     $ ovs-vsctl add-br br0
> >     $ ovs-vsctl add-port br0 tap0
> > 
> > I expected that I could then use this IP address to contact other hosts on the
> > network, but it doesn't work.  Why not?
> 
> You're doing something really strange here.  TUN/TAP interface
> is a way to communicate between userspace application that creates it
> and the kernel network stack.  In the command sequence above there is
> no application that actually listens on the other side of this
> tap0 network interface.  And it's effectively in DOWN state and
> can not be used.  'tunctl' just allows you to create a persistent
> tap interface that will survive restart of the owning application
> without loosing ip address and other configuration.
> 
>   $ tunctl -t tap0
>   $ ip addr add 192.168.0.123/24 dev tap0
>   $ ip link set tap0 up
>   $ ip link show tap0
>     5: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>        state DOWN mode DEFAULT group default qlen 1000
>        link/ether 82:51:60:3a:08:e0 brd ff:ff:ff:ff:ff:ff
> 
> To make it work some application needs to open /dev/net/tun and
> perform an TUNSETIFF ioctl to open this or create new tap interface.
> OVS will not do that, it will just open usual AF_PACKET socket

> that will try to get packets from the DOWN kernel interface (interface
> type=system by default).
> Same will happen if you'll try to open it with type=afxdp.
> 
> OVS will open and configure the tap interface correctly only if you'll
> provide type=tap.  In this case, OVS will open /dev/net/tun and
> will perform TUNSETIFF ioctl to open persistent or create new tap
> interface. Retrieved tap_fd will be used for data transmission. After
> that tap0 will get the carrier and the state will finally become UP.
> (type=internal is equal to type=tap for userspace datapath).

Thanks, I tested it and it works OK if ovs-vsctl open the device with
type=internal or type=tap. Because OVS will re-open the /dev/net/tun
and get the tap fd. Below is my script:

ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev

ip tuntap add tap0 mode tap
tunctl -t tap1
ovs-vsctl add-port br0 tap0 -- set int tap0 type=internal
ovs-vsctl add-port br0 tap1 -- set int tap1 type=tap

ip netns add ns0
ip netns add ns1
ip link set tap0 netns ns0
ip link set tap1 netns ns1
ip netns exec ns0 ip addr add 10.1.1.1/24 dev tap0
ip netns exec ns1 ip addr add 10.1.1.2/24 dev tap1
ip netns exec ns0 ip link set dev tap0 up
ip netns exec ns1 ip link set dev tap1 up

ip netns exec ns0 ifconfig
ip netns exec ns0 ping 10.1.1.2

Regards,
William


> 
> In case of tap interface created by QEMU, OVS is able to open it with
> usual AF_PACKET/XDP socket just because QEMU is the userspace application
> that owns it (opens /dev/net/tun and performs TUNSETIFF ioctl).  The
> interface is in UP state as long as QEMU process alive.
> 
> TAP interface is not a stand-alone entity, it's a pipe between particular
> userspace application and the kernel network stack.  And it will obviously
> not work if you're connecting to it from the kernel side (via socket)
> without any application listening from the userspace.
> 
> Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/lib/automake.mk b/lib/automake.mk
index 17b36b43d9d7..0c635404cb43 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -398,6 +398,8 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
 	lib/netdev-linux-private.h \
+	lib/netdev-tpacket.c \
+	lib/netdev-tpacket.h \
 	lib/netdev-offload-tc.c \
 	lib/netlink-conntrack.c \
 	lib/netlink-conntrack.h \
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index f08159aa7b53..99a2c03bb2a6 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -20,6 +20,7 @@ 
 #include <linux/filter.h>
 #include <linux/gen_stats.h>
 #include <linux/if_ether.h>
+#include <linux/if_packet.h>
 #include <linux/if_tun.h>
 #include <linux/types.h>
 #include <linux/ethtool.h>
@@ -37,6 +38,24 @@ 
 
 struct netdev;
 
+/* tpacket rx and tx ring structure. */
+struct tp_ring {
+    struct iovec *rd;   /* rd[n] points to mmap area. */
+    int rd_len;
+    int rd_num;
+    char *mm;           /* mmap address. */
+    size_t mm_len;
+    unsigned int next_avail_block;
+    int frame_len;
+};
+
+struct tpacket_info {
+    int fd;
+    struct tpacket_req3 req;
+    struct tp_ring rxring;
+    struct tp_ring txring;
+};
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
@@ -110,6 +129,10 @@  struct netdev_linux {
 
     struct netdev_afxdp_tx_lock *tx_locks;  /* Array of locks for TX queues. */
 #endif
+
+    /* tpacket v3 information. */
+    struct tpacket_info **tps;
+    int n_tps;
 };
 
 static bool
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index f8e59bacfb13..edfc389ee6f2 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -36,9 +36,10 @@ 
 #include <linux/rtnetlink.h>
 #include <linux/sockios.h>
 #include <sys/ioctl.h>
+#include <sys/mman.h>
 #include <sys/socket.h>
 #include <sys/utsname.h>
-#include <netpacket/packet.h>
+//#include <netpacket/packet.h>
 #include <net/if.h>
 #include <net/if_arp.h>
 #include <net/route.h>
@@ -57,6 +58,7 @@ 
 #include "openvswitch/hmap.h"
 #include "netdev-afxdp.h"
 #include "netdev-provider.h"
+#include "netdev-tpacket.h"
 #include "netdev-vport.h"
 #include "netlink-notifier.h"
 #include "netlink-socket.h"
@@ -3315,6 +3317,26 @@  const struct netdev_class netdev_afxdp_class = {
     .rxq_recv = netdev_afxdp_rxq_recv,
 };
 #endif
+
+const struct netdev_class netdev_tpacket_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "tpacket",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
+    .get_stats = netdev_linux_get_stats,
+    .get_features = netdev_linux_get_features,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_tpacket_set_config,
+    .get_config = netdev_tpacket_get_config,
+    .reconfigure = netdev_tpacket_reconfigure,
+    .get_block_id = netdev_linux_get_block_id,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+    .send = netdev_tpacket_batch_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_destruct = netdev_linux_rxq_destruct,
+    .rxq_recv = netdev_tpacket_rxq_recv,
+};
 
 
 #define CODEL_N_QUEUES 0x0000
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index f109c4e66f0d..518d1dc6e02c 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -833,6 +833,7 @@  extern const struct netdev_class netdev_bsd_class;
 extern const struct netdev_class netdev_windows_class;
 #else
 extern const struct netdev_class netdev_linux_class;
+extern const struct netdev_class netdev_tpacket_class;
 #endif
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
diff --git a/lib/netdev-tpacket.c b/lib/netdev-tpacket.c
new file mode 100644
index 000000000000..798ce776838f
--- /dev/null
+++ b/lib/netdev-tpacket.c
@@ -0,0 +1,487 @@ 
+/*
+ * Copyright (c) 2019 VMware, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-tpacket.h"
+
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_packet.h>
+#include <net/if.h>
+#include <poll.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dpif-netdev.h"
+#include "fatal-signal.h"
+#include "openvswitch/compiler.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/list.h"
+#include "openvswitch/thread.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "socket-util.h"
+#include "util.h"
+
+COVERAGE_DEFINE(tpacket_rx_busy);
+COVERAGE_DEFINE(tpacket_tx_busy);
+
+VLOG_DEFINE_THIS_MODULE(netdev_tpacket);
+//static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+/* One block contains two frames. */
+#define TP_BLOCKSZ      4096
+#define TP_FRAMESZ      2048
+#define TP_NUM_DESCS    1024
+#define TP_BLOCKNR      1024
+#define TP_BLOCKNR_MASK (TP_BLOCKNR - 1)
+#define TP_FRAMENR      (TP_BLOCKNR * (TP_BLOCKSZ/TP_FRAMESZ))
+#define TP_FRAMENR_MASK (TP_FRAMENR -1)
+#define BATCH_SIZE      NETDEV_MAX_BURST
+
+#define barrier() __asm__ __volatile__("" : : : "memory")
+
+static struct tpacket_info *tpacket_configure(struct netdev_linux *dev);
+static int tpacket_configure_all(struct netdev_linux *dev);
+static void tpacket_destroy(struct tpacket_info *tp);
+static void tpacket_destroy_all(struct netdev_linux *dev);
+
+static void
+tpacket_fill_v3(struct tpacket_req3 *r)
+{
+    memset(r, 0, sizeof *r);
+
+    r->tp_block_size = TP_BLOCKSZ;  /* Minimal size of contiguous block. */
+    r->tp_frame_size = TP_FRAMESZ;  /* Size of frame. */
+    r->tp_block_nr = TP_BLOCKNR;    /* Number of blocks. */
+    r->tp_frame_nr = TP_FRAMENR;    /* Number of frames. */
+    r->tp_retire_blk_tov = 0;       /* Timeout in msecs. */
+    r->tp_sizeof_priv = 0;          /* Offset to private data area. */
+    r->tp_feature_req_word = 0;
+    //r->tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
+}
+
+int
+netdev_tpacket_set_config(struct netdev *netdev,
+                          const struct smap *args OVS_UNUSED,
+                          char **errp OVS_UNUSED)
+{
+    netdev_request_reconfigure(netdev);
+    return 0;
+}
+
+int
+netdev_tpacket_get_config(const struct netdev *netdev OVS_UNUSED,
+                          struct smap *args OVS_UNUSED)
+{
+    return 0;
+}
+
+static struct tpacket_info *
+tpacket_configure(struct netdev_linux *dev)
+{
+    struct tpacket_req3 req;
+    struct tpacket_info *tp;
+    struct sockaddr_ll ll;
+    int ver, fd, ifindex;
+    int error, i, noqdisc;
+
+    tp = xmalloc(sizeof *tp);
+    if (!tp) {
+        ovs_mutex_unlock(&dev->mutex);
+        return NULL;
+    }
+    memset(tp, 0, sizeof *tp);
+
+    tp->fd = fd = socket(PF_PACKET, SOCK_RAW, 0);
+    if (fd < 0) {
+        VLOG_ERR("tpacket: create PF_PACKET failed: %s", ovs_strerror(errno));
+        error = errno;
+        goto error;
+    }
+
+    ver = TPACKET_V3;
+    error = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &ver, sizeof(ver));
+    if (error) {
+        VLOG_ERR("tpacket: set version failed: %s", ovs_strerror(errno));
+        goto error;
+    }
+
+    tpacket_fill_v3(&req);
+    error = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &req, sizeof req);
+    if (error) {
+        VLOG_ERR("tpacket: set rx_ring failed: %s", ovs_strerror(errno));
+        goto error;
+    }
+    error = setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &req, sizeof req);
+    if (error) {
+        VLOG_ERR("tpacket: set tx_ring failed: %s", ovs_strerror(errno));
+        goto error;
+    }
+    tp->req = req;
+
+    /* Configure rx/tx ring. */
+    tp->rxring.mm_len = req.tp_block_size * req.tp_block_nr;
+    tp->rxring.mm = mmap(0, 2 * tp->rxring.mm_len, PROT_READ | PROT_WRITE,
+                         MAP_SHARED | MAP_LOCKED | MAP_POPULATE, fd, 0);
+    if (!tp->rxring.mm) {
+        VLOG_ERR("tpacket: mmap rx_ring failed: %s", ovs_strerror(errno));
+        goto error;
+    }
+    tp->txring.mm_len = tp->rxring.mm_len;
+    tp->txring.mm = tp->rxring.mm + tp->rxring.mm_len;
+
+    tp->rxring.rd_num = tp->txring.rd_num = req.tp_block_nr;
+    tp->rxring.rd_len = tp->txring.rd_len =
+                        req.tp_block_nr * sizeof *tp->rxring.rd;
+
+    tp->rxring.rd = xmalloc(tp->rxring.rd_len);
+    if (!tp->rxring.rd) {
+        return NULL;
+    }
+    memset(tp->rxring.rd, 0, tp->rxring.rd_len);
+
+    tp->txring.rd = xmalloc(tp->txring.rd_len);
+    if (!tp->txring.rd) {
+        return NULL;
+    }
+    memset(tp->txring.rd, 0, tp->txring.rd_len);
+
+    for (i = 0; i < tp->rxring.rd_num; i++) {
+        tp->rxring.rd[i].iov_base = tp->rxring.mm + (i * req.tp_block_size);
+        tp->rxring.rd[i].iov_len = req.tp_block_size;
+    }
+    for (i = 0; i < tp->txring.rd_num; i++) {
+        tp->txring.rd[i].iov_base = tp->txring.mm + (i * req.tp_block_size);
+        tp->txring.rd[i].iov_len = req.tp_block_size;
+    }
+
+    noqdisc = 1;
+    setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS,
+               &noqdisc, sizeof(noqdisc));
+
+    ifindex = linux_get_ifindex(netdev_get_name(&dev->up));
+
+    ll.sll_family = PF_PACKET;
+    ll.sll_protocol = htons(ETH_P_ALL);
+    ll.sll_ifindex = ifindex;
+    ll.sll_hatype = 0;
+    ll.sll_pkttype = 0;
+    ll.sll_halen = 0;
+
+    error = bind(fd, (struct sockaddr *)&ll, sizeof ll);
+    if (error) {
+        VLOG_ERR("tpacket: bind failed: %s", ovs_strerror(errno));
+        goto error_unmap;
+    }
+
+    return tp;
+
+error_unmap:
+    munmap(tp->rxring.mm, tp->rxring.mm_len * 2);
+error:
+    if (tp) {
+        free(tp);
+    }
+    if (fd >= 0) {
+        close(fd);
+    }
+
+    return NULL;
+}
+
+static int
+tpacket_configure_all(struct netdev_linux *dev)
+{
+    int n_tps, i;
+
+    n_tps = dev->n_tps;
+    dev->tps = calloc(n_tps, sizeof(struct tpacket_info *));
+
+    for (i = 0; i < n_tps; i++) {
+        VLOG_INFO("tpacket: configure %dth queue.", i);
+        dev->tps[i] = tpacket_configure(dev);
+        if (!dev->tps[i]) {
+            VLOG_ERR("tpacket: configure %dth queue failed.", i);
+            goto error;
+        }
+    }
+    return 0;
+
+error:
+    tpacket_destroy_all(dev);
+    return EINVAL;
+}
+
+static void
+tpacket_destroy(struct tpacket_info *tp)
+{
+    if (!tp) {
+        return;
+    }
+    munmap(tp->rxring.mm, tp->rxring.mm_len * 2);  /* Both rx and tx. */
+    close(tp->fd);
+    free(tp->rxring.rd);
+    free(tp->txring.rd);
+    free(tp);
+}
+
+static void
+tpacket_destroy_all(struct netdev_linux *dev)
+{
+    int i;
+
+    if (!dev->tps) {
+        return;
+    }
+    for (i = 0; i < dev->n_tps; i++) {
+        tpacket_destroy(dev->tps[i]);
+    }
+}
+
+int
+netdev_tpacket_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    netdev->n_rxq = 1;
+    dev->n_tps = netdev->n_rxq;
+    tpacket_destroy_all(dev);
+
+    err = tpacket_configure_all(dev);
+    if (err) {
+        VLOG_ERR("%s: tpacket reconfiguration failed.",
+                 netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+    ovs_mutex_unlock(&dev->mutex);
+
+    return err;
+}
+
+static inline uint32_t
+get_block_status(struct tpacket_block_desc *desc)
+{
+    barrier();
+    return desc->hdr.bh1.block_status;
+}
+
+static inline void
+set_block_status(struct tpacket_block_desc *desc, uint32_t status)
+{
+    desc->hdr.bh1.block_status = status;
+    barrier();
+}
+
+static inline uint32_t
+get_num_pkts(struct tpacket_block_desc *desc)
+{
+    return desc->hdr.bh1.num_pkts;
+}
+
+static inline uint32_t
+first_pkt_ofs(struct tpacket_block_desc *desc)
+{
+    return desc->hdr.bh1.offset_to_first_pkt;
+}
+
+static uint64_t block_seq_num = 0;
+static void OVS_UNUSED
+check_seq_num(struct tpacket_block_desc *desc)
+{
+    uint64_t seq = desc->hdr.bh1.seq_num;
+
+    if (block_seq_num + 1 != seq) {
+        VLOG_ERR("seq no %"PRIu64" + 1 != %"PRIu64,
+                 block_seq_num, seq);
+    } else {
+        block_seq_num = seq;
+    }
+}
+int
+netdev_tpacket_rxq_recv(struct netdev_rxq *rxq_,
+                        struct dp_packet_batch *batch,
+                        int *qfill)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev *netdev = rx->up.netdev;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int qid = rxq_->queue_id;
+    struct tpacket_block_desc *desc;
+    struct tpacket_info *tp;
+    struct tp_ring *rxring;
+    unsigned int block_num, n_pkts = 0;
+
+    tp = dev->tps[qid];
+    if (!tp) {
+        return EAGAIN;
+    }
+    rx->fd = tp->fd;
+    rxring = &tp->rxring;
+    block_num = rxring->next_avail_block;
+    dp_packet_batch_init(batch);
+
+    while (n_pkts < BATCH_SIZE) {
+        struct tpacket3_hdr *tphdr;
+        struct dp_packet *packet;
+        uint32_t num_pkts;
+        char *data;
+        int i;
+
+        block_num = block_num & TP_BLOCKNR_MASK;
+        desc = (struct tpacket_block_desc *)rxring->rd[block_num].iov_base;
+        while ((get_block_status(desc) & TP_STATUS_USER) == 0) {
+            if (batch->count == 0) {
+#if 0
+                struct pollfd pfd;
+                memset(&pfd, 0, sizeof pfd);
+                pfd.fd = tp->fd;
+                pfd.events = POLLIN | POLLERR;
+                pfd.events = 0;
+                poll(&pfd, 1, 1);
+#endif
+                COVERAGE_INC(tpacket_rx_busy);
+                return EAGAIN;
+            } else {
+                goto out;
+            }
+        }
+
+        check_seq_num(desc);
+        num_pkts = get_num_pkts(desc);
+        tphdr = (struct tpacket3_hdr *)
+                ((char *)desc + first_pkt_ofs(desc));
+
+        /* A block might have multiple frames(packets). */
+        for (i = 0; i < num_pkts; i++) {
+            data = (char *)tphdr + tphdr->tp_mac;
+            packet = dp_packet_clone_data_with_headroom(data,
+                                                        tphdr->tp_snaplen,
+                                                        DP_NETDEV_HEADROOM);
+            dp_packet_set_size(packet, tphdr->tp_snaplen);
+            dp_packet_set_rss_hash(packet, tphdr->hv1.tp_rxhash);
+            dp_packet_batch_add(batch, packet);
+
+            tphdr = (struct tpacket3_hdr *)((char *)tphdr +
+                                            tphdr->tp_next_offset);
+            barrier();
+            n_pkts++;
+        }
+
+        block_num++;
+        rxring->next_avail_block++;
+        set_block_status(desc, TP_STATUS_KERNEL);
+    }
+
+out:
+    if (qfill) {
+        *qfill = 0;
+    }
+
+    return 0;
+}
+
+static inline struct tpacket3_hdr *
+get_next_tx_frame(struct tp_ring *txring, int n)
+{
+    char *start = txring->rd[0].iov_base;
+
+    return (struct tpacket3_hdr *)(start + (n * TP_FRAMESZ));
+}
+
+int
+netdev_tpacket_batch_send(struct netdev *netdev, int qid,
+                          struct dp_packet_batch *batch,
+                          bool concurrent_txq OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct dp_packet *packet;
+    struct tpacket_info *tp;
+    struct tp_ring *txring;
+    unsigned int frame_num;
+    int error = 0;
+    int retries = 3;
+
+    tp = dev->tps[qid];
+    if (!tp) {
+        error = EAGAIN;
+        goto out;
+    }
+    txring = &tp->txring;
+    frame_num = txring->next_avail_block;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct tpacket3_hdr *tphdr;
+        int size;
+
+        frame_num = frame_num & TP_FRAMENR_MASK;
+        tphdr = get_next_tx_frame(txring, frame_num);
+#if 0
+        if (!(tphdr->tp_status & TP_STATUS_AVAILABLE)) {
+            COVERAGE_INC(tpacket_tx_busy);
+        }
+#endif
+        if (tphdr->tp_status &
+           (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) {
+            barrier();
+            COVERAGE_INC(tpacket_tx_busy);
+            error = EAGAIN;
+            goto out;
+        }
+
+        size = dp_packet_size(packet);
+        tphdr->tp_snaplen = size;
+        tphdr->tp_len = size;
+        tphdr->tp_next_offset = 0;
+
+        memcpy((char *)tphdr + TPACKET3_HDRLEN - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet), size);
+
+        frame_num++;
+        txring->next_avail_block++;
+        barrier();
+        tphdr->tp_status = TP_STATUS_SEND_REQUEST;
+    }
+
+kick_retry:
+    error = sendto(tp->fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+    if (error < 0) {
+        if (retries-- && errno == EAGAIN)  {
+            COVERAGE_INC(tpacket_tx_busy);
+            goto kick_retry;
+        } else {
+            goto out;
+        }
+    }
+
+    return 0;
+
+out:
+    dp_packet_delete_batch(batch, true);
+    return error;
+}
diff --git a/lib/netdev-tpacket.h b/lib/netdev-tpacket.h
new file mode 100644
index 000000000000..2a80f962e0b7
--- /dev/null
+++ b/lib/netdev-tpacket.h
@@ -0,0 +1,43 @@ 
+/*
+ * Copyright (c) 2018, 2019 VMware, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_TPACKET_H
+#define NETDEV_TPACKET_H 1
+
+#include <stdint.h>
+#include <stdbool.h>
+
+struct dp_packet;
+struct dp_packet_batch;
+struct netdev;
+struct netdev_custom_stats;
+struct netdev_rxq;
+struct netdev_stats;
+struct smap;
+
+int netdev_tpacket_rxq_recv(struct netdev_rxq *rxq_,
+                            struct dp_packet_batch *batch,
+                            int *qfill);
+int netdev_tpacket_batch_send(struct netdev *netdev_, int qid,
+                            struct dp_packet_batch *batch,
+                            bool concurrent_txq);
+int netdev_tpacket_set_config(struct netdev *netdev, const struct smap *args,
+                              char **errp);
+int netdev_tpacket_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_tpacket_get_custom_stats(const struct netdev *netdev,
+                                    struct netdev_custom_stats *custom_stats);
+int netdev_tpacket_reconfigure(struct netdev *netdev);
+#endif /* netdev-tpacket.h */
diff --git a/lib/netdev.c b/lib/netdev.c
index 405c98c687fa..3710834521d5 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -145,6 +145,7 @@  netdev_initialize(void)
 
 #ifdef __linux__
         netdev_register_provider(&netdev_linux_class);
+        netdev_register_provider(&netdev_tpacket_class);
         netdev_register_provider(&netdev_internal_class);
         netdev_register_provider(&netdev_tap_class);
         netdev_vport_tunnel_register();