[ovs-dev,v7] Use TPACKET_V3 to accelerate veth for userspace datapath
diff mbox series

Message ID 20200318090240.17608-1-yang_y_yi@163.com
State New
Headers show
Series
  • [ovs-dev,v7] Use TPACKET_V3 to accelerate veth for userspace datapath
Related show

Commit Message

yang_y_yi March 18, 2020, 9:02 a.m. UTC
From: Yi Yang <yangyi01@inspur.com>

We can avoid high system call overhead by using TPACKET_V3
and using DPDK-like poll to receive and send packets (Note: send
still needs to call sendto to trigger final packet transmission).

From Linux kernel 3.10 on, TPACKET_V3 has been supported,
so all the Linux kernels current OVS supports can run
TPACKET_V3 without any problem.

I can see about 50% performance improvement for veth compared to
last recvmmsg optimization if I use TPACKET_V3, it is about 2.21
Gbps, but it was 1.47 Gbps before.

After is_pmd is set to true, performance can be improved much
more, it is about 180% performance improvement.

TPACKET_V3 can support TSO, but its performance isn't good because
of TPACKET_V3 kernel implementation issue, so it falls back to
recvmmsg in case userspace-tso-enable is set to true, but its
performance is better than recvmmsg in case userspace-tso-enable is
set to false, so just use TPACKET_V3 in that case.

Note: how much performance improvement is up to your platform,
some platforms can see huge improvement, some ones aren't so
noticeable, but if is_pmd is set to true, you can see big
performance improvement, the prerequisite is your tested veth
interfaces should be attached to different pmd threads.

Signed-off-by: Yi Yang <yangyi01@inspur.com>
Co-authored-by: William Tu <u9012063@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
---
 acinclude.m4                     |  12 ++
 configure.ac                     |   1 +
 include/sparse/linux/if_packet.h | 111 +++++++++++
 lib/dp-packet.c                  |  18 ++
 lib/dp-packet.h                  |   9 +
 lib/netdev-linux-private.h       |  26 +++
 lib/netdev-linux.c               | 419 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 579 insertions(+), 17 deletions(-)

Changelog:
- v6->v7
 * is_pmd is set to true for system interfaces
 * Use zero copy for tpacket_v3 receiving
 * Fix comments by William
 * Remove include/linux/if_packet.h

- v5->v6
 * Fall back to recvmmsg in case userspace-tso-enable is true
   because of TPACKET_V3 kernel implementation issue for tso
   support

- v4->v5
 * Fix travis build issues
 * Fix comments issues (capitalize the first letter)
 * Verify TSO on Ubuntu 18.04 3.5.0-40-generic

- v3->v4
 * Fix sparse check errors

- v2->v3
 * Fix build issues in case HAVE_TPACKET_V3 is not defined
 * Add tso-related support code
 * make sure it can work normally in case userspace-tso-enable is true

- v1->v2
 * Remove TPACKET_V1 and TPACKET_V2 which is obsolete
 * Add include/linux/if_packet.h
 * Change include/sparse/linux/if_packet.h

Comments

Ilya Maximets March 18, 2020, 11:44 a.m. UTC | #1
On 3/18/20 10:02 AM, yang_y_yi@163.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> We can avoid high system call overhead by using TPACKET_V3
> and using DPDK-like poll to receive and send packets (Note: send
> still needs to call sendto to trigger final packet transmission).
> 
> From Linux kernel 3.10 on, TPACKET_V3 has been supported,
> so all the Linux kernels current OVS supports can run
> TPACKET_V3 without any problem.
> 
> I can see about 50% performance improvement for veth compared to
> last recvmmsg optimization if I use TPACKET_V3, it is about 2.21
> Gbps, but it was 1.47 Gbps before.
> 
> After is_pmd is set to true, performance can be improved much
> more, it is about 180% performance improvement.
> 
> TPACKET_V3 can support TSO, but its performance isn't good because
> of TPACKET_V3 kernel implementation issue, so it falls back to
> recvmmsg in case userspace-tso-enable is set to true, but its
> performance is better than recvmmsg in case userspace-tso-enable is
> set to false, so just use TPACKET_V3 in that case.
> 
> Note: how much performance improvement is up to your platform,
> some platforms can see huge improvement, some ones aren't so
> noticeable, but if is_pmd is set to true, you can see big
> performance improvement, the prerequisite is your tested veth
> interfaces should be attached to different pmd threads.
> 
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> Co-authored-by: William Tu <u9012063@gmail.com>
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  acinclude.m4                     |  12 ++
>  configure.ac                     |   1 +
>  include/sparse/linux/if_packet.h | 111 +++++++++++
>  lib/dp-packet.c                  |  18 ++
>  lib/dp-packet.h                  |   9 +
>  lib/netdev-linux-private.h       |  26 +++
>  lib/netdev-linux.c               | 419 +++++++++++++++++++++++++++++++++++++--
>  7 files changed, 579 insertions(+), 17 deletions(-)
> 
> Changelog:
> - v6->v7
>  * is_pmd is set to true for system interfaces

This can not be done that simple and should not be done unconditionally
anyways.  netdev-linux is not thread safe in many ways.  At least, stats
accounting will be messed up.  Second thing is that this change will
harm all the usual DPDK-based setups since PMD threads will start make
a lot of syscalls and sleep inside the kernel missing packets from the
fast DPDK interfaces.  Third thing is that this change will fire up at
least one PMD thread consuming 100% CPU constantly even for setups where
it's not needed.
So, this version is definitely not acceptable.

Best regards, Ilya Maximets.
Yi Yang (杨燚)-云服务集团 March 18, 2020, 1:22 p.m. UTC | #2
Ilya, raw socket for the interface type of which is "system" has been set to
non-block mode, can you explain which syscall will lead to sleep? Yes, pmd
thread will consume CPU resource even if it has nothing to do, but all the
type=dpdk ports are handled by pmd thread, here we just let system
interfaces look like a DPDK interface. I didn't see any problem in my test,
it will be better if you can tell me what will result in a problem and how I
can reproduce it. By the way, type=tap/internal interfaces are still be
handled by ovs-vswitchd thread.

In addition, only one line change is there, ".is_pmd = true,", ".is_pmd =
false," will keep it in ovs-vswitchd if there is any other concern. We can
change non-thread-safe parts to support pmd.

-----邮件原件-----
发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Ilya Maximets
发送时间: 2020年3月18日 19:45
收件人: yang_y_yi@163.com; ovs-dev@openvswitch.org
抄送: i.maximets@ovn.org
主题: Re: [ovs-dev] [PATCH v7] Use TPACKET_V3 to accelerate veth for
userspace datapath

On 3/18/20 10:02 AM, yang_y_yi@163.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> We can avoid high system call overhead by using TPACKET_V3 and using 
> DPDK-like poll to receive and send packets (Note: send still needs to 
> call sendto to trigger final packet transmission).
> 
> From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the 
> Linux kernels current OVS supports can run
> TPACKET_V3 without any problem.
> 
> I can see about 50% performance improvement for veth compared to last 
> recvmmsg optimization if I use TPACKET_V3, it is about 2.21 Gbps, but 
> it was 1.47 Gbps before.
> 
> After is_pmd is set to true, performance can be improved much more, it 
> is about 180% performance improvement.
> 
> TPACKET_V3 can support TSO, but its performance isn't good because of 
> TPACKET_V3 kernel implementation issue, so it falls back to recvmmsg 
> in case userspace-tso-enable is set to true, but its performance is 
> better than recvmmsg in case userspace-tso-enable is set to false, so 
> just use TPACKET_V3 in that case.
> 
> Note: how much performance improvement is up to your platform, some 
> platforms can see huge improvement, some ones aren't so noticeable, 
> but if is_pmd is set to true, you can see big performance improvement, 
> the prerequisite is your tested veth interfaces should be attached to 
> different pmd threads.
> 
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> Co-authored-by: William Tu <u9012063@gmail.com>
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  acinclude.m4                     |  12 ++
>  configure.ac                     |   1 +
>  include/sparse/linux/if_packet.h | 111 +++++++++++
>  lib/dp-packet.c                  |  18 ++
>  lib/dp-packet.h                  |   9 +
>  lib/netdev-linux-private.h       |  26 +++
>  lib/netdev-linux.c               | 419
+++++++++++++++++++++++++++++++++++++--
>  7 files changed, 579 insertions(+), 17 deletions(-)
> 
> Changelog:
> - v6->v7
>  * is_pmd is set to true for system interfaces

This can not be done that simple and should not be done unconditionally
anyways.  netdev-linux is not thread safe in many ways.  At least, stats
accounting will be messed up.  Second thing is that this change will harm
all the usual DPDK-based setups since PMD threads will start make a lot of
syscalls and sleep inside the kernel missing packets from the fast DPDK
interfaces.  Third thing is that this change will fire up at least one PMD
thread consuming 100% CPU constantly even for setups where it's not needed.
So, this version is definitely not acceptable.

Best regards, Ilya Maximets.
William Tu March 18, 2020, 9:41 p.m. UTC | #3
On Wed, Mar 18, 2020 at 6:22 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Ilya, raw socket for the interface type of which is "system" has been set to
> non-block mode, can you explain which syscall will lead to sleep? Yes, pmd
> thread will consume CPU resource even if it has nothing to do, but all the
> type=dpdk ports are handled by pmd thread, here we just let system
> interfaces look like a DPDK interface. I didn't see any problem in my test,
> it will be better if you can tell me what will result in a problem and how I
> can reproduce it. By the way, type=tap/internal interfaces are still be
> handled by ovs-vswitchd thread.
>
> In addition, only one line change is there, ".is_pmd = true,", ".is_pmd =
> false," will keep it in ovs-vswitchd if there is any other concern. We can
> change non-thread-safe parts to support pmd.
>

Hi Yiyang an Ilya,

How about making tpacket_v3 a new netdev class with type="tpacket"?
Like my original patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/366229.html

Users have to create it specifically by doing type="tpacket", ex:
  $ ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="tpacket"
And we can set is_pmd=true for this particular type.

Regards
William
Yi Yang (杨燚)-云服务集团 March 18, 2020, 11:56 p.m. UTC | #4
William, that can't fix Ilya's concern, we can fix thread-safe issues if most of code in lib/netdev-linux.c is not thread-safe, we also have to consider how we can fix scalability issue, obviously it isn't scalable that one ovs-vswitchd thread handles all the such interfaces, this is performance bottleneck and the performance is linearly related with number of interfaces. I think pmd thread is our natural choice, we have no reason to refuse this way, we can first fix them if current code does have issues to support pmd thread.

If it is difficult to support pmd currently, I can remove ".is_pmd = true", if a user wants to do that way, maybe we can add an option in interface level, options:is_pmd=true, but I'm not sure if is_pmd can be set on adding interface.

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年3月19日 5:42
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@163.com; ovs-dev@openvswitch.org
主题: Re: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath

On Wed, Mar 18, 2020 at 6:22 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Ilya, raw socket for the interface type of which is "system" has been 
> set to non-block mode, can you explain which syscall will lead to 
> sleep? Yes, pmd thread will consume CPU resource even if it has 
> nothing to do, but all the type=dpdk ports are handled by pmd thread, 
> here we just let system interfaces look like a DPDK interface. I 
> didn't see any problem in my test, it will be better if you can tell 
> me what will result in a problem and how I can reproduce it. By the 
> way, type=tap/internal interfaces are still be handled by ovs-vswitchd thread.
>
> In addition, only one line change is there, ".is_pmd = true,", 
> ".is_pmd = false," will keep it in ovs-vswitchd if there is any other 
> concern. We can change non-thread-safe parts to support pmd.
>

Hi Yiyang an Ilya,

How about making tpacket_v3 a new netdev class with type="tpacket"?
Like my original patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/366229.html

Users have to create it specifically by doing type="tpacket", ex:
  $ ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="tpacket"
And we can set is_pmd=true for this particular type.

Regards
William
Yi Yang (杨燚)-云服务集团 March 19, 2020, 3:12 a.m. UTC | #5
Hi, folks

As I said, TPACKET_V3 does have kernel implementation issue, I tried to fix it in Linux kernel 5.5.9, here is my test data with tpacket_v3 and tso enabled. On my low end server, my goal is to reach 16Gbps at least, I still have another idea to improve it.

[yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 42336 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec    1   3.09 MBytes
[  4]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec    0   3.09 MBytes
[  4]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec    3   3.09 MBytes
[  4]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec    4             sender
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 42334
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 42336
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  60.00-60.01  sec  14.3 MBytes  12.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.01  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.01  sec  77.2 GBytes  11.0 Gbits/sec                  receiver


iperf Done.
[yangyi@localhost ovs-master]$

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年3月19日 5:42
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@163.com; ovs-dev@openvswitch.org
主题: Re: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath

On Wed, Mar 18, 2020 at 6:22 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Ilya, raw socket for the interface type of which is "system" has been 
> set to non-block mode, can you explain which syscall will lead to 
> sleep? Yes, pmd thread will consume CPU resource even if it has 
> nothing to do, but all the type=dpdk ports are handled by pmd thread, 
> here we just let system interfaces look like a DPDK interface. I 
> didn't see any problem in my test, it will be better if you can tell 
> me what will result in a problem and how I can reproduce it. By the 
> way, type=tap/internal interfaces are still be handled by ovs-vswitchd thread.
>
> In addition, only one line change is there, ".is_pmd = true,", 
> ".is_pmd = false," will keep it in ovs-vswitchd if there is any other 
> concern. We can change non-thread-safe parts to support pmd.
>

Hi Yiyang an Ilya,

How about making tpacket_v3 a new netdev class with type="tpacket"?
Like my original patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/366229.html

Users have to create it specifically by doing type="tpacket", ex:
  $ ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="tpacket"
And we can set is_pmd=true for this particular type.

Regards
William
William Tu March 19, 2020, 2:52 p.m. UTC | #6
On Wed, Mar 18, 2020 at 8:12 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Hi, folks
>
> As I said, TPACKET_V3 does have kernel implementation issue, I tried to fix it in Linux kernel 5.5.9, here is my test data with tpacket_v3 and tso enabled. On my low end server, my goal is to reach 16Gbps at least, I still have another idea to improve it.
>
Can you share your kernel fix?
Or have you sent patch somewhere?
William
Yi Yang (杨燚)-云服务集团 March 20, 2020, 1:24 a.m. UTC | #7
William, this is just a simple experiment, I'm still trying other ideas to get higher performance improvement, final patch is for Linux kernel net tree, not for ovs.

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年3月19日 22:53
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@163.com; ovs-dev@openvswitch.org
主题: Re: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath

On Wed, Mar 18, 2020 at 8:12 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Hi, folks
>
> As I said, TPACKET_V3 does have kernel implementation issue, I tried to fix it in Linux kernel 5.5.9, here is my test data with tpacket_v3 and tso enabled. On my low end server, my goal is to reach 16Gbps at least, I still have another idea to improve it.
>
Can you share your kernel fix?
Or have you sent patch somewhere?
William
Yi Yang (杨燚)-云服务集团 March 23, 2020, 10 a.m. UTC | #8
Hi, folks

I implemented my goal in Ubuntu kernel 4.15.0-92.93, here is my performance data with tpacket_v3 and tso. So now I'm very sure tpacket_v3 can do better.

[yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
iperf3: no process found
Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 44976 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
[  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
[  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
[  4]  30.00-40.00  sec  19.9 GBytes  17.1 Gbits/sec  102262    346 KBytes
[  4]  40.00-50.00  sec  19.8 GBytes  17.0 Gbits/sec  105383    225 KBytes
[  4]  50.00-60.00  sec  19.9 GBytes  17.1 Gbits/sec  103177    294 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec   119 GBytes  17.0 Gbits/sec  628995             sender
[  4]   0.00-60.00  sec   119 GBytes  17.0 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 44974
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 44976
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  19.5 GBytes  16.7 Gbits/sec
[  5]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec
[  5]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec
[  5]  30.00-40.00  sec  19.9 GBytes  17.1 Gbits/sec
[  5]  40.00-50.00  sec  19.8 GBytes  17.0 Gbits/sec
[  5]  50.00-60.00  sec  19.9 GBytes  17.1 Gbits/sec
[  5]  60.00-60.04  sec  89.1 MBytes  17.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.04  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.04  sec   119 GBytes  17.0 Gbits/sec                  receiver


iperf Done.
[yangyi@localhost ovs-master]$

-----邮件原件-----
发件人: Yi Yang (杨燚)-云服务集团 
发送时间: 2020年3月19日 11:12
收件人: 'u9012063@gmail.com' <u9012063@gmail.com>
抄送: 'i.maximets@ovn.org' <i.maximets@ovn.org>; 'yang_y_yi@163.com' <yang_y_yi@163.com>; 'ovs-dev@openvswitch.org' <ovs-dev@openvswitch.org>
主题: 答复: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath
重要性: 高

Hi, folks

As I said, TPACKET_V3 does have kernel implementation issue, I tried to fix it in Linux kernel 5.5.9, here is my test data with tpacket_v3 and tso enabled. On my low end server, my goal is to reach 16Gbps at least, I still have another idea to improve it.

[yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 42336 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec    1   3.09 MBytes
[  4]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec    0   3.09 MBytes
[  4]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec    3   3.09 MBytes
[  4]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec    4             sender
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 42334
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 42336
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  60.00-60.01  sec  14.3 MBytes  12.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.01  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.01  sec  77.2 GBytes  11.0 Gbits/sec                  receiver


iperf Done.
[yangyi@localhost ovs-master]$

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年3月19日 5:42
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@163.com; ovs-dev@openvswitch.org
主题: Re: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath

On Wed, Mar 18, 2020 at 6:22 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Ilya, raw socket for the interface type of which is "system" has been 
> set to non-block mode, can you explain which syscall will lead to 
> sleep? Yes, pmd thread will consume CPU resource even if it has 
> nothing to do, but all the type=dpdk ports are handled by pmd thread, 
> here we just let system interfaces look like a DPDK interface. I 
> didn't see any problem in my test, it will be better if you can tell 
> me what will result in a problem and how I can reproduce it. By the 
> way, type=tap/internal interfaces are still be handled by ovs-vswitchd thread.
>
> In addition, only one line change is there, ".is_pmd = true,", 
> ".is_pmd = false," will keep it in ovs-vswitchd if there is any other 
> concern. We can change non-thread-safe parts to support pmd.
>

Hi Yiyang an Ilya,

How about making tpacket_v3 a new netdev class with type="tpacket"?
Like my original patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/366229.html

Users have to create it specifically by doing type="tpacket", ex:
  $ ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="tpacket"
And we can set is_pmd=true for this particular type.

Regards
William
Yi Yang (杨燚)-云服务集团 March 25, 2020, 2:17 p.m. UTC | #9
William, FYI, my Linux kernel fix patch for TPACKET_V3 TSO performance issue. You can try it in Ubuntu 4.15.0 kernel.

https://patchwork.ozlabs.org/patch/1261410/


-----邮件原件-----
发件人: Yi Yang (杨燚)-云服务集团 
发送时间: 2020年3月23日 18:00
收件人: 'u9012063@gmail.com' <u9012063@gmail.com>
抄送: 'i.maximets@ovn.org' <i.maximets@ovn.org>; 'yang_y_yi@163.com' <yang_y_yi@163.com>; 'ovs-dev@openvswitch.org' <ovs-dev@openvswitch.org>
主题: 答复: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath
重要性: 高

Hi, folks

I implemented my goal in Ubuntu kernel 4.15.0-92.93, here is my performance data with tpacket_v3 and tso. So now I'm very sure tpacket_v3 can do better.

[yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
iperf3: no process found
Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 44976 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
[  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
[  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
[  4]  30.00-40.00  sec  19.9 GBytes  17.1 Gbits/sec  102262    346 KBytes
[  4]  40.00-50.00  sec  19.8 GBytes  17.0 Gbits/sec  105383    225 KBytes
[  4]  50.00-60.00  sec  19.9 GBytes  17.1 Gbits/sec  103177    294 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec   119 GBytes  17.0 Gbits/sec  628995             sender
[  4]   0.00-60.00  sec   119 GBytes  17.0 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 44974
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 44976
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  19.5 GBytes  16.7 Gbits/sec
[  5]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec
[  5]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec
[  5]  30.00-40.00  sec  19.9 GBytes  17.1 Gbits/sec
[  5]  40.00-50.00  sec  19.8 GBytes  17.0 Gbits/sec
[  5]  50.00-60.00  sec  19.9 GBytes  17.1 Gbits/sec
[  5]  60.00-60.04  sec  89.1 MBytes  17.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.04  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.04  sec   119 GBytes  17.0 Gbits/sec                  receiver


iperf Done.
[yangyi@localhost ovs-master]$

-----邮件原件-----
发件人: Yi Yang (杨燚)-云服务集团 
发送时间: 2020年3月19日 11:12
收件人: 'u9012063@gmail.com' <u9012063@gmail.com>
抄送: 'i.maximets@ovn.org' <i.maximets@ovn.org>; 'yang_y_yi@163.com' <yang_y_yi@163.com>; 'ovs-dev@openvswitch.org' <ovs-dev@openvswitch.org>
主题: 答复: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath
重要性: 高

Hi, folks

As I said, TPACKET_V3 does have kernel implementation issue, I tried to fix it in Linux kernel 5.5.9, here is my test data with tpacket_v3 and tso enabled. On my low end server, my goal is to reach 16Gbps at least, I still have another idea to improve it.

[yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
Connecting to host 10.15.1.3, port 5201
[  4] local 10.15.1.2 port 42336 connected to 10.15.1.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec    1   3.09 MBytes
[  4]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec    0   3.09 MBytes
[  4]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec    3   3.09 MBytes
[  4]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
[  4]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec    0   3.09 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec    4             sender
[  4]   0.00-60.00  sec  77.2 GBytes  11.1 Gbits/sec                  receiver

Server output:
Accepted connection from 10.15.1.2, port 42334
[  5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 42336
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  10.00-20.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  20.00-30.00  sec  12.9 GBytes  11.1 Gbits/sec
[  5]  30.00-40.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  40.00-50.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  50.00-60.00  sec  12.8 GBytes  11.0 Gbits/sec
[  5]  60.00-60.01  sec  14.3 MBytes  12.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.01  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.01  sec  77.2 GBytes  11.0 Gbits/sec                  receiver


iperf Done.
[yangyi@localhost ovs-master]$

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2020年3月19日 5:42
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; yang_y_yi@163.com; ovs-dev@openvswitch.org
主题: Re: [ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath

On Wed, Mar 18, 2020 at 6:22 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Ilya, raw socket for the interface type of which is "system" has been 
> set to non-block mode, can you explain which syscall will lead to 
> sleep? Yes, pmd thread will consume CPU resource even if it has 
> nothing to do, but all the type=dpdk ports are handled by pmd thread, 
> here we just let system interfaces look like a DPDK interface. I 
> didn't see any problem in my test, it will be better if you can tell 
> me what will result in a problem and how I can reproduce it. By the 
> way, type=tap/internal interfaces are still be handled by ovs-vswitchd thread.
>
> In addition, only one line change is there, ".is_pmd = true,", 
> ".is_pmd = false," will keep it in ovs-vswitchd if there is any other 
> concern. We can change non-thread-safe parts to support pmd.
>

Hi Yiyang an Ilya,

How about making tpacket_v3 a new netdev class with type="tpacket"?
Like my original patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/366229.html

Users have to create it specifically by doing type="tpacket", ex:
  $ ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="tpacket"
And we can set is_pmd=true for this particular type.

Regards
William

Patch
diff mbox series

diff --git a/acinclude.m4 b/acinclude.m4
index 02efea6..1488ded 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1082,6 +1082,18 @@  AC_DEFUN([OVS_CHECK_IF_DL],
       AC_SEARCH_LIBS([pcap_open_live], [pcap])
    fi])
 
+dnl OVS_CHECK_LINUX_TPACKET
+dnl
+dnl Configure Linux TPACKET.
+AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [
+  AC_COMPILE_IFELSE([
+    AC_LANG_PROGRAM([#include <linux/if_packet.h>], [
+        struct tpacket3_hdr x =  { 0 };
+    ])],
+    [AC_DEFINE([HAVE_TPACKET_V3], [1],
+    [Define to 1 if struct tpacket3_hdr is available.])])
+])
+
 dnl Checks for buggy strtok_r.
 dnl
 dnl Some versions of glibc 2.7 has a bug in strtok_r when compiling
diff --git a/configure.ac b/configure.ac
index 1877aae..b61a1f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -89,6 +89,7 @@  OVS_CHECK_VISUAL_STUDIO_DDK
 OVS_CHECK_COVERAGE
 OVS_CHECK_NDEBUG
 OVS_CHECK_NETLINK
+OVS_CHECK_LINUX_TPACKET
 OVS_CHECK_OPENSSL
 OVS_CHECK_LIBCAPNG
 OVS_CHECK_LOGDIR
diff --git a/include/sparse/linux/if_packet.h b/include/sparse/linux/if_packet.h
index 5ff6d47..0ac3fce 100644
--- a/include/sparse/linux/if_packet.h
+++ b/include/sparse/linux/if_packet.h
@@ -5,6 +5,7 @@ 
 #error "Use this header only with sparse.  It is not a correct implementation."
 #endif
 
+#include <openvswitch/types.h>
 #include_next <linux/if_packet.h>
 
 /* Fix endianness of 'spkt_protocol' and 'sll_protocol' members. */
@@ -27,4 +28,114 @@  struct sockaddr_ll {
         unsigned char   sll_addr[8];
 };
 
+/* Packet types */
+#define PACKET_HOST                     0 /* To us                */
+#define PACKET_OTHERHOST                3 /* To someone else 	*/
+#define PACKET_LOOPBACK                 5 /* MC/BRD frame looped back */
+
+/* Packet socket options */
+#define PACKET_RX_RING                  5
+#define PACKET_VERSION                 10
+#define PACKET_TX_RING                 13
+#define PACKET_VNET_HDR                15
+
+/* Rx ring - header status */
+#define TP_STATUS_KERNEL                0
+#define TP_STATUS_USER            (1 << 0)
+#define TP_STATUS_VLAN_VALID      (1 << 4) /* auxdata has valid tp_vlan_tci */
+#define TP_STATUS_VLAN_TPID_VALID (1 << 6) /* auxdata has valid tp_vlan_tpid */
+
+/* Tx ring - header status */
+#define TP_STATUS_SEND_REQUEST    (1 << 0)
+#define TP_STATUS_SENDING         (1 << 1)
+
+#define tpacket_hdr rpl_tpacket_hdr
+struct tpacket_hdr {
+    unsigned long tp_status;
+    unsigned int tp_len;
+    unsigned int tp_snaplen;
+    unsigned short tp_mac;
+    unsigned short tp_net;
+    unsigned int tp_sec;
+    unsigned int tp_usec;
+};
+
+#define TPACKET_ALIGNMENT 16
+#define TPACKET_ALIGN(x) (((x)+TPACKET_ALIGNMENT-1)&~(TPACKET_ALIGNMENT-1))
+
+#define tpacket_hdr_variant1 rpl_tpacket_hdr_variant1
+struct tpacket_hdr_variant1 {
+    uint32_t tp_rxhash;
+    uint32_t tp_vlan_tci;
+    uint16_t tp_vlan_tpid;
+    uint16_t tp_padding;
+};
+
+#define tpacket3_hdr rpl_tpacket3_hdr
+struct tpacket3_hdr {
+    uint32_t  tp_next_offset;
+    uint32_t  tp_sec;
+    uint32_t  tp_nsec;
+    uint32_t  tp_snaplen;
+    uint32_t  tp_len;
+    uint32_t  tp_status;
+    uint16_t  tp_mac;
+    uint16_t  tp_net;
+    /* pkt_hdr variants */
+    union {
+        struct tpacket_hdr_variant1 hv1;
+    };
+    uint8_t  tp_padding[8];
+};
+
+#define tpacket_bd_ts rpl_tpacket_bd_ts
+struct tpacket_bd_ts {
+    unsigned int ts_sec;
+    union {
+        unsigned int ts_usec;
+        unsigned int ts_nsec;
+    };
+};
+
+#define tpacket_hdr_v1 rpl_tpacket_hdr_v1
+struct tpacket_hdr_v1 {
+    uint32_t block_status;
+    uint32_t num_pkts;
+    uint32_t offset_to_first_pkt;
+    uint32_t blk_len;
+    uint64_t __attribute__((aligned(8))) seq_num;
+    struct tpacket_bd_ts ts_first_pkt, ts_last_pkt;
+};
+
+#define tpacket_bd_header_u rpl_tpacket_bd_header_u
+union tpacket_bd_header_u {
+    struct tpacket_hdr_v1 bh1;
+};
+
+#define tpacket_block_desc rpl_tpacket_block_desc
+struct tpacket_block_desc {
+    uint32_t version;
+    uint32_t offset_to_priv;
+    union tpacket_bd_header_u hdr;
+};
+
+#define TPACKET3_HDRLEN \
+    (TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))
+
+enum rpl_tpacket_versions {
+    TPACKET_V1,
+    TPACKET_V2,
+    TPACKET_V3
+};
+
+#define tpacket_req3 rpl_tpacket_req3
+struct tpacket_req3 {
+    unsigned int tp_block_size; /* Minimal size of contiguous block */
+    unsigned int tp_block_nr; /* Number of blocks */
+    unsigned int tp_frame_size; /* Size of frame */
+    unsigned int tp_frame_nr; /* Total number of frames */
+    unsigned int tp_retire_blk_tov; /* Timeout in msecs */
+    unsigned int tp_sizeof_priv; /* Offset to private data area */
+    unsigned int tp_feature_req_word;
+};
 #endif
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index cd26235..82f4934 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -76,6 +76,21 @@  dp_packet_use_afxdp(struct dp_packet *b, void *data, size_t allocated,
 }
 #endif
 
+#if HAVE_TPACKET_V3
+/* Initialize 'b' as an dp_packet that contains tpacket data.
+ */
+void
+dp_packet_use_tpacket(struct dp_packet *b, void *data, size_t allocated,
+                      size_t headroom)
+{
+    dp_packet_set_base(b, (char *)data - headroom);
+    dp_packet_set_data(b, data);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_init__(b, allocated, DPBUF_TPACKET_V3);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -271,6 +286,9 @@  dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t new_tailroom)
     case DPBUF_AFXDP:
         OVS_NOT_REACHED();
 
+    case DPBUF_TPACKET_V3:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 9f8991f..955c6f8 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -44,6 +44,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
     DPBUF_AFXDP,               /* Buffer data from XDP frame. */
+    DPBUF_TPACKET_V3           /* Buffer data from TPACKET_V3 rx ring */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -139,6 +140,9 @@  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 #if HAVE_AF_XDP
 void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
 #endif
+#if HAVE_TPACKET_V3
+void dp_packet_use_tpacket(struct dp_packet *, void *, size_t, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -207,6 +211,11 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+        if (b->source == DPBUF_TPACKET_V3) {
+            /* TPACKET_V3 buffer needn't free, it is recycled. */
+            return;
+        }
+
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
index c7c515f..296f085 100644
--- a/lib/netdev-linux-private.h
+++ b/lib/netdev-linux-private.h
@@ -20,6 +20,7 @@ 
 #include <linux/filter.h>
 #include <linux/gen_stats.h>
 #include <linux/if_ether.h>
+#include <linux/if_packet.h>
 #include <linux/if_tun.h>
 #include <linux/types.h>
 #include <linux/ethtool.h>
@@ -41,6 +42,26 @@  struct netdev;
 /* The maximum packet length is 16 bits */
 #define LINUX_RXQ_TSO_MAX_LEN 65535
 
+#ifdef HAVE_TPACKET_V3
+#define TPACKET_MAX_FRAME_NUM 64
+struct tpacket_ring {
+    int sockfd;                  /* Raw socket fd */
+    struct iovec *rd;            /* Ring buffer descriptors */
+    uint8_t *mm_space;           /* Mmap base address */
+    size_t mm_len;               /* Total mmap length */
+    size_t rd_len;               /* Total ring buffer descriptors length */
+    int type;                    /* Ring type: rx or tx */
+    int rd_num;                  /* Number of ring buffer descriptor */
+    int flen;                    /* Block size */
+    struct tpacket_req3 req;     /* TPACKET_V3 req */
+    uint32_t block_num;          /* Current block number */
+    uint32_t frame_num;          /* Current frame number */
+    uint32_t frame_num_in_block; /* Frame number in current block */
+    void * ppd;                  /* Packet pointer in current block */
+    struct dp_packet *pkts;      /* Preallocated dp_packet pool */
+};
+#endif /* HAVE_TPACKET_V3 */
+
 struct netdev_rxq_linux {
     struct netdev_rxq up;
     bool is_tap;
@@ -105,6 +126,11 @@  struct netdev_linux {
 
     int numa_id;                /* NUMA node id. */
 
+#ifdef HAVE_TPACKET_V3
+    struct tpacket_ring *tp_rx_ring;
+    struct tpacket_ring *tp_tx_ring;
+#endif
+
 #ifdef HAVE_AF_XDP
     /* AF_XDP information. */
     struct xsk_socket_info **xsks;
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index c6e46f1..963bb06 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -38,6 +38,9 @@ 
 #include <linux/sockios.h>
 #include <linux/virtio_net.h>
 #include <sys/ioctl.h>
+#ifdef HAVE_TPACKET_V3
+#include <sys/mman.h>
+#endif
 #include <sys/socket.h>
 #include <sys/uio.h>
 #include <sys/utsname.h>
@@ -970,6 +973,7 @@  netdev_linux_construct_tap(struct netdev *netdev_)
     static const char tap_dev[] = "/dev/net/tun";
     const char *name = netdev_->name;
     struct ifreq ifr;
+    bool tso = userspace_tso_enabled();
 
     int error = netdev_linux_common_construct(netdev_);
     if (error) {
@@ -987,7 +991,7 @@  netdev_linux_construct_tap(struct netdev *netdev_)
     /* Create tap device. */
     get_flags(&netdev->up, &netdev->ifi_flags);
     ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
-    if (userspace_tso_enabled()) {
+    if (tso) {
         ifr.ifr_flags |= IFF_VNET_HDR;
     }
 
@@ -1012,7 +1016,7 @@  netdev_linux_construct_tap(struct netdev *netdev_)
         goto error_close;
     }
 
-    if (userspace_tso_enabled()) {
+    if (tso) {
         /* Old kernels don't support TUNSETOFFLOAD. If TUNSETOFFLOAD is
          * available, it will return EINVAL when a flag is unknown.
          * Therefore, try enabling offload with no flags to check
@@ -1074,6 +1078,116 @@  netdev_linux_rxq_alloc(void)
     return &rx->up;
 }
 
+#ifdef HAVE_TPACKET_V3
+static inline struct tpacket3_hdr *
+tpacket_get_next_frame(struct tpacket_ring *ring, uint32_t frame_num)
+{
+    uint8_t *f0 = ring->rd[0].iov_base;
+
+    return ALIGNED_CAST(struct tpacket3_hdr *,
+               f0 + (frame_num * ring->req.tp_frame_size));
+}
+
+static inline void
+tpacket_fill_ring(struct tpacket_ring *ring, unsigned int blocks, int type)
+{
+    if (type == PACKET_RX_RING) {
+        ring->req.tp_retire_blk_tov = 0;
+        ring->req.tp_sizeof_priv = 0;
+        ring->req.tp_feature_req_word = 0;
+    }
+
+    if (userspace_tso_enabled()) {
+        /* For TX ring, the whole packet must be in one frame
+         * so tp_frame_size must big enough to accommodate
+         * 64K packet, tpacket3_hdr will occupy some bytes,
+         * the final frame size is 64K + 4K = 68K.
+         */
+        ring->req.tp_frame_size = (getpagesize() << 4) + getpagesize();
+        ring->req.tp_block_size = ring->req.tp_frame_size;
+    } else {
+        ring->req.tp_block_size = getpagesize() << 2;
+        ring->req.tp_frame_size = TPACKET_ALIGNMENT << 7;
+    }
+
+    ring->req.tp_block_nr = blocks;
+
+    ring->req.tp_frame_nr = ring->req.tp_block_size /
+                             ring->req.tp_frame_size *
+                             ring->req.tp_block_nr;
+
+    ring->mm_len = ring->req.tp_block_size * ring->req.tp_block_nr;
+    ring->rd_num = ring->req.tp_block_nr;
+    ring->flen = ring->req.tp_block_size;
+}
+
+static int
+tpacket_setup_ring(int sock, struct tpacket_ring *ring, int type)
+{
+    int ret = 0;
+    unsigned int blocks;
+
+    if (userspace_tso_enabled()) {
+        blocks = 128;
+    } else {
+        blocks = 256;
+    }
+    ring->type = type;
+    tpacket_fill_ring(ring, blocks, type);
+    ret = setsockopt(sock, SOL_PACKET, type, &ring->req,
+                     sizeof(ring->req));
+
+    if (ret == -1) {
+        return -1;
+    }
+
+    ring->rd_len = ring->rd_num * sizeof(*ring->rd);
+    ring->rd = xmalloc(ring->rd_len);
+    if (ring->rd == NULL) {
+        return -1;
+    }
+
+    /* Preallocated dp_packet pool */
+    if (type == PACKET_RX_RING) {
+        ring->pkts = xmalloc(sizeof(struct dp_packet) * TPACKET_MAX_FRAME_NUM);
+        if (ring->pkts == NULL) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+static inline int
+tpacket_mmap_rx_tx_ring(int sock, struct tpacket_ring *rx_ring,
+                struct tpacket_ring *tx_ring)
+{
+    int i;
+
+    rx_ring->mm_space = mmap(NULL, rx_ring->mm_len + tx_ring->mm_len,
+                          PROT_READ | PROT_WRITE,
+                          MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0);
+    if (rx_ring->mm_space == MAP_FAILED) {
+        return -1;
+    }
+
+    memset(rx_ring->rd, 0, rx_ring->rd_len);
+    for (i = 0; i < rx_ring->rd_num; ++i) {
+        rx_ring->rd[i].iov_base = rx_ring->mm_space + (i * rx_ring->flen);
+        rx_ring->rd[i].iov_len = rx_ring->flen;
+    }
+
+    tx_ring->mm_space = rx_ring->mm_space + rx_ring->mm_len;
+    memset(tx_ring->rd, 0, tx_ring->rd_len);
+    for (i = 0; i < tx_ring->rd_num; ++i) {
+        tx_ring->rd[i].iov_base = tx_ring->mm_space + (i * tx_ring->flen);
+        tx_ring->rd[i].iov_len = tx_ring->flen;
+    }
+
+    return 0;
+}
+#endif
+
 static int
 netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 {
@@ -1081,6 +1195,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     struct netdev *netdev_ = rx->up.netdev;
     struct netdev_linux *netdev = netdev_linux_cast(netdev_);
     int error;
+    bool tso = userspace_tso_enabled();
 
     ovs_mutex_lock(&netdev->mutex);
     rx->is_tap = is_tap_netdev(netdev_);
@@ -1089,6 +1204,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
     } else {
         struct sockaddr_ll sll;
         int ifindex, val;
+
         /* Result of tcpdump -dd inbound */
         static const struct sock_filter filt[] = {
             { 0x28, 0, 0, 0xfffff004 }, /* ldh [0] */
@@ -1101,7 +1217,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
         };
 
         /* Create file descriptor. */
-        rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
+        rx->fd = socket(PF_PACKET, SOCK_RAW, (OVS_FORCE int) htons(ETH_P_ALL));
         if (rx->fd < 0) {
             error = errno;
             VLOG_ERR("failed to create raw socket (%s)", ovs_strerror(error));
@@ -1116,7 +1232,7 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
             goto error;
         }
 
-        if (userspace_tso_enabled()
+        if (tso
             && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
                           sizeof val)) {
             error = errno;
@@ -1125,6 +1241,53 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
             goto error;
         }
 
+#ifdef HAVE_TPACKET_V3
+        if (!tso) {
+            static int ver = TPACKET_V3;
+
+            /* TPACKET_V3 ring setup must be after setsockopt
+             * PACKET_VNET_HDR because PACKET_VNET_HDR will return error
+             * (EBUSY) if ring is set up
+             */
+            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VERSION, &ver,
+                               sizeof(ver));
+            if (error != 0) {
+                error = errno;
+                VLOG_ERR("%s: failed to set tpacket version (%s)",
+                         netdev_get_name(netdev_), ovs_strerror(error));
+                goto error;
+            }
+            netdev->tp_rx_ring = xzalloc(sizeof(struct tpacket_ring));
+            netdev->tp_tx_ring = xzalloc(sizeof(struct tpacket_ring));
+            netdev->tp_rx_ring->sockfd = rx->fd;
+            netdev->tp_tx_ring->sockfd = rx->fd;
+            error = tpacket_setup_ring(rx->fd, netdev->tp_rx_ring,
+                                       PACKET_RX_RING);
+            if (error != 0) {
+                error = errno;
+                VLOG_ERR("%s: failed to set tpacket rx ring (%s)",
+                         netdev_get_name(netdev_), ovs_strerror(error));
+                goto error;
+            }
+            error = tpacket_setup_ring(rx->fd, netdev->tp_tx_ring,
+                                       PACKET_TX_RING);
+            if (error != 0) {
+                error = errno;
+                VLOG_ERR("%s: failed to set tpacket tx ring (%s)",
+                         netdev_get_name(netdev_), ovs_strerror(error));
+                goto error;
+            }
+            error = tpacket_mmap_rx_tx_ring(rx->fd, netdev->tp_rx_ring,
+                                           netdev->tp_tx_ring);
+            if (error != 0) {
+                error = errno;
+                VLOG_ERR("%s: failed to mmap tpacket rx & tx ring (%s)",
+                         netdev_get_name(netdev_), ovs_strerror(error));
+                goto error;
+            }
+        }
+#endif
+
         /* Set non-blocking mode. */
         error = set_nonblocking(rx->fd);
         if (error) {
@@ -1139,9 +1302,16 @@  netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
 
         /* Bind to specific ethernet device. */
         memset(&sll, 0, sizeof sll);
-        sll.sll_family = AF_PACKET;
+        sll.sll_family = PF_PACKET;
+#ifdef HAVE_TPACKET_V3
+        if (!tso) {
+            sll.sll_hatype = 0;
+            sll.sll_pkttype = 0;
+            sll.sll_halen = 0;
+        }
+#endif
         sll.sll_ifindex = ifindex;
-        sll.sll_protocol = htons(ETH_P_ALL);
+        sll.sll_protocol = (OVS_FORCE ovs_be16) htons(ETH_P_ALL);
         if (bind(rx->fd, (struct sockaddr *) &sll, sizeof sll) < 0) {
             error = errno;
             VLOG_ERR("%s: failed to bind raw socket (%s)",
@@ -1178,6 +1348,19 @@  netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
     int i;
 
     if (!rx->is_tap) {
+#ifdef HAVE_TPACKET_V3
+        if (!userspace_tso_enabled()) {
+            struct netdev_linux *netdev = netdev_linux_cast(rx->up.netdev);
+
+            if (netdev->tp_rx_ring) {
+                munmap(netdev->tp_rx_ring->mm_space,
+                       2 * netdev->tp_rx_ring->mm_len);
+                free(netdev->tp_rx_ring->rd);
+                free(netdev->tp_tx_ring->rd);
+            }
+        }
+#endif
+
         close(rx->fd);
     }
 
@@ -1220,8 +1403,8 @@  auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
  * It also used recvmmsg to reduce multiple syscalls overhead;
  */
 static int
-netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
-                                 struct dp_packet_batch *batch)
+netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, bool tso,
+                                 int mtu, struct dp_packet_batch *batch)
 {
     int iovlen;
     size_t std_len;
@@ -1237,7 +1420,7 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
     struct dp_packet *buffers[NETDEV_MAX_BURST];
     int i;
 
-    if (userspace_tso_enabled()) {
+    if (tso) {
         /* Use the buffer from the allocated packet below to receive MTU
          * sized packets and an aux_buf for extra TSO data. */
         iovlen = IOV_TSO_SIZE;
@@ -1368,7 +1551,7 @@  netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
  * packets are added into *batch. The return value is 0 or errno.
  */
 static int
-netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
+netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, bool tso, int mtu,
                                 struct dp_packet_batch *batch)
 {
     int virtio_net_hdr_size;
@@ -1377,7 +1560,7 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
     int iovlen;
     int i;
 
-    if (userspace_tso_enabled()) {
+    if (tso) {
         /* Use the buffer from the allocated packet below to receive MTU
          * sized packets and an aux_buf for extra TSO data. */
         iovlen = IOV_TSO_SIZE;
@@ -1454,6 +1637,110 @@  netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
     return 0;
 }
 
+#ifdef HAVE_TPACKET_V3
+static int
+netdev_linux_batch_recv_tpacket(struct netdev_rxq_linux *rx, bool tso,
+                                int mtu OVS_UNUSED,
+                                struct dp_packet_batch *batch)
+{
+    struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+    struct dp_packet *buffer;
+    int i = 0;
+    unsigned int block_num;
+    unsigned int frame_num;
+    unsigned int fn_in_block;
+    struct tpacket_block_desc *pbd;
+    struct tpacket3_hdr *ppd;
+    int virtio_net_hdr_size;
+
+    if (tso) {
+        virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
+    } else {
+        virtio_net_hdr_size = 0;
+    }
+
+    ppd = ALIGNED_CAST(struct tpacket3_hdr *, netdev->tp_rx_ring->ppd);
+    block_num = netdev->tp_rx_ring->block_num;
+    frame_num = netdev->tp_rx_ring->frame_num;
+    fn_in_block = netdev->tp_rx_ring->frame_num_in_block;
+    pbd = ALIGNED_CAST(struct tpacket_block_desc *,
+              netdev->tp_rx_ring->rd[block_num].iov_base);
+
+    while (i < NETDEV_MAX_BURST) {
+        if ((pbd->hdr.bh1.block_status & TP_STATUS_USER) == 0) {
+            break;
+        }
+        if (fn_in_block == 0) {
+            ppd = ALIGNED_CAST(struct tpacket3_hdr *, (uint8_t *) pbd +
+                                   pbd->hdr.bh1.offset_to_first_pkt);
+        }
+
+        /* Use preallocated dp_packet and tpacket_v3 rx ring buffer
+         * to avoid memory allocating and packet copy.
+         */
+        buffer = &netdev->tp_rx_ring->pkts[frame_num];
+        dp_packet_use_tpacket(buffer, (uint8_t *)ppd + ppd->tp_mac
+                                           - virtio_net_hdr_size,
+                              ppd->tp_snaplen + virtio_net_hdr_size
+                                  + VLAN_ETH_HEADER_LEN,
+                              DP_NETDEV_HEADROOM);
+        dp_packet_set_size(buffer, ppd->tp_snaplen + virtio_net_hdr_size);
+
+        if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
+            /* Unexpected error situation: the virtio header is not present
+             * or corrupted. Drop the packet but continue in case next ones
+             * are correct. */
+            dp_packet_delete(buffer);
+            netdev->rx_dropped += 1;
+            VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
+                         netdev_get_name(netdev_));
+        } else {
+            if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
+                struct eth_header *eth;
+                bool double_tagged;
+                ovs_be16 vlan_tpid;
+
+                eth = dp_packet_data(buffer);
+                double_tagged = eth->eth_type == htons(ETH_TYPE_VLAN_8021Q);
+                if (ppd->tp_status & TP_STATUS_VLAN_TPID_VALID) {
+                    vlan_tpid = htons(ppd->hv1.tp_vlan_tpid);
+                } else if (double_tagged) {
+                    vlan_tpid = htons(ETH_TYPE_VLAN_8021AD);
+                } else {
+                    vlan_tpid = htons(ETH_TYPE_VLAN_8021Q);
+                }
+                eth_push_vlan(buffer, vlan_tpid, htons(ppd->hv1.tp_vlan_tci));
+            }
+            dp_packet_batch_add(batch, buffer);
+            frame_num = (frame_num + 1) % TPACKET_MAX_FRAME_NUM;
+        }
+
+        fn_in_block++;
+        if (fn_in_block >= pbd->hdr.bh1.num_pkts) {
+            pbd->hdr.bh1.block_status = TP_STATUS_KERNEL;
+            block_num = (block_num + 1) %
+                            netdev->tp_rx_ring->req.tp_block_nr;
+            pbd = (struct tpacket_block_desc *)
+                     netdev->tp_rx_ring->rd[block_num].iov_base;
+            fn_in_block = 0;
+            ppd = NULL;
+        } else {
+            ppd = ALIGNED_CAST(struct tpacket3_hdr *,
+                   (uint8_t *) ppd + ppd->tp_next_offset);
+        }
+        i++;
+    }
+
+    netdev->tp_rx_ring->block_num = block_num;
+    netdev->tp_rx_ring->frame_num = frame_num;
+    netdev->tp_rx_ring->frame_num_in_block = fn_in_block;
+    netdev->tp_rx_ring->ppd = ppd;
+
+    return 0;
+}
+#endif /* HAVE_TPACKET_V3 */
+
 static int
 netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
                       int *qfill)
@@ -1462,12 +1749,13 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     struct netdev *netdev = rx->up.netdev;
     ssize_t retval;
     int mtu;
+    bool tso = userspace_tso_enabled();
 
     if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
         mtu = ETH_PAYLOAD_MAX;
     }
 
-    if (userspace_tso_enabled()) {
+    if (tso) {
         /* Allocate TSO packets. The packet has enough headroom to store
          * a full non-TSO packet. When a TSO packet is received, the data
          * from non-TSO buffer (std_len) is prepended to the TSO packet
@@ -1485,9 +1773,19 @@  netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
     }
 
     dp_packet_batch_init(batch);
-    retval = (rx->is_tap
-              ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
-              : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
+    if (rx->is_tap) {
+        retval = netdev_linux_batch_rxq_recv_tap(rx, tso, mtu, batch);
+    } else {
+        if (tso) {
+            retval = netdev_linux_batch_rxq_recv_sock(rx, tso, mtu, batch);
+        } else {
+#ifndef HAVE_TPACKET_V3
+            retval = netdev_linux_batch_rxq_recv_sock(rx, tso, mtu, batch);
+#else
+            retval = netdev_linux_batch_recv_tpacket(rx, tso, mtu, batch);
+#endif
+        }
+    }
 
     if (retval) {
         if (retval != EAGAIN && retval != EMSGSIZE) {
@@ -1692,6 +1990,83 @@  netdev_linux_get_numa_id(const struct netdev *netdev_)
     return numa_id;
 }
 
+#ifdef HAVE_TPACKET_V3
+static inline int
+tpacket_tx_is_ready(void * next_frame)
+{
+    struct tpacket3_hdr *hdr = ALIGNED_CAST(struct tpacket3_hdr *, next_frame);
+
+    return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
+}
+
+static int
+netdev_linux_tpacket_batch_send(struct netdev *netdev_, bool tso, int mtu,
+                            struct dp_packet_batch *batch)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+    struct dp_packet *packet;
+    int sockfd;
+    ssize_t bytes_sent;
+    int total_pkts = 0;
+
+    unsigned int frame_nr = netdev->tp_tx_ring->req.tp_frame_nr;
+    unsigned int frame_num = netdev->tp_tx_ring->frame_num;
+
+    /* The Linux tap driver returns EIO if the device is not up,
+     * so if the device is not up, don't waste time sending it.
+     * However, if the device is in another network namespace
+     * then OVS can't retrieve the state. In that case, send the
+     * packets anyway. */
+    if (netdev->present && !(netdev->ifi_flags & IFF_UP)) {
+        netdev->tx_dropped += dp_packet_batch_size(batch);
+        return 0;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        size_t size;
+        struct tpacket3_hdr *ppd;
+
+        if (tso) {
+            netdev_linux_prepend_vnet_hdr(packet, mtu);
+        }
+
+        size = dp_packet_size(packet);
+        ppd = tpacket_get_next_frame(netdev->tp_tx_ring, frame_num);
+
+        if (!tpacket_tx_is_ready(ppd)) {
+            break;
+        }
+        ppd->tp_snaplen = size;
+        ppd->tp_len = size;
+        ppd->tp_next_offset = 0;
+
+        memcpy((uint8_t *)ppd + TPACKET3_HDRLEN - sizeof(struct sockaddr_ll),
+               dp_packet_data(packet),
+               size);
+        ppd->tp_status = TP_STATUS_SEND_REQUEST;
+        frame_num = (frame_num + 1) % frame_nr;
+        total_pkts++;
+    }
+    netdev->tp_tx_ring->frame_num = frame_num;
+
+    /* Kick-off transmits */
+    if (total_pkts != 0) {
+        sockfd = netdev->tp_tx_ring->sockfd;
+        bytes_sent = sendto(sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+        if (bytes_sent == -1 &&
+                errno != ENOBUFS && errno != EAGAIN) {
+            /*
+             * In case of an ENOBUFS/EAGAIN error all of the enqueued
+             * packets will be considered successful even though only some
+             * are sent.
+             */
+            netdev->tx_dropped += dp_packet_batch_size(batch);
+        }
+    }
+    return 0;
+}
+#endif
+
 /* Sends 'batch' on 'netdev'.  Returns 0 if successful, otherwise a positive
  * errno value.  Returns EAGAIN without blocking if the packet cannot be queued
  * immediately.  Returns EMSGSIZE if a partial packet was transmitted or if
@@ -1731,7 +2106,17 @@  netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
             goto free_batch;
         }
 
-        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
+        if (tso) {
+            error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu,
+                                                 batch);
+        } else {
+#ifndef HAVE_TPACKET_V3
+            error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu,
+                                                 batch);
+#else
+            error = netdev_linux_tpacket_batch_send(netdev_, tso, mtu, batch);
+#endif
+        }
     } else {
         error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
     }
@@ -3562,7 +3947,7 @@  exit:
 const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "system",
-    .is_pmd = false,
+    .is_pmd = true,
     .construct = netdev_linux_construct,
     .destruct = netdev_linux_destruct,
     .get_stats = netdev_linux_get_stats,