mbox series

[ovs-dev,v1,0/6] Memory access optimization for flow scalability of userspace datapath.

Message ID 20200602071005.29925-1-Yanqin.Wei@arm.com
Headers show
Series Memory access optimization for flow scalability of userspace datapath. | expand

Message

Yanqin Wei June 2, 2020, 7:09 a.m. UTC
OVS userspace datapath is a program with heavy memory access. It needs to
load/store a large number of memory, including packet header, metadata,
EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
refilling, which has a great impact on flow scalability. And in some cases,
EMC has a negative impact on the overall performance. It is difficult for
user to dynamically manage the enabling of EMC. 

This series of patches improve memory access of userspace datapath as
follows:
1. Reduce the number of metadata cache line accessed by non-tunnel traffic. 
2. Decrease unnecessary memory load/store for batch/flow. 
3. Modify the layout of EMC data struct. Centralize the storage of hash
value. 

In the NIC2NIC traffic tests, the overall performance improvement is
observed, especially in multi-flow cases. 
Flows           delta
1-1K flows      5-10%
10K flows       20%
100K flows      40%
EMC disable     10%

Malvika Gupta (1):
  [ovs-dev] dpif-netdev: Modify dfc_processing function to void function

Yanqin Wei (5):
  netdev: avoid unnecessary packet batch refilling in netdev feature
    check
  dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
  dpif-netdev: improve emc lookup performance by contiguous storage of
    hash value.
  dpif-netdev: skip flow hash calculation in case of smc disabled
  dpif-netdev: remove unnecessary key length calculation in fast path

 lib/dp-packet.h   |  12 +++--
 lib/dpif-netdev.c | 115 ++++++++++++++++++++++++----------------------
 lib/flow.c        |   2 +-
 lib/netdev.c      |  13 ++++--
 lib/packets.h     |  46 ++++++++++++++++---
 5 files changed, 120 insertions(+), 68 deletions(-)

Comments

Yanqin Wei June 30, 2020, 9:26 a.m. UTC | #1
Hi, every contributor

These patches could significantly improve multi-flow throughput of userspace datapath.  If you feel it will take too much time to review all patches, I suggest you could look at the 2nd/3rd first, which have the major improvement in these patches.
[ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
[ovs-dev][PATCH v1 3/6] dpif-netdev: improve emc lookup performance by contiguous storage of hash value.

Any comments from anyone are appreciated.

Best Regards,
Wei Yanqin

> -----Original Message-----
> From: Yanqin Wei <Yanqin.Wei@arm.com>
> Sent: Tuesday, June 2, 2020 3:10 PM
> To: dev@openvswitch.org
> Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com; Malvika
> Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang
> <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com>
> Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> scalability of userspace datapath.
> 
> OVS userspace datapath is a program with heavy memory access. It needs to
> load/store a large number of memory, including packet header, metadata,
> EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
> refilling, which has a great impact on flow scalability. And in some cases, EMC
> has a negative impact on the overall performance. It is difficult for user to
> dynamically manage the enabling of EMC.
> 
> This series of patches improve memory access of userspace datapath as
> follows:
> 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> 2. Decrease unnecessary memory load/store for batch/flow.
> 3. Modify the layout of EMC data struct. Centralize the storage of hash value.
> 
> In the NIC2NIC traffic tests, the overall performance improvement is observed,
> especially in multi-flow cases.
> Flows           delta
> 1-1K flows      5-10%
> 10K flows       20%
> 100K flows      40%
> EMC disable     10%
> 
> Malvika Gupta (1):
>   [ovs-dev] dpif-netdev: Modify dfc_processing function to void function
> 
> Yanqin Wei (5):
>   netdev: avoid unnecessary packet batch refilling in netdev feature
>     check
>   dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
>   dpif-netdev: improve emc lookup performance by contiguous storage of
>     hash value.
>   dpif-netdev: skip flow hash calculation in case of smc disabled
>   dpif-netdev: remove unnecessary key length calculation in fast path
> 
>  lib/dp-packet.h   |  12 +++--
>  lib/dpif-netdev.c | 115 ++++++++++++++++++++++++----------------------
>  lib/flow.c        |   2 +-
>  lib/netdev.c      |  13 ++++--
>  lib/packets.h     |  46 ++++++++++++++++---
>  5 files changed, 120 insertions(+), 68 deletions(-)
> 
> --
> 2.17.1
William Tu July 5, 2020, 1:22 p.m. UTC | #2
On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote:
>
> Hi, every contributor
>
> These patches could significantly improve multi-flow throughput of userspace datapath.  If you feel it will take too much time to review all patches, I suggest you could look at the 2nd/3rd first, which have the major improvement in these patches.
> [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
> [ovs-dev][PATCH v1 3/6] dpif-netdev: improve emc lookup performance by contiguous storage of hash value.
>
> Any comments from anyone are appreciated.
>
> Best Regards,
> Wei Yanqin
>
> > -----Original Message-----
> > From: Yanqin Wei <Yanqin.Wei@arm.com>
> > Sent: Tuesday, June 2, 2020 3:10 PM
> > To: dev@openvswitch.org
> > Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com; Malvika
> > Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>;
> > Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang
> > <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com>
> > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> > scalability of userspace datapath.
> >
> > OVS userspace datapath is a program with heavy memory access. It needs to
> > load/store a large number of memory, including packet header, metadata,
> > EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
> > refilling, which has a great impact on flow scalability. And in some cases, EMC
> > has a negative impact on the overall performance. It is difficult for user to
> > dynamically manage the enabling of EMC.
> >
> > This series of patches improve memory access of userspace datapath as
> > follows:
> > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> > 2. Decrease unnecessary memory load/store for batch/flow.
> > 3. Modify the layout of EMC data struct. Centralize the storage of hash value.
> >
> > In the NIC2NIC traffic tests, the overall performance improvement is observed,
> > especially in multi-flow cases.
> > Flows           delta
> > 1-1K flows      5-10%
> > 10K flows       20%
> > 100K flows      40%
> > EMC disable     10%

Thanks for submitting the patch series. I apply the series and I do see the
above performance improvement you describe above.
btw, is your number on ARM server or x86?
Below is my number using single flow and drop action on Intel(R)
Xeon(R) CPU @ 2.00GHz
In summary I see around 10% improvement using 1flow.

=== master ===
root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 0:
  packets received: 96269888
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 87513839
  smc hits: 0
  megaflow hits: 8755584
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 432
  avg. packets per output batch: 0.00
  idle cycles: 0 (0.00%)
  processing cycles: 20083008856 (100.00%)
  avg cycles per packet: 208.61 (20083008856/96269888)
  avg processing cycles per packet: 208.61 (20083008856/96269888)

=== master without EMC ===
pmd thread numa_id 0 core_id 1:
  packets received: 90775936
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 90775424
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 479
  avg. packets per output batch: 0.00
  idle cycles: 0 (0.00%)
  processing cycles: 21239087946 (100.00%)
  avg cycles per packet: 233.97 (21239087946/90775936)
  avg processing cycles per packet: 233.97 (21239087946/90775936)

=== yanqin v1: ===
pmd thread numa_id 0 core_id 1:
  packets received: 156582112
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 142344109
  smc hits: 0
  megaflow hits: 14237554
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 448
  avg. packets per output batch: 0.00
  idle cycles: 4320112 (0.01%)
  processing cycles: 30503055968 (99.99%)
  avg cycles per packet: 194.83 (30507376080/156582112)
  avg processing cycles per packet: 194.81 (30503055968/156582112)

=== yanqin v1 without EMC: ===
pmd thread numa_id 0 core_id 0:
  packets received: 48441664
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 48441182
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 449
  avg. packets per output batch: 0.00
  idle cycles: 0 (0.00%)
  processing cycles: 10513468302 (100.00%)
  avg cycles per packet: 217.03 (10513468302/48441664)
  avg processing cycles per packet: 217.03 (10513468302/48441664)
William Tu July 5, 2020, 1:26 p.m. UTC | #3
Hi Yanqin,

On Tue, Jun 2, 2020 at 12:10 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote:
>
> OVS userspace datapath is a program with heavy memory access. It needs to
> load/store a large number of memory, including packet header, metadata,
> EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
> refilling, which has a great impact on flow scalability. And in some cases,
> EMC has a negative impact on the overall performance. It is difficult for
> user to dynamically manage the enabling of EMC.

I'm just curious.
Did you do some micro performance benchmark to find out these cache line issues?
If so, what kind of tool do you use?
Or do you do it by inspecting the code?

Thanks
William
Yanqin Wei July 6, 2020, 10:22 a.m. UTC | #4
Hi William,

Many thanks for your time to test these patches. The number is achieved on Arm server, but x86 has the similar improvement. 
And CPU cache size will slightly impact the performance data, because the larger cache size, the lower probability of cache refilling/eviction . 

Best Regards,
Wei Yanqin 
> 
> On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote:
> >
> > Hi, every contributor
> >
> > These patches could significantly improve multi-flow throughput of
> userspace datapath.  If you feel it will take too much time to review all patches,
> I suggest you could look at the 2nd/3rd first, which have the major
> improvement in these patches.
> > [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip
> > ip/ipv6 address comparison [ovs-dev][PATCH v1 3/6] dpif-netdev: improve
> emc lookup performance by contiguous storage of hash value.
> >
> > Any comments from anyone are appreciated.
> >
> > Best Regards,
> > Wei Yanqin
> >
> > > -----Original Message-----
> > > From: Yanqin Wei <Yanqin.Wei@arm.com>
> > > Sent: Tuesday, June 2, 2020 3:10 PM
> > > To: dev@openvswitch.org
> > > Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com;
> Malvika
> > > Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>;
> > > Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang
> > > <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com>
> > > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> > > scalability of userspace datapath.
> > >
> > > OVS userspace datapath is a program with heavy memory access. It
> > > needs to load/store a large number of memory, including packet
> > > header, metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of
> > > cache line missing and refilling, which has a great impact on flow
> > > scalability. And in some cases, EMC has a negative impact on the
> > > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> > >
> > > This series of patches improve memory access of userspace datapath
> > > as
> > > follows:
> > > 1. Reduce the number of metadata cache line accessed by non-tunnel
> traffic.
> > > 2. Decrease unnecessary memory load/store for batch/flow.
> > > 3. Modify the layout of EMC data struct. Centralize the storage of hash
> value.
> > >
> > > In the NIC2NIC traffic tests, the overall performance improvement is
> > > observed, especially in multi-flow cases.
> > > Flows           delta
> > > 1-1K flows      5-10%
> > > 10K flows       20%
> > > 100K flows      40%
> > > EMC disable     10%
> 
> Thanks for submitting the patch series. I apply the series and I do see the
> above performance improvement you describe above.
> btw, is your number on ARM server or x86?

> Below is my number using single flow and drop action on Intel(R)
> Xeon(R) CPU @ 2.00GHz
> In summary I see around 10% improvement using 1flow.
> 
> === master ===
> root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show pmd thread
> numa_id 0 core_id 0:
>   packets received: 96269888
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 87513839
>   smc hits: 0
>   megaflow hits: 8755584
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 432
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 20083008856 (100.00%)
>   avg cycles per packet: 208.61 (20083008856/96269888)
>   avg processing cycles per packet: 208.61 (20083008856/96269888)
> 
> === master without EMC ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 90775936
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 90775424
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 479
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 21239087946 (100.00%)
>   avg cycles per packet: 233.97 (21239087946/90775936)
>   avg processing cycles per packet: 233.97 (21239087946/90775936)
> 
> === yanqin v1: ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 156582112
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 142344109
>   smc hits: 0
>   megaflow hits: 14237554
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 448
>   avg. packets per output batch: 0.00
>   idle cycles: 4320112 (0.01%)
>   processing cycles: 30503055968 (99.99%)
>   avg cycles per packet: 194.83 (30507376080/156582112)
>   avg processing cycles per packet: 194.81 (30503055968/156582112)
> 
> === yanqin v1 without EMC: ===
> pmd thread numa_id 0 core_id 0:
>   packets received: 48441664
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 48441182
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 449
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 10513468302 (100.00%)
>   avg cycles per packet: 217.03 (10513468302/48441664)
>   avg processing cycles per packet: 217.03 (10513468302/48441664)
Yanqin Wei July 6, 2020, 10:55 a.m. UTC | #5
Hi William,
> 
> On Tue, Jun 2, 2020 at 12:10 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote:
> >
> > OVS userspace datapath is a program with heavy memory access. It needs
> > to load/store a large number of memory, including packet header,
> > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache
> > line missing and refilling, which has a great impact on flow
> > scalability. And in some cases, EMC has a negative impact on the
> > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> 
> I'm just curious.
> Did you do some micro performance benchmark to find out these cache line
> issues?
Yes, we did some micro benchmarking for packet parsing, EMC and DPCLS.  But end2end test is more important because in the fastpath, different data access will affect each other. For example, a large number of EMC table will also impact the access efficiency of metadata or packet header.

> If so, what kind of tool do you use?
"perf stat -e" could record many kinds of PMU event. We could use "perf list" to list all events, some of them can be used to measure memory access efficiency (cache miss/refill/evict).

> Or do you do it by inspecting the code?
Code analysis is also important. We need to analyze the main data accessed in the fast path and their layout. 

> 
> Thanks
> William
Van Haaren, Harry July 7, 2020, 2:56 p.m. UTC | #6
> -----Original Message-----
> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Yanqin Wei
> Sent: Tuesday, June 2, 2020 8:10 AM
> To: dev@openvswitch.org
> Cc: Ruifeng.Wang@arm.com; Lijian.Zhang@arm.com; i.maximets@ovn.org;
> nd@arm.com
> Subject: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability
> of userspace datapath.
> 
> OVS userspace datapath is a program with heavy memory access. It needs to
> load/store a large number of memory, including packet header, metadata,
> EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
> refilling, which has a great impact on flow scalability. And in some cases,
> EMC has a negative impact on the overall performance. It is difficult for
> user to dynamically manage the enabling of EMC.
> 
> This series of patches improve memory access of userspace datapath as
> follows:
> 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> 2. Decrease unnecessary memory load/store for batch/flow.
> 3. Modify the layout of EMC data struct. Centralize the storage of hash
> value.
> 
> In the NIC2NIC traffic tests, the overall performance improvement is
> observed, especially in multi-flow cases.
> Flows           delta
> 1-1K flows      5-10%
> 10K flows       20%
> 100K flows      40%
> EMC disable     10%

Hi Yanqin,

A quick simple test here with EMC disabled shows similar performance results to
your data above, nice work. I think the optimizations here make sense, to not touch
extra cache-lines until required (eg tunnel metadata), particularly for outer packet parsing.

I hope to enable more optimizations around dpif-netdev in 2.15, so if you are also
planning to do more work in this area, it would be good to sync to avoid excessive
rebasing in future?

Regards, -Harry

<snip patch details>
Yanqin Wei July 8, 2020, 8:17 a.m. UTC | #7
Hi Harry,

> >
> > OVS userspace datapath is a program with heavy memory access. It needs
> > to load/store a large number of memory, including packet header,
> > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache
> > line missing and refilling, which has a great impact on flow
> > scalability. And in some cases, EMC has a negative impact on the
> > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> >
> > This series of patches improve memory access of userspace datapath as
> > follows:
> > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> > 2. Decrease unnecessary memory load/store for batch/flow.
> > 3. Modify the layout of EMC data struct. Centralize the storage of
> > hash value.
> >
> > In the NIC2NIC traffic tests, the overall performance improvement is
> > observed, especially in multi-flow cases.
> > Flows           delta
> > 1-1K flows      5-10%
> > 10K flows       20%
> > 100K flows      40%
> > EMC disable     10%
> 
> Hi Yanqin,
> 
> A quick simple test here with EMC disabled shows similar performance results
> to your data above, nice work. I think the optimizations here make sense, to
> not touch extra cache-lines until required (eg tunnel metadata), particularly
> for outer packet parsing.
Many thanks for your time to test and review the patch. 
> 
> I hope to enable more optimizations around dpif-netdev in 2.15, so if you are
> also planning to do more work in this area, it would be good to sync to avoid
> excessive rebasing in future?
That is great to hear that. If we have new work planed in 2.15, we will discuss with you and community.
> 
> Regards, -Harry
> 
> <snip patch details>
Van Haaren, Harry July 8, 2020, 10:46 a.m. UTC | #8
> -----Original Message-----
> From: Yanqin Wei <Yanqin.Wei@arm.com>
> Sent: Wednesday, July 8, 2020 9:17 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; Lijian Zhang
> <Lijian.Zhang@arm.com>; i.maximets@ovn.org; nd <nd@arm.com>; Stokes, Ian
> <ian.stokes@intel.com>; William Tu <u9012063@gmail.com>
> Subject: RE: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow
> scalability of userspace datapath.
> 
> Hi Harry,
> 
> > >
> > > OVS userspace datapath is a program with heavy memory access. It needs
> > > to load/store a large number of memory, including packet header,
> > > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache
> > > line missing and refilling, which has a great impact on flow
> > > scalability. And in some cases, EMC has a negative impact on the
> > > overall performance. It is difficult for user to dynamically manage the
> > enabling of EMC.
> > >
> > > This series of patches improve memory access of userspace datapath as
> > > follows:
> > > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> > > 2. Decrease unnecessary memory load/store for batch/flow.
> > > 3. Modify the layout of EMC data struct. Centralize the storage of
> > > hash value.
> > >
> > > In the NIC2NIC traffic tests, the overall performance improvement is
> > > observed, especially in multi-flow cases.
> > > Flows           delta
> > > 1-1K flows      5-10%
> > > 10K flows       20%
> > > 100K flows      40%
> > > EMC disable     10%
> >
> > Hi Yanqin,
> >
> > A quick simple test here with EMC disabled shows similar performance results
> > to your data above, nice work. I think the optimizations here make sense, to
> > not touch extra cache-lines until required (eg tunnel metadata), particularly
> > for outer packet parsing.
> Many thanks for your time to test and review the patch.

Performed some EMC based tests and multi-flow tests, ran "make check" unit tests;
Tested-by: Harry van Haaren <harry.van.haaren@intel.com>

> > I hope to enable more optimizations around dpif-netdev in 2.15, so if you are
> > also planning to do more work in this area, it would be good to sync to avoid
> > excessive rebasing in future?
> That is great to hear that. If we have new work planed in 2.15, we will discuss
> with you and community.

That would be great - looking forward to collaborating there.

> > Regards, -Harry
> >
> > <snip patch details>