Message ID | 20200602071005.29925-1-Yanqin.Wei@arm.com |
---|---|
Headers | show |
Series | Memory access optimization for flow scalability of userspace datapath. | expand |
Hi, every contributor These patches could significantly improve multi-flow throughput of userspace datapath. If you feel it will take too much time to review all patches, I suggest you could look at the 2nd/3rd first, which have the major improvement in these patches. [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison [ovs-dev][PATCH v1 3/6] dpif-netdev: improve emc lookup performance by contiguous storage of hash value. Any comments from anyone are appreciated. Best Regards, Wei Yanqin > -----Original Message----- > From: Yanqin Wei <Yanqin.Wei@arm.com> > Sent: Tuesday, June 2, 2020 3:10 PM > To: dev@openvswitch.org > Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com; Malvika > Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>; > Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang > <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com> > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow > scalability of userspace datapath. > > OVS userspace datapath is a program with heavy memory access. It needs to > load/store a large number of memory, including packet header, metadata, > EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and > refilling, which has a great impact on flow scalability. And in some cases, EMC > has a negative impact on the overall performance. It is difficult for user to > dynamically manage the enabling of EMC. > > This series of patches improve memory access of userspace datapath as > follows: > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic. > 2. Decrease unnecessary memory load/store for batch/flow. > 3. Modify the layout of EMC data struct. Centralize the storage of hash value. > > In the NIC2NIC traffic tests, the overall performance improvement is observed, > especially in multi-flow cases. > Flows delta > 1-1K flows 5-10% > 10K flows 20% > 100K flows 40% > EMC disable 10% > > Malvika Gupta (1): > [ovs-dev] dpif-netdev: Modify dfc_processing function to void function > > Yanqin Wei (5): > netdev: avoid unnecessary packet batch refilling in netdev feature > check > dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison > dpif-netdev: improve emc lookup performance by contiguous storage of > hash value. > dpif-netdev: skip flow hash calculation in case of smc disabled > dpif-netdev: remove unnecessary key length calculation in fast path > > lib/dp-packet.h | 12 +++-- > lib/dpif-netdev.c | 115 ++++++++++++++++++++++++---------------------- > lib/flow.c | 2 +- > lib/netdev.c | 13 ++++-- > lib/packets.h | 46 ++++++++++++++++--- > 5 files changed, 120 insertions(+), 68 deletions(-) > > -- > 2.17.1
On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote: > > Hi, every contributor > > These patches could significantly improve multi-flow throughput of userspace datapath. If you feel it will take too much time to review all patches, I suggest you could look at the 2nd/3rd first, which have the major improvement in these patches. > [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison > [ovs-dev][PATCH v1 3/6] dpif-netdev: improve emc lookup performance by contiguous storage of hash value. > > Any comments from anyone are appreciated. > > Best Regards, > Wei Yanqin > > > -----Original Message----- > > From: Yanqin Wei <Yanqin.Wei@arm.com> > > Sent: Tuesday, June 2, 2020 3:10 PM > > To: dev@openvswitch.org > > Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com; Malvika > > Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>; > > Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang > > <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com> > > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow > > scalability of userspace datapath. > > > > OVS userspace datapath is a program with heavy memory access. It needs to > > load/store a large number of memory, including packet header, metadata, > > EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and > > refilling, which has a great impact on flow scalability. And in some cases, EMC > > has a negative impact on the overall performance. It is difficult for user to > > dynamically manage the enabling of EMC. > > > > This series of patches improve memory access of userspace datapath as > > follows: > > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic. > > 2. Decrease unnecessary memory load/store for batch/flow. > > 3. Modify the layout of EMC data struct. Centralize the storage of hash value. > > > > In the NIC2NIC traffic tests, the overall performance improvement is observed, > > especially in multi-flow cases. > > Flows delta > > 1-1K flows 5-10% > > 10K flows 20% > > 100K flows 40% > > EMC disable 10% Thanks for submitting the patch series. I apply the series and I do see the above performance improvement you describe above. btw, is your number on ARM server or x86? Below is my number using single flow and drop action on Intel(R) Xeon(R) CPU @ 2.00GHz In summary I see around 10% improvement using 1flow. === master === root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show pmd thread numa_id 0 core_id 0: packets received: 96269888 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 87513839 smc hits: 0 megaflow hits: 8755584 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 432 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 20083008856 (100.00%) avg cycles per packet: 208.61 (20083008856/96269888) avg processing cycles per packet: 208.61 (20083008856/96269888) === master without EMC === pmd thread numa_id 0 core_id 1: packets received: 90775936 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 0 smc hits: 0 megaflow hits: 90775424 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 479 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 21239087946 (100.00%) avg cycles per packet: 233.97 (21239087946/90775936) avg processing cycles per packet: 233.97 (21239087946/90775936) === yanqin v1: === pmd thread numa_id 0 core_id 1: packets received: 156582112 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 142344109 smc hits: 0 megaflow hits: 14237554 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 448 avg. packets per output batch: 0.00 idle cycles: 4320112 (0.01%) processing cycles: 30503055968 (99.99%) avg cycles per packet: 194.83 (30507376080/156582112) avg processing cycles per packet: 194.81 (30503055968/156582112) === yanqin v1 without EMC: === pmd thread numa_id 0 core_id 0: packets received: 48441664 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 0 smc hits: 0 megaflow hits: 48441182 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 449 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 10513468302 (100.00%) avg cycles per packet: 217.03 (10513468302/48441664) avg processing cycles per packet: 217.03 (10513468302/48441664)
Hi Yanqin, On Tue, Jun 2, 2020 at 12:10 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote: > > OVS userspace datapath is a program with heavy memory access. It needs to > load/store a large number of memory, including packet header, metadata, > EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and > refilling, which has a great impact on flow scalability. And in some cases, > EMC has a negative impact on the overall performance. It is difficult for > user to dynamically manage the enabling of EMC. I'm just curious. Did you do some micro performance benchmark to find out these cache line issues? If so, what kind of tool do you use? Or do you do it by inspecting the code? Thanks William
Hi William, Many thanks for your time to test these patches. The number is achieved on Arm server, but x86 has the similar improvement. And CPU cache size will slightly impact the performance data, because the larger cache size, the lower probability of cache refilling/eviction . Best Regards, Wei Yanqin > > On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote: > > > > Hi, every contributor > > > > These patches could significantly improve multi-flow throughput of > userspace datapath. If you feel it will take too much time to review all patches, > I suggest you could look at the 2nd/3rd first, which have the major > improvement in these patches. > > [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip > > ip/ipv6 address comparison [ovs-dev][PATCH v1 3/6] dpif-netdev: improve > emc lookup performance by contiguous storage of hash value. > > > > Any comments from anyone are appreciated. > > > > Best Regards, > > Wei Yanqin > > > > > -----Original Message----- > > > From: Yanqin Wei <Yanqin.Wei@arm.com> > > > Sent: Tuesday, June 2, 2020 3:10 PM > > > To: dev@openvswitch.org > > > Cc: nd <nd@arm.com>; i.maximets@ovn.org; u9012063@gmail.com; > Malvika > > > Gupta <Malvika.Gupta@arm.com>; Lijian Zhang <Lijian.Zhang@arm.com>; > > > Ruifeng Wang <Ruifeng.Wang@arm.com>; Lance Yang > > > <Lance.Yang@arm.com>; Yanqin Wei <Yanqin.Wei@arm.com> > > > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow > > > scalability of userspace datapath. > > > > > > OVS userspace datapath is a program with heavy memory access. It > > > needs to load/store a large number of memory, including packet > > > header, metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of > > > cache line missing and refilling, which has a great impact on flow > > > scalability. And in some cases, EMC has a negative impact on the > > > overall performance. It is difficult for user to dynamically manage the > enabling of EMC. > > > > > > This series of patches improve memory access of userspace datapath > > > as > > > follows: > > > 1. Reduce the number of metadata cache line accessed by non-tunnel > traffic. > > > 2. Decrease unnecessary memory load/store for batch/flow. > > > 3. Modify the layout of EMC data struct. Centralize the storage of hash > value. > > > > > > In the NIC2NIC traffic tests, the overall performance improvement is > > > observed, especially in multi-flow cases. > > > Flows delta > > > 1-1K flows 5-10% > > > 10K flows 20% > > > 100K flows 40% > > > EMC disable 10% > > Thanks for submitting the patch series. I apply the series and I do see the > above performance improvement you describe above. > btw, is your number on ARM server or x86? > Below is my number using single flow and drop action on Intel(R) > Xeon(R) CPU @ 2.00GHz > In summary I see around 10% improvement using 1flow. > > === master === > root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show pmd thread > numa_id 0 core_id 0: > packets received: 96269888 > packet recirculations: 0 > avg. datapath passes per packet: 1.00 > emc hits: 87513839 > smc hits: 0 > megaflow hits: 8755584 > avg. subtable lookups per megaflow hit: 1.00 > miss with success upcall: 1 > miss with failed upcall: 432 > avg. packets per output batch: 0.00 > idle cycles: 0 (0.00%) > processing cycles: 20083008856 (100.00%) > avg cycles per packet: 208.61 (20083008856/96269888) > avg processing cycles per packet: 208.61 (20083008856/96269888) > > === master without EMC === > pmd thread numa_id 0 core_id 1: > packets received: 90775936 > packet recirculations: 0 > avg. datapath passes per packet: 1.00 > emc hits: 0 > smc hits: 0 > megaflow hits: 90775424 > avg. subtable lookups per megaflow hit: 1.00 > miss with success upcall: 1 > miss with failed upcall: 479 > avg. packets per output batch: 0.00 > idle cycles: 0 (0.00%) > processing cycles: 21239087946 (100.00%) > avg cycles per packet: 233.97 (21239087946/90775936) > avg processing cycles per packet: 233.97 (21239087946/90775936) > > === yanqin v1: === > pmd thread numa_id 0 core_id 1: > packets received: 156582112 > packet recirculations: 0 > avg. datapath passes per packet: 1.00 > emc hits: 142344109 > smc hits: 0 > megaflow hits: 14237554 > avg. subtable lookups per megaflow hit: 1.00 > miss with success upcall: 1 > miss with failed upcall: 448 > avg. packets per output batch: 0.00 > idle cycles: 4320112 (0.01%) > processing cycles: 30503055968 (99.99%) > avg cycles per packet: 194.83 (30507376080/156582112) > avg processing cycles per packet: 194.81 (30503055968/156582112) > > === yanqin v1 without EMC: === > pmd thread numa_id 0 core_id 0: > packets received: 48441664 > packet recirculations: 0 > avg. datapath passes per packet: 1.00 > emc hits: 0 > smc hits: 0 > megaflow hits: 48441182 > avg. subtable lookups per megaflow hit: 1.00 > miss with success upcall: 1 > miss with failed upcall: 449 > avg. packets per output batch: 0.00 > idle cycles: 0 (0.00%) > processing cycles: 10513468302 (100.00%) > avg cycles per packet: 217.03 (10513468302/48441664) > avg processing cycles per packet: 217.03 (10513468302/48441664)
Hi William, > > On Tue, Jun 2, 2020 at 12:10 AM Yanqin Wei <Yanqin.Wei@arm.com> wrote: > > > > OVS userspace datapath is a program with heavy memory access. It needs > > to load/store a large number of memory, including packet header, > > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache > > line missing and refilling, which has a great impact on flow > > scalability. And in some cases, EMC has a negative impact on the > > overall performance. It is difficult for user to dynamically manage the > enabling of EMC. > > I'm just curious. > Did you do some micro performance benchmark to find out these cache line > issues? Yes, we did some micro benchmarking for packet parsing, EMC and DPCLS. But end2end test is more important because in the fastpath, different data access will affect each other. For example, a large number of EMC table will also impact the access efficiency of metadata or packet header. > If so, what kind of tool do you use? "perf stat -e" could record many kinds of PMU event. We could use "perf list" to list all events, some of them can be used to measure memory access efficiency (cache miss/refill/evict). > Or do you do it by inspecting the code? Code analysis is also important. We need to analyze the main data accessed in the fast path and their layout. > > Thanks > William
> -----Original Message----- > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Yanqin Wei > Sent: Tuesday, June 2, 2020 8:10 AM > To: dev@openvswitch.org > Cc: Ruifeng.Wang@arm.com; Lijian.Zhang@arm.com; i.maximets@ovn.org; > nd@arm.com > Subject: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability > of userspace datapath. > > OVS userspace datapath is a program with heavy memory access. It needs to > load/store a large number of memory, including packet header, metadata, > EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and > refilling, which has a great impact on flow scalability. And in some cases, > EMC has a negative impact on the overall performance. It is difficult for > user to dynamically manage the enabling of EMC. > > This series of patches improve memory access of userspace datapath as > follows: > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic. > 2. Decrease unnecessary memory load/store for batch/flow. > 3. Modify the layout of EMC data struct. Centralize the storage of hash > value. > > In the NIC2NIC traffic tests, the overall performance improvement is > observed, especially in multi-flow cases. > Flows delta > 1-1K flows 5-10% > 10K flows 20% > 100K flows 40% > EMC disable 10% Hi Yanqin, A quick simple test here with EMC disabled shows similar performance results to your data above, nice work. I think the optimizations here make sense, to not touch extra cache-lines until required (eg tunnel metadata), particularly for outer packet parsing. I hope to enable more optimizations around dpif-netdev in 2.15, so if you are also planning to do more work in this area, it would be good to sync to avoid excessive rebasing in future? Regards, -Harry <snip patch details>
Hi Harry, > > > > OVS userspace datapath is a program with heavy memory access. It needs > > to load/store a large number of memory, including packet header, > > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache > > line missing and refilling, which has a great impact on flow > > scalability. And in some cases, EMC has a negative impact on the > > overall performance. It is difficult for user to dynamically manage the > enabling of EMC. > > > > This series of patches improve memory access of userspace datapath as > > follows: > > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic. > > 2. Decrease unnecessary memory load/store for batch/flow. > > 3. Modify the layout of EMC data struct. Centralize the storage of > > hash value. > > > > In the NIC2NIC traffic tests, the overall performance improvement is > > observed, especially in multi-flow cases. > > Flows delta > > 1-1K flows 5-10% > > 10K flows 20% > > 100K flows 40% > > EMC disable 10% > > Hi Yanqin, > > A quick simple test here with EMC disabled shows similar performance results > to your data above, nice work. I think the optimizations here make sense, to > not touch extra cache-lines until required (eg tunnel metadata), particularly > for outer packet parsing. Many thanks for your time to test and review the patch. > > I hope to enable more optimizations around dpif-netdev in 2.15, so if you are > also planning to do more work in this area, it would be good to sync to avoid > excessive rebasing in future? That is great to hear that. If we have new work planed in 2.15, we will discuss with you and community. > > Regards, -Harry > > <snip patch details>
> -----Original Message----- > From: Yanqin Wei <Yanqin.Wei@arm.com> > Sent: Wednesday, July 8, 2020 9:17 AM > To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; Lijian Zhang > <Lijian.Zhang@arm.com>; i.maximets@ovn.org; nd <nd@arm.com>; Stokes, Ian > <ian.stokes@intel.com>; William Tu <u9012063@gmail.com> > Subject: RE: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow > scalability of userspace datapath. > > Hi Harry, > > > > > > > OVS userspace datapath is a program with heavy memory access. It needs > > > to load/store a large number of memory, including packet header, > > > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache > > > line missing and refilling, which has a great impact on flow > > > scalability. And in some cases, EMC has a negative impact on the > > > overall performance. It is difficult for user to dynamically manage the > > enabling of EMC. > > > > > > This series of patches improve memory access of userspace datapath as > > > follows: > > > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic. > > > 2. Decrease unnecessary memory load/store for batch/flow. > > > 3. Modify the layout of EMC data struct. Centralize the storage of > > > hash value. > > > > > > In the NIC2NIC traffic tests, the overall performance improvement is > > > observed, especially in multi-flow cases. > > > Flows delta > > > 1-1K flows 5-10% > > > 10K flows 20% > > > 100K flows 40% > > > EMC disable 10% > > > > Hi Yanqin, > > > > A quick simple test here with EMC disabled shows similar performance results > > to your data above, nice work. I think the optimizations here make sense, to > > not touch extra cache-lines until required (eg tunnel metadata), particularly > > for outer packet parsing. > Many thanks for your time to test and review the patch. Performed some EMC based tests and multi-flow tests, ran "make check" unit tests; Tested-by: Harry van Haaren <harry.van.haaren@intel.com> > > I hope to enable more optimizations around dpif-netdev in 2.15, so if you are > > also planning to do more work in this area, it would be good to sync to avoid > > excessive rebasing in future? > That is great to hear that. If we have new work planed in 2.15, we will discuss > with you and community. That would be great - looking forward to collaborating there. > > Regards, -Harry > > > > <snip patch details>