Message ID | 20210517140434.59555-16-cian.ferriter@intel.com |
---|---|
State | Changes Requested |
Headers | show |
Series | DPIF Framework + Optimizations | expand |
> From: Harry van Haaren <harry.van.haaren@intel.com> > > This commit optimizes the output action, by enabling the compiler to > optimize the code better through reducing code complexity. > > The core concept of this optimization is that the array-length checks > have already been performed above the copying code, so can be removed. > Removing of the per-packet length checks allows the compiler to auto-vectorize > the stores using SIMD registers. > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > Thanks for the patch Cian/Harry. Comments inline. > --- > > v8: Add NEWS entry. > --- > NEWS | 1 + > lib/dpif-netdev.c | 23 ++++++++++++++++++----- > 2 files changed, 19 insertions(+), 5 deletions(-) > > diff --git a/NEWS b/NEWS > index d04dac746..a7ec34af1 100644 > --- a/NEWS > +++ b/NEWS > @@ -17,6 +17,7 @@ Post-v2.15.0 > * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if > the > CPU supports it. This enhances performance by using the native vpopcount > instructions, instead of the emulated version of vpopcount. > + * Optimize dp_netdev_output by enhancing compiler optimization potential. > - ovs-ctl: > * New option '--no-record-hostname' to disable hostname configuration > in ovsdb on startup. > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 04a4f71cb..f46269ae3 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct > dp_netdev_pmd_thread *pmd, > pmd->n_output_batches++; > } > > - struct dp_packet *packet; > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > - pmd->ctx.last_rxq; > - dp_packet_batch_add(&p->output_pkts, packet); > + /* The above checks ensure that there is enough space in the output batch. > + * Using dp_packet_batch_add() has a branch to check if the batch is full. > + * This branch reduces the compiler's ability to optimize efficiently. The > + * below code implements packet movement between batches without > checks, > + * with the required semantics of output batch perhaps containing packets. > + */ What is the performance gain recorded with this approach with compiler optimizations? > + int batch_size = dp_packet_batch_size(packets_); > + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); > + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; > + struct dp_packet_batch *output_batch = &p->output_pkts; > + > + for (int i = 0; i < batch_size; i++) { > + struct dp_packet *packet = packets_->packets[i]; > + p->output_pkts_rxqs[out_batch_idx] = rxq; > + output_batch->packets[out_batch_idx] = packet; > + out_batch_idx++; > } > + output_batch->count += batch_size; So I guess the counter argument here is that the previous approach uses the dp_packet standard functions. There has been work over time to ensure that dp_packets are handled in a similar way using the dp_packet api. This I think reverses the position. That's why I'd like to get a sense of the performance gain as I think it's important to justify breaking from code conformance. Has there been thought to modifying the dp_packet_batch_add ? Regards Ian > + > return true; > } > > -- > 2.31.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Hi Ian, Thanks for the review. My responses are inline. > -----Original Message----- > From: Stokes, Ian <ian.stokes@intel.com> > Sent: Tuesday 15 June 2021 18:04 > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry > <harry.van.haaren@intel.com> > Cc: i.maximets@ovn.org > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > From: Harry van Haaren <harry.van.haaren@intel.com> > > > > This commit optimizes the output action, by enabling the compiler to > > optimize the code better through reducing code complexity. > > > > The core concept of this optimization is that the array-length checks > > have already been performed above the copying code, so can be removed. > > Removing of the per-packet length checks allows the compiler to auto-vectorize > > the stores using SIMD registers. > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > > > > Thanks for the patch Cian/Harry. > > Comments inline. > > --- > > > > v8: Add NEWS entry. > > --- > > NEWS | 1 + > > lib/dpif-netdev.c | 23 ++++++++++++++++++----- > > 2 files changed, 19 insertions(+), 5 deletions(-) > > > > diff --git a/NEWS b/NEWS > > index d04dac746..a7ec34af1 100644 > > --- a/NEWS > > +++ b/NEWS > > @@ -17,6 +17,7 @@ Post-v2.15.0 > > * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if > > the > > CPU supports it. This enhances performance by using the native vpopcount > > instructions, instead of the emulated version of vpopcount. > > + * Optimize dp_netdev_output by enhancing compiler optimization potential. > > - ovs-ctl: > > * New option '--no-record-hostname' to disable hostname configuration > > in ovsdb on startup. > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > index 04a4f71cb..f46269ae3 100644 > > --- a/lib/dpif-netdev.c > > +++ b/lib/dpif-netdev.c > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct > > dp_netdev_pmd_thread *pmd, > > pmd->n_output_batches++; > > } > > > > - struct dp_packet *packet; > > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > > - pmd->ctx.last_rxq; > > - dp_packet_batch_add(&p->output_pkts, packet); > > + /* The above checks ensure that there is enough space in the output batch. > > + * Using dp_packet_batch_add() has a branch to check if the batch is full. > > + * This branch reduces the compiler's ability to optimize efficiently. The > > + * below code implements packet movement between batches without > > checks, > > + * with the required semantics of output batch perhaps containing packets. > > + */ > > What is the performance gain recorded with this approach with compiler optimizations? > The test was run using the scalar DPIF, 1 flow and EMC on, where the before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet after. Hence this optimization saves 13 cycles per packet in my measurements. > > + int batch_size = dp_packet_batch_size(packets_); > > + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); > > + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; > > + struct dp_packet_batch *output_batch = &p->output_pkts; > > + > > + for (int i = 0; i < batch_size; i++) { > > + struct dp_packet *packet = packets_->packets[i]; > > + p->output_pkts_rxqs[out_batch_idx] = rxq; > > + output_batch->packets[out_batch_idx] = packet; > > + out_batch_idx++; > > } > > + output_batch->count += batch_size; > > So I guess the counter argument here is that the previous approach uses the dp_packet standard functions. > > There has been work over time to ensure that dp_packets are handled in a similar way using the dp_packet api. > > This I think reverses the position. That's why I'd like to get a sense of the performance gain as I think it's important to justify breaking > from code conformance. > > Has there been thought to modifying the dp_packet_batch_add ? > We thought about adding an "unsafe" version of dp_packet_batch_add() which doesn't check whether the packet batch is full. Due to this optimization being dependent on other checks in the surrounding code, it isn't appropriate to have a generic function containing this optimization. > Regards > Ian > > + > > return true; > > } > > > > -- > > 2.31.1 > > > > _______________________________________________ > > dev mailing list > > dev@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> Hi Ian, > > Thanks for the review. My responses are inline. > > > -----Original Message----- > > From: Stokes, Ian <ian.stokes@intel.com> > > Sent: Tuesday 15 June 2021 18:04 > > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; > Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry > > <harry.van.haaren@intel.com> > > Cc: i.maximets@ovn.org > > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > > > From: Harry van Haaren <harry.van.haaren@intel.com> > > > > > > This commit optimizes the output action, by enabling the compiler to > > > optimize the code better through reducing code complexity. > > > > > > The core concept of this optimization is that the array-length checks > > > have already been performed above the copying code, so can be removed. > > > Removing of the per-packet length checks allows the compiler to auto- > vectorize > > > the stores using SIMD registers. > > > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > > > > > > > Thanks for the patch Cian/Harry. > > > > Comments inline. > > > --- > > > > > > v8: Add NEWS entry. > > > --- > > > NEWS | 1 + > > > lib/dpif-netdev.c | 23 ++++++++++++++++++----- > > > 2 files changed, 19 insertions(+), 5 deletions(-) > > > > > > diff --git a/NEWS b/NEWS > > > index d04dac746..a7ec34af1 100644 > > > --- a/NEWS > > > +++ b/NEWS > > > @@ -17,6 +17,7 @@ Post-v2.15.0 > > > * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction > if > > > the > > > CPU supports it. This enhances performance by using the native > vpopcount > > > instructions, instead of the emulated version of vpopcount. > > > + * Optimize dp_netdev_output by enhancing compiler optimization > potential. > > > - ovs-ctl: > > > * New option '--no-record-hostname' to disable hostname configuration > > > in ovsdb on startup. > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > > index 04a4f71cb..f46269ae3 100644 > > > --- a/lib/dpif-netdev.c > > > +++ b/lib/dpif-netdev.c > > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct > > > dp_netdev_pmd_thread *pmd, > > > pmd->n_output_batches++; > > > } > > > > > > - struct dp_packet *packet; > > > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > > > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > > > - pmd->ctx.last_rxq; > > > - dp_packet_batch_add(&p->output_pkts, packet); > > > + /* The above checks ensure that there is enough space in the output > batch. > > > + * Using dp_packet_batch_add() has a branch to check if the batch is full. > > > + * This branch reduces the compiler's ability to optimize efficiently. The > > > + * below code implements packet movement between batches without > > > checks, > > > + * with the required semantics of output batch perhaps containing > packets. > > > + */ > > > > What is the performance gain recorded with this approach with compiler > optimizations? > > > > The test was run using the scalar DPIF, 1 flow and EMC on, where the > before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet > after. Hence this optimization saves 13 cycles per packet in my measurements. So I think that info is important and should be added to the commit message and will help inform whether to make the change. Ilya has worked in this area extensively so would be interested to hear his opinion? I don't think this is a blocker to the DPIF work so could possibly be submitted separately as a patch for master? Regards Ian > > > > + int batch_size = dp_packet_batch_size(packets_); > > > + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); > > > + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; > > > + struct dp_packet_batch *output_batch = &p->output_pkts; > > > + > > > + for (int i = 0; i < batch_size; i++) { > > > + struct dp_packet *packet = packets_->packets[i]; > > > + p->output_pkts_rxqs[out_batch_idx] = rxq; > > > + output_batch->packets[out_batch_idx] = packet; > > > + out_batch_idx++; > > > } > > > + output_batch->count += batch_size; > > > > So I guess the counter argument here is that the previous approach uses the > dp_packet standard functions. > > > > There has been work over time to ensure that dp_packets are handled in a > similar way using the dp_packet api. > > > > This I think reverses the position. That's why I'd like to get a sense of the > performance gain as I think it's important to justify breaking > > from code conformance. > > > > Has there been thought to modifying the dp_packet_batch_add ? > > > > We thought about adding an "unsafe" version of dp_packet_batch_add() which > doesn't check whether the packet batch is full. Due to this optimization being > dependent on other checks in the surrounding code, it isn't appropriate to have > a generic function containing this optimization. > > > Regards > > Ian > > > + > > > return true; > > > } > > > > > > -- > > > 2.31.1 > > > > > > _______________________________________________ > > > dev mailing list > > > dev@openvswitch.org > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
Hi Ian, Further responses are inline. > -----Original Message----- > From: Stokes, Ian <ian.stokes@intel.com> > Sent: Wednesday 16 June 2021 13:47 > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; Van Haaren, Harry <harry.van.haaren@intel.com>; > i.maximets@ovn.org > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > Hi Ian, > > > > Thanks for the review. My responses are inline. > > > > > -----Original Message----- > > > From: Stokes, Ian <ian.stokes@intel.com> > > > Sent: Tuesday 15 June 2021 18:04 > > > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; > > Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry > > > <harry.van.haaren@intel.com> > > > Cc: i.maximets@ovn.org > > > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > > > > > From: Harry van Haaren <harry.van.haaren@intel.com> > > > > > > > > This commit optimizes the output action, by enabling the compiler to > > > > optimize the code better through reducing code complexity. > > > > > > > > The core concept of this optimization is that the array-length checks > > > > have already been performed above the copying code, so can be removed. > > > > Removing of the per-packet length checks allows the compiler to auto- > > vectorize > > > > the stores using SIMD registers. > > > > > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > > > > > > > > > > Thanks for the patch Cian/Harry. > > > > > > Comments inline. > > > > --- > > > > > > > > v8: Add NEWS entry. > > > > --- > > > > NEWS | 1 + > > > > lib/dpif-netdev.c | 23 ++++++++++++++++++----- > > > > 2 files changed, 19 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/NEWS b/NEWS > > > > index d04dac746..a7ec34af1 100644 > > > > --- a/NEWS > > > > +++ b/NEWS > > > > @@ -17,6 +17,7 @@ Post-v2.15.0 > > > > * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction > > if > > > > the > > > > CPU supports it. This enhances performance by using the native > > vpopcount > > > > instructions, instead of the emulated version of vpopcount. > > > > + * Optimize dp_netdev_output by enhancing compiler optimization > > potential. > > > > - ovs-ctl: > > > > * New option '--no-record-hostname' to disable hostname configuration > > > > in ovsdb on startup. > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > > > index 04a4f71cb..f46269ae3 100644 > > > > --- a/lib/dpif-netdev.c > > > > +++ b/lib/dpif-netdev.c > > > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct > > > > dp_netdev_pmd_thread *pmd, > > > > pmd->n_output_batches++; > > > > } > > > > > > > > - struct dp_packet *packet; > > > > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > > > > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > > > > - pmd->ctx.last_rxq; > > > > - dp_packet_batch_add(&p->output_pkts, packet); > > > > + /* The above checks ensure that there is enough space in the output > > batch. > > > > + * Using dp_packet_batch_add() has a branch to check if the batch is full. > > > > + * This branch reduces the compiler's ability to optimize efficiently. The > > > > + * below code implements packet movement between batches without > > > > checks, > > > > + * with the required semantics of output batch perhaps containing > > packets. > > > > + */ > > > > > > What is the performance gain recorded with this approach with compiler > > optimizations? > > > > > > > The test was run using the scalar DPIF, 1 flow and EMC on, where the > > before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet > > after. Hence this optimization saves 13 cycles per packet in my measurements. > > So I think that info is important and should be added to the commit message and will help inform whether to make the change. > I added this information to the commit message. > Ilya has worked in this area extensively so would be interested to hear his opinion? > > I don't think this is a blocker to the DPIF work so could possibly be submitted separately as a patch for master? > Correct, it doesn't block the DPIF work. I'll submit as a separate patch. > Regards > Ian > > > > > > + int batch_size = dp_packet_batch_size(packets_); > > > > + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); > > > > + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; > > > > + struct dp_packet_batch *output_batch = &p->output_pkts; > > > > + > > > > + for (int i = 0; i < batch_size; i++) { > > > > + struct dp_packet *packet = packets_->packets[i]; > > > > + p->output_pkts_rxqs[out_batch_idx] = rxq; > > > > + output_batch->packets[out_batch_idx] = packet; > > > > + out_batch_idx++; > > > > } > > > > + output_batch->count += batch_size; > > > > > > So I guess the counter argument here is that the previous approach uses the > > dp_packet standard functions. > > > > > > There has been work over time to ensure that dp_packets are handled in a > > similar way using the dp_packet api. > > > > > > This I think reverses the position. That's why I'd like to get a sense of the > > performance gain as I think it's important to justify breaking > > > from code conformance. > > > > > > Has there been thought to modifying the dp_packet_batch_add ? > > > > > > > We thought about adding an "unsafe" version of dp_packet_batch_add() which > > doesn't check whether the packet batch is full. Due to this optimization being > > dependent on other checks in the surrounding code, it isn't appropriate to have > > a generic function containing this optimization. > > > > > Regards > > > Ian > > > > + > > > > return true; > > > > } > > > > > > > > -- > > > > 2.31.1 > > > > > > > > _______________________________________________ > > > > dev mailing list > > > > dev@openvswitch.org > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > >
diff --git a/NEWS b/NEWS index d04dac746..a7ec34af1 100644 --- a/NEWS +++ b/NEWS @@ -17,6 +17,7 @@ Post-v2.15.0 * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the CPU supports it. This enhances performance by using the native vpopcount instructions, instead of the emulated version of vpopcount. + * Optimize dp_netdev_output by enhancing compiler optimization potential. - ovs-ctl: * New option '--no-record-hostname' to disable hostname configuration in ovsdb on startup. diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 04a4f71cb..f46269ae3 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct dp_netdev_pmd_thread *pmd, pmd->n_output_batches++; } - struct dp_packet *packet; - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = - pmd->ctx.last_rxq; - dp_packet_batch_add(&p->output_pkts, packet); + /* The above checks ensure that there is enough space in the output batch. + * Using dp_packet_batch_add() has a branch to check if the batch is full. + * This branch reduces the compiler's ability to optimize efficiently. The + * below code implements packet movement between batches without checks, + * with the required semantics of output batch perhaps containing packets. + */ + int batch_size = dp_packet_batch_size(packets_); + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; + struct dp_packet_batch *output_batch = &p->output_pkts; + + for (int i = 0; i < batch_size; i++) { + struct dp_packet *packet = packets_->packets[i]; + p->output_pkts_rxqs[out_batch_idx] = rxq; + output_batch->packets[out_batch_idx] = packet; + out_batch_idx++; } + output_batch->count += batch_size; + return true; }