diff mbox series

[ovs-dev,v12,15/16] dpif-netdev: Optimize dp output action.

Message ID 20210517140434.59555-16-cian.ferriter@intel.com
State Changes Requested
Headers show
Series DPIF Framework + Optimizations | expand

Commit Message

Ferriter, Cian May 17, 2021, 2:04 p.m. UTC
From: Harry van Haaren <harry.van.haaren@intel.com>

This commit optimizes the output action, by enabling the compiler to
optimize the code better through reducing code complexity.

The core concept of this optimization is that the array-length checks
have already been performed above the copying code, so can be removed.
Removing of the per-packet length checks allows the compiler to auto-vectorize
the stores using SIMD registers.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

v8: Add NEWS entry.
---
 NEWS              |  1 +
 lib/dpif-netdev.c | 23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

Comments

Stokes, Ian June 15, 2021, 5:04 p.m. UTC | #1
> From: Harry van Haaren <harry.van.haaren@intel.com>
> 
> This commit optimizes the output action, by enabling the compiler to
> optimize the code better through reducing code complexity.
> 
> The core concept of this optimization is that the array-length checks
> have already been performed above the copying code, so can be removed.
> Removing of the per-packet length checks allows the compiler to auto-vectorize
> the stores using SIMD registers.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 

Thanks for the patch Cian/Harry.

Comments inline.
> ---
> 
> v8: Add NEWS entry.
> ---
>  NEWS              |  1 +
>  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index d04dac746..a7ec34af1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -17,6 +17,7 @@ Post-v2.15.0
>       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if
> the
>         CPU supports it. This enhances performance by using the native vpopcount
>         instructions, instead of the emulated version of vpopcount.
> +     * Optimize dp_netdev_output by enhancing compiler optimization potential.
>     - ovs-ctl:
>       * New option '--no-record-hostname' to disable hostname configuration
>         in ovsdb on startup.
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 04a4f71cb..f46269ae3 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> dp_netdev_pmd_thread *pmd,
>          pmd->n_output_batches++;
>      }
> 
> -    struct dp_packet *packet;
> -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -            pmd->ctx.last_rxq;
> -        dp_packet_batch_add(&p->output_pkts, packet);
> +    /* The above checks ensure that there is enough space in the output batch.
> +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> +     * This branch reduces the compiler's ability to optimize efficiently. The
> +     * below code implements packet movement between batches without
> checks,
> +     * with the required semantics of output batch perhaps containing packets.
> +     */

What is the performance gain recorded with this approach with compiler optimizations?

> +    int batch_size = dp_packet_batch_size(packets_);
> +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> +    struct dp_packet_batch *output_batch = &p->output_pkts;
> +
> +    for (int i = 0; i < batch_size; i++) {
> +        struct dp_packet *packet = packets_->packets[i];
> +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> +        output_batch->packets[out_batch_idx] = packet;
> +        out_batch_idx++;
>      }
> +    output_batch->count += batch_size;

So I guess the counter argument here is that the previous approach uses the dp_packet standard functions.

There has been work over time to ensure that dp_packets are handled in a similar way using the dp_packet api.

This I think reverses the position. That's why I'd like to get a sense of the performance gain as I think it's important to justify breaking from code conformance.

Has there been thought to modifying the dp_packet_batch_add ?

Regards
Ian
> +
>      return true;
>  }
> 
> --
> 2.31.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ferriter, Cian June 16, 2021, 10:48 a.m. UTC | #2
Hi Ian,

Thanks for the review. My responses are inline.

> -----Original Message-----
> From: Stokes, Ian <ian.stokes@intel.com>
> Sent: Tuesday 15 June 2021 18:04
> To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: i.maximets@ovn.org
> Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> 
> > From: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > This commit optimizes the output action, by enabling the compiler to
> > optimize the code better through reducing code complexity.
> >
> > The core concept of this optimization is that the array-length checks
> > have already been performed above the copying code, so can be removed.
> > Removing of the per-packet length checks allows the compiler to auto-vectorize
> > the stores using SIMD registers.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> 
> Thanks for the patch Cian/Harry.
> 
> Comments inline.
> > ---
> >
> > v8: Add NEWS entry.
> > ---
> >  NEWS              |  1 +
> >  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
> >  2 files changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index d04dac746..a7ec34af1 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -17,6 +17,7 @@ Post-v2.15.0
> >       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if
> > the
> >         CPU supports it. This enhances performance by using the native vpopcount
> >         instructions, instead of the emulated version of vpopcount.
> > +     * Optimize dp_netdev_output by enhancing compiler optimization potential.
> >     - ovs-ctl:
> >       * New option '--no-record-hostname' to disable hostname configuration
> >         in ovsdb on startup.
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 04a4f71cb..f46269ae3 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> > dp_netdev_pmd_thread *pmd,
> >          pmd->n_output_batches++;
> >      }
> >
> > -    struct dp_packet *packet;
> > -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> > -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> > -            pmd->ctx.last_rxq;
> > -        dp_packet_batch_add(&p->output_pkts, packet);
> > +    /* The above checks ensure that there is enough space in the output batch.
> > +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> > +     * This branch reduces the compiler's ability to optimize efficiently. The
> > +     * below code implements packet movement between batches without
> > checks,
> > +     * with the required semantics of output batch perhaps containing packets.
> > +     */
> 
> What is the performance gain recorded with this approach with compiler optimizations?
> 

The test was run using the scalar DPIF, 1 flow and EMC on, where the before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet after. Hence this optimization saves 13 cycles per packet in my measurements.

> > +    int batch_size = dp_packet_batch_size(packets_);
> > +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> > +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> > +    struct dp_packet_batch *output_batch = &p->output_pkts;
> > +
> > +    for (int i = 0; i < batch_size; i++) {
> > +        struct dp_packet *packet = packets_->packets[i];
> > +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> > +        output_batch->packets[out_batch_idx] = packet;
> > +        out_batch_idx++;
> >      }
> > +    output_batch->count += batch_size;
> 
> So I guess the counter argument here is that the previous approach uses the dp_packet standard functions.
> 
> There has been work over time to ensure that dp_packets are handled in a similar way using the dp_packet api.
> 
> This I think reverses the position. That's why I'd like to get a sense of the performance gain as I think it's important to justify breaking
> from code conformance.
> 
> Has there been thought to modifying the dp_packet_batch_add ?
> 

We thought about adding an "unsafe" version of dp_packet_batch_add() which doesn't check whether the packet batch is full. Due to this optimization being dependent on other checks in the surrounding code, it isn't appropriate to have a generic function containing this optimization.

> Regards
> Ian
> > +
> >      return true;
> >  }
> >
> > --
> > 2.31.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Stokes, Ian June 16, 2021, 12:46 p.m. UTC | #3
> Hi Ian,
> 
> Thanks for the review. My responses are inline.
> 
> > -----Original Message-----
> > From: Stokes, Ian <ian.stokes@intel.com>
> > Sent: Tuesday 15 June 2021 18:04
> > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry
> > <harry.van.haaren@intel.com>
> > Cc: i.maximets@ovn.org
> > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> >
> > > From: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > This commit optimizes the output action, by enabling the compiler to
> > > optimize the code better through reducing code complexity.
> > >
> > > The core concept of this optimization is that the array-length checks
> > > have already been performed above the copying code, so can be removed.
> > > Removing of the per-packet length checks allows the compiler to auto-
> vectorize
> > > the stores using SIMD registers.
> > >
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> >
> > Thanks for the patch Cian/Harry.
> >
> > Comments inline.
> > > ---
> > >
> > > v8: Add NEWS entry.
> > > ---
> > >  NEWS              |  1 +
> > >  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
> > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/NEWS b/NEWS
> > > index d04dac746..a7ec34af1 100644
> > > --- a/NEWS
> > > +++ b/NEWS
> > > @@ -17,6 +17,7 @@ Post-v2.15.0
> > >       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction
> if
> > > the
> > >         CPU supports it. This enhances performance by using the native
> vpopcount
> > >         instructions, instead of the emulated version of vpopcount.
> > > +     * Optimize dp_netdev_output by enhancing compiler optimization
> potential.
> > >     - ovs-ctl:
> > >       * New option '--no-record-hostname' to disable hostname configuration
> > >         in ovsdb on startup.
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > > index 04a4f71cb..f46269ae3 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> > > dp_netdev_pmd_thread *pmd,
> > >          pmd->n_output_batches++;
> > >      }
> > >
> > > -    struct dp_packet *packet;
> > > -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> > > -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> > > -            pmd->ctx.last_rxq;
> > > -        dp_packet_batch_add(&p->output_pkts, packet);
> > > +    /* The above checks ensure that there is enough space in the output
> batch.
> > > +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> > > +     * This branch reduces the compiler's ability to optimize efficiently. The
> > > +     * below code implements packet movement between batches without
> > > checks,
> > > +     * with the required semantics of output batch perhaps containing
> packets.
> > > +     */
> >
> > What is the performance gain recorded with this approach with compiler
> optimizations?
> >
> 
> The test was run using the scalar DPIF, 1 flow and EMC on, where the
> before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet
> after. Hence this optimization saves 13 cycles per packet in my measurements.

So I think that info is important and should be added to the commit message and will help inform whether to make the change.

Ilya has worked in this area extensively so would be interested to hear his opinion?

I don't think this is a blocker to the DPIF work so could possibly be submitted separately as a patch for master?

Regards
Ian
> 
> > > +    int batch_size = dp_packet_batch_size(packets_);
> > > +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> > > +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> > > +    struct dp_packet_batch *output_batch = &p->output_pkts;
> > > +
> > > +    for (int i = 0; i < batch_size; i++) {
> > > +        struct dp_packet *packet = packets_->packets[i];
> > > +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> > > +        output_batch->packets[out_batch_idx] = packet;
> > > +        out_batch_idx++;
> > >      }
> > > +    output_batch->count += batch_size;
> >
> > So I guess the counter argument here is that the previous approach uses the
> dp_packet standard functions.
> >
> > There has been work over time to ensure that dp_packets are handled in a
> similar way using the dp_packet api.
> >
> > This I think reverses the position. That's why I'd like to get a sense of the
> performance gain as I think it's important to justify breaking
> > from code conformance.
> >
> > Has there been thought to modifying the dp_packet_batch_add ?
> >
> 
> We thought about adding an "unsafe" version of dp_packet_batch_add() which
> doesn't check whether the packet batch is full. Due to this optimization being
> dependent on other checks in the surrounding code, it isn't appropriate to have
> a generic function containing this optimization.
> 
> > Regards
> > Ian
> > > +
> > >      return true;
> > >  }
> > >
> > > --
> > > 2.31.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ferriter, Cian June 16, 2021, 4:23 p.m. UTC | #4
Hi Ian,

Further responses are inline.

> -----Original Message-----
> From: Stokes, Ian <ian.stokes@intel.com>
> Sent: Wednesday 16 June 2021 13:47
> To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org; Van Haaren, Harry <harry.van.haaren@intel.com>;
> i.maximets@ovn.org
> Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> 
> > Hi Ian,
> >
> > Thanks for the review. My responses are inline.
> >
> > > -----Original Message-----
> > > From: Stokes, Ian <ian.stokes@intel.com>
> > > Sent: Tuesday 15 June 2021 18:04
> > > To: Ferriter, Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> > Ferriter, Cian <cian.ferriter@intel.com>; Van Haaren, Harry
> > > <harry.van.haaren@intel.com>
> > > Cc: i.maximets@ovn.org
> > > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> > >
> > > > From: Harry van Haaren <harry.van.haaren@intel.com>
> > > >
> > > > This commit optimizes the output action, by enabling the compiler to
> > > > optimize the code better through reducing code complexity.
> > > >
> > > > The core concept of this optimization is that the array-length checks
> > > > have already been performed above the copying code, so can be removed.
> > > > Removing of the per-packet length checks allows the compiler to auto-
> > vectorize
> > > > the stores using SIMD registers.
> > > >
> > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > >
> > >
> > > Thanks for the patch Cian/Harry.
> > >
> > > Comments inline.
> > > > ---
> > > >
> > > > v8: Add NEWS entry.
> > > > ---
> > > >  NEWS              |  1 +
> > > >  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
> > > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/NEWS b/NEWS
> > > > index d04dac746..a7ec34af1 100644
> > > > --- a/NEWS
> > > > +++ b/NEWS
> > > > @@ -17,6 +17,7 @@ Post-v2.15.0
> > > >       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction
> > if
> > > > the
> > > >         CPU supports it. This enhances performance by using the native
> > vpopcount
> > > >         instructions, instead of the emulated version of vpopcount.
> > > > +     * Optimize dp_netdev_output by enhancing compiler optimization
> > potential.
> > > >     - ovs-ctl:
> > > >       * New option '--no-record-hostname' to disable hostname configuration
> > > >         in ovsdb on startup.
> > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > > > index 04a4f71cb..f46269ae3 100644
> > > > --- a/lib/dpif-netdev.c
> > > > +++ b/lib/dpif-netdev.c
> > > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >          pmd->n_output_batches++;
> > > >      }
> > > >
> > > > -    struct dp_packet *packet;
> > > > -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> > > > -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> > > > -            pmd->ctx.last_rxq;
> > > > -        dp_packet_batch_add(&p->output_pkts, packet);
> > > > +    /* The above checks ensure that there is enough space in the output
> > batch.
> > > > +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> > > > +     * This branch reduces the compiler's ability to optimize efficiently. The
> > > > +     * below code implements packet movement between batches without
> > > > checks,
> > > > +     * with the required semantics of output batch perhaps containing
> > packets.
> > > > +     */
> > >
> > > What is the performance gain recorded with this approach with compiler
> > optimizations?
> > >
> >
> > The test was run using the scalar DPIF, 1 flow and EMC on, where the
> > before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet
> > after. Hence this optimization saves 13 cycles per packet in my measurements.
> 
> So I think that info is important and should be added to the commit message and will help inform whether to make the change.
> 

I added this information to the commit message.

> Ilya has worked in this area extensively so would be interested to hear his opinion?
> 
> I don't think this is a blocker to the DPIF work so could possibly be submitted separately as a patch for master?
> 

Correct, it doesn't block the DPIF work. I'll submit as a separate patch.

> Regards
> Ian
> >
> > > > +    int batch_size = dp_packet_batch_size(packets_);
> > > > +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> > > > +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> > > > +    struct dp_packet_batch *output_batch = &p->output_pkts;
> > > > +
> > > > +    for (int i = 0; i < batch_size; i++) {
> > > > +        struct dp_packet *packet = packets_->packets[i];
> > > > +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> > > > +        output_batch->packets[out_batch_idx] = packet;
> > > > +        out_batch_idx++;
> > > >      }
> > > > +    output_batch->count += batch_size;
> > >
> > > So I guess the counter argument here is that the previous approach uses the
> > dp_packet standard functions.
> > >
> > > There has been work over time to ensure that dp_packets are handled in a
> > similar way using the dp_packet api.
> > >
> > > This I think reverses the position. That's why I'd like to get a sense of the
> > performance gain as I think it's important to justify breaking
> > > from code conformance.
> > >
> > > Has there been thought to modifying the dp_packet_batch_add ?
> > >
> >
> > We thought about adding an "unsafe" version of dp_packet_batch_add() which
> > doesn't check whether the packet batch is full. Due to this optimization being
> > dependent on other checks in the surrounding code, it isn't appropriate to have
> > a generic function containing this optimization.
> >
> > > Regards
> > > Ian
> > > > +
> > > >      return true;
> > > >  }
> > > >
> > > > --
> > > > 2.31.1
> > > >
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index d04dac746..a7ec34af1 100644
--- a/NEWS
+++ b/NEWS
@@ -17,6 +17,7 @@  Post-v2.15.0
      * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the
        CPU supports it. This enhances performance by using the native vpopcount
        instructions, instead of the emulated version of vpopcount.
+     * Optimize dp_netdev_output by enhancing compiler optimization potential.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 04a4f71cb..f46269ae3 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7290,12 +7290,25 @@  dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
         pmd->n_output_batches++;
     }
 
-    struct dp_packet *packet;
-    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
-        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
-            pmd->ctx.last_rxq;
-        dp_packet_batch_add(&p->output_pkts, packet);
+    /* The above checks ensure that there is enough space in the output batch.
+     * Using dp_packet_batch_add() has a branch to check if the batch is full.
+     * This branch reduces the compiler's ability to optimize efficiently. The
+     * below code implements packet movement between batches without checks,
+     * with the required semantics of output batch perhaps containing packets.
+     */
+    int batch_size = dp_packet_batch_size(packets_);
+    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
+    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
+    struct dp_packet_batch *output_batch = &p->output_pkts;
+
+    for (int i = 0; i < batch_size; i++) {
+        struct dp_packet *packet = packets_->packets[i];
+        p->output_pkts_rxqs[out_batch_idx] = rxq;
+        output_batch->packets[out_batch_idx] = packet;
+        out_batch_idx++;
     }
+    output_batch->count += batch_size;
+
     return true;
 }