diff mbox series

[ovs-dev,v11,5/5] dpif-netdev: Add specialized generic scalar functions

Message ID 20190715163636.51572-6-harry.van.haaren@intel.com
State Superseded
Headers show
Series dpcls func ptrs & optimizations | expand

Commit Message

Van Haaren, Harry July 15, 2019, 4:36 p.m. UTC
This commit adds a number of specialized functions, that handle
common miniflow fingerprints. This enables compiler optimization,
resulting in higher performance. Below a quick description of
how this optimization actually works;

"Specialized functions" are "instances" of the generic implementation,
but the compiler is given extra context when compiling. In the case of
iterating miniflow datastructures, the most interesting value to enable
compile time optimizations is the loop trip count per unit.

In order to create a specialized function, there is a generic implementation,
which uses a for() loop without the compiler knowing the loop trip count at
compile time. The loop trip count is passed in as an argument to the function:

uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
{
    for(uint32_t i = 0; i < loop_count; i++)
        // do work
}

In order to "specialize" the function, we call the generic implementation
with hard-coded numbers - these are compile time constants!

uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
{
    // use hard coded constant for compile-time constant-propogation
    return miniflow_impl_generic(mf, 5);
}

Given the compiler is aware of the loop trip count at compile time,
it can perform an optimization known as "constant propogation". Combined
with inlining of the miniflow_impl_generic() function, the compiler is
now enabled to *compile time* unroll the loop 5x, and produce "flat" code.

The last step to using the specialized functions is to utilize a
function-pointer to choose the specialized (or generic) implementation.
The selection of the function pointer is performed at subtable creation
time, when miniflow fingerprint of the subtable is known. This technique
is known as "multiple dispatch" in some literature, as it uses multiple
items of information (miniflow bit counts) to select the dispatch function.

By pointing the function pointer at the optimized implementation, OvS
benefits from the compile time optimizations at runtime.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>

---

v11:
- Use MACROs to declare and check optimized functions (Ilya)
- Use captial letter for commit message (Ilya)
- Rebase onto latest patchset changes
- Added NEWS entry for data-path subtable specialization (Ian/Harry)
- Checkpatch notes an "incorrect bracketing" in the MACROs, however I
  didn't find a solution that it does like.

v10:
- Rebase changes from previous patches.
- Remove "restrict" keyword as windows CI failed, see here for details:
  https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228

v8:
- Rework to use blocks_cache from the dpcls instance, to avoid variable
  lenght arrays in the data-path.
---
 NEWS                             |  4 +++
 lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
 lib/dpif-netdev-private.h        |  8 +++++
 lib/dpif-netdev.c                |  9 ++++--
 4 files changed, 70 insertions(+), 2 deletions(-)

Comments

0-day Robot July 15, 2019, 5:12 p.m. UTC | #1
Bleep bloop.  Greetings Harry van Haaren, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Inappropriate bracing around statement
#112 FILE: lib/dpif-netdev-lookup-generic.c:266:
    if (!f && u0_bits == U0 && u1_bits == U1) {                               \

Lines checked: 178, Warnings: 0, Errors: 1


Please check this out.  If you feel there has been an error, please email aconole@bytheb.org

Thanks,
0-day Robot
Stokes, Ian July 16, 2019, 9:07 p.m. UTC | #2
On 7/15/2019 5:36 PM, Harry van Haaren wrote:
> This commit adds a number of specialized functions, that handle
> common miniflow fingerprints. This enables compiler optimization,
> resulting in higher performance. Below a quick description of
> how this optimization actually works;
> 
> "Specialized functions" are "instances" of the generic implementation,
> but the compiler is given extra context when compiling. In the case of
> iterating miniflow datastructures, the most interesting value to enable
> compile time optimizations is the loop trip count per unit.
> 
> In order to create a specialized function, there is a generic implementation,
> which uses a for() loop without the compiler knowing the loop trip count at
> compile time. The loop trip count is passed in as an argument to the function:
> 
> uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
> {
>      for(uint32_t i = 0; i < loop_count; i++)
>          // do work
> }
> 
> In order to "specialize" the function, we call the generic implementation
> with hard-coded numbers - these are compile time constants!
> 
> uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
> {
>      // use hard coded constant for compile-time constant-propogation
>      return miniflow_impl_generic(mf, 5);
> }
> 
> Given the compiler is aware of the loop trip count at compile time,
> it can perform an optimization known as "constant propogation". Combined
> with inlining of the miniflow_impl_generic() function, the compiler is
> now enabled to *compile time* unroll the loop 5x, and produce "flat" code.
> 
> The last step to using the specialized functions is to utilize a
> function-pointer to choose the specialized (or generic) implementation.
> The selection of the function pointer is performed at subtable creation
> time, when miniflow fingerprint of the subtable is known. This technique
> is known as "multiple dispatch" in some literature, as it uses multiple
> items of information (miniflow bit counts) to select the dispatch function.
> 
> By pointing the function pointer at the optimized implementation, OvS
> benefits from the compile time optimizations at runtime.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Tested-by: Malvika Gupta <malvika.gupta@arm.com>
> 

Thanks for the v11 Harry, some minor comments inline below.

> ---
> 
> v11:
> - Use MACROs to declare and check optimized functions (Ilya)
> - Use captial letter for commit message (Ilya)
> - Rebase onto latest patchset changes
> - Added NEWS entry for data-path subtable specialization (Ian/Harry)
> - Checkpatch notes an "incorrect bracketing" in the MACROs, however I
>    didn't find a solution that it does like.
> 
> v10:
> - Rebase changes from previous patches.
> - Remove "restrict" keyword as windows CI failed, see here for details:
>    https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228
> 
> v8:
> - Rework to use blocks_cache from the dpcls instance, to avoid variable
>    lenght arrays in the data-path.
> ---
>   NEWS                             |  4 +++
>   lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
>   lib/dpif-netdev-private.h        |  8 +++++
>   lib/dpif-netdev.c                |  9 ++++--
>   4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 81130e667..4cfffb1bc 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -34,6 +34,10 @@ Post-v2.11.0
>        * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace
>          datapath regardless of '--cleanup' option. Use '--cleanup' to remove
>          internal ports too.
> +     * Datapath classifer code refactored to enable function pointers to select
> +       the lookup implementation at runtime. This enables specialization of
> +       specific subtables based on the miniflow attributes, enhancing the
> +       performance of the subtable search.
>      - OVSDB:
>        * OVSDB clients can now resynchronize with clustered servers much more
>          quickly after a brief disconnection, saving bandwidth and CPU time.
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index abd166fc3..259c36645 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -233,7 +233,58 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 const struct netdev_flow_key *keys[],
>                                 struct dpcls_rule **rules)
>   {
> +    /* Here the runtime subtable->mf_bits counts are used, which forces the
> +     * compiler to iterate normal for() loops. Due to this limitation in the
> +     * compilers available optimizations, this function has lower performance
> +     * than the below specialized functions.
> +     */
>       return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
>                                  subtable->mf_bits_set_unit0,
>                                  subtable->mf_bits_set_unit1);
>   }
> +
> +/* Expand out specialized functions with U0 and U1 bit attributes */

Minor, missing period at end of comment (can fix this on commit if there 
are no other revisions).

> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
> +    static uint32_t                                                           \
> +    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
> +                                         struct dpcls_subtable *subtable,     \
> +                                         uint64_t *blocks_scratch,            \
> +                                         uint32_t keys_map,                   \
> +                                         const struct netdev_flow_key *keys[],\
> +                                         struct dpcls_rule **rules)           \
> +    {                                                                         \
> +        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
> +                                   keys, rules, U0, U1);                      \
> +    }                                                                         \
> +
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
> +
> +/* Check if a speicalized function is valid for the required subtable. */
Minor, speicalized -> specialized, again can be fixed on commit otherwise.

> +#define CHECK_LOOKUP_FUNCTION(U0,U1)                                          \
> +    if (!f && u0_bits == U0 && u1_bits == U1) {                               \
> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;                       \
> +    }
> +
> +/* Probe function to lookup an available specialized function.
> + * If capable to run the requested miniflow fingerprint, this function returns
> + * the most optimal implementation for that miniflow fingerprint.
> + * @retval FunctionAddress A valid function to handle the miniflow bit pattern
> + * @retval 0 The requested miniflow is not supported here, NULL is returned
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> +{
> +    dpcls_subtable_lookup_func f = NULL;

In the comments you return FunctionAddress but you return f below, 
should this not be FunctionAddress or maybe another variable name if 
'FunctionAddress' is a bit unwieldy?

> +
> +    CHECK_LOOKUP_FUNCTION(5, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 0);
> +
> +    if (f) {
> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> +                  u0_bits, u1_bits);
> +    }
> +    return f;
> +}
> diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> index e1ceaa641..f541343f4 100644
> --- a/lib/dpif-netdev-private.h
> +++ b/lib/dpif-netdev-private.h
> @@ -69,6 +69,14 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 const struct netdev_flow_key *keys[],
>                                 struct dpcls_rule **rules);
>   
> +/* Probe function to select a specialized version of the generic lookup
> + * implementation. This provides performance benefit due to compile-time
> + * optimizations such as loop-unrolling. These are enabled by the compile-time
> + * constants in the specific function implementations.
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
> +
>   /* A set of rules that all have the same fields wildcarded. */
>   struct dpcls_subtable {
>       /* The fields are only used by writers. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8acc1445a..996dec35e 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7735,8 +7735,13 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>           cls->blocks_scratch_size = blocks_required_per_pkt;
>       }
>   
> -    /* Assign the generic lookup - this works with any miniflow fingerprint. */
> -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    /* Probe for a specialized generic lookup function. */
> +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> +
> +    /* If not set, assign generic lookup. Generic works for any miniflow. */
> +    if (!subtable->lookup_func) {
> +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    }
>   
>       cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>       /* Add the new subtable at the end of the pvector (with no hits yet) */
Missing period above.
> 

Regards
Ian
Ilya Maximets July 17, 2019, 9:52 a.m. UTC | #3
On 15.07.2019 19:36, Harry van Haaren wrote:
> This commit adds a number of specialized functions, that handle
> common miniflow fingerprints. This enables compiler optimization,
> resulting in higher performance. Below a quick description of
> how this optimization actually works;
> 
> "Specialized functions" are "instances" of the generic implementation,
> but the compiler is given extra context when compiling. In the case of
> iterating miniflow datastructures, the most interesting value to enable
> compile time optimizations is the loop trip count per unit.
> 
> In order to create a specialized function, there is a generic implementation,
> which uses a for() loop without the compiler knowing the loop trip count at
> compile time. The loop trip count is passed in as an argument to the function:
> 
> uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
> {
>     for(uint32_t i = 0; i < loop_count; i++)
>         // do work
> }
> 
> In order to "specialize" the function, we call the generic implementation
> with hard-coded numbers - these are compile time constants!
> 
> uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
> {
>     // use hard coded constant for compile-time constant-propogation
>     return miniflow_impl_generic(mf, 5);
> }
> 
> Given the compiler is aware of the loop trip count at compile time,
> it can perform an optimization known as "constant propogation". Combined
> with inlining of the miniflow_impl_generic() function, the compiler is
> now enabled to *compile time* unroll the loop 5x, and produce "flat" code.
> 
> The last step to using the specialized functions is to utilize a
> function-pointer to choose the specialized (or generic) implementation.
> The selection of the function pointer is performed at subtable creation
> time, when miniflow fingerprint of the subtable is known. This technique
> is known as "multiple dispatch" in some literature, as it uses multiple
> items of information (miniflow bit counts) to select the dispatch function.
> 
> By pointing the function pointer at the optimized implementation, OvS
> benefits from the compile time optimizations at runtime.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Tested-by: Malvika Gupta <malvika.gupta@arm.com>
> 
> ---
> 
> v11:
> - Use MACROs to declare and check optimized functions (Ilya)
> - Use captial letter for commit message (Ilya)
> - Rebase onto latest patchset changes
> - Added NEWS entry for data-path subtable specialization (Ian/Harry)
> - Checkpatch notes an "incorrect bracketing" in the MACROs, however I
>   didn't find a solution that it does like.
> 
> v10:
> - Rebase changes from previous patches.
> - Remove "restrict" keyword as windows CI failed, see here for details:
>   https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228
> 
> v8:
> - Rework to use blocks_cache from the dpcls instance, to avoid variable
>   lenght arrays in the data-path.
> ---
>  NEWS                             |  4 +++
>  lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
>  lib/dpif-netdev-private.h        |  8 +++++
>  lib/dpif-netdev.c                |  9 ++++--
>  4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 81130e667..4cfffb1bc 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -34,6 +34,10 @@ Post-v2.11.0
>       * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace
>         datapath regardless of '--cleanup' option. Use '--cleanup' to remove
>         internal ports too.
> +     * Datapath classifer code refactored to enable function pointers to select
> +       the lookup implementation at runtime. This enables specialization of
> +       specific subtables based on the miniflow attributes, enhancing the
> +       performance of the subtable search.
>     - OVSDB:
>       * OVSDB clients can now resynchronize with clustered servers much more
>         quickly after a brief disconnection, saving bandwidth and CPU time.
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index abd166fc3..259c36645 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -233,7 +233,58 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                const struct netdev_flow_key *keys[],
>                                struct dpcls_rule **rules)
>  {
> +    /* Here the runtime subtable->mf_bits counts are used, which forces the
> +     * compiler to iterate normal for() loops. Due to this limitation in the
> +     * compilers available optimizations, this function has lower performance
> +     * than the below specialized functions.
> +     */
>      return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
>                                 subtable->mf_bits_set_unit0,
>                                 subtable->mf_bits_set_unit1);
>  }
> +
> +/* Expand out specialized functions with U0 and U1 bit attributes */
> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
> +    static uint32_t                                                           \
> +    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
> +                                         struct dpcls_subtable *subtable,     \
> +                                         uint64_t *blocks_scratch,            \
> +                                         uint32_t keys_map,                   \
> +                                         const struct netdev_flow_key *keys[],\
> +                                         struct dpcls_rule **rules)           \
> +    {                                                                         \
> +        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
> +                                   keys, rules, U0, U1);                      \
> +    }                                                                         \
> +
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)

Some spaces needed after the comma.

> +
> +/* Check if a speicalized function is valid for the required subtable. */
> +#define CHECK_LOOKUP_FUNCTION(U0,U1)                                          \
> +    if (!f && u0_bits == U0 && u1_bits == U1) {                               \
> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;                       \
> +    }

The reason why I initially placed this macro inside the function is that
'u0_bits' and 'u1_bits' only makes sense inside the function.
IMHO, it's better to move this inside, but it's up to you.

> +
> +/* Probe function to lookup an available specialized function.
> + * If capable to run the requested miniflow fingerprint, this function returns
> + * the most optimal implementation for that miniflow fingerprint.
> + * @retval FunctionAddress A valid function to handle the miniflow bit pattern
> + * @retval 0 The requested miniflow is not supported here, NULL is returned
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> +{
> +    dpcls_subtable_lookup_func f = NULL;
> +
> +    CHECK_LOOKUP_FUNCTION(5, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 0);
> +
> +    if (f) {
> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> +                  u0_bits, u1_bits);

Can we move this message out to 'dpcls_create_subtable' and log the
non-optimized case too?
Also, this needs to be DBG as all other subtable related logs are DBG logs.

We could just extend the 'Creating subtable' log message to include this
information:

    VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d with a %s "
             "lookup function (units: %"PRIu32", %"PRIu32").",
             cmap_count(&cls->subtables_map), subtable, cls->in_port,
             (subtable->lookup_func == dpcls_subtable_lookup_generic)
             ? "generic" : "specialized", unit0, unit1);

> +    }
> +    return f;
> +}
> diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> index e1ceaa641..f541343f4 100644
> --- a/lib/dpif-netdev-private.h
> +++ b/lib/dpif-netdev-private.h
> @@ -69,6 +69,14 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                const struct netdev_flow_key *keys[],
>                                struct dpcls_rule **rules);
>  
> +/* Probe function to select a specialized version of the generic lookup
> + * implementation. This provides performance benefit due to compile-time
> + * optimizations such as loop-unrolling. These are enabled by the compile-time
> + * constants in the specific function implementations.
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
> +
>  /* A set of rules that all have the same fields wildcarded. */
>  struct dpcls_subtable {
>      /* The fields are only used by writers. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8acc1445a..996dec35e 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7735,8 +7735,13 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>          cls->blocks_scratch_size = blocks_required_per_pkt;
>      }
>  
> -    /* Assign the generic lookup - this works with any miniflow fingerprint. */
> -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    /* Probe for a specialized generic lookup function. */
> +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> +
> +    /* If not set, assign generic lookup. Generic works for any miniflow. */
> +    if (!subtable->lookup_func) {
> +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    }
>  
>      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>      /* Add the new subtable at the end of the pvector (with no hits yet) */
>
Van Haaren, Harry July 17, 2019, 10:18 a.m. UTC | #4
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Wednesday, July 17, 2019 10:52 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
> <ian.stokes@intel.com>
> Subject: Re: [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar
> functions

<snip>

> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
> 
> Some spaces needed after the comma.

Will fix in v12.

> > +/* Check if a speicalized function is valid for the required subtable. */
> > +#define CHECK_LOOKUP_FUNCTION(U0,U1)
> \
> > +    if (!f && u0_bits == U0 && u1_bits == U1) {
> \
> > +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;
> \
> > +    }
> 
> The reason why I initially placed this macro inside the function is that
> 'u0_bits' and 'u1_bits' only makes sense inside the function.
> IMHO, it's better to move this inside, but it's up to you.

I'd prefer leave outside the function - this seems to be the standard way of
doing MACRO defines in OVS? Eg in dpif-netdev other macros are also outside the
functions, so I'll follow that method.


<snip>
> > +dpcls_subtable_lookup_func
> > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> > +{
> > +    dpcls_subtable_lookup_func f = NULL;
> > +
> > +    CHECK_LOOKUP_FUNCTION(5, 1);
> > +    CHECK_LOOKUP_FUNCTION(4, 1);
> > +    CHECK_LOOKUP_FUNCTION(4, 0);
> > +
> > +    if (f) {
> > +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> > +                  u0_bits, u1_bits);
> 
> Can we move this message out to 'dpcls_create_subtable' and log the
> non-optimized case too?
> Also, this needs to be DBG as all other subtable related logs are DBG logs.
> 
> We could just extend the 'Creating subtable' log message to include this
> information:
> 
>     VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d with a %s "
>              "lookup function (units: %"PRIu32", %"PRIu32").",
>              cmap_count(&cls->subtables_map), subtable, cls->in_port,
>              (subtable->lookup_func == dpcls_subtable_lookup_generic)
>              ? "generic" : "specialized", unit0, unit1);


This might seem a nice idea now, however in future we can plug in other optimized
flavors of lookups, and then the dpcls_create_subtable() function won't know
the details of the implementation. Hence having the log in the implementation
is the better solution.

Will change to DEBUG instead of INFO, good idea thanks.

<snip>
Ilya Maximets July 17, 2019, 10:24 a.m. UTC | #5
On 17.07.2019 13:18, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Wednesday, July 17, 2019 10:52 AM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
>> <ian.stokes@intel.com>
>> Subject: Re: [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar
>> functions
> 
> <snip>
> 
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
>>
>> Some spaces needed after the comma.
> 
> Will fix in v12.
> 
>>> +/* Check if a speicalized function is valid for the required subtable. */
>>> +#define CHECK_LOOKUP_FUNCTION(U0,U1)
>> \
>>> +    if (!f && u0_bits == U0 && u1_bits == U1) {
>> \
>>> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;
>> \
>>> +    }
>>
>> The reason why I initially placed this macro inside the function is that
>> 'u0_bits' and 'u1_bits' only makes sense inside the function.
>> IMHO, it's better to move this inside, but it's up to you.
> 
> I'd prefer leave outside the function - this seems to be the standard way of
> doing MACRO defines in OVS? Eg in dpif-netdev other macros are also outside the
> functions, so I'll follow that method.
> 
> 
> <snip>
>>> +dpcls_subtable_lookup_func
>>> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
>>> +{
>>> +    dpcls_subtable_lookup_func f = NULL;
>>> +
>>> +    CHECK_LOOKUP_FUNCTION(5, 1);
>>> +    CHECK_LOOKUP_FUNCTION(4, 1);
>>> +    CHECK_LOOKUP_FUNCTION(4, 0);
>>> +
>>> +    if (f) {
>>> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
>>> +                  u0_bits, u1_bits);
>>
>> Can we move this message out to 'dpcls_create_subtable' and log the
>> non-optimized case too?
>> Also, this needs to be DBG as all other subtable related logs are DBG logs.
>>
>> We could just extend the 'Creating subtable' log message to include this
>> information:
>>
>>     VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d with a %s "
>>              "lookup function (units: %"PRIu32", %"PRIu32").",
>>              cmap_count(&cls->subtables_map), subtable, cls->in_port,
>>              (subtable->lookup_func == dpcls_subtable_lookup_generic)
>>              ? "generic" : "specialized", unit0, unit1);
> 
> 
> This might seem a nice idea now, however in future we can plug in other optimized
> flavors of lookups, and then the dpcls_create_subtable() function won't know
> the details of the implementation. Hence having the log in the implementation
> is the better solution.

If there will be other implementations we'll have to change the prototype of
'dpcls_subtable_generic_probe' and refactor the code around anyway. So, changing
the log message would be a minor thing. Your version of 'dpcls_create_subtable'
already highly depends on the implementation of 'dpcls_subtable_generic_probe'.

> 
> Will change to DEBUG instead of INFO, good idea thanks.
> 
> <snip>
>
Van Haaren, Harry July 17, 2019, 10:29 a.m. UTC | #6
> -----Original Message-----
> From: Stokes, Ian
> Sent: Tuesday, July 16, 2019 10:07 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: echaudro@redhat.com; i.maximets@samsung.com; malvika.gupta@arm.com
> Subject: Re: [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar
> functions
> 
> On 7/15/2019 5:36 PM, Harry van Haaren wrote:
> > This commit adds a number of specialized functions, that handle

<snip>

> Thanks for the v11 Harry, some minor comments inline below.

Thanks for review!

> > v11:
> > - Use MACROs to declare and check optimized functions (Ilya)
> > - Use captial letter for commit message (Ilya)
> > - Rebase onto latest patchset changes
> > - Added NEWS entry for data-path subtable specialization (Ian/Harry)
> > - Checkpatch notes an "incorrect bracketing" in the MACROs, however I
> >    didn't find a solution that it does like.
> >
> > v10:
> > - Rebase changes from previous patches.
> > - Remove "restrict" keyword as windows CI failed, see here for details:
> >    https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228
> >
> > v8:
> > - Rework to use blocks_cache from the dpcls instance, to avoid variable
> >    lenght arrays in the data-path.
> > ---
> >   NEWS                             |  4 +++
> >   lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
> >   lib/dpif-netdev-private.h        |  8 +++++
> >   lib/dpif-netdev.c                |  9 ++++--
> >   4 files changed, 70 insertions(+), 2 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index 81130e667..4cfffb1bc 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -34,6 +34,10 @@ Post-v2.11.0
> >        * 'ovs-appctl exit' now implies cleanup of non-internal ports in
> userspace
> >          datapath regardless of '--cleanup' option. Use '--cleanup' to
> remove
> >          internal ports too.
> > +     * Datapath classifer code refactored to enable function pointers to
> select
> > +       the lookup implementation at runtime. This enables specialization of
> > +       specific subtables based on the miniflow attributes, enhancing the
> > +       performance of the subtable search.
> >      - OVSDB:
> >        * OVSDB clients can now resynchronize with clustered servers much
> more
> >          quickly after a brief disconnection, saving bandwidth and CPU time.
> > diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-
> generic.c
> > index abd166fc3..259c36645 100644
> > --- a/lib/dpif-netdev-lookup-generic.c
> > +++ b/lib/dpif-netdev-lookup-generic.c
> > @@ -233,7 +233,58 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable
> *subtable,
> >                                 const struct netdev_flow_key *keys[],
> >                                 struct dpcls_rule **rules)
> >   {
> > +    /* Here the runtime subtable->mf_bits counts are used, which forces the
> > +     * compiler to iterate normal for() loops. Due to this limitation in
> the
> > +     * compilers available optimizations, this function has lower
> performance
> > +     * than the below specialized functions.
> > +     */
> >       return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> >                                  subtable->mf_bits_set_unit0,
> >                                  subtable->mf_bits_set_unit1);
> >   }
> > +
> > +/* Expand out specialized functions with U0 and U1 bit attributes */
> 
> Minor, missing period at end of comment (can fix this on commit if there
> are no other revisions).

Fixed.

<snip>
> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
> > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
> > +
> > +/* Check if a speicalized function is valid for the required subtable. */
> Minor, speicalized -> specialized, again can be fixed on commit otherwise.

Fixed.

> 
> > +#define CHECK_LOOKUP_FUNCTION(U0,U1)
> \
> > +    if (!f && u0_bits == U0 && u1_bits == U1) {
> \
> > +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;
> \
> > +    }
> > +
> > +/* Probe function to lookup an available specialized function.
> > + * If capable to run the requested miniflow fingerprint, this function
> returns
> > + * the most optimal implementation for that miniflow fingerprint.
> > + * @retval FunctionAddress A valid function to handle the miniflow bit
> pattern
> > + * @retval 0 The requested miniflow is not supported here, NULL is returned
> > + */
> > +dpcls_subtable_lookup_func
> > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> > +{
> > +    dpcls_subtable_lookup_func f = NULL;
> 
> In the comments you return FunctionAddress but you return f below,
> should this not be FunctionAddress or maybe another variable name if
> 'FunctionAddress' is a bit unwieldy?

Updated return value descriptions as Non-NULL and NULL, and updated comments to
read well. This better describes the code than "FunctionAddress".


<snip>
> > -    /* Assign the generic lookup - this works with any miniflow
> fingerprint. */
> > -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> > +    /* Probe for a specialized generic lookup function. */
> > +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> > +
> > +    /* If not set, assign generic lookup. Generic works for any miniflow.
> */
> > +    if (!subtable->lookup_func) {
> > +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> > +    }
> >
> >       cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
> >       /* Add the new subtable at the end of the pvector (with no hits yet)
> */
> Missing period above.


I'd prefer not fix this one - that code is patch context and isn't currently being modified.
Van Haaren, Harry July 17, 2019, 10:35 a.m. UTC | #7
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Wednesday, July 17, 2019 11:24 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
> <ian.stokes@intel.com>
> Subject: Re: [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar
> functions
> 
> On 17.07.2019 13:18, Van Haaren, Harry wrote:

<snip>

> >>> +    if (f) {
> >>> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> >>> +                  u0_bits, u1_bits);
> >>
> >> Can we move this message out to 'dpcls_create_subtable' and log the
> >> non-optimized case too?
> >> Also, this needs to be DBG as all other subtable related logs are DBG logs.
> >>
> >> We could just extend the 'Creating subtable' log message to include this
> >> information:
> >>
> >>     VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d with a %s "
> >>              "lookup function (units: %"PRIu32", %"PRIu32").",
> >>              cmap_count(&cls->subtables_map), subtable, cls->in_port,
> >>              (subtable->lookup_func == dpcls_subtable_lookup_generic)
> >>              ? "generic" : "specialized", unit0, unit1);
> >
> >
> > This might seem a nice idea now, however in future we can plug in other
> optimized
> > flavors of lookups, and then the dpcls_create_subtable() function won't know
> > the details of the implementation. Hence having the log in the
> implementation
> > is the better solution.
> 
> If there will be other implementations we'll have to change the prototype of
> 'dpcls_subtable_generic_probe' and refactor the code around anyway. So,
> changing
> the log message would be a minor thing. Your version of
> 'dpcls_create_subtable'
> already highly depends on the implementation of
> 'dpcls_subtable_generic_probe'.

If we plug in new implementations, they will also have their own probe() function,
so there is no need to modify the existing probe prototype.

The log message is implementation specific (the name and details of the implementation),
so a print at that level (not an abstraction level above it) makes more sense here.

I'll leave this as is for v12, but change to DEBUG as suggested.
Stokes, Ian July 17, 2019, 11:28 a.m. UTC | #8
On 7/17/2019 11:29 AM, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Stokes, Ian
>> Sent: Tuesday, July 16, 2019 10:07 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>> Cc: echaudro@redhat.com; i.maximets@samsung.com; malvika.gupta@arm.com
>> Subject: Re: [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar
>> functions
>>
>> On 7/15/2019 5:36 PM, Harry van Haaren wrote:
>>> This commit adds a number of specialized functions, that handle
> 
> <snip>
> 
>> Thanks for the v11 Harry, some minor comments inline below.
> 
> Thanks for review!
> 
>>> v11:
>>> - Use MACROs to declare and check optimized functions (Ilya)
>>> - Use captial letter for commit message (Ilya)
>>> - Rebase onto latest patchset changes
>>> - Added NEWS entry for data-path subtable specialization (Ian/Harry)
>>> - Checkpatch notes an "incorrect bracketing" in the MACROs, however I
>>>     didn't find a solution that it does like.
>>>
>>> v10:
>>> - Rebase changes from previous patches.
>>> - Remove "restrict" keyword as windows CI failed, see here for details:
>>>     https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228
>>>
>>> v8:
>>> - Rework to use blocks_cache from the dpcls instance, to avoid variable
>>>     lenght arrays in the data-path.
>>> ---
>>>    NEWS                             |  4 +++
>>>    lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
>>>    lib/dpif-netdev-private.h        |  8 +++++
>>>    lib/dpif-netdev.c                |  9 ++++--
>>>    4 files changed, 70 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/NEWS b/NEWS
>>> index 81130e667..4cfffb1bc 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -34,6 +34,10 @@ Post-v2.11.0
>>>         * 'ovs-appctl exit' now implies cleanup of non-internal ports in
>> userspace
>>>           datapath regardless of '--cleanup' option. Use '--cleanup' to
>> remove
>>>           internal ports too.
>>> +     * Datapath classifer code refactored to enable function pointers to
>> select
>>> +       the lookup implementation at runtime. This enables specialization of
>>> +       specific subtables based on the miniflow attributes, enhancing the
>>> +       performance of the subtable search.
>>>       - OVSDB:
>>>         * OVSDB clients can now resynchronize with clustered servers much
>> more
>>>           quickly after a brief disconnection, saving bandwidth and CPU time.
>>> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-
>> generic.c
>>> index abd166fc3..259c36645 100644
>>> --- a/lib/dpif-netdev-lookup-generic.c
>>> +++ b/lib/dpif-netdev-lookup-generic.c
>>> @@ -233,7 +233,58 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable
>> *subtable,
>>>                                  const struct netdev_flow_key *keys[],
>>>                                  struct dpcls_rule **rules)
>>>    {
>>> +    /* Here the runtime subtable->mf_bits counts are used, which forces the
>>> +     * compiler to iterate normal for() loops. Due to this limitation in
>> the
>>> +     * compilers available optimizations, this function has lower
>> performance
>>> +     * than the below specialized functions.
>>> +     */
>>>        return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
>> rules,
>>>                                   subtable->mf_bits_set_unit0,
>>>                                   subtable->mf_bits_set_unit1);
>>>    }
>>> +
>>> +/* Expand out specialized functions with U0 and U1 bit attributes */
>>
>> Minor, missing period at end of comment (can fix this on commit if there
>> are no other revisions).
> 
> Fixed.
> 
> <snip>
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
>>> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
>>> +
>>> +/* Check if a speicalized function is valid for the required subtable. */
>> Minor, speicalized -> specialized, again can be fixed on commit otherwise.
> 
> Fixed.
> 
>>
>>> +#define CHECK_LOOKUP_FUNCTION(U0,U1)
>> \
>>> +    if (!f && u0_bits == U0 && u1_bits == U1) {
>> \
>>> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;
>> \
>>> +    }
>>> +
>>> +/* Probe function to lookup an available specialized function.
>>> + * If capable to run the requested miniflow fingerprint, this function
>> returns
>>> + * the most optimal implementation for that miniflow fingerprint.
>>> + * @retval FunctionAddress A valid function to handle the miniflow bit
>> pattern
>>> + * @retval 0 The requested miniflow is not supported here, NULL is returned
>>> + */
>>> +dpcls_subtable_lookup_func
>>> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
>>> +{
>>> +    dpcls_subtable_lookup_func f = NULL;
>>
>> In the comments you return FunctionAddress but you return f below,
>> should this not be FunctionAddress or maybe another variable name if
>> 'FunctionAddress' is a bit unwieldy?
> 
> Updated return value descriptions as Non-NULL and NULL, and updated comments to
> read well. This better describes the code than "FunctionAddress".
> 
> 
> <snip>
>>> -    /* Assign the generic lookup - this works with any miniflow
>> fingerprint. */
>>> -    subtable->lookup_func = dpcls_subtable_lookup_generic;
>>> +    /* Probe for a specialized generic lookup function. */
>>> +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
>>> +
>>> +    /* If not set, assign generic lookup. Generic works for any miniflow.
>> */
>>> +    if (!subtable->lookup_func) {
>>> +        subtable->lookup_func = dpcls_subtable_lookup_generic;
>>> +    }
>>>
>>>        cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>>>        /* Add the new subtable at the end of the pvector (with no hits yet)
>> */
>> Missing period above.
> 
> 
> I'd prefer not fix this one - that code is patch context and isn't currently being modified.

Ah, good catch, sounds good to me.

Ian
>
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index 81130e667..4cfffb1bc 100644
--- a/NEWS
+++ b/NEWS
@@ -34,6 +34,10 @@  Post-v2.11.0
      * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace
        datapath regardless of '--cleanup' option. Use '--cleanup' to remove
        internal ports too.
+     * Datapath classifer code refactored to enable function pointers to select
+       the lookup implementation at runtime. This enables specialization of
+       specific subtables based on the miniflow attributes, enhancing the
+       performance of the subtable search.
    - OVSDB:
      * OVSDB clients can now resynchronize with clustered servers much more
        quickly after a brief disconnection, saving bandwidth and CPU time.
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index abd166fc3..259c36645 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -233,7 +233,58 @@  dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
                               const struct netdev_flow_key *keys[],
                               struct dpcls_rule **rules)
 {
+    /* Here the runtime subtable->mf_bits counts are used, which forces the
+     * compiler to iterate normal for() loops. Due to this limitation in the
+     * compilers available optimizations, this function has lower performance
+     * than the below specialized functions.
+     */
     return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
                                subtable->mf_bits_set_unit0,
                                subtable->mf_bits_set_unit1);
 }
+
+/* Expand out specialized functions with U0 and U1 bit attributes */
+#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
+    static uint32_t                                                           \
+    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
+                                         struct dpcls_subtable *subtable,     \
+                                         uint64_t *blocks_scratch,            \
+                                         uint32_t keys_map,                   \
+                                         const struct netdev_flow_key *keys[],\
+                                         struct dpcls_rule **rules)           \
+    {                                                                         \
+        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
+                                   keys, rules, U0, U1);                      \
+    }                                                                         \
+
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
+
+/* Check if a speicalized function is valid for the required subtable. */
+#define CHECK_LOOKUP_FUNCTION(U0,U1)                                          \
+    if (!f && u0_bits == U0 && u1_bits == U1) {                               \
+        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;                       \
+    }
+
+/* Probe function to lookup an available specialized function.
+ * If capable to run the requested miniflow fingerprint, this function returns
+ * the most optimal implementation for that miniflow fingerprint.
+ * @retval FunctionAddress A valid function to handle the miniflow bit pattern
+ * @retval 0 The requested miniflow is not supported here, NULL is returned
+ */
+dpcls_subtable_lookup_func
+dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
+{
+    dpcls_subtable_lookup_func f = NULL;
+
+    CHECK_LOOKUP_FUNCTION(5, 1);
+    CHECK_LOOKUP_FUNCTION(4, 1);
+    CHECK_LOOKUP_FUNCTION(4, 0);
+
+    if (f) {
+        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
+                  u0_bits, u1_bits);
+    }
+    return f;
+}
diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
index e1ceaa641..f541343f4 100644
--- a/lib/dpif-netdev-private.h
+++ b/lib/dpif-netdev-private.h
@@ -69,6 +69,14 @@  dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
                               const struct netdev_flow_key *keys[],
                               struct dpcls_rule **rules);
 
+/* Probe function to select a specialized version of the generic lookup
+ * implementation. This provides performance benefit due to compile-time
+ * optimizations such as loop-unrolling. These are enabled by the compile-time
+ * constants in the specific function implementations.
+ */
+dpcls_subtable_lookup_func
+dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
+
 /* A set of rules that all have the same fields wildcarded. */
 struct dpcls_subtable {
     /* The fields are only used by writers. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8acc1445a..996dec35e 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7735,8 +7735,13 @@  dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
         cls->blocks_scratch_size = blocks_required_per_pkt;
     }
 
-    /* Assign the generic lookup - this works with any miniflow fingerprint. */
-    subtable->lookup_func = dpcls_subtable_lookup_generic;
+    /* Probe for a specialized generic lookup function. */
+    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
+
+    /* If not set, assign generic lookup. Generic works for any miniflow. */
+    if (!subtable->lookup_func) {
+        subtable->lookup_func = dpcls_subtable_lookup_generic;
+    }
 
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
     /* Add the new subtable at the end of the pvector (with no hits yet) */