Message ID | 20190718130306.66970-6-harry.van.haaren@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | dpcls func ptrs & optimizations | expand |
Bleep bloop. Greetings Harry van Haaren, I am a robot and I have tried out your patch. Thanks for your contribution. I encountered some error that I wasn't expecting. See the details below. git-am: Failed to merge in the changes. Patch failed at 0001 dpif-netdev: Add specialized generic scalar functions The copy of the patch that failed is found in: /var/lib/jenkins/jobs/upstream_build_from_pw/workspace/.git/rebase-apply/patch When you have resolved this problem, run "git am --resolved". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". Please check this out. If you feel there has been an error, please email aconole@bytheb.org Thanks, 0-day Robot
On 18.07.2019 16:03, Harry van Haaren wrote: > This commit adds a number of specialized functions, that handle > common miniflow fingerprints. This enables compiler optimization, > resulting in higher performance. Below a quick description of > how this optimization actually works; > > "Specialized functions" are "instances" of the generic implementation, > but the compiler is given extra context when compiling. In the case of > iterating miniflow datastructures, the most interesting value to enable > compile time optimizations is the loop trip count per unit. > > In order to create a specialized function, there is a generic implementation, > which uses a for() loop without the compiler knowing the loop trip count at > compile time. The loop trip count is passed in as an argument to the function: > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count) > { > for(uint32_t i = 0; i < loop_count; i++) > // do work > } > > In order to "specialize" the function, we call the generic implementation > with hard-coded numbers - these are compile time constants! > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count) > { > // use hard coded constant for compile-time constant-propogation > return miniflow_impl_generic(mf, 5); > } > > Given the compiler is aware of the loop trip count at compile time, > it can perform an optimization known as "constant propogation". Combined > with inlining of the miniflow_impl_generic() function, the compiler is > now enabled to *compile time* unroll the loop 5x, and produce "flat" code. > > The last step to using the specialized functions is to utilize a > function-pointer to choose the specialized (or generic) implementation. > The selection of the function pointer is performed at subtable creation > time, when miniflow fingerprint of the subtable is known. This technique > is known as "multiple dispatch" in some literature, as it uses multiple > items of information (miniflow bit counts) to select the dispatch function. > > By pointing the function pointer at the optimized implementation, OvS > benefits from the compile time optimizations at runtime. > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > Tested-by: Malvika Gupta <malvika.gupta@arm.com> > > --- > > v13: > - Update macros to new lookup function prototype without blocks scratch > > v12: > - Fix typo (Ian) > - Fix missing . after comments (Ian) > - Improve return value comments for probe function (Ian) > - Spaces after number in optimized lookup declarations (Ilya) > - Make VLOG level debug instead of info (Ilya) > > v11: > - Use MACROs to declare and check optimized functions (Ilya) > - Use captial letter for commit message (Ilya) > - Rebase onto latest patchset changes > - Added NEWS entry for data-path subtable specialization (Ian/Harry) > - Checkpatch notes an "incorrect bracketing" in the MACROs, however I > didn't find a solution that it does like. > > v10: > - Rebase changes from previous patches. > - Remove "restrict" keyword as windows CI failed, see here for details: > https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228 > > v8: > - Rework to use blocks_cache from the dpcls instance, to avoid variable > lenght arrays in the data-path. > --- > NEWS | 4 +++ > lib/dpif-netdev-lookup-generic.c | 49 ++++++++++++++++++++++++++++++++ > lib/dpif-netdev-private.h | 8 ++++++ > lib/dpif-netdev.c | 9 ++++-- > 4 files changed, 68 insertions(+), 2 deletions(-) > > diff --git a/NEWS b/NEWS > index 81130e667..4cfffb1bc 100644 > --- a/NEWS > +++ b/NEWS > @@ -34,6 +34,10 @@ Post-v2.11.0 > * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace > datapath regardless of '--cleanup' option. Use '--cleanup' to remove > internal ports too. > + * Datapath classifer code refactored to enable function pointers to select > + the lookup implementation at runtime. This enables specialization of > + specific subtables based on the miniflow attributes, enhancing the > + performance of the subtable search. > - OVSDB: > * OVSDB clients can now resynchronize with clustered servers much more > quickly after a brief disconnection, saving bandwidth and CPU time. > diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c > index 4dcf4640e..10ac41d8e 100644 > --- a/lib/dpif-netdev-lookup-generic.c > +++ b/lib/dpif-netdev-lookup-generic.c > @@ -260,7 +260,56 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable, > const struct netdev_flow_key *keys[], > struct dpcls_rule **rules) > { > + /* Here the runtime subtable->mf_bits counts are used, which forces the > + * compiler to iterate normal for() loops. Due to this limitation in the > + * compilers available optimizations, this function has lower performance > + * than the below specialized functions. > + */ > return lookup_generic_impl(subtable, keys_map, keys, rules, > subtable->mf_bits_set_unit0, > subtable->mf_bits_set_unit1); > } > + > +/* Expand out specialized functions with U0 and U1 bit attributes. */ > +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1) \ > + static uint32_t \ > + dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1( \ > + struct dpcls_subtable *subtable, \ > + uint32_t keys_map, \ > + const struct netdev_flow_key *keys[],\ > + struct dpcls_rule **rules) \ > + { \ > + return lookup_generic_impl(subtable, keys_map, keys, rules, U0, U1); \ > + } \ > + > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1) > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0) > + > +/* Check if a specialized function is valid for the required subtable. */ > +#define CHECK_LOOKUP_FUNCTION(U0,U1) \ Ian, could you, please, add a space after a comma here before applying the patch?
> On 18.07.2019 16:03, Harry van Haaren wrote: > > This commit adds a number of specialized functions, that handle > > common miniflow fingerprints. This enables compiler optimization, > > resulting in higher performance. Below a quick description of > > how this optimization actually works; > > > > "Specialized functions" are "instances" of the generic implementation, > > but the compiler is given extra context when compiling. In the case of > > iterating miniflow datastructures, the most interesting value to enable > > compile time optimizations is the loop trip count per unit. > > > > In order to create a specialized function, there is a generic > implementation, > > which uses a for() loop without the compiler knowing the loop trip count > at > > compile time. The loop trip count is passed in as an argument to the > function: > > > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count) > > { > > for(uint32_t i = 0; i < loop_count; i++) > > // do work > > } > > > > In order to "specialize" the function, we call the generic > implementation > > with hard-coded numbers - these are compile time constants! > > > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count) > > { > > // use hard coded constant for compile-time constant-propogation > > return miniflow_impl_generic(mf, 5); > > } > > > > Given the compiler is aware of the loop trip count at compile time, > > it can perform an optimization known as "constant propogation". Combined > > with inlining of the miniflow_impl_generic() function, the compiler is > > now enabled to *compile time* unroll the loop 5x, and produce "flat" > code. > > > > The last step to using the specialized functions is to utilize a > > function-pointer to choose the specialized (or generic) implementation. > > The selection of the function pointer is performed at subtable creation > > time, when miniflow fingerprint of the subtable is known. This technique > > is known as "multiple dispatch" in some literature, as it uses multiple > > items of information (miniflow bit counts) to select the dispatch > function. > > > > By pointing the function pointer at the optimized implementation, OvS > > benefits from the compile time optimizations at runtime. > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > > Tested-by: Malvika Gupta <malvika.gupta@arm.com> > > > > --- > > > > v13: > > - Update macros to new lookup function prototype without blocks scratch > > > > v12: > > - Fix typo (Ian) > > - Fix missing . after comments (Ian) > > - Improve return value comments for probe function (Ian) > > - Spaces after number in optimized lookup declarations (Ilya) > > - Make VLOG level debug instead of info (Ilya) > > > > v11: > > - Use MACROs to declare and check optimized functions (Ilya) > > - Use captial letter for commit message (Ilya) > > - Rebase onto latest patchset changes > > - Added NEWS entry for data-path subtable specialization (Ian/Harry) > > - Checkpatch notes an "incorrect bracketing" in the MACROs, however I > > didn't find a solution that it does like. > > > > v10: > > - Rebase changes from previous patches. > > - Remove "restrict" keyword as windows CI failed, see here for details: > > https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228 > > > > v8: > > - Rework to use blocks_cache from the dpcls instance, to avoid variable > > lenght arrays in the data-path. > > --- > > NEWS | 4 +++ > > lib/dpif-netdev-lookup-generic.c | 49 ++++++++++++++++++++++++++++++++ > > lib/dpif-netdev-private.h | 8 ++++++ > > lib/dpif-netdev.c | 9 ++++-- > > 4 files changed, 68 insertions(+), 2 deletions(-) > > > > diff --git a/NEWS b/NEWS > > index 81130e667..4cfffb1bc 100644 > > --- a/NEWS > > +++ b/NEWS > > @@ -34,6 +34,10 @@ Post-v2.11.0 > > * 'ovs-appctl exit' now implies cleanup of non-internal ports in > userspace > > datapath regardless of '--cleanup' option. Use '--cleanup' to > remove > > internal ports too. > > + * Datapath classifer code refactored to enable function pointers > to select > > + the lookup implementation at runtime. This enables > specialization of > > + specific subtables based on the miniflow attributes, enhancing > the > > + performance of the subtable search. > > - OVSDB: > > * OVSDB clients can now resynchronize with clustered servers much > more > > quickly after a brief disconnection, saving bandwidth and CPU > time. > > diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup- > generic.c > > index 4dcf4640e..10ac41d8e 100644 > > --- a/lib/dpif-netdev-lookup-generic.c > > +++ b/lib/dpif-netdev-lookup-generic.c > > @@ -260,7 +260,56 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable > *subtable, > > const struct netdev_flow_key *keys[], > > struct dpcls_rule **rules) > > { > > + /* Here the runtime subtable->mf_bits counts are used, which forces > the > > + * compiler to iterate normal for() loops. Due to this limitation > in the > > + * compilers available optimizations, this function has lower > performance > > + * than the below specialized functions. > > + */ > > return lookup_generic_impl(subtable, keys_map, keys, rules, > > subtable->mf_bits_set_unit0, > > subtable->mf_bits_set_unit1); > > } > > + > > +/* Expand out specialized functions with U0 and U1 bit attributes. */ > > +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1) > \ > > + static uint32_t > \ > > + dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1( > \ > > + struct dpcls_subtable > *subtable, \ > > + uint32_t keys_map, > \ > > + const struct netdev_flow_key > *keys[],\ > > + struct dpcls_rule **rules) > \ > > + { > \ > > + return lookup_generic_impl(subtable, keys_map, keys, rules, U0, > U1); \ > > + } > \ > > + > > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1) > > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) > > +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0) > > + > > +/* Check if a specialized function is valid for the required subtable. > */ > > +#define CHECK_LOOKUP_FUNCTION(U0,U1) > \ > > Ian, could you, please, add a space after a comma here before applying the > patch? Sure, no problem. Ian
diff --git a/NEWS b/NEWS index 81130e667..4cfffb1bc 100644 --- a/NEWS +++ b/NEWS @@ -34,6 +34,10 @@ Post-v2.11.0 * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace datapath regardless of '--cleanup' option. Use '--cleanup' to remove internal ports too. + * Datapath classifer code refactored to enable function pointers to select + the lookup implementation at runtime. This enables specialization of + specific subtables based on the miniflow attributes, enhancing the + performance of the subtable search. - OVSDB: * OVSDB clients can now resynchronize with clustered servers much more quickly after a brief disconnection, saving bandwidth and CPU time. diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c index 4dcf4640e..10ac41d8e 100644 --- a/lib/dpif-netdev-lookup-generic.c +++ b/lib/dpif-netdev-lookup-generic.c @@ -260,7 +260,56 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable, const struct netdev_flow_key *keys[], struct dpcls_rule **rules) { + /* Here the runtime subtable->mf_bits counts are used, which forces the + * compiler to iterate normal for() loops. Due to this limitation in the + * compilers available optimizations, this function has lower performance + * than the below specialized functions. + */ return lookup_generic_impl(subtable, keys_map, keys, rules, subtable->mf_bits_set_unit0, subtable->mf_bits_set_unit1); } + +/* Expand out specialized functions with U0 and U1 bit attributes. */ +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1) \ + static uint32_t \ + dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1( \ + struct dpcls_subtable *subtable, \ + uint32_t keys_map, \ + const struct netdev_flow_key *keys[],\ + struct dpcls_rule **rules) \ + { \ + return lookup_generic_impl(subtable, keys_map, keys, rules, U0, U1); \ + } \ + +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1) +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0) + +/* Check if a specialized function is valid for the required subtable. */ +#define CHECK_LOOKUP_FUNCTION(U0,U1) \ + if (!f && u0_bits == U0 && u1_bits == U1) { \ + f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1; \ + } + +/* Probe function to lookup an available specialized function. + * If capable to run the requested miniflow fingerprint, this function returns + * the most optimal implementation for that miniflow fingerprint. + * @retval Non-NULL A valid function to handle the miniflow bit pattern + * @retval NULL The requested miniflow is not supported by this implementation. + */ +dpcls_subtable_lookup_func +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits) +{ + dpcls_subtable_lookup_func f = NULL; + + CHECK_LOOKUP_FUNCTION(5, 1); + CHECK_LOOKUP_FUNCTION(4, 1); + CHECK_LOOKUP_FUNCTION(4, 0); + + if (f) { + VLOG_DBG("Subtable using Generic Optimized for u0 %d, u1 %d\n", + u0_bits, u1_bits); + } + return f; +} diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h index 610851a10..68c33a0f9 100644 --- a/lib/dpif-netdev-private.h +++ b/lib/dpif-netdev-private.h @@ -67,6 +67,14 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable, const struct netdev_flow_key *keys[], struct dpcls_rule **rules); +/* Probe function to select a specialized version of the generic lookup + * implementation. This provides performance benefit due to compile-time + * optimizations such as loop-unrolling. These are enabled by the compile-time + * constants in the specific function implementations. + */ +dpcls_subtable_lookup_func +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count); + /* A set of rules that all have the same fields wildcarded. */ struct dpcls_subtable { /* The fields are only used by writers. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 702170698..d0a1c58ad 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -7712,8 +7712,13 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask) subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1)); netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1); - /* Assign the generic lookup - this works with any miniflow fingerprint. */ - subtable->lookup_func = dpcls_subtable_lookup_generic; + /* Probe for a specialized generic lookup function. */ + subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1); + + /* If not set, assign generic lookup. Generic works for any miniflow. */ + if (!subtable->lookup_func) { + subtable->lookup_func = dpcls_subtable_lookup_generic; + } cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash); /* Add the new subtable at the end of the pvector (with no hits yet) */