mbox series

[ovs-dev,v10,0/5] dpcls func ptrs & optimizations

Message ID 20190709123440.45519-1-harry.van.haaren@intel.com
Headers show
Series dpcls func ptrs & optimizations | expand

Message

Van Haaren, Harry July 9, 2019, 12:34 p.m. UTC
Hey All,


Here a v10 of the DPCLS Function Pointer patchset, as has been
presented at OVS Conf in Nov '18, and discussed on the ML since then.
I'm aware of the soft-freeze for 2.12, I feel this patchset has had
enough reviews/versions/testing to be merged in 2.12.

Thanks Ilya and Ian for review comments on v9, they should all be addressed
in this v10. 

Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
for reporting ARM performance gains, see here for details:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html


Regards, -Harry


Patchset details;
1) Refactor dpcls_lookup and the subtable for flexibility.
In particular, add a function pointer to the subtable
structure, which enables "plugging-in" a lookup function
at runtime. This enables a number of optimizations in future.

2) and 3)
With the function pointer in place, we refactor the existing
dpcls_lookup matching code into its own function, and later its
own file. To split it to its own file requires making various
dpcls data-structures available in the dpif-netdev.h header.

4)
Refactor the existing code, to favour compute of flat arrays of
miniflows, instead of the MACRO based iteration. This simplifies
the code itself, and makes future optimizations possible due to
simplified loop structures, and loop trip counts pass in via
function arguments. See commit message for more details.

5)
This patch implements a select few specialized functions, for handling
miniflows with 5-1, 4-1, and 4-0 miniflow unit bit patterns. More of
these types of functions can (and should) be added to accelerate other
patterns of subtable lookups! See commit message for more details.

Harry van Haaren (5):
  dpif-netdev: implement function pointers/subtable
  dpif-netdev: move dpcls lookup structures to .h
  dpif-netdev: split out generic lookup function
  dpif-netdev: refactor generic implementation
  dpif-netdev: add specialized generic scalar functions

 lib/automake.mk                  |   1 +
 lib/dpif-netdev-lookup-generic.c | 298 +++++++++++++++++++++++++++++++
 lib/dpif-netdev.c                | 195 ++++++++++----------
 lib/dpif-netdev.h                |  94 ++++++++++
 4 files changed, 491 insertions(+), 97 deletions(-)
 create mode 100644 lib/dpif-netdev-lookup-generic.c

Comments

Ilya Maximets July 11, 2019, 2:13 p.m. UTC | #1
On 09.07.2019 15:34, Harry van Haaren wrote:
> Hey All,
> 
> 
> Here a v10 of the DPCLS Function Pointer patchset, as has been
> presented at OVS Conf in Nov '18, and discussed on the ML since then.
> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
> enough reviews/versions/testing to be merged in 2.12.
> 
> Thanks Ilya and Ian for review comments on v9, they should all be addressed
> in this v10. 
> 
> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
> for reporting ARM performance gains, see here for details:
> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
> 
> 
> Regards, -Harry

Hi, Harry.
Thanks for working on this.

I performed some tests with this version in my usual PVP with bonded PHY
setup and here are some observations:

* Bug that redirected packets to wrong rules is gone. At least I can't
  catch it in my testing anymore. Assuming it's fixed now.

* dpcls performance boost for 512B packets is around 12% in compare with
  current master.

Few remarks about the test scenario:
All packets mostly goes through the NORMAL action with vlan push/pop.
Packets that goes from VM to balanced-tcp bonded PHY goes through
recirculation. Datapath flows for them looks like this:

Before recirculation:
recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:bb:cc:dd:ee:11,nw_frag=no

After recirculation:
recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no

I have 256 flows in datapath for different 'dp_hash'es.

So, even if the number of ipv4 flows is as high as 256K, I have about ~270 datapath
flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).

All the flows fits into 5+1 case, i.e. optimized function 
dpcls_subtable_lookup_mf_u0w5_u1w1 used.

Most interesting observation:

* New version of dpcls lookup outperforms SMC in this setup even on
  relatively small number of flows. With 8K flows dpcls faster than SMC
  by 1.5% and by 5.7% with 256K flows.
  Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
  interesting because no-one can beat EMC in this area.

I'd like to read the code more carefully tomorrow and probably give some
more feedback.

Best regards, Ilya Maximets.
Van Haaren, Harry July 11, 2019, 2:40 p.m. UTC | #2
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Thursday, July 11, 2019 3:14 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: malvika.gupta@arm.com; Stokes, Ian <ian.stokes@intel.com>; Michal Orsák
> <michal.orsak@gmail.com>
> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
> 
> On 09.07.2019 15:34, Harry van Haaren wrote:
> > Hey All,
> >
> >
> > Here a v10 of the DPCLS Function Pointer patchset, as has been
> > presented at OVS Conf in Nov '18, and discussed on the ML since then.
> > I'm aware of the soft-freeze for 2.12, I feel this patchset has had
> > enough reviews/versions/testing to be merged in 2.12.
> >
> > Thanks Ilya and Ian for review comments on v9, they should all be addressed
> > in this v10.
> >
> > Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
> > for reporting ARM performance gains, see here for details:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
> >
> >
> > Regards, -Harry
> 
> Hi, Harry.
> Thanks for working on this.

My pleasure - it’s a nice part of OVS. And there's lots more to do :)


> I performed some tests with this version in my usual PVP with bonded PHY
> setup and here are some observations:
> 
> * Bug that redirected packets to wrong rules is gone. At least I can't
>   catch it in my testing anymore. Assuming it's fixed now.
> 
> * dpcls performance boost for 512B packets is around 12% in compare with
>   current master.

Ah great! Glad to hear its giving you performance.


> Few remarks about the test scenario:
> All packets mostly goes through the NORMAL action with vlan push/pop.
> Packets that goes from VM to balanced-tcp bonded PHY goes through
> recirculation. Datapath flows for them looks like this:
> 
> Before recirculation:
> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
> b:cc:dd:ee:11,nw_frag=no
> 
> After recirculation:
> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
> 
> I have 256 flows in datapath for different 'dp_hash'es.
> 
> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
> datapath
> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).

Right - I'm a big fan of the consistent performance characteristic of DPCLS,
which is due to its wildcarding capabilities and lack of caching concepts.


> All the flows fits into 5+1 case, i.e. optimized function
> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
> 
> Most interesting observation:
> 
> * New version of dpcls lookup outperforms SMC in this setup even on
>   relatively small number of flows. With 8K flows dpcls faster than SMC
>   by 1.5% and by 5.7% with 256K flows.
>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>   interesting because no-one can beat EMC in this area.
>
> I'd like to read the code more carefully tomorrow and probably give some
> more feedback.
> 
> Best regards, Ilya Maximets.

Thanks for your comments - please do prioritize feedback ASAP, because as
you know the 2.12 soft-freeze is already in effect.

I'll work on Ian's comments on v10, but hold off sending v11 until there
is some feedback from you too :)

Thanks again, -Harry
Ilya Maximets July 12, 2019, 10:49 a.m. UTC | #3
On 11.07.2019 17:40, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Thursday, July 11, 2019 3:14 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>> Cc: malvika.gupta@arm.com; Stokes, Ian <ian.stokes@intel.com>; Michal Orsák
>> <michal.orsak@gmail.com>
>> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
>>
>> On 09.07.2019 15:34, Harry van Haaren wrote:
>>> Hey All,
>>>
>>>
>>> Here a v10 of the DPCLS Function Pointer patchset, as has been
>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>> enough reviews/versions/testing to be merged in 2.12.
>>>
>>> Thanks Ilya and Ian for review comments on v9, they should all be addressed
>>> in this v10.
>>>
>>> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
>>> for reporting ARM performance gains, see here for details:
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
>>>
>>>
>>> Regards, -Harry
>>
>> Hi, Harry.
>> Thanks for working on this.
> 
> My pleasure - it’s a nice part of OVS. And there's lots more to do :)
> 
> 
>> I performed some tests with this version in my usual PVP with bonded PHY
>> setup and here are some observations:
>>
>> * Bug that redirected packets to wrong rules is gone. At least I can't
>>   catch it in my testing anymore. Assuming it's fixed now.
>>
>> * dpcls performance boost for 512B packets is around 12% in compare with
>>   current master.
> 
> Ah great! Glad to hear its giving you performance.
> 
> 
>> Few remarks about the test scenario:
>> All packets mostly goes through the NORMAL action with vlan push/pop.
>> Packets that goes from VM to balanced-tcp bonded PHY goes through
>> recirculation. Datapath flows for them looks like this:
>>
>> Before recirculation:
>> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
>> b:cc:dd:ee:11,nw_frag=no
>>
>> After recirculation:
>> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
>>
>> I have 256 flows in datapath for different 'dp_hash'es.
>>
>> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
>> datapath
>> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).
> 
> Right - I'm a big fan of the consistent performance characteristic of DPCLS,
> which is due to its wildcarding capabilities and lack of caching concepts.
> 
> 
>> All the flows fits into 5+1 case, i.e. optimized function
>> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
>>
>> Most interesting observation:
>>
>> * New version of dpcls lookup outperforms SMC in this setup even on
>>   relatively small number of flows. With 8K flows dpcls faster than SMC
>>   by 1.5% and by 5.7% with 256K flows.
>>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>>   interesting because no-one can beat EMC in this area.
>>
>> I'd like to read the code more carefully tomorrow and probably give some
>> more feedback.
>>
>> Best regards, Ilya Maximets.
> 
> Thanks for your comments - please do prioritize feedback ASAP, because as
> you know the 2.12 soft-freeze is already in effect.
> 
> I'll work on Ian's comments on v10, but hold off sending v11 until there
> is some feedback from you too :)

Few comments:

1. I don't really like that new dpcls depends on the internal structure of
   'struct flowmap'. At least, we need to add a build assert to notify
   anyone who will try to change it.

2. No need to expose so many internal stuff to every module that includes
   'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
   all the subtable related stuff there?

3. IMHO, it's better to apply some preprocessor optimizations to make it
   easier to add new optimized functions and decrease code duplication.

4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.

5. A lot of comments needs to be re-formatted according to coding style,
   i.e. first letter capitalized + period at the end.

6. "generic optimized" sounds weird.

Here is some incremental I'm suggesting which covers comments 1, 3 and 4:

diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index 12406f4c1..2d210bf93 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -31,6 +31,9 @@
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
 
+/* Lookup functions below depends on the internal structure of a flowmap. */
+BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
+
 /* netdev_flow_key_flatten_unit:
  * Given a packet, table and mf_masks, this function iterates over each bit
  * set in the subtable, and calculates the appropriate metadata to store in the
@@ -129,8 +132,8 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
 {
     const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
     const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
-
     uint64_t not_match = 0;
+
     for (int i = 0; i < mf_bits_total; i++) {
         not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
     }
@@ -163,7 +166,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
     int i;
 
     /* Flatten the packet metadata into the blocks_scratch[] using subtable */
-    ULLONG_FOR_EACH_1(i, keys_map) {
+    ULLONG_FOR_EACH_1 (i, keys_map) {
             netdev_flow_key_flatten(keys[i],
                                     &subtable->mask,
                                     mf_masks,
@@ -173,9 +176,10 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
     }
 
     /* Hash the now linearized blocks of packet metadata */
-    ULLONG_FOR_EACH_1(i, keys_map) {
+    ULLONG_FOR_EACH_1 (i, keys_map) {
          uint32_t hash = 0;
          uint32_t i_off = i * bit_count_total;
+
          for (int h = 0; h < bit_count_total; h++) {
              hash = hash_add64(hash, blocks_scratch[i_off + h]);
          }
@@ -188,12 +192,13 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
      */
     uint32_t found_map;
     const struct cmap_node *nodes[NETDEV_MAX_BURST];
+
     found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
 
     /* Verify that packet actually matched rule. If not found, a hash
      * collision has taken place, so continue searching with the next node.
      */
-    ULLONG_FOR_EACH_1(i, found_map) {
+    ULLONG_FOR_EACH_1 (i, found_map) {
         struct dpcls_rule *rule;
 
         CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
@@ -236,41 +241,25 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
                                subtable->mf_bits_set_unit1);
 }
 
-static uint32_t
-dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    /* hard coded bit counts - enables compile time loop unrolling, and
-     * generating of optimized code-sequences due to loop unrolled code.
-     */
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               5, 1);
-}
+#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
+    static uint32_t                                                           \
+    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
+                                         struct dpcls_subtable *subtable,     \
+                                         uint64_t *blocks_scratch,            \
+                                         uint32_t keys_map,                   \
+                                         const struct netdev_flow_key *keys[],\
+                                         struct dpcls_rule **rules)           \
+    {                                                                         \
+        /* Hard coded bit counts - enables compile time loop unrolling, and   \
+         * generating of optimized code-sequences due to loop unrolled code.  \
+         */                                                                   \
+        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
+                                   keys, rules, U0, U1);                      \
+    }                                                                         \
 
-static uint32_t
-dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               4, 1);
-}
-
-static uint32_t
-dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               4, 0);
-}
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1);
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1);
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0);
 
 /* Probe function to lookup an available specialized function.
  * If capable to run the requested miniflow fingerprint, this function returns
@@ -283,14 +272,15 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
 {
     dpcls_subtable_lookup_func f = NULL;
 
-    if (u0_bits == 5 && u1_bits == 1) {
-        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
-    } else if (u0_bits == 4 && u1_bits == 1) {
-        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
-    } else if (u0_bits == 4 && u1_bits == 0) {
-        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
+#define CHECK_LOOKUP_FUNCTION(U0, U1)                      \
+    if (!f && u0_bits == U0 && u1_bits == U1) {            \
+        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;    \
     }
 
+    CHECK_LOOKUP_FUNCTION(5, 1);
+    CHECK_LOOKUP_FUNCTION(4, 1);
+    CHECK_LOOKUP_FUNCTION(4, 0);
+
     if (f) {
         VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
                   u0_bits, u1_bits);
---

Best regards, Ilya Maximets.
Ilya Maximets July 12, 2019, 11:22 a.m. UTC | #4
On 12.07.2019 13:49, Ilya Maximets wrote:
> On 11.07.2019 17:40, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>>> Sent: Thursday, July 11, 2019 3:14 PM
>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>>> Cc: malvika.gupta@arm.com; Stokes, Ian <ian.stokes@intel.com>; Michal Orsák
>>> <michal.orsak@gmail.com>
>>> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
>>>
>>> On 09.07.2019 15:34, Harry van Haaren wrote:
>>>> Hey All,
>>>>
>>>>
>>>> Here a v10 of the DPCLS Function Pointer patchset, as has been
>>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>>> enough reviews/versions/testing to be merged in 2.12.
>>>>
>>>> Thanks Ilya and Ian for review comments on v9, they should all be addressed
>>>> in this v10.
>>>>
>>>> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
>>>> for reporting ARM performance gains, see here for details:
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
>>>>
>>>>
>>>> Regards, -Harry
>>>
>>> Hi, Harry.
>>> Thanks for working on this.
>>
>> My pleasure - it’s a nice part of OVS. And there's lots more to do :)
>>
>>
>>> I performed some tests with this version in my usual PVP with bonded PHY
>>> setup and here are some observations:
>>>
>>> * Bug that redirected packets to wrong rules is gone. At least I can't
>>>   catch it in my testing anymore. Assuming it's fixed now.
>>>
>>> * dpcls performance boost for 512B packets is around 12% in compare with
>>>   current master.
>>
>> Ah great! Glad to hear its giving you performance.
>>
>>
>>> Few remarks about the test scenario:
>>> All packets mostly goes through the NORMAL action with vlan push/pop.
>>> Packets that goes from VM to balanced-tcp bonded PHY goes through
>>> recirculation. Datapath flows for them looks like this:
>>>
>>> Before recirculation:
>>> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
>>> b:cc:dd:ee:11,nw_frag=no
>>>
>>> After recirculation:
>>> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
>>>
>>> I have 256 flows in datapath for different 'dp_hash'es.
>>>
>>> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
>>> datapath
>>> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).
>>
>> Right - I'm a big fan of the consistent performance characteristic of DPCLS,
>> which is due to its wildcarding capabilities and lack of caching concepts.
>>
>>
>>> All the flows fits into 5+1 case, i.e. optimized function
>>> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
>>>
>>> Most interesting observation:
>>>
>>> * New version of dpcls lookup outperforms SMC in this setup even on
>>>   relatively small number of flows. With 8K flows dpcls faster than SMC
>>>   by 1.5% and by 5.7% with 256K flows.
>>>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>>>   interesting because no-one can beat EMC in this area.
>>>
>>> I'd like to read the code more carefully tomorrow and probably give some
>>> more feedback.
>>>
>>> Best regards, Ilya Maximets.
>>
>> Thanks for your comments - please do prioritize feedback ASAP, because as
>> you know the 2.12 soft-freeze is already in effect.
>>
>> I'll work on Ian's comments on v10, but hold off sending v11 until there
>> is some feedback from you too :)
> 
> Few comments:
> 
> 1. I don't really like that new dpcls depends on the internal structure of
>    'struct flowmap'. At least, we need to add a build assert to notify
>    anyone who will try to change it.
> 
> 2. No need to expose so many internal stuff to every module that includes
>    'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
>    all the subtable related stuff there?
> 
> 3. IMHO, it's better to apply some preprocessor optimizations to make it
>    easier to add new optimized functions and decrease code duplication.
> 
> 4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.
> 
> 5. A lot of comments needs to be re-formatted according to coding style,
>    i.e. first letter capitalized + period at the end.
> 
> 6. "generic optimized" sounds weird.

One more thing. Could you, please, rename patches like:

dpif-netdev: implement function pointers/subtable

to

dpif-netdev: Implement function pointers/subtable.


This is not a strict requirement but a common formatting for OVS (capital letter + period).

> 
> Here is some incremental I'm suggesting which covers comments 1, 3 and 4:
> 
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index 12406f4c1..2d210bf93 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -31,6 +31,9 @@
>  
>  VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
>  
> +/* Lookup functions below depends on the internal structure of a flowmap. */
> +BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
> +
>  /* netdev_flow_key_flatten_unit:
>   * Given a packet, table and mf_masks, this function iterates over each bit
>   * set in the subtable, and calculates the appropriate metadata to store in the
> @@ -129,8 +132,8 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
>  {
>      const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
>      const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> -
>      uint64_t not_match = 0;
> +
>      for (int i = 0; i < mf_bits_total; i++) {
>          not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
>      }
> @@ -163,7 +166,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>      int i;
>  
>      /* Flatten the packet metadata into the blocks_scratch[] using subtable */
> -    ULLONG_FOR_EACH_1(i, keys_map) {
> +    ULLONG_FOR_EACH_1 (i, keys_map) {
>              netdev_flow_key_flatten(keys[i],
>                                      &subtable->mask,
>                                      mf_masks,
> @@ -173,9 +176,10 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>      }
>  
>      /* Hash the now linearized blocks of packet metadata */
> -    ULLONG_FOR_EACH_1(i, keys_map) {
> +    ULLONG_FOR_EACH_1 (i, keys_map) {
>           uint32_t hash = 0;
>           uint32_t i_off = i * bit_count_total;
> +
>           for (int h = 0; h < bit_count_total; h++) {
>               hash = hash_add64(hash, blocks_scratch[i_off + h]);
>           }
> @@ -188,12 +192,13 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>       */
>      uint32_t found_map;
>      const struct cmap_node *nodes[NETDEV_MAX_BURST];
> +
>      found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
>  
>      /* Verify that packet actually matched rule. If not found, a hash
>       * collision has taken place, so continue searching with the next node.
>       */
> -    ULLONG_FOR_EACH_1(i, found_map) {
> +    ULLONG_FOR_EACH_1 (i, found_map) {
>          struct dpcls_rule *rule;
>  
>          CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> @@ -236,41 +241,25 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 subtable->mf_bits_set_unit1);
>  }
>  
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    /* hard coded bit counts - enables compile time loop unrolling, and
> -     * generating of optimized code-sequences due to loop unrolled code.
> -     */
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               5, 1);
> -}
> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
> +    static uint32_t                                                           \
> +    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
> +                                         struct dpcls_subtable *subtable,     \
> +                                         uint64_t *blocks_scratch,            \
> +                                         uint32_t keys_map,                   \
> +                                         const struct netdev_flow_key *keys[],\
> +                                         struct dpcls_rule **rules)           \
> +    {                                                                         \
> +        /* Hard coded bit counts - enables compile time loop unrolling, and   \
> +         * generating of optimized code-sequences due to loop unrolled code.  \
> +         */                                                                   \
> +        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
> +                                   keys, rules, U0, U1);                      \
> +    }                                                                         \
>  
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               4, 1);
> -}
> -
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               4, 0);
> -}
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0);
>  
>  /* Probe function to lookup an available specialized function.
>   * If capable to run the requested miniflow fingerprint, this function returns
> @@ -283,14 +272,15 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
>  {
>      dpcls_subtable_lookup_func f = NULL;
>  
> -    if (u0_bits == 5 && u1_bits == 1) {
> -        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> -    } else if (u0_bits == 4 && u1_bits == 1) {
> -        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> -    } else if (u0_bits == 4 && u1_bits == 0) {
> -        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> +#define CHECK_LOOKUP_FUNCTION(U0, U1)                      \
> +    if (!f && u0_bits == U0 && u1_bits == U1) {            \
> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;    \
>      }
>  
> +    CHECK_LOOKUP_FUNCTION(5, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 0);
> +
>      if (f) {
>          VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
>                    u0_bits, u1_bits);
> ---
> 
> Best regards, Ilya Maximets.
>
Van Haaren, Harry July 12, 2019, 2:05 p.m. UTC | #5
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Friday, July 12, 2019 11:50 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: malvika.gupta@arm.com; Stokes, Ian <ian.stokes@intel.com>; Michal Orsák
> <michal.orsak@gmail.com>; dev@openvswitch.org
> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations

<snip> lots of chatter

> >> All the flows fits into 5+1 case, i.e. optimized function
> >> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
> >>
> >> Most interesting observation:
> >>
> >> * New version of dpcls lookup outperforms SMC in this setup even on
> >>   relatively small number of flows. With 8K flows dpcls faster than SMC
> >>   by 1.5% and by 5.7% with 256K flows.
> >>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
> >>   interesting because no-one can beat EMC in this area.
> >>
> >> I'd like to read the code more carefully tomorrow and probably give some
> >> more feedback.
> >>
> >> Best regards, Ilya Maximets.
> >
> > Thanks for your comments - please do prioritize feedback ASAP, because as
> > you know the 2.12 soft-freeze is already in effect.
> >
> > I'll work on Ian's comments on v10, but hold off sending v11 until there
> > is some feedback from you too :)

Thanks for the prompt review;


> Few comments:
> 
> 1. I don't really like that new dpcls depends on the internal structure of
>    'struct flowmap'. At least, we need to add a build assert to notify
>    anyone who will try to change it.

Agreed - adding a build-time assert is a good idea, will include in v11.


> 2. No need to expose so many internal stuff to every module that includes
>    'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
>    all the subtable related stuff there?

Good idea - keeps things a bit cleaner, will do for v11.


> 3. IMHO, it's better to apply some preprocessor optimizations to make it
>    easier to add new optimized functions and decrease code duplication.

Yes - had intended to MACRO them out when adding more -  V11 will include, thanks!


> 4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.

Sure - I see you've sent checkpatch improvements - nice one.


> 5. A lot of comments needs to be re-formatted according to coding style,
>    i.e. first letter capitalized + period at the end.

Yes - unfortunately I'm trained & hard-coded to DPDK standards... Will fix
as much as I see while rebasing.


> 6. "generic optimized" sounds weird.

I think it’s the best description of what it is that I can think of.

Generic in that it will work for any subtable & packet combination
Optimized in that it is specialized for a specific use-case at compile time.


[Pulling in from other email]
>> Ilya wrote:
>> One more thing. Could you, please, rename patches like:
>> dpif-netdev: Implement function pointers/subtable.

Yes will do.


> Here is some incremental I'm suggesting which covers comments 1, 3 and 4:

Thanks for the suggestions - working on it.

<snip> actual code changes (see previous email on mailing list archive).

Regards, -Harry