mbox series

[ovs-dev,v12,0/5] dpcls func ptrs & optimizations

Message ID 20190717130033.25114-1-harry.van.haaren@intel.com
Headers show
Series dpcls func ptrs & optimizations | expand

Message

Van Haaren, Harry July 17, 2019, 1 p.m. UTC
Hey Folks,

Here a v12 of the DPCLS Function Pointer patchset, as has been
presented at OVS Conf in Nov '18, and discussed on the ML since then.
I'm aware of the soft-freeze for 2.12, I feel this patchset has had
enough reviews/versions/testing to be merged in 2.12.

Thanks Ilya and Ian for review comments on v11, they should all be addressed
in this v12. Patchset details below, summary of v11 changes as follows:
- Reworked various minor comments (Capitals, stops, and whitespace)
- Moved variable declarations to patch they're used in
- Improved function documentation on probe()
- Reduced log level from info to debug
- See per patch --- v12 notes for details.

Given the nearing soft-freeze and branch deadlines, I'd like to see
this get merged - and any future minor comments / improvements can be
handled before release.

Regards, -Harry


Patchset details;
1) Refactor dpcls_lookup and the subtable for flexibility.
In particular, add a function pointer to the subtable
structure, which enables "plugging-in" a lookup function
at runtime. This enables a number of optimizations in future.

2) and 3)
With the function pointer in place, we refactor the existing
dpcls_lookup matching code into its own function, and later its
own file. To split it to its own file requires making various
dpcls data-structures available in dpif-netdev-private.h.

4)
Refactor the existing code, to favour compute of flat arrays of
miniflows, instead of the MACRO based iteration. This simplifies
the code itself, and makes future optimizations possible due to
simplified loop structures, and loop trip counts pass in via
function arguments. See commit message for more details.

5)
This patch implements a select few specialized functions, for handling
miniflows with 5-1, 4-1, and 4-0 miniflow unit bit patterns. More of
these types of functions can (and should) be added to accelerate other
patterns of subtable lookups! See commit message for more details.


Harry van Haaren (5):
  dpif-netdev: Implement function pointers/subtable
  dpif-netdev: Move dpcls lookup structures to .h
  dpif-netdev: Split out generic lookup function
  dpif-netdev: Refactor generic implementation
  dpif-netdev: Add specialized generic scalar functions

 NEWS                             |   4 +
 lib/automake.mk                  |   2 +
 lib/dpif-netdev-lookup-generic.c | 290 +++++++++++++++++++++++++++++++
 lib/dpif-netdev-private.h        | 129 ++++++++++++++
 lib/dpif-netdev.c                | 197 ++++++++++-----------
 5 files changed, 525 insertions(+), 97 deletions(-)
 create mode 100644 lib/dpif-netdev-lookup-generic.c
 create mode 100644 lib/dpif-netdev-private.h

Comments

Ilya Maximets July 17, 2019, 3:03 p.m. UTC | #1
On 17.07.2019 16:00, Harry van Haaren wrote:
> Hey Folks,
> 
> Here a v12 of the DPCLS Function Pointer patchset, as has been
> presented at OVS Conf in Nov '18, and discussed on the ML since then.
> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
> enough reviews/versions/testing to be merged in 2.12.
> 
> Thanks Ilya and Ian for review comments on v11, they should all be addressed
> in this v12. Patchset details below, summary of v11 changes as follows:
> - Reworked various minor comments (Capitals, stops, and whitespace)
> - Moved variable declarations to patch they're used in
> - Improved function documentation on probe()
> - Reduced log level from info to debug
> - See per patch --- v12 notes for details.
> 
> Given the nearing soft-freeze and branch deadlines, I'd like to see
> this get merged - and any future minor comments / improvements can be
> handled before release.
> 
> Regards, -Harry
> 
> 
> Patchset details;
> 1) Refactor dpcls_lookup and the subtable for flexibility.
> In particular, add a function pointer to the subtable
> structure, which enables "plugging-in" a lookup function
> at runtime. This enables a number of optimizations in future.
> 
> 2) and 3)
> With the function pointer in place, we refactor the existing
> dpcls_lookup matching code into its own function, and later its
> own file. To split it to its own file requires making various
> dpcls data-structures available in dpif-netdev-private.h.
> 
> 4)
> Refactor the existing code, to favour compute of flat arrays of
> miniflows, instead of the MACRO based iteration. This simplifies
> the code itself, and makes future optimizations possible due to
> simplified loop structures, and loop trip counts pass in via
> function arguments. See commit message for more details.
> 
> 5)
> This patch implements a select few specialized functions, for handling
> miniflows with 5-1, 4-1, and 4-0 miniflow unit bit patterns. More of
> these types of functions can (and should) be added to accelerate other
> patterns of subtable lookups! See commit message for more details.
> 
> 
> Harry van Haaren (5):
>   dpif-netdev: Implement function pointers/subtable
>   dpif-netdev: Move dpcls lookup structures to .h
>   dpif-netdev: Split out generic lookup function
>   dpif-netdev: Refactor generic implementation
>   dpif-netdev: Add specialized generic scalar functions
> 
>  NEWS                             |   4 +
>  lib/automake.mk                  |   2 +
>  lib/dpif-netdev-lookup-generic.c | 290 +++++++++++++++++++++++++++++++
>  lib/dpif-netdev-private.h        | 129 ++++++++++++++
>  lib/dpif-netdev.c                | 197 ++++++++++-----------
>  5 files changed, 525 insertions(+), 97 deletions(-)
>  create mode 100644 lib/dpif-netdev-lookup-generic.c
>  create mode 100644 lib/dpif-netdev-private.h
> 


I performed a few tests with v11 of this patch-set on my usual setup to check
performance of the new generic (non-optimized) implementation. The result is that
new generic implementation is ~5% faster for me than current master (it was 12%
for optimized lookup functions) which is good.
The code looks OK for me in general. I still don't really like the fact that
dpcls depends on the internal structure of a flowmap, but we, probably, could
deal with that while we have a build time assertion. Hope, we'll have some better
implementation with the same level of performance in the future.

Best regards, Ilya Maximets.
Van Haaren, Harry July 17, 2019, 4:16 p.m. UTC | #2
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Wednesday, July 17, 2019 4:03 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
> <ian.stokes@intel.com>
> Subject: Re: [PATCH v12 0/5] dpcls func ptrs & optimizations
> 
> On 17.07.2019 16:00, Harry van Haaren wrote:
> > Hey Folks,
> >
> > Here a v12 of the DPCLS Function Pointer patchset, as has been
> > presented at OVS Conf in Nov '18, and discussed on the ML since then.
> > I'm aware of the soft-freeze for 2.12, I feel this patchset has had
> > enough reviews/versions/testing to be merged in 2.12.

<snip>

> > Harry van Haaren (5):
> >   dpif-netdev: Implement function pointers/subtable
> >   dpif-netdev: Move dpcls lookup structures to .h
> >   dpif-netdev: Split out generic lookup function
> >   dpif-netdev: Refactor generic implementation
> >   dpif-netdev: Add specialized generic scalar functions
> >
> >  NEWS                             |   4 +
> >  lib/automake.mk                  |   2 +
> >  lib/dpif-netdev-lookup-generic.c | 290 +++++++++++++++++++++++++++++++
> >  lib/dpif-netdev-private.h        | 129 ++++++++++++++
> >  lib/dpif-netdev.c                | 197 ++++++++++-----------
> >  5 files changed, 525 insertions(+), 97 deletions(-)
> >  create mode 100644 lib/dpif-netdev-lookup-generic.c
> >  create mode 100644 lib/dpif-netdev-private.h
> >
> 
> 
> I performed a few tests with v11 of this patch-set on my usual setup to
> check
> performance of the new generic (non-optimized) implementation. The result is
> that
> new generic implementation is ~5% faster for me than current master (it was
> 12%
> for optimized lookup functions) which is good.
> The code looks OK for me in general. I still don't really like the fact that
> dpcls depends on the internal structure of a flowmap, but we, probably,
> could
> deal with that while we have a build time assertion. Hope, we'll have some
> better
> implementation with the same level of performance in the future.


Thanks for the feedback on performance Ilya - good to see that we're going
in the right direction performance wise anyway.

I have one item I'd still like to improve in this patchset, and its regarding
where the blocks scratch array is being stored. I'll rework and find a better
solution than the current, and post that as the lucky patchset v13 :)

Cheers, -Harry
Ilya Maximets July 17, 2019, 4:26 p.m. UTC | #3
On 17.07.2019 19:16, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Wednesday, July 17, 2019 4:03 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
>> <ian.stokes@intel.com>
>> Subject: Re: [PATCH v12 0/5] dpcls func ptrs & optimizations
>>
>> On 17.07.2019 16:00, Harry van Haaren wrote:
>>> Hey Folks,
>>>
>>> Here a v12 of the DPCLS Function Pointer patchset, as has been
>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>> enough reviews/versions/testing to be merged in 2.12.
> 
> <snip>
> 
>>> Harry van Haaren (5):
>>>   dpif-netdev: Implement function pointers/subtable
>>>   dpif-netdev: Move dpcls lookup structures to .h
>>>   dpif-netdev: Split out generic lookup function
>>>   dpif-netdev: Refactor generic implementation
>>>   dpif-netdev: Add specialized generic scalar functions
>>>
>>>  NEWS                             |   4 +
>>>  lib/automake.mk                  |   2 +
>>>  lib/dpif-netdev-lookup-generic.c | 290 +++++++++++++++++++++++++++++++
>>>  lib/dpif-netdev-private.h        | 129 ++++++++++++++
>>>  lib/dpif-netdev.c                | 197 ++++++++++-----------
>>>  5 files changed, 525 insertions(+), 97 deletions(-)
>>>  create mode 100644 lib/dpif-netdev-lookup-generic.c
>>>  create mode 100644 lib/dpif-netdev-private.h
>>>
>>
>>
>> I performed a few tests with v11 of this patch-set on my usual setup to
>> check
>> performance of the new generic (non-optimized) implementation. The result is
>> that
>> new generic implementation is ~5% faster for me than current master (it was
>> 12%
>> for optimized lookup functions) which is good.
>> The code looks OK for me in general. I still don't really like the fact that
>> dpcls depends on the internal structure of a flowmap, but we, probably,
>> could
>> deal with that while we have a build time assertion. Hope, we'll have some
>> better
>> implementation with the same level of performance in the future.
> 
> 
> Thanks for the feedback on performance Ilya - good to see that we're going
> in the right direction performance wise anyway.
> 
> I have one item I'd still like to improve in this patchset, and its regarding
> where the blocks scratch array is being stored. I'll rework and find a better
> solution than the current, and post that as the lucky patchset v13 :)

As an idea, I could suggest DEFINE_STATIC_PER_THREAD_DATA inside the dpif-netdev-lookup-generic.c.
Van Haaren, Harry July 17, 2019, 4:33 p.m. UTC | #4
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Wednesday, July 17, 2019 5:26 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
> <ian.stokes@intel.com>
> Subject: Re: [PATCH v12 0/5] dpcls func ptrs & optimizations

<snip>

> >> I performed a few tests with v11 of this patch-set on my usual setup to
> >> check
> >> performance of the new generic (non-optimized) implementation. The result
> is
> >> that
> >> new generic implementation is ~5% faster for me than current master (it
> was
> >> 12%
> >> for optimized lookup functions) which is good.
> >> The code looks OK for me in general. I still don't really like the fact
> that
> >> dpcls depends on the internal structure of a flowmap, but we, probably,
> >> could
> >> deal with that while we have a build time assertion. Hope, we'll have
> some
> >> better
> >> implementation with the same level of performance in the future.
> >
> >
> > Thanks for the feedback on performance Ilya - good to see that we're going
> > in the right direction performance wise anyway.
> >
> > I have one item I'd still like to improve in this patchset, and its
> regarding
> > where the blocks scratch array is being stored. I'll rework and find a
> better
> > solution than the current, and post that as the lucky patchset v13 :)
> 
> As an idea, I could suggest DEFINE_STATIC_PER_THREAD_DATA inside the dpif-
> netdev-lookup-generic.c.


Thanks for the suggestion - that would make the blocks scratch pointer available
on a per-thread basis, and be initialized to NULL correct?

As a result, we would must a (predictable) branch to check if the pointer is
NULL, to allocate the required space on first usage of the subtable_lookup?

Or is there a better way to initialize it for all threads that call dpcls_lookup()?
My concern here is what if one thread creates a subtable (and allocs blocks_scratch for itself),
but then the PMD is polled by another thread - which uses its own blocks_scratch which is NULL.

Hence, I think the runtime check + alloc is probably the best/safest way,
but I'd appreciate your input on the above threading logic Ilya :)
Ilya Maximets July 18, 2019, 11:29 a.m. UTC | #5
On 17.07.2019 19:33, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Wednesday, July 17, 2019 5:26 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@openvswitch.org
>> Cc: echaudro@redhat.com; malvika.gupta@arm.com; Stokes, Ian
>> <ian.stokes@intel.com>
>> Subject: Re: [PATCH v12 0/5] dpcls func ptrs & optimizations
> 
> <snip>
> 
>>>> I performed a few tests with v11 of this patch-set on my usual setup to
>>>> check
>>>> performance of the new generic (non-optimized) implementation. The result
>> is
>>>> that
>>>> new generic implementation is ~5% faster for me than current master (it
>> was
>>>> 12%
>>>> for optimized lookup functions) which is good.
>>>> The code looks OK for me in general. I still don't really like the fact
>> that
>>>> dpcls depends on the internal structure of a flowmap, but we, probably,
>>>> could
>>>> deal with that while we have a build time assertion. Hope, we'll have
>> some
>>>> better
>>>> implementation with the same level of performance in the future.
>>>
>>>
>>> Thanks for the feedback on performance Ilya - good to see that we're going
>>> in the right direction performance wise anyway.
>>>
>>> I have one item I'd still like to improve in this patchset, and its
>> regarding
>>> where the blocks scratch array is being stored. I'll rework and find a
>> better
>>> solution than the current, and post that as the lucky patchset v13 :)
>>
>> As an idea, I could suggest DEFINE_STATIC_PER_THREAD_DATA inside the dpif-
>> netdev-lookup-generic.c.
> 
> 
> Thanks for the suggestion - that would make the blocks scratch pointer available
> on a per-thread basis, and be initialized to NULL correct?
> 
> As a result, we would must a (predictable) branch to check if the pointer is
> NULL, to allocate the required space on first usage of the subtable_lookup?
> 
> Or is there a better way to initialize it for all threads that call dpcls_lookup()?
> My concern here is what if one thread creates a subtable (and allocs blocks_scratch for itself),
> but then the PMD is polled by another thread - which uses its own blocks_scratch which is NULL.
> 
> Hence, I think the runtime check + alloc is probably the best/safest way,
> but I'd appreciate your input on the above threading logic Ilya :)
> 


Sorry, I was already out of office when this message arrived. I'll send my comments
on your v13.

Best regards, Ilya Maximets.