diff mbox series

[ovs-dev,2/2] dpif-netlink: Fix overriding the number of handler threads.

Message ID e2207ec1aeb4aa3490e9aac71383d76a9fcc5283.1707226223.git.echaudro@redhat.com
State Changes Requested
Delegated to: Eelco Chaudron
Headers show
Series [ovs-dev,1/2] dpif-netdev: Do not create handler threads. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/intel-ovs-compilation success test: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Eelco Chaudron Feb. 6, 2024, 1:30 p.m. UTC
Previously, the ability to override the system default for the number
of handler threads was broken since to the introduction of the per-CPU
upcall handlers.

This patch rectifies this behavior by re-enabling manual configuration
of the handler thread count.

Fixes: b1e517bd2f81 ("dpif-netlink: Introduce per-cpu upcall dispatch.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
---
 lib/dpif-netlink.c            | 8 +++++++-
 lib/dpif-provider.h           | 4 ++++
 ofproto/ofproto-dpif-upcall.c | 2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

Comments

Ilya Maximets Feb. 6, 2024, 1:46 p.m. UTC | #1
On 2/6/24 14:30, Eelco Chaudron wrote:
> Previously, the ability to override the system default for the number
> of handler threads was broken since to the introduction of the per-CPU
> upcall handlers.

It wasn't broken, it was intentionally disabled.  I don't think we should
introduce ability to configure the number of handlers for per-CPU dispatch
mode.  It only allows users to shoot themselves in a foot if they do not
understand what they are doing.

Many integration scripts in a wild are still blindly setting these options
without even knowing that the underlying implementation changed completely.
And those setups may suffer if we suddenly decide to respect the configuration
that is ignored on modern kernels for all currently supported versions of OVS.

Best regards, Ilya Maximets.
Eelco Chaudron Feb. 6, 2024, 2:07 p.m. UTC | #2
On 6 Feb 2024, at 14:46, Ilya Maximets wrote:

> On 2/6/24 14:30, Eelco Chaudron wrote:
>> Previously, the ability to override the system default for the number
>> of handler threads was broken since to the introduction of the per-CPU
>> upcall handlers.
>
> It wasn't broken, it was intentionally disabled.  I don't think we should
> introduce ability to configure the number of handlers for per-CPU dispatch
> mode.  It only allows users to shoot themselves in a foot if they do not
> understand what they are doing.

I think the design was not thought through enough, and resulted in a lot of corner case issues. From the start, we should still have allowed tuning of this option as it was before.

> Many integration scripts in a wild are still blindly setting these options
> without even knowing that the underlying implementation changed completely.
> And those setups may suffer if we suddenly decide to respect the configuration
> that is ignored on modern kernels for all currently supported versions of OVS.

I feel like this is not OVS’s problem. If people do not know what they are doing, they should not touch it… We have a nice Dutch saying for this "Wie zijn billen brandt, moet op de blaren zitten” :)

However, I do feel like we should have an option to configure this, as it makes no sense on a 256-core system to create 256 handler threads. Maybe we should add a new option, n-forced-handler-threads. This will avoid the above issue with existing scripts.

Thoughts?

//Eelco
Ilya Maximets Feb. 6, 2024, 3:01 p.m. UTC | #3
On 2/6/24 15:07, Eelco Chaudron wrote:
> 
> 
> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
> 
>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>> Previously, the ability to override the system default for the number
>>> of handler threads was broken since to the introduction of the per-CPU
>>> upcall handlers.
>>
>> It wasn't broken, it was intentionally disabled.  I don't think we should
>> introduce ability to configure the number of handlers for per-CPU dispatch
>> mode.  It only allows users to shoot themselves in a foot if they do not
>> understand what they are doing.
> 
> I think the design was not thought through enough, and resulted in a lot of
> corner case issues. From the start, we should still have allowed tuning of
> this option as it was before.

The tuning is much more complicated and has very different consequences
with per-cpu dispatch compared to per-vport.  So, the results of adjusting
this option can be very different.

> 
>> Many integration scripts in a wild are still blindly setting these options
>> without even knowing that the underlying implementation changed completely.
>> And those setups may suffer if we suddenly decide to respect the configuration
>> that is ignored on modern kernels for all currently supported versions of OVS.
> 
> I feel like this is not OVS’s problem. If people do not know what they are doing,
> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
> brandt, moet op de blaren zitten” :)

It is an OVS's problem for the same reason as above.  We changed the logic
under the hood so much that users that expected a certain behavior in the
past may have a very different behavior now.  We can't just change the
meaning of the knob that easily.  See below.

> 
> However, I do feel like we should have an option to configure this, as it makes
> no sense on a 256-core system to create 256 handler threads.

It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
should be able to handle it on each one of them in order to have balanced load.

With per-vport dispatch, on a system with 10 cores and 5 handler threads,
all 5 handler threads will share the load, because each one of them listens
on all the ports, hence each one of them can process packets whenever it
has time to do so.

With per-CPU dispatch, on a same system these 5 threads are mapped to specific
cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
2 threads overloaded and 3 other threads do nothing.  In the same scenario with
per-vport dispatch we would have all 5 threads sharing the load.

So, in order for the user to make a decision on the number of threads, they
have to know on which CPUs they most likely to have interrupts and how exactly
OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
Some of that is actually impossible to know.  And when the traffic pattern changes,
the number of threads may need to be changed to game the mapping to avoid
overload.

Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
changes the mapping.

AFAIU, The only way to achieve an even load without dynamic reconfiguration is
to have a thread per CPU core.

> Maybe we should
> add a new option, n-forced-handler-threads. This will avoid the above issue with
> existing scripts.

Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
If these users are knowledgeable enough to predict traffic patterns and CPU
mappings, they can set appropriate CPU affinity masks as well.

For others, changing the meaning of a knob underneath them would be kind
of irresponsible from our side.

Best regards, Ilya Maximets.
Eelco Chaudron Feb. 8, 2024, 12:44 p.m. UTC | #4
On 6 Feb 2024, at 16:01, Ilya Maximets wrote:

> On 2/6/24 15:07, Eelco Chaudron wrote:
>>
>>
>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>
>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>> Previously, the ability to override the system default for the number
>>>> of handler threads was broken since to the introduction of the per-CPU
>>>> upcall handlers.
>>>
>>> It wasn't broken, it was intentionally disabled.  I don't think we should
>>> introduce ability to configure the number of handlers for per-CPU dispatch
>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>> understand what they are doing.
>>
>> I think the design was not thought through enough, and resulted in a lot of
>> corner case issues. From the start, we should still have allowed tuning of
>> this option as it was before.
>
> The tuning is much more complicated and has very different consequences
> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
> this option can be very different.

>>> Many integration scripts in a wild are still blindly setting these options
>>> without even knowing that the underlying implementation changed completely.
>>> And those setups may suffer if we suddenly decide to respect the configuration
>>> that is ignored on modern kernels for all currently supported versions of OVS.
>>
>> I feel like this is not OVS’s problem. If people do not know what they are doing,
>> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
>> brandt, moet op de blaren zitten” :)
>
> It is an OVS's problem for the same reason as above.  We changed the logic
> under the hood so much that users that expected a certain behavior in the
> past may have a very different behavior now.  We can't just change the
> meaning of the knob that easily.  See below.

Ack, so I would suggest adding a new nob.

>>
>> However, I do feel like we should have an option to configure this, as it makes
>> no sense on a 256-core system to create 256 handler threads.
>
> It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
> should be able to handle it on each one of them in order to have balanced load.
>
> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
> all 5 handler threads will share the load, because each one of them listens
> on all the ports, hence each one of them can process packets whenever it
> has time to do so.

Yes, but this had other problems :)

> With per-CPU dispatch, on a same system these 5 threads are mapped to specific
> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
> 2 threads overloaded and 3 other threads do nothing.  In the same scenario with
> per-vport dispatch we would have all 5 threads sharing the load.

Yes, but with 256 we might have 250 doing nothing, and looking at some data from a system with decent OpenFlow tables (1M+ entries, but only a couple of tables), we can handle about 232k upcalls on a single handler thread for 64 bytes packets before we run out of socket buffer. For 512bytes around 212k, and 1500 around 189k.

If we have more than 200k upcalls per second, we have other problems, especially as the default dp flow size is 200k.

So I guess the new nob becomes tunable for resource consumption vs upcall handling, which I think is fine, and might be desirable in certain configurations. We have other nobs in the system that require a decent amount of knowledge.

> So, in order for the user to make a decision on the number of threads, they
> have to know on which CPUs they most likely to have interrupts and how exactly
> OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
> Some of that is actually impossible to know.  And when the traffic pattern changes,
> the number of threads may need to be changed to game the mapping to avoid
> overload.
>
> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
> changes the mapping.

Well, using an affinity mask will mess up the above anyway as we will get fewer threads, sharing potential CPU resources, etc. So making the n_threads configurable will have the same problem as changing the affinity mask. But we might have other benefits (see below).

> AFAIU, The only way to achieve an even load without dynamic reconfiguration is
> to have a thread per CPU core.

Yes, only if we have no affinity mask, but at the cost of additional resources (which we might not want).

>> Maybe we should
>> add a new option, n-forced-handler-threads. This will avoid the above issue with
>> existing scripts.
>
> Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
> If these users are knowledgeable enough to predict traffic patterns and CPU
> mappings, they can set appropriate CPU affinity masks as well.

Well, you can not exactly set the number of handler threads as there will be round-up to prime (which should not be a big problem). But the main problem I see is that now you limit all OVS to that set of CPU. So a burst of upcall could mess with your revalidator handling (or other threads running on that set).

> For others, changing the meaning of a knob underneath them would be kind
> of irresponsible from our side.

Ack, I will introduce a new nob (also some changes to dpif-netlink, as the patch has a bug).
Ilya Maximets Feb. 8, 2024, 2 p.m. UTC | #5
On 2/8/24 13:44, Eelco Chaudron wrote:
> 
> 
> On 6 Feb 2024, at 16:01, Ilya Maximets wrote:
> 
>> On 2/6/24 15:07, Eelco Chaudron wrote:
>>>
>>>
>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>>
>>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>>> Previously, the ability to override the system default for the number
>>>>> of handler threads was broken since to the introduction of the per-CPU
>>>>> upcall handlers.
>>>>
>>>> It wasn't broken, it was intentionally disabled.  I don't think we should
>>>> introduce ability to configure the number of handlers for per-CPU dispatch
>>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>>> understand what they are doing.
>>>
>>> I think the design was not thought through enough, and resulted in a lot of
>>> corner case issues. From the start, we should still have allowed tuning of
>>> this option as it was before.
>>
>> The tuning is much more complicated and has very different consequences
>> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
>> this option can be very different.
> 
>>>> Many integration scripts in a wild are still blindly setting these options
>>>> without even knowing that the underlying implementation changed completely.
>>>> And those setups may suffer if we suddenly decide to respect the configuration
>>>> that is ignored on modern kernels for all currently supported versions of OVS.
>>>
>>> I feel like this is not OVS’s problem. If people do not know what they are doing,
>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
>>> brandt, moet op de blaren zitten” :)
>>
>> It is an OVS's problem for the same reason as above.  We changed the logic
>> under the hood so much that users that expected a certain behavior in the
>> past may have a very different behavior now.  We can't just change the
>> meaning of the knob that easily.  See below.
> 
> Ack, so I would suggest adding a new nob.
> 
>>>
>>> However, I do feel like we should have an option to configure this, as it makes
>>> no sense on a 256-core system to create 256 handler threads.
>>
>> It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
>> should be able to handle it on each one of them in order to have balanced load.
>>
>> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
>> all 5 handler threads will share the load, because each one of them listens
>> on all the ports, hence each one of them can process packets whenever it
>> has time to do so.
> 
> Yes, but this had other problems :)

Sure, that's why we went away from that logic. :)

> 
>> With per-CPU dispatch, on a same system these 5 threads are mapped to specific
>> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
>> 2 threads overloaded and 3 other threads do nothing.  In the same scenario with
>> per-vport dispatch we would have all 5 threads sharing the load.
> 
> Yes, but with 256 we might have 250 doing nothing, and looking at some data from a
> system with decent OpenFlow tables (1M+ entries, but only a couple of tables), we
> can handle about 232k upcalls on a single handler thread for 64 bytes packets before
> we run out of socket buffer. For 512bytes around 212k, and 1500 around 189k.

Having idle threads in a system is generally not a problem.  As long as we
don't have hundreds of thousands of them, they will not impact scheduling
performance.

Also, actual performance of handlers is hard to quantify, since OVS can be
running on a large variety of different hardware.  Yours is likely on a
powerful side of a spectrum.

> 
> If we have more than 200k upcalls per second, we have other problems, especially as
> the default dp flow size is 200k.
> 
> So I guess the new nob becomes tunable for resource consumption vs upcall handling,
> which I think is fine, and might be desirable in certain configurations. We have other
> nobs in the system that require a decent amount of knowledge.

Having more knobs just for the sake of having something configurable
doesn't sound good to me.  We should try to not introduce ways to
configure something that doesn't need to be configured.

Sure, creating spare threads consumes memory.  However there are few
things to consider here:

1. A system with 256 cores likely has a 512 MB of RAM to spare
   (By default we limit to 2 MB of stack per thread).
   I haven't seen any complaint from users about memory consumption
   on large systems.

2. By reducing the number of threads we're working around the problem
   of a large stack use during flow translation.  If we can reduce the
   stack usage, we can reduce the overall memory consumption.
   So, maybe that should be addressed instead of working around with
   a hard-to-use-right knob.

3. We can probably limit the number of revalidators instead (without
   a knob).  We may estimate the number of revalidators needed by the
   datapath flow hard limit and not create more than that even if we
   have more cores.  Should help with memory consumption as well.

> 
>> So, in order for the user to make a decision on the number of threads, they
>> have to know on which CPUs they most likely to have interrupts and how exactly
>> OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
>> Some of that is actually impossible to know.  And when the traffic pattern changes,
>> the number of threads may need to be changed to game the mapping to avoid
>> overload.
>>
>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
>> changes the mapping.
> 
> Well, using an affinity mask will mess up the above anyway as we will get fewer
> threads, sharing potential CPU resources, etc. So making the n_threads configurable
> will have the same problem as changing the affinity mask. But we might have other
> benefits (see below).
> 
>> AFAIU, The only way to achieve an even load without dynamic reconfiguration is
>> to have a thread per CPU core.
> 
> Yes, only if we have no affinity mask, but at the cost of additional resources
> (which we might not want).
> 
>>> Maybe we should
>>> add a new option, n-forced-handler-threads. This will avoid the above issue with
>>> existing scripts.
>>
>> Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
>> If these users are knowledgeable enough to predict traffic patterns and CPU
>> mappings, they can set appropriate CPU affinity masks as well.
> 
> Well, you can not exactly set the number of handler threads as there will be
> round-up to prime (which should not be a big problem). But the main problem I see
> is that now you limit all OVS to that set of CPU. So a burst of upcall could mess
> with your revalidator handling (or other threads running on that set).

Should not be a problem.  We always run handlers and revalidators on the same
cores by default.  And revalidators do not need millisecond precision.  Also
revalidators can float between cores.  So, the use of the same cores should
not be a problem as long as not all handlers are at 100% load, which will be
a problem on its own regardless.

> 
>> For others, changing the meaning of a knob underneath them would be kind
>> of irresponsible from our side.
> 
> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the patch has a bug).
> 

I'm still not convinced we need a knob for the reasons listed above.

Best regards, Ilya Maximets.
Eelco Chaudron Feb. 8, 2024, 3:53 p.m. UTC | #6
On 8 Feb 2024, at 15:00, Ilya Maximets wrote:

> On 2/8/24 13:44, Eelco Chaudron wrote:
>>
>>
>> On 6 Feb 2024, at 16:01, Ilya Maximets wrote:
>>
>>> On 2/6/24 15:07, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>>>
>>>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>>>> Previously, the ability to override the system default for the number
>>>>>> of handler threads was broken since to the introduction of the per-CPU
>>>>>> upcall handlers.
>>>>>
>>>>> It wasn't broken, it was intentionally disabled.  I don't think we should
>>>>> introduce ability to configure the number of handlers for per-CPU dispatch
>>>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>>>> understand what they are doing.
>>>>
>>>> I think the design was not thought through enough, and resulted in a lot of
>>>> corner case issues. From the start, we should still have allowed tuning of
>>>> this option as it was before.
>>>
>>> The tuning is much more complicated and has very different consequences
>>> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
>>> this option can be very different.
>>
>>>>> Many integration scripts in a wild are still blindly setting these options
>>>>> without even knowing that the underlying implementation changed completely.
>>>>> And those setups may suffer if we suddenly decide to respect the configuration
>>>>> that is ignored on modern kernels for all currently supported versions of OVS.
>>>>
>>>> I feel like this is not OVS’s problem. If people do not know what they are doing,
>>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
>>>> brandt, moet op de blaren zitten” :)
>>>
>>> It is an OVS's problem for the same reason as above.  We changed the logic
>>> under the hood so much that users that expected a certain behavior in the
>>> past may have a very different behavior now.  We can't just change the
>>> meaning of the knob that easily.  See below.
>>
>> Ack, so I would suggest adding a new nob.
>>
>>>>
>>>> However, I do feel like we should have an option to configure this, as it makes
>>>> no sense on a 256-core system to create 256 handler threads.
>>>
>>> It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
>>> should be able to handle it on each one of them in order to have balanced load.
>>>
>>> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
>>> all 5 handler threads will share the load, because each one of them listens
>>> on all the ports, hence each one of them can process packets whenever it
>>> has time to do so.
>>
>> Yes, but this had other problems :)
>
> Sure, that's why we went away from that logic. :)
>
>>
>>> With per-CPU dispatch, on a same system these 5 threads are mapped to specific
>>> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
>>> 2 threads overloaded and 3 other threads do nothing.  In the same scenario with
>>> per-vport dispatch we would have all 5 threads sharing the load.
>>
>> Yes, but with 256 we might have 250 doing nothing, and looking at some data from a
>> system with decent OpenFlow tables (1M+ entries, but only a couple of tables), we
>> can handle about 232k upcalls on a single handler thread for 64 bytes packets before
>> we run out of socket buffer. For 512bytes around 212k, and 1500 around 189k.
>
> Having idle threads in a system is generally not a problem.  As long as we
> don't have hundreds of thousands of them, they will not impact scheduling
> performance.
>
> Also, actual performance of handlers is hard to quantify, since OVS can be
> running on a large variety of different hardware.  Yours is likely on a
> powerful side of a spectrum.
>
>>
>> If we have more than 200k upcalls per second, we have other problems, especially as
>> the default dp flow size is 200k.
>>
>> So I guess the new nob becomes tunable for resource consumption vs upcall handling,
>> which I think is fine, and might be desirable in certain configurations. We have other
>> nobs in the system that require a decent amount of knowledge.
>
> Having more knobs just for the sake of having something configurable
> doesn't sound good to me.  We should try to not introduce ways to
> configure something that doesn't need to be configured.
>
> Sure, creating spare threads consumes memory.  However there are few
> things to consider here:
>
> 1. A system with 256 cores likely has a 512 MB of RAM to spare
>    (By default we limit to 2 MB of stack per thread).
>    I haven't seen any complaint from users about memory consumption
>    on large systems.
>
> 2. By reducing the number of threads we're working around the problem
>    of a large stack use during flow translation.  If we can reduce the
>    stack usage, we can reduce the overall memory consumption.
>    So, maybe that should be addressed instead of working around with
>    a hard-to-use-right knob.

I guess the problem is that various people have tried, and the stack size did get reduced, but not enough. The code is very complex and due to all the corner cases introduced over the years, it’s (almost) impossible to get the recursion removed, without a complete rewrite (trying to not break all these corner cases).

And in the past, we did receive complaints about the memory size. In some cases, we still need 8M, which is 2G on a 256-core system.

Unless someone finds a solution to this, I still feel we should have this option in case we need it.

//Eelco

> 3. We can probably limit the number of revalidators instead (without
>    a knob).  We may estimate the number of revalidators needed by the
>    datapath flow hard limit and not create more than that even if we
>    have more cores.  Should help with memory consumption as well.
>
>>
>>> So, in order for the user to make a decision on the number of threads, they
>>> have to know on which CPUs they most likely to have interrupts and how exactly
>>> OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
>>> Some of that is actually impossible to know.  And when the traffic pattern changes,
>>> the number of threads may need to be changed to game the mapping to avoid
>>> overload.
>>>
>>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
>>> changes the mapping.
>>
>> Well, using an affinity mask will mess up the above anyway as we will get fewer
>> threads, sharing potential CPU resources, etc. So making the n_threads configurable
>> will have the same problem as changing the affinity mask. But we might have other
>> benefits (see below).
>>
>>> AFAIU, The only way to achieve an even load without dynamic reconfiguration is
>>> to have a thread per CPU core.
>>
>> Yes, only if we have no affinity mask, but at the cost of additional resources
>> (which we might not want).
>>
>>>> Maybe we should
>>>> add a new option, n-forced-handler-threads. This will avoid the above issue with
>>>> existing scripts.
>>>
>>> Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
>>> If these users are knowledgeable enough to predict traffic patterns and CPU
>>> mappings, they can set appropriate CPU affinity masks as well.
>>
>> Well, you can not exactly set the number of handler threads as there will be
>> round-up to prime (which should not be a big problem). But the main problem I see
>> is that now you limit all OVS to that set of CPU. So a burst of upcall could mess
>> with your revalidator handling (or other threads running on that set).
>
> Should not be a problem.  We always run handlers and revalidators on the same
> cores by default.  And revalidators do not need millisecond precision.  Also
> revalidators can float between cores.  So, the use of the same cores should
> not be a problem as long as not all handlers are at 100% load, which will be
> a problem on its own regardless.
>
>>
>>> For others, changing the meaning of a knob underneath them would be kind
>>> of irresponsible from our side.
>>
>> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the patch has a bug).
>>
>
> I'm still not convinced we need a knob for the reasons listed above.
>
> Best regards, Ilya Maximets.
Ilya Maximets Feb. 19, 2024, 7:23 p.m. UTC | #7
On 2/8/24 16:53, Eelco Chaudron wrote:
> 
> 
> On 8 Feb 2024, at 15:00, Ilya Maximets wrote:
> 
>> On 2/8/24 13:44, Eelco Chaudron wrote:
>>>
>>>
>>> On 6 Feb 2024, at 16:01, Ilya Maximets wrote:
>>>
>>>> On 2/6/24 15:07, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>>>>
>>>>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>>>>> Previously, the ability to override the system default for the number
>>>>>>> of handler threads was broken since to the introduction of the per-CPU
>>>>>>> upcall handlers.
>>>>>>
>>>>>> It wasn't broken, it was intentionally disabled.  I don't think we should
>>>>>> introduce ability to configure the number of handlers for per-CPU dispatch
>>>>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>>>>> understand what they are doing.
>>>>>
>>>>> I think the design was not thought through enough, and resulted in a lot of
>>>>> corner case issues. From the start, we should still have allowed tuning of
>>>>> this option as it was before.
>>>>
>>>> The tuning is much more complicated and has very different consequences
>>>> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
>>>> this option can be very different.
>>>
>>>>>> Many integration scripts in a wild are still blindly setting these options
>>>>>> without even knowing that the underlying implementation changed completely.
>>>>>> And those setups may suffer if we suddenly decide to respect the configuration
>>>>>> that is ignored on modern kernels for all currently supported versions of OVS.
>>>>>
>>>>> I feel like this is not OVS’s problem. If people do not know what they are doing,
>>>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
>>>>> brandt, moet op de blaren zitten” :)
>>>>
>>>> It is an OVS's problem for the same reason as above.  We changed the logic
>>>> under the hood so much that users that expected a certain behavior in the
>>>> past may have a very different behavior now.  We can't just change the
>>>> meaning of the knob that easily.  See below.
>>>
>>> Ack, so I would suggest adding a new nob.
>>>
>>>>>
>>>>> However, I do feel like we should have an option to configure this, as it makes
>>>>> no sense on a 256-core system to create 256 handler threads.
>>>>
>>>> It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
>>>> should be able to handle it on each one of them in order to have balanced load.
>>>>
>>>> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
>>>> all 5 handler threads will share the load, because each one of them listens
>>>> on all the ports, hence each one of them can process packets whenever it
>>>> has time to do so.
>>>
>>> Yes, but this had other problems :)
>>
>> Sure, that's why we went away from that logic. :)
>>
>>>
>>>> With per-CPU dispatch, on a same system these 5 threads are mapped to specific
>>>> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
>>>> 2 threads overloaded and 3 other threads do nothing.  In the same scenario with
>>>> per-vport dispatch we would have all 5 threads sharing the load.
>>>
>>> Yes, but with 256 we might have 250 doing nothing, and looking at some data from a
>>> system with decent OpenFlow tables (1M+ entries, but only a couple of tables), we
>>> can handle about 232k upcalls on a single handler thread for 64 bytes packets before
>>> we run out of socket buffer. For 512bytes around 212k, and 1500 around 189k.
>>
>> Having idle threads in a system is generally not a problem.  As long as we
>> don't have hundreds of thousands of them, they will not impact scheduling
>> performance.
>>
>> Also, actual performance of handlers is hard to quantify, since OVS can be
>> running on a large variety of different hardware.  Yours is likely on a
>> powerful side of a spectrum.
>>
>>>
>>> If we have more than 200k upcalls per second, we have other problems, especially as
>>> the default dp flow size is 200k.
>>>
>>> So I guess the new nob becomes tunable for resource consumption vs upcall handling,
>>> which I think is fine, and might be desirable in certain configurations. We have other
>>> nobs in the system that require a decent amount of knowledge.
>>
>> Having more knobs just for the sake of having something configurable
>> doesn't sound good to me.  We should try to not introduce ways to
>> configure something that doesn't need to be configured.
>>
>> Sure, creating spare threads consumes memory.  However there are few
>> things to consider here:
>>
>> 1. A system with 256 cores likely has a 512 MB of RAM to spare
>>    (By default we limit to 2 MB of stack per thread).
>>    I haven't seen any complaint from users about memory consumption
>>    on large systems.
>>
>> 2. By reducing the number of threads we're working around the problem
>>    of a large stack use during flow translation.  If we can reduce the
>>    stack usage, we can reduce the overall memory consumption.
>>    So, maybe that should be addressed instead of working around with
>>    a hard-to-use-right knob.
> 
> I guess the problem is that various people have tried, and the stack size did get reduced,
> but not enough. The code is very complex and due to all the corner cases introduced over
> the years, it’s (almost) impossible to get the recursion removed, without a complete rewrite
> (trying to not break all these corner cases).

I don't remember anyone actually trying to reduce the recursion depth.
But the was some work reducing the stack allocations, true.

It's not necessary to completely remove recursion, but it should be
possible to short-circuit it in most common cases.

> 
> And in the past, we did receive complaints about the memory size. In some cases, we still
> need 8M, which is 2G on a 256-core system.
> 
> Unless someone finds a solution to this, I still feel we should have this option in case
> we need it.

CPU affinity is an available option, as discussed blow.

Another option we can explore is to use MCL_ONFAULT instead of locking
all the memory right away.  MCL_ONFAULT didn't exist when --mlockall
was introduced, so it may be reasonable to use it.  Some testing on a
systems under memory pressure is needed, of course.  But if the main
concern is that OVS allocates too much memory that it doesn't use, it
will no longer be the case.

> 
> //Eelco
> 
>> 3. We can probably limit the number of revalidators instead (without
>>    a knob).  We may estimate the number of revalidators needed by the
>>    datapath flow hard limit and not create more than that even if we
>>    have more cores.  Should help with memory consumption as well.
>>
>>>
>>>> So, in order for the user to make a decision on the number of threads, they
>>>> have to know on which CPUs they most likely to have interrupts and how exactly
>>>> OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
>>>> Some of that is actually impossible to know.  And when the traffic pattern changes,
>>>> the number of threads may need to be changed to game the mapping to avoid
>>>> overload.
>>>>
>>>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
>>>> changes the mapping.
>>>
>>> Well, using an affinity mask will mess up the above anyway as we will get fewer
>>> threads, sharing potential CPU resources, etc. So making the n_threads configurable
>>> will have the same problem as changing the affinity mask. But we might have other
>>> benefits (see below).
>>>
>>>> AFAIU, The only way to achieve an even load without dynamic reconfiguration is
>>>> to have a thread per CPU core.
>>>
>>> Yes, only if we have no affinity mask, but at the cost of additional resources
>>> (which we might not want).
>>>
>>>>> Maybe we should
>>>>> add a new option, n-forced-handler-threads. This will avoid the above issue with
>>>>> existing scripts.
>>>>
>>>> Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
>>>> If these users are knowledgeable enough to predict traffic patterns and CPU
>>>> mappings, they can set appropriate CPU affinity masks as well.
>>>
>>> Well, you can not exactly set the number of handler threads as there will be
>>> round-up to prime (which should not be a big problem). But the main problem I see
>>> is that now you limit all OVS to that set of CPU. So a burst of upcall could mess
>>> with your revalidator handling (or other threads running on that set).
>>
>> Should not be a problem.  We always run handlers and revalidators on the same
>> cores by default.  And revalidators do not need millisecond precision.  Also
>> revalidators can float between cores.  So, the use of the same cores should
>> not be a problem as long as not all handlers are at 100% load, which will be
>> a problem on its own regardless.
>>
>>>
>>>> For others, changing the meaning of a knob underneath them would be kind
>>>> of irresponsible from our side.
>>>
>>> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the patch has a bug).
>>>
>>
>> I'm still not convinced we need a knob for the reasons listed above.
>>
>> Best regards, Ilya Maximets.
>
Eelco Chaudron Feb. 20, 2024, 12:47 p.m. UTC | #8
On 19 Feb 2024, at 20:23, Ilya Maximets wrote:

> On 2/8/24 16:53, Eelco Chaudron wrote:
>>
>>
>> On 8 Feb 2024, at 15:00, Ilya Maximets wrote:
>>
>>> On 2/8/24 13:44, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 6 Feb 2024, at 16:01, Ilya Maximets wrote:
>>>>
>>>>> On 2/6/24 15:07, Eelco Chaudron wrote:
>>>>>>
>>>>>>
>>>>>> On 6 Feb 2024, at 14:46, Ilya Maximets wrote:
>>>>>>
>>>>>>> On 2/6/24 14:30, Eelco Chaudron wrote:
>>>>>>>> Previously, the ability to override the system default for the number
>>>>>>>> of handler threads was broken since to the introduction of the per-CPU
>>>>>>>> upcall handlers.
>>>>>>>
>>>>>>> It wasn't broken, it was intentionally disabled.  I don't think we should
>>>>>>> introduce ability to configure the number of handlers for per-CPU dispatch
>>>>>>> mode.  It only allows users to shoot themselves in a foot if they do not
>>>>>>> understand what they are doing.
>>>>>>
>>>>>> I think the design was not thought through enough, and resulted in a lot of
>>>>>> corner case issues. From the start, we should still have allowed tuning of
>>>>>> this option as it was before.
>>>>>
>>>>> The tuning is much more complicated and has very different consequences
>>>>> with per-cpu dispatch compared to per-vport.  So, the results of adjusting
>>>>> this option can be very different.
>>>>
>>>>>>> Many integration scripts in a wild are still blindly setting these options
>>>>>>> without even knowing that the underlying implementation changed completely.
>>>>>>> And those setups may suffer if we suddenly decide to respect the configuration
>>>>>>> that is ignored on modern kernels for all currently supported versions of OVS.
>>>>>>
>>>>>> I feel like this is not OVS’s problem. If people do not know what they are doing,
>>>>>> they should not touch it… We have a nice Dutch saying for this "Wie zijn billen
>>>>>> brandt, moet op de blaren zitten” :)
>>>>>
>>>>> It is an OVS's problem for the same reason as above.  We changed the logic
>>>>> under the hood so much that users that expected a certain behavior in the
>>>>> past may have a very different behavior now.  We can't just change the
>>>>> meaning of the knob that easily.  See below.
>>>>
>>>> Ack, so I would suggest adding a new nob.
>>>>
>>>>>>
>>>>>> However, I do feel like we should have an option to configure this, as it makes
>>>>>> no sense on a 256-core system to create 256 handler threads.
>>>>>
>>>>> It actually makes perfect sense.  Since the traffic can appear on 256 cores, we
>>>>> should be able to handle it on each one of them in order to have balanced load.
>>>>>
>>>>> With per-vport dispatch, on a system with 10 cores and 5 handler threads,
>>>>> all 5 handler threads will share the load, because each one of them listens
>>>>> on all the ports, hence each one of them can process packets whenever it
>>>>> has time to do so.
>>>>
>>>> Yes, but this had other problems :)
>>>
>>> Sure, that's why we went away from that logic. :)
>>>
>>>>
>>>>> With per-CPU dispatch, on a same system these 5 threads are mapped to specific
>>>>> cores, 2 cores per thread.  If all the traffic comes on 4 cores, we may have
>>>>> 2 threads overloaded and 3 other threads do nothing.  In the same scenario with
>>>>> per-vport dispatch we would have all 5 threads sharing the load.
>>>>
>>>> Yes, but with 256 we might have 250 doing nothing, and looking at some data from a
>>>> system with decent OpenFlow tables (1M+ entries, but only a couple of tables), we
>>>> can handle about 232k upcalls on a single handler thread for 64 bytes packets before
>>>> we run out of socket buffer. For 512bytes around 212k, and 1500 around 189k.
>>>
>>> Having idle threads in a system is generally not a problem.  As long as we
>>> don't have hundreds of thousands of them, they will not impact scheduling
>>> performance.
>>>
>>> Also, actual performance of handlers is hard to quantify, since OVS can be
>>> running on a large variety of different hardware.  Yours is likely on a
>>> powerful side of a spectrum.
>>>
>>>>
>>>> If we have more than 200k upcalls per second, we have other problems, especially as
>>>> the default dp flow size is 200k.
>>>>
>>>> So I guess the new nob becomes tunable for resource consumption vs upcall handling,
>>>> which I think is fine, and might be desirable in certain configurations. We have other
>>>> nobs in the system that require a decent amount of knowledge.
>>>
>>> Having more knobs just for the sake of having something configurable
>>> doesn't sound good to me.  We should try to not introduce ways to
>>> configure something that doesn't need to be configured.
>>>
>>> Sure, creating spare threads consumes memory.  However there are few
>>> things to consider here:
>>>
>>> 1. A system with 256 cores likely has a 512 MB of RAM to spare
>>>    (By default we limit to 2 MB of stack per thread).
>>>    I haven't seen any complaint from users about memory consumption
>>>    on large systems.
>>>
>>> 2. By reducing the number of threads we're working around the problem
>>>    of a large stack use during flow translation.  If we can reduce the
>>>    stack usage, we can reduce the overall memory consumption.
>>>    So, maybe that should be addressed instead of working around with
>>>    a hard-to-use-right knob.
>>
>> I guess the problem is that various people have tried, and the stack size did get reduced,
>> but not enough. The code is very complex and due to all the corner cases introduced over
>> the years, it’s (almost) impossible to get the recursion removed, without a complete rewrite
>> (trying to not break all these corner cases).
>
> I don't remember anyone actually trying to reduce the recursion depth.
> But the was some work reducing the stack allocations, true.
>
> It's not necessary to completely remove recursion, but it should be
> possible to short-circuit it in most common cases.

Mike did some stack reducing, but Aaron has a task to limit the recursion. I thought he hit a dead end, but maybe he can comment, or you can share your ideas with him.

>>
>> And in the past, we did receive complaints about the memory size. In some cases, we still
>> need 8M, which is 2G on a 256-core system.
>>
>> Unless someone finds a solution to this, I still feel we should have this option in case
>> we need it.
>
> CPU affinity is an available option, as discussed blow.
>
> Another option we can explore is to use MCL_ONFAULT instead of locking
> all the memory right away.  MCL_ONFAULT didn't exist when --mlockall
> was introduced, so it may be reasonable to use it.  Some testing on a
> systems under memory pressure is needed, of course.  But if the main
> concern is that OVS allocates too much memory that it doesn't use, it
> will no longer be the case.

I’ll do some experimenting, and try to find the use case that would break on a large machine. To be continued…

In the meantime, I’ll re-send a v2 only containing the first patch.

//Eelco

>>> 3. We can probably limit the number of revalidators instead (without
>>>    a knob).  We may estimate the number of revalidators needed by the
>>>    datapath flow hard limit and not create more than that even if we
>>>    have more cores.  Should help with memory consumption as well.
>>>
>>>>
>>>>> So, in order for the user to make a decision on the number of threads, they
>>>>> have to know on which CPUs they most likely to have interrupts and how exactly
>>>>> OVS will map handlers to CPU numbers.  That is too much for a user to know, IMO.
>>>>> Some of that is actually impossible to know.  And when the traffic pattern changes,
>>>>> the number of threads may need to be changed to game the mapping to avoid
>>>>> overload.
>>>>>
>>>>> Also, multiply that complexity by the CPU affinity of the ovs-vswitchd that
>>>>> changes the mapping.
>>>>
>>>> Well, using an affinity mask will mess up the above anyway as we will get fewer
>>>> threads, sharing potential CPU resources, etc. So making the n_threads configurable
>>>> will have the same problem as changing the affinity mask. But we might have other
>>>> benefits (see below).
>>>>
>>>>> AFAIU, The only way to achieve an even load without dynamic reconfiguration is
>>>>> to have a thread per CPU core.
>>>>
>>>> Yes, only if we have no affinity mask, but at the cost of additional resources
>>>> (which we might not want).
>>>>
>>>>>> Maybe we should
>>>>>> add a new option, n-forced-handler-threads. This will avoid the above issue with
>>>>>> existing scripts.
>>>>>
>>>>> Users can limit the number of threads by setting CPU affinity for ovs-vswitchd.
>>>>> If these users are knowledgeable enough to predict traffic patterns and CPU
>>>>> mappings, they can set appropriate CPU affinity masks as well.
>>>>
>>>> Well, you can not exactly set the number of handler threads as there will be
>>>> round-up to prime (which should not be a big problem). But the main problem I see
>>>> is that now you limit all OVS to that set of CPU. So a burst of upcall could mess
>>>> with your revalidator handling (or other threads running on that set).
>>>
>>> Should not be a problem.  We always run handlers and revalidators on the same
>>> cores by default.  And revalidators do not need millisecond precision.  Also
>>> revalidators can float between cores.  So, the use of the same cores should
>>> not be a problem as long as not all handlers are at 100% load, which will be
>>> a problem on its own regardless.
>>>
>>>>
>>>>> For others, changing the meaning of a knob underneath them would be kind
>>>>> of irresponsible from our side.
>>>>
>>>> Ack, I will introduce a new nob (also some changes to dpif-netlink, as the patch has a bug).
>>>>
>>>
>>> I'm still not convinced we need a knob for the reasons listed above.
>>>
>>> Best regards, Ilya Maximets.
>>
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 84e2bd8ea..fd5532014 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2839,7 +2839,13 @@  dpif_netlink_number_handlers_required(struct dpif *dpif_, uint32_t *n_handlers)
     struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
 
     if (dpif_netlink_upcall_per_cpu(dpif)) {
-        *n_handlers = dpif_netlink_calculate_n_handlers();
+        uint32_t calc_handlers = dpif_netlink_calculate_n_handlers();
+
+        /* When a specific number of handlers is requested, we honor this as
+         * long as they are less than the suggested number. */
+        if (!*n_handlers || *n_handlers > calc_handlers) {
+            *n_handlers = calc_handlers;
+        }
         return true;
     }
 
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index 520e21e68..fc3d39803 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -369,6 +369,10 @@  struct dpif_class {
     /* Queries 'dpif' to see if a certain number of handlers are required by
      * the implementation.
      *
+     * The 'n_handlers' value can be initialize with a system/user suggestion.
+     * It's up to the implementation if this suggestion fits the requirements
+     * for correct operation.
+     *
      * If a certain number of handlers are required, returns 'true' and sets
      * 'n_handlers' to that number of handler threads.
      *
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index 9a5c5c29c..f9d8f9cc7 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -660,7 +660,7 @@  udpif_set_threads(struct udpif *udpif, uint32_t n_handlers_,
                   uint32_t n_revalidators_)
 {
     ovs_assert(udpif);
-    uint32_t n_handlers_requested;
+    uint32_t n_handlers_requested = n_handlers_;
     uint32_t n_revalidators_requested;
     bool forced = false;