diff mbox series

[ovs-dev] docs: Add HyperThreading notes for auto-lb usage.

Message ID 1673150145-29368-1-git-send-email-lic121@chinatelecom.cn
State Changes Requested
Headers show
Series [ovs-dev] docs: Add HyperThreading notes for auto-lb usage. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Cheng Li Jan. 8, 2023, 3:55 a.m. UTC
In my test, if one logical core is pinned to PMD thread while the
other logical(of the same physical core) is not. The PMD
performance is affected the by the not-pinned logical core load.
This maks it difficult to estimate the loads during a dry-run.

Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
---
 Documentation/topics/dpdk/pmd.rst | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Simon Horman Jan. 20, 2023, 3:29 p.m. UTC | #1
On Sun, Jan 08, 2023 at 03:55:45AM +0000, Cheng Li wrote:
> In my test, if one logical core is pinned to PMD thread while the
> other logical(of the same physical core) is not. The PMD
> performance is affected the by the not-pinned logical core load.
> This maks it difficult to estimate the loads during a dry-run.
> 
> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>

Makes sense to me.

Reviewed-by: Simon Horman <simon.horman@corigine.com>

> ---
>  Documentation/topics/dpdk/pmd.rst | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index 9006fd4..b220199 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>      when all PMD threads are running on cores from a single NUMA node. In this
>      case cross-NUMA datapaths will not change after reassignment.
>  
> +    For the same reason, please ensure that the pmd threads are pinned to SMT
> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
> +    not have the same performance.
> +
>  The minimum time between 2 consecutive PMD auto load balancing iterations can
>  also be configured by::
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Kevin Traynor Jan. 24, 2023, 3:52 p.m. UTC | #2
On 08/01/2023 03:55, Cheng Li wrote:
> In my test, if one logical core is pinned to PMD thread while the
> other logical(of the same physical core) is not. The PMD
> performance is affected the by the not-pinned logical core load.
> This maks it difficult to estimate the loads during a dry-run.
> 
> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> ---
>   Documentation/topics/dpdk/pmd.rst | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index 9006fd4..b220199 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>       when all PMD threads are running on cores from a single NUMA node. In this
>       case cross-NUMA datapaths will not change after reassignment.
>   
> +    For the same reason, please ensure that the pmd threads are pinned to SMT
> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
> +    not have the same performance.
> +
>   The minimum time between 2 consecutive PMD auto load balancing iterations can
>   also be configured by::
>   

I don't think it's a hard requirement as siblings should not impact as 
much as cross-numa might but it's probably good advice in general.

Acked-by: Kevin Traynor <ktraynor@redhat.com>
Ilya Maximets Jan. 27, 2023, 3:04 p.m. UTC | #3
On 1/24/23 16:52, Kevin Traynor wrote:
> On 08/01/2023 03:55, Cheng Li wrote:
>> In my test, if one logical core is pinned to PMD thread while the
>> other logical(of the same physical core) is not. The PMD
>> performance is affected the by the not-pinned logical core load.
>> This maks it difficult to estimate the loads during a dry-run.
>>
>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>> ---
>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>> index 9006fd4..b220199 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>>       when all PMD threads are running on cores from a single NUMA node. In this
>>       case cross-NUMA datapaths will not change after reassignment.
>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
>> +    not have the same performance.

Uhm... Am I reading this wrong or this note suggests to pin PMD threads
to SMT siblings?  It sounds like that's the opposite of what you were
trying to say.  Siblings are sharing the same physical core, so if some
PMDs are pinned to siblings, the load prediction can not work correctly.

Nit: s/pmd/PMD/

Best regards, Ilya Maximets.

>> +
>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
>>   also be configured by::
>>   
> 
> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
> 
> Acked-by: Kevin Traynor <ktraynor@redhat.com>
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Cheng Li Jan. 29, 2023, 1:33 a.m. UTC | #4
On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
> On 1/24/23 16:52, Kevin Traynor wrote:
> > On 08/01/2023 03:55, Cheng Li wrote:
> >> In my test, if one logical core is pinned to PMD thread while the
> >> other logical(of the same physical core) is not. The PMD
> >> performance is affected the by the not-pinned logical core load.
> >> This maks it difficult to estimate the loads during a dry-run.
> >>
> >> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> >> ---
> >>   Documentation/topics/dpdk/pmd.rst | 4 ++++
> >>   1 file changed, 4 insertions(+)
> >>
> >> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> >> index 9006fd4..b220199 100644
> >> --- a/Documentation/topics/dpdk/pmd.rst
> >> +++ b/Documentation/topics/dpdk/pmd.rst
> >> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
> >>       when all PMD threads are running on cores from a single NUMA node. In this
> >>       case cross-NUMA datapaths will not change after reassignment.
> >>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
> >> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
> >> +    not have the same performance.
> 
> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
> to SMT siblings?  It sounds like that's the opposite of what you were
> trying to say.  Siblings are sharing the same physical core, so if some
> PMDs are pinned to siblings, the load prediction can not work correctly.

Thanks for the review, Ilya.

The note indeed suggests to pin PMD threads to sliblings. Sliblings share
the same physical core, if PMDs pin one slibling while leaving the other
slibling of the same physical core not pinned, the load prediction may
not work correctly. Because the pinned slibling performance may affected
by the not-pinned slibling workload. So we sugguest to pin both
sliblings of the same physical core.


> 
> Nit: s/pmd/PMD/
> 
> Best regards, Ilya Maximets.
> 
> >> +
> >>   The minimum time between 2 consecutive PMD auto load balancing iterations can
> >>   also be configured by::
> >>   
> > 
> > I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
> > 
> > Acked-by: Kevin Traynor <ktraynor@redhat.com>
> > 
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > 
>
Ilya Maximets Jan. 30, 2023, 12:38 p.m. UTC | #5
On 1/29/23 02:33, Cheng Li wrote:
> On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
>> On 1/24/23 16:52, Kevin Traynor wrote:
>>> On 08/01/2023 03:55, Cheng Li wrote:
>>>> In my test, if one logical core is pinned to PMD thread while the
>>>> other logical(of the same physical core) is not. The PMD
>>>> performance is affected the by the not-pinned logical core load.
>>>> This maks it difficult to estimate the loads during a dry-run.
>>>>
>>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>>>> ---
>>>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
>>>>   1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>>>> index 9006fd4..b220199 100644
>>>> --- a/Documentation/topics/dpdk/pmd.rst
>>>> +++ b/Documentation/topics/dpdk/pmd.rst
>>>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>>>>       when all PMD threads are running on cores from a single NUMA node. In this
>>>>       case cross-NUMA datapaths will not change after reassignment.
>>>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
>>>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
>>>> +    not have the same performance.
>>
>> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
>> to SMT siblings?  It sounds like that's the opposite of what you were
>> trying to say.  Siblings are sharing the same physical core, so if some
>> PMDs are pinned to siblings, the load prediction can not work correctly.
> 
> Thanks for the review, Ilya.
> 
> The note indeed suggests to pin PMD threads to sliblings. Sliblings share
> the same physical core, if PMDs pin one slibling while leaving the other
> slibling of the same physical core not pinned, the load prediction may
> not work correctly. Because the pinned slibling performance may affected
> by the not-pinned slibling workload. So we sugguest to pin both
> sliblings of the same physical core.

But this makes sense only if all the PMD threads are on siblings of the
same physical core.  If more than one physical core is involved, the load
calculations will be incorrect.  For example, let's say we have 4 threads
A, B, C and D, where A and B are siblings and C and D are siblings.  And
it happened that we have only 2 ports, both of which are assigned to A.
It makes a huge difference whether we move one of the ports from A to B
or if we move it from A to C.  It is an oversimplified example, but we
can't rely on load calculations in general case if PMD threads are running
on SMT siblings.

> 
> 
>>
>> Nit: s/pmd/PMD/
>>
>> Best regards, Ilya Maximets.
>>
>>>> +
>>>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
>>>>   also be configured by::
>>>>   
>>>
>>> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
>>>
>>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>
Cheng Li Jan. 30, 2023, 3:13 p.m. UTC | #6
On Mon, Jan 30, 2023 at 01:38:45PM +0100, Ilya Maximets wrote:
> On 1/29/23 02:33, Cheng Li wrote:
> > On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
> >> On 1/24/23 16:52, Kevin Traynor wrote:
> >>> On 08/01/2023 03:55, Cheng Li wrote:
> >>>> In my test, if one logical core is pinned to PMD thread while the
> >>>> other logical(of the same physical core) is not. The PMD
> >>>> performance is affected the by the not-pinned logical core load.
> >>>> This maks it difficult to estimate the loads during a dry-run.
> >>>>
> >>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> >>>> ---
> >>>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
> >>>>   1 file changed, 4 insertions(+)
> >>>>
> >>>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> >>>> index 9006fd4..b220199 100644
> >>>> --- a/Documentation/topics/dpdk/pmd.rst
> >>>> +++ b/Documentation/topics/dpdk/pmd.rst
> >>>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
> >>>>       when all PMD threads are running on cores from a single NUMA node. In this
> >>>>       case cross-NUMA datapaths will not change after reassignment.
> >>>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
> >>>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
> >>>> +    not have the same performance.
> >>
> >> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
> >> to SMT siblings?  It sounds like that's the opposite of what you were
> >> trying to say.  Siblings are sharing the same physical core, so if some
> >> PMDs are pinned to siblings, the load prediction can not work correctly.
> > 
> > Thanks for the review, Ilya.
> > 
> > The note indeed suggests to pin PMD threads to sliblings. Sliblings share
> > the same physical core, if PMDs pin one slibling while leaving the other
> > slibling of the same physical core not pinned, the load prediction may
> > not work correctly. Because the pinned slibling performance may affected
> > by the not-pinned slibling workload. So we sugguest to pin both
> > sliblings of the same physical core.
> 
> But this makes sense only if all the PMD threads are on siblings of the
> same physical core.  If more than one physical core is involved, the load
> calculations will be incorrect.  For example, let's say we have 4 threads
> A, B, C and D, where A and B are siblings and C and D are siblings.  And
> it happened that we have only 2 ports, both of which are assigned to A.
> It makes a huge difference whether we move one of the ports from A to B
> or if we move it from A to C.  It is an oversimplified example, but we
> can't rely on load calculations in general case if PMD threads are running
> on SMT siblings.

Thanks for the detail explanation, now I get your point.

In your example, PMD B,C,D having no rxq assigned will be in sleep, which
costs little CPU cycles. When a logical core(B) is sleeping, its
slibling core(A) uses most of the physical core resource, so it's
powerfull. If we move one port from A to B, one physical core is
running. If we move one port from A to C, two physical cores are
running. So the result perfromance will be huge different. Hope I
understand correctly.

To cover this case, one choice is to use only one of the sliblings while
leaving the other slibling unused(isolate). I have ever done some tests,
using both sliblings have 25% performance improvement than using only
one slibling while leaving the other slibing unused. So this may not be
a good choice.

> 
> > 
> > 
> >>
> >> Nit: s/pmd/PMD/
> >>
> >> Best regards, Ilya Maximets.
> >>
> >>>> +
> >>>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
> >>>>   also be configured by::
> >>>>   
> >>>
> >>> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
> >>>
> >>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev@openvswitch.org
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>
> >>
>
Ilya Maximets Jan. 30, 2023, 4:51 p.m. UTC | #7
On 1/30/23 16:13, Cheng Li wrote:
> On Mon, Jan 30, 2023 at 01:38:45PM +0100, Ilya Maximets wrote:
>> On 1/29/23 02:33, Cheng Li wrote:
>>> On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
>>>> On 1/24/23 16:52, Kevin Traynor wrote:
>>>>> On 08/01/2023 03:55, Cheng Li wrote:
>>>>>> In my test, if one logical core is pinned to PMD thread while the
>>>>>> other logical(of the same physical core) is not. The PMD
>>>>>> performance is affected the by the not-pinned logical core load.
>>>>>> This maks it difficult to estimate the loads during a dry-run.
>>>>>>
>>>>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>>>>>> ---
>>>>>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
>>>>>>   1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>>>>>> index 9006fd4..b220199 100644
>>>>>> --- a/Documentation/topics/dpdk/pmd.rst
>>>>>> +++ b/Documentation/topics/dpdk/pmd.rst
>>>>>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>>>>>>       when all PMD threads are running on cores from a single NUMA node. In this
>>>>>>       case cross-NUMA datapaths will not change after reassignment.
>>>>>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
>>>>>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
>>>>>> +    not have the same performance.
>>>>
>>>> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
>>>> to SMT siblings?  It sounds like that's the opposite of what you were
>>>> trying to say.  Siblings are sharing the same physical core, so if some
>>>> PMDs are pinned to siblings, the load prediction can not work correctly.
>>>
>>> Thanks for the review, Ilya.
>>>
>>> The note indeed suggests to pin PMD threads to sliblings. Sliblings share
>>> the same physical core, if PMDs pin one slibling while leaving the other
>>> slibling of the same physical core not pinned, the load prediction may
>>> not work correctly. Because the pinned slibling performance may affected
>>> by the not-pinned slibling workload. So we sugguest to pin both
>>> sliblings of the same physical core.
>>
>> But this makes sense only if all the PMD threads are on siblings of the
>> same physical core.  If more than one physical core is involved, the load
>> calculations will be incorrect.  For example, let's say we have 4 threads
>> A, B, C and D, where A and B are siblings and C and D are siblings.  And
>> it happened that we have only 2 ports, both of which are assigned to A.
>> It makes a huge difference whether we move one of the ports from A to B
>> or if we move it from A to C.  It is an oversimplified example, but we
>> can't rely on load calculations in general case if PMD threads are running
>> on SMT siblings.
> 
> Thanks for the detail explanation, now I get your point.
> 
> In your example, PMD B,C,D having no rxq assigned will be in sleep, which
> costs little CPU cycles. When a logical core(B) is sleeping, its
> slibling core(A) uses most of the physical core resource, so it's
> powerfull. If we move one port from A to B, one physical core is
> running. If we move one port from A to C, two physical cores are
> running. So the result perfromance will be huge different. Hope I
> understand correctly.

Yes.

And even if they are not sleeping.  E.g. if each thread has 1 port
to poll, and thread A has 3 ports to poll.  The calculated variance
will not match the actual performance impact as the actual available
CPU power is different across cores.

> 
> To cover this case, one choice is to use only one of the sliblings while
> leaving the other slibling unused(isolate). I have ever done some tests,
> using both sliblings have 25% performance improvement than using only
> one slibling while leaving the other slibing unused. So this may not be
> a good choice.

Leaving 20-25% of performance on the table might not be a wise choice,
but it seems to be the only way to have predictable results with the
current implementation of auto load-balancing.

To confidently suggest users to use SMT siblings with ALB enabled, we
will need to make ALB aware of SMT topology.  Maybe make it hierarchical,
i.e. balance between SMT siblings, then between physical core packages
(or in the opposite order).

Note: balancing between NUMA nodes is more complicated because of device
      locality.

Best regards, Ilya Maximets.

> 
>>
>>>
>>>
>>>>
>>>> Nit: s/pmd/PMD/
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>>> +
>>>>>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
>>>>>>   also be configured by::
>>>>>>   
>>>>>
>>>>> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
>>>>>
>>>>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev@openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>
>>>>
>>
Cheng Li Jan. 31, 2023, 1:07 a.m. UTC | #8
On Mon, Jan 30, 2023 at 05:51:57PM +0100, Ilya Maximets wrote:
> On 1/30/23 16:13, Cheng Li wrote:
> > On Mon, Jan 30, 2023 at 01:38:45PM +0100, Ilya Maximets wrote:
> >> On 1/29/23 02:33, Cheng Li wrote:
> >>> On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
> >>>> On 1/24/23 16:52, Kevin Traynor wrote:
> >>>>> On 08/01/2023 03:55, Cheng Li wrote:
> >>>>>> In my test, if one logical core is pinned to PMD thread while the
> >>>>>> other logical(of the same physical core) is not. The PMD
> >>>>>> performance is affected the by the not-pinned logical core load.
> >>>>>> This maks it difficult to estimate the loads during a dry-run.
> >>>>>>
> >>>>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> >>>>>> ---
> >>>>>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
> >>>>>>   1 file changed, 4 insertions(+)
> >>>>>>
> >>>>>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> >>>>>> index 9006fd4..b220199 100644
> >>>>>> --- a/Documentation/topics/dpdk/pmd.rst
> >>>>>> +++ b/Documentation/topics/dpdk/pmd.rst
> >>>>>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
> >>>>>>       when all PMD threads are running on cores from a single NUMA node. In this
> >>>>>>       case cross-NUMA datapaths will not change after reassignment.
> >>>>>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
> >>>>>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
> >>>>>> +    not have the same performance.
> >>>>
> >>>> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
> >>>> to SMT siblings?  It sounds like that's the opposite of what you were
> >>>> trying to say.  Siblings are sharing the same physical core, so if some
> >>>> PMDs are pinned to siblings, the load prediction can not work correctly.
> >>>
> >>> Thanks for the review, Ilya.
> >>>
> >>> The note indeed suggests to pin PMD threads to sliblings. Sliblings share
> >>> the same physical core, if PMDs pin one slibling while leaving the other
> >>> slibling of the same physical core not pinned, the load prediction may
> >>> not work correctly. Because the pinned slibling performance may affected
> >>> by the not-pinned slibling workload. So we sugguest to pin both
> >>> sliblings of the same physical core.
> >>
> >> But this makes sense only if all the PMD threads are on siblings of the
> >> same physical core.  If more than one physical core is involved, the load
> >> calculations will be incorrect.  For example, let's say we have 4 threads
> >> A, B, C and D, where A and B are siblings and C and D are siblings.  And
> >> it happened that we have only 2 ports, both of which are assigned to A.
> >> It makes a huge difference whether we move one of the ports from A to B
> >> or if we move it from A to C.  It is an oversimplified example, but we
> >> can't rely on load calculations in general case if PMD threads are running
> >> on SMT siblings.
> > 
> > Thanks for the detail explanation, now I get your point.
> > 
> > In your example, PMD B,C,D having no rxq assigned will be in sleep, which
> > costs little CPU cycles. When a logical core(B) is sleeping, its
> > slibling core(A) uses most of the physical core resource, so it's
> > powerfull. If we move one port from A to B, one physical core is
> > running. If we move one port from A to C, two physical cores are
> > running. So the result perfromance will be huge different. Hope I
> > understand correctly.
> 
> Yes.
> 
> And even if they are not sleeping.  E.g. if each thread has 1 port
> to poll, and thread A has 3 ports to poll.  The calculated variance
> will not match the actual performance impact as the actual available
> CPU power is different across cores.

"CPU power is different across cores". How does this happen? Because of
cpufreq may be different across cores? Or because of different poll
count?
As I understand it, no matter how many rxq to poll, no matter the rxqs
are busy or free(pps), PMD is always in poll loop which cost 100% CPU
cycles. Maybe it costs different cache resource?

> 
> > 
> > To cover this case, one choice is to use only one of the sliblings while
> > leaving the other slibling unused(isolate). I have ever done some tests,
> > using both sliblings have 25% performance improvement than using only
> > one slibling while leaving the other slibing unused. So this may not be
> > a good choice.
> 
> Leaving 20-25% of performance on the table might not be a wise choice,
> but it seems to be the only way to have predictable results with the
> current implementation of auto load-balancing.
> 
> To confidently suggest users to use SMT siblings with ALB enabled, we
> will need to make ALB aware of SMT topology.  Maybe make it hierarchical,
> i.e. balance between SMT siblings, then between physical core packages
> (or in the opposite order).

Agree, this would be a good solution for the case you mentioned.

> 
> Note: balancing between NUMA nodes is more complicated because of device
>       locality.
> 
> Best regards, Ilya Maximets.
> 
> > 
> >>
> >>>
> >>>
> >>>>
> >>>> Nit: s/pmd/PMD/
> >>>>
> >>>> Best regards, Ilya Maximets.
> >>>>
> >>>>>> +
> >>>>>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
> >>>>>>   also be configured by::
> >>>>>>   
> >>>>>
> >>>>> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
> >>>>>
> >>>>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
> >>>>>
> >>>>> _______________________________________________
> >>>>> dev mailing list
> >>>>> dev@openvswitch.org
> >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>>
> >>>>
> >>
>
Ilya Maximets Jan. 31, 2023, 12:19 p.m. UTC | #9
On 1/31/23 02:07, Cheng Li wrote:
> On Mon, Jan 30, 2023 at 05:51:57PM +0100, Ilya Maximets wrote:
>> On 1/30/23 16:13, Cheng Li wrote:
>>> On Mon, Jan 30, 2023 at 01:38:45PM +0100, Ilya Maximets wrote:
>>>> On 1/29/23 02:33, Cheng Li wrote:
>>>>> On Fri, Jan 27, 2023 at 04:04:55PM +0100, Ilya Maximets wrote:
>>>>>> On 1/24/23 16:52, Kevin Traynor wrote:
>>>>>>> On 08/01/2023 03:55, Cheng Li wrote:
>>>>>>>> In my test, if one logical core is pinned to PMD thread while the
>>>>>>>> other logical(of the same physical core) is not. The PMD
>>>>>>>> performance is affected the by the not-pinned logical core load.
>>>>>>>> This maks it difficult to estimate the loads during a dry-run.
>>>>>>>>
>>>>>>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>>>>>>>> ---
>>>>>>>>   Documentation/topics/dpdk/pmd.rst | 4 ++++
>>>>>>>>   1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
>>>>>>>> index 9006fd4..b220199 100644
>>>>>>>> --- a/Documentation/topics/dpdk/pmd.rst
>>>>>>>> +++ b/Documentation/topics/dpdk/pmd.rst
>>>>>>>> @@ -312,6 +312,10 @@ If not set, the default variance improvement threshold is 25%.
>>>>>>>>       when all PMD threads are running on cores from a single NUMA node. In this
>>>>>>>>       case cross-NUMA datapaths will not change after reassignment.
>>>>>>>>   +    For the same reason, please ensure that the pmd threads are pinned to SMT
>>>>>>>> +    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
>>>>>>>> +    not have the same performance.
>>>>>>
>>>>>> Uhm... Am I reading this wrong or this note suggests to pin PMD threads
>>>>>> to SMT siblings?  It sounds like that's the opposite of what you were
>>>>>> trying to say.  Siblings are sharing the same physical core, so if some
>>>>>> PMDs are pinned to siblings, the load prediction can not work correctly.
>>>>>
>>>>> Thanks for the review, Ilya.
>>>>>
>>>>> The note indeed suggests to pin PMD threads to sliblings. Sliblings share
>>>>> the same physical core, if PMDs pin one slibling while leaving the other
>>>>> slibling of the same physical core not pinned, the load prediction may
>>>>> not work correctly. Because the pinned slibling performance may affected
>>>>> by the not-pinned slibling workload. So we sugguest to pin both
>>>>> sliblings of the same physical core.
>>>>
>>>> But this makes sense only if all the PMD threads are on siblings of the
>>>> same physical core.  If more than one physical core is involved, the load
>>>> calculations will be incorrect.  For example, let's say we have 4 threads
>>>> A, B, C and D, where A and B are siblings and C and D are siblings.  And
>>>> it happened that we have only 2 ports, both of which are assigned to A.
>>>> It makes a huge difference whether we move one of the ports from A to B
>>>> or if we move it from A to C.  It is an oversimplified example, but we
>>>> can't rely on load calculations in general case if PMD threads are running
>>>> on SMT siblings.
>>>
>>> Thanks for the detail explanation, now I get your point.
>>>
>>> In your example, PMD B,C,D having no rxq assigned will be in sleep, which
>>> costs little CPU cycles. When a logical core(B) is sleeping, its
>>> slibling core(A) uses most of the physical core resource, so it's
>>> powerfull. If we move one port from A to B, one physical core is
>>> running. If we move one port from A to C, two physical cores are
>>> running. So the result perfromance will be huge different. Hope I
>>> understand correctly.
>>
>> Yes.
>>
>> And even if they are not sleeping.  E.g. if each thread has 1 port
>> to poll, and thread A has 3 ports to poll.  The calculated variance
>> will not match the actual performance impact as the actual available
>> CPU power is different across cores.
> 
> "CPU power is different across cores". How does this happen? Because of
> cpufreq may be different across cores? Or because of different poll
> count?
> As I understand it, no matter how many rxq to poll, no matter the rxqs
> are busy or free(pps), PMD is always in poll loop which cost 100% CPU
> cycles. Maybe it costs different cache resource?

While it's true that PMD threads are just running at 100% CPU usage, when
we talk about SMT siblings, their, let's call it, "effective power" is not
the same as "effective power" of the sibling of another physical core.
Simply because their siblings are running different workloads and utilizing
different physical components of their cores at different times, including,
yes, differences in cache utilization.  So, 100% of one core is not equal to
100% of another core.  Unless they are fully independent physical cores with
no noisy siblings.

Hopefully, thresholds can amortize the difference.  But they will not help
is some of the threads are actually sleeping.

I didn't take into account dynamic frequency adjustments which, of course,
will make everything even more complicated.  Though, if the cores are in
a busy loop, I'd expect frequencies to stay more or less on the same level,
even if dynamic management is enabled.  And I'm not sure if cpufreq can
actually control the clock speed for siblings separately, I'd expect them
to be the same.

The new 'pmd-maxsleep' configuration though might introduce some more
interesting cases into a load balancing math as well, since threads will
not consume 100% CPU anymore.

> 
>>
>>>
>>> To cover this case, one choice is to use only one of the sliblings while
>>> leaving the other slibling unused(isolate). I have ever done some tests,
>>> using both sliblings have 25% performance improvement than using only
>>> one slibling while leaving the other slibing unused. So this may not be
>>> a good choice.
>>
>> Leaving 20-25% of performance on the table might not be a wise choice,
>> but it seems to be the only way to have predictable results with the
>> current implementation of auto load-balancing.
>>
>> To confidently suggest users to use SMT siblings with ALB enabled, we
>> will need to make ALB aware of SMT topology.  Maybe make it hierarchical,
>> i.e. balance between SMT siblings, then between physical core packages
>> (or in the opposite order).
> 
> Agree, this would be a good solution for the case you mentioned.
> 
>>
>> Note: balancing between NUMA nodes is more complicated because of device
>>       locality.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Nit: s/pmd/PMD/
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>>> +
>>>>>>>>   The minimum time between 2 consecutive PMD auto load balancing iterations can
>>>>>>>>   also be configured by::
>>>>>>>>   
>>>>>>>
>>>>>>> I don't think it's a hard requirement as siblings should not impact as much as cross-numa might but it's probably good advice in general.
>>>>>>>
>>>>>>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dev mailing list
>>>>>>> dev@openvswitch.org
>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>
>>>>>>
>>>>
>>
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index 9006fd4..b220199 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,6 +312,10 @@  If not set, the default variance improvement threshold is 25%.
     when all PMD threads are running on cores from a single NUMA node. In this
     case cross-NUMA datapaths will not change after reassignment.
 
+    For the same reason, please ensure that the pmd threads are pinned to SMT
+    siblings if HyperThreading is enabled. Otherwise, PMDs within a NUMA may
+    not have the same performance.
+
 The minimum time between 2 consecutive PMD auto load balancing iterations can
 also be configured by::