diff mbox series

spapr_numa.c: FORM2 table handle nodes with no distance info

Message ID 20211105135137.1584840-1-npiggin@gmail.com
State New
Headers show
Series spapr_numa.c: FORM2 table handle nodes with no distance info | expand

Commit Message

Nicholas Piggin Nov. 5, 2021, 1:51 p.m. UTC
A configuration that specifies multiple nodes without distance info
results in the non-local points in the FORM2 matrix having a distance of
0. This causes Linux to complain "Invalid distance value range" because
a node distance is smaller than the local distance.

Fix this by building a simple local / remote fallback for points where
distance information is missing.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comments

Daniel Henrique Barboza Nov. 5, 2021, 6:52 p.m. UTC | #1
On 11/5/21 10:51, Nicholas Piggin wrote:
> A configuration that specifies multiple nodes without distance info
> results in the non-local points in the FORM2 matrix having a distance of
> 0. This causes Linux to complain "Invalid distance value range" because
> a node distance is smaller than the local distance.
> 
> Fix this by building a simple local / remote fallback for points where
> distance information is missing.

Thanks for looking this up. I checked the output of this same scenario with
a FORM1 guest and 4 distance-less NUMA nodes. This is what I got:

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
(...)
node distances:
node   0   1   2   3
   0:  10  160  160  160
   1:  160  10  160  160
   2:  160  160  10  160
   3:  160  160  160  10
[root@localhost ~]#


With this patch we're getting '20' instead of '160' because you're using
NUMA_DISTANCE_DEFAULT, while FORM1 will default this case to the maximum
NUMA distance the kernel allows for that affinity (160).

I do not have strong feelings about changing this behavior between FORM1 and
FORM2. I tested the same scenario with a x86_64 guest and they also uses '20'
in this case as well, so far as QEMU goes using NUMA_DISTANCE_DEFAULT is
consistent.

Aneesh is already in CC, so I believe he'll let us know if there's something
we're missing and we need to preserve the '160' distance in FORM2 for this
case as well.

For now:


> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---


Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>



>   hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
>   1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 5822938448..56ab2a5fb6 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -546,12 +546,24 @@ static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>                * NUMA nodes, but QEMU adds the default NUMA node without
>                * adding the numa_info to retrieve distance info from.
>                */
> -            if (src == dst) {
> -                distance_table[i++] = NUMA_DISTANCE_MIN;
> -                continue;
> +            distance_table[i] = numa_info[src].distance[dst];
> +            if (distance_table[i] == 0) {
> +                /*
> +                 * In case QEMU adds a default NUMA single node when the user
> +                 * did not add any, or where the user did not supply distances,
> +                 * the value will be 0 here. Populate the table with a fallback
> +                 * simple local / remote distance.
> +                 */
> +                if (src == dst) {
> +                    distance_table[i] = NUMA_DISTANCE_MIN;
> +                } else {
> +                    distance_table[i] = numa_info[src].distance[dst];
> +                    if (distance_table[i] < NUMA_DISTANCE_MIN) {
> +                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
> +                    }
> +                }
>               }
> -
> -            distance_table[i++] = numa_info[src].distance[dst];
> +            i++;
>           }
>       }
>   
>
David Gibson Nov. 8, 2021, 3:26 a.m. UTC | #2
On Fri, Nov 05, 2021 at 03:52:13PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 11/5/21 10:51, Nicholas Piggin wrote:
> > A configuration that specifies multiple nodes without distance info
> > results in the non-local points in the FORM2 matrix having a distance of
> > 0. This causes Linux to complain "Invalid distance value range" because
> > a node distance is smaller than the local distance.
> > 
> > Fix this by building a simple local / remote fallback for points where
> > distance information is missing.
> 
> Thanks for looking this up. I checked the output of this same scenario with
> a FORM1 guest and 4 distance-less NUMA nodes. This is what I got:
> 
> [root@localhost ~]# numactl -H
> available: 4 nodes (0-3)
> (...)
> node distances:
> node   0   1   2   3
>   0:  10  160  160  160
>   1:  160  10  160  160
>   2:  160  160  10  160
>   3:  160  160  160  10
> [root@localhost ~]#
> 
> 
> With this patch we're getting '20' instead of '160' because you're using
> NUMA_DISTANCE_DEFAULT, while FORM1 will default this case to the maximum
> NUMA distance the kernel allows for that affinity (160).
> 
> I do not have strong feelings about changing this behavior between FORM1 and
> FORM2. I tested the same scenario with a x86_64 guest and they also uses '20'
> in this case as well, so far as QEMU goes using NUMA_DISTANCE_DEFAULT is
> consistent.
> 
> Aneesh is already in CC, so I believe he'll let us know if there's something
> we're missing and we need to preserve the '160' distance in FORM2 for this
> case as well.
> 
> For now:
> 
> 
> > 
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> 
> 
> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Applied to ppc-for-6.2, thanks.

> 
> 
> 
> >   hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
> >   1 file changed, 17 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > index 5822938448..56ab2a5fb6 100644
> > --- a/hw/ppc/spapr_numa.c
> > +++ b/hw/ppc/spapr_numa.c
> > @@ -546,12 +546,24 @@ static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> >                * NUMA nodes, but QEMU adds the default NUMA node without
> >                * adding the numa_info to retrieve distance info from.
> >                */
> > -            if (src == dst) {
> > -                distance_table[i++] = NUMA_DISTANCE_MIN;
> > -                continue;
> > +            distance_table[i] = numa_info[src].distance[dst];
> > +            if (distance_table[i] == 0) {
> > +                /*
> > +                 * In case QEMU adds a default NUMA single node when the user
> > +                 * did not add any, or where the user did not supply distances,
> > +                 * the value will be 0 here. Populate the table with a fallback
> > +                 * simple local / remote distance.
> > +                 */
> > +                if (src == dst) {
> > +                    distance_table[i] = NUMA_DISTANCE_MIN;
> > +                } else {
> > +                    distance_table[i] = numa_info[src].distance[dst];
> > +                    if (distance_table[i] < NUMA_DISTANCE_MIN) {
> > +                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
> > +                    }
> > +                }
> >               }
> > -
> > -            distance_table[i++] = numa_info[src].distance[dst];
> > +            i++;
> >           }
> >       }
> > 
>
Aneesh Kumar K V Nov. 8, 2021, 4:22 a.m. UTC | #3
Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> On 11/5/21 10:51, Nicholas Piggin wrote:
>> A configuration that specifies multiple nodes without distance info
>> results in the non-local points in the FORM2 matrix having a distance of
>> 0. This causes Linux to complain "Invalid distance value range" because
>> a node distance is smaller than the local distance.
>> 
>> Fix this by building a simple local / remote fallback for points where
>> distance information is missing.
>
> Thanks for looking this up. I checked the output of this same scenario with
> a FORM1 guest and 4 distance-less NUMA nodes. This is what I got:
>
> [root@localhost ~]# numactl -H
> available: 4 nodes (0-3)
> (...)
> node distances:
> node   0   1   2   3
>    0:  10  160  160  160
>    1:  160  10  160  160
>    2:  160  160  10  160
>    3:  160  160  160  10
> [root@localhost ~]#
>
>
> With this patch we're getting '20' instead of '160' because you're using
> NUMA_DISTANCE_DEFAULT, while FORM1 will default this case to the maximum
> NUMA distance the kernel allows for that affinity (160).

where is that enforced? Do we know why FORM1 picked 160? 

>
> I do not have strong feelings about changing this behavior between FORM1 and
> FORM2. I tested the same scenario with a x86_64 guest and they also uses '20'
> in this case as well, so far as QEMU goes using NUMA_DISTANCE_DEFAULT is
> consistent.
>

for FORM2 I would suggest 20 is the right value and it is also
consistent with other architectures. 

> Aneesh is already in CC, so I believe he'll let us know if there's something
> we're missing and we need to preserve the '160' distance in FORM2 for this
> case as well.
>
> For now:
>
>
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>
>
> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>
>
>
>>   hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
>>   1 file changed, 17 insertions(+), 5 deletions(-)
>> 
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 5822938448..56ab2a5fb6 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -546,12 +546,24 @@ static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>>                * NUMA nodes, but QEMU adds the default NUMA node without
>>                * adding the numa_info to retrieve distance info from.
>>                */
>> -            if (src == dst) {
>> -                distance_table[i++] = NUMA_DISTANCE_MIN;
>> -                continue;

We always initialized the local distance to be NUMA_DISTANCE_MIN
irrespective of what is specified via Qemu command line before? If so
then the above change will break that? 

>> +            distance_table[i] = numa_info[src].distance[dst];
>> +            if (distance_table[i] == 0) {

we know distance_table[i] is == 0 here and ..

>> +                /*
>> +                 * In case QEMU adds a default NUMA single node when the user
>> +                 * did not add any, or where the user did not supply distances,
>> +                 * the value will be 0 here. Populate the table with a fallback
>> +                 * simple local / remote distance.
>> +                 */
>> +                if (src == dst) {
>> +                    distance_table[i] = NUMA_DISTANCE_MIN;
>> +                } else {
>> +                    distance_table[i] = numa_info[src].distance[dst];
>> +                    if (distance_table[i] < NUMA_DISTANCE_MIN) {


considering we reached here after checking distance_table[i] == 0 do we
need to do the above two lines?

>> +                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
>> +                    }
>> +                }
>>               }
>> -
>> -            distance_table[i++] = numa_info[src].distance[dst];
>> +            i++;
>>           }
>>       }
Nicholas Piggin Nov. 8, 2021, 1:51 p.m. UTC | #4
Excerpts from Aneesh Kumar K.V's message of November 8, 2021 2:22 pm:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> On 11/5/21 10:51, Nicholas Piggin wrote:
>>> A configuration that specifies multiple nodes without distance info
>>> results in the non-local points in the FORM2 matrix having a distance of
>>> 0. This causes Linux to complain "Invalid distance value range" because
>>> a node distance is smaller than the local distance.
>>> 
>>> Fix this by building a simple local / remote fallback for points where
>>> distance information is missing.
>>
>> Thanks for looking this up. I checked the output of this same scenario with
>> a FORM1 guest and 4 distance-less NUMA nodes. This is what I got:
>>
>> [root@localhost ~]# numactl -H
>> available: 4 nodes (0-3)
>> (...)
>> node distances:
>> node   0   1   2   3
>>    0:  10  160  160  160
>>    1:  160  10  160  160
>>    2:  160  160  10  160
>>    3:  160  160  160  10
>> [root@localhost ~]#
>>
>>
>> With this patch we're getting '20' instead of '160' because you're using
>> NUMA_DISTANCE_DEFAULT, while FORM1 will default this case to the maximum
>> NUMA distance the kernel allows for that affinity (160).
> 
> where is that enforced? Do we know why FORM1 picked 160? 
> 
>>
>> I do not have strong feelings about changing this behavior between FORM1 and
>> FORM2. I tested the same scenario with a x86_64 guest and they also uses '20'
>> in this case as well, so far as QEMU goes using NUMA_DISTANCE_DEFAULT is
>> consistent.
>>
> 
> for FORM2 I would suggest 20 is the right value and it is also
> consistent with other architectures. 
> 
>> Aneesh is already in CC, so I believe he'll let us know if there's something
>> we're missing and we need to preserve the '160' distance in FORM2 for this
>> case as well.
>>
>> For now:
>>
>>
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>
>>
>> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>
>>
>>
>>>   hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
>>>   1 file changed, 17 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>>> index 5822938448..56ab2a5fb6 100644
>>> --- a/hw/ppc/spapr_numa.c
>>> +++ b/hw/ppc/spapr_numa.c
>>> @@ -546,12 +546,24 @@ static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>>>                * NUMA nodes, but QEMU adds the default NUMA node without
>>>                * adding the numa_info to retrieve distance info from.
>>>                */
>>> -            if (src == dst) {
>>> -                distance_table[i++] = NUMA_DISTANCE_MIN;
>>> -                continue;
> 
> We always initialized the local distance to be NUMA_DISTANCE_MIN
> irrespective of what is specified via Qemu command line before? If so
> then the above change will break that? 

That's true. I think command line should take priority and if we have to 
override it for some reason then we should print a warning.

> 
>>> +            distance_table[i] = numa_info[src].distance[dst];
>>> +            if (distance_table[i] == 0) {
> 
> we know distance_table[i] is == 0 here and ..
> 
>>> +                /*
>>> +                 * In case QEMU adds a default NUMA single node when the user
>>> +                 * did not add any, or where the user did not supply distances,
>>> +                 * the value will be 0 here. Populate the table with a fallback
>>> +                 * simple local / remote distance.
>>> +                 */
>>> +                if (src == dst) {
>>> +                    distance_table[i] = NUMA_DISTANCE_MIN;
>>> +                } else {
>>> +                    distance_table[i] = numa_info[src].distance[dst];
>>> +                    if (distance_table[i] < NUMA_DISTANCE_MIN) {
> 
> 
> considering we reached here after checking distance_table[i] == 0 do we
> need to do the above two lines?

Oh that's true. I think the lines could just be removed.

Thanks,
Nick

> 
>>> +                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
>>> +                    }
>>> +                }
>>>               }
>>> -
>>> -            distance_table[i++] = numa_info[src].distance[dst];
>>> +            i++;
>>>           }
>>>       }
> 
> 
>
Daniel Henrique Barboza Nov. 8, 2021, 9:12 p.m. UTC | #5
On 11/8/21 01:22, Aneesh Kumar K.V wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> On 11/5/21 10:51, Nicholas Piggin wrote:
>>> A configuration that specifies multiple nodes without distance info
>>> results in the non-local points in the FORM2 matrix having a distance of
>>> 0. This causes Linux to complain "Invalid distance value range" because
>>> a node distance is smaller than the local distance.
>>>
>>> Fix this by building a simple local / remote fallback for points where
>>> distance information is missing.
>>
>> Thanks for looking this up. I checked the output of this same scenario with
>> a FORM1 guest and 4 distance-less NUMA nodes. This is what I got:
>>
>> [root@localhost ~]# numactl -H
>> available: 4 nodes (0-3)
>> (...)
>> node distances:
>> node   0   1   2   3
>>     0:  10  160  160  160
>>     1:  160  10  160  160
>>     2:  160  160  10  160
>>     3:  160  160  160  10
>> [root@localhost ~]#
>>
>>
>> With this patch we're getting '20' instead of '160' because you're using
>> NUMA_DISTANCE_DEFAULT, while FORM1 will default this case to the maximum
>> NUMA distance the kernel allows for that affinity (160).
> 
> where is that enforced? Do we know why FORM1 picked 160?


It's the kernel algorithm that determines FORM1 distance. It doubles the
distance value of the previous level. It starts with the LOCAL_DISTANCE (10)
for the first NUMA level, second level is 10*2, and so on for all 4
reference-points (10, 20, 40, 80). If no match is found in the 4th level,
it doubles once more, giving us 160.

What is happening here is that the absence of a distance (distance == 0)
is being handled by FORM1 code in QEMU in a way that the associativity domains
will cause this kernel behavior I described above.

I'll check it out later and see if that's easily fixable.

> 
>>
>> I do not have strong feelings about changing this behavior between FORM1 and
>> FORM2. I tested the same scenario with a x86_64 guest and they also uses '20'
>> in this case as well, so far as QEMU goes using NUMA_DISTANCE_DEFAULT is
>> consistent.
>>
> 
> for FORM2 I would suggest 20 is the right value and it is also
> consistent with other architectures.
> 
>> Aneesh is already in CC, so I believe he'll let us know if there's something
>> we're missing and we need to preserve the '160' distance in FORM2 for this
>> case as well.
>>
>> For now:
>>
>>
>>>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>
>>
>> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>
>>
>>
>>>    hw/ppc/spapr_numa.c | 22 +++++++++++++++++-----
>>>    1 file changed, 17 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>>> index 5822938448..56ab2a5fb6 100644
>>> --- a/hw/ppc/spapr_numa.c
>>> +++ b/hw/ppc/spapr_numa.c
>>> @@ -546,12 +546,24 @@ static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>>>                 * NUMA nodes, but QEMU adds the default NUMA node without
>>>                 * adding the numa_info to retrieve distance info from.
>>>                 */
>>> -            if (src == dst) {
>>> -                distance_table[i++] = NUMA_DISTANCE_MIN;
>>> -                continue;
> 
> We always initialized the local distance to be NUMA_DISTANCE_MIN
> irrespective of what is specified via Qemu command line before? If so
> then the above change will break that?

No. I added this piece of code above because QEMU can auto-generate a single
NUMA node if the user added no NUMA nodes in the command line. This
auto-generated NUMA node didn't have the local_distance for itself set. That's
the only case where I was setting distance = 10. The remaining entries
were being written as-is. And now we need Nick's patch as well because
I missed other instances of absent distances hehe

I don't believe that we're breaking anything with this patch because we're
checking for distance = 0 first, and QEMU doesn't allow any distance < 10 to
be set:

-numa dist,src=0,dst=1,val=3
qemu-system-x86_64: -numa dist,src=0,dst=1,val=3: NUMA distance (3) is invalid, it shouldn't be less than 10


This means that we're not overwriting any user setting by accident.

> 
>>> +            distance_table[i] = numa_info[src].distance[dst];
>>> +            if (distance_table[i] == 0) {
> 
> we know distance_table[i] is == 0 here and ..
> 
>>> +                /*
>>> +                 * In case QEMU adds a default NUMA single node when the user
>>> +                 * did not add any, or where the user did not supply distances,
>>> +                 * the value will be 0 here. Populate the table with a fallback
>>> +                 * simple local / remote distance.
>>> +                 */
>>> +                if (src == dst) {
>>> +                    distance_table[i] = NUMA_DISTANCE_MIN;
>>> +                } else {
>>> +                    distance_table[i] = numa_info[src].distance[dst];
>>> +                    if (distance_table[i] < NUMA_DISTANCE_MIN) {
> 
> 
> considering we reached here after checking distance_table[i] == 0 do we
> need to do the above two lines?

You're right. I believe we can make it work with

                 if (src == dst) {
                     distance_table[i] = NUMA_DISTANCE_MIN;
                 } else {
                     distance_table[i] = NUMA_DISTANCE_DEFAULT;
                 }


Nick, care to re-send?



Thanks,



Daniel

> 
>>> +                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
>>> +                    }
>>> +                }
>>>                }
>>> -
>>> -            distance_table[i++] = numa_info[src].distance[dst];
>>> +            i++;
>>>            }
>>>        }
> 
>
diff mbox series

Patch

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 5822938448..56ab2a5fb6 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -546,12 +546,24 @@  static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
              * NUMA nodes, but QEMU adds the default NUMA node without
              * adding the numa_info to retrieve distance info from.
              */
-            if (src == dst) {
-                distance_table[i++] = NUMA_DISTANCE_MIN;
-                continue;
+            distance_table[i] = numa_info[src].distance[dst];
+            if (distance_table[i] == 0) {
+                /*
+                 * In case QEMU adds a default NUMA single node when the user
+                 * did not add any, or where the user did not supply distances,
+                 * the value will be 0 here. Populate the table with a fallback
+                 * simple local / remote distance.
+                 */
+                if (src == dst) {
+                    distance_table[i] = NUMA_DISTANCE_MIN;
+                } else {
+                    distance_table[i] = numa_info[src].distance[dst];
+                    if (distance_table[i] < NUMA_DISTANCE_MIN) {
+                        distance_table[i] = NUMA_DISTANCE_DEFAULT;
+                    }
+                }
             }
-
-            distance_table[i++] = numa_info[src].distance[dst];
+            i++;
         }
     }