diff mbox series

powerpc/numa: Restrict possible nodes based on platform

Message ID 20200706064002.14848-1-srikar@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show
Series powerpc/numa: Restrict possible nodes based on platform | expand

Commit Message

Srikar Dronamraju July 6, 2020, 6:40 a.m. UTC
As per PAPR, there are 2 device tree property
ibm,max-associativity-domains (which defines the maximum number of
domains that the firmware i.e PowerVM can support) and
ibm,current-associativity-domains (which defines the maximum number of
domains that the platform can support). Value of
ibm,max-associativity-domains property is always greater than or equal
to ibm,current-associativity-domains property.

Powerpc currently uses ibm,max-associativity-domains  property while
setting the possible number of nodes. This is currently set at 32.
However the possible number of nodes for a platform may be significantly
less. Hence set the possible number of nodes based on
ibm,current-associativity-domains property.

$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
		 00000005 00000001 00000002 00000002 00000002 00000010
/proc/device-tree/rtas/ibm,max-associativity-domains
		 00000005 00000001 00000008 00000020 00000020 00000100

$ cat /sys/devices/system/node/possible ##Before patch
0-31

$ cat /sys/devices/system/node/possible ##After patch
0-1

Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.

This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.

I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).

Before patch
free -k at boot
              total        used        free      shared  buff/cache   available
Mem:      523498176     4106816   518820608       22272      570752   516606720
Swap:       4194240           0     4194240

free -k after creating 100 memory cgroups
              total        used        free      shared  buff/cache   available
Mem:      523498176     4628416   518246464       22336      623296   516058688
Swap:       4194240           0     4194240

free -k after destroying 100 memory cgroups
              total        used        free      shared  buff/cache   available
Mem:      523498176     4697408   518173760       22400      627008   515987904
Swap:       4194240           0     4194240

After patch
free -k at boot
              total        used        free      shared  buff/cache   available
Mem:      523498176     3969472   518933888       22272      594816   516731776
Swap:       4194240           0     4194240

free -k after creating 100 memory cgroups
              total        used        free      shared  buff/cache   available
Mem:      523498176     4181888   518676096       22208      640192   516496448
Swap:       4194240           0     4194240

free -k after destroying 100 memory cgroups
              total        used        free      shared  buff/cache   available
Mem:      523498176     4232320   518619904       22272      645952   516443264
Swap:       4194240           0     4194240

Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tyrel Datwyler July 6, 2020, 8:58 p.m. UTC | #1
On 7/5/20 11:40 PM, Srikar Dronamraju wrote:
> As per PAPR, there are 2 device tree property
> ibm,max-associativity-domains (which defines the maximum number of
> domains that the firmware i.e PowerVM can support) and
> ibm,current-associativity-domains (which defines the maximum number of
> domains that the platform can support). Value of
> ibm,max-associativity-domains property is always greater than or equal
> to ibm,current-associativity-domains property.
> 
> Powerpc currently uses ibm,max-associativity-domains  property while
> setting the possible number of nodes. This is currently set at 32.
> However the possible number of nodes for a platform may be significantly
> less. Hence set the possible number of nodes based on
> ibm,current-associativity-domains property.
> 
> $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
> /proc/device-tree/rtas/ibm,current-associativity-domains
> 		 00000005 00000001 00000002 00000002 00000002 00000010
> /proc/device-tree/rtas/ibm,max-associativity-domains
> 		 00000005 00000001 00000008 00000020 00000020 00000100
> 
> $ cat /sys/devices/system/node/possible ##Before patch
> 0-31
> 
> $ cat /sys/devices/system/node/possible ##After patch
> 0-1
> 
> Note the maximum nodes this platform can support is only 2 but the
> possible nodes is set to 32.
> 
> This is important because lot of kernel and user space code allocate
> structures for all possible nodes leading to a lot of memory that is
> allocated but not used.
> 
> I ran a simple experiment to create and destroy 100 memory cgroups on
> boot on a 8 node machine (Power8 Alpine).
> 
> Before patch
> free -k at boot
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     4106816   518820608       22272      570752   516606720
> Swap:       4194240           0     4194240
> 
> free -k after creating 100 memory cgroups
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     4628416   518246464       22336      623296   516058688
> Swap:       4194240           0     4194240
> 
> free -k after destroying 100 memory cgroups
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     4697408   518173760       22400      627008   515987904
> Swap:       4194240           0     4194240
> 
> After patch
> free -k at boot
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     3969472   518933888       22272      594816   516731776
> Swap:       4194240           0     4194240
> 
> free -k after creating 100 memory cgroups
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     4181888   518676096       22208      640192   516496448
> Swap:       4194240           0     4194240
> 
> free -k after destroying 100 memory cgroups
>               total        used        free      shared  buff/cache   available
> Mem:      523498176     4232320   518619904       22272      645952   516443264
> Swap:       4194240           0     4194240
> 
> Observations:
> Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
> Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
> 
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 9fcf2d195830..3d55cef1a2dc 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
>  		return;
> 
>  	if (of_property_read_u32_index(rtas,
> -				"ibm,max-associativity-domains",
> +				"ibm,current-associativity-domains",

I'm not sure ibm,current-associativity-domains is guaranteed to exist on older
firmware. You may need check that it exists and fall back to
ibm,max-associativity-domains in the event it doesn't

-Tyrel

>  				min_common_depth, &numnodes))
>  		goto out;
>
Nathan Lynch July 6, 2020, 11:19 p.m. UTC | #2
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 9fcf2d195830..3d55cef1a2dc 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
>  		return;
>  
>  	if (of_property_read_u32_index(rtas,
> -				"ibm,max-associativity-domains",
> +				"ibm,current-associativity-domains",
>  				min_common_depth, &numnodes))

Looks good if ibm,current-associativity-domains[min_common_depth]
actually denotes the range of possible values, i.e. a value of 2 implies
node numbers 0 and 1. PAPR+ says it's the "number of unique values",
which isn't how I would specify the property if it's supposed to express
a range. But it's probably OK...
Nathan Lynch July 7, 2020, 12:44 a.m. UTC | #3
Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
>>  		return;
>> 
>>  	if (of_property_read_u32_index(rtas,
>> -				"ibm,max-associativity-domains",
>> +				"ibm,current-associativity-domains",
>
> I'm not sure ibm,current-associativity-domains is guaranteed to exist on older
> firmware. You may need check that it exists and fall back to
> ibm,max-associativity-domains in the event it doesn't

Yes. Looks like it's a PowerVM-specific property.
Srikar Dronamraju July 7, 2020, 2:50 a.m. UTC | #4
* Tyrel Datwyler <tyreld@linux.ibm.com> [2020-07-06 13:58:42]:

> On 7/5/20 11:40 PM, Srikar Dronamraju wrote:
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 9fcf2d195830..3d55cef1a2dc 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
> >  		return;
> > 
> >  	if (of_property_read_u32_index(rtas,
> > -				"ibm,max-associativity-domains",
> > +				"ibm,current-associativity-domains",
> 
> I'm not sure ibm,current-associativity-domains is guaranteed to exist on older
> firmware. You may need check that it exists and fall back to
> ibm,max-associativity-domains in the event it doesn't
> 
> -Tyrel
> 

Oh, 
thanks Tyrel thats a good observation.
Will fallback on max-associativity.
Srikar Dronamraju July 7, 2020, 2:53 a.m. UTC | #5
* Nathan Lynch <nathanl@linux.ibm.com> [2020-07-06 19:44:31]:

> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
> >>  		return;
> >> 
> >>  	if (of_property_read_u32_index(rtas,
> >> -				"ibm,max-associativity-domains",
> >> +				"ibm,current-associativity-domains",
> >
> > I'm not sure ibm,current-associativity-domains is guaranteed to exist on older
> > firmware. You may need check that it exists and fall back to
> > ibm,max-associativity-domains in the event it doesn't
> 
> Yes. Looks like it's a PowerVM-specific property.

Right, this is just a PowerVM specific property and doesn't affect PowerNV.
On PowerNV thought we have sparse nodes, the max possible nodes is set
correctly.
Michael Ellerman July 7, 2020, 5:02 a.m. UTC | #6
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> As per PAPR, there are 2 device tree property
> ibm,max-associativity-domains (which defines the maximum number of
> domains that the firmware i.e PowerVM can support) and
> ibm,current-associativity-domains (which defines the maximum number of
> domains that the platform can support). Value of
> ibm,max-associativity-domains property is always greater than or equal
> to ibm,current-associativity-domains property.

Where is it documented?

It's definitely not in LoPAPR.

> Powerpc currently uses ibm,max-associativity-domains  property while
> setting the possible number of nodes. This is currently set at 32.
> However the possible number of nodes for a platform may be significantly
> less. Hence set the possible number of nodes based on
> ibm,current-associativity-domains property.
>
> $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
> /proc/device-tree/rtas/ibm,current-associativity-domains
> 		 00000005 00000001 00000002 00000002 00000002 00000010
> /proc/device-tree/rtas/ibm,max-associativity-domains
> 		 00000005 00000001 00000008 00000020 00000020 00000100
>
> $ cat /sys/devices/system/node/possible ##Before patch
> 0-31
>
> $ cat /sys/devices/system/node/possible ##After patch
> 0-1
>
> Note the maximum nodes this platform can support is only 2 but the
> possible nodes is set to 32.

But what about LPM to a system with more nodes?

cheers
Srikar Dronamraju July 7, 2020, 8:42 a.m. UTC | #7
* Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > As per PAPR, there are 2 device tree property
> > ibm,max-associativity-domains (which defines the maximum number of
> > domains that the firmware i.e PowerVM can support) and
> > ibm,current-associativity-domains (which defines the maximum number of
> > domains that the platform can support). Value of
> > ibm,max-associativity-domains property is always greater than or equal
> > to ibm,current-associativity-domains property.
> 
> Where is it documented?
> 
> It's definitely not in LoPAPR.
> 

https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
Page number 833.

which says 
ibm,current-associativity-domains”
	property name to define the current number of associativity
	domains for this platform.
	prop-encoded-array: An associativity list such that all values are
	the number of unique values that the current platform supports
	in that location. The associativity list consisting of a number of
	entries integer (N) encoded as with encode-int followed by N
	integers encoded as with encode-int each representing current
	number of unique associativity domains the platform supports at
	that level.

> > Powerpc currently uses ibm,max-associativity-domains  property while
> > setting the possible number of nodes. This is currently set at 32.
> > However the possible number of nodes for a platform may be significantly
> > less. Hence set the possible number of nodes based on
> > ibm,current-associativity-domains property.
> >
> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
> > /proc/device-tree/rtas/ibm,current-associativity-domains
> > 		 00000005 00000001 00000002 00000002 00000002 00000010
> > /proc/device-tree/rtas/ibm,max-associativity-domains
> > 		 00000005 00000001 00000008 00000020 00000020 00000100
> >
> > $ cat /sys/devices/system/node/possible ##Before patch
> > 0-31
> >
> > $ cat /sys/devices/system/node/possible ##After patch
> > 0-1
> >
> > Note the maximum nodes this platform can support is only 2 but the
> > possible nodes is set to 32.
> 
> But what about LPM to a system with more nodes?
> 

I have very less info on LPM, so I checked with Nathan Lynch before posting
and as per Nathan in the current design of LPM, Linux wouldn't use the new
node numbers.
Nathan Lynch July 10, 2020, 5:41 p.m. UTC | #8
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
>> > /proc/device-tree/rtas/ibm,current-associativity-domains
>> > 		 00000005 00000001 00000002 00000002 00000002 00000010
>> > /proc/device-tree/rtas/ibm,max-associativity-domains
>> > 		 00000005 00000001 00000008 00000020 00000020 00000100
>> >
>> > $ cat /sys/devices/system/node/possible ##Before patch
>> > 0-31
>> >
>> > $ cat /sys/devices/system/node/possible ##After patch
>> > 0-1
>> >
>> > Note the maximum nodes this platform can support is only 2 but the
>> > possible nodes is set to 32.
>> 
>> But what about LPM to a system with more nodes?
>> 
>
> I have very less info on LPM, so I checked with Nathan Lynch before posting
> and as per Nathan in the current design of LPM, Linux wouldn't use the new
> node numbers.

(I see a v2 has been posted already but I needed a little time to check
with our hypervisor people on this point.)

It's less of a design and more of a least-bad option in the absence of a
more flexible NUMA architecture in Linux.

For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA
information from the device tree that it was booted with, and it must
disregard ibm,associativity and similar information after LPM or certain
other platform events. Historically there has been code that tried to
honor changes in NUMA information but it caused much worse problems than
degraded performance. That code has been disabled by default since last
year and is now subject to removal:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897

Most NUMA-aware code happens to follow that rule because the device tree
associativity information tends to get cached on first access in Linux's
own data structures. It all feels a little fragile to me though,
especially since we can DLPAR add processors and memory after LPM with
"new" associativity properties which don't relate to the logical
topology Linux has already built. However, on currently available hw, as
long as we're using ibm,max-associativity-domains to limit the set of
possible nodes, I believe such resources will always receive valid (but
possibly suboptimal) NUMA assignments. That's because as of this
writing ibm,max-associativity-domains has the same contents on all
currently available PowerVM systems.

Now if we change to using ibm,current-associativity-domains, which we
*can* expect to differ between differently configured systems, post-LPM
DLPAR additions can yield resources with node assignments that
fall outside of the possible range, especially when we've migrated from
a smaller system to a larger one.

Is the current code robust against that possibility? I don't think so:
it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and
possibly more code need to guard against this in order to prevent
NODE_DATA() null dereferences and the like. Probably those functions
should be made to clamp the nid assignment at num_possible_nodes()
instead of MAX_NUMNODES, which strikes me as more correct regardless of
your patch.
diff mbox series

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..3d55cef1a2dc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -897,7 +897,7 @@  static void __init find_possible_nodes(void)
 		return;
 
 	if (of_property_read_u32_index(rtas,
-				"ibm,max-associativity-domains",
+				"ibm,current-associativity-domains",
 				min_common_depth, &numnodes))
 		goto out;