Message ID | 20200706064002.14848-1-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | powerpc/numa: Restrict possible nodes based on platform | expand |
On 7/5/20 11:40 PM, Srikar Dronamraju wrote: > As per PAPR, there are 2 device tree property > ibm,max-associativity-domains (which defines the maximum number of > domains that the firmware i.e PowerVM can support) and > ibm,current-associativity-domains (which defines the maximum number of > domains that the platform can support). Value of > ibm,max-associativity-domains property is always greater than or equal > to ibm,current-associativity-domains property. > > Powerpc currently uses ibm,max-associativity-domains property while > setting the possible number of nodes. This is currently set at 32. > However the possible number of nodes for a platform may be significantly > less. Hence set the possible number of nodes based on > ibm,current-associativity-domains property. > > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains > /proc/device-tree/rtas/ibm,current-associativity-domains > 00000005 00000001 00000002 00000002 00000002 00000010 > /proc/device-tree/rtas/ibm,max-associativity-domains > 00000005 00000001 00000008 00000020 00000020 00000100 > > $ cat /sys/devices/system/node/possible ##Before patch > 0-31 > > $ cat /sys/devices/system/node/possible ##After patch > 0-1 > > Note the maximum nodes this platform can support is only 2 but the > possible nodes is set to 32. > > This is important because lot of kernel and user space code allocate > structures for all possible nodes leading to a lot of memory that is > allocated but not used. > > I ran a simple experiment to create and destroy 100 memory cgroups on > boot on a 8 node machine (Power8 Alpine). > > Before patch > free -k at boot > total used free shared buff/cache available > Mem: 523498176 4106816 518820608 22272 570752 516606720 > Swap: 4194240 0 4194240 > > free -k after creating 100 memory cgroups > total used free shared buff/cache available > Mem: 523498176 4628416 518246464 22336 623296 516058688 > Swap: 4194240 0 4194240 > > free -k after destroying 100 memory cgroups > total used free shared buff/cache available > Mem: 523498176 4697408 518173760 22400 627008 515987904 > Swap: 4194240 0 4194240 > > After patch > free -k at boot > total used free shared buff/cache available > Mem: 523498176 3969472 518933888 22272 594816 516731776 > Swap: 4194240 0 4194240 > > free -k after creating 100 memory cgroups > total used free shared buff/cache available > Mem: 523498176 4181888 518676096 22208 640192 516496448 > Swap: 4194240 0 4194240 > > free -k after destroying 100 memory cgroups > total used free shared buff/cache available > Mem: 523498176 4232320 518619904 22272 645952 516443264 > Swap: 4194240 0 4194240 > > Observations: > Fixed kernel takes 137344 kb (4106816-3969472) less to boot. > Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs. > > Cc: Nathan Lynch <nathanl@linux.ibm.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Anton Blanchard <anton@ozlabs.org> > Cc: Bharata B Rao <bharata@linux.ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 9fcf2d195830..3d55cef1a2dc 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) > return; > > if (of_property_read_u32_index(rtas, > - "ibm,max-associativity-domains", > + "ibm,current-associativity-domains", I'm not sure ibm,current-associativity-domains is guaranteed to exist on older firmware. You may need check that it exists and fall back to ibm,max-associativity-domains in the event it doesn't -Tyrel > min_common_depth, &numnodes)) > goto out; >
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 9fcf2d195830..3d55cef1a2dc 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) > return; > > if (of_property_read_u32_index(rtas, > - "ibm,max-associativity-domains", > + "ibm,current-associativity-domains", > min_common_depth, &numnodes)) Looks good if ibm,current-associativity-domains[min_common_depth] actually denotes the range of possible values, i.e. a value of 2 implies node numbers 0 and 1. PAPR+ says it's the "number of unique values", which isn't how I would specify the property if it's supposed to express a range. But it's probably OK...
Tyrel Datwyler <tyreld@linux.ibm.com> writes: >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) >> return; >> >> if (of_property_read_u32_index(rtas, >> - "ibm,max-associativity-domains", >> + "ibm,current-associativity-domains", > > I'm not sure ibm,current-associativity-domains is guaranteed to exist on older > firmware. You may need check that it exists and fall back to > ibm,max-associativity-domains in the event it doesn't Yes. Looks like it's a PowerVM-specific property.
* Tyrel Datwyler <tyreld@linux.ibm.com> [2020-07-06 13:58:42]: > On 7/5/20 11:40 PM, Srikar Dronamraju wrote: > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > index 9fcf2d195830..3d55cef1a2dc 100644 > > --- a/arch/powerpc/mm/numa.c > > +++ b/arch/powerpc/mm/numa.c > > @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) > > return; > > > > if (of_property_read_u32_index(rtas, > > - "ibm,max-associativity-domains", > > + "ibm,current-associativity-domains", > > I'm not sure ibm,current-associativity-domains is guaranteed to exist on older > firmware. You may need check that it exists and fall back to > ibm,max-associativity-domains in the event it doesn't > > -Tyrel > Oh, thanks Tyrel thats a good observation. Will fallback on max-associativity.
* Nathan Lynch <nathanl@linux.ibm.com> [2020-07-06 19:44:31]: > Tyrel Datwyler <tyreld@linux.ibm.com> writes: > >> --- a/arch/powerpc/mm/numa.c > >> +++ b/arch/powerpc/mm/numa.c > >> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) > >> return; > >> > >> if (of_property_read_u32_index(rtas, > >> - "ibm,max-associativity-domains", > >> + "ibm,current-associativity-domains", > > > > I'm not sure ibm,current-associativity-domains is guaranteed to exist on older > > firmware. You may need check that it exists and fall back to > > ibm,max-associativity-domains in the event it doesn't > > Yes. Looks like it's a PowerVM-specific property. Right, this is just a PowerVM specific property and doesn't affect PowerNV. On PowerNV thought we have sparse nodes, the max possible nodes is set correctly.
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: > As per PAPR, there are 2 device tree property > ibm,max-associativity-domains (which defines the maximum number of > domains that the firmware i.e PowerVM can support) and > ibm,current-associativity-domains (which defines the maximum number of > domains that the platform can support). Value of > ibm,max-associativity-domains property is always greater than or equal > to ibm,current-associativity-domains property. Where is it documented? It's definitely not in LoPAPR. > Powerpc currently uses ibm,max-associativity-domains property while > setting the possible number of nodes. This is currently set at 32. > However the possible number of nodes for a platform may be significantly > less. Hence set the possible number of nodes based on > ibm,current-associativity-domains property. > > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains > /proc/device-tree/rtas/ibm,current-associativity-domains > 00000005 00000001 00000002 00000002 00000002 00000010 > /proc/device-tree/rtas/ibm,max-associativity-domains > 00000005 00000001 00000008 00000020 00000020 00000100 > > $ cat /sys/devices/system/node/possible ##Before patch > 0-31 > > $ cat /sys/devices/system/node/possible ##After patch > 0-1 > > Note the maximum nodes this platform can support is only 2 but the > possible nodes is set to 32. But what about LPM to a system with more nodes? cheers
* Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]: > Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: > > As per PAPR, there are 2 device tree property > > ibm,max-associativity-domains (which defines the maximum number of > > domains that the firmware i.e PowerVM can support) and > > ibm,current-associativity-domains (which defines the maximum number of > > domains that the platform can support). Value of > > ibm,max-associativity-domains property is always greater than or equal > > to ibm,current-associativity-domains property. > > Where is it documented? > > It's definitely not in LoPAPR. > https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf Page number 833. which says ibm,current-associativity-domainsâ property name to define the current number of associativity domains for this platform. prop-encoded-array: An associativity list such that all values are the number of unique values that the current platform supports in that location. The associativity list consisting of a number of entries integer (N) encoded as with encode-int followed by N integers encoded as with encode-int each representing current number of unique associativity domains the platform supports at that level. > > Powerpc currently uses ibm,max-associativity-domains property while > > setting the possible number of nodes. This is currently set at 32. > > However the possible number of nodes for a platform may be significantly > > less. Hence set the possible number of nodes based on > > ibm,current-associativity-domains property. > > > > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains > > /proc/device-tree/rtas/ibm,current-associativity-domains > > 00000005 00000001 00000002 00000002 00000002 00000010 > > /proc/device-tree/rtas/ibm,max-associativity-domains > > 00000005 00000001 00000008 00000020 00000020 00000100 > > > > $ cat /sys/devices/system/node/possible ##Before patch > > 0-31 > > > > $ cat /sys/devices/system/node/possible ##After patch > > 0-1 > > > > Note the maximum nodes this platform can support is only 2 but the > > possible nodes is set to 32. > > But what about LPM to a system with more nodes? > I have very less info on LPM, so I checked with Nathan Lynch before posting and as per Nathan in the current design of LPM, Linux wouldn't use the new node numbers.
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: > * Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]: > >> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: >> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains >> > /proc/device-tree/rtas/ibm,current-associativity-domains >> > 00000005 00000001 00000002 00000002 00000002 00000010 >> > /proc/device-tree/rtas/ibm,max-associativity-domains >> > 00000005 00000001 00000008 00000020 00000020 00000100 >> > >> > $ cat /sys/devices/system/node/possible ##Before patch >> > 0-31 >> > >> > $ cat /sys/devices/system/node/possible ##After patch >> > 0-1 >> > >> > Note the maximum nodes this platform can support is only 2 but the >> > possible nodes is set to 32. >> >> But what about LPM to a system with more nodes? >> > > I have very less info on LPM, so I checked with Nathan Lynch before posting > and as per Nathan in the current design of LPM, Linux wouldn't use the new > node numbers. (I see a v2 has been posted already but I needed a little time to check with our hypervisor people on this point.) It's less of a design and more of a least-bad option in the absence of a more flexible NUMA architecture in Linux. For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA information from the device tree that it was booted with, and it must disregard ibm,associativity and similar information after LPM or certain other platform events. Historically there has been code that tried to honor changes in NUMA information but it caused much worse problems than degraded performance. That code has been disabled by default since last year and is now subject to removal: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897 Most NUMA-aware code happens to follow that rule because the device tree associativity information tends to get cached on first access in Linux's own data structures. It all feels a little fragile to me though, especially since we can DLPAR add processors and memory after LPM with "new" associativity properties which don't relate to the logical topology Linux has already built. However, on currently available hw, as long as we're using ibm,max-associativity-domains to limit the set of possible nodes, I believe such resources will always receive valid (but possibly suboptimal) NUMA assignments. That's because as of this writing ibm,max-associativity-domains has the same contents on all currently available PowerVM systems. Now if we change to using ibm,current-associativity-domains, which we *can* expect to differ between differently configured systems, post-LPM DLPAR additions can yield resources with node assignments that fall outside of the possible range, especially when we've migrated from a smaller system to a larger one. Is the current code robust against that possibility? I don't think so: it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and possibly more code need to guard against this in order to prevent NODE_DATA() null dereferences and the like. Probably those functions should be made to clamp the nid assignment at num_possible_nodes() instead of MAX_NUMNODES, which strikes me as more correct regardless of your patch.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 9fcf2d195830..3d55cef1a2dc 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) return; if (of_property_read_u32_index(rtas, - "ibm,max-associativity-domains", + "ibm,current-associativity-domains", min_common_depth, &numnodes)) goto out;
As per PAPR, there are 2 device tree property ibm,max-associativity-domains (which defines the maximum number of domains that the firmware i.e PowerVM can support) and ibm,current-associativity-domains (which defines the maximum number of domains that the platform can support). Value of ibm,max-associativity-domains property is always greater than or equal to ibm,current-associativity-domains property. Powerpc currently uses ibm,max-associativity-domains property while setting the possible number of nodes. This is currently set at 32. However the possible number of nodes for a platform may be significantly less. Hence set the possible number of nodes based on ibm,current-associativity-domains property. $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains /proc/device-tree/rtas/ibm,current-associativity-domains 00000005 00000001 00000002 00000002 00000002 00000010 /proc/device-tree/rtas/ibm,max-associativity-domains 00000005 00000001 00000008 00000020 00000020 00000100 $ cat /sys/devices/system/node/possible ##Before patch 0-31 $ cat /sys/devices/system/node/possible ##After patch 0-1 Note the maximum nodes this platform can support is only 2 but the possible nodes is set to 32. This is important because lot of kernel and user space code allocate structures for all possible nodes leading to a lot of memory that is allocated but not used. I ran a simple experiment to create and destroy 100 memory cgroups on boot on a 8 node machine (Power8 Alpine). Before patch free -k at boot total used free shared buff/cache available Mem: 523498176 4106816 518820608 22272 570752 516606720 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4628416 518246464 22336 623296 516058688 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4697408 518173760 22400 627008 515987904 Swap: 4194240 0 4194240 After patch free -k at boot total used free shared buff/cache available Mem: 523498176 3969472 518933888 22272 594816 516731776 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4181888 518676096 22208 640192 516496448 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4232320 518619904 22272 645952 516443264 Swap: 4194240 0 4194240 Observations: Fixed kernel takes 137344 kb (4106816-3969472) less to boot. Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs. Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Anton Blanchard <anton@ozlabs.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)