Message ID | a5f7fdb2-d974-03c6-5ecc-c7f726f2b244@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | powerpc/nodes/hotplug: Fix problem with memoryless nodes | expand |
Michael Bringmann <mwb@linux.vnet.ibm.com> writes: > powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU, > it may occur that the new resources are to be inserted into nodes > that were not used for memory resources at bootup. Many different > configurations of PowerPC resources may need to be supported depending > upon the environment. Give me some detail please?! > This patch fixes some problems encountered at What problems? > runtime with configurations that support memory-less nodes, but which > allow CPUs to be added at and after boot. How does it fix those problems? > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b385cd0..e811dd1 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu, > return rc; > } > > +static int verify_node_preparation(int nid) > +{ I would not expect a function called "verify" ... > + if ((NODE_DATA(nid) == NULL) || > + (NODE_DATA(nid)->node_spanned_pages == 0)) { > + if (try_online_node(nid)) .. to do something like online a node. > + return first_online_node; > + } > + > + return nid; > +} > + > /* > * Update the CPU maps and sysfs entries for a single CPU when its NUMA > * characteristics change. This function doesn't perform any locking and is > @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked) > /* Use associativity from first thread for all siblings */ > vphn_get_associativity(cpu, associativity); > new_nid = associativity_to_nid(associativity); > - if (new_nid < 0 || !node_online(new_nid)) > + if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > + new_nid = verify_node_preparation(new_nid); You're being called part-way through CPU hotplug here, are we sure it's safe to go and do memory hotplug from there? What's the locking situation? cheers
On 10/16/2017 07:54 AM, Michael Ellerman wrote: > Michael Bringmann <mwb@linux.vnet.ibm.com> writes: > >> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU, >> it may occur that the new resources are to be inserted into nodes >> that were not used for memory resources at bootup. Many different >> configurations of PowerPC resources may need to be supported depending >> upon the environment. > > Give me some detail please?! Configurations that demonstrated problems included 'memoryless' nodes that possessed only CPUs at boot, and nodes that contained neither CPUs nor memory at boot. The calculations in the kernel resulted in a different node use layout on many SAP HANA configured systems. > >> This patch fixes some problems encountered at > > What problems? The previous implementation collapsed all node assignments after affinity calculations to use only those nodes that had memory at boot. This resulted in calculation and configuration differences between the FSP code and the Linux kernel. > >> runtime with configurations that support memory-less nodes, but which >> allow CPUs to be added at and after boot. > > How does it fix those problems? The change involves completing the initialization of nodes that were not used at boot, but which were selected by VPHN affinity calculations during subsequent hotplug operations. Michael > >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> index b385cd0..e811dd1 100644 >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu, >> return rc; >> } >> >> +static int verify_node_preparation(int nid) >> +{ > > I would not expect a function called "verify" ... > >> + if ((NODE_DATA(nid) == NULL) || >> + (NODE_DATA(nid)->node_spanned_pages == 0)) { >> + if (try_online_node(nid)) > > .. to do something like online a node. > >> + return first_online_node; >> + } >> + >> + return nid; >> +} >> + >> /* >> * Update the CPU maps and sysfs entries for a single CPU when its NUMA >> * characteristics change. This function doesn't perform any locking and is >> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked) >> /* Use associativity from first thread for all siblings */ >> vphn_get_associativity(cpu, associativity); >> new_nid = associativity_to_nid(associativity); >> - if (new_nid < 0 || !node_online(new_nid)) >> + if (new_nid < 0 || !node_possible(new_nid)) >> new_nid = first_online_node; >> >> + new_nid = verify_node_preparation(new_nid); > > You're being called part-way through CPU hotplug here, are we sure it's > safe to go and do memory hotplug from there? What's the locking > situation? > > cheers > >
Hello: See below. On 10/16/2017 07:54 AM, Michael Ellerman wrote: > Michael Bringmann <mwb@linux.vnet.ibm.com> writes: > >> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU, >> it may occur that the new resources are to be inserted into nodes >> that were not used for memory resources at bootup. Many different >> configurations of PowerPC resources may need to be supported depending >> upon the environment. > > Give me some detail please?! The most important characteristics that I have observed are: * Dedicated vs. shared resources. Shared resources require information such as the VPHN hcall for CPU assignment to nodes. * memoryless nodes at boot. Nodes need to be defined as 'possible' at boot for operation with other code modules. Previously, the powerpc code would limit the set of possible/online nodes to those which have memory assigned at boot. Subsequent add/remove of CPUs or memory would only work with this subset of possible nodes. * memoryless nodes with CPUs at boot. Due to the previous restriction on nodes, nodes that had CPUs but no memory were being collapsed into other nodes that did have memory at boot. In practice this meant that the node assignment presented by the runtime kernel differed from the affinity and associativity attirbutes presented by the device tree or VPHN hcalls. Nodes that might be known to the pHyp were not 'possible' in the runtime kernel because they did not have memory at boot. > >> This patch fixes some problems encountered at > > What problems? This patch set fixes a couple of problems. * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting the resources to the set of nodes known to have memory at boot. > >> runtime with configurations that support memory-less nodes, but which >> allow CPUs to be added at and after boot. > > How does it fix those problems? This problem was fixed in a couple of ways. First, the code now checks whether the node to which a CPU is mapped by 'numa_update_cpu_topology' / 'arch_update_cpu_topology' has been initialized and has memory available. If either test is false, a call is made to 'try_online_node()' to finish the data structure initialization. Only if we are unable to initialize the node at this point will the CPU node assignment be collapsed into an existing node. After initialization by 'try_online_node()', calls to 'local_memory_node' no longer crash for these memoryless nodes. > >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> index b385cd0..e811dd1 100644 >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu, >> return rc; >> } >> >> +static int verify_node_preparation(int nid) >> +{ > > I would not expect a function called "verify" ... > >> + if ((NODE_DATA(nid) == NULL) || >> + (NODE_DATA(nid)->node_spanned_pages == 0)) { >> + if (try_online_node(nid)) > > .. to do something like online a node. We have changed the function name to 'find_cpu_nid'. > >> + return first_online_node; >> + } >> + >> + return nid; >> +} >> + >> /* >> * Update the CPU maps and sysfs entries for a single CPU when its NUMA >> * characteristics change. This function doesn't perform any locking and is >> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked) >> /* Use associativity from first thread for all siblings */ >> vphn_get_associativity(cpu, associativity); >> new_nid = associativity_to_nid(associativity); >> - if (new_nid < 0 || !node_online(new_nid)) >> + if (new_nid < 0 || !node_possible(new_nid)) >> new_nid = first_online_node; >> >> + new_nid = verify_node_preparation(new_nid); > > You're being called part-way through CPU hotplug here, are we sure it's > safe to go and do memory hotplug from there? What's the locking > situation? We are not doing memory hotplug. We are initializing a node that may be used by CPUs or memory before it can be referenced as invalid by a CPU hotplug operation. CPU hotplug operations are protected by a range of APIs including cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, device locks, and more. Memory hotplug operations, including try_online_node, are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the case of CPUs being hot-added to a previously memoryless node, the try_online_node operation occurs wholly within the CPU locks with no overlap. Using HMC hot-add/hot-remove operations, I have been able to add and remove CPUs to any possible node without failures. HMC operations involve a degree self-serialization, though. > > cheers > >
On 11/15/2017 12:28 PM, Michael Bringmann wrote: > Hello: > See below. > > On 10/16/2017 07:54 AM, Michael Ellerman wrote: >> Michael Bringmann <mwb@linux.vnet.ibm.com> writes: >> >>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU, >>> it may occur that the new resources are to be inserted into nodes >>> that were not used for memory resources at bootup. Many different >>> configurations of PowerPC resources may need to be supported depending >>> upon the environment. >> >> Give me some detail please?! > > The most important characteristics that I have observed are: > > * Dedicated vs. shared resources. Shared resources require information > such as the VPHN hcall for CPU assignment to nodes. > * memoryless nodes at boot. Nodes need to be defined as 'possible' at > boot for operation with other code modules. Previously, the powerpc > code would limit the set of possible/online nodes to those which have > memory assigned at boot. Subsequent add/remove of CPUs or memory would > only work with this subset of possible nodes. > * memoryless nodes with CPUs at boot. Due to the previous restriction on > nodes, nodes that had CPUs but no memory were being collapsed into other > nodes that did have memory at boot. In practice this meant that the > node assignment presented by the runtime kernel differed from the affinity > and associativity attirbutes presented by the device tree or VPHN hcalls. > Nodes that might be known to the pHyp were not 'possible' in the runtime > kernel because they did not have memory at boot. > >> >>> This patch fixes some problems encountered at >> >> What problems? > > This patch set fixes a couple of problems. > > * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them > are allowed to be 'possible' and 'online'. Memory allocations for those > nodes are taken from another node that does have memory until and if memory > is hot-added to the node. > * Nodes which have no resources assigned at boot, but which may still be > referenced subsequently by affinity or associativity attributes, are kept > in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs > to the system can reference these nodes and bring them online instead of > redirecting the resources to the set of nodes known to have memory at boot. > >> >>> runtime with configurations that support memory-less nodes, but which >>> allow CPUs to be added at and after boot. >> >> How does it fix those problems? > > This problem was fixed in a couple of ways. First, the code now checks > whether the node to which a CPU is mapped by 'numa_update_cpu_topology' / > 'arch_update_cpu_topology' has been initialized and has memory available. > If either test is false, a call is made to 'try_online_node()' to finish > the data structure initialization. Only if we are unable to initialize > the node at this point will the CPU node assignment be collapsed into an > existing node. After initialization by 'try_online_node()', calls to > 'local_memory_node' no longer crash for these memoryless nodes. > >> >>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>> index b385cd0..e811dd1 100644 >>> --- a/arch/powerpc/mm/numa.c >>> +++ b/arch/powerpc/mm/numa.c >>> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu, >>> return rc; >>> } >>> >>> +static int verify_node_preparation(int nid) >>> +{ >> >> I would not expect a function called "verify" ... >> >>> + if ((NODE_DATA(nid) == NULL) || >>> + (NODE_DATA(nid)->node_spanned_pages == 0)) { >>> + if (try_online_node(nid)) >> >> .. to do something like online a node. > > We have changed the function name to 'find_cpu_nid'. Ok, but I would still not expect 'find_cpu_nid' to online the node. > >> >>> + return first_online_node; >>> + } >>> + >>> + return nid; >>> +} >>> + >>> /* >>> * Update the CPU maps and sysfs entries for a single CPU when its NUMA >>> * characteristics change. This function doesn't perform any locking and is >>> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked) >>> /* Use associativity from first thread for all siblings */ >>> vphn_get_associativity(cpu, associativity); >>> new_nid = associativity_to_nid(associativity); >>> - if (new_nid < 0 || !node_online(new_nid)) >>> + if (new_nid < 0 || !node_possible(new_nid)) >>> new_nid = first_online_node; >>> >>> + new_nid = verify_node_preparation(new_nid); >> >> You're being called part-way through CPU hotplug here, are we sure it's >> safe to go and do memory hotplug from there? What's the locking >> situation? > > We are not doing memory hotplug. We are initializing a node that may be used > by CPUs or memory before it can be referenced as invalid by a CPU hotplug > operation. CPU hotplug operations are protected by a range of APIs including > cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, > device locks, and more. Memory hotplug operations, including try_online_node, > are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. > In the case of CPUs being hot-added to a previously memoryless node, the > try_online_node operation occurs wholly within the CPU locks with no overlap. > Using HMC hot-add/hot-remove operations, I have been able to add and remove > CPUs to any possible node without failures. HMC operations involve a degree > self-serialization, though. For both memory and cpu DLPAR operations we hold the device hotplug lock. -Nathan > >> >> cheers >> >> >
>>> >>>> + if ((NODE_DATA(nid) == NULL) || >>>> + (NODE_DATA(nid)->node_spanned_pages == 0)) { >>>> + if (try_online_node(nid)) >>> >>> .. to do something like online a node. >> >> We have changed the function name to 'find_cpu_nid'. > > Ok, but I would still not expect 'find_cpu_nid' to online the node. > We would have to talk to the developer that created try_online_node() which fully initializes the node and all of the related data structures. A few of the APIs are external, and 'numa.c' knows how to allocate the base 'pgdat' structure, but everything else that the kernel depends upon for a node is handled in mm/page_alloc.c and mm/hotplug_memory.c. I was trying to avoid piecemeal changes to that code -- avoid any changes if it comes to it. Even if it was not expected to put the node online, it is convenient, as otherwise the patches to 'numa.c' would have to put the node online -- that is expected for a CPU that is online. Regards,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b385cd0..e811dd1 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -551,7 +551,7 @@ static int numa_setup_cpu(unsigned long lcpu) nid = of_node_to_nid_single(cpu); out_present: - if (nid < 0 || !node_online(nid)) + if (nid < 0 || !node_possible(nid)) nid = first_online_node; map_cpu_to_node(lcpu, nid); @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu, return rc; } +static int verify_node_preparation(int nid) +{ + if ((NODE_DATA(nid) == NULL) || + (NODE_DATA(nid)->node_spanned_pages == 0)) { + if (try_online_node(nid)) + return first_online_node; + } + + return nid; +} + /* * Update the CPU maps and sysfs entries for a single CPU when its NUMA * characteristics change. This function doesn't perform any locking and is @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); new_nid = associativity_to_nid(associativity); - if (new_nid < 0 || !node_online(new_nid)) + if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node; + new_nid = verify_node_preparation(new_nid); + if (new_nid == numa_cpu_lookup_table[cpu]) { cpumask_andnot(&cpu_associativity_changes_mask, &cpu_associativity_changes_mask,
powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU, it may occur that the new resources are to be inserted into nodes that were not used for memory resources at bootup. Many different configurations of PowerPC resources may need to be supported depending upon the environment. This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, but which allow CPUs to be added at and after boot. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)