diff mbox series

[2/2] powerpc/hotplug: Ensure nodes initialized for hotplug

Message ID a5f7fdb2-d974-03c6-5ecc-c7f726f2b244@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show
Series powerpc/nodes/hotplug: Fix problem with memoryless nodes | expand

Commit Message

Michael Bringmann Sept. 18, 2017, 6:28 p.m. UTC
powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU,
it may occur that the new resources are to be inserted into nodes
that were not used for memory resources at bootup.  Many different
configurations of PowerPC resources may need to be supported depending
upon the environment.  This patch fixes some problems encountered at
runtime with configurations that support memory-less nodes, but which
allow CPUs to be added at and after boot.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Comments

Michael Ellerman Oct. 16, 2017, 12:54 p.m. UTC | #1
Michael Bringmann <mwb@linux.vnet.ibm.com> writes:

> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU,
> it may occur that the new resources are to be inserted into nodes
> that were not used for memory resources at bootup.  Many different
> configurations of PowerPC resources may need to be supported depending
> upon the environment.

Give me some detail please?!

> This patch fixes some problems encountered at

What problems?

> runtime with configurations that support memory-less nodes, but which
> allow CPUs to be added at and after boot.

How does it fix those problems?

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b385cd0..e811dd1 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu,
>  	return rc;
>  }
>  
> +static int verify_node_preparation(int nid)
> +{

I would not expect a function called "verify" ...

> +	if ((NODE_DATA(nid) == NULL) ||
> +	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
> +		if (try_online_node(nid))

.. to do something like online a node.

> +			return first_online_node;
> +	}
> +
> +	return nid;
> +}
> +
>  /*
>   * Update the CPU maps and sysfs entries for a single CPU when its NUMA
>   * characteristics change. This function doesn't perform any locking and is
> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked)
>  		/* Use associativity from first thread for all siblings */
>  		vphn_get_associativity(cpu, associativity);
>  		new_nid = associativity_to_nid(associativity);
> -		if (new_nid < 0 || !node_online(new_nid))
> +		if (new_nid < 0 || !node_possible(new_nid))
>  			new_nid = first_online_node;
>  
> +		new_nid = verify_node_preparation(new_nid);

You're being called part-way through CPU hotplug here, are we sure it's
safe to go and do memory hotplug from there? What's the locking
situation?

cheers
Michael Bringmann Oct. 17, 2017, 3:08 p.m. UTC | #2
On 10/16/2017 07:54 AM, Michael Ellerman wrote:
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> 
>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU,
>> it may occur that the new resources are to be inserted into nodes
>> that were not used for memory resources at bootup.  Many different
>> configurations of PowerPC resources may need to be supported depending
>> upon the environment.
> 
> Give me some detail please?!

Configurations that demonstrated problems included 'memoryless' nodes
that possessed only CPUs at boot, and nodes that contained neither CPUs
nor memory at boot.  The calculations in the kernel resulted in a different
node use layout on many SAP HANA configured systems.

> 
>> This patch fixes some problems encountered at
> 
> What problems?

The previous implementation collapsed all node assignments after affinity
calculations to use only those nodes that had memory at boot.  This
resulted in calculation and configuration differences between the FSP
code and the Linux kernel.

> 
>> runtime with configurations that support memory-less nodes, but which
>> allow CPUs to be added at and after boot.
> 
> How does it fix those problems?

The change involves completing the initialization of nodes that were not
used at boot, but which were selected by VPHN affinity calculations during
subsequent hotplug operations.

Michael

> 
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index b385cd0..e811dd1 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu,
>>  	return rc;
>>  }
>>  
>> +static int verify_node_preparation(int nid)
>> +{
> 
> I would not expect a function called "verify" ...
> 
>> +	if ((NODE_DATA(nid) == NULL) ||
>> +	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
>> +		if (try_online_node(nid))
> 
> .. to do something like online a node.
> 
>> +			return first_online_node;
>> +	}
>> +
>> +	return nid;
>> +}
>> +
>>  /*
>>   * Update the CPU maps and sysfs entries for a single CPU when its NUMA
>>   * characteristics change. This function doesn't perform any locking and is
>> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked)
>>  		/* Use associativity from first thread for all siblings */
>>  		vphn_get_associativity(cpu, associativity);
>>  		new_nid = associativity_to_nid(associativity);
>> -		if (new_nid < 0 || !node_online(new_nid))
>> +		if (new_nid < 0 || !node_possible(new_nid))
>>  			new_nid = first_online_node;
>>  
>> +		new_nid = verify_node_preparation(new_nid);
> 
> You're being called part-way through CPU hotplug here, are we sure it's
> safe to go and do memory hotplug from there? What's the locking
> situation?
> 
> cheers
> 
>
Michael Bringmann Nov. 15, 2017, 6:28 p.m. UTC | #3
Hello:
    See below.

On 10/16/2017 07:54 AM, Michael Ellerman wrote:
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> 
>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU,
>> it may occur that the new resources are to be inserted into nodes
>> that were not used for memory resources at bootup.  Many different
>> configurations of PowerPC resources may need to be supported depending
>> upon the environment.
> 
> Give me some detail please?!

The most important characteristics that I have observed are:

* Dedicated vs. shared resources.  Shared resources require information
  such as the VPHN hcall for CPU assignment to nodes.
* memoryless nodes at boot.  Nodes need to be defined as 'possible' at
  boot for operation with other code modules.  Previously, the powerpc
  code would limit the set of possible/online nodes to those which have
  memory assigned at boot.  Subsequent add/remove of CPUs or memory would
  only work with this subset of possible nodes.
* memoryless nodes with CPUs at boot.  Due to the previous restriction on
  nodes, nodes that had CPUs but no memory were being collapsed into other
  nodes that did have memory at boot.  In practice this meant that the
  node assignment presented by the runtime kernel differed from the affinity
  and associativity attirbutes presented by the device tree or VPHN hcalls.
  Nodes that might be known to the pHyp were not 'possible' in the runtime
  kernel because they did not have memory at boot.

> 
>> This patch fixes some problems encountered at
> 
> What problems?

This patch set fixes a couple of problems.

* Nodes known to powerpc to be memoryless at boot, but to have CPUs in them
  are allowed to be 'possible' and 'online'.  Memory allocations for those
  nodes are taken from another node that does have memory until and if memory
  is hot-added to the node.
* Nodes which have no resources assigned at boot, but which may still be
  referenced subsequently by affinity or associativity attributes, are kept
  in the list of 'possible' nodes for powerpc.  Hot-add of memory or CPUs
  to the system can reference these nodes and bring them online instead of
  redirecting the resources to the set of nodes known to have memory at boot.

> 
>> runtime with configurations that support memory-less nodes, but which
>> allow CPUs to be added at and after boot.
> 
> How does it fix those problems?

This problem was fixed in a couple of ways.  First, the code now checks
whether the node to which a CPU is mapped by 'numa_update_cpu_topology' /
'arch_update_cpu_topology' has been initialized and has memory available.
If either test is false, a call is made to 'try_online_node()' to finish
the data structure initialization.  Only if we are unable to initialize
the node at this point will the CPU node assignment be collapsed into an
existing node.  After initialization by 'try_online_node()', calls to
'local_memory_node' no longer crash for these memoryless nodes.

> 
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index b385cd0..e811dd1 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu,
>>  	return rc;
>>  }
>>  
>> +static int verify_node_preparation(int nid)
>> +{
> 
> I would not expect a function called "verify" ...
> 
>> +	if ((NODE_DATA(nid) == NULL) ||
>> +	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
>> +		if (try_online_node(nid))
> 
> .. to do something like online a node.

We have changed the function name to 'find_cpu_nid'.

> 
>> +			return first_online_node;
>> +	}
>> +
>> +	return nid;
>> +}
>> +
>>  /*
>>   * Update the CPU maps and sysfs entries for a single CPU when its NUMA
>>   * characteristics change. This function doesn't perform any locking and is
>> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked)
>>  		/* Use associativity from first thread for all siblings */
>>  		vphn_get_associativity(cpu, associativity);
>>  		new_nid = associativity_to_nid(associativity);
>> -		if (new_nid < 0 || !node_online(new_nid))
>> +		if (new_nid < 0 || !node_possible(new_nid))
>>  			new_nid = first_online_node;
>>  
>> +		new_nid = verify_node_preparation(new_nid);
> 
> You're being called part-way through CPU hotplug here, are we sure it's
> safe to go and do memory hotplug from there? What's the locking
> situation?

We are not doing memory hotplug.  We are initializing a node that may be used
by CPUs or memory before it can be referenced as invalid by a CPU hotplug
operation.  CPU hotplug operations are protected by a range of APIs including
cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock,
device locks, and more.  Memory hotplug operations, including try_online_node,
are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more.
In the case of CPUs being hot-added to a previously memoryless node, the
try_online_node operation occurs wholly within the CPU locks with no overlap.
Using HMC hot-add/hot-remove operations, I have been able to add and remove
CPUs to any possible node without failures.  HMC operations involve a degree
self-serialization, though.

> 
> cheers
> 
>
Nathan Fontenot Nov. 16, 2017, 4:57 p.m. UTC | #4
On 11/15/2017 12:28 PM, Michael Bringmann wrote:
> Hello:
>     See below.
> 
> On 10/16/2017 07:54 AM, Michael Ellerman wrote:
>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>
>>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU,
>>> it may occur that the new resources are to be inserted into nodes
>>> that were not used for memory resources at bootup.  Many different
>>> configurations of PowerPC resources may need to be supported depending
>>> upon the environment.
>>
>> Give me some detail please?!
> 
> The most important characteristics that I have observed are:
> 
> * Dedicated vs. shared resources.  Shared resources require information
>   such as the VPHN hcall for CPU assignment to nodes.
> * memoryless nodes at boot.  Nodes need to be defined as 'possible' at
>   boot for operation with other code modules.  Previously, the powerpc
>   code would limit the set of possible/online nodes to those which have
>   memory assigned at boot.  Subsequent add/remove of CPUs or memory would
>   only work with this subset of possible nodes.
> * memoryless nodes with CPUs at boot.  Due to the previous restriction on
>   nodes, nodes that had CPUs but no memory were being collapsed into other
>   nodes that did have memory at boot.  In practice this meant that the
>   node assignment presented by the runtime kernel differed from the affinity
>   and associativity attirbutes presented by the device tree or VPHN hcalls.
>   Nodes that might be known to the pHyp were not 'possible' in the runtime
>   kernel because they did not have memory at boot.
> 
>>
>>> This patch fixes some problems encountered at
>>
>> What problems?
> 
> This patch set fixes a couple of problems.
> 
> * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them
>   are allowed to be 'possible' and 'online'.  Memory allocations for those
>   nodes are taken from another node that does have memory until and if memory
>   is hot-added to the node.
> * Nodes which have no resources assigned at boot, but which may still be
>   referenced subsequently by affinity or associativity attributes, are kept
>   in the list of 'possible' nodes for powerpc.  Hot-add of memory or CPUs
>   to the system can reference these nodes and bring them online instead of
>   redirecting the resources to the set of nodes known to have memory at boot.
> 
>>
>>> runtime with configurations that support memory-less nodes, but which
>>> allow CPUs to be added at and after boot.
>>
>> How does it fix those problems?
> 
> This problem was fixed in a couple of ways.  First, the code now checks
> whether the node to which a CPU is mapped by 'numa_update_cpu_topology' /
> 'arch_update_cpu_topology' has been initialized and has memory available.
> If either test is false, a call is made to 'try_online_node()' to finish
> the data structure initialization.  Only if we are unable to initialize
> the node at this point will the CPU node assignment be collapsed into an
> existing node.  After initialization by 'try_online_node()', calls to
> 'local_memory_node' no longer crash for these memoryless nodes.
> 
>>
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index b385cd0..e811dd1 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -1325,6 +1325,17 @@ static long vphn_get_associativity(unsigned long cpu,
>>>  	return rc;
>>>  }
>>>  
>>> +static int verify_node_preparation(int nid)
>>> +{
>>
>> I would not expect a function called "verify" ...
>>
>>> +	if ((NODE_DATA(nid) == NULL) ||
>>> +	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
>>> +		if (try_online_node(nid))
>>
>> .. to do something like online a node.
> 
> We have changed the function name to 'find_cpu_nid'.

Ok, but I would still not expect 'find_cpu_nid' to online the node.

> 
>>
>>> +			return first_online_node;
>>> +	}
>>> +
>>> +	return nid;
>>> +}
>>> +
>>>  /*
>>>   * Update the CPU maps and sysfs entries for a single CPU when its NUMA
>>>   * characteristics change. This function doesn't perform any locking and is
>>> @@ -1433,9 +1444,11 @@ int numa_update_cpu_topology(bool cpus_locked)
>>>  		/* Use associativity from first thread for all siblings */
>>>  		vphn_get_associativity(cpu, associativity);
>>>  		new_nid = associativity_to_nid(associativity);
>>> -		if (new_nid < 0 || !node_online(new_nid))
>>> +		if (new_nid < 0 || !node_possible(new_nid))
>>>  			new_nid = first_online_node;
>>>  
>>> +		new_nid = verify_node_preparation(new_nid);
>>
>> You're being called part-way through CPU hotplug here, are we sure it's
>> safe to go and do memory hotplug from there? What's the locking
>> situation?
> 
> We are not doing memory hotplug.  We are initializing a node that may be used
> by CPUs or memory before it can be referenced as invalid by a CPU hotplug
> operation.  CPU hotplug operations are protected by a range of APIs including
> cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock,
> device locks, and more.  Memory hotplug operations, including try_online_node,
> are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more.
> In the case of CPUs being hot-added to a previously memoryless node, the
> try_online_node operation occurs wholly within the CPU locks with no overlap.
> Using HMC hot-add/hot-remove operations, I have been able to add and remove
> CPUs to any possible node without failures.  HMC operations involve a degree
> self-serialization, though.

For both memory and cpu DLPAR operations we hold the device hotplug lock.

-Nathan
 
> 
>>
>> cheers
>>
>>
>
Michael Bringmann Nov. 16, 2017, 5:36 p.m. UTC | #5
>>>
>>>> +	if ((NODE_DATA(nid) == NULL) ||
>>>> +	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
>>>> +		if (try_online_node(nid))
>>>
>>> .. to do something like online a node.
>>
>> We have changed the function name to 'find_cpu_nid'.
> 
> Ok, but I would still not expect 'find_cpu_nid' to online the node.
> 

We would have to talk to the developer that created try_online_node()
which fully initializes the node and all of the related data structures.
A few of the APIs are external, and 'numa.c' knows how to allocate the
base 'pgdat' structure, but everything else that the kernel depends
upon for a node is handled in mm/page_alloc.c and mm/hotplug_memory.c.
I was trying to avoid piecemeal changes to that code -- avoid any changes
if it comes to it.

Even if it was not expected to put the node online, it is convenient,
as otherwise the patches to 'numa.c' would have to put the node online
-- that is expected for a CPU that is online.

Regards,
diff mbox series

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b385cd0..e811dd1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -551,7 +551,7 @@  static int numa_setup_cpu(unsigned long lcpu)
 	nid = of_node_to_nid_single(cpu);
 
 out_present:
-	if (nid < 0 || !node_online(nid))
+	if (nid < 0 || !node_possible(nid))
 		nid = first_online_node;
 
 	map_cpu_to_node(lcpu, nid);
@@ -1325,6 +1325,17 @@  static long vphn_get_associativity(unsigned long cpu,
 	return rc;
 }
 
+static int verify_node_preparation(int nid)
+{
+	if ((NODE_DATA(nid) == NULL) ||
+	    (NODE_DATA(nid)->node_spanned_pages == 0)) {
+		if (try_online_node(nid))
+			return first_online_node;
+	}
+
+	return nid;
+}
+
 /*
  * Update the CPU maps and sysfs entries for a single CPU when its NUMA
  * characteristics change. This function doesn't perform any locking and is
@@ -1433,9 +1444,11 @@  int numa_update_cpu_topology(bool cpus_locked)
 		/* Use associativity from first thread for all siblings */
 		vphn_get_associativity(cpu, associativity);
 		new_nid = associativity_to_nid(associativity);
-		if (new_nid < 0 || !node_online(new_nid))
+		if (new_nid < 0 || !node_possible(new_nid))
 			new_nid = first_online_node;
 
+		new_nid = verify_node_preparation(new_nid);
+
 		if (new_nid == numa_cpu_lookup_table[cpu]) {
 			cpumask_andnot(&cpu_associativity_changes_mask,
 					&cpu_associativity_changes_mask,