diff mbox

[v2] powerpc/numa: set node_possible_map to only node_online_map during boot

Message ID 20150306052750.GA9576@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

Nishanth Aravamudan March 6, 2015, 5:27 a.m. UTC
On 05.03.2015 [15:29:00 -0800], David Rientjes wrote:
> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> 
> > So if we compare to x86:
> > 
> > arch/x86/mm/numa.c::numa_init():
> > 
> >         nodes_clear(numa_nodes_parsed);
> >         nodes_clear(node_possible_map);
> >         nodes_clear(node_online_map);
> > 	...
> > 	numa_register_memblks(...);
> > 
> > arch/x86/mm/numa.c::numa_register_memblks():
> > 
> > 	node_possible_map = numa_nodes_parsed;
> > 
> > Basically, it looks like x86 NUMA init clears out possible map and
> > online map, probably for a similar reason to what I gave in the
> > changelog that by default, the possible map seems to be based off
> > MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
> > 
> > My patch was an attempt to emulate the same thing on powerpc. You are
> > right that there is a window in which the node_possible_map and
> > node_online_map are out of sync with my patch. It seems like it
> > shouldn't matter given how early in boot we are, but perhaps the
> > following would have been clearer:
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..1a118b08fad2 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,6 +958,13 @@ void __init initmem_init(void)
> >  
> >         memblock_dump_all();
> >  
> > +       /*
> > +        * Reduce the possible NUMA nodes to the online NUMA nodes,
> > +        * since we do not support node hotplug. This ensures that  we
> > +        * lower the maximum NUMA node ID to what is actually present.
> > +        */
> > +       nodes_and(node_possible_map, node_possible_map, node_online_map);
> 
> If you don't support node hotplug, then a node should always be possible 
> if it's online unless there are other tricks powerpc plays with 
> node_possible_map.  Shouldn't this just be 
> node_possible_map = node_online_map?

Yeah, but I was too dumb to think of that before sending :)

Updated version follows...

-Nish


Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.
    
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

---
v1 -> v2:
  Rather than clear node_possible_map and set it nid-by-nid, just
  directly assign node_online_map to it, as suggested by Michael
  Ellerman and Tejun Heo.

Comments

Raghavendra K T March 6, 2015, 11:29 a.m. UTC | #1
On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote:
> On 05.03.2015 [15:29:00 -0800], David Rientjes wrote:
>> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
>>
>>> So if we compare to x86:
>>>
>>> arch/x86/mm/numa.c::numa_init():
>>>
>>>          nodes_clear(numa_nodes_parsed);
>>>          nodes_clear(node_possible_map);
>>>          nodes_clear(node_online_map);
>>> 	...
>>> 	numa_register_memblks(...);
>>>
>>> arch/x86/mm/numa.c::numa_register_memblks():
>>>
>>> 	node_possible_map = numa_nodes_parsed;
>>>
>>> Basically, it looks like x86 NUMA init clears out possible map and
>>> online map, probably for a similar reason to what I gave in the
>>> changelog that by default, the possible map seems to be based off
>>> MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
>>>
>>> My patch was an attempt to emulate the same thing on powerpc. You are
>>> right that there is a window in which the node_possible_map and
>>> node_online_map are out of sync with my patch. It seems like it
>>> shouldn't matter given how early in boot we are, but perhaps the
>>> following would have been clearer:
>>>
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index 0257a7d659ef..1a118b08fad2 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>>>
>>>          memblock_dump_all();
>>>
>>> +       /*
>>> +        * Reduce the possible NUMA nodes to the online NUMA nodes,
>>> +        * since we do not support node hotplug. This ensures that  we
>>> +        * lower the maximum NUMA node ID to what is actually present.
>>> +        */
>>> +       nodes_and(node_possible_map, node_possible_map, node_online_map);
>>
>> If you don't support node hotplug, then a node should always be possible
>> if it's online unless there are other tricks powerpc plays with
>> node_possible_map.  Shouldn't this just be
>> node_possible_map = node_online_map?
>
> Yeah, but I was too dumb to think of that before sending :)
>
> Updated version follows...
>
> -Nish
>
---8<---
>
> Raghu noticed an issue with excessive memory allocation on power with a
> simple cgroup test, specifically, in mem_cgroup_css_alloc ->
> for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
> up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
> directories).
should we also add after this patch it has reduced to around 2MB?
>
> The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
> possible), which defines node_possible_map, which in turn defines the
> value of nr_node_ids in setup_nr_node_ids and the iteration of
> for_each_node.
>
> In practice, we never see a system with 256 NUMA nodes, and in fact, we
> do not support node hotplug on power in the first place, so the nodes
> that are online when we come up are the nodes that will be present for
> the lifetime of this kernel. So let's, at least, drop the NUMA possible
> map down to the online map at runtime. This is similar to what x86 does
> in its initialization routines.
>
> mem_cgroup_css_alloc should also be fixed to only iterate over
> memory-populated nodes and handle hotplug, but that is a separate
> change.
>
Maybe we could fomally add
Reported-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> To: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Anton Blanchard <anton@samba.org>
> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> ---
> v1 -> v2:
>    Rather than clear node_possible_map and set it nid-by-nid, just
>    directly assign node_online_map to it, as suggested by Michael
>    Ellerman and Tejun Heo.
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..0c1716cd271f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>
>   	memblock_dump_all();
>
> +	/*
> +	 * Reduce the possible NUMA nodes to the online NUMA nodes,
> +	 * since we do not support node hotplug. This ensures that  we
> +	 * lower the maximum NUMA node ID to what is actually present.
> +	 */

  Hope we remember this change when we add hotplug :)

> +	node_possible_map = node_online_map;
> +
>   	for_each_online_node(nid) {
>   		unsigned long start_pfn, end_pfn;
>
Michael Ellerman March 9, 2015, 11:55 p.m. UTC | #2
On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote:
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..0c1716cd271f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,6 +958,13 @@ void __init initmem_init(void)
>  
>  	memblock_dump_all();
>  
> +	/*
> +	 * Reduce the possible NUMA nodes to the online NUMA nodes,
> +	 * since we do not support node hotplug. This ensures that  we
> +	 * lower the maximum NUMA node ID to what is actually present.
> +	 */
> +	node_possible_map = node_online_map;

That looks nice, but is it generating what we want?

ie. is the content of node_online_map being *copied* into node_possible_map.

Or are we changing node_possible_map to point at node_online_map?

cheers
diff mbox

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..0c1716cd271f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@  void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * Reduce the possible NUMA nodes to the online NUMA nodes,
+	 * since we do not support node hotplug. This ensures that  we
+	 * lower the maximum NUMA node ID to what is actually present.
+	 */
+	node_possible_map = node_online_map;
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;