Message ID | 20150306052750.GA9576@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote: > On 05.03.2015 [15:29:00 -0800], David Rientjes wrote: >> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: >> >>> So if we compare to x86: >>> >>> arch/x86/mm/numa.c::numa_init(): >>> >>> nodes_clear(numa_nodes_parsed); >>> nodes_clear(node_possible_map); >>> nodes_clear(node_online_map); >>> ... >>> numa_register_memblks(...); >>> >>> arch/x86/mm/numa.c::numa_register_memblks(): >>> >>> node_possible_map = numa_nodes_parsed; >>> >>> Basically, it looks like x86 NUMA init clears out possible map and >>> online map, probably for a similar reason to what I gave in the >>> changelog that by default, the possible map seems to be based off >>> MAX_NUMNODES, rather than nr_node_ids or anything dynamic. >>> >>> My patch was an attempt to emulate the same thing on powerpc. You are >>> right that there is a window in which the node_possible_map and >>> node_online_map are out of sync with my patch. It seems like it >>> shouldn't matter given how early in boot we are, but perhaps the >>> following would have been clearer: >>> >>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>> index 0257a7d659ef..1a118b08fad2 100644 >>> --- a/arch/powerpc/mm/numa.c >>> +++ b/arch/powerpc/mm/numa.c >>> @@ -958,6 +958,13 @@ void __init initmem_init(void) >>> >>> memblock_dump_all(); >>> >>> + /* >>> + * Reduce the possible NUMA nodes to the online NUMA nodes, >>> + * since we do not support node hotplug. This ensures that we >>> + * lower the maximum NUMA node ID to what is actually present. >>> + */ >>> + nodes_and(node_possible_map, node_possible_map, node_online_map); >> >> If you don't support node hotplug, then a node should always be possible >> if it's online unless there are other tricks powerpc plays with >> node_possible_map. Shouldn't this just be >> node_possible_map = node_online_map? > > Yeah, but I was too dumb to think of that before sending :) > > Updated version follows... > > -Nish > ---8<--- > > Raghu noticed an issue with excessive memory allocation on power with a > simple cgroup test, specifically, in mem_cgroup_css_alloc -> > for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing > up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup > directories). should we also add after this patch it has reduced to around 2MB? > > The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes > possible), which defines node_possible_map, which in turn defines the > value of nr_node_ids in setup_nr_node_ids and the iteration of > for_each_node. > > In practice, we never see a system with 256 NUMA nodes, and in fact, we > do not support node hotplug on power in the first place, so the nodes > that are online when we come up are the nodes that will be present for > the lifetime of this kernel. So let's, at least, drop the NUMA possible > map down to the online map at runtime. This is similar to what x86 does > in its initialization routines. > > mem_cgroup_css_alloc should also be fixed to only iterate over > memory-populated nodes and handle hotplug, but that is a separate > change. > Maybe we could fomally add Reported-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > To: Michael Ellerman <mpe@ellerman.id.au> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Tejun Heo <tj@kernel.org> > Cc: David Rientjes <rientjes@google.com> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Anton Blanchard <anton@samba.org> > Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > --- > v1 -> v2: > Rather than clear node_possible_map and set it nid-by-nid, just > directly assign node_online_map to it, as suggested by Michael > Ellerman and Tejun Heo. > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 0257a7d659ef..0c1716cd271f 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -958,6 +958,13 @@ void __init initmem_init(void) > > memblock_dump_all(); > > + /* > + * Reduce the possible NUMA nodes to the online NUMA nodes, > + * since we do not support node hotplug. This ensures that we > + * lower the maximum NUMA node ID to what is actually present. > + */ Hope we remember this change when we add hotplug :) > + node_possible_map = node_online_map; > + > for_each_online_node(nid) { > unsigned long start_pfn, end_pfn; >
On Thu, 2015-03-05 at 21:27 -0800, Nishanth Aravamudan wrote: > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 0257a7d659ef..0c1716cd271f 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -958,6 +958,13 @@ void __init initmem_init(void) > > memblock_dump_all(); > > + /* > + * Reduce the possible NUMA nodes to the online NUMA nodes, > + * since we do not support node hotplug. This ensures that we > + * lower the maximum NUMA node ID to what is actually present. > + */ > + node_possible_map = node_online_map; That looks nice, but is it generating what we want? ie. is the content of node_online_map being *copied* into node_possible_map. Or are we changing node_possible_map to point at node_online_map? cheers
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..0c1716cd271f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* + * Reduce the possible NUMA nodes to the online NUMA nodes, + * since we do not support node hotplug. This ensures that we + * lower the maximum NUMA node ID to what is actually present. + */ + node_possible_map = node_online_map; + for_each_online_node(nid) { unsigned long start_pfn, end_pfn;