diff mbox

[RFC] powerpc/numa: reset node_possible_map to only node_online_map

Message ID 20150305180549.GA29601@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

Nishanth Aravamudan March 5, 2015, 6:05 p.m. UTC
Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
iteration of for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

One could alternatively nodemask_and(node_possible_map,
node_online_map), but I think the cost of anding the two will always be
higher than zero and set a few bits in practice.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
While looking at this, I noticed that nr_node_ids is actually a
misnomer, it seems. It's not the number, but the maximum_node_id, as
with sparse NUMA nodes, you might only have two NUMA nodes possible, but
to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
changed?

Comments

David Rientjes March 5, 2015, 9:16 p.m. UTC | #1
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0257a7d659ef..24de29b3651b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,9 +958,17 @@ void __init initmem_init(void)
>  
>  	memblock_dump_all();
>  
> +	/*
> +	 * zero out the possible nodes after we parse the device-tree,
> +	 * so that we lower the maximum NUMA node ID to what is actually
> +	 * present.
> +	 */
> +	nodes_clear(node_possible_map);
> +
>  	for_each_online_node(nid) {
>  		unsigned long start_pfn, end_pfn;
>  
> +		node_set(nid, node_possible_map);
>  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  		setup_node_data(nid, start_pfn, end_pfn);
>  		sparse_memory_present_with_active_regions(nid);

This seems a bit strange, node_possible_map is supposed to be a superset 
of node_online_map and this loop is iterating over node_online_map to set 
nodes in node_possible_map.
Michael Ellerman March 5, 2015, 9:48 p.m. UTC | #2
On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote:
> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0257a7d659ef..24de29b3651b 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> >  
> >  	memblock_dump_all();
> >  
> > +	/*
> > +	 * zero out the possible nodes after we parse the device-tree,
> > +	 * so that we lower the maximum NUMA node ID to what is actually
> > +	 * present.
> > +	 */
> > +	nodes_clear(node_possible_map);
> > +
> >  	for_each_online_node(nid) {
> >  		unsigned long start_pfn, end_pfn;
> >  
> > +		node_set(nid, node_possible_map);
> >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >  		setup_node_data(nid, start_pfn, end_pfn);
> >  		sparse_memory_present_with_active_regions(nid);
> 
> This seems a bit strange, node_possible_map is supposed to be a superset 
> of node_online_map and this loop is iterating over node_online_map to set 
> nodes in node_possible_map.
 
Yeah. Though at this point in boot I don't think it matters that the two maps
are out-of-sync temporarily.

But it would simpler to just set the possible map to be the online map. That
would also maintain the invariant that the possible map is always a superset of
the online map.

Or did I miss a detail there (sleep deprived parent mode).

cheers
David Rientjes March 5, 2015, 9:58 p.m. UTC | #3
On Fri, 6 Mar 2015, Michael Ellerman wrote:

> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 0257a7d659ef..24de29b3651b 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > >  
> > >  	memblock_dump_all();
> > >  
> > > +	/*
> > > +	 * zero out the possible nodes after we parse the device-tree,
> > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > +	 * present.
> > > +	 */
> > > +	nodes_clear(node_possible_map);
> > > +
> > >  	for_each_online_node(nid) {
> > >  		unsigned long start_pfn, end_pfn;
> > >  
> > > +		node_set(nid, node_possible_map);
> > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > >  		setup_node_data(nid, start_pfn, end_pfn);
> > >  		sparse_memory_present_with_active_regions(nid);
> > 
> > This seems a bit strange, node_possible_map is supposed to be a superset 
> > of node_online_map and this loop is iterating over node_online_map to set 
> > nodes in node_possible_map.
>  
> Yeah. Though at this point in boot I don't think it matters that the two maps
> are out-of-sync temporarily.
> 
> But it would simpler to just set the possible map to be the online map. That
> would also maintain the invariant that the possible map is always a superset of
> the online map.
> 
> Or did I miss a detail there (sleep deprived parent mode).
> 

I think reset_numa_cpu_lookup_table() which iterates over the possible 
map, and thus only a subset of nodes now, may be concerning.

I'm not sure why this is being proposed as a powerpc patch and now a patch 
for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
for all possible nodes?  We should only be allocating for online nodes in 
N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
transition from memoryless -> memory.  The extra bonus is that 
alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
TODO in that function can be removed.
Tejun Heo March 5, 2015, 10:08 p.m. UTC | #4
Hello,

On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote:
> I'm not sure why this is being proposed as a powerpc patch and now a patch 
> for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
> for all possible nodes?  We should only be allocating for online nodes in 
> N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
> callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
> transition from memoryless -> memory.  The extra bonus is that 
> alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
> TODO in that function can be removed.

For cpus, the general direction is allocating for all possible cpus.
For iterations, we alternate between using all possibles and onlines
depending on the use case; however, the general idea is that the
possibles and onlines aren't gonna be very different.  NR_CPUS and
MAX_NUMNODES gotta accomodate the worst possible case the kernel may
run on but the possible masks should be set to the actually possible
subset during boot so that the kernel don't end up allocating for and
iterating over things which can't ever exist.

It can be argued that we should always stick to the online masks for
allocation and iteration; however, that usually requires more
complexity and the only cases where this mattered have been when the
boot code got it wrong and failed to set the possible masks correctly,
which also seems to be the case here.  I don't see any reason to
deviate here.

Thanks.
Tejun Heo March 5, 2015, 10:13 p.m. UTC | #5
On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote:
> While looking at this, I noticed that nr_node_ids is actually a
> misnomer, it seems. It's not the number, but the maximum_node_id, as
> with sparse NUMA nodes, you might only have two NUMA nodes possible, but
> to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
> changed?

It's the same for nr_cpu_ids.  It's counting the number of valid IDs
during that boot instance.  In the above case, whether the nodes are
sparse or not, there exist 17 node ids - 0 to 16.  Maybe numa_max_id
had been a better name (but would that equal the highest number or
+1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer
either.  Doesn't really matter at this point.  Maybe add comments on
top of both?

Thanks.
Tejun Heo March 5, 2015, 10:18 p.m. UTC | #6
On Thu, Mar 05, 2015 at 05:08:04PM -0500, Tejun Heo wrote:
> It can be argued that we should always stick to the online masks for
> allocation and iteration; however, that usually requires more
> complexity and the only cases where this mattered have been when the
> boot code got it wrong and failed to set the possible masks correctly,
> which also seems to be the case here.  I don't see any reason to
> deviate here.

Hmm... but yeah, as you wrote, keeping the allocation local could be a
reason but let's please not do this just to reduce memory consumption.
If memory locality of the field affects performance noticeably, sure.

Thanks.
Nishanth Aravamudan March 5, 2015, 11:17 p.m. UTC | #7
On 06.03.2015 [08:48:52 +1100], Michael Ellerman wrote:
> On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote:
> > On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:
> > 
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 0257a7d659ef..24de29b3651b 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > >  
> > >  	memblock_dump_all();
> > >  
> > > +	/*
> > > +	 * zero out the possible nodes after we parse the device-tree,
> > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > +	 * present.
> > > +	 */
> > > +	nodes_clear(node_possible_map);
> > > +
> > >  	for_each_online_node(nid) {
> > >  		unsigned long start_pfn, end_pfn;
> > >  
> > > +		node_set(nid, node_possible_map);
> > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > >  		setup_node_data(nid, start_pfn, end_pfn);
> > >  		sparse_memory_present_with_active_regions(nid);
> > 
> > This seems a bit strange, node_possible_map is supposed to be a superset 
> > of node_online_map and this loop is iterating over node_online_map to set 
> > nodes in node_possible_map.
>  
> Yeah. Though at this point in boot I don't think it matters that the two maps
> are out-of-sync temporarily.
> 
> But it would simpler to just set the possible map to be the online
> map. That would also maintain the invariant that the possible map is
> always a superset of the online map.

Yes, we could do that (see my reply to David just now). I didn't
consider just setting the map directly, that would be clearer. I didn't
want to post my nodes_and() version, because the cost of nodes_and
seemed higher than nodes_clear & node_set appropriately.

-Nish
Nishanth Aravamudan March 5, 2015, 11:20 p.m. UTC | #8
On 05.03.2015 [13:58:27 -0800], David Rientjes wrote:
> On Fri, 6 Mar 2015, Michael Ellerman wrote:
> 
> > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > > index 0257a7d659ef..24de29b3651b 100644
> > > > --- a/arch/powerpc/mm/numa.c
> > > > +++ b/arch/powerpc/mm/numa.c
> > > > @@ -958,9 +958,17 @@ void __init initmem_init(void)
> > > >  
> > > >  	memblock_dump_all();
> > > >  
> > > > +	/*
> > > > +	 * zero out the possible nodes after we parse the device-tree,
> > > > +	 * so that we lower the maximum NUMA node ID to what is actually
> > > > +	 * present.
> > > > +	 */
> > > > +	nodes_clear(node_possible_map);
> > > > +
> > > >  	for_each_online_node(nid) {
> > > >  		unsigned long start_pfn, end_pfn;
> > > >  
> > > > +		node_set(nid, node_possible_map);
> > > >  		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > > >  		setup_node_data(nid, start_pfn, end_pfn);
> > > >  		sparse_memory_present_with_active_regions(nid);
> > > 
> > > This seems a bit strange, node_possible_map is supposed to be a superset 
> > > of node_online_map and this loop is iterating over node_online_map to set 
> > > nodes in node_possible_map.
> >  
> > Yeah. Though at this point in boot I don't think it matters that the
> > two maps are out-of-sync temporarily.
> > 
> > But it would simpler to just set the possible map to be the online
> > map. That would also maintain the invariant that the possible map is
> > always a superset of the online map.
> > 
> > Or did I miss a detail there (sleep deprived parent mode).
> > 
> 
> I think reset_numa_cpu_lookup_table() which iterates over the possible
> map, and thus only a subset of nodes now, may be concerning.


I think you are confusing the CPU online map and the NUMA node online
map. reset_numa_cpu_lookup_table is a cpu->node mapping, only called at
boot-time, and iterates over the CPU online map, which is unaltered by
my patch.

> I'm not sure why this is being proposed as a powerpc patch and now a
> patch for mem_cgroup_css_alloc().

I think mem_cgroup_css_alloc() is just an example of a larger issue. I
should have made that clearer in my changelog. Even if we change
mem_cgroup_css_alloc(), I think we want to fix the node_possible_map on
powerpc to be accurate at run-time, just like x86 does.

> In other words, why do we have to allocate for all possible nodes?  We
> should only be allocating for online nodes in N_MEMORY with mem
> hotplug disabled initially and then have a mem hotplug callback
> implemented to alloc_mem_cgroup_per_zone_info() for nodes that
> transition from memoryless -> memory.  The extra bonus is that
> alloc_mem_cgroup_per_zone_info() need never allocate remote memory and
> the TODO in that function can be removed.

This is a good idea, and seems like it can be a follow-on parallel patch
to the one I provided (which does need an updated changelog now).

Thanks,
Nish
Nishanth Aravamudan March 5, 2015, 11:21 p.m. UTC | #9
On 05.03.2015 [17:08:04 -0500], Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote:
> > I'm not sure why this is being proposed as a powerpc patch and now a patch 
> > for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
> > for all possible nodes?  We should only be allocating for online nodes in 
> > N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
> > callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
> > transition from memoryless -> memory.  The extra bonus is that 
> > alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
> > TODO in that function can be removed.
> 
> For cpus, the general direction is allocating for all possible cpus.
> For iterations, we alternate between using all possibles and onlines
> depending on the use case; however, the general idea is that the
> possibles and onlines aren't gonna be very different.  NR_CPUS and
> MAX_NUMNODES gotta accomodate the worst possible case the kernel may
> run on but the possible masks should be set to the actually possible
> subset during boot so that the kernel don't end up allocating for and
> iterating over things which can't ever exist.

Makes sense to me.

> It can be argued that we should always stick to the online masks for
> allocation and iteration; however, that usually requires more
> complexity and the only cases where this mattered have been when the
> boot code got it wrong and failed to set the possible masks correctly,
> which also seems to be the case here.  I don't see any reason to
> deviate here.

So, do you agree with the general direction of my change? :)

Thanks,
Nish
Tejun Heo March 5, 2015, 11:24 p.m. UTC | #10
On Thu, Mar 05, 2015 at 03:21:35PM -0800, Nishanth Aravamudan wrote:
> So, do you agree with the general direction of my change? :)

Yeah, I mean it's an obvious bug fix.  I don't know when or how it
should be set on powerpc but if the machine can't do NUMA node
hotplug, its node online and possible masks must be equal.

Thanks.
Nishanth Aravamudan March 5, 2015, 11:27 p.m. UTC | #11
On 05.03.2015 [17:13:08 -0500], Tejun Heo wrote:
> On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote:
> > While looking at this, I noticed that nr_node_ids is actually a
> > misnomer, it seems. It's not the number, but the maximum_node_id, as
> > with sparse NUMA nodes, you might only have two NUMA nodes possible, but
> > to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
> > changed?
> 
> It's the same for nr_cpu_ids.  It's counting the number of valid IDs
> during that boot instance.  In the above case, whether the nodes are
> sparse or not, there exist 17 node ids - 0 to 16.  Maybe numa_max_id
> had been a better name (but would that equal the highest number or
> +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer
> either.  Doesn't really matter at this point.  Maybe add comments on
> top of both?

Yes, I will consider that. To me, I guess it's more a matter of:

a) How does nr_node_ids relate to the number of possible NUMA node IDs
at runtime?

They are identical.

b) How does nr_node_ids relate to the number of NUMA node IDs in use?

There is no relation.

c) How does nr_node_ids relate to the maximum NUMA node ID in use?

It is one larger than that value.

However, for a), at least, we don't care about that on power, really. We
don't have node hotplug, so the "possible" is the "online" in practice,
for a given system.

Iteration seems to generally not be a problem (since we have sparse
iterators anyways) and we shouldn't be allocating for non-present nodes.

But we run into excessive allocations (I'm looking into a few others
Dipankar has found now) with array allocations based of nr_node_ids or
MAX_NUMNODES when the NUMA topology is sparse..

-Nish
diff mbox

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..24de29b3651b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,9 +958,17 @@  void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * zero out the possible nodes after we parse the device-tree,
+	 * so that we lower the maximum NUMA node ID to what is actually
+	 * present.
+	 */
+	nodes_clear(node_possible_map);
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
+		node_set(nid, node_possible_map);
 		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 		setup_node_data(nid, start_pfn, end_pfn);
 		sparse_memory_present_with_active_regions(nid);