Message ID | 20150305180549.GA29601@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 0257a7d659ef..24de29b3651b 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -958,9 +958,17 @@ void __init initmem_init(void) > > memblock_dump_all(); > > + /* > + * zero out the possible nodes after we parse the device-tree, > + * so that we lower the maximum NUMA node ID to what is actually > + * present. > + */ > + nodes_clear(node_possible_map); > + > for_each_online_node(nid) { > unsigned long start_pfn, end_pfn; > > + node_set(nid, node_possible_map); > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > setup_node_data(nid, start_pfn, end_pfn); > sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map.
On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote: > On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > index 0257a7d659ef..24de29b3651b 100644 > > --- a/arch/powerpc/mm/numa.c > > +++ b/arch/powerpc/mm/numa.c > > @@ -958,9 +958,17 @@ void __init initmem_init(void) > > > > memblock_dump_all(); > > > > + /* > > + * zero out the possible nodes after we parse the device-tree, > > + * so that we lower the maximum NUMA node ID to what is actually > > + * present. > > + */ > > + nodes_clear(node_possible_map); > > + > > for_each_online_node(nid) { > > unsigned long start_pfn, end_pfn; > > > > + node_set(nid, node_possible_map); > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > > setup_node_data(nid, start_pfn, end_pfn); > > sparse_memory_present_with_active_regions(nid); > > This seems a bit strange, node_possible_map is supposed to be a superset > of node_online_map and this loop is iterating over node_online_map to set > nodes in node_possible_map. Yeah. Though at this point in boot I don't think it matters that the two maps are out-of-sync temporarily. But it would simpler to just set the possible map to be the online map. That would also maintain the invariant that the possible map is always a superset of the online map. Or did I miss a detail there (sleep deprived parent mode). cheers
On Fri, 6 Mar 2015, Michael Ellerman wrote: > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > > index 0257a7d659ef..24de29b3651b 100644 > > > --- a/arch/powerpc/mm/numa.c > > > +++ b/arch/powerpc/mm/numa.c > > > @@ -958,9 +958,17 @@ void __init initmem_init(void) > > > > > > memblock_dump_all(); > > > > > > + /* > > > + * zero out the possible nodes after we parse the device-tree, > > > + * so that we lower the maximum NUMA node ID to what is actually > > > + * present. > > > + */ > > > + nodes_clear(node_possible_map); > > > + > > > for_each_online_node(nid) { > > > unsigned long start_pfn, end_pfn; > > > > > > + node_set(nid, node_possible_map); > > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > > > setup_node_data(nid, start_pfn, end_pfn); > > > sparse_memory_present_with_active_regions(nid); > > > > This seems a bit strange, node_possible_map is supposed to be a superset > > of node_online_map and this loop is iterating over node_online_map to set > > nodes in node_possible_map. > > Yeah. Though at this point in boot I don't think it matters that the two maps > are out-of-sync temporarily. > > But it would simpler to just set the possible map to be the online map. That > would also maintain the invariant that the possible map is always a superset of > the online map. > > Or did I miss a detail there (sleep deprived parent mode). > I think reset_numa_cpu_lookup_table() which iterates over the possible map, and thus only a subset of nodes now, may be concerning. I'm not sure why this is being proposed as a powerpc patch and now a patch for mem_cgroup_css_alloc(). In other words, why do we have to allocate for all possible nodes? We should only be allocating for online nodes in N_MEMORY with mem hotplug disabled initially and then have a mem hotplug callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that transition from memoryless -> memory. The extra bonus is that alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the TODO in that function can be removed.
Hello, On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote: > I'm not sure why this is being proposed as a powerpc patch and now a patch > for mem_cgroup_css_alloc(). In other words, why do we have to allocate > for all possible nodes? We should only be allocating for online nodes in > N_MEMORY with mem hotplug disabled initially and then have a mem hotplug > callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that > transition from memoryless -> memory. The extra bonus is that > alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the > TODO in that function can be removed. For cpus, the general direction is allocating for all possible cpus. For iterations, we alternate between using all possibles and onlines depending on the use case; however, the general idea is that the possibles and onlines aren't gonna be very different. NR_CPUS and MAX_NUMNODES gotta accomodate the worst possible case the kernel may run on but the possible masks should be set to the actually possible subset during boot so that the kernel don't end up allocating for and iterating over things which can't ever exist. It can be argued that we should always stick to the online masks for allocation and iteration; however, that usually requires more complexity and the only cases where this mattered have been when the boot code got it wrong and failed to set the possible masks correctly, which also seems to be the case here. I don't see any reason to deviate here. Thanks.
On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote: > While looking at this, I noticed that nr_node_ids is actually a > misnomer, it seems. It's not the number, but the maximum_node_id, as > with sparse NUMA nodes, you might only have two NUMA nodes possible, but > to make certain loops work, nr_node_ids will be, e.g., 17. Should it be > changed? It's the same for nr_cpu_ids. It's counting the number of valid IDs during that boot instance. In the above case, whether the nodes are sparse or not, there exist 17 node ids - 0 to 16. Maybe numa_max_id had been a better name (but would that equal the highest number or +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer either. Doesn't really matter at this point. Maybe add comments on top of both? Thanks.
On Thu, Mar 05, 2015 at 05:08:04PM -0500, Tejun Heo wrote: > It can be argued that we should always stick to the online masks for > allocation and iteration; however, that usually requires more > complexity and the only cases where this mattered have been when the > boot code got it wrong and failed to set the possible masks correctly, > which also seems to be the case here. I don't see any reason to > deviate here. Hmm... but yeah, as you wrote, keeping the allocation local could be a reason but let's please not do this just to reduce memory consumption. If memory locality of the field affects performance noticeably, sure. Thanks.
On 06.03.2015 [08:48:52 +1100], Michael Ellerman wrote: > On Thu, 2015-03-05 at 13:16 -0800, David Rientjes wrote: > > On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: > > > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > > index 0257a7d659ef..24de29b3651b 100644 > > > --- a/arch/powerpc/mm/numa.c > > > +++ b/arch/powerpc/mm/numa.c > > > @@ -958,9 +958,17 @@ void __init initmem_init(void) > > > > > > memblock_dump_all(); > > > > > > + /* > > > + * zero out the possible nodes after we parse the device-tree, > > > + * so that we lower the maximum NUMA node ID to what is actually > > > + * present. > > > + */ > > > + nodes_clear(node_possible_map); > > > + > > > for_each_online_node(nid) { > > > unsigned long start_pfn, end_pfn; > > > > > > + node_set(nid, node_possible_map); > > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > > > setup_node_data(nid, start_pfn, end_pfn); > > > sparse_memory_present_with_active_regions(nid); > > > > This seems a bit strange, node_possible_map is supposed to be a superset > > of node_online_map and this loop is iterating over node_online_map to set > > nodes in node_possible_map. > > Yeah. Though at this point in boot I don't think it matters that the two maps > are out-of-sync temporarily. > > But it would simpler to just set the possible map to be the online > map. That would also maintain the invariant that the possible map is > always a superset of the online map. Yes, we could do that (see my reply to David just now). I didn't consider just setting the map directly, that would be clearer. I didn't want to post my nodes_and() version, because the cost of nodes_and seemed higher than nodes_clear & node_set appropriately. -Nish
On 05.03.2015 [13:58:27 -0800], David Rientjes wrote: > On Fri, 6 Mar 2015, Michael Ellerman wrote: > > > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > > > index 0257a7d659ef..24de29b3651b 100644 > > > > --- a/arch/powerpc/mm/numa.c > > > > +++ b/arch/powerpc/mm/numa.c > > > > @@ -958,9 +958,17 @@ void __init initmem_init(void) > > > > > > > > memblock_dump_all(); > > > > > > > > + /* > > > > + * zero out the possible nodes after we parse the device-tree, > > > > + * so that we lower the maximum NUMA node ID to what is actually > > > > + * present. > > > > + */ > > > > + nodes_clear(node_possible_map); > > > > + > > > > for_each_online_node(nid) { > > > > unsigned long start_pfn, end_pfn; > > > > > > > > + node_set(nid, node_possible_map); > > > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > > > > setup_node_data(nid, start_pfn, end_pfn); > > > > sparse_memory_present_with_active_regions(nid); > > > > > > This seems a bit strange, node_possible_map is supposed to be a superset > > > of node_online_map and this loop is iterating over node_online_map to set > > > nodes in node_possible_map. > > > > Yeah. Though at this point in boot I don't think it matters that the > > two maps are out-of-sync temporarily. > > > > But it would simpler to just set the possible map to be the online > > map. That would also maintain the invariant that the possible map is > > always a superset of the online map. > > > > Or did I miss a detail there (sleep deprived parent mode). > > > > I think reset_numa_cpu_lookup_table() which iterates over the possible > map, and thus only a subset of nodes now, may be concerning. I think you are confusing the CPU online map and the NUMA node online map. reset_numa_cpu_lookup_table is a cpu->node mapping, only called at boot-time, and iterates over the CPU online map, which is unaltered by my patch. > I'm not sure why this is being proposed as a powerpc patch and now a > patch for mem_cgroup_css_alloc(). I think mem_cgroup_css_alloc() is just an example of a larger issue. I should have made that clearer in my changelog. Even if we change mem_cgroup_css_alloc(), I think we want to fix the node_possible_map on powerpc to be accurate at run-time, just like x86 does. > In other words, why do we have to allocate for all possible nodes? We > should only be allocating for online nodes in N_MEMORY with mem > hotplug disabled initially and then have a mem hotplug callback > implemented to alloc_mem_cgroup_per_zone_info() for nodes that > transition from memoryless -> memory. The extra bonus is that > alloc_mem_cgroup_per_zone_info() need never allocate remote memory and > the TODO in that function can be removed. This is a good idea, and seems like it can be a follow-on parallel patch to the one I provided (which does need an updated changelog now). Thanks, Nish
On 05.03.2015 [17:08:04 -0500], Tejun Heo wrote: > Hello, > > On Thu, Mar 05, 2015 at 01:58:27PM -0800, David Rientjes wrote: > > I'm not sure why this is being proposed as a powerpc patch and now a patch > > for mem_cgroup_css_alloc(). In other words, why do we have to allocate > > for all possible nodes? We should only be allocating for online nodes in > > N_MEMORY with mem hotplug disabled initially and then have a mem hotplug > > callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that > > transition from memoryless -> memory. The extra bonus is that > > alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the > > TODO in that function can be removed. > > For cpus, the general direction is allocating for all possible cpus. > For iterations, we alternate between using all possibles and onlines > depending on the use case; however, the general idea is that the > possibles and onlines aren't gonna be very different. NR_CPUS and > MAX_NUMNODES gotta accomodate the worst possible case the kernel may > run on but the possible masks should be set to the actually possible > subset during boot so that the kernel don't end up allocating for and > iterating over things which can't ever exist. Makes sense to me. > It can be argued that we should always stick to the online masks for > allocation and iteration; however, that usually requires more > complexity and the only cases where this mattered have been when the > boot code got it wrong and failed to set the possible masks correctly, > which also seems to be the case here. I don't see any reason to > deviate here. So, do you agree with the general direction of my change? :) Thanks, Nish
On Thu, Mar 05, 2015 at 03:21:35PM -0800, Nishanth Aravamudan wrote:
> So, do you agree with the general direction of my change? :)
Yeah, I mean it's an obvious bug fix. I don't know when or how it
should be set on powerpc but if the machine can't do NUMA node
hotplug, its node online and possible masks must be equal.
Thanks.
On 05.03.2015 [17:13:08 -0500], Tejun Heo wrote: > On Thu, Mar 05, 2015 at 10:05:49AM -0800, Nishanth Aravamudan wrote: > > While looking at this, I noticed that nr_node_ids is actually a > > misnomer, it seems. It's not the number, but the maximum_node_id, as > > with sparse NUMA nodes, you might only have two NUMA nodes possible, but > > to make certain loops work, nr_node_ids will be, e.g., 17. Should it be > > changed? > > It's the same for nr_cpu_ids. It's counting the number of valid IDs > during that boot instance. In the above case, whether the nodes are > sparse or not, there exist 17 node ids - 0 to 16. Maybe numa_max_id > had been a better name (but would that equal the highest number or > +1?) but nr_node_ids != nr_nodes so I don't think it's a misnomer > either. Doesn't really matter at this point. Maybe add comments on > top of both? Yes, I will consider that. To me, I guess it's more a matter of: a) How does nr_node_ids relate to the number of possible NUMA node IDs at runtime? They are identical. b) How does nr_node_ids relate to the number of NUMA node IDs in use? There is no relation. c) How does nr_node_ids relate to the maximum NUMA node ID in use? It is one larger than that value. However, for a), at least, we don't care about that on power, really. We don't have node hotplug, so the "possible" is the "online" in practice, for a given system. Iteration seems to generally not be a problem (since we have sparse iterators anyways) and we shouldn't be allocating for non-present nodes. But we run into excessive allocations (I'm looking into a few others Dipankar has found now) with array allocations based of nr_node_ids or MAX_NUMNODES when the NUMA topology is sparse.. -Nish
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* + * zero out the possible nodes after we parse the device-tree, + * so that we lower the maximum NUMA node ID to what is actually + * present. + */ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid);
Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc -> for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. One could alternatively nodemask_and(node_possible_map, node_online_map), but I think the cost of anding the two will always be higher than zero and set a few bits in practice. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> --- While looking at this, I noticed that nr_node_ids is actually a misnomer, it seems. It's not the number, but the maximum_node_id, as with sparse NUMA nodes, you might only have two NUMA nodes possible, but to make certain loops work, nr_node_ids will be, e.g., 17. Should it be changed?