Message ID | CAAmzW4PXkdpNi5pZ=4BzdXNvqTEAhcuw-x0pWidqrxzdePxXxA@mail.gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote: > 2014-02-06 David Rientjes <rientjes@google.com>: > > On Thu, 6 Feb 2014, Joonsoo Kim wrote: > > > >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> > >> > > > > I may be misunderstanding this patch and there's no help because there's > > no changelog. > > Sorry about that. > I made this patch just for testing. :) > Thanks for looking this. > > >> diff --git a/include/linux/topology.h b/include/linux/topology.h > >> index 12ae6ce..a6d5438 100644 > >> --- a/include/linux/topology.h > >> +++ b/include/linux/topology.h > >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void) > >> * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). > >> */ > >> DECLARE_PER_CPU(int, _numa_mem_); > >> +int _node_numa_mem_[MAX_NUMNODES]; > >> > >> #ifndef set_numa_mem > >> static inline void set_numa_mem(int node) > >> { > >> this_cpu_write(_numa_mem_, node); > >> + _node_numa_mem_[numa_node_id()] = node; > >> +} > >> +#endif > >> + > >> +#ifndef get_numa_mem > >> +static inline int get_numa_mem(int node) > >> +{ > >> + return _node_numa_mem_[node]; > >> } > >> #endif > >> > >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu) > >> static inline void set_cpu_numa_mem(int cpu, int node) > >> { > >> per_cpu(_numa_mem_, cpu) = node; > >> + _node_numa_mem_[numa_node_id()] = node; > > > > The intention seems to be that _node_numa_mem_[X] for a node X will return > > a node Y with memory that has the nearest distance? In other words, > > caching the value returned by local_memory_node(X)? > > Yes, you are right. > > > That doesn't seem to be what it's doing since numa_node_id() is the node > > of the cpu that current is running on so this ends up getting initialized > > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in > > cpu_possible_mask. > > Yes, I made a mistake. > Thanks for pointer. > I fix it and attach v2. > Now I'm out of office, so I'm not sure this second version is correct :( > > Thanks. > > ----------8<-------------- > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001 > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Date: Thu, 6 Feb 2014 17:07:05 +0900 > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for > determining the > fallback node > > We need to determine the fallback node in slub allocator if the allocation > target node is memoryless node. Without it, the SLUB wrongly select > the node which has no memory and can't use a partial slab, because of node > mismatch. Introduced function, node_numa_mem(X), will return > a node Y with memory that has the nearest distance. If X is memoryless > node, it will return nearest distance node, but, if > X is normal node, it will return itself. > > We will use this function in following patch to determine the fallback > node. > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > diff --git a/include/linux/topology.h b/include/linux/topology.h > index 12ae6ce..66b19b8 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -233,11 +233,20 @@ static inline int numa_node_id(void) > * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). > */ > DECLARE_PER_CPU(int, _numa_mem_); > +int _node_numa_mem_[MAX_NUMNODES]; Should be static, I think? > > #ifndef set_numa_mem > static inline void set_numa_mem(int node) > { > this_cpu_write(_numa_mem_, node); > + _node_numa_mem_[numa_node_id()] = node; > +} > +#endif > + > +#ifndef get_numa_mem > +static inline int get_numa_mem(int node) > +{ > + return _node_numa_mem_[node]; > } > #endif > > @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu) > static inline void set_cpu_numa_mem(int cpu, int node) > { > per_cpu(_numa_mem_, cpu) = node; > + _node_numa_mem_[cpu_to_node(cpu)] = node; > } > #endif > > @@ -273,6 +283,13 @@ static inline int numa_mem_id(void) > } > #endif > > +#ifndef get_numa_mem > +static inline int get_numa_mem(int node) > +{ > + return node; > +} > +#endif > + > #ifndef cpu_to_mem > static inline int cpu_to_mem(int cpu) > { > -- > 1.7.9.5 >
On Thu, 6 Feb 2014, Joonsoo Kim wrote: > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001 > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Date: Thu, 6 Feb 2014 17:07:05 +0900 > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for > determining the > fallback node > > We need to determine the fallback node in slub allocator if the allocation > target node is memoryless node. Without it, the SLUB wrongly select > the node which has no memory and can't use a partial slab, because of node > mismatch. Introduced function, node_numa_mem(X), will return > a node Y with memory that has the nearest distance. If X is memoryless > node, it will return nearest distance node, but, if > X is normal node, it will return itself. > > We will use this function in following patch to determine the fallback > node. > I like the approach and it may fix the problem today, but it may not be sufficient in the future: nodes may not only be memoryless but they may also be cpuless. It's possible that a node can only have I/O, networking, or storage devices and we can define affinity for them that is remote from every cpu and/or memory by the ACPI specification. It seems like a better approach would be to do this when a node is brought online and determine the fallback node based not on the zonelists as you do here but rather on locality (such as through a SLIT if provided, see node_distance()). Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make a lot of sense in generic code. I'd suggest something like node_to_mem_node().
On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote: > > diff --git a/include/linux/topology.h b/include/linux/topology.h > > index 12ae6ce..66b19b8 100644 > > --- a/include/linux/topology.h > > +++ b/include/linux/topology.h > > @@ -233,11 +233,20 @@ static inline int numa_node_id(void) > > * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). > > */ > > DECLARE_PER_CPU(int, _numa_mem_); > > +int _node_numa_mem_[MAX_NUMNODES]; > > Should be static, I think? Yes, will update it. Thanks.
On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote: > On Thu, 6 Feb 2014, Joonsoo Kim wrote: > > > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001 > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > Date: Thu, 6 Feb 2014 17:07:05 +0900 > > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for > > determining the > > fallback node > > > > We need to determine the fallback node in slub allocator if the allocation > > target node is memoryless node. Without it, the SLUB wrongly select > > the node which has no memory and can't use a partial slab, because of node > > mismatch. Introduced function, node_numa_mem(X), will return > > a node Y with memory that has the nearest distance. If X is memoryless > > node, it will return nearest distance node, but, if > > X is normal node, it will return itself. > > > > We will use this function in following patch to determine the fallback > > node. > > > > I like the approach and it may fix the problem today, but it may not be > sufficient in the future: nodes may not only be memoryless but they may > also be cpuless. It's possible that a node can only have I/O, networking, > or storage devices and we can define affinity for them that is remote from > every cpu and/or memory by the ACPI specification. > > It seems like a better approach would be to do this when a node is brought > online and determine the fallback node based not on the zonelists as you > do here but rather on locality (such as through a SLIT if provided, see > node_distance()). Hmm... I guess that zonelist is base on locality. Zonelist is generated using node_distance(), so I think that it reflects locality. But, I'm not expert on NUMA, so please let me know what I am missing here :) > Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make > a lot of sense in generic code. I'd suggest something like > node_to_mem_node(). It's much better! If this patch eventually will be needed, I will update it. Thanks.
On Fri, 7 Feb 2014, Joonsoo Kim wrote: > > > > It seems like a better approach would be to do this when a node is brought > > online and determine the fallback node based not on the zonelists as you > > do here but rather on locality (such as through a SLIT if provided, see > > node_distance()). > > Hmm... > I guess that zonelist is base on locality. Zonelist is generated using > node_distance(), so I think that it reflects locality. But, I'm not expert > on NUMA, so please let me know what I am missing here :) The next node can be found by going through the zonelist of a node and checking for available memory. See fallback_alloc(). There is a function node_distance() that determines the relative performance of a memory access from one to the other node. The building of the fallback list for every node in build_zonelists() relies on that.
On Fri, 7 Feb 2014, Joonsoo Kim wrote: > > It seems like a better approach would be to do this when a node is brought > > online and determine the fallback node based not on the zonelists as you > > do here but rather on locality (such as through a SLIT if provided, see > > node_distance()). > > Hmm... > I guess that zonelist is base on locality. Zonelist is generated using > node_distance(), so I think that it reflects locality. But, I'm not expert > on NUMA, so please let me know what I am missing here :) > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes. If your solution is going to become the generic kernel API that determines what node has local memory for a particular node, then it will have to support all definitions of node. That includes nodes that consist solely of I/O, chipsets, networking, or storage devices. These nodes may not have memory or cpus, so doing it as part of onlining cpus isn't going to be generic enough. You want a node_to_mem_node() API for all possible node types (the possible node types listed above are straight from the ACPI spec). For 99% of people, node_to_mem_node(X) is always going to be X and we can optimize for that, but any solution that relies on cpu online is probably shortsighted right now. I think it would be much better to do this as a part of setting a node to be online.
On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote: > On Fri, 7 Feb 2014, Joonsoo Kim wrote: > > > > It seems like a better approach would be to do this when a node is brought > > > online and determine the fallback node based not on the zonelists as you > > > do here but rather on locality (such as through a SLIT if provided, see > > > node_distance()). > > > > Hmm... > > I guess that zonelist is base on locality. Zonelist is generated using > > node_distance(), so I think that it reflects locality. But, I'm not expert > > on NUMA, so please let me know what I am missing here :) > > > > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes. > If your solution is going to become the generic kernel API that determines > what node has local memory for a particular node, then it will have to > support all definitions of node. That includes nodes that consist solely > of I/O, chipsets, networking, or storage devices. These nodes may not > have memory or cpus, so doing it as part of onlining cpus isn't going to > be generic enough. You want a node_to_mem_node() API for all possible > node types (the possible node types listed above are straight from the > ACPI spec). For 99% of people, node_to_mem_node(X) is always going to be > X and we can optimize for that, but any solution that relies on cpu online > is probably shortsighted right now. > > I think it would be much better to do this as a part of setting a node to > be online. Okay. I got your point. I will change it to rely on node online if this patch is really needed. Thanks!
On 10.02.2014 [10:09:36 +0900], Joonsoo Kim wrote: > On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote: > > On Fri, 7 Feb 2014, Joonsoo Kim wrote: > > > > > > It seems like a better approach would be to do this when a node is brought > > > > online and determine the fallback node based not on the zonelists as you > > > > do here but rather on locality (such as through a SLIT if provided, see > > > > node_distance()). > > > > > > Hmm... > > > I guess that zonelist is base on locality. Zonelist is generated using > > > node_distance(), so I think that it reflects locality. But, I'm not expert > > > on NUMA, so please let me know what I am missing here :) > > > > > > > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes. > > If your solution is going to become the generic kernel API that determines > > what node has local memory for a particular node, then it will have to > > support all definitions of node. That includes nodes that consist solely > > of I/O, chipsets, networking, or storage devices. These nodes may not > > have memory or cpus, so doing it as part of onlining cpus isn't going to > > be generic enough. You want a node_to_mem_node() API for all possible > > node types (the possible node types listed above are straight from the > > ACPI spec). For 99% of people, node_to_mem_node(X) is always going to be > > X and we can optimize for that, but any solution that relies on cpu online > > is probably shortsighted right now. > > > > I think it would be much better to do this as a part of setting a node to > > be online. > > Okay. I got your point. > I will change it to rely on node online if this patch is really needed. Sorry for bringing up this old thread again, but I had a question for you, David. node_to_mem_node(), which does seem like a useful API, doesn't seem like it can just node_distance() solely, right? Because that just tells us the relative cost (or so I think about it) of using resources from that node. But we also need to know if that node itself has memory, etc. So using the zonelists is required no matter what? And upon memory hotplug (or unplug), the topology can change in a way that affects things, so node online time isn't right either? Thanks, Nish
On Mon, 21 Jul 2014, Nishanth Aravamudan wrote: > Sorry for bringing up this old thread again, but I had a question for > you, David. node_to_mem_node(), which does seem like a useful API, > doesn't seem like it can just node_distance() solely, right? Because > that just tells us the relative cost (or so I think about it) of using > resources from that node. But we also need to know if that node itself > has memory, etc. So using the zonelists is required no matter what? And > upon memory hotplug (or unplug), the topology can change in a way that > affects things, so node online time isn't right either? > I think there's two use cases of interest: - allocating from a memoryless node where numa_node_id() is memoryless, and - using node_to_mem_node() for a possibly-memoryless node for kmalloc(). I believe the first should have its own node_zonelist[0], whether it's memoryless or not, that points to a list of zones that start with those with the smallest distance. I think its own node_zonelist[1], for __GFP_THISNODE allocations, should point to the node with present memory that has the smallest distance. For sure node_zonelist[0] cannot be NULL since things like first_online_pgdat() would break and it should be unnecessary to do node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES since the zonelists should already be defined properly. All nodes, regardless of whether they have memory or not, should probably end up having a struct pglist_data unless there's a reason for another level of indirection.
Hi David, On 21.07.2014 [18:16:58 -0700], David Rientjes wrote: > On Mon, 21 Jul 2014, Nishanth Aravamudan wrote: > > > Sorry for bringing up this old thread again, but I had a question for > > you, David. node_to_mem_node(), which does seem like a useful API, > > doesn't seem like it can just node_distance() solely, right? Because > > that just tells us the relative cost (or so I think about it) of using > > resources from that node. But we also need to know if that node itself > > has memory, etc. So using the zonelists is required no matter what? And > > upon memory hotplug (or unplug), the topology can change in a way that > > affects things, so node online time isn't right either? > > > > I think there's two use cases of interest: > > - allocating from a memoryless node where numa_node_id() is memoryless, > and > > - using node_to_mem_node() for a possibly-memoryless node for kmalloc(). > > I believe the first should have its own node_zonelist[0], whether it's > memoryless or not, that points to a list of zones that start with those > with the smallest distance. Ok, and that would be used for falling back in the appropriate priority? > I think its own node_zonelist[1], for __GFP_THISNODE allocations, > should point to the node with present memory that has the smallest > distance. And so would this, but with the caveat that we can fail here and don't go further? Semantically, __GFP_THISNODE then means "as close as physically possible ignoring run-time memory constraints". I say that because obviously we might get off-node memory without memoryless nodes, but that shouldn't be used to satisfy __GPF_THISNODE allocations. > For sure node_zonelist[0] cannot be NULL since things like > first_online_pgdat() would break and it should be unnecessary to do > node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES > since the zonelists should already be defined properly. All nodes, > regardless of whether they have memory or not, should probably end up > having a struct pglist_data unless there's a reason for another level of > indirection. So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and on powerpc now, things look really good. On a KVM instance with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 16336 MB node 1 free: 14274 MB node distances: node 0 1 0: 10 40 1: 40 10 3.16.0-rc6 gives: Slab: 1039744 kB SReclaimable: 38976 kB SUnreclaim: 1000768 kB Joonsoo's patches give: Slab: 366144 kB SReclaimable: 36928 kB SUnreclaim: 329216 kB For reference, CONFIG_SLAB gives: Slab: 122496 kB SReclaimable: 14912 kB SUnreclaim: 107584 kB At Tejun's request [adding him to Cc], I also partially reverted 81c98869faa5 ("kthread: ensure locality of task_struct allocations"): Slab: 428864 kB SReclaimable: 44288 kB SUnreclaim: 384576 kB This seems slightly worse, but I think it's because of the same root-cause that I indicated in my RFC patch 2/2, quoting it here: " There is an issue currently where NUMA information is used on powerpc (and possibly ia64) before it has been read from the device-tree, which leads to large slab consumption with CONFIG_SLUB and memoryless nodes. NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate after start_secondary(), similar to ia64, which is invoked via smp_init(). Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as early_initcall()") made init_workqueues() be invoked via do_pre_smp_initcalls(), which is obviously before the secondary processors are online. ... Therefore, when init_workqueues() runs, it sees all CPUs as being on Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to a high number of slab deactivations (http://www.spinics.net/lists/linux-mm/msg67489.html)." Christoph/Tejun, do you see the issue I'm referring to? Is my analysis correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we have to be especially careful that users of cpu_to_{node,mem} and related APIs run *after* correct values are stored for all used CPUs? In any case, with Joonsoo's patches, we shouldn't see slab deactivations *if* the NUMA topology information is stored correctly. The full changelog and patch is at http://patchwork.ozlabs.org/patch/371266/. Adding my patch on top of Joonsoo's and the revert, I get: Slab: 411776 kB SReclaimable: 40960 kB SUnreclaim: 370816 kB So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so much that we are close to OOM with small VM/LPAR sizes. Thoughts? I would like to push: 1) Joonsoo's patch to add get_numa_mem, renamed to node_to_mem_node(), which is caching the result of local_memory_node() for each node. 2) Joonsoo's patch to use node_to_mem_node in __slab_alloc() and get_partial() when memoryless nodes are encountered. 3) Partial revert of 81c98869faa5 ("kthread: ensure locality of task_struct allocations") to remove a reference to cpu_to_mem() from the kthread code. After this, the only references to cpu_to_mem() are in headers, mm/slab.c, and kernel/profile.c (the last of which is because of the use of alloc_pages_exact_node(), it seems). 4) Re-post of my patch to fix an ordering issue for the per-CPU NUMA information on powerpc I understand your concerns, I think, about Joonsoo's patches, but we're hitting this pretty regularly in the field and it would be nice to have something workable in the short-term, while I try and follow-up on these more invasive ideas. Thanks, Nish
Hello, On Tue, Jul 22, 2014 at 02:43:11PM -0700, Nishanth Aravamudan wrote: ... > " There is an issue currently where NUMA information is used on powerpc > (and possibly ia64) before it has been read from the device-tree, which > leads to large slab consumption with CONFIG_SLUB and memoryless nodes. > > NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate > after start_secondary(), similar to ia64, which is invoked via > smp_init(). > > Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as > early_initcall()") made init_workqueues() be invoked via > do_pre_smp_initcalls(), which is obviously before the secondary > processors are online. > ... > Therefore, when init_workqueues() runs, it sees all CPUs as being on > Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to > a high number of slab deactivations > (http://www.spinics.net/lists/linux-mm/msg67489.html)." > > Christoph/Tejun, do you see the issue I'm referring to? Is my analysis > correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we > have to be especially careful that users of cpu_to_{node,mem} and > related APIs run *after* correct values are stored for all used CPUs? Without delving into the code, yes, NUMA info should be set up as soon as possible before major allocations happen. All allocations which happen beforehand would naturally be done with bogus NUMA information. Thanks.
On 22.07.2014 [14:43:11 -0700], Nishanth Aravamudan wrote: > Hi David, <snip> > on powerpc now, things look really good. On a KVM instance with the > following topology: > > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 > node 0 size: 0 MB > node 0 free: 0 MB > node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 > node 1 size: 16336 MB > node 1 free: 14274 MB > node distances: > node 0 1 > 0: 10 40 > 1: 40 10 > > 3.16.0-rc6 gives: > > Slab: 1039744 kB > SReclaimable: 38976 kB > SUnreclaim: 1000768 kB <snip> > Adding my patch on top of Joonsoo's and the revert, I get: > > Slab: 411776 kB > SReclaimable: 40960 kB > SUnreclaim: 370816 kB > > So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so > much that we are close to OOM with small VM/LPAR sizes. Just to clarify/add one more datapoint, with a balanced topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 8154 MB node 0 free: 8075 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 8181 MB node 1 free: 7776 MB node distances: node 0 1 0: 10 40 1: 40 10 I see the following for my patch + Joonsoo's + the revert: Slab: 495872 kB SReclaimable: 46528 kB SUnreclaim: 449344 kB (Although these numbers fluctuate quite a bit between 250M and 500M), which indicates that the memoryless node slab consumption is now on-par with a populated topology. And both are still more than CONFIG_SLAB requires. Thanks, Nish
On Tue, 22 Jul 2014, Nishanth Aravamudan wrote: > > I think there's two use cases of interest: > > > > - allocating from a memoryless node where numa_node_id() is memoryless, > > and > > > > - using node_to_mem_node() for a possibly-memoryless node for kmalloc(). > > > > I believe the first should have its own node_zonelist[0], whether it's > > memoryless or not, that points to a list of zones that start with those > > with the smallest distance. > > Ok, and that would be used for falling back in the appropriate priority? > There's no real fallback since there's never a case when you can allocate on a memoryless node. The zonelist defines the appropriate order in which to try to allocate from zones, so it depends on things like the numa_node_id() in alloc_pages_current() and whether the zonelist for a memoryless node is properly initialized or whether this needs to be numa_mem_id(). It depends on the intended behavior of calling alloc_pages_{node,vma}() with a memoryless node, the complexity of (re-)building the zonelists at bootstrap and for memory hotplug isn't a hotpath. This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL is set. > > I think its own node_zonelist[1], for __GFP_THISNODE allocations, > > should point to the node with present memory that has the smallest > > distance. > > And so would this, but with the caveat that we can fail here and don't > go further? Semantically, __GFP_THISNODE then means "as close as > physically possible ignoring run-time memory constraints". I say that > because obviously we might get off-node memory without memoryless nodes, > but that shouldn't be used to satisfy __GPF_THISNODE allocations. > alloc_pages_current() substitutes any existing mempolicy for the default local policy when __GFP_THISNODE is set, and that would require local allocation. That, currently, is numa_node_id() and not numa_mem_id(). The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so it will allocate remotely anyway.
diff --git a/include/linux/topology.h b/include/linux/topology.h index 12ae6ce..66b19b8 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -233,11 +233,20 @@ static inline int numa_node_id(void) * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). */ DECLARE_PER_CPU(int, _numa_mem_); +int _node_numa_mem_[MAX_NUMNODES]; #ifndef set_numa_mem static inline void set_numa_mem(int node) { this_cpu_write(_numa_mem_, node); + _node_numa_mem_[numa_node_id()] = node; +} +#endif + +#ifndef get_numa_mem +static inline int get_numa_mem(int node) +{ + return _node_numa_mem_[node]; } #endif @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu) static inline void set_cpu_numa_mem(int cpu, int node) { per_cpu(_numa_mem_, cpu) = node; + _node_numa_mem_[cpu_to_node(cpu)] = node; } #endif @@ -273,6 +283,13 @@ static inline int numa_mem_id(void) } #endif +#ifndef get_numa_mem +static inline int get_numa_mem(int node) +{ + return node; +} +#endif + #ifndef cpu_to_mem static inline int cpu_to_mem(int cpu) {