diff mbox

[RFC,2/3] topology: support node_numa_mem() for determining the fallback node

Message ID CAAmzW4PXkdpNi5pZ=4BzdXNvqTEAhcuw-x0pWidqrxzdePxXxA@mail.gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Joonsoo Kim Feb. 6, 2014, 10:29 a.m. UTC
2014-02-06 David Rientjes <rientjes@google.com>:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>
> I may be misunderstanding this patch and there's no help because there's
> no changelog.

Sorry about that.
I made this patch just for testing. :)
Thanks for looking this.

>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 12ae6ce..a6d5438 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>>   */
>>  DECLARE_PER_CPU(int, _numa_mem_);
>> +int _node_numa_mem_[MAX_NUMNODES];
>>
>>  #ifndef set_numa_mem
>>  static inline void set_numa_mem(int node)
>>  {
>>       this_cpu_write(_numa_mem_, node);
>> +     _node_numa_mem_[numa_node_id()] = node;
>> +}
>> +#endif
>> +
>> +#ifndef get_numa_mem
>> +static inline int get_numa_mem(int node)
>> +{
>> +     return _node_numa_mem_[node];
>>  }
>>  #endif
>>
>> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>>  static inline void set_cpu_numa_mem(int cpu, int node)
>>  {
>>       per_cpu(_numa_mem_, cpu) = node;
>> +     _node_numa_mem_[numa_node_id()] = node;
>
> The intention seems to be that _node_numa_mem_[X] for a node X will return
> a node Y with memory that has the nearest distance?  In other words,
> caching the value returned by local_memory_node(X)?

Yes, you are right.

> That doesn't seem to be what it's doing since numa_node_id() is the node
> of the cpu that current is running on so this ends up getting initialized
> to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> cpu_possible_mask.

Yes, I made a mistake.
Thanks for pointer.
I fix it and attach v2.
Now I'm out of office, so I'm not sure this second version is correct :(

Thanks.

----------8<--------------
From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Thu, 6 Feb 2014 17:07:05 +0900
Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
determining the
 fallback node

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select
the node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_numa_mem(X), will return
a node Y with memory that has the nearest distance. If X is memoryless
node, it will return nearest distance node, but, if
X is normal node, it will return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Comments

Nishanth Aravamudan Feb. 6, 2014, 7:11 p.m. UTC | #1
On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote:
> 2014-02-06 David Rientjes <rientjes@google.com>:
> > On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> >
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>
> >
> > I may be misunderstanding this patch and there's no help because there's
> > no changelog.
> 
> Sorry about that.
> I made this patch just for testing. :)
> Thanks for looking this.
> 
> >> diff --git a/include/linux/topology.h b/include/linux/topology.h
> >> index 12ae6ce..a6d5438 100644
> >> --- a/include/linux/topology.h
> >> +++ b/include/linux/topology.h
> >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >>   */
> >>  DECLARE_PER_CPU(int, _numa_mem_);
> >> +int _node_numa_mem_[MAX_NUMNODES];
> >>
> >>  #ifndef set_numa_mem
> >>  static inline void set_numa_mem(int node)
> >>  {
> >>       this_cpu_write(_numa_mem_, node);
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >> +}
> >> +#endif
> >> +
> >> +#ifndef get_numa_mem
> >> +static inline int get_numa_mem(int node)
> >> +{
> >> +     return _node_numa_mem_[node];
> >>  }
> >>  #endif
> >>
> >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> >>  static inline void set_cpu_numa_mem(int cpu, int node)
> >>  {
> >>       per_cpu(_numa_mem_, cpu) = node;
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >
> > The intention seems to be that _node_numa_mem_[X] for a node X will return
> > a node Y with memory that has the nearest distance?  In other words,
> > caching the value returned by local_memory_node(X)?
> 
> Yes, you are right.
> 
> > That doesn't seem to be what it's doing since numa_node_id() is the node
> > of the cpu that current is running on so this ends up getting initialized
> > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> > cpu_possible_mask.
> 
> Yes, I made a mistake.
> Thanks for pointer.
> I fix it and attach v2.
> Now I'm out of office, so I'm not sure this second version is correct :(
> 
> Thanks.
> 
> ----------8<--------------
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..66b19b8 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];

Should be static, I think?

> 
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>   this_cpu_write(_numa_mem_, node);
> + _node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return _node_numa_mem_[node];
>  }
>  #endif
> 
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>   per_cpu(_numa_mem_, cpu) = node;
> + _node_numa_mem_[cpu_to_node(cpu)] = node;
>  }
>  #endif
> 
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
> 
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {
> -- 
> 1.7.9.5
>
David Rientjes Feb. 6, 2014, 8:52 p.m. UTC | #2
On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 

I like the approach and it may fix the problem today, but it may not be 
sufficient in the future: nodes may not only be memoryless but they may 
also be cpuless.  It's possible that a node can only have I/O, networking, 
or storage devices and we can define affinity for them that is remote from 
every cpu and/or memory by the ACPI specification.

It seems like a better approach would be to do this when a node is brought 
online and determine the fallback node based not on the zonelists as you 
do here but rather on locality (such as through a SLIT if provided, see 
node_distance()).

Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
a lot of sense in generic code.  I'd suggest something like 
node_to_mem_node().
Joonsoo Kim Feb. 7, 2014, 5:42 a.m. UTC | #3
On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> Should be static, I think?

Yes, will update it.

Thanks.
Joonsoo Kim Feb. 7, 2014, 5:48 a.m. UTC | #4
On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Thu, 6 Feb 2014 17:07:05 +0900
> > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> > determining the
> >  fallback node
> > 
> > We need to determine the fallback node in slub allocator if the allocation
> > target node is memoryless node. Without it, the SLUB wrongly select
> > the node which has no memory and can't use a partial slab, because of node
> > mismatch. Introduced function, node_numa_mem(X), will return
> > a node Y with memory that has the nearest distance. If X is memoryless
> > node, it will return nearest distance node, but, if
> > X is normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> 
> I like the approach and it may fix the problem today, but it may not be 
> sufficient in the future: nodes may not only be memoryless but they may 
> also be cpuless.  It's possible that a node can only have I/O, networking, 
> or storage devices and we can define affinity for them that is remote from 
> every cpu and/or memory by the ACPI specification.
> 
> It seems like a better approach would be to do this when a node is brought 
> online and determine the fallback node based not on the zonelists as you 
> do here but rather on locality (such as through a SLIT if provided, see 
> node_distance()).

Hmm...
I guess that zonelist is base on locality. Zonelist is generated using
node_distance(), so I think that it reflects locality. But, I'm not expert
on NUMA, so please let me know what I am missing here :)

> Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
> a lot of sense in generic code.  I'd suggest something like 
> node_to_mem_node().

It's much better!
If this patch eventually will be needed, I will update it.

Thanks.
Christoph Lameter (Ampere) Feb. 7, 2014, 5:53 p.m. UTC | #5
On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.
David Rientjes Feb. 8, 2014, 9:57 a.m. UTC | #6
On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > It seems like a better approach would be to do this when a node is brought 
> > online and determine the fallback node based not on the zonelists as you 
> > do here but rather on locality (such as through a SLIT if provided, see 
> > node_distance()).
> 
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)
> 

The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
If your solution is going to become the generic kernel API that determines 
what node has local memory for a particular node, then it will have to 
support all definitions of node.  That includes nodes that consist solely 
of I/O, chipsets, networking, or storage devices.  These nodes may not 
have memory or cpus, so doing it as part of onlining cpus isn't going to 
be generic enough.  You want a node_to_mem_node() API for all possible 
node types (the possible node types listed above are straight from the 
ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
X and we can optimize for that, but any solution that relies on cpu online 
is probably shortsighted right now.

I think it would be much better to do this as a part of setting a node to 
be online.
Joonsoo Kim Feb. 10, 2014, 1:09 a.m. UTC | #7
On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!
Nishanth Aravamudan July 22, 2014, 1:03 a.m. UTC | #8
On 10.02.2014 [10:09:36 +0900], Joonsoo Kim wrote:
> On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> > On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> > 
> > > > It seems like a better approach would be to do this when a node is brought 
> > > > online and determine the fallback node based not on the zonelists as you 
> > > > do here but rather on locality (such as through a SLIT if provided, see 
> > > > node_distance()).
> > > 
> > > Hmm...
> > > I guess that zonelist is base on locality. Zonelist is generated using
> > > node_distance(), so I think that it reflects locality. But, I'm not expert
> > > on NUMA, so please let me know what I am missing here :)
> > > 
> > 
> > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> > If your solution is going to become the generic kernel API that determines 
> > what node has local memory for a particular node, then it will have to 
> > support all definitions of node.  That includes nodes that consist solely 
> > of I/O, chipsets, networking, or storage devices.  These nodes may not 
> > have memory or cpus, so doing it as part of onlining cpus isn't going to 
> > be generic enough.  You want a node_to_mem_node() API for all possible 
> > node types (the possible node types listed above are straight from the 
> > ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> > X and we can optimize for that, but any solution that relies on cpu online 
> > is probably shortsighted right now.
> > 
> > I think it would be much better to do this as a part of setting a node to 
> > be online.
> 
> Okay. I got your point.
> I will change it to rely on node online if this patch is really needed.

Sorry for bringing up this old thread again, but I had a question for
you, David. node_to_mem_node(), which does seem like a useful API,
doesn't seem like it can just node_distance() solely, right? Because
that just tells us the relative cost (or so I think about it) of using
resources from that node. But we also need to know if that node itself
has memory, etc. So using the zonelists is required no matter what? And
upon memory hotplug (or unplug), the topology can change in a way that
affects things, so node online time isn't right either?

Thanks,
Nish
David Rientjes July 22, 2014, 1:16 a.m. UTC | #9
On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:

> Sorry for bringing up this old thread again, but I had a question for
> you, David. node_to_mem_node(), which does seem like a useful API,
> doesn't seem like it can just node_distance() solely, right? Because
> that just tells us the relative cost (or so I think about it) of using
> resources from that node. But we also need to know if that node itself
> has memory, etc. So using the zonelists is required no matter what? And
> upon memory hotplug (or unplug), the topology can change in a way that
> affects things, so node online time isn't right either?
> 

I think there's two use cases of interest:

 - allocating from a memoryless node where numa_node_id() is memoryless, 
   and

 - using node_to_mem_node() for a possibly-memoryless node for kmalloc().

I believe the first should have its own node_zonelist[0], whether it's 
memoryless or not, that points to a list of zones that start with those 
with the smallest distance.  I think its own node_zonelist[1], for 
__GFP_THISNODE allocations, should point to the node with present memory 
that has the smallest distance.

For sure node_zonelist[0] cannot be NULL since things like 
first_online_pgdat() would break and it should be unnecessary to do 
node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
since the zonelists should already be defined properly.  All nodes, 
regardless of whether they have memory or not, should probably end up 
having a struct pglist_data unless there's a reason for another level of 
indirection.
Nishanth Aravamudan July 22, 2014, 9:43 p.m. UTC | #10
Hi David,

On 21.07.2014 [18:16:58 -0700], David Rientjes wrote:
> On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:
> 
> > Sorry for bringing up this old thread again, but I had a question for
> > you, David. node_to_mem_node(), which does seem like a useful API,
> > doesn't seem like it can just node_distance() solely, right? Because
> > that just tells us the relative cost (or so I think about it) of using
> > resources from that node. But we also need to know if that node itself
> > has memory, etc. So using the zonelists is required no matter what? And
> > upon memory hotplug (or unplug), the topology can change in a way that
> > affects things, so node online time isn't right either?
> > 
> 
> I think there's two use cases of interest:
> 
>  - allocating from a memoryless node where numa_node_id() is memoryless, 
>    and
> 
>  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> 
> I believe the first should have its own node_zonelist[0], whether it's 
> memoryless or not, that points to a list of zones that start with those 
> with the smallest distance.

Ok, and that would be used for falling back in the appropriate priority?

> I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> should point to the node with present memory that has the smallest
> distance.

And so would this, but with the caveat that we can fail here and don't
go further? Semantically, __GFP_THISNODE then means "as close as
physically possible ignoring run-time memory constraints". I say that
because obviously we might get off-node memory without memoryless nodes,
but that shouldn't be used to satisfy __GPF_THISNODE allocations.

> For sure node_zonelist[0] cannot be NULL since things like 
> first_online_pgdat() would break and it should be unnecessary to do 
> node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
> since the zonelists should already be defined properly.  All nodes, 
> regardless of whether they have memory or not, should probably end up 
> having a struct pglist_data unless there's a reason for another level of 
> indirection.

So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and
on powerpc now, things look really good. On a KVM instance with the
following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 14274 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

3.16.0-rc6 gives:

        Slab:            1039744 kB
	SReclaimable:      38976 kB
	SUnreclaim:      1000768 kB

Joonsoo's patches give:

        Slab:             366144 kB
	SReclaimable:      36928 kB
	SUnreclaim:       329216 kB

For reference, CONFIG_SLAB gives:

        Slab:             122496 kB
	SReclaimable:      14912 kB
	SUnreclaim:       107584 kB

At Tejun's request [adding him to Cc], I also partially reverted
81c98869faa5 ("kthread: ensure locality of task_struct allocations"): 

	Slab:             428864 kB
	SReclaimable:      44288 kB
	SUnreclaim:       384576 kB

This seems slightly worse, but I think it's because of the same
root-cause that I indicated in my RFC patch 2/2, quoting it here:

"    There is an issue currently where NUMA information is used on powerpc
    (and possibly ia64) before it has been read from the device-tree, which
    leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
    
    NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
    after start_secondary(), similar to ia64, which is invoked via
    smp_init().
    
    Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
    early_initcall()") made init_workqueues() be invoked via
    do_pre_smp_initcalls(), which is obviously before the secondary
    processors are online.
    ...
    Therefore, when init_workqueues() runs, it sees all CPUs as being on
    Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
    a high number of slab deactivations
    (http://www.spinics.net/lists/linux-mm/msg67489.html)."

Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
have to be especially careful that users of cpu_to_{node,mem} and
related APIs run *after* correct values are stored for all used CPUs?

In any case, with Joonsoo's patches, we shouldn't see slab deactivations
*if* the NUMA topology information is stored correctly. The full
changelog and patch is at http://patchwork.ozlabs.org/patch/371266/.

Adding my patch on top of Joonsoo's and the revert, I get:

	Slab:             411776 kB
	SReclaimable:      40960 kB
	SUnreclaim:       370816 kB

So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
much that we are close to OOM with small VM/LPAR sizes.

Thoughts?

I would like to push:

1) Joonsoo's patch to add get_numa_mem, renamed to node_to_mem_node(),
which is caching the result of local_memory_node() for each node.

2) Joonsoo's patch to use node_to_mem_node in __slab_alloc() and
get_partial() when memoryless nodes are encountered.

3) Partial revert of 81c98869faa5 ("kthread: ensure locality of
task_struct allocations") to remove a reference to cpu_to_mem() from the
kthread code. After this, the only references to cpu_to_mem() are in
headers, mm/slab.c, and kernel/profile.c (the last of which is because
of the use of alloc_pages_exact_node(), it seems).

4) Re-post of my patch to fix an ordering issue for the per-CPU NUMA
information on powerpc

I understand your concerns, I think, about Joonsoo's patches, but we're
hitting this pretty regularly in the field and it would be nice to have
something workable in the short-term, while I try and follow-up on these
more invasive ideas.

Thanks,
Nish
Tejun Heo July 22, 2014, 9:49 p.m. UTC | #11
Hello,

On Tue, Jul 22, 2014 at 02:43:11PM -0700, Nishanth Aravamudan wrote:
...
> "    There is an issue currently where NUMA information is used on powerpc
>     (and possibly ia64) before it has been read from the device-tree, which
>     leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
>     
>     NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
>     after start_secondary(), similar to ia64, which is invoked via
>     smp_init().
>     
>     Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
>     early_initcall()") made init_workqueues() be invoked via
>     do_pre_smp_initcalls(), which is obviously before the secondary
>     processors are online.
>     ...
>     Therefore, when init_workqueues() runs, it sees all CPUs as being on
>     Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
>     a high number of slab deactivations
>     (http://www.spinics.net/lists/linux-mm/msg67489.html)."
> 
> Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
> correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
> have to be especially careful that users of cpu_to_{node,mem} and
> related APIs run *after* correct values are stored for all used CPUs?

Without delving into the code, yes, NUMA info should be set up as soon
as possible before major allocations happen.  All allocations which
happen beforehand would naturally be done with bogus NUMA information.

Thanks.
Nishanth Aravamudan July 22, 2014, 11:47 p.m. UTC | #12
On 22.07.2014 [14:43:11 -0700], Nishanth Aravamudan wrote:
> Hi David,

<snip>

> on powerpc now, things look really good. On a KVM instance with the
> following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
> node 1 size: 16336 MB
> node 1 free: 14274 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> 3.16.0-rc6 gives:
> 
>         Slab:            1039744 kB
> 	SReclaimable:      38976 kB
> 	SUnreclaim:      1000768 kB

<snip>

> Adding my patch on top of Joonsoo's and the revert, I get:
> 
> 	Slab:             411776 kB
> 	SReclaimable:      40960 kB
> 	SUnreclaim:       370816 kB
> 
> So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
> much that we are close to OOM with small VM/LPAR sizes.

Just to clarify/add one more datapoint, with a balanced topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 8154 MB
node 0 free: 8075 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 8181 MB
node 1 free: 7776 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10

I see the following for my patch + Joonsoo's + the revert:

Slab:             495872 kB
SReclaimable:      46528 kB
SUnreclaim:       449344 kB

(Although these numbers fluctuate quite a bit between 250M and 500M),
which indicates that the memoryless node slab consumption is now on-par
with a populated topology. And both are still more than CONFIG_SLAB
requires.

Thanks,
Nish
David Rientjes July 23, 2014, 12:43 a.m. UTC | #13
On Tue, 22 Jul 2014, Nishanth Aravamudan wrote:

> > I think there's two use cases of interest:
> > 
> >  - allocating from a memoryless node where numa_node_id() is memoryless, 
> >    and
> > 
> >  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> > 
> > I believe the first should have its own node_zonelist[0], whether it's 
> > memoryless or not, that points to a list of zones that start with those 
> > with the smallest distance.
> 
> Ok, and that would be used for falling back in the appropriate priority?
> 

There's no real fallback since there's never a case when you can allocate 
on a memoryless node.  The zonelist defines the appropriate order in which 
to try to allocate from zones, so it depends on things like the 
numa_node_id() in alloc_pages_current() and whether the zonelist for a 
memoryless node is properly initialized or whether this needs to be 
numa_mem_id().  It depends on the intended behavior of calling 
alloc_pages_{node,vma}() with a memoryless node, the complexity of 
(re-)building the zonelists at bootstrap and for memory hotplug isn't a 
hotpath.

This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL 
is set.

> > I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> > should point to the node with present memory that has the smallest
> > distance.
> 
> And so would this, but with the caveat that we can fail here and don't
> go further? Semantically, __GFP_THISNODE then means "as close as
> physically possible ignoring run-time memory constraints". I say that
> because obviously we might get off-node memory without memoryless nodes,
> but that shouldn't be used to satisfy __GPF_THISNODE allocations.
> 

alloc_pages_current() substitutes any existing mempolicy for the default 
local policy when __GFP_THISNODE is set, and that would require local 
allocation.  That, currently, is numa_node_id() and not numa_mem_id().

The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so 
it will allocate remotely anyway.
diff mbox

Patch

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..66b19b8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@  static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];

 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
  this_cpu_write(_numa_mem_, node);
+ _node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return _node_numa_mem_[node];
 }
 #endif

@@ -260,6 +269,7 @@  static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
  per_cpu(_numa_mem_, cpu) = node;
+ _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

@@ -273,6 +283,13 @@  static inline int numa_mem_id(void)
 }
 #endif

+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {