diff mbox

slub: Don't throw away partial remote slabs if there is no local memory

Message ID 20140127055805.GA2471@lge.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Joonsoo Kim Jan. 27, 2014, 5:58 a.m. UTC
On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 

Hello,

I think that there is one mistake on David's patch although I'm not sure
that it is the reason for this result.

With David's patch, get_partial() in new_slab_objects() doesn't work properly,
because we only change node id in !node_match() case. If we meet just !freelist
case, we pass node id directly to new_slab_objects(), so we always try to allocate
new slab page regardless existence of partial pages. We should solve it.

Could you try this one?

Thanks.

Comments

Nishanth Aravamudan Jan. 28, 2014, 6:29 p.m. UTC | #1
On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > 
> > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > the system showing the original problem, configured to have 15GB of
> > > > memory.
> > > > 
> > > > With your patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:         8768192 kB
> > > > Slab:            3882560 kB
> > > > SReclaimable:     105408 kB
> > > > SUnreclaim:      3777152 kB
> > > > 
> > > > With Anton's patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:        11195008 kB
> > > > Slab:            1427968 kB
> > > > SReclaimable:     109184 kB
> > > > SUnreclaim:      1318784 kB
> > > > 
> > > > 
> > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > 
> 
> Hello,
> 
> I think that there is one mistake on David's patch although I'm not sure
> that it is the reason for this result.
> 
> With David's patch, get_partial() in new_slab_objects() doesn't work
> properly, because we only change node id in !node_match() case. If we
> meet just !freelist case, we pass node id directly to
> new_slab_objects(), so we always try to allocate new slab page
> regardless existence of partial pages. We should solve it.
> 
> Could you try this one?

This helps about the same as David's patch -- but I found the reason
why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
shortly for that and one other case I found.

This patch on its own seems to help on our test system by saving around
1.5GB of slab.

Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

with the caveat below.

Thanks,
Nish

> 
> Thanks.
> 
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>                 struct kmem_cache_cpu *c)
>  {
>         void *object;
> -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
> +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> +               searchnode = numa_mem_id();

This might be clearer as:

int searchnode = node;
if (node == NUMA_NO_NODE || !node_present_pages(node))
	searchnode = numa_mem_id();

>         object = get_partial_node(s, get_node(s, searchnode), c, flags);
>         if (object || node != NUMA_NO_NODE)
>                 return object;
> @@ -2278,10 +2280,14 @@ redo:
> 
>         if (unlikely(!node_match(page, node))) {
>                 stat(s, ALLOC_NODE_MISMATCH);
> -               deactivate_slab(s, page, c->freelist);
> -               c->page = NULL;
> -               c->freelist = NULL;
> -               goto new_slab;
> +               if (unlikely(!node_present_pages(node)))
> +                       node = numa_mem_id();
> +               if (!node_match(page, node)) {
> +                       deactivate_slab(s, page, c->freelist);
> +                       c->page = NULL;
> +                       c->freelist = NULL;
> +                       goto new_slab;
> +               }
>         }
> 
>         /*
>
Christoph Lameter (Ampere) Jan. 29, 2014, 3:54 p.m. UTC | #2
On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.

Oww...
Nishanth Aravamudan Jan. 29, 2014, 10:36 p.m. UTC | #3
On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.
> 
> Thanks,
> Nish
> 
> > 
> > Thanks.
> > 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> >                 struct kmem_cache_cpu *c)
> >  {
> >         void *object;
> > -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > 
> > +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> > +               searchnode = numa_mem_id();
> 
> This might be clearer as:
> 
> int searchnode = node;
> if (node == NUMA_NO_NODE || !node_present_pages(node))
> 	searchnode = numa_mem_id();

Cody Schafer mentioned to me on IRC that this may not always reflect
exactly what the caller intends.

int searchnode = node;
if (node == NUMA_NO_NODE)
	searchnode = numa_mem_id();
if (!node_present_pages(node))
	searchnode = local_memory_node(node);

The difference in semantics from the previous is that here, if we have a
memoryless node, rather than using the CPU's nearest NUMA node, we use
the NUMA node closest to the requested one?

> >         object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >         if (object || node != NUMA_NO_NODE)
> >                 return object;
> > @@ -2278,10 +2280,14 @@ redo:
> > 
> >         if (unlikely(!node_match(page, node))) {
> >                 stat(s, ALLOC_NODE_MISMATCH);
> > -               deactivate_slab(s, page, c->freelist);
> > -               c->page = NULL;
> > -               c->freelist = NULL;
> > -               goto new_slab;
> > +               if (unlikely(!node_present_pages(node)))
> > +                       node = numa_mem_id();

Similarly here?

-Nish

> > +               if (!node_match(page, node)) {
> > +                       deactivate_slab(s, page, c->freelist);
> > +                       c->page = NULL;
> > +                       c->freelist = NULL;
> > +                       goto new_slab;
> > +               }
> >         }
> > 
> >         /*
> >
Christoph Lameter (Ampere) Jan. 30, 2014, 4:26 p.m. UTC | #4
On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

> exactly what the caller intends.
>
> int searchnode = node;
> if (node == NUMA_NO_NODE)
> 	searchnode = numa_mem_id();
> if (!node_present_pages(node))
> 	searchnode = local_memory_node(node);
>
> The difference in semantics from the previous is that here, if we have a
> memoryless node, rather than using the CPU's nearest NUMA node, we use
> the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the "right" node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.
Nishanth Aravamudan Feb. 3, 2014, 11 p.m. UTC | #5
On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.

So what's the status of this patch? Christoph, do you think this is fine
as it is?

Thanks,
Nish
Christoph Lameter (Ampere) Feb. 4, 2014, 3:38 a.m. UTC | #6
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> So what's the status of this patch? Christoph, do you think this is fine
> as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.
Nishanth Aravamudan Feb. 4, 2014, 7:26 a.m. UTC | #7
On 03.02.2014 [21:38:36 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > So what's the status of this patch? Christoph, do you think this is fine
> > as it is?
> 
> Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
> already acked the patch.

Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
the $SUBJECT issue.

Thanks,
Nish
Christoph Lameter (Ampere) Feb. 4, 2014, 8:39 p.m. UTC | #8
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.
Nishanth Aravamudan Feb. 5, 2014, 12:13 a.m. UTC | #9
On 04.02.2014 [14:39:32 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> > the $SUBJECT issue.
> 
> Hmmm... I am not sure that this is a general solution. The fallback to
> other nodes can not only occur because a node has no memory as his patch
> assumes.

Thanks, Christoph. I see your point.

Something in this area would be nice, though, as it does produce a
fairly significant bump in the slab usage on our test system.

> If the target node allocation fails (for whatever reason) then I would
> recommend for simplicities sake to change the target node to
> NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> complex solution would be to look through partial lists in increasing
> distance to find a partially used slab that is reasonable close to the
> current node. Slab has logic like that in fallback_alloc(). Slubs
> get_any_partial() function does something close to what you want.

I apologize for my own ignorance, but I'm having trouble following.
Anton's original patch did fallback to the current cpu slab, but I'm not
sure any NUMA_NO_NODE change is necessary there. At the point we're
deactivating the slab (in the current code, in __slab_alloc()), we have
successfully allocated from somewhere, it's just not on the node we
expected to be on.

So perhaps you are saying to make a change lower in the code? I'm not
sure where it makes sense to change the target node in that case. I'd
appreciate any guidance you can give.

Thanks,
Nish
Christoph Lameter (Ampere) Feb. 5, 2014, 7:28 p.m. UTC | #10
On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.
Nishanth Aravamudan Feb. 6, 2014, 2:08 a.m. UTC | #11
On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
> On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
> 
> > > If the target node allocation fails (for whatever reason) then I would
> > > recommend for simplicities sake to change the target node to
> > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > > complex solution would be to look through partial lists in increasing
> > > distance to find a partially used slab that is reasonable close to the
> > > current node. Slab has logic like that in fallback_alloc(). Slubs
> > > get_any_partial() function does something close to what you want.
> >
> > I apologize for my own ignorance, but I'm having trouble following.
> > Anton's original patch did fallback to the current cpu slab, but I'm not
> > sure any NUMA_NO_NODE change is necessary there. At the point we're
> > deactivating the slab (in the current code, in __slab_alloc()), we have
> > successfully allocated from somewhere, it's just not on the node we
> > expected to be on.
> 
> Right so if we are ignoring the node then the simplest thing to do is to
> not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

> > So perhaps you are saying to make a change lower in the code? I'm not
> > sure where it makes sense to change the target node in that case. I'd
> > appreciate any guidance you can give.
> 
> This not an easy thing to do. If the current slab is not the right node
> but would be the node from which the page allocator would be returning
> memory then the current slab can still be allocated from. If the fallback
> is to another node then the current cpu slab needs to be deactivated and
> the allocation from that node needs to proceeed. Have a look at
> fallback_alloc() in the slab allocator.
> 
> A allocation attempt from the page allocator can be restricted to a
> specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish
Christoph Lameter (Ampere) Feb. 6, 2014, 5:25 p.m. UTC | #12
On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

> > Right so if we are ignoring the node then the simplest thing to do is to
> > not deactivate the current cpu slab but to take an object from it.
>
> Ok, that's what Anton's patch does, I believe. Are you ok with that
> patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.
diff mbox

Patch

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,8 +1698,10 @@  static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
                struct kmem_cache_cpu *c)
 {
        void *object;
-       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+       if (node != NUMA_NO_NODE && !node_present_pages(node))
+               searchnode = numa_mem_id();
        object = get_partial_node(s, get_node(s, searchnode), c, flags);
        if (object || node != NUMA_NO_NODE)
                return object;
@@ -2278,10 +2280,14 @@  redo:
 
        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
-               deactivate_slab(s, page, c->freelist);
-               c->page = NULL;
-               c->freelist = NULL;
-               goto new_slab;
+               if (unlikely(!node_present_pages(node)))
+                       node = numa_mem_id();
+               if (!node_match(page, node)) {
+                       deactivate_slab(s, page, c->freelist);
+                       c->page = NULL;
+                       c->freelist = NULL;
+                       goto new_slab;
+               }
        }
 
        /*