diff mbox

slub: Don't throw away partial remote slabs if there is no local memory

Message ID 20140107132100.5b5ad198@kryten (mailing list archive)
State Not Applicable
Headers show

Commit Message

Anton Blanchard Jan. 7, 2014, 2:21 a.m. UTC
We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:            2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.

As such we end up only using 1 entry in each slab:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384       1404 MB    4.90%
task_struct          668 MB    2.90%
kmalloc-128          193 MB    3.61%
kmalloc-192          152 MB    5.23%
kmalloc-8192          72 MB   23.40%
kmalloc-16            64 MB    7.43%
kmalloc-512           33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab:             360192 kB

And our efficiency is much better:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384         92 MB   74.27%
task_struct           23 MB   83.46%
idr_layer_cache       18 MB  100.00%
pgtable-2^12          17 MB  100.00%
kmalloc-65536         15 MB  100.00%
inode_cache           14 MB  100.00%
kmalloc-256           14 MB   97.81%
kmalloc-8192          14 MB   85.71%

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Comments

Wanpeng Li Jan. 7, 2014, 4:19 a.m. UTC | #1
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>
>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Andi Kleen Jan. 7, 2014, 6:49 a.m. UTC | #2
Anton Blanchard <anton@samba.org> writes:
>
> Thoughts? It seems like we could hit a similar situation if a machine
> is balanced but we run out of memory on a single node.

Yes I agree, but your patch doesn't seem to attempt to handle this?

-Andi
>
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,17 @@ redo:
>  
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +
> +		/*
> +		 * If the node contains no memory there is no point in trying
> +		 * to allocate a new node local slab
> +		 */
> +		if (node_spanned_pages(node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}
>  	}
>  
>  	/*
Wanpeng Li Jan. 7, 2014, 8:48 a.m. UTC | #3
Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
[...]
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>

We have a machine:

[    0.000000] Node 0 Memory:
[    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
[    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
[    0.000000] Node 10 Memory: 0xc0000000-0x180000000

[    0.041486] Node 0 CPUs: 0-19
[    0.041490] Node 4 CPUs:
[    0.041492] Node 6 CPUs:
[    0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>

So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Wanpeng Li Jan. 7, 2014, 9:21 a.m. UTC | #4
On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> Hi Joonsoo,
>> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> 
>> [...]
>> >Hello,
>> >
>> >I think that we need more efforts to solve unbalanced node problem.
>> >
>> >With this patch, even if node of current cpu slab is not favorable to
>> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >
>> 
>> We have a machine:
>> 
>> [    0.000000] Node 0 Memory:
>> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> 
>> [    0.041486] Node 0 CPUs: 0-19
>> [    0.041490] Node 4 CPUs:
>> [    0.041492] Node 6 CPUs:
>> [    0.041495] Node 10 CPUs:
>> 
>> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> of the memoryless node in buddy system, how can not favorable happen? 
>
>Hi, Wanpeng.
>
>IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>allocate the page in fallback zones/node of that node #. So fallback list isn't
>related to fallback one of memoryless node #. Am I wrong?
>

Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

>Thanks.
>
>> 
>> >And there is one more problem. Even if we have some partial slabs on
>> >compatible node, we would allocate new slab, because get_partial() cannot handle
>> >this unbalance node case.
>> >
>> >To fix this correctly, how about following patch?
>> >
>> 
>> So I think we should fold both of your two patches to one.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>> >Thanks.
>> >
>> >------------->8--------------------
>> >diff --git a/mm/slub.c b/mm/slub.c
>> >index c3eb3d3..a1f6dfa 100644
>> >--- a/mm/slub.c
>> >+++ b/mm/slub.c
>> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>> > {
>> >        void *object;
>> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>> >+       struct zonelist *zonelist;
>> >+       struct zoneref *z;
>> >+       struct zone *zone;
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
David Laight Jan. 7, 2014, 9:42 a.m. UTC | #5
> From: Anton Blanchard
> We noticed a huge amount of slab memory consumed on a large ppc64 box:
> 
> Slab:            2094336 kB
> 
> Almost 2GB. This box is not balanced and some nodes do not have local
> memory, causing slub to be very inefficient in its slab usage.
> 
> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
> sees it isn't node local, deactivates it and tries to allocate a new
> slab. ...
...
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> 		deactivate_slab(s, page, c->freelist);
> 		c->page = NULL;
> 		c->freelist = NULL;
> 		goto new_slab;
>  	}

Why not just delete the entire test?
Presumably some time a little earlier no local memory was available.
Even if there is some available now, it is very likely that some won't
be available again in the near future.

	David.
Wanpeng Li Jan. 7, 2014, 10:28 a.m. UTC | #6
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>

Deactive cpu slab cache doesn't always mean free the slab cache to buddy system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>
>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {
>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Anton Blanchard Jan. 8, 2014, 2:03 p.m. UTC | #7
Hi Andi,

> > Thoughts? It seems like we could hit a similar situation if a
> > machine is balanced but we run out of memory on a single node.
> 
> Yes I agree, but your patch doesn't seem to attempt to handle this?

It doesn't. I was hoping someone with more mm knowledge than I could
suggest a lightweight way of doing this.

Anton
Anton Blanchard Jan. 8, 2014, 2:14 p.m. UTC | #8
Hi David,

> Why not just delete the entire test?
> Presumably some time a little earlier no local memory was available.
> Even if there is some available now, it is very likely that some won't
> be available again in the near future.

I agree, the current behaviour seems strange but it has been around
since the inital slub commit.

Anton
Anton Blanchard Jan. 8, 2014, 2:17 p.m. UTC | #9
Hi Wanpeng,

> >+		if (node_spanned_pages(node)) {
> 
> s/node_spanned_pages/node_present_pages 

Thanks, I hadn't come across node_present_pages() before.

Anton
diff mbox

Patch

Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@  redo:
 
 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+
+		/*
+		 * If the node contains no memory there is no point in trying
+		 * to allocate a new node local slab
+		 */
+		if (node_spanned_pages(node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*