Message ID | 20140127055805.GA2471@lge.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote: > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote: > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote: > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote: > > > > > > > Thank you for clarifying and providing a test patch. I ran with this on > > > > the system showing the original problem, configured to have 15GB of > > > > memory. > > > > > > > > With your patch after boot: > > > > > > > > MemTotal: 15604736 kB > > > > MemFree: 8768192 kB > > > > Slab: 3882560 kB > > > > SReclaimable: 105408 kB > > > > SUnreclaim: 3777152 kB > > > > > > > > With Anton's patch after boot: > > > > > > > > MemTotal: 15604736 kB > > > > MemFree: 11195008 kB > > > > Slab: 1427968 kB > > > > SReclaimable: 109184 kB > > > > SUnreclaim: 1318784 kB > > > > > > > > > > > > I know that's fairly unscientific, but the numbers are reproducible. > > > > > > Hello, > > I think that there is one mistake on David's patch although I'm not sure > that it is the reason for this result. > > With David's patch, get_partial() in new_slab_objects() doesn't work > properly, because we only change node id in !node_match() case. If we > meet just !freelist case, we pass node id directly to > new_slab_objects(), so we always try to allocate new slab page > regardless existence of partial pages. We should solve it. > > Could you try this one? This helps about the same as David's patch -- but I found the reason why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch shortly for that and one other case I found. This patch on its own seems to help on our test system by saving around 1.5GB of slab. Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> with the caveat below. Thanks, Nish > > Thanks. > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > struct kmem_cache_cpu *c) > { > void *object; > - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; > + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; > > + if (node != NUMA_NO_NODE && !node_present_pages(node)) > + searchnode = numa_mem_id(); This might be clearer as: int searchnode = node; if (node == NUMA_NO_NODE || !node_present_pages(node)) searchnode = numa_mem_id(); > object = get_partial_node(s, get_node(s, searchnode), c, flags); > if (object || node != NUMA_NO_NODE) > return object; > @@ -2278,10 +2280,14 @@ redo: > > if (unlikely(!node_match(page, node))) { > stat(s, ALLOC_NODE_MISMATCH); > - deactivate_slab(s, page, c->freelist); > - c->page = NULL; > - c->freelist = NULL; > - goto new_slab; > + if (unlikely(!node_present_pages(node))) > + node = numa_mem_id(); > + if (!node_match(page, node)) { > + deactivate_slab(s, page, c->freelist); > + c->page = NULL; > + c->freelist = NULL; > + goto new_slab; > + } > } > > /* >
On Tue, 28 Jan 2014, Nishanth Aravamudan wrote: > This helps about the same as David's patch -- but I found the reason > why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch > shortly for that and one other case I found. Oww...
On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote: > On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote: > > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote: > > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote: > > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote: > > > > > > > > > Thank you for clarifying and providing a test patch. I ran with this on > > > > > the system showing the original problem, configured to have 15GB of > > > > > memory. > > > > > > > > > > With your patch after boot: > > > > > > > > > > MemTotal: 15604736 kB > > > > > MemFree: 8768192 kB > > > > > Slab: 3882560 kB > > > > > SReclaimable: 105408 kB > > > > > SUnreclaim: 3777152 kB > > > > > > > > > > With Anton's patch after boot: > > > > > > > > > > MemTotal: 15604736 kB > > > > > MemFree: 11195008 kB > > > > > Slab: 1427968 kB > > > > > SReclaimable: 109184 kB > > > > > SUnreclaim: 1318784 kB > > > > > > > > > > > > > > > I know that's fairly unscientific, but the numbers are reproducible. > > > > > > > > > Hello, > > > > I think that there is one mistake on David's patch although I'm not sure > > that it is the reason for this result. > > > > With David's patch, get_partial() in new_slab_objects() doesn't work > > properly, because we only change node id in !node_match() case. If we > > meet just !freelist case, we pass node id directly to > > new_slab_objects(), so we always try to allocate new slab page > > regardless existence of partial pages. We should solve it. > > > > Could you try this one? > > This helps about the same as David's patch -- but I found the reason > why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch > shortly for that and one other case I found. > > This patch on its own seems to help on our test system by saving around > 1.5GB of slab. > > Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > > with the caveat below. > > Thanks, > Nish > > > > > Thanks. > > > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > > struct kmem_cache_cpu *c) > > { > > void *object; > > - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; > > + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; > > > > + if (node != NUMA_NO_NODE && !node_present_pages(node)) > > + searchnode = numa_mem_id(); > > This might be clearer as: > > int searchnode = node; > if (node == NUMA_NO_NODE || !node_present_pages(node)) > searchnode = numa_mem_id(); Cody Schafer mentioned to me on IRC that this may not always reflect exactly what the caller intends. int searchnode = node; if (node == NUMA_NO_NODE) searchnode = numa_mem_id(); if (!node_present_pages(node)) searchnode = local_memory_node(node); The difference in semantics from the previous is that here, if we have a memoryless node, rather than using the CPU's nearest NUMA node, we use the NUMA node closest to the requested one? > > object = get_partial_node(s, get_node(s, searchnode), c, flags); > > if (object || node != NUMA_NO_NODE) > > return object; > > @@ -2278,10 +2280,14 @@ redo: > > > > if (unlikely(!node_match(page, node))) { > > stat(s, ALLOC_NODE_MISMATCH); > > - deactivate_slab(s, page, c->freelist); > > - c->page = NULL; > > - c->freelist = NULL; > > - goto new_slab; > > + if (unlikely(!node_present_pages(node))) > > + node = numa_mem_id(); Similarly here? -Nish > > + if (!node_match(page, node)) { > > + deactivate_slab(s, page, c->freelist); > > + c->page = NULL; > > + c->freelist = NULL; > > + goto new_slab; > > + } > > } > > > > /* > >
On Wed, 29 Jan 2014, Nishanth Aravamudan wrote: > exactly what the caller intends. > > int searchnode = node; > if (node == NUMA_NO_NODE) > searchnode = numa_mem_id(); > if (!node_present_pages(node)) > searchnode = local_memory_node(node); > > The difference in semantics from the previous is that here, if we have a > memoryless node, rather than using the CPU's nearest NUMA node, we use > the NUMA node closest to the requested one? The idea here is that the page allocator will do the fallback to other nodes. This check for !node_present should not be necessary. SLUB needs to accept the page from whatever node the page allocator returned and work with that. The problem is the check for having a slab from the "right" node may fall again after another attempt to allocate from the same node. SLUB will then push the slab from the *wrong* node back to the partial lists and may attempt another allocation that will again be successful but return memory from another node. That way the partial lists from a particular node are growing uselessly. One way to solve this may be to check if memory is actually allocated from the requested node and fallback to NUMA_NO_NODE (which will use the last allocated slab) for future allocs if the page allocator returned memory from a different node (unless GFP_THIS_NODE is set of course). Otherwise we end up replicating the page allocator logic in slub like in slab. That is what I wanted to avoid.
On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote: > On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote: > > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote: > > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote: > > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote: > > > > > > > > > Thank you for clarifying and providing a test patch. I ran with this on > > > > > the system showing the original problem, configured to have 15GB of > > > > > memory. > > > > > > > > > > With your patch after boot: > > > > > > > > > > MemTotal: 15604736 kB > > > > > MemFree: 8768192 kB > > > > > Slab: 3882560 kB > > > > > SReclaimable: 105408 kB > > > > > SUnreclaim: 3777152 kB > > > > > > > > > > With Anton's patch after boot: > > > > > > > > > > MemTotal: 15604736 kB > > > > > MemFree: 11195008 kB > > > > > Slab: 1427968 kB > > > > > SReclaimable: 109184 kB > > > > > SUnreclaim: 1318784 kB > > > > > > > > > > > > > > > I know that's fairly unscientific, but the numbers are reproducible. > > > > > > > > > Hello, > > > > I think that there is one mistake on David's patch although I'm not sure > > that it is the reason for this result. > > > > With David's patch, get_partial() in new_slab_objects() doesn't work > > properly, because we only change node id in !node_match() case. If we > > meet just !freelist case, we pass node id directly to > > new_slab_objects(), so we always try to allocate new slab page > > regardless existence of partial pages. We should solve it. > > > > Could you try this one? > > This helps about the same as David's patch -- but I found the reason > why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch > shortly for that and one other case I found. > > This patch on its own seems to help on our test system by saving around > 1.5GB of slab. > > Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > > with the caveat below. So what's the status of this patch? Christoph, do you think this is fine as it is? Thanks, Nish
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote: > So what's the status of this patch? Christoph, do you think this is fine > as it is? Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I already acked the patch.
On 03.02.2014 [21:38:36 -0600], Christoph Lameter wrote: > On Mon, 3 Feb 2014, Nishanth Aravamudan wrote: > > > So what's the status of this patch? Christoph, do you think this is fine > > as it is? > > Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I > already acked the patch. Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for the $SUBJECT issue. Thanks, Nish
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote: > Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for > the $SUBJECT issue. Hmmm... I am not sure that this is a general solution. The fallback to other nodes can not only occur because a node has no memory as his patch assumes. If the target node allocation fails (for whatever reason) then I would recommend for simplicities sake to change the target node to NUMA_NO_NODE and just take whatever is in the current cpu slab. A more complex solution would be to look through partial lists in increasing distance to find a partially used slab that is reasonable close to the current node. Slab has logic like that in fallback_alloc(). Slubs get_any_partial() function does something close to what you want.
On 04.02.2014 [14:39:32 -0600], Christoph Lameter wrote: > On Mon, 3 Feb 2014, Nishanth Aravamudan wrote: > > > Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for > > the $SUBJECT issue. > > Hmmm... I am not sure that this is a general solution. The fallback to > other nodes can not only occur because a node has no memory as his patch > assumes. Thanks, Christoph. I see your point. Something in this area would be nice, though, as it does produce a fairly significant bump in the slab usage on our test system. > If the target node allocation fails (for whatever reason) then I would > recommend for simplicities sake to change the target node to > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more > complex solution would be to look through partial lists in increasing > distance to find a partially used slab that is reasonable close to the > current node. Slab has logic like that in fallback_alloc(). Slubs > get_any_partial() function does something close to what you want. I apologize for my own ignorance, but I'm having trouble following. Anton's original patch did fallback to the current cpu slab, but I'm not sure any NUMA_NO_NODE change is necessary there. At the point we're deactivating the slab (in the current code, in __slab_alloc()), we have successfully allocated from somewhere, it's just not on the node we expected to be on. So perhaps you are saying to make a change lower in the code? I'm not sure where it makes sense to change the target node in that case. I'd appreciate any guidance you can give. Thanks, Nish
On Tue, 4 Feb 2014, Nishanth Aravamudan wrote: > > If the target node allocation fails (for whatever reason) then I would > > recommend for simplicities sake to change the target node to > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more > > complex solution would be to look through partial lists in increasing > > distance to find a partially used slab that is reasonable close to the > > current node. Slab has logic like that in fallback_alloc(). Slubs > > get_any_partial() function does something close to what you want. > > I apologize for my own ignorance, but I'm having trouble following. > Anton's original patch did fallback to the current cpu slab, but I'm not > sure any NUMA_NO_NODE change is necessary there. At the point we're > deactivating the slab (in the current code, in __slab_alloc()), we have > successfully allocated from somewhere, it's just not on the node we > expected to be on. Right so if we are ignoring the node then the simplest thing to do is to not deactivate the current cpu slab but to take an object from it. > So perhaps you are saying to make a change lower in the code? I'm not > sure where it makes sense to change the target node in that case. I'd > appreciate any guidance you can give. This not an easy thing to do. If the current slab is not the right node but would be the node from which the page allocator would be returning memory then the current slab can still be allocated from. If the fallback is to another node then the current cpu slab needs to be deactivated and the allocation from that node needs to proceeed. Have a look at fallback_alloc() in the slab allocator. A allocation attempt from the page allocator can be restricted to a specific node through GFP_THIS_NODE.
On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote: > On Tue, 4 Feb 2014, Nishanth Aravamudan wrote: > > > > If the target node allocation fails (for whatever reason) then I would > > > recommend for simplicities sake to change the target node to > > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more > > > complex solution would be to look through partial lists in increasing > > > distance to find a partially used slab that is reasonable close to the > > > current node. Slab has logic like that in fallback_alloc(). Slubs > > > get_any_partial() function does something close to what you want. > > > > I apologize for my own ignorance, but I'm having trouble following. > > Anton's original patch did fallback to the current cpu slab, but I'm not > > sure any NUMA_NO_NODE change is necessary there. At the point we're > > deactivating the slab (in the current code, in __slab_alloc()), we have > > successfully allocated from somewhere, it's just not on the node we > > expected to be on. > > Right so if we are ignoring the node then the simplest thing to do is to > not deactivate the current cpu slab but to take an object from it. Ok, that's what Anton's patch does, I believe. Are you ok with that patch as it is? > > So perhaps you are saying to make a change lower in the code? I'm not > > sure where it makes sense to change the target node in that case. I'd > > appreciate any guidance you can give. > > This not an easy thing to do. If the current slab is not the right node > but would be the node from which the page allocator would be returning > memory then the current slab can still be allocated from. If the fallback > is to another node then the current cpu slab needs to be deactivated and > the allocation from that node needs to proceeed. Have a look at > fallback_alloc() in the slab allocator. > > A allocation attempt from the page allocator can be restricted to a > specific node through GFP_THIS_NODE. Thanks for the pointers, I will try and take a look. Thanks, Nish
On Wed, 5 Feb 2014, Nishanth Aravamudan wrote: > > Right so if we are ignoring the node then the simplest thing to do is to > > not deactivate the current cpu slab but to take an object from it. > > Ok, that's what Anton's patch does, I believe. Are you ok with that > patch as it is? No. Again his patch only works if the node is memoryless not if there are other issues that prevent allocation from that node.
--- a/mm/slub.c +++ b/mm/slub.c @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, struct kmem_cache_cpu *c) { void *object; - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; + if (node != NUMA_NO_NODE && !node_present_pages(node)) + searchnode = numa_mem_id(); object = get_partial_node(s, get_node(s, searchnode), c, flags); if (object || node != NUMA_NO_NODE) return object; @@ -2278,10 +2280,14 @@ redo: if (unlikely(!node_match(page, node))) { stat(s, ALLOC_NODE_MISMATCH); - deactivate_slab(s, page, c->freelist); - c->page = NULL; - c->freelist = NULL; - goto new_slab; + if (unlikely(!node_present_pages(node))) + node = numa_mem_id(); + if (!node_match(page, node)) { + deactivate_slab(s, page, c->freelist); + c->page = NULL; + c->freelist = NULL; + goto new_slab; + } } /*