Patchwork [3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations

login
register
mail settings
Submitter Mel Gorman
Date May 13, 2011, 2:03 p.m.
Message ID <1305295404-12129-4-git-send-email-mgorman@suse.de>
Download mbox | patch
Permalink /patch/95488/
State Not Applicable
Headers show

Comments

Mel Gorman - May 13, 2011, 2:03 p.m.
To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, the caller can enter
compaction or reclaim - both of which are likely to cost more than the
benefit of using high-order pages in SLUB. On a desktop system, two
users report that the system is getting stalled with kswapd using large
amounts of CPU.

This patch prevents SLUB taking any expensive steps when trying to use
high-order allocations. Instead, it is expected to fall back to smaller
orders more aggressively. Testing was somewhat inconclusive on how much
this helped but it makes sense that falling back to order-0 allocations
is faster than entering compaction or direct reclaim.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)
David Rientjes - May 16, 2011, 9:16 p.m.
On Fri, 13 May 2011, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..057f1e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	if (!wait && can_wake_kswapd) {
>  		/*
>  		 * Not worth trying to allocate harder for
>  		 * __GFP_NOMEMALLOC even if it can't schedule.
> diff --git a/mm/slub.c b/mm/slub.c
> index 98c358d..c5797ab 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	 * Let the initial higher-order allocation fail under memory pressure
>  	 * so we fall-back to the minimum order allocation.
>  	 */
> -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
>  
>  	page = alloc_slab_page(alloc_gfp, node, oo);
>  	if (unlikely(!page)) {

It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
will immediately fail.

alloc_gfp would probably benefit from having a comment about why 
__GFP_WAIT should be masked off here: that we don't want to do compaction 
or direct reclaim or retry the allocation more than once (so both 
__GFP_NORETRY and __GFP_REPEAT are no-ops).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 17, 2011, 8:42 a.m.
On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9f8a97b..057f1e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  {
> >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> >  
> >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 */
> >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> >  
> > -	if (!wait) {
> > +	if (!wait && can_wake_kswapd) {
> >  		/*
> >  		 * Not worth trying to allocate harder for
> >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98c358d..c5797ab 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> >  	 * Let the initial higher-order allocation fail under memory pressure
> >  	 * so we fall-back to the minimum order allocation.
> >  	 */
> > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> >  
> >  	page = alloc_slab_page(alloc_gfp, node, oo);
> >  	if (unlikely(!page)) {
> 
> It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> will immediately fail.
> 

We can enter enter direct compaction or direct reclaim
at least once. If compaction is enabled and we enter
reclaim/compaction, the presense of __GFP_REPEAT makes a difference
in should_continue_reclaim().  With compaction disabled, the presense
of the flag is relevant in should_alloc_retry() with it being possible
to loop in the allocator instead of failing the SLUB allocation and
dropping back.

Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
more sense. In that case, we clear both flags because
__GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags.
If for whatever reason the __GFP_WAIT was re-added, the presense of
__GFP_REPEAT could cause problems in reclaim that would be hard to
spot again.

> alloc_gfp would probably benefit from having a comment about why 
> __GFP_WAIT should be masked off here: that we don't want to do compaction 
> or direct reclaim or retry the allocation more than once (so both 
> __GFP_NORETRY and __GFP_REPEAT are no-ops).

That would have been helpful all right. I should have caught that
and explained it properly. In the event there is a new version of
the patch, I'll add one. For the moment, I'm dropping this patch
entirely. Christoph wants to maintain historic behaviour of SLUB to
maximise the number of high-order pages it uses and at the end of the
day, which option performs better depends entirely on the workload
and machine configuration.
Christoph Lameter - May 17, 2011, 1:51 p.m.
On Tue, 17 May 2011, Mel Gorman wrote:

> entirely. Christoph wants to maintain historic behaviour of SLUB to
> maximise the number of high-order pages it uses and at the end of the
> day, which option performs better depends entirely on the workload
> and machine configuration.

That is not what I meant. I would like more higher order allocations to
succeed. That does not mean that slubs allocation methods and flags passed
have to stay the same. You can change the slub behavior if it helps.

I am just suspicious of compaction. If these mods are needed to reduce the
amount of higher order pages then compaction does not have the
beneficial effect that it should have. It does not actually
increase the available higher order pages. Fix that first.



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 17, 2011, 4:22 p.m.
On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote:
> On Tue, 17 May 2011, Mel Gorman wrote:
> 
> > entirely. Christoph wants to maintain historic behaviour of SLUB to
> > maximise the number of high-order pages it uses and at the end of the
> > day, which option performs better depends entirely on the workload
> > and machine configuration.
> 
> That is not what I meant. I would like more higher order allocations to
> succeed. That does not mean that slubs allocation methods and flags passed
> have to stay the same. You can change the slub behavior if it helps.
> 

In this particular patch, the success rate for high order allocations
would likely decrease in low memory conditions albeit the latency when
calling the page allocator will be lower and the disruption to the
system will be less (no copying or reclaim of pages). My expectation
would be that it's cheaper for SLUB to fall back than compact memory
or reclaim pages even if this means a slab page is smaller until more
memory is free. However, if the "goodness" criteria is high order
allocation success rate, the patch shouldn't be merged.

> I am just suspicious of compaction. If these mods are needed to reduce the
> amount of higher order pages then compaction does not have the
> beneficial effect that it should have. It does not actually
> increase the available higher order pages. Fix that first.
> 

The problem being addressed was the machine being hung at worst and in
other cases having kswapd pinned at 99-100% CPU. It's now been shown
that modifying SLUB is not necessary to fix this because the bug was
in page reclaim. The high-order allocation success rate didn't come
into it.
Christoph Lameter - May 17, 2011, 5:52 p.m.
On Tue, 17 May 2011, Mel Gorman wrote:

> > That is not what I meant. I would like more higher order allocations to
> > succeed. That does not mean that slubs allocation methods and flags passed
> > have to stay the same. You can change the slub behavior if it helps.
> >
>
> In this particular patch, the success rate for high order allocations
> would likely decrease in low memory conditions albeit the latency when
> calling the page allocator will be lower and the disruption to the
> system will be less (no copying or reclaim of pages). My expectation
> would be that it's cheaper for SLUB to fall back than compact memory
> or reclaim pages even if this means a slab page is smaller until more
> memory is free. However, if the "goodness" criteria is high order
> allocation success rate, the patch shouldn't be merged.

The criteria is certainly overall system performance and not a high order
allocation rate.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes - May 17, 2011, 7:31 p.m.
On Tue, 17 May 2011, Mel Gorman wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 9f8a97b..057f1e2 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  {
> > >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> > >  
> > >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> > >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	 */
> > >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> > >  
> > > -	if (!wait) {
> > > +	if (!wait && can_wake_kswapd) {
> > >  		/*
> > >  		 * Not worth trying to allocate harder for
> > >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 98c358d..c5797ab 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > >  	 * Let the initial higher-order allocation fail under memory pressure
> > >  	 * so we fall-back to the minimum order allocation.
> > >  	 */
> > > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> > >  
> > >  	page = alloc_slab_page(alloc_gfp, node, oo);
> > >  	if (unlikely(!page)) {
> > 
> > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> > will immediately fail.
> > 
> 
> We can enter enter direct compaction or direct reclaim
> at least once. If compaction is enabled and we enter
> reclaim/compaction, the presense of __GFP_REPEAT makes a difference
> in should_continue_reclaim().  With compaction disabled, the presense
> of the flag is relevant in should_alloc_retry() with it being possible
> to loop in the allocator instead of failing the SLUB allocation and
> dropping back.
> 

You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct 
reclaim, so clearing __GFP_REPEAT here doesn't actually do anything.  
That's why I suggested adding a comment about why you're clearing 
__GFP_WAIT: to make it obvious that these allocations will immediately 
fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY 
or remove __GFP_REPEAT.

> Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
> more sense.

No, I meant !__GFP_NOFAIL since the high priority allocations (if 
PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page 
without that bit.  That allows this !__GFP_WAIT allocation to immediately 
fail.  __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have 
__GFP_WAIT.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes - May 17, 2011, 7:35 p.m.
On Tue, 17 May 2011, Christoph Lameter wrote:

> > In this particular patch, the success rate for high order allocations
> > would likely decrease in low memory conditions albeit the latency when
> > calling the page allocator will be lower and the disruption to the
> > system will be less (no copying or reclaim of pages). My expectation
> > would be that it's cheaper for SLUB to fall back than compact memory
> > or reclaim pages even if this means a slab page is smaller until more
> > memory is free. However, if the "goodness" criteria is high order
> > allocation success rate, the patch shouldn't be merged.
> 
> The criteria is certainly overall system performance and not a high order
> allocation rate.
> 

SLUB definitely depends on these higher order allocations being successful 
for performance, dropping back to the min order is a last resort as 
opposed to failing the kmalloc().  If it's the last resort, then it makes 
sense that we'd want to try both compaction and reclaim while we're 
already in the page allocator as we go down the slub slowpath.  Why not 
try just a little harder (compaction and/or reclaim) to alloc the cache's 
preferred order?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..057f1e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1972,6 +1972,7 @@  gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -1984,7 +1985,7 @@  gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (!wait && can_wake_kswapd) {
 		/*
 		 * Not worth trying to allocate harder for
 		 * __GFP_NOMEMALLOC even if it can't schedule.
diff --git a/mm/slub.c b/mm/slub.c
index 98c358d..c5797ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,8 @@  static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
+			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {