Patchwork [3/3] mm: slub: Default slub_max_order to 0

login
register
mail settings
Submitter Mel Gorman
Date May 11, 2011, 3:29 p.m.
Message ID <1305127773-10570-4-git-send-email-mgorman@suse.de>
Download mbox | patch
Permalink /patch/95164/
State Not Applicable
Headers show

Comments

Mel Gorman - May 11, 2011, 3:29 p.m.
To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations up to order-3 by default and falls back to
lower allocations if they fail. While care is taken that the caller
and kswapd take no unusual steps in response to this, there are
further consequences like shrinkers who have to free more objects to
release any memory. There is anecdotal evidence that significant time
is being spent looping in shrinkers with insufficient progress being
made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.

SLUB is now the default allocator and some bug reports have been
pinned down to SLUB using high orders during operations like
copying large amounts of data. SLUBs use of high-orders benefits
applications that are sized to memory appropriately but this does not
necessarily apply to large file servers or desktops.  This patch
causes SLUB to use order-0 pages like SLAB does by default.
There is further evidence that this keeps kswapd's usage lower
(https://lkml.org/lkml/2011/5/10/383).

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/vm/slub.txt |    2 +-
 mm/slub.c                 |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
David Rientjes - May 11, 2011, 8:38 p.m.
On Wed, 11 May 2011, Mel Gorman wrote:

> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations up to order-3 by default and falls back to
> lower allocations if they fail. While care is taken that the caller
> and kswapd take no unusual steps in response to this, there are
> further consequences like shrinkers who have to free more objects to
> release any memory. There is anecdotal evidence that significant time
> is being spent looping in shrinkers with insufficient progress being
> made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> 
> SLUB is now the default allocator and some bug reports have been
> pinned down to SLUB using high orders during operations like
> copying large amounts of data. SLUBs use of high-orders benefits
> applications that are sized to memory appropriately but this does not
> necessarily apply to large file servers or desktops.  This patch
> causes SLUB to use order-0 pages like SLAB does by default.
> There is further evidence that this keeps kswapd's usage lower
> (https://lkml.org/lkml/2011/5/10/383).
> 

This is going to severely impact slub's performance for applications on 
machines with plenty of memory available where fragmentation isn't a 
concern when allocating from caches with large object sizes (even 
changing the min order of kamlloc-256 from 1 to 0!) by default for users 
who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
on allocating from the cpu slab and freeing to the cpu slab to avoid the 
slowpaths, so higher order slabs are important for its performance.

I can get numbers for a simple netperf TCP_RR benchmark with this change 
applied to show the degradation on a server with >32GB of RAM with this 
patch applied.

It would be ideal if this default could be adjusted based on the amount of 
memory available in the smallest node to determine whether we're concerned 
about making higher order allocations.  (Using the smallest node as a 
metric so that mempolicies and cpusets don't get unfairly biased against.)  
With the previous changes in this patchset, specifically avoiding waking 
kswapd and doing compaction for the higher order allocs before falling 
back to the min order, it shouldn't be devastating to try an order-3 alloc 
that will fail quickly.

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/vm/slub.txt |    2 +-
>  mm/slub.c                 |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
> index 07375e7..778e9fa 100644
> --- a/Documentation/vm/slub.txt
> +++ b/Documentation/vm/slub.txt
> @@ -117,7 +117,7 @@ can be influenced by kernel parameters:
>  
>  slub_min_objects=x		(default 4)
>  slub_min_order=x		(default 0)
> -slub_max_order=x		(default 1)
> +slub_max_order=x		(default 0)

Hmm, that was wrong to begin with, it should have been 3.

>  
>  slub_min_objects allows to specify how many objects must at least fit
>  into one slab in order for the allocation order to be acceptable.
> diff --git a/mm/slub.c b/mm/slub.c
> index 1071723..23a4789 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
>   * take the list_lock.
>   */
>  static int slub_min_order;
> -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> +static int slub_max_order;
>  static int slub_min_objects;
>  
>  /*
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 11, 2011, 8:53 p.m.
On Wed, 2011-05-11 at 13:38 -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > high-order allocations up to order-3 by default and falls back to
> > lower allocations if they fail. While care is taken that the caller
> > and kswapd take no unusual steps in response to this, there are
> > further consequences like shrinkers who have to free more objects to
> > release any memory. There is anecdotal evidence that significant time
> > is being spent looping in shrinkers with insufficient progress being
> > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> > 
> > SLUB is now the default allocator and some bug reports have been
> > pinned down to SLUB using high orders during operations like
> > copying large amounts of data. SLUBs use of high-orders benefits
> > applications that are sized to memory appropriately but this does not
> > necessarily apply to large file servers or desktops.  This patch
> > causes SLUB to use order-0 pages like SLAB does by default.
> > There is further evidence that this keeps kswapd's usage lower
> > (https://lkml.org/lkml/2011/5/10/383).
> > 
> 
> This is going to severely impact slub's performance for applications on 
> machines with plenty of memory available where fragmentation isn't a 
> concern when allocating from caches with large object sizes (even 
> changing the min order of kamlloc-256 from 1 to 0!) by default for users 
> who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
> on allocating from the cpu slab and freeing to the cpu slab to avoid the 
> slowpaths, so higher order slabs are important for its performance.
> 
> I can get numbers for a simple netperf TCP_RR benchmark with this change 
> applied to show the degradation on a server with >32GB of RAM with this 
> patch applied.
> 
> It would be ideal if this default could be adjusted based on the amount of 
> memory available in the smallest node to determine whether we're concerned 
> about making higher order allocations.  (Using the smallest node as a 
> metric so that mempolicies and cpusets don't get unfairly biased against.)  
> With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order, it shouldn't be devastating to try an order-3 alloc 
> that will fail quickly.

So my testing has shown that simply booting the kernel with
slub_max_order=0 makes the hang I'm seeing go away.  This definitely
implicates the higher order allocations in the kswapd problem.  I think
it would be wise not to make it the default until we can sort out the
root cause.

James




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 11, 2011, 9:09 p.m.
On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > high-order allocations up to order-3 by default and falls back to
> > lower allocations if they fail. While care is taken that the caller
> > and kswapd take no unusual steps in response to this, there are
> > further consequences like shrinkers who have to free more objects to
> > release any memory. There is anecdotal evidence that significant time
> > is being spent looping in shrinkers with insufficient progress being
> > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> > 
> > SLUB is now the default allocator and some bug reports have been
> > pinned down to SLUB using high orders during operations like
> > copying large amounts of data. SLUBs use of high-orders benefits
> > applications that are sized to memory appropriately but this does not
> > necessarily apply to large file servers or desktops.  This patch
> > causes SLUB to use order-0 pages like SLAB does by default.
> > There is further evidence that this keeps kswapd's usage lower
> > (https://lkml.org/lkml/2011/5/10/383).
> > 
> 
> This is going to severely impact slub's performance for applications on 
> machines with plenty of memory available where fragmentation isn't a 
> concern when allocating from caches with large object sizes (even 
> changing the min order of kamlloc-256 from 1 to 0!) by default for users 
> who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
> on allocating from the cpu slab and freeing to the cpu slab to avoid the 
> slowpaths, so higher order slabs are important for its performance.
> 

I agree with you that there are situations where plenty of memory
means that that it'll perform much better. However, indications are
that it breaks down with high CPU usage when memory is low.  Worse,
once fragmentation becomes a problem, large amounts of UNMOVABLE and
RECLAIMABLE will make it progressively more expensive to find the
necessary pages. Perhaps with patches 1 and 2, this is not as much
of a problem but figures in the leader indicated that for a simple
workload with large amounts of files and data exceeding physical
memory that it was better off not to use high orders at all which
is a situation I'd expect to be encountered by more users than
performance-sensitive applications.

In other words, we're taking one hit or the other.

> I can get numbers for a simple netperf TCP_RR benchmark with this change 
> applied to show the degradation on a server with >32GB of RAM with this 
> patch applied.
> 

Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit,
particularly on a local machine where the recycling of pages will
impact it heavily.

> It would be ideal if this default could be adjusted based on the amount of 
> memory available in the smallest node to determine whether we're concerned 
> about making higher order allocations. 

It's not a function of memory size, working set size is what
is important or at least how many new pages have been allocated
recently. Fit your workload in physical memory - high orders are
great. Go larger than that and you hit problems. James' testing
indicated that kswapd CPU usage dropped to far lower levels with this
patch applied his test of untarring a large file for example.

> (Using the smallest node as a 
> metric so that mempolicies and cpusets don't get unfairly biased against.)  
> With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order, it shouldn't be devastating to try an order-3 alloc 
> that will fail quickly.
> 

Which is more reasonable? That an ordinary user gets a default that
is fairly safe even if benchmarks that demand the highest performance
from SLUB take a hit or that administrators running such workloads
set slub_max_order=3?

> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  Documentation/vm/slub.txt |    2 +-
> >  mm/slub.c                 |    2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
> > index 07375e7..778e9fa 100644
> > --- a/Documentation/vm/slub.txt
> > +++ b/Documentation/vm/slub.txt
> > @@ -117,7 +117,7 @@ can be influenced by kernel parameters:
> >  
> >  slub_min_objects=x		(default 4)
> >  slub_min_order=x		(default 0)
> > -slub_max_order=x		(default 1)
> > +slub_max_order=x		(default 0)
> 
> Hmm, that was wrong to begin with, it should have been 3.
> 

True, but I didn't see the point fixing it in a separate patch. If this
patch gets rejected, I'll submit a documentation fix.

> >  
> >  slub_min_objects allows to specify how many objects must at least fit
> >  into one slab in order for the allocation order to be acceptable.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 1071723..23a4789 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
> >   * take the list_lock.
> >   */
> >  static int slub_min_order;
> > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > +static int slub_max_order;
> >  static int slub_min_objects;
> >  
> >  /*
David Rientjes - May 11, 2011, 10:27 p.m.
On Wed, 11 May 2011, Mel Gorman wrote:

> I agree with you that there are situations where plenty of memory
> means that that it'll perform much better. However, indications are
> that it breaks down with high CPU usage when memory is low.  Worse,
> once fragmentation becomes a problem, large amounts of UNMOVABLE and
> RECLAIMABLE will make it progressively more expensive to find the
> necessary pages. Perhaps with patches 1 and 2, this is not as much
> of a problem but figures in the leader indicated that for a simple
> workload with large amounts of files and data exceeding physical
> memory that it was better off not to use high orders at all which
> is a situation I'd expect to be encountered by more users than
> performance-sensitive applications.
> 
> In other words, we're taking one hit or the other.
> 

Seems like the ideal solution would then be to find how to best set the 
default, and that can probably only be done with the size of the smallest 
node since it has a higher liklihood of encountering a large amount of 
unreclaimable slab when memory is low.

> > I can get numbers for a simple netperf TCP_RR benchmark with this change 
> > applied to show the degradation on a server with >32GB of RAM with this 
> > patch applied.
> > 
> 
> Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit,
> particularly on a local machine where the recycling of pages will
> impact it heavily.
> 

Ignoring the local machine for a second, TCP_RR probably shouldn't be 
taking any more of a hit with slub than it already is.  When I benchmarked 
slab vs. slub a couple months ago with two machines, each four quad-core 
Opterons with 64GB of memory, with this benchmark it showed slub was 
already 10-15% slower.  That's why slub has always been unusable for us, 
and I'm surprised that it's now becoming the favorite of distros 
everywhere (and, yes, Ubuntu now defaults to it as well).

> > It would be ideal if this default could be adjusted based on the amount of 
> > memory available in the smallest node to determine whether we're concerned 
> > about making higher order allocations. 
> 
> It's not a function of memory size, working set size is what
> is important or at least how many new pages have been allocated
> recently. Fit your workload in physical memory - high orders are
> great. Go larger than that and you hit problems. James' testing
> indicated that kswapd CPU usage dropped to far lower levels with this
> patch applied his test of untarring a large file for example.
> 

My point is that it would probably be better to tune the default based on 
how much memory is available at boot since it implies the probability of 
having an abundance of memory while populating the caches' partial lists 
up to min_partial rather than change it for everyone where it is known 
that it will cause performance degradations if memory is never low.  We 
probably don't want to be doing order-3 allocations for half the slab 
caches when we have 1G of memory available, but that's acceptable with 
64GB.

> > (Using the smallest node as a 
> > metric so that mempolicies and cpusets don't get unfairly biased against.)  
> > With the previous changes in this patchset, specifically avoiding waking 
> > kswapd and doing compaction for the higher order allocs before falling 
> > back to the min order, it shouldn't be devastating to try an order-3 alloc 
> > that will fail quickly.
> > 
> 
> Which is more reasonable? That an ordinary user gets a default that
> is fairly safe even if benchmarks that demand the highest performance
> from SLUB take a hit or that administrators running such workloads
> set slub_max_order=3?
> 

Not sure what is more reasonable since it depends on what the workload is, 
but what probably is unreasonable is changing a slub default that is known 
to directly impact performance by presenting a single benchmark under 
consideration without some due diligence in testing others like netperf.

We all know that slub has some disavantages compared to slab that are only 
now being realized because it has become the debian default, but it does 
excel at some workloads -- it was initially presented to beat slab in 
kernbench, hackbench, sysbench, and aim9 when it was merged.  Those 
advantages may never be fully realized on laptops or desktop machines, but 
with machines with plenty of memory available, slub ofter does perform 
better than slab.

That's why I suggested tuning the min order default based on total memory, 
it would probably be easier to justify than changing it for everyone and 
demanding users who are completely happy with using slub, the kernel.org 
default for years, now use command line options.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 2:43 p.m.
On Wed, 11 May 2011, Mel Gorman wrote:

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
>   * take the list_lock.
>   */
>  static int slub_min_order;
> -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> +static int slub_max_order;

If we really need to do this then do not push this down to zero please.
SLAB uses order 1 for the meax. Lets at least keep it theere.

We have been using SLUB for a long time. Why is this issue arising now?
Due to compaction etc making reclaim less efficient?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 3:15 p.m.
On Thu, 2011-05-12 at 09:43 -0500, Christoph Lameter wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
> >   * take the list_lock.
> >   */
> >  static int slub_min_order;
> > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > +static int slub_max_order;
> 
> If we really need to do this then do not push this down to zero please.
> SLAB uses order 1 for the meax. Lets at least keep it theere.

1 is the current value.  Reducing it to zero seems to fix the kswapd
induced hangs.  The problem does look to be some shrinker/allocator
interference somewhere in vmscan.c, but the fact is that it's triggered
by SLUB and not SLAB.  I really think that what's happening is some type
of feedback loops where one of the shrinkers is issuing a
wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
on non-preempt).

> We have been using SLUB for a long time. Why is this issue arising now?
> Due to compaction etc making reclaim less efficient?

This is the snark argument (I've said it thrice the bellman cried and
what I tell you three times is true).  The fact is that no enterprise
distribution at all uses SLUB.  It's only recently that the desktop
distributions started to ... the bugs are showing up under FC15 beta,
which is the first fedora distribution to enable it.  I'd say we're only
just beginning widespread SLUB testing.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 3:27 p.m.
On Thu, 12 May 2011, James Bottomley wrote:

> > >   */
> > >  static int slub_min_order;
> > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > > +static int slub_max_order;
> >
> > If we really need to do this then do not push this down to zero please.
> > SLAB uses order 1 for the meax. Lets at least keep it theere.
>
> 1 is the current value.  Reducing it to zero seems to fix the kswapd
> induced hangs.  The problem does look to be some shrinker/allocator
> interference somewhere in vmscan.c, but the fact is that it's triggered
> by SLUB and not SLAB.  I really think that what's happening is some type
> of feedback loops where one of the shrinkers is issuing a
> wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
> on non-preempt).

The current value is PAGE_ALLOC_COSTLY_ORDER which is 3.

> > We have been using SLUB for a long time. Why is this issue arising now?
> > Due to compaction etc making reclaim less efficient?
>
> This is the snark argument (I've said it thrice the bellman cried and
> what I tell you three times is true).  The fact is that no enterprise
> distribution at all uses SLUB.  It's only recently that the desktop
> distributions started to ... the bugs are showing up under FC15 beta,
> which is the first fedora distribution to enable it.  I'd say we're only
> just beginning widespread SLUB testing.

Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my
archives so has Fedora). I have been running those here for a couple of
years and the issues that I see here seem to be only with the most
recent kernels that now do compaction and other reclaim tricks.





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 3:43 p.m.
On Thu, 2011-05-12 at 10:27 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > > >   */
> > > >  static int slub_min_order;
> > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > > > +static int slub_max_order;
> > >
> > > If we really need to do this then do not push this down to zero please.
> > > SLAB uses order 1 for the meax. Lets at least keep it theere.
> >
> > 1 is the current value.  Reducing it to zero seems to fix the kswapd
> > induced hangs.  The problem does look to be some shrinker/allocator
> > interference somewhere in vmscan.c, but the fact is that it's triggered
> > by SLUB and not SLAB.  I really think that what's happening is some type
> > of feedback loops where one of the shrinkers is issuing a
> > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
> > on non-preempt).
> 
> The current value is PAGE_ALLOC_COSTLY_ORDER which is 3.
> 
> > > We have been using SLUB for a long time. Why is this issue arising now?
> > > Due to compaction etc making reclaim less efficient?
> >
> > This is the snark argument (I've said it thrice the bellman cried and
> > what I tell you three times is true).  The fact is that no enterprise
> > distribution at all uses SLUB.  It's only recently that the desktop
> > distributions started to ... the bugs are showing up under FC15 beta,
> > which is the first fedora distribution to enable it.  I'd say we're only
> > just beginning widespread SLUB testing.
> 
> Debian and Ubuntu have been using SLUB for a long time

Only from Squeeze, which has been released for ~3 months.  That doesn't
qualify as a "long time" in my book.

>  (and AFAICT from my
> archives so has Fedora).

As I said above, no released fedora version uses SLUB.  It's only just
been enabled for the unreleased FC15; I'm testing a beta copy.

>  I have been running those here for a couple of
> years and the issues that I see here seem to be only with the most
> recent kernels that now do compaction and other reclaim tricks.

but a sample of one doeth not great testing make.

However, since you admit even you see problems, let's concentrate on
fixing them rather than recriminations?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones - May 12, 2011, 3:45 p.m.
On Thu, May 12, 2011 at 10:27:00AM -0500, Christoph Lameter wrote:
 > On Thu, 12 May 2011, James Bottomley wrote:
 > > It's only recently that the desktop
 > > distributions started to ... the bugs are showing up under FC15 beta,
 > > which is the first fedora distribution to enable it.  I'd say we're only
 > > just beginning widespread SLUB testing.
 > 
 > Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my
 > archives so has Fedora).

Indeed. It was enabled in Fedora pretty much as soon as it appeared in mainline.

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones - May 12, 2011, 3:46 p.m.
On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:

 > As I said above, no released fedora version uses SLUB.  It's only just
 > been enabled for the unreleased FC15; I'm testing a beta copy.

James, this isn't true.

$ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y

(That's the oldest release I have right now, but it's been enabled even
before that release).

	Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - May 12, 2011, 3:55 p.m.
On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> However, since you admit even you see problems, let's concentrate on
> fixing them rather than recriminations?

Yes, please. So does dropping max_order to 1 help?
PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.

			Pekka

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 4 p.m.
On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote:
> On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:
> 
>  > As I said above, no released fedora version uses SLUB.  It's only just
>  > been enabled for the unreleased FC15; I'm testing a beta copy.
> 
> James, this isn't true.
> 
> $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> 
> (That's the oldest release I have right now, but it's been enabled even
> before that release).

OK, I concede the point ... I haven't actually kept any of my FC
machines current for a while.

However, the fact remains that this seems to be a slub problem and it
needs fixing.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 4:01 p.m.
On Thu, 12 May 2011, James Bottomley wrote:

> > Debian and Ubuntu have been using SLUB for a long time
>
> Only from Squeeze, which has been released for ~3 months.  That doesn't
> qualify as a "long time" in my book.

I am sorry but I have never used a Debian/Ubuntu system in the last 3
years that did not use SLUB. And it was that by default. But then we
usually do not run the "released" Debian version. Typically one runs
testing. Ubuntu is different there we usually run releases. But those
have been SLUB for as long as I remember.

And so far it is rock solid and is widely rolled out throughout our
infrastructure (mostly 2.6.32 kernels).

> but a sample of one doeth not great testing make.
>
> However, since you admit even you see problems, let's concentrate on
> fixing them rather than recriminations?

I do not see problems here with earlier kernels. I only see these on one
testing system with the latest kernels on Ubuntu 11.04.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones - May 12, 2011, 4:08 p.m.
On Thu, May 12, 2011 at 11:00:23AM -0500, James Bottomley wrote:
 > On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote:
 > > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:
 > > 
 > >  > As I said above, no released fedora version uses SLUB.  It's only just
 > >  > been enabled for the unreleased FC15; I'm testing a beta copy.
 > > 
 > > James, this isn't true.
 > > 
 > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
 > > CONFIG_SLUB_DEBUG=y
 > > CONFIG_SLUB=y
 > > 
 > > (That's the oldest release I have right now, but it's been enabled even
 > > before that release).
 > 
 > OK, I concede the point ... I haven't actually kept any of my FC
 > machines current for a while.

'a while' is an understatement :)
It was first enabled in Fedora 8 in 2007.

	Dave
 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 12, 2011, 4:10 p.m.
Le jeudi 12 mai 2011 à 11:01 -0500, Christoph Lameter a écrit :
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > > Debian and Ubuntu have been using SLUB for a long time
> >
> > Only from Squeeze, which has been released for ~3 months.  That doesn't
> > qualify as a "long time" in my book.
> 
> I am sorry but I have never used a Debian/Ubuntu system in the last 3
> years that did not use SLUB. And it was that by default. But then we
> usually do not run the "released" Debian version. Typically one runs
> testing. Ubuntu is different there we usually run releases. But those
> have been SLUB for as long as I remember.
> 
> And so far it is rock solid and is widely rolled out throughout our
> infrastructure (mostly 2.6.32 kernels).
> 
> > but a sample of one doeth not great testing make.
> >
> > However, since you admit even you see problems, let's concentrate on
> > fixing them rather than recriminations?
> 
> I do not see problems here with earlier kernels. I only see these on one
> testing system with the latest kernels on Ubuntu 11.04.

More fuel to this discussion with commit 6d4831c2

Something is wrong with high order allocations, on some machines.

Maybe we can find real cause instead of limiting us to use order-0 pages
in the end... ;)

commit 6d4831c283530a5f2c6bd8172c13efa236eb149d
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Wed Apr 27 15:26:41 2011 -0700

    vfs: avoid large kmalloc()s for the fdtable
    
    Azurit reports large increases in system time after 2.6.36 when running
    Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
    to allocate fdmem if possible").
    
    That patch caused the vfs to use kmalloc() for very large allocations and
    this is causing excessive work (and presumably excessive reclaim) within
    the page allocator.
    
    Fix it by falling back to vmalloc() earlier - when the allocation attempt
    would have been considered "costly" by reclaim.
    
  

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 4:27 p.m.
On Thu, 12 May 2011, James Bottomley wrote:

> However, the fact remains that this seems to be a slub problem and it
> needs fixing.

Why are you so fixed on slub in these matters? Its an key component but
there is a high interaction with other subsystems. There was no recent
change in slub that changed the order of allocations. There were changes
affecting the reclaim logic. Slub has been working just fine with the
existing allocation schemes for a long time.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 4:30 p.m.
On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > However, the fact remains that this seems to be a slub problem and it
> > needs fixing.
> 
> Why are you so fixed on slub in these matters?

Because, as has been hashed out in the thread, changing SLUB to SLAB
makes the hang go away.

>  Its an key component but
> there is a high interaction with other subsystems. There was no recent
> change in slub that changed the order of allocations. There were changes
> affecting the reclaim logic. Slub has been working just fine with the
> existing allocation schemes for a long time.

So suggest an alternative root cause and a test to expose it.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 4:48 p.m.
On Thu, 12 May 2011, James Bottomley wrote:

> On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote:
> > On Thu, 12 May 2011, James Bottomley wrote:
> >
> > > However, the fact remains that this seems to be a slub problem and it
> > > needs fixing.
> >
> > Why are you so fixed on slub in these matters?
>
> Because, as has been hashed out in the thread, changing SLUB to SLAB
> makes the hang go away.

SLUB doesnt hang here with earlier kernel versions either. So the higher
allocations are no longer as effective as they were before. This is due to
a change in another subsystem.

> >  Its an key component but
> > there is a high interaction with other subsystems. There was no recent
> > change in slub that changed the order of allocations. There were changes
> > affecting the reclaim logic. Slub has been working just fine with the
> > existing allocation schemes for a long time.
>
> So suggest an alternative root cause and a test to expose it.

Have a look at my other emails? I am just repeating myself again it seems.

Try order = 1 which gives you SLAB like interaction with the page
allocator. Then we at least know that it is the order 2 and 3 allocs that
are the problem and not something else.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - May 12, 2011, 5:06 p.m.
On Thu, May 12, 2011 at 7:30 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> So suggest an alternative root cause and a test to expose it.

Is your .config available somewhere, btw?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - May 12, 2011, 5:11 p.m.
On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>> So suggest an alternative root cause and a test to expose it.
>
> Is your .config available somewhere, btw?

If it's this:

http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD

I'd love to see what happens if you disable

CONFIG_TRANSPARENT_HUGEPAGE=y

because that's going to reduce high order allocations as well, no?

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 5:36 p.m.
On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> kswapd and doing compaction for the higher order allocs before falling 

Note that patch 2 disabled compaction by clearing __GFP_WAIT.

What you describe here would be patch 2 without the ~__GFP_WAIT
addition (so keeping only ~GFP_NOFAIL).

Not clearing __GFP_WAIT when compaction is enabled is possible and
shouldn't result in bad behavior (if compaction is not enabled with
current SLUB it's hard to imagine how it could perform decently if
there's fragmentation). You should try to benchmark to see if it's
worth it on the large NUMA systems with heavy network traffic (for
normal systems I doubt compaction is worth it but I'm not against
trying to keep it enabled just in case).

On a side note, this reminds me to rebuild with slub_max_order in .bss
on my cellphone (where I can't switch to SLAB because of some silly
rfs vfat-on-steroids proprietary module).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - May 12, 2011, 5:37 p.m.
On Thu, 12 May 2011 18:10:38 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> More fuel to this discussion with commit 6d4831c2
> 
> Something is wrong with high order allocations, on some machines.
> 
> Maybe we can find real cause instead of limiting us to use order-0 pages
> in the end... ;)
> 
> commit 6d4831c283530a5f2c6bd8172c13efa236eb149d
> Author: Andrew Morton <akpm@linux-foundation.org>
> Date:   Wed Apr 27 15:26:41 2011 -0700
> 
>     vfs: avoid large kmalloc()s for the fdtable

Well, it's always been the case that satisfying higher-order
allocations take a disproportionate amount of work in page reclaim. 
And often causes excessive reclaim.

That's why we've traditionally worked to avoid higher-order
allocations, and this has always been a problem with slub.

But the higher-order allocations shouldn't cause the VM to melt down. 
We changed something, and now it melts down.  Changing slub to avoid
that meltdown doesn't fix the thing we broke.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 5:38 p.m.
On Thu, 12 May 2011, Pekka Enberg wrote:

> On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> > On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> >> So suggest an alternative root cause and a test to expose it.
> >
> > Is your .config available somewhere, btw?
>
> If it's this:
>
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
>
> I'd love to see what happens if you disable
>
> CONFIG_TRANSPARENT_HUGEPAGE=y
>
> because that's going to reduce high order allocations as well, no?

I dont think that will change much since huge pages are at MAX_ORDER size.
Either you can get them or not. The challenge with the small order
allocations is that they require contiguous memory. Compaction is likely
not as effective as the prior mechanism that did opportunistic reclaim of
neighboring pages.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 5:40 p.m.
On Thu, May 12, 2011 at 11:27:04AM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > However, the fact remains that this seems to be a slub problem and it
> > needs fixing.
> 
> Why are you so fixed on slub in these matters? Its an key component but
> there is a high interaction with other subsystems. There was no recent
> change in slub that changed the order of allocations. There were changes
> affecting the reclaim logic. Slub has been working just fine with the
> existing allocation schemes for a long time.

It should work just fine when compaction is enabled.

The COMPACTION=n case would also work decent if we eliminate the lumpy
reclaim. Lumpy reclaim tells the VM to ignore all young bits in the
pagetables and take everything down in order to generate the order 3
page that SLUB asks. You can't expect decent behavior the moment you
take everything down regardless of referenced bits on page and young
bits in pte. I doubt it's new issue, but lumpy may have become more or
less aggressive over time. Good thing, lumpy is eliminated (basically at
runtime, not compile time) by enabling compaction.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 5:46 p.m.
On Thu, May 12, 2011 at 11:48:19AM -0500, Christoph Lameter wrote:
> Try order = 1 which gives you SLAB like interaction with the page
> allocator. Then we at least know that it is the order 2 and 3 allocs that
> are the problem and not something else.

order 1 should work better, because it's less likely we end up here
(which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
at the top of page_check_references())

   else if (sc->order && priority < DEF_PRIORITY - 2)
   	sc->reclaim_mode |= syncmode;

with order 1 more likely we end up here as enough pages are freed for
order 1 and we're safe:

     else
	sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;

None of these issue should materialize with COMPACTION=n. Even
__GFP_WAIT can be left enabled to run compaction without expecting
adverse behavior, but running compaction may still not be worth it for
small systems where the benefit of having order 1/2/3 allocation may
not outweight the cost of compaction itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 5:51 p.m.
On Thu, May 12, 2011 at 08:11:05PM +0300, Pekka Enberg wrote:
> If it's this:
> 
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
> 
> I'd love to see what happens if you disable
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> 
> because that's going to reduce high order allocations as well, no?

Well THP forces COMPACTION=y so lumpy won't risk to be activated. I
got once a complaint asking not to make THP force COMPACTION=y (there
is no real dependency here, THP will just call alloc_pages with
__GFP_NO_KSWAPD and order 9, or 10 on x86-nopae), but I preferred to
keep it forced exactly to avoid issues like these when THP is on. If
even order 3 is causing troubles (which doesn't immediately make lumpy
activated, it only activates when priority is < DEF_PRIORITY-2, so
after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
was happening at order 9 every time firefox, gcc and mutt allocated
memory ;).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 6 p.m.
On Thu, 12 May 2011, Andrea Arcangeli wrote:

> order 1 should work better, because it's less likely we end up here
> (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
> at the top of page_check_references())
>
>    else if (sc->order && priority < DEF_PRIORITY - 2)

Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation
for SLAB order 1 allocs?

May I assume that the case of order 2 and 3 allocs in that case was not
very well tested after the changes to introduce compaction since people
were focusing on RHEL testing?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 6 p.m.
On Thu, May 12, 2011 at 12:38:34PM -0500, Christoph Lameter wrote:
> I dont think that will change much since huge pages are at MAX_ORDER size.
> Either you can get them or not. The challenge with the small order
> allocations is that they require contiguous memory. Compaction is likely
> not as effective as the prior mechanism that did opportunistic reclaim of
> neighboring pages.

THP requires contiguous pages too, the issue is similar, and worse
with THP, but THP enables compaction by default, likely this only
happens with compaction off. We've really to differentiate between
compaction on and off, it makes world of difference (a THP enabled
kernel with compaction off, also runs into swap storms and temporary
hangs all the time, it's probably the same issue of SLUB=y
COMPACTION=n). At least THP didn't activate kswapd, kswapd running
lumpy too makes things worse as it'll probably keep running in the
background after the direct reclaim fails.

The original reports talks about kerenls with SLUB=y and
COMPACTION=n. Not sure if anybody is having trouble with SLUB=y
COMPACTION=y...

Compaction is more effective than the prior mechanism too (prior
mechanism is lumpy reclaim) and it doesn't cause VM disruptions that
ignore all referenced information and takes down anything it finds in
the way.

I think when COMPACTION=n, lumpy either should go away, or only be
activated by __GFP_REPEAT so that only hugetlbfs makes use of
it. Increasing nr_hugepages is ok to halt the system for a while but
when all allocations are doing that, system becomes unusable, kind of
livelocked.

BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
(not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 6:03 p.m.
On Thu, 12 May 2011, Andrea Arcangeli wrote:

> even order 3 is causing troubles (which doesn't immediately make lumpy
> activated, it only activates when priority is < DEF_PRIORITY-2, so
> after 2 loops failing to reclaim nr_to_reclaim pages), imagine what

That is a significant change for SLUB with the merge of the compaction
code.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 6:09 p.m.
On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > even order 3 is causing troubles (which doesn't immediately make lumpy
> > activated, it only activates when priority is < DEF_PRIORITY-2, so
> > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
> 
> That is a significant change for SLUB with the merge of the compaction
> code.

Even before compaction was posted, I had to shut off lumpy reclaim or
it'd hang all the time with frequent order 9 allocations. Maybe lumpy
was better before, maybe lumpy "improved" its reliability recently,
but definitely it wasn't performing well. That definitely applies to
>=2.6.32 (I had to nuke lumpy from it, and only keep compaction
enabled, pretty much like upstream with COMPACTION=y). I think I never
tried earlier lumpy code than 2.6.32, maybe it was less aggressive
back then, I don't exclude it but I thought the whole notion of lumpy
was to takedown everything in the way, which usually leads to process
hanging in swapins or pageins for frequent used memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 6:16 p.m.
On Thu, 12 May 2011, Andrea Arcangeli wrote:

> On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote:
> > On Thu, 12 May 2011, Andrea Arcangeli wrote:
> >
> > > even order 3 is causing troubles (which doesn't immediately make lumpy
> > > activated, it only activates when priority is < DEF_PRIORITY-2, so
> > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
> >
> > That is a significant change for SLUB with the merge of the compaction
> > code.
>
> Even before compaction was posted, I had to shut off lumpy reclaim or
> it'd hang all the time with frequent order 9 allocations. Maybe lumpy
> was better before, maybe lumpy "improved" its reliability recently,

Well we are concerned about order 2 and 3 alloc here. Checking for <
PAGE_ORDER_COSTLY to avoid the order 9 lumpy reclaim looks okay.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli - May 12, 2011, 6:18 p.m.
On Thu, May 12, 2011 at 01:00:10PM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > order 1 should work better, because it's less likely we end up here
> > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
> > at the top of page_check_references())
> >
> >    else if (sc->order && priority < DEF_PRIORITY - 2)
> 
> Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation
> for SLAB order 1 allocs?

That's to allow a few loops of the shrinker (i.e. not take down
everything in the way regardless of any aging information in pte/page
if there's no memory pressure). This "- 2" is independent of the
allocation order. If it was < DEF_PRIORITY it'd trigger lumpy already
at the second loop (in do_try_to_free_pages). So it'd make things
worse. Like it'd make things worse decreasing the
PAGE_ALLOC_COSTLY_ORDER define to 2 and keeping slub at 3.

> May I assume that the case of order 2 and 3 allocs in that case was not
> very well tested after the changes to introduce compaction since people
> were focusing on RHEL testing?

Not really, I had to eliminate lumpy before compaction was
developed. RHEL6 has zero lumpy code (not even at compile time) and
compaction enabled by default, so even if we enabled SLUB=y it should
work ok (not sure why James still crashes with patch 2 applied that
clears __GFP_WAIT, that crash likely has nothing to do with compaction
or lumpy as both are off with __GFP_WAIT not set).

Lumpy is also eliminated upstream now (but only at runtime when
COMPACTION=y), unless __GFP_REPEAT is set, in which case I think lumpy
will still work upstream too but few unfrequent things like increasing
nr_hugepages uses that.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 6:36 p.m.
On Thu, 2011-05-12 at 20:11 +0300, Pekka Enberg wrote:
> On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> > On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> >> So suggest an alternative root cause and a test to expose it.
> >
> > Is your .config available somewhere, btw?
> 
> If it's this:
> 
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
> 
> I'd love to see what happens if you disable
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> 
> because that's going to reduce high order allocations as well, no?

So yes, it's a default FC15 config.

Disabling THP was initially tried a long time ago and didn't make a
difference (it was originally suggested by Chris Mason).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 6:37 p.m.
On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > However, since you admit even you see problems, let's concentrate on
> > fixing them rather than recriminations?
> 
> Yes, please. So does dropping max_order to 1 help?
> PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.

Just booting with max_slab_order=1 (and none of the other patches
applied) I can still get the machine to go into kswapd at 99%, so it
doesn't seem to make much of a difference.

Do you want me to try with the other two patches and max_slab_order=1?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter - May 12, 2011, 6:46 p.m.
On Thu, 12 May 2011, James Bottomley wrote:

> On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > However, since you admit even you see problems, let's concentrate on
> > > fixing them rather than recriminations?
> >
> > Yes, please. So does dropping max_order to 1 help?
> > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
>
> Just booting with max_slab_order=1 (and none of the other patches
> applied) I can still get the machine to go into kswapd at 99%, so it
> doesn't seem to make much of a difference.

slub_max_order=1 right? Not max_slab_order.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 7:21 p.m.
On Thu, 2011-05-12 at 13:46 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > However, since you admit even you see problems, let's concentrate on
> > > > fixing them rather than recriminations?
> > >
> > > Yes, please. So does dropping max_order to 1 help?
> > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> >
> > Just booting with max_slab_order=1 (and none of the other patches
> > applied) I can still get the machine to go into kswapd at 99%, so it
> > doesn't seem to make much of a difference.
> 
> slub_max_order=1 right? Not max_slab_order.

Yes.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 7:44 p.m.
On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > However, since you admit even you see problems, let's concentrate on
> > > fixing them rather than recriminations?
> > 
> > Yes, please. So does dropping max_order to 1 help?
> > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> 
> Just booting with max_slab_order=1 (and none of the other patches
> applied) I can still get the machine to go into kswapd at 99%, so it
> doesn't seem to make much of a difference.
> 
> Do you want me to try with the other two patches and max_slab_order=1?

OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
trigger the problem (kswapd spinning at 99%).  This is still with
PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
it.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 8:04 p.m.
On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > However, since you admit even you see problems, let's concentrate on
> > > > fixing them rather than recriminations?
> > > 
> > > Yes, please. So does dropping max_order to 1 help?
> > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > 
> > Just booting with max_slab_order=1 (and none of the other patches
> > applied) I can still get the machine to go into kswapd at 99%, so it
> > doesn't seem to make much of a difference.
> > 
> > Do you want me to try with the other two patches and max_slab_order=1?
> 
> OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> trigger the problem (kswapd spinning at 99%).  This is still with
> PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> it.

Confirmed, I'm afraid ... I can trigger the problem with all three
patches under PREEMPT.  It's not a hang this time, it's just kswapd
taking 100% system time on 1 CPU and it won't calm down after I unload
the system.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner - May 12, 2011, 8:29 p.m.
On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > However, since you admit even you see problems, let's concentrate on
> > > > > fixing them rather than recriminations?
> > > > 
> > > > Yes, please. So does dropping max_order to 1 help?
> > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > 
> > > Just booting with max_slab_order=1 (and none of the other patches
> > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > doesn't seem to make much of a difference.
> > > 
> > > Do you want me to try with the other two patches and max_slab_order=1?
> > 
> > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > trigger the problem (kswapd spinning at 99%).  This is still with
> > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > it.
> 
> Confirmed, I'm afraid ... I can trigger the problem with all three
> patches under PREEMPT.  It's not a hang this time, it's just kswapd
> taking 100% system time on 1 CPU and it won't calm down after I unload
> the system.

That is kind of expected, though.  If one CPU is busy with a streaming
IO load generating new pages, kswapd is busy reclaiming the old ones
so that the generator does not have to do the reclaim itself.

By unload, do you mean stopping the generator?  And if so, how quickly
after you stop the generator does kswapd go back to sleep?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner - May 12, 2011, 8:31 p.m.
On Thu, May 12, 2011 at 10:29:17PM +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > > However, since you admit even you see problems, let's concentrate on
> > > > > > fixing them rather than recriminations?
> > > > > 
> > > > > Yes, please. So does dropping max_order to 1 help?
> > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > > 
> > > > Just booting with max_slab_order=1 (and none of the other patches
> > > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > > doesn't seem to make much of a difference.
> > > > 
> > > > Do you want me to try with the other two patches and max_slab_order=1?
> > > 
> > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > > trigger the problem (kswapd spinning at 99%).  This is still with
> > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > > it.
> > 
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.

I am so sorry, I missed the "won't" here.  Please ignore.

> That is kind of expected, though.  If one CPU is busy with a streaming
> IO load generating new pages, kswapd is busy reclaiming the old ones
> so that the generator does not have to do the reclaim itself.
> 
> By unload, do you mean stopping the generator?  And if so, how quickly
> after you stop the generator does kswapd go back to sleep?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley - May 12, 2011, 8:31 p.m.
On Thu, 2011-05-12 at 22:29 +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > > However, since you admit even you see problems, let's concentrate on
> > > > > > fixing them rather than recriminations?
> > > > > 
> > > > > Yes, please. So does dropping max_order to 1 help?
> > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > > 
> > > > Just booting with max_slab_order=1 (and none of the other patches
> > > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > > doesn't seem to make much of a difference.
> > > > 
> > > > Do you want me to try with the other two patches and max_slab_order=1?
> > > 
> > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > > trigger the problem (kswapd spinning at 99%).  This is still with
> > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > > it.
> > 
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.
> 
> That is kind of expected, though.  If one CPU is busy with a streaming
> IO load generating new pages, kswapd is busy reclaiming the old ones
> so that the generator does not have to do the reclaim itself.
> 
> By unload, do you mean stopping the generator? 

Correct.

>  And if so, how quickly
> after you stop the generator does kswapd go back to sleep?

It doesn't.  At least not on its own; the CPU stays pegged.  If I start
other work (like a kernel compile), then sometimes it does go back to
nothing.

I'm speculating that this is the hang case for non-PREEMPT.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg - May 13, 2011, 6:16 a.m.
Hi,

On Thu, May 12, 2011 at 11:04 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Confirmed, I'm afraid ... I can trigger the problem with all three
> patches under PREEMPT.  It's not a hang this time, it's just kswapd
> taking 100% system time on 1 CPU and it won't calm down after I unload
> the system.

OK, that's good to know. I'd still like to take patches 1-2, though. Mel?

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 13, 2011, 9:49 a.m.
On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> <SNIP>
>
> BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.

This is in V2 (unreleased, testing in progress and was running
overnight). I noticed that clearing __GFP_REPEAT is required for
reclaim/compaction if direct reclaimers from SLUB are to return false in
should_continue_reclaim() and bail out from high-order allocation
properly. As it is, there is a possibility for slub high-order direct
reclaimers to loop in reclaim/compaction for a long time. This is
only important when CONFIG_COMPACTION=y.
Mel Gorman - May 13, 2011, 10:05 a.m.
On Fri, May 13, 2011 at 09:16:24AM +0300, Pekka Enberg wrote:
> Hi,
> 
> On Thu, May 12, 2011 at 11:04 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.
> 
> OK, that's good to know. I'd still like to take patches 1-2, though. Mel?
> 

Wait for a V2 please. __GFP_REPEAT should also be removed.
Mel Gorman - May 13, 2011, 10:14 a.m.
On Wed, May 11, 2011 at 03:27:11PM -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > I agree with you that there are situations where plenty of memory
> > means that that it'll perform much better. However, indications are
> > that it breaks down with high CPU usage when memory is low.  Worse,
> > once fragmentation becomes a problem, large amounts of UNMOVABLE and
> > RECLAIMABLE will make it progressively more expensive to find the
> > necessary pages. Perhaps with patches 1 and 2, this is not as much
> > of a problem but figures in the leader indicated that for a simple
> > workload with large amounts of files and data exceeding physical
> > memory that it was better off not to use high orders at all which
> > is a situation I'd expect to be encountered by more users than
> > performance-sensitive applications.
> > 
> > In other words, we're taking one hit or the other.
> > 
> 
> Seems like the ideal solution would then be to find how to best set the 
> default, and that can probably only be done with the size of the smallest 
> node since it has a higher liklihood of encountering a large amount of 
> unreclaimable slab when memory is low.
> 

Ideally yes, but glancing through this thread and thinking on it a bit
more, I'm going to drop this patch. As pointed out, SLUB with high
orders has been in use with distributions already so the breakage is
elsewhere. Patches 1 and 2 still make some sense but they're not the
root cause.

> <SNIP>
Andrea Arcangeli - May 15, 2011, 4:39 p.m.
On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote:
> On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> > <SNIP>
> >
> > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.
> 
> This is in V2 (unreleased, testing in progress and was running
> overnight). I noticed that clearing __GFP_REPEAT is required for
> reclaim/compaction if direct reclaimers from SLUB are to return false in
> should_continue_reclaim() and bail out from high-order allocation
> properly. As it is, there is a possibility for slub high-order direct
> reclaimers to loop in reclaim/compaction for a long time. This is
> only important when CONFIG_COMPACTION=y.

Agreed. However I don't expect anyone to allocate from slub(/slab)
with __GFP_REPEAT so it's probably only theoretical but more correct
indeed ;).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 16, 2011, 8:42 a.m.
On Sun, May 15, 2011 at 06:39:06PM +0200, Andrea Arcangeli wrote:
> On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote:
> > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> > > <SNIP>
> > >
> > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.
> > 
> > This is in V2 (unreleased, testing in progress and was running
> > overnight). I noticed that clearing __GFP_REPEAT is required for
> > reclaim/compaction if direct reclaimers from SLUB are to return false in
> > should_continue_reclaim() and bail out from high-order allocation
> > properly. As it is, there is a possibility for slub high-order direct
> > reclaimers to loop in reclaim/compaction for a long time. This is
> > only important when CONFIG_COMPACTION=y.
> 
> Agreed. However I don't expect anyone to allocate from slub(/slab)
> with __GFP_REPEAT so it's probably only theoretical but more correct
> indeed ;).

Networking layer does specify __GFP_REPEAT.
David Rientjes - May 16, 2011, 9:03 p.m.
On Thu, 12 May 2011, Andrea Arcangeli wrote:

> On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> > kswapd and doing compaction for the higher order allocs before falling 
> 
> Note that patch 2 disabled compaction by clearing __GFP_WAIT.
> 
> What you describe here would be patch 2 without the ~__GFP_WAIT
> addition (so keeping only ~GFP_NOFAIL).
> 

It's out of context, my sentence was:

"With the previous changes in this patchset, specifically avoiding waking 
kswapd and doing compaction for the higher order allocs before falling 
back to the min order..."

meaning this patchset avoids waking kswapd and avoids doing compaction.

> Not clearing __GFP_WAIT when compaction is enabled is possible and
> shouldn't result in bad behavior (if compaction is not enabled with
> current SLUB it's hard to imagine how it could perform decently if
> there's fragmentation). You should try to benchmark to see if it's
> worth it on the large NUMA systems with heavy network traffic (for
> normal systems I doubt compaction is worth it but I'm not against
> trying to keep it enabled just in case).
> 

The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
the problem is that the slub slowpath is being used >95% of the time on 
every allocation and free for the very large number of kmalloc-256 and 
kmalloc-2K caches.  Those caches are order 1 and 3, respectively, on my 
system by default, but the page allocator seldomly gets invoked for such a 
benchmark after the partial lists are populated: the overhead is from the 
per-node locking required in the slowpath to traverse the partial lists.  
See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman - May 17, 2011, 9:48 a.m.
On Mon, May 16, 2011 at 02:03:33PM -0700, David Rientjes wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> > > kswapd and doing compaction for the higher order allocs before falling 
> > 
> > Note that patch 2 disabled compaction by clearing __GFP_WAIT.
> > 
> > What you describe here would be patch 2 without the ~__GFP_WAIT
> > addition (so keeping only ~GFP_NOFAIL).
> > 
> 
> It's out of context, my sentence was:
> 
> "With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order..."
> 
> meaning this patchset avoids waking kswapd and avoids doing compaction.
> 

Ok.

> > Not clearing __GFP_WAIT when compaction is enabled is possible and
> > shouldn't result in bad behavior (if compaction is not enabled with
> > current SLUB it's hard to imagine how it could perform decently if
> > there's fragmentation). You should try to benchmark to see if it's
> > worth it on the large NUMA systems with heavy network traffic (for
> > normal systems I doubt compaction is worth it but I'm not against
> > trying to keep it enabled just in case).
> > 
> 
> The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
> the problem is that the slub slowpath is being used >95% of the time on 
> every allocation and free for the very large number of kmalloc-256 and 
> kmalloc-2K caches. 

Ok, that makes sense as I'd full expect that benchmark to exhaust
the per-cpu page (high order or otherwise) of slab objects routinely
during default and I'd also expect the freeing on the other side to
be releasing slabs frequently to the partial or empty lists.

> Those caches are order 1 and 3, respectively, on my 
> system by default, but the page allocator seldomly gets invoked for such a 
> benchmark after the partial lists are populated: the overhead is from the 
> per-node locking required in the slowpath to traverse the partial lists.  
> See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15.

Ok, I can see how this patch would indeed make the situation worse. I
vaguely recall that there were other patches that would increase the
per-cpu lists of objects but have no recollection as to what happened
them.

Maybe Christoph remembers but one way or the other, it's out of scope
for James' and Colin's bug.
David Rientjes - May 17, 2011, 7:25 p.m.
On Tue, 17 May 2011, Mel Gorman wrote:

> > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
> > the problem is that the slub slowpath is being used >95% of the time on 
> > every allocation and free for the very large number of kmalloc-256 and 
> > kmalloc-2K caches. 
> 
> Ok, that makes sense as I'd full expect that benchmark to exhaust
> the per-cpu page (high order or otherwise) of slab objects routinely
> during default and I'd also expect the freeing on the other side to
> be releasing slabs frequently to the partial or empty lists.
> 

That's most of the problem, but it's compounded on this benchmark because 
the slab pulled from the partial list to replace the per-cpu page 
typically only has a very minimal number (2 or 3) of free objects, so it 
can only serve one allocation and then require the allocation slowpath to 
pull yet another slab from the partial list the next time around.  I had a 
patchset that addressed that, which I called "slab thrashing", by only 
pulling a slab from the partial list when it had a pre-defined proportion 
of available objects and otherwise skipping it, and that ended up helping 
the benchmark by 5-7%.  Smaller orders will make this worse, as well, 
since if there were only 2 or 3 free objects on an order-3 slab before, 
there's no chance that's going to be equivalent on an order-0 slab.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index 07375e7..778e9fa 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -117,7 +117,7 @@  can be influenced by kernel parameters:
 
 slub_min_objects=x		(default 4)
 slub_min_order=x		(default 0)
-slub_max_order=x		(default 1)
+slub_max_order=x		(default 0)
 
 slub_min_objects allows to specify how many objects must at least fit
 into one slab in order for the allocation order to be acceptable.
diff --git a/mm/slub.c b/mm/slub.c
index 1071723..23a4789 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2198,7 +2198,7 @@  EXPORT_SYMBOL(kmem_cache_free);
  * take the list_lock.
  */
 static int slub_min_order;
-static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
+static int slub_max_order;
 static int slub_min_objects;
 
 /*