Message ID | 1305127773-10570-4-git-send-email-mgorman@suse.de |
---|---|
State | Not Applicable, archived |
Headers | show |
On Wed, 11 May 2011, Mel Gorman wrote: > To avoid locking and per-cpu overhead, SLUB optimisically uses > high-order allocations up to order-3 by default and falls back to > lower allocations if they fail. While care is taken that the caller > and kswapd take no unusual steps in response to this, there are > further consequences like shrinkers who have to free more objects to > release any memory. There is anecdotal evidence that significant time > is being spent looping in shrinkers with insufficient progress being > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > SLUB is now the default allocator and some bug reports have been > pinned down to SLUB using high orders during operations like > copying large amounts of data. SLUBs use of high-orders benefits > applications that are sized to memory appropriately but this does not > necessarily apply to large file servers or desktops. This patch > causes SLUB to use order-0 pages like SLAB does by default. > There is further evidence that this keeps kswapd's usage lower > (https://lkml.org/lkml/2011/5/10/383). > This is going to severely impact slub's performance for applications on machines with plenty of memory available where fragmentation isn't a concern when allocating from caches with large object sizes (even changing the min order of kamlloc-256 from 1 to 0!) by default for users who don't use slub_max_order=3 on the command line. SLUB relies heavily on allocating from the cpu slab and freeing to the cpu slab to avoid the slowpaths, so higher order slabs are important for its performance. I can get numbers for a simple netperf TCP_RR benchmark with this change applied to show the degradation on a server with >32GB of RAM with this patch applied. It would be ideal if this default could be adjusted based on the amount of memory available in the smallest node to determine whether we're concerned about making higher order allocations. (Using the smallest node as a metric so that mempolicies and cpusets don't get unfairly biased against.) With the previous changes in this patchset, specifically avoiding waking kswapd and doing compaction for the higher order allocs before falling back to the min order, it shouldn't be devastating to try an order-3 alloc that will fail quickly. > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > Documentation/vm/slub.txt | 2 +- > mm/slub.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt > index 07375e7..778e9fa 100644 > --- a/Documentation/vm/slub.txt > +++ b/Documentation/vm/slub.txt > @@ -117,7 +117,7 @@ can be influenced by kernel parameters: > > slub_min_objects=x (default 4) > slub_min_order=x (default 0) > -slub_max_order=x (default 1) > +slub_max_order=x (default 0) Hmm, that was wrong to begin with, it should have been 3. > > slub_min_objects allows to specify how many objects must at least fit > into one slab in order for the allocation order to be acceptable. > diff --git a/mm/slub.c b/mm/slub.c > index 1071723..23a4789 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > * take the list_lock. > */ > static int slub_min_order; > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > +static int slub_max_order; > static int slub_min_objects; > > /* -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-05-11 at 13:38 -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > high-order allocations up to order-3 by default and falls back to > > lower allocations if they fail. While care is taken that the caller > > and kswapd take no unusual steps in response to this, there are > > further consequences like shrinkers who have to free more objects to > > release any memory. There is anecdotal evidence that significant time > > is being spent looping in shrinkers with insufficient progress being > > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > > > SLUB is now the default allocator and some bug reports have been > > pinned down to SLUB using high orders during operations like > > copying large amounts of data. SLUBs use of high-orders benefits > > applications that are sized to memory appropriately but this does not > > necessarily apply to large file servers or desktops. This patch > > causes SLUB to use order-0 pages like SLAB does by default. > > There is further evidence that this keeps kswapd's usage lower > > (https://lkml.org/lkml/2011/5/10/383). > > > > This is going to severely impact slub's performance for applications on > machines with plenty of memory available where fragmentation isn't a > concern when allocating from caches with large object sizes (even > changing the min order of kamlloc-256 from 1 to 0!) by default for users > who don't use slub_max_order=3 on the command line. SLUB relies heavily > on allocating from the cpu slab and freeing to the cpu slab to avoid the > slowpaths, so higher order slabs are important for its performance. > > I can get numbers for a simple netperf TCP_RR benchmark with this change > applied to show the degradation on a server with >32GB of RAM with this > patch applied. > > It would be ideal if this default could be adjusted based on the amount of > memory available in the smallest node to determine whether we're concerned > about making higher order allocations. (Using the smallest node as a > metric so that mempolicies and cpusets don't get unfairly biased against.) > With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order, it shouldn't be devastating to try an order-3 alloc > that will fail quickly. So my testing has shown that simply booting the kernel with slub_max_order=0 makes the hang I'm seeing go away. This definitely implicates the higher order allocations in the kswapd problem. I think it would be wise not to make it the default until we can sort out the root cause. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > high-order allocations up to order-3 by default and falls back to > > lower allocations if they fail. While care is taken that the caller > > and kswapd take no unusual steps in response to this, there are > > further consequences like shrinkers who have to free more objects to > > release any memory. There is anecdotal evidence that significant time > > is being spent looping in shrinkers with insufficient progress being > > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. > > > > SLUB is now the default allocator and some bug reports have been > > pinned down to SLUB using high orders during operations like > > copying large amounts of data. SLUBs use of high-orders benefits > > applications that are sized to memory appropriately but this does not > > necessarily apply to large file servers or desktops. This patch > > causes SLUB to use order-0 pages like SLAB does by default. > > There is further evidence that this keeps kswapd's usage lower > > (https://lkml.org/lkml/2011/5/10/383). > > > > This is going to severely impact slub's performance for applications on > machines with plenty of memory available where fragmentation isn't a > concern when allocating from caches with large object sizes (even > changing the min order of kamlloc-256 from 1 to 0!) by default for users > who don't use slub_max_order=3 on the command line. SLUB relies heavily > on allocating from the cpu slab and freeing to the cpu slab to avoid the > slowpaths, so higher order slabs are important for its performance. > I agree with you that there are situations where plenty of memory means that that it'll perform much better. However, indications are that it breaks down with high CPU usage when memory is low. Worse, once fragmentation becomes a problem, large amounts of UNMOVABLE and RECLAIMABLE will make it progressively more expensive to find the necessary pages. Perhaps with patches 1 and 2, this is not as much of a problem but figures in the leader indicated that for a simple workload with large amounts of files and data exceeding physical memory that it was better off not to use high orders at all which is a situation I'd expect to be encountered by more users than performance-sensitive applications. In other words, we're taking one hit or the other. > I can get numbers for a simple netperf TCP_RR benchmark with this change > applied to show the degradation on a server with >32GB of RAM with this > patch applied. > Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit, particularly on a local machine where the recycling of pages will impact it heavily. > It would be ideal if this default could be adjusted based on the amount of > memory available in the smallest node to determine whether we're concerned > about making higher order allocations. It's not a function of memory size, working set size is what is important or at least how many new pages have been allocated recently. Fit your workload in physical memory - high orders are great. Go larger than that and you hit problems. James' testing indicated that kswapd CPU usage dropped to far lower levels with this patch applied his test of untarring a large file for example. > (Using the smallest node as a > metric so that mempolicies and cpusets don't get unfairly biased against.) > With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order, it shouldn't be devastating to try an order-3 alloc > that will fail quickly. > Which is more reasonable? That an ordinary user gets a default that is fairly safe even if benchmarks that demand the highest performance from SLUB take a hit or that administrators running such workloads set slub_max_order=3? > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > Documentation/vm/slub.txt | 2 +- > > mm/slub.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt > > index 07375e7..778e9fa 100644 > > --- a/Documentation/vm/slub.txt > > +++ b/Documentation/vm/slub.txt > > @@ -117,7 +117,7 @@ can be influenced by kernel parameters: > > > > slub_min_objects=x (default 4) > > slub_min_order=x (default 0) > > -slub_max_order=x (default 1) > > +slub_max_order=x (default 0) > > Hmm, that was wrong to begin with, it should have been 3. > True, but I didn't see the point fixing it in a separate patch. If this patch gets rejected, I'll submit a documentation fix. > > > > slub_min_objects allows to specify how many objects must at least fit > > into one slab in order for the allocation order to be acceptable. > > diff --git a/mm/slub.c b/mm/slub.c > > index 1071723..23a4789 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > > * take the list_lock. > > */ > > static int slub_min_order; > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > +static int slub_max_order; > > static int slub_min_objects; > > > > /*
On Wed, 11 May 2011, Mel Gorman wrote: > I agree with you that there are situations where plenty of memory > means that that it'll perform much better. However, indications are > that it breaks down with high CPU usage when memory is low. Worse, > once fragmentation becomes a problem, large amounts of UNMOVABLE and > RECLAIMABLE will make it progressively more expensive to find the > necessary pages. Perhaps with patches 1 and 2, this is not as much > of a problem but figures in the leader indicated that for a simple > workload with large amounts of files and data exceeding physical > memory that it was better off not to use high orders at all which > is a situation I'd expect to be encountered by more users than > performance-sensitive applications. > > In other words, we're taking one hit or the other. > Seems like the ideal solution would then be to find how to best set the default, and that can probably only be done with the size of the smallest node since it has a higher liklihood of encountering a large amount of unreclaimable slab when memory is low. > > I can get numbers for a simple netperf TCP_RR benchmark with this change > > applied to show the degradation on a server with >32GB of RAM with this > > patch applied. > > > > Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit, > particularly on a local machine where the recycling of pages will > impact it heavily. > Ignoring the local machine for a second, TCP_RR probably shouldn't be taking any more of a hit with slub than it already is. When I benchmarked slab vs. slub a couple months ago with two machines, each four quad-core Opterons with 64GB of memory, with this benchmark it showed slub was already 10-15% slower. That's why slub has always been unusable for us, and I'm surprised that it's now becoming the favorite of distros everywhere (and, yes, Ubuntu now defaults to it as well). > > It would be ideal if this default could be adjusted based on the amount of > > memory available in the smallest node to determine whether we're concerned > > about making higher order allocations. > > It's not a function of memory size, working set size is what > is important or at least how many new pages have been allocated > recently. Fit your workload in physical memory - high orders are > great. Go larger than that and you hit problems. James' testing > indicated that kswapd CPU usage dropped to far lower levels with this > patch applied his test of untarring a large file for example. > My point is that it would probably be better to tune the default based on how much memory is available at boot since it implies the probability of having an abundance of memory while populating the caches' partial lists up to min_partial rather than change it for everyone where it is known that it will cause performance degradations if memory is never low. We probably don't want to be doing order-3 allocations for half the slab caches when we have 1G of memory available, but that's acceptable with 64GB. > > (Using the smallest node as a > > metric so that mempolicies and cpusets don't get unfairly biased against.) > > With the previous changes in this patchset, specifically avoiding waking > > kswapd and doing compaction for the higher order allocs before falling > > back to the min order, it shouldn't be devastating to try an order-3 alloc > > that will fail quickly. > > > > Which is more reasonable? That an ordinary user gets a default that > is fairly safe even if benchmarks that demand the highest performance > from SLUB take a hit or that administrators running such workloads > set slub_max_order=3? > Not sure what is more reasonable since it depends on what the workload is, but what probably is unreasonable is changing a slub default that is known to directly impact performance by presenting a single benchmark under consideration without some due diligence in testing others like netperf. We all know that slub has some disavantages compared to slab that are only now being realized because it has become the debian default, but it does excel at some workloads -- it was initially presented to beat slab in kernbench, hackbench, sysbench, and aim9 when it was merged. Those advantages may never be fully realized on laptops or desktop machines, but with machines with plenty of memory available, slub ofter does perform better than slab. That's why I suggested tuning the min order default based on total memory, it would probably be easier to justify than changing it for everyone and demanding users who are completely happy with using slub, the kernel.org default for years, now use command line options. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 11 May 2011, Mel Gorman wrote: > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > * take the list_lock. > */ > static int slub_min_order; > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > +static int slub_max_order; If we really need to do this then do not push this down to zero please. SLAB uses order 1 for the meax. Lets at least keep it theere. We have been using SLUB for a long time. Why is this issue arising now? Due to compaction etc making reclaim less efficient? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 09:43 -0500, Christoph Lameter wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); > > * take the list_lock. > > */ > > static int slub_min_order; > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > +static int slub_max_order; > > If we really need to do this then do not push this down to zero please. > SLAB uses order 1 for the meax. Lets at least keep it theere. 1 is the current value. Reducing it to zero seems to fix the kswapd induced hangs. The problem does look to be some shrinker/allocator interference somewhere in vmscan.c, but the fact is that it's triggered by SLUB and not SLAB. I really think that what's happening is some type of feedback loops where one of the shrinkers is issuing a wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU on non-preempt). > We have been using SLUB for a long time. Why is this issue arising now? > Due to compaction etc making reclaim less efficient? This is the snark argument (I've said it thrice the bellman cried and what I tell you three times is true). The fact is that no enterprise distribution at all uses SLUB. It's only recently that the desktop distributions started to ... the bugs are showing up under FC15 beta, which is the first fedora distribution to enable it. I'd say we're only just beginning widespread SLUB testing. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, James Bottomley wrote: > > > */ > > > static int slub_min_order; > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > > +static int slub_max_order; > > > > If we really need to do this then do not push this down to zero please. > > SLAB uses order 1 for the meax. Lets at least keep it theere. > > 1 is the current value. Reducing it to zero seems to fix the kswapd > induced hangs. The problem does look to be some shrinker/allocator > interference somewhere in vmscan.c, but the fact is that it's triggered > by SLUB and not SLAB. I really think that what's happening is some type > of feedback loops where one of the shrinkers is issuing a > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU > on non-preempt). The current value is PAGE_ALLOC_COSTLY_ORDER which is 3. > > We have been using SLUB for a long time. Why is this issue arising now? > > Due to compaction etc making reclaim less efficient? > > This is the snark argument (I've said it thrice the bellman cried and > what I tell you three times is true). The fact is that no enterprise > distribution at all uses SLUB. It's only recently that the desktop > distributions started to ... the bugs are showing up under FC15 beta, > which is the first fedora distribution to enable it. I'd say we're only > just beginning widespread SLUB testing. Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my archives so has Fedora). I have been running those here for a couple of years and the issues that I see here seem to be only with the most recent kernels that now do compaction and other reclaim tricks. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 10:27 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > > > */ > > > > static int slub_min_order; > > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; > > > > +static int slub_max_order; > > > > > > If we really need to do this then do not push this down to zero please. > > > SLAB uses order 1 for the meax. Lets at least keep it theere. > > > > 1 is the current value. Reducing it to zero seems to fix the kswapd > > induced hangs. The problem does look to be some shrinker/allocator > > interference somewhere in vmscan.c, but the fact is that it's triggered > > by SLUB and not SLAB. I really think that what's happening is some type > > of feedback loops where one of the shrinkers is issuing a > > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU > > on non-preempt). > > The current value is PAGE_ALLOC_COSTLY_ORDER which is 3. > > > > We have been using SLUB for a long time. Why is this issue arising now? > > > Due to compaction etc making reclaim less efficient? > > > > This is the snark argument (I've said it thrice the bellman cried and > > what I tell you three times is true). The fact is that no enterprise > > distribution at all uses SLUB. It's only recently that the desktop > > distributions started to ... the bugs are showing up under FC15 beta, > > which is the first fedora distribution to enable it. I'd say we're only > > just beginning widespread SLUB testing. > > Debian and Ubuntu have been using SLUB for a long time Only from Squeeze, which has been released for ~3 months. That doesn't qualify as a "long time" in my book. > (and AFAICT from my > archives so has Fedora). As I said above, no released fedora version uses SLUB. It's only just been enabled for the unreleased FC15; I'm testing a beta copy. > I have been running those here for a couple of > years and the issues that I see here seem to be only with the most > recent kernels that now do compaction and other reclaim tricks. but a sample of one doeth not great testing make. However, since you admit even you see problems, let's concentrate on fixing them rather than recriminations? James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 10:27:00AM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > It's only recently that the desktop > > distributions started to ... the bugs are showing up under FC15 beta, > > which is the first fedora distribution to enable it. I'd say we're only > > just beginning widespread SLUB testing. > > Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my > archives so has Fedora). Indeed. It was enabled in Fedora pretty much as soon as it appeared in mainline. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > As I said above, no released fedora version uses SLUB. It's only just > been enabled for the unreleased FC15; I'm testing a beta copy. James, this isn't true. $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y (That's the oldest release I have right now, but it's been enabled even before that release). Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > However, since you admit even you see problems, let's concentrate on > fixing them rather than recriminations? Yes, please. So does dropping max_order to 1 help? PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote: > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > > > As I said above, no released fedora version uses SLUB. It's only just > > been enabled for the unreleased FC15; I'm testing a beta copy. > > James, this isn't true. > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > > (That's the oldest release I have right now, but it's been enabled even > before that release). OK, I concede the point ... I haven't actually kept any of my FC machines current for a while. However, the fact remains that this seems to be a slub problem and it needs fixing. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, James Bottomley wrote: > > Debian and Ubuntu have been using SLUB for a long time > > Only from Squeeze, which has been released for ~3 months. That doesn't > qualify as a "long time" in my book. I am sorry but I have never used a Debian/Ubuntu system in the last 3 years that did not use SLUB. And it was that by default. But then we usually do not run the "released" Debian version. Typically one runs testing. Ubuntu is different there we usually run releases. But those have been SLUB for as long as I remember. And so far it is rock solid and is widely rolled out throughout our infrastructure (mostly 2.6.32 kernels). > but a sample of one doeth not great testing make. > > However, since you admit even you see problems, let's concentrate on > fixing them rather than recriminations? I do not see problems here with earlier kernels. I only see these on one testing system with the latest kernels on Ubuntu 11.04. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 11:00:23AM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote: > > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote: > > > > > As I said above, no released fedora version uses SLUB. It's only just > > > been enabled for the unreleased FC15; I'm testing a beta copy. > > > > James, this isn't true. > > > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 > > CONFIG_SLUB_DEBUG=y > > CONFIG_SLUB=y > > > > (That's the oldest release I have right now, but it's been enabled even > > before that release). > > OK, I concede the point ... I haven't actually kept any of my FC > machines current for a while. 'a while' is an understatement :) It was first enabled in Fedora 8 in 2007. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 12 mai 2011 à 11:01 -0500, Christoph Lameter a écrit : > On Thu, 12 May 2011, James Bottomley wrote: > > > > Debian and Ubuntu have been using SLUB for a long time > > > > Only from Squeeze, which has been released for ~3 months. That doesn't > > qualify as a "long time" in my book. > > I am sorry but I have never used a Debian/Ubuntu system in the last 3 > years that did not use SLUB. And it was that by default. But then we > usually do not run the "released" Debian version. Typically one runs > testing. Ubuntu is different there we usually run releases. But those > have been SLUB for as long as I remember. > > And so far it is rock solid and is widely rolled out throughout our > infrastructure (mostly 2.6.32 kernels). > > > but a sample of one doeth not great testing make. > > > > However, since you admit even you see problems, let's concentrate on > > fixing them rather than recriminations? > > I do not see problems here with earlier kernels. I only see these on one > testing system with the latest kernels on Ubuntu 11.04. More fuel to this discussion with commit 6d4831c2 Something is wrong with high order allocations, on some machines. Maybe we can find real cause instead of limiting us to use order-0 pages in the end... ;) commit 6d4831c283530a5f2c6bd8172c13efa236eb149d Author: Andrew Morton <akpm@linux-foundation.org> Date: Wed Apr 27 15:26:41 2011 -0700 vfs: avoid large kmalloc()s for the fdtable Azurit reports large increases in system time after 2.6.36 when running Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc() to allocate fdmem if possible"). That patch caused the vfs to use kmalloc() for very large allocations and this is causing excessive work (and presumably excessive reclaim) within the page allocator. Fix it by falling back to vmalloc() earlier - when the allocation attempt would have been considered "costly" by reclaim. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, James Bottomley wrote: > However, the fact remains that this seems to be a slub problem and it > needs fixing. Why are you so fixed on slub in these matters? Its an key component but there is a high interaction with other subsystems. There was no recent change in slub that changed the order of allocations. There were changes affecting the reclaim logic. Slub has been working just fine with the existing allocation schemes for a long time. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > However, the fact remains that this seems to be a slub problem and it > > needs fixing. > > Why are you so fixed on slub in these matters? Because, as has been hashed out in the thread, changing SLUB to SLAB makes the hang go away. > Its an key component but > there is a high interaction with other subsystems. There was no recent > change in slub that changed the order of allocations. There were changes > affecting the reclaim logic. Slub has been working just fine with the > existing allocation schemes for a long time. So suggest an alternative root cause and a test to expose it. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, James Bottomley wrote: > On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote: > > On Thu, 12 May 2011, James Bottomley wrote: > > > > > However, the fact remains that this seems to be a slub problem and it > > > needs fixing. > > > > Why are you so fixed on slub in these matters? > > Because, as has been hashed out in the thread, changing SLUB to SLAB > makes the hang go away. SLUB doesnt hang here with earlier kernel versions either. So the higher allocations are no longer as effective as they were before. This is due to a change in another subsystem. > > Its an key component but > > there is a high interaction with other subsystems. There was no recent > > change in slub that changed the order of allocations. There were changes > > affecting the reclaim logic. Slub has been working just fine with the > > existing allocation schemes for a long time. > > So suggest an alternative root cause and a test to expose it. Have a look at my other emails? I am just repeating myself again it seems. Try order = 1 which gives you SLAB like interaction with the page allocator. Then we at least know that it is the order 2 and 3 allocs that are the problem and not something else. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 7:30 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> So suggest an alternative root cause and a test to expose it.
Is your .config available somewhere, btw?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: >> So suggest an alternative root cause and a test to expose it. > > Is your .config available somewhere, btw? If it's this: http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD I'd love to see what happens if you disable CONFIG_TRANSPARENT_HUGEPAGE=y because that's going to reduce high order allocations as well, no? Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> kswapd and doing compaction for the higher order allocs before falling
Note that patch 2 disabled compaction by clearing __GFP_WAIT.
What you describe here would be patch 2 without the ~__GFP_WAIT
addition (so keeping only ~GFP_NOFAIL).
Not clearing __GFP_WAIT when compaction is enabled is possible and
shouldn't result in bad behavior (if compaction is not enabled with
current SLUB it's hard to imagine how it could perform decently if
there's fragmentation). You should try to benchmark to see if it's
worth it on the large NUMA systems with heavy network traffic (for
normal systems I doubt compaction is worth it but I'm not against
trying to keep it enabled just in case).
On a side note, this reminds me to rebuild with slub_max_order in .bss
on my cellphone (where I can't switch to SLAB because of some silly
rfs vfat-on-steroids proprietary module).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011 18:10:38 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > More fuel to this discussion with commit 6d4831c2 > > Something is wrong with high order allocations, on some machines. > > Maybe we can find real cause instead of limiting us to use order-0 pages > in the end... ;) > > commit 6d4831c283530a5f2c6bd8172c13efa236eb149d > Author: Andrew Morton <akpm@linux-foundation.org> > Date: Wed Apr 27 15:26:41 2011 -0700 > > vfs: avoid large kmalloc()s for the fdtable Well, it's always been the case that satisfying higher-order allocations take a disproportionate amount of work in page reclaim. And often causes excessive reclaim. That's why we've traditionally worked to avoid higher-order allocations, and this has always been a problem with slub. But the higher-order allocations shouldn't cause the VM to melt down. We changed something, and now it melts down. Changing slub to avoid that meltdown doesn't fix the thing we broke. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, Pekka Enberg wrote: > On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > >> So suggest an alternative root cause and a test to expose it. > > > > Is your .config available somewhere, btw? > > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? I dont think that will change much since huge pages are at MAX_ORDER size. Either you can get them or not. The challenge with the small order allocations is that they require contiguous memory. Compaction is likely not as effective as the prior mechanism that did opportunistic reclaim of neighboring pages. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 11:27:04AM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > However, the fact remains that this seems to be a slub problem and it > > needs fixing. > > Why are you so fixed on slub in these matters? Its an key component but > there is a high interaction with other subsystems. There was no recent > change in slub that changed the order of allocations. There were changes > affecting the reclaim logic. Slub has been working just fine with the > existing allocation schemes for a long time. It should work just fine when compaction is enabled. The COMPACTION=n case would also work decent if we eliminate the lumpy reclaim. Lumpy reclaim tells the VM to ignore all young bits in the pagetables and take everything down in order to generate the order 3 page that SLUB asks. You can't expect decent behavior the moment you take everything down regardless of referenced bits on page and young bits in pte. I doubt it's new issue, but lumpy may have become more or less aggressive over time. Good thing, lumpy is eliminated (basically at runtime, not compile time) by enabling compaction. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 11:48:19AM -0500, Christoph Lameter wrote: > Try order = 1 which gives you SLAB like interaction with the page > allocator. Then we at least know that it is the order 2 and 3 allocs that > are the problem and not something else. order 1 should work better, because it's less likely we end up here (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens at the top of page_check_references()) else if (sc->order && priority < DEF_PRIORITY - 2) sc->reclaim_mode |= syncmode; with order 1 more likely we end up here as enough pages are freed for order 1 and we're safe: else sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC; None of these issue should materialize with COMPACTION=n. Even __GFP_WAIT can be left enabled to run compaction without expecting adverse behavior, but running compaction may still not be worth it for small systems where the benefit of having order 1/2/3 allocation may not outweight the cost of compaction itself. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 08:11:05PM +0300, Pekka Enberg wrote: > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? Well THP forces COMPACTION=y so lumpy won't risk to be activated. I got once a complaint asking not to make THP force COMPACTION=y (there is no real dependency here, THP will just call alloc_pages with __GFP_NO_KSWAPD and order 9, or 10 on x86-nopae), but I preferred to keep it forced exactly to avoid issues like these when THP is on. If even order 3 is causing troubles (which doesn't immediately make lumpy activated, it only activates when priority is < DEF_PRIORITY-2, so after 2 loops failing to reclaim nr_to_reclaim pages), imagine what was happening at order 9 every time firefox, gcc and mutt allocated memory ;). -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, Andrea Arcangeli wrote: > order 1 should work better, because it's less likely we end up here > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens > at the top of page_check_references()) > > else if (sc->order && priority < DEF_PRIORITY - 2) Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation for SLAB order 1 allocs? May I assume that the case of order 2 and 3 allocs in that case was not very well tested after the changes to introduce compaction since people were focusing on RHEL testing? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 12:38:34PM -0500, Christoph Lameter wrote: > I dont think that will change much since huge pages are at MAX_ORDER size. > Either you can get them or not. The challenge with the small order > allocations is that they require contiguous memory. Compaction is likely > not as effective as the prior mechanism that did opportunistic reclaim of > neighboring pages. THP requires contiguous pages too, the issue is similar, and worse with THP, but THP enables compaction by default, likely this only happens with compaction off. We've really to differentiate between compaction on and off, it makes world of difference (a THP enabled kernel with compaction off, also runs into swap storms and temporary hangs all the time, it's probably the same issue of SLUB=y COMPACTION=n). At least THP didn't activate kswapd, kswapd running lumpy too makes things worse as it'll probably keep running in the background after the direct reclaim fails. The original reports talks about kerenls with SLUB=y and COMPACTION=n. Not sure if anybody is having trouble with SLUB=y COMPACTION=y... Compaction is more effective than the prior mechanism too (prior mechanism is lumpy reclaim) and it doesn't cause VM disruptions that ignore all referenced information and takes down anything it finds in the way. I think when COMPACTION=n, lumpy either should go away, or only be activated by __GFP_REPEAT so that only hugetlbfs makes use of it. Increasing nr_hugepages is ok to halt the system for a while but when all allocations are doing that, system becomes unusable, kind of livelocked. BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, Andrea Arcangeli wrote: > even order 3 is causing troubles (which doesn't immediately make lumpy > activated, it only activates when priority is < DEF_PRIORITY-2, so > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what That is a significant change for SLUB with the merge of the compaction code. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > even order 3 is causing troubles (which doesn't immediately make lumpy > > activated, it only activates when priority is < DEF_PRIORITY-2, so > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what > > That is a significant change for SLUB with the merge of the compaction > code. Even before compaction was posted, I had to shut off lumpy reclaim or it'd hang all the time with frequent order 9 allocations. Maybe lumpy was better before, maybe lumpy "improved" its reliability recently, but definitely it wasn't performing well. That definitely applies to >=2.6.32 (I had to nuke lumpy from it, and only keep compaction enabled, pretty much like upstream with COMPACTION=y). I think I never tried earlier lumpy code than 2.6.32, maybe it was less aggressive back then, I don't exclude it but I thought the whole notion of lumpy was to takedown everything in the way, which usually leads to process hanging in swapins or pageins for frequent used memory. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, Andrea Arcangeli wrote: > On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote: > > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > > > even order 3 is causing troubles (which doesn't immediately make lumpy > > > activated, it only activates when priority is < DEF_PRIORITY-2, so > > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what > > > > That is a significant change for SLUB with the merge of the compaction > > code. > > Even before compaction was posted, I had to shut off lumpy reclaim or > it'd hang all the time with frequent order 9 allocations. Maybe lumpy > was better before, maybe lumpy "improved" its reliability recently, Well we are concerned about order 2 and 3 alloc here. Checking for < PAGE_ORDER_COSTLY to avoid the order 9 lumpy reclaim looks okay. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 01:00:10PM -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > order 1 should work better, because it's less likely we end up here > > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens > > at the top of page_check_references()) > > > > else if (sc->order && priority < DEF_PRIORITY - 2) > > Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation > for SLAB order 1 allocs? That's to allow a few loops of the shrinker (i.e. not take down everything in the way regardless of any aging information in pte/page if there's no memory pressure). This "- 2" is independent of the allocation order. If it was < DEF_PRIORITY it'd trigger lumpy already at the second loop (in do_try_to_free_pages). So it'd make things worse. Like it'd make things worse decreasing the PAGE_ALLOC_COSTLY_ORDER define to 2 and keeping slub at 3. > May I assume that the case of order 2 and 3 allocs in that case was not > very well tested after the changes to introduce compaction since people > were focusing on RHEL testing? Not really, I had to eliminate lumpy before compaction was developed. RHEL6 has zero lumpy code (not even at compile time) and compaction enabled by default, so even if we enabled SLUB=y it should work ok (not sure why James still crashes with patch 2 applied that clears __GFP_WAIT, that crash likely has nothing to do with compaction or lumpy as both are off with __GFP_WAIT not set). Lumpy is also eliminated upstream now (but only at runtime when COMPACTION=y), unless __GFP_REPEAT is set, in which case I think lumpy will still work upstream too but few unfrequent things like increasing nr_hugepages uses that. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 20:11 +0300, Pekka Enberg wrote: > On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote: > > On Thu, May 12, 2011 at 7:30 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > >> So suggest an alternative root cause and a test to expose it. > > > > Is your .config available somewhere, btw? > > If it's this: > > http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD > > I'd love to see what happens if you disable > > CONFIG_TRANSPARENT_HUGEPAGE=y > > because that's going to reduce high order allocations as well, no? So yes, it's a default FC15 config. Disabling THP was initially tried a long time ago and didn't make a difference (it was originally suggested by Chris Mason). James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > However, since you admit even you see problems, let's concentrate on > > fixing them rather than recriminations? > > Yes, please. So does dropping max_order to 1 help? > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. Just booting with max_slab_order=1 (and none of the other patches applied) I can still get the machine to go into kswapd at 99%, so it doesn't seem to make much of a difference. Do you want me to try with the other two patches and max_slab_order=1? James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2011, James Bottomley wrote: > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > However, since you admit even you see problems, let's concentrate on > > > fixing them rather than recriminations? > > > > Yes, please. So does dropping max_order to 1 help? > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > Just booting with max_slab_order=1 (and none of the other patches > applied) I can still get the machine to go into kswapd at 99%, so it > doesn't seem to make much of a difference. slub_max_order=1 right? Not max_slab_order. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 13:46 -0500, Christoph Lameter wrote: > On Thu, 12 May 2011, James Bottomley wrote: > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > However, since you admit even you see problems, let's concentrate on > > > > fixing them rather than recriminations? > > > > > > Yes, please. So does dropping max_order to 1 help? > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > Just booting with max_slab_order=1 (and none of the other patches > > applied) I can still get the machine to go into kswapd at 99%, so it > > doesn't seem to make much of a difference. > > slub_max_order=1 right? Not max_slab_order. Yes. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > However, since you admit even you see problems, let's concentrate on > > > fixing them rather than recriminations? > > > > Yes, please. So does dropping max_order to 1 help? > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > Just booting with max_slab_order=1 (and none of the other patches > applied) I can still get the machine to go into kswapd at 99%, so it > doesn't seem to make much of a difference. > > Do you want me to try with the other two patches and max_slab_order=1? OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to trigger the problem (kswapd spinning at 99%). This is still with PREEMPT; it's possible that non-PREEMPT might be better, so I'll try patches 1+2+3 with PREEMPT just to see if the perturbation is caused by it. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > However, since you admit even you see problems, let's concentrate on > > > > fixing them rather than recriminations? > > > > > > Yes, please. So does dropping max_order to 1 help? > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > Just booting with max_slab_order=1 (and none of the other patches > > applied) I can still get the machine to go into kswapd at 99%, so it > > doesn't seem to make much of a difference. > > > > Do you want me to try with the other two patches and max_slab_order=1? > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > trigger the problem (kswapd spinning at 99%). This is still with > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > it. Confirmed, I'm afraid ... I can trigger the problem with all three patches under PREEMPT. It's not a hang this time, it's just kswapd taking 100% system time on 1 CPU and it won't calm down after I unload the system. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > However, since you admit even you see problems, let's concentrate on > > > > > fixing them rather than recriminations? > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > doesn't seem to make much of a difference. > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > trigger the problem (kswapd spinning at 99%). This is still with > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > it. > > Confirmed, I'm afraid ... I can trigger the problem with all three > patches under PREEMPT. It's not a hang this time, it's just kswapd > taking 100% system time on 1 CPU and it won't calm down after I unload > the system. That is kind of expected, though. If one CPU is busy with a streaming IO load generating new pages, kswapd is busy reclaiming the old ones so that the generator does not have to do the reclaim itself. By unload, do you mean stopping the generator? And if so, how quickly after you stop the generator does kswapd go back to sleep? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 10:29:17PM +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > > However, since you admit even you see problems, let's concentrate on > > > > > > fixing them rather than recriminations? > > > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > > doesn't seem to make much of a difference. > > > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > > trigger the problem (kswapd spinning at 99%). This is still with > > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > > it. > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. I am so sorry, I missed the "won't" here. Please ignore. > That is kind of expected, though. If one CPU is busy with a streaming > IO load generating new pages, kswapd is busy reclaiming the old ones > so that the generator does not have to do the reclaim itself. > > By unload, do you mean stopping the generator? And if so, how quickly > after you stop the generator does kswapd go back to sleep? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-12 at 22:29 +0200, Johannes Weiner wrote: > On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote: > > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote: > > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote: > > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote: > > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote: > > > > > > However, since you admit even you see problems, let's concentrate on > > > > > > fixing them rather than recriminations? > > > > > > > > > > Yes, please. So does dropping max_order to 1 help? > > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7. > > > > > > > > Just booting with max_slab_order=1 (and none of the other patches > > > > applied) I can still get the machine to go into kswapd at 99%, so it > > > > doesn't seem to make much of a difference. > > > > > > > > Do you want me to try with the other two patches and max_slab_order=1? > > > > > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to > > > trigger the problem (kswapd spinning at 99%). This is still with > > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try > > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by > > > it. > > > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. > > That is kind of expected, though. If one CPU is busy with a streaming > IO load generating new pages, kswapd is busy reclaiming the old ones > so that the generator does not have to do the reclaim itself. > > By unload, do you mean stopping the generator? Correct. > And if so, how quickly > after you stop the generator does kswapd go back to sleep? It doesn't. At least not on its own; the CPU stays pegged. If I start other work (like a kernel compile), then sometimes it does go back to nothing. I'm speculating that this is the hang case for non-PREEMPT. James -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thu, May 12, 2011 at 11:04 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Confirmed, I'm afraid ... I can trigger the problem with all three > patches under PREEMPT. It's not a hang this time, it's just kswapd > taking 100% system time on 1 CPU and it won't calm down after I unload > the system. OK, that's good to know. I'd still like to take patches 1-2, though. Mel? Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > <SNIP> > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. This is in V2 (unreleased, testing in progress and was running overnight). I noticed that clearing __GFP_REPEAT is required for reclaim/compaction if direct reclaimers from SLUB are to return false in should_continue_reclaim() and bail out from high-order allocation properly. As it is, there is a possibility for slub high-order direct reclaimers to loop in reclaim/compaction for a long time. This is only important when CONFIG_COMPACTION=y.
On Fri, May 13, 2011 at 09:16:24AM +0300, Pekka Enberg wrote: > Hi, > > On Thu, May 12, 2011 at 11:04 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > Confirmed, I'm afraid ... I can trigger the problem with all three > > patches under PREEMPT. It's not a hang this time, it's just kswapd > > taking 100% system time on 1 CPU and it won't calm down after I unload > > the system. > > OK, that's good to know. I'd still like to take patches 1-2, though. Mel? > Wait for a V2 please. __GFP_REPEAT should also be removed.
On Wed, May 11, 2011 at 03:27:11PM -0700, David Rientjes wrote: > On Wed, 11 May 2011, Mel Gorman wrote: > > > I agree with you that there are situations where plenty of memory > > means that that it'll perform much better. However, indications are > > that it breaks down with high CPU usage when memory is low. Worse, > > once fragmentation becomes a problem, large amounts of UNMOVABLE and > > RECLAIMABLE will make it progressively more expensive to find the > > necessary pages. Perhaps with patches 1 and 2, this is not as much > > of a problem but figures in the leader indicated that for a simple > > workload with large amounts of files and data exceeding physical > > memory that it was better off not to use high orders at all which > > is a situation I'd expect to be encountered by more users than > > performance-sensitive applications. > > > > In other words, we're taking one hit or the other. > > > > Seems like the ideal solution would then be to find how to best set the > default, and that can probably only be done with the size of the smallest > node since it has a higher liklihood of encountering a large amount of > unreclaimable slab when memory is low. > Ideally yes, but glancing through this thread and thinking on it a bit more, I'm going to drop this patch. As pointed out, SLUB with high orders has been in use with distributions already so the breakage is elsewhere. Patches 1 and 2 still make some sense but they're not the root cause. > <SNIP>
On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote: > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > > <SNIP> > > > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. > > This is in V2 (unreleased, testing in progress and was running > overnight). I noticed that clearing __GFP_REPEAT is required for > reclaim/compaction if direct reclaimers from SLUB are to return false in > should_continue_reclaim() and bail out from high-order allocation > properly. As it is, there is a possibility for slub high-order direct > reclaimers to loop in reclaim/compaction for a long time. This is > only important when CONFIG_COMPACTION=y. Agreed. However I don't expect anyone to allocate from slub(/slab) with __GFP_REPEAT so it's probably only theoretical but more correct indeed ;). -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 15, 2011 at 06:39:06PM +0200, Andrea Arcangeli wrote: > On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote: > > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote: > > > <SNIP> > > > > > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too > > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not > > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless > > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL. > > > > This is in V2 (unreleased, testing in progress and was running > > overnight). I noticed that clearing __GFP_REPEAT is required for > > reclaim/compaction if direct reclaimers from SLUB are to return false in > > should_continue_reclaim() and bail out from high-order allocation > > properly. As it is, there is a possibility for slub high-order direct > > reclaimers to loop in reclaim/compaction for a long time. This is > > only important when CONFIG_COMPACTION=y. > > Agreed. However I don't expect anyone to allocate from slub(/slab) > with __GFP_REPEAT so it's probably only theoretical but more correct > indeed ;). Networking layer does specify __GFP_REPEAT.
On Thu, 12 May 2011, Andrea Arcangeli wrote: > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > > kswapd and doing compaction for the higher order allocs before falling > > Note that patch 2 disabled compaction by clearing __GFP_WAIT. > > What you describe here would be patch 2 without the ~__GFP_WAIT > addition (so keeping only ~GFP_NOFAIL). > It's out of context, my sentence was: "With the previous changes in this patchset, specifically avoiding waking kswapd and doing compaction for the higher order allocs before falling back to the min order..." meaning this patchset avoids waking kswapd and avoids doing compaction. > Not clearing __GFP_WAIT when compaction is enabled is possible and > shouldn't result in bad behavior (if compaction is not enabled with > current SLUB it's hard to imagine how it could perform decently if > there's fragmentation). You should try to benchmark to see if it's > worth it on the large NUMA systems with heavy network traffic (for > normal systems I doubt compaction is worth it but I'm not against > trying to keep it enabled just in case). > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, the problem is that the slub slowpath is being used >95% of the time on every allocation and free for the very large number of kmalloc-256 and kmalloc-2K caches. Those caches are order 1 and 3, respectively, on my system by default, but the page allocator seldomly gets invoked for such a benchmark after the partial lists are populated: the overhead is from the per-node locking required in the slowpath to traverse the partial lists. See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 16, 2011 at 02:03:33PM -0700, David Rientjes wrote: > On Thu, 12 May 2011, Andrea Arcangeli wrote: > > > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote: > > > kswapd and doing compaction for the higher order allocs before falling > > > > Note that patch 2 disabled compaction by clearing __GFP_WAIT. > > > > What you describe here would be patch 2 without the ~__GFP_WAIT > > addition (so keeping only ~GFP_NOFAIL). > > > > It's out of context, my sentence was: > > "With the previous changes in this patchset, specifically avoiding waking > kswapd and doing compaction for the higher order allocs before falling > back to the min order..." > > meaning this patchset avoids waking kswapd and avoids doing compaction. > Ok. > > Not clearing __GFP_WAIT when compaction is enabled is possible and > > shouldn't result in bad behavior (if compaction is not enabled with > > current SLUB it's hard to imagine how it could perform decently if > > there's fragmentation). You should try to benchmark to see if it's > > worth it on the large NUMA systems with heavy network traffic (for > > normal systems I doubt compaction is worth it but I'm not against > > trying to keep it enabled just in case). > > > > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, > the problem is that the slub slowpath is being used >95% of the time on > every allocation and free for the very large number of kmalloc-256 and > kmalloc-2K caches. Ok, that makes sense as I'd full expect that benchmark to exhaust the per-cpu page (high order or otherwise) of slab objects routinely during default and I'd also expect the freeing on the other side to be releasing slabs frequently to the partial or empty lists. > Those caches are order 1 and 3, respectively, on my > system by default, but the page allocator seldomly gets invoked for such a > benchmark after the partial lists are populated: the overhead is from the > per-node locking required in the slowpath to traverse the partial lists. > See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15. Ok, I can see how this patch would indeed make the situation worse. I vaguely recall that there were other patches that would increase the per-cpu lists of objects but have no recollection as to what happened them. Maybe Christoph remembers but one way or the other, it's out of scope for James' and Colin's bug.
On Tue, 17 May 2011, Mel Gorman wrote: > > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, > > the problem is that the slub slowpath is being used >95% of the time on > > every allocation and free for the very large number of kmalloc-256 and > > kmalloc-2K caches. > > Ok, that makes sense as I'd full expect that benchmark to exhaust > the per-cpu page (high order or otherwise) of slab objects routinely > during default and I'd also expect the freeing on the other side to > be releasing slabs frequently to the partial or empty lists. > That's most of the problem, but it's compounded on this benchmark because the slab pulled from the partial list to replace the per-cpu page typically only has a very minimal number (2 or 3) of free objects, so it can only serve one allocation and then require the allocation slowpath to pull yet another slab from the partial list the next time around. I had a patchset that addressed that, which I called "slab thrashing", by only pulling a slab from the partial list when it had a pre-defined proportion of available objects and otherwise skipping it, and that ended up helping the benchmark by 5-7%. Smaller orders will make this worse, as well, since if there were only 2 or 3 free objects on an order-3 slab before, there's no chance that's going to be equivalent on an order-0 slab. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 07375e7..778e9fa 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -117,7 +117,7 @@ can be influenced by kernel parameters: slub_min_objects=x (default 4) slub_min_order=x (default 0) -slub_max_order=x (default 1) +slub_max_order=x (default 0) slub_min_objects allows to specify how many objects must at least fit into one slab in order for the allocation order to be acceptable. diff --git a/mm/slub.c b/mm/slub.c index 1071723..23a4789 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free); * take the list_lock. */ static int slub_min_order; -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; +static int slub_max_order; static int slub_min_objects; /*
To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations up to order-3 by default and falls back to lower allocations if they fail. While care is taken that the caller and kswapd take no unusual steps in response to this, there are further consequences like shrinkers who have to free more objects to release any memory. There is anecdotal evidence that significant time is being spent looping in shrinkers with insufficient progress being made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake. SLUB is now the default allocator and some bug reports have been pinned down to SLUB using high orders during operations like copying large amounts of data. SLUBs use of high-orders benefits applications that are sized to memory appropriately but this does not necessarily apply to large file servers or desktops. This patch causes SLUB to use order-0 pages like SLAB does by default. There is further evidence that this keeps kswapd's usage lower (https://lkml.org/lkml/2011/5/10/383). Signed-off-by: Mel Gorman <mgorman@suse.de> --- Documentation/vm/slub.txt | 2 +- mm/slub.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)