diff mbox

[1/4] mm: exclude reserved pages from dirtyable memory

Message ID 1316526315-16801-2-git-send-email-jweiner@redhat.com
State Not Applicable, archived
Headers show

Commit Message

Johannes Weiner Sept. 20, 2011, 1:45 p.m. UTC
The amount of dirtyable pages should not include the total number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:

       +----------+ ---
       |   anon   |  |
       +----------+  |
       |          |  |
       |          |  -- dirty limit new    -- flusher new
       |   file   |  |                     |
       |          |  |                     |
       |          |  -- dirty limit old    -- flusher old
       |          |                        |
       +----------+                       --- reclaim
       | reserved |
       +----------+
       |  kernel  |
       +----------+

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix.  In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/mmzone.h |    1 +
 mm/page-writeback.c    |    8 +++++---
 mm/page_alloc.c        |    1 +
 3 files changed, 7 insertions(+), 3 deletions(-)

Comments

Rik van Riel Sept. 20, 2011, 3:21 p.m. UTC | #1
On 09/20/2011 09:45 AM, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the total number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:

> Signed-off-by: Johannes Weiner<jweiner@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman Sept. 21, 2011, 2:04 p.m. UTC | #2
On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the total number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
> 
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
> 
>        +----------+ ---
>        |   anon   |  |
>        +----------+  |
>        |          |  |
>        |          |  -- dirty limit new    -- flusher new
>        |   file   |  |                     |
>        |          |  |                     |
>        |          |  -- dirty limit old    -- flusher old
>        |          |                        |
>        +----------+                       --- reclaim
>        | reserved |
>        +----------+
>        |  kernel  |
>        +----------+
> 
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix.  In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
> 
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  include/linux/mmzone.h |    1 +
>  mm/page-writeback.c    |    8 +++++---
>  mm/page_alloc.c        |    1 +
>  3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1ed4116..e28f8e0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -316,6 +316,7 @@ struct zone {
>  	 * sysctl_lowmem_reserve_ratio sysctl changes.
>  	 */
>  	unsigned long		lowmem_reserve[MAX_NR_ZONES];
> +	unsigned long		totalreserve_pages;
>  

This is nit-picking but totalreserve_pages is a poor name because it's
a per-zone value that is one of the lowmem_reserve[] fields instead
of a total. After this patch, we have zone->totalreserve_pages and
totalreserve_pages but are not related to the same thing.
but they are not the same.

It gets confusing once you consider what the values are
for. lowmem_reserve is part of a placement policy that limits the
number of pages placed in lower zones that allocated from higher
zones. totalreserve_pages is related to the overcommit heuristic
where it is assuming that the most interesting type of allocation
is GFP_HIGHUSER.

This begs the question - what is this new field, where does it come
from, what does it want from us? Should we take it to our Patch Leader?

This field ultimately affects what zone is used to allocate a new
page so it's related to placement policy. That implies the naming then
should indicate it is related to lowmem_reserve - largest_lowmem_reserve?

Alternative, make it clear that it's one of the lowmem_reserve
values and store the index instead of the value - largest_reserve_idx?

>  #ifdef CONFIG_NUMA
>  	int node;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index da6d263..9f896db 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -169,8 +169,9 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  		struct zone *z =
>  			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>  
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> +		x += zone_page_state(z, NR_FREE_PAGES) -
> +			zone->totalreserve_pages;
> +		x += zone_reclaimable_pages(z);
>  	}

This is highmem so zone->totalreserve_pages should always be 0.

Otherwise, the patch seems fine.

>  	/*
>  	 * Make sure that the number of highmem pages is never larger
> @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
>  {
>  	unsigned long x;
>  
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	x = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
> +	x += global_reclaimable_pages();
>  
>  	if (!vm_highmem_is_dirtyable)
>  		x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1dba05e..7e8e2ee 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5075,6 +5075,7 @@ static void calculate_totalreserve_pages(void)
>  
>  			if (max > zone->present_pages)
>  				max = zone->present_pages;
> +			zone->totalreserve_pages = max;
>  			reserve_pages += max;
>  		}
>  	}
Mel Gorman Sept. 21, 2011, 3:03 p.m. UTC | #3
On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > The amount of dirtyable pages should not include the total number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> > 
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> > 
> >        +----------+ ---
> >        |   anon   |  |
> >        +----------+  |
> >        |          |  |
> >        |          |  -- dirty limit new    -- flusher new
> >        |   file   |  |                     |
> >        |          |  |                     |
> >        |          |  -- dirty limit old    -- flusher old
> >        |          |                        |
> >        +----------+                       --- reclaim
> >        | reserved |
> >        +----------+
> >        |  kernel  |
> >        +----------+
> > 
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix.  In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> > 
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > ---
> >  include/linux/mmzone.h |    1 +
> >  mm/page-writeback.c    |    8 +++++---
> >  mm/page_alloc.c        |    1 +
> >  3 files changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1ed4116..e28f8e0 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -316,6 +316,7 @@ struct zone {
> >  	 * sysctl_lowmem_reserve_ratio sysctl changes.
> >  	 */
> >  	unsigned long		lowmem_reserve[MAX_NR_ZONES];
> > +	unsigned long		totalreserve_pages;
> >  
> 
> This is nit-picking but totalreserve_pages is a poor name because it's
> a per-zone value that is one of the lowmem_reserve[] fields instead
> of a total. After this patch, we have zone->totalreserve_pages and
> totalreserve_pages but are not related to the same thing.
> but they are not the same.
> 

As you correctly pointed out to be on IRC, zone->totalreserve_pages
is not the lowmem_reserve because it takes the high_wmark into
account. Sorry about that, I should have kept thinking.  The name is
still poor though because it does not explain what the value is or
what it means.

zone->FOO value needs to be related to lowmem_reserve because this
	is related to balancing zone usage.

zone->FOO value should also be related to the high_wmark because
	this is avoiding writeback from page reclaim

err....... umm... this?

	/*
	 * When allocating a new page that is expected to be
	 * dirtied soon, the number of free pages and the
	 * dirty_balance reserve are taken into account. The
	 * objective is that the globally allowed number of dirty
	 * pages should be distributed throughout the zones such
	 * that it is very unlikely that page reclaim will call
	 * ->writepage.
	 *
	 * dirty_balance_reserve takes both lowmem_reserve and
	 * the high watermark into account. The lowmem_reserve
	 * is taken into account because we don't want the
	 * distribution of dirty pages to unnecessarily increase
	 * lowmem pressure. The watermark is taken into account
	 * because it's correlated with when kswapd wakes up
	 * and how long it stays awake.
	 */
	unsigned long		dirty_balance_reserve.
Johannes Weiner Sept. 22, 2011, 9:03 a.m. UTC | #4
On Wed, Sep 21, 2011 at 04:03:28PM +0100, Mel Gorman wrote:
> On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> > On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > > The amount of dirtyable pages should not include the total number of
> > > free pages: there is a number of reserved pages that the page
> > > allocator and kswapd always try to keep free.
> > > 
> > > The closer (reclaimable pages - dirty pages) is to the number of
> > > reserved pages, the more likely it becomes for reclaim to run into
> > > dirty pages:
> > > 
> > >        +----------+ ---
> > >        |   anon   |  |
> > >        +----------+  |
> > >        |          |  |
> > >        |          |  -- dirty limit new    -- flusher new
> > >        |   file   |  |                     |
> > >        |          |  |                     |
> > >        |          |  -- dirty limit old    -- flusher old
> > >        |          |                        |
> > >        +----------+                       --- reclaim
> > >        | reserved |
> > >        +----------+
> > >        |  kernel  |
> > >        +----------+
> > > 
> > > Not treating reserved pages as dirtyable on a global level is only a
> > > conceptual fix.  In reality, dirty pages are not distributed equally
> > > across zones and reclaim runs into dirty pages on a regular basis.
> > > 
> > > But it is important to get this right before tackling the problem on a
> > > per-zone level, where the distance between reclaim and the dirty pages
> > > is mostly much smaller in absolute numbers.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > ---
> > >  include/linux/mmzone.h |    1 +
> > >  mm/page-writeback.c    |    8 +++++---
> > >  mm/page_alloc.c        |    1 +
> > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 1ed4116..e28f8e0 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -316,6 +316,7 @@ struct zone {
> > >  	 * sysctl_lowmem_reserve_ratio sysctl changes.
> > >  	 */
> > >  	unsigned long		lowmem_reserve[MAX_NR_ZONES];
> > > +	unsigned long		totalreserve_pages;
> > >  
> > 
> > This is nit-picking but totalreserve_pages is a poor name because it's
> > a per-zone value that is one of the lowmem_reserve[] fields instead
> > of a total. After this patch, we have zone->totalreserve_pages and
> > totalreserve_pages but are not related to the same thing.
> > but they are not the same.
> 
> As you correctly pointed out to be on IRC, zone->totalreserve_pages
> is not the lowmem_reserve because it takes the high_wmark into
> account. Sorry about that, I should have kept thinking.  The name is
> still poor though because it does not explain what the value is or
> what it means.
> 
> zone->FOO value needs to be related to lowmem_reserve because this
> 	is related to balancing zone usage.
> 
> zone->FOO value should also be related to the high_wmark because
> 	this is avoiding writeback from page reclaim
> 
> err....... umm... this?
> 
> 	/*
> 	 * When allocating a new page that is expected to be
> 	 * dirtied soon, the number of free pages and the
> 	 * dirty_balance reserve are taken into account. The
> 	 * objective is that the globally allowed number of dirty
> 	 * pages should be distributed throughout the zones such
> 	 * that it is very unlikely that page reclaim will call
> 	 * ->writepage.
> 	 *
> 	 * dirty_balance_reserve takes both lowmem_reserve and
> 	 * the high watermark into account. The lowmem_reserve
> 	 * is taken into account because we don't want the
> 	 * distribution of dirty pages to unnecessarily increase
> 	 * lowmem pressure. The watermark is taken into account
> 	 * because it's correlated with when kswapd wakes up
> 	 * and how long it stays awake.
> 	 */
> 	unsigned long		dirty_balance_reserve.

Yes, that's much better, thanks.

I assume this is meant the same for both the zone and the global level
and we should not mess with totalreserve_pages in either case?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman Sept. 22, 2011, 10:54 a.m. UTC | #5
On Thu, Sep 22, 2011 at 11:03:26AM +0200, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 04:03:28PM +0100, Mel Gorman wrote:
> > On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> > > On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > > > The amount of dirtyable pages should not include the total number of
> > > > free pages: there is a number of reserved pages that the page
> > > > allocator and kswapd always try to keep free.
> > > > 
> > > > The closer (reclaimable pages - dirty pages) is to the number of
> > > > reserved pages, the more likely it becomes for reclaim to run into
> > > > dirty pages:
> > > > 
> > > >        +----------+ ---
> > > >        |   anon   |  |
> > > >        +----------+  |
> > > >        |          |  |
> > > >        |          |  -- dirty limit new    -- flusher new
> > > >        |   file   |  |                     |
> > > >        |          |  |                     |
> > > >        |          |  -- dirty limit old    -- flusher old
> > > >        |          |                        |
> > > >        +----------+                       --- reclaim
> > > >        | reserved |
> > > >        +----------+
> > > >        |  kernel  |
> > > >        +----------+
> > > > 
> > > > Not treating reserved pages as dirtyable on a global level is only a
> > > > conceptual fix.  In reality, dirty pages are not distributed equally
> > > > across zones and reclaim runs into dirty pages on a regular basis.
> > > > 
> > > > But it is important to get this right before tackling the problem on a
> > > > per-zone level, where the distance between reclaim and the dirty pages
> > > > is mostly much smaller in absolute numbers.
> > > > 
> > > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > > ---
> > > >  include/linux/mmzone.h |    1 +
> > > >  mm/page-writeback.c    |    8 +++++---
> > > >  mm/page_alloc.c        |    1 +
> > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 1ed4116..e28f8e0 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -316,6 +316,7 @@ struct zone {
> > > >  	 * sysctl_lowmem_reserve_ratio sysctl changes.
> > > >  	 */
> > > >  	unsigned long		lowmem_reserve[MAX_NR_ZONES];
> > > > +	unsigned long		totalreserve_pages;
> > > >  
> > > 
> > > This is nit-picking but totalreserve_pages is a poor name because it's
> > > a per-zone value that is one of the lowmem_reserve[] fields instead
> > > of a total. After this patch, we have zone->totalreserve_pages and
> > > totalreserve_pages but are not related to the same thing.
> > > but they are not the same.
> > 
> > As you correctly pointed out to be on IRC, zone->totalreserve_pages
> > is not the lowmem_reserve because it takes the high_wmark into
> > account. Sorry about that, I should have kept thinking.  The name is
> > still poor though because it does not explain what the value is or
> > what it means.
> > 
> > zone->FOO value needs to be related to lowmem_reserve because this
> > 	is related to balancing zone usage.
> > 
> > zone->FOO value should also be related to the high_wmark because
> > 	this is avoiding writeback from page reclaim
> > 
> > err....... umm... this?
> > 
> > 	/*
> > 	 * When allocating a new page that is expected to be
> > 	 * dirtied soon, the number of free pages and the
> > 	 * dirty_balance reserve are taken into account. The
> > 	 * objective is that the globally allowed number of dirty
> > 	 * pages should be distributed throughout the zones such
> > 	 * that it is very unlikely that page reclaim will call
> > 	 * ->writepage.
> > 	 *
> > 	 * dirty_balance_reserve takes both lowmem_reserve and
> > 	 * the high watermark into account. The lowmem_reserve
> > 	 * is taken into account because we don't want the
> > 	 * distribution of dirty pages to unnecessarily increase
> > 	 * lowmem pressure. The watermark is taken into account
> > 	 * because it's correlated with when kswapd wakes up
> > 	 * and how long it stays awake.
> > 	 */
> > 	unsigned long		dirty_balance_reserve.
> 
> Yes, that's much better, thanks.
> 
> I assume this is meant the same for both the zone and the global level
> and we should not mess with totalreserve_pages in either case?

Yes. I'd even suggest changing the name of totalreserve_pages to make
it clear it is related to overcommit rather than pfmemalloc, dirty
or any other reserve. i.e. s/totalreserve_pages/overcommit_reserve/
diff mbox

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..e28f8e0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -316,6 +316,7 @@  struct zone {
 	 * sysctl_lowmem_reserve_ratio sysctl changes.
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
+	unsigned long		totalreserve_pages;
 
 #ifdef CONFIG_NUMA
 	int node;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index da6d263..9f896db 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -169,8 +169,9 @@  static unsigned long highmem_dirtyable_memory(unsigned long total)
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
+		x += zone_page_state(z, NR_FREE_PAGES) -
+			zone->totalreserve_pages;
+		x += zone_reclaimable_pages(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -194,7 +195,8 @@  static unsigned long determine_dirtyable_memory(void)
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	x = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
+	x += global_reclaimable_pages();
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..7e8e2ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5075,6 +5075,7 @@  static void calculate_totalreserve_pages(void)
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
+			zone->totalreserve_pages = max;
 			reserve_pages += max;
 		}
 	}