Patchwork [8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

login
register
mail settings
Submitter Mel Gorman
Date Sept. 21, 2012, 10:46 a.m.
Message ID <1348224383-1499-9-git-send-email-mgorman@suse.de>
Download mbox | patch
Permalink /patch/185693/
State New
Headers show

Comments

Mel Gorman - Sept. 21, 2012, 10:46 a.m.
When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still allows
excessive scanning when THP allocations are failing in quick succession
due to memory pressure. Waiting until memory pressure is relieved would
cause compaction to continually fail instead of using reclaim/compaction
to try allocate the page. The time-based mechanism is clumsy but a better
option is not obvious.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h          |    3 ++
 include/linux/pageblock-flags.h |   19 +++++++-
 mm/compaction.c                 |   93 +++++++++++++++++++++++++++++++++++++--
 mm/internal.h                   |    1 +
 mm/page_alloc.c                 |    1 +
 5 files changed, 111 insertions(+), 6 deletions(-)
Rafael Aquini - Sept. 21, 2012, 5:53 p.m.
On Fri, Sep 21, 2012 at 11:46:22AM +0100, Mel Gorman wrote:
> When compaction was implemented it was known that scanning could potentially
> be excessive. The ideal was that a counter be maintained for each pageblock
> but maintaining this information would incur a severe penalty due to a
> shared writable cache line. It has reached the point where the scanning
> costs are an serious problem, particularly on long-lived systems where a
> large process starts and allocates a large number of THPs at the same time.
> 
> Instead of using a shared counter, this patch adds another bit to the
> pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> either migrate or free scanner and 0 pages were isolated, the pageblock
> is marked to be skipped in the future. When scanning, this bit is checked
> before any scanning takes place and the block skipped if set.
> 
> The main difficulty with a patch like this is "when to ignore the cached
> information?" If it's ignored too often, the scanning rates will still
> be excessive. If the information is too stale then allocations will fail
> that might have otherwise succeeded. In this patch
> 
> o CMA always ignores the information
> o If the migrate and free scanner meet then the cached information will
>   be discarded if it's at least 5 seconds since the last time the cache
>   was discarded
> o If there are a large number of allocation failures, discard the cache.
> 
> The time-based heuristic is very clumsy but there are few choices for a
> better event. Depending solely on multiple allocation failures still allows
> excessive scanning when THP allocations are failing in quick succession
> due to memory pressure. Waiting until memory pressure is relieved would
> cause compaction to continually fail instead of using reclaim/compaction
> to try allocate the page. The time-based mechanism is clumsy but a better
> option is not obvious.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>
Andrew Morton - Sept. 21, 2012, 9:36 p.m.
On Fri, 21 Sep 2012 11:46:22 +0100
Mel Gorman <mgorman@suse.de> wrote:

> When compaction was implemented it was known that scanning could potentially
> be excessive. The ideal was that a counter be maintained for each pageblock
> but maintaining this information would incur a severe penalty due to a
> shared writable cache line. It has reached the point where the scanning
> costs are an serious problem, particularly on long-lived systems where a
> large process starts and allocates a large number of THPs at the same time.
> 
> Instead of using a shared counter, this patch adds another bit to the
> pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> either migrate or free scanner and 0 pages were isolated, the pageblock
> is marked to be skipped in the future. When scanning, this bit is checked
> before any scanning takes place and the block skipped if set.
> 
> The main difficulty with a patch like this is "when to ignore the cached
> information?" If it's ignored too often, the scanning rates will still
> be excessive. If the information is too stale then allocations will fail
> that might have otherwise succeeded. In this patch
> 
> o CMA always ignores the information
> o If the migrate and free scanner meet then the cached information will
>   be discarded if it's at least 5 seconds since the last time the cache
>   was discarded
> o If there are a large number of allocation failures, discard the cache.
> 
> The time-based heuristic is very clumsy but there are few choices for a
> better event. Depending solely on multiple allocation failures still allows
> excessive scanning when THP allocations are failing in quick succession
> due to memory pressure. Waiting until memory pressure is relieved would
> cause compaction to continually fail instead of using reclaim/compaction
> to try allocate the page. The time-based mechanism is clumsy but a better
> option is not obvious.

ick.

Wall time has sooo little relationship to what's happening in there. 
If we *have* to use polling, cannot we clock the poll with some metric
which is at least vaguely related to the amount of activity?  Number
(or proportion) of pages scanned, for example?  Or reset everything on
the Nth trip around the zone?  Or even a combination of one of these
*and* of wall time, so the system will at least work harder when MM is
under load.

Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?

>
> ...
>
> +static void reset_isolation_suitable(struct zone *zone)
> +{
> +	unsigned long start_pfn = zone->zone_start_pfn;
> +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	unsigned long pfn;
> +
> +	/*
> +	 * Do not reset more than once every five seconds. If allocations are
> +	 * failing sufficiently quickly to allow this to happen then continually
> +	 * scanning for compaction is not going to help. The choice of five
> +	 * seconds is arbitrary but will mitigate excessive scanning.
> +	 */
> +	if (time_before(jiffies, zone->compact_blockskip_expire))
> +		return;
> +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> +
> +	/* Walk the zone and mark every pageblock as suitable for isolation */
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> +		struct page *page;
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		page = pfn_to_page(pfn);
> +		if (zone != page_zone(page))
> +			continue;
> +
> +		clear_pageblock_skip(page);
> +	}

What's the worst-case loop count here?

> +}
> +
>
> ...
>
Mel Gorman - Sept. 24, 2012, 9:39 a.m.
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2012 11:46:22 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > When compaction was implemented it was known that scanning could potentially
> > be excessive. The ideal was that a counter be maintained for each pageblock
> > but maintaining this information would incur a severe penalty due to a
> > shared writable cache line. It has reached the point where the scanning
> > costs are an serious problem, particularly on long-lived systems where a
> > large process starts and allocates a large number of THPs at the same time.
> > 
> > Instead of using a shared counter, this patch adds another bit to the
> > pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> > either migrate or free scanner and 0 pages were isolated, the pageblock
> > is marked to be skipped in the future. When scanning, this bit is checked
> > before any scanning takes place and the block skipped if set.
> > 
> > The main difficulty with a patch like this is "when to ignore the cached
> > information?" If it's ignored too often, the scanning rates will still
> > be excessive. If the information is too stale then allocations will fail
> > that might have otherwise succeeded. In this patch
> > 
> > o CMA always ignores the information
> > o If the migrate and free scanner meet then the cached information will
> >   be discarded if it's at least 5 seconds since the last time the cache
> >   was discarded
> > o If there are a large number of allocation failures, discard the cache.
> > 
> > The time-based heuristic is very clumsy but there are few choices for a
> > better event. Depending solely on multiple allocation failures still allows
> > excessive scanning when THP allocations are failing in quick succession
> > due to memory pressure. Waiting until memory pressure is relieved would
> > cause compaction to continually fail instead of using reclaim/compaction
> > to try allocate the page. The time-based mechanism is clumsy but a better
> > option is not obvious.
> 
> ick.
> 

I know. I was being generous when I described it as "clumsy".

> Wall time has sooo little relationship to what's happening in there. 
> If we *have* to use polling, cannot we clock the poll with some metric
> which is at least vaguely related to the amount of activity?

Initially I wanted to only depend on just this

        /* Clear pageblock skip if there are numerous alloc failures */
        if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
                reset_isolation_suitable(zone);

because this it at least related to activity but it's weak for two
reasons. One, it's depending on failures to make the decisions - i.e. when
it already is too late. Two, even this condition can be hit very quickly
and can result in many resets per second in the worst case.

> Number
> (or proportion) of pages scanned, for example?  Or reset everything on
> the Nth trip around the zone? 

For a full compaction failure we have scanned all pages in the zone so
there is no proportion to use there.

Resetting everything every Nth trip around the zone is similar to the
above check except it would look like

if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
		zone->compact_reset_laps == COMPACT_MAX_LAPS)
	reset_isolation_suitable(zone)

but it's weak for the same reasons - depending on failures to make decisions
and can happen too quickly.

I also considered using the PGFREE vmstat to reset if a pageblock worth of
pages had been freed since the last reset but this happens very quickly
under memory pressure and would not throttle enough. I also considered
deferring until NR_FREE_PAGES was high enough but this would severely
impact allocation success rates under memory pressure.

> Or even a combination of one of these
> *and* of wall time, so the system will at least work harder when MM is
> under load.
> 

We sortof do that now - we are depending on a number of failures and
time before clearing the bits.

> Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> a pageblock's PB_migrate_skip synchronously, when something was done to
> that pageblock which justifies repolling it?
> 

The "something" event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free paths
of the page allocator. I felt that that was too high a price to pay.

> >
> > ...
> >
> > +static void reset_isolation_suitable(struct zone *zone)
> > +{
> > +	unsigned long start_pfn = zone->zone_start_pfn;
> > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > +	unsigned long pfn;
> > +
> > +	/*
> > +	 * Do not reset more than once every five seconds. If allocations are
> > +	 * failing sufficiently quickly to allow this to happen then continually
> > +	 * scanning for compaction is not going to help. The choice of five
> > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > +	 */
> > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > +		return;
> > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > +
> > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > +		struct page *page;
> > +		if (!pfn_valid(pfn))
> > +			continue;
> > +
> > +		page = pfn_to_page(pfn);
> > +		if (zone != page_zone(page))
> > +			continue;
> > +
> > +		clear_pageblock_skip(page);
> > +	}
> 
> What's the worst-case loop count here?
> 

zone->spanned_pages >> pageblock_order

> > +}
> > +
> >
> > ...
> >
>
Andrew Morton - Sept. 24, 2012, 9:26 p.m.
On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> 
> > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > a pageblock's PB_migrate_skip synchronously, when something was done to
> > that pageblock which justifies repolling it?
> > 
> 
> The "something" event you are looking for is pages being freed or
> allocated in the page allocator. A movable page being allocated in block
> or a page being freed should clear the PB_migrate_skip bit if it's set.
> Unfortunately this would impact the fast path of the alloc and free paths
> of the page allocator. I felt that that was too high a price to pay.

We already do a similar thing in the page allocator: clearing of
->all_unreclaimable and ->pages_scanned.  But that isn't on the "fast
path" really - it happens once per pcp unload.  Can we do something
like that?  Drop some hint into the zone without having to visit each
page?

> > >
> > > ...
> > >
> > > +static void reset_isolation_suitable(struct zone *zone)
> > > +{
> > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > +	unsigned long pfn;
> > > +
> > > +	/*
> > > +	 * Do not reset more than once every five seconds. If allocations are
> > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > +	 * scanning for compaction is not going to help. The choice of five
> > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > +	 */
> > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > +		return;
> > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > +
> > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > +		struct page *page;
> > > +		if (!pfn_valid(pfn))
> > > +			continue;
> > > +
> > > +		page = pfn_to_page(pfn);
> > > +		if (zone != page_zone(page))
> > > +			continue;
> > > +
> > > +		clear_pageblock_skip(page);
> > > +	}
> > 
> > What's the worst-case loop count here?
> > 
> 
> zone->spanned_pages >> pageblock_order

What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
Mel Gorman - Sept. 25, 2012, 9:12 a.m.
On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> On Mon, 24 Sep 2012 10:39:38 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > 
> > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > that pageblock which justifies repolling it?
> > > 
> > 
> > The "something" event you are looking for is pages being freed or
> > allocated in the page allocator. A movable page being allocated in block
> > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > Unfortunately this would impact the fast path of the alloc and free paths
> > of the page allocator. I felt that that was too high a price to pay.
> 
> We already do a similar thing in the page allocator: clearing of
> ->all_unreclaimable and ->pages_scanned. 

That is true but that is a simple write (shared cache line but still) to
a struct zone. Worse, now that you point it out, that's pretty stupid. It
should be checking if the value is non-zero before writing to it to avoid
a cache line bounce.

Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
not as cheap as it needs to

set_pageblock_skip
  -> set_pageblock_flags_group
    -> page_zone
    -> page_to_pfn
    -> get_pageblock_bitmap
    -> pfn_to_bitidx
    -> __set_bit

> But that isn't on the "fast
> path" really - it happens once per pcp unload. 

That's still an important enough path that I'm wary of making it fatter
and that only covers the free path. To avoid the polling, the allocation
side needs to be handled too. It could be shoved down into rmqueue() to
put it into a slightly colder path but still, it's a price to pay to keep
compaction happy.

> Can we do something
> like that?  Drop some hint into the zone without having to visit each
> page?
> 

Not without incurring a cost, but yes, t is possible to give a hint on when
PG_migrate_skip should be cleared and move away from that time-based hammer.

First, we'd introduce a variant of get_pageblock_migratetype() that returns
all the bits for the pageblock flags and then helpers to extract either the
migratetype or the PG_migrate_skip. We already are incurring the cost of
get_pageblock_migratetype() so it will not be much more expensive than what
is already there. If there is an allocation or free within a pageblock that
as the PG_migrate_skip bit set then we increment a counter. When the counter
reaches some to-be-decided "threshold" then compaction may clear all the
bits. This would match the criteria of the clearing being based on activity.

There are four potential problems with this

1. The logic to retrieve all the bits and split them up will be a little
   convulated but maybe it would not be that bad.

2. The counter is a shared-writable cache line but obviously it could
   be moved to vmstat and incremented with inc_zone_page_state to offset
   the cost a little.

3. The biggested weakness is that there is not way to know if the
   counter is incremented based on activity in a small subset of blocks.

4. What should the threshold be?

The first problem is minor but the other three are potentially a mess.
Adding another vmstat counter is bad enough in itself but if the counter
is incremented based on a small subsets of pageblocks, the hint becomes
is potentially useless.

However, does this match what you have in mind or am I over-complicating
things?

> > > >
> > > > ...
> > > >
> > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > +{
> > > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > +	unsigned long pfn;
> > > > +
> > > > +	/*
> > > > +	 * Do not reset more than once every five seconds. If allocations are
> > > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > > +	 * scanning for compaction is not going to help. The choice of five
> > > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > > +	 */
> > > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > +		return;
> > > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > +
> > > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > +		struct page *page;
> > > > +		if (!pfn_valid(pfn))
> > > > +			continue;
> > > > +
> > > > +		page = pfn_to_page(pfn);
> > > > +		if (zone != page_zone(page))
> > > > +			continue;
> > > > +
> > > > +		clear_pageblock_skip(page);
> > > > +	}
> > > 
> > > What's the worst-case loop count here?
> > > 
> > 
> > zone->spanned_pages >> pageblock_order
> 
> What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)

Lets take an unlikely case - 128G single-node machine. That loop count
on x86-64 would be 65536. It'll be fast enough, particularly in this
path.
Andrew Morton - Sept. 25, 2012, 8:03 p.m.
On Tue, 25 Sep 2012 10:12:07 +0100
Mel Gorman <mgorman@suse.de> wrote:

> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
> 
> There are four potential problems with this
> 
> 1. The logic to retrieve all the bits and split them up will be a little
>    convulated but maybe it would not be that bad.
> 
> 2. The counter is a shared-writable cache line but obviously it could
>    be moved to vmstat and incremented with inc_zone_page_state to offset
>    the cost a little.
> 
> 3. The biggested weakness is that there is not way to know if the
>    counter is incremented based on activity in a small subset of blocks.
> 
> 4. What should the threshold be?
> 
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.
> 
> However, does this match what you have in mind or am I over-complicating
> things?

Sounds complicated.

Using wall time really does suck.  Are you sure you can't think of
something more logical?

How would we demonstrate the suckage?  What would be the observeable downside of
switching that 5 seconds to 5 hours?

> > > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > +		struct page *page;
> > > > > +		if (!pfn_valid(pfn))
> > > > > +			continue;
> > > > > +
> > > > > +		page = pfn_to_page(pfn);
> > > > > +		if (zone != page_zone(page))
> > > > > +			continue;
> > > > > +
> > > > > +		clear_pageblock_skip(page);
> > > > > +	}
> > > > 
> > > > What's the worst-case loop count here?
> > > > 
> > > 
> > > zone->spanned_pages >> pageblock_order
> > 
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
> 
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.

That could easily exceed a millisecond.  Can/should we stick a
cond_resched() in there?
Minchan Kim - Sept. 26, 2012, 12:49 a.m.
On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > On Mon, 24 Sep 2012 10:39:38 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > > 
> > > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > that pageblock which justifies repolling it?
> > > > 
> > > 
> > > The "something" event you are looking for is pages being freed or
> > > allocated in the page allocator. A movable page being allocated in block
> > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > Unfortunately this would impact the fast path of the alloc and free paths
> > > of the page allocator. I felt that that was too high a price to pay.
> > 
> > We already do a similar thing in the page allocator: clearing of
> > ->all_unreclaimable and ->pages_scanned. 
> 
> That is true but that is a simple write (shared cache line but still) to
> a struct zone. Worse, now that you point it out, that's pretty stupid. It
> should be checking if the value is non-zero before writing to it to avoid
> a cache line bounce.
> 
> Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> not as cheap as it needs to
> 
> set_pageblock_skip
>   -> set_pageblock_flags_group
>     -> page_zone
>     -> page_to_pfn
>     -> get_pageblock_bitmap
>     -> pfn_to_bitidx
>     -> __set_bit
> 
> > But that isn't on the "fast
> > path" really - it happens once per pcp unload. 
> 
> That's still an important enough path that I'm wary of making it fatter
> and that only covers the free path. To avoid the polling, the allocation
> side needs to be handled too. It could be shoved down into rmqueue() to
> put it into a slightly colder path but still, it's a price to pay to keep
> compaction happy.
> 
> > Can we do something
> > like that?  Drop some hint into the zone without having to visit each
> > page?
> > 
> 
> Not without incurring a cost, but yes, t is possible to give a hint on when
> PG_migrate_skip should be cleared and move away from that time-based hammer.
> 
> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
> 
> There are four potential problems with this
> 
> 1. The logic to retrieve all the bits and split them up will be a little
>    convulated but maybe it would not be that bad.
> 
> 2. The counter is a shared-writable cache line but obviously it could
>    be moved to vmstat and incremented with inc_zone_page_state to offset
>    the cost a little.
> 
> 3. The biggested weakness is that there is not way to know if the
>    counter is incremented based on activity in a small subset of blocks.
> 
> 4. What should the threshold be?
> 
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.

Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
in pageblock_flags_group.
In allocation path, we can set PG_check_migrate in a pageblock
In free path, we can set PG_check_free in a pageblock.
And they are cleared by compaction's scan like now.
So we can discard 3 and 4 at least.

Another idea is that let's cure it by fixing fundamental problem.
Make zone's locks more fine-grained.
As time goes by, system uses bigger memory but our lock of zone
isn't scalable. Recently, lru_lock and zone->lock contention report
isn't rare so i think it's good time that we move next step.

How about defining struct sub_zone per 2G or 4G?
so a zone can have several sub_zone as size and subzone can replace
current zone's role and zone is just container of subzones.
Of course, it's not easy to implement but I think someday we should
go that way. Is it a really overkill?

> 
> However, does this match what you have in mind or am I over-complicating
> things?
> 
> > > > >
> > > > > ...
> > > > >
> > > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > > +{
> > > > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > > +	unsigned long pfn;
> > > > > +
> > > > > +	/*
> > > > > +	 * Do not reset more than once every five seconds. If allocations are
> > > > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > > > +	 * scanning for compaction is not going to help. The choice of five
> > > > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > > > +	 */
> > > > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > > +		return;
> > > > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > > +
> > > > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > +		struct page *page;
> > > > > +		if (!pfn_valid(pfn))
> > > > > +			continue;
> > > > > +
> > > > > +		page = pfn_to_page(pfn);
> > > > > +		if (zone != page_zone(page))
> > > > > +			continue;
> > > > > +
> > > > > +		clear_pageblock_skip(page);
> > > > > +	}
> > > > 
> > > > What's the worst-case loop count here?
> > > > 
> > > 
> > > zone->spanned_pages >> pageblock_order
> > 
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
> 
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Mel Gorman - Sept. 27, 2012, 12:14 p.m.
On Wed, Sep 26, 2012 at 09:49:30AM +0900, Minchan Kim wrote:
> On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> > On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > > On Mon, 24 Sep 2012 10:39:38 +0100
> > > Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > > > 
> > > > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > > that pageblock which justifies repolling it?
> > > > > 
> > > > 
> > > > The "something" event you are looking for is pages being freed or
> > > > allocated in the page allocator. A movable page being allocated in block
> > > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > > Unfortunately this would impact the fast path of the alloc and free paths
> > > > of the page allocator. I felt that that was too high a price to pay.
> > > 
> > > We already do a similar thing in the page allocator: clearing of
> > > ->all_unreclaimable and ->pages_scanned. 
> > 
> > That is true but that is a simple write (shared cache line but still) to
> > a struct zone. Worse, now that you point it out, that's pretty stupid. It
> > should be checking if the value is non-zero before writing to it to avoid
> > a cache line bounce.
> > 
> > Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> > not as cheap as it needs to
> > 
> > set_pageblock_skip
> >   -> set_pageblock_flags_group
> >     -> page_zone
> >     -> page_to_pfn
> >     -> get_pageblock_bitmap
> >     -> pfn_to_bitidx
> >     -> __set_bit
> > 
> > > But that isn't on the "fast
> > > path" really - it happens once per pcp unload. 
> > 
> > That's still an important enough path that I'm wary of making it fatter
> > and that only covers the free path. To avoid the polling, the allocation
> > side needs to be handled too. It could be shoved down into rmqueue() to
> > put it into a slightly colder path but still, it's a price to pay to keep
> > compaction happy.
> > 
> > > Can we do something
> > > like that?  Drop some hint into the zone without having to visit each
> > > page?
> > > 
> > 
> > Not without incurring a cost, but yes, t is possible to give a hint on when
> > PG_migrate_skip should be cleared and move away from that time-based hammer.
> > 
> > First, we'd introduce a variant of get_pageblock_migratetype() that returns
> > all the bits for the pageblock flags and then helpers to extract either the
> > migratetype or the PG_migrate_skip. We already are incurring the cost of
> > get_pageblock_migratetype() so it will not be much more expensive than what
> > is already there. If there is an allocation or free within a pageblock that
> > as the PG_migrate_skip bit set then we increment a counter. When the counter
> > reaches some to-be-decided "threshold" then compaction may clear all the
> > bits. This would match the criteria of the clearing being based on activity.
> > 
> > There are four potential problems with this
> > 
> > 1. The logic to retrieve all the bits and split them up will be a little
> >    convulated but maybe it would not be that bad.
> > 
> > 2. The counter is a shared-writable cache line but obviously it could
> >    be moved to vmstat and incremented with inc_zone_page_state to offset
> >    the cost a little.
> > 
> > 3. The biggested weakness is that there is not way to know if the
> >    counter is incremented based on activity in a small subset of blocks.
> > 
> > 4. What should the threshold be?
> > 
> > The first problem is minor but the other three are potentially a mess.
> > Adding another vmstat counter is bad enough in itself but if the counter
> > is incremented based on a small subsets of pageblocks, the hint becomes
> > is potentially useless.
> 
> Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
> in pageblock_flags_group.
> In allocation path, we can set PG_check_migrate in a pageblock
> In free path, we can set PG_check_free in a pageblock.
> And they are cleared by compaction's scan like now.
> So we can discard 3 and 4 at least.
> 

Adding a second bit does not fix problem 3 or problem 4 at all. With two
bits, all activity could be concentrated on two blocks - one migrate and
one free. The threshold still has to be selected.

> Another idea is that let's cure it by fixing fundamental problem.
> Make zone's locks more fine-grained.

Far easier said than done and only covers the contention problem. It
does nothing for the scanning problem.

> As time goes by, system uses bigger memory but our lock of zone
> isn't scalable. Recently, lru_lock and zone->lock contention report
> isn't rare so i think it's good time that we move next step.
> 

Lock contention on both those locks recently were due to compaction
rather than something more fundamental.

> How about defining struct sub_zone per 2G or 4G?
> so a zone can have several sub_zone as size and subzone can replace
> current zone's role and zone is just container of subzones.
> Of course, it's not easy to implement but I think someday we should
> go that way. Is it a really overkill?
> 

One one side that greatly increases the cost of the page allocator and
the size of the zonelist it must walk as it'll need additional walks for
each of these lists. The interaction with fragmentation avoidance and
how it handles fallbacks would be particularly problematic. On the other
side, multiple sub-zones will also introduce multiple LRUs making the
existing balancing problem considerably worse.

And again, all this would be aimed at contention and do nothing for the
scanning problem at hand.

That introduces a multiple LRUs that must be balanced problem. 

I'm work on a patch that removes the time heuristic that I think might
work. Will hopefully post it today.

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 603d0b5..a456361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,6 +368,9 @@  struct zone {
 	 */
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	unsigned long		compact_blockskip_expire;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 19ef95d..eed27f4 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@  enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+	PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@  unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+			get_pageblock_flags_group(page, PB_migrate_skip,     \
+							PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-			get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+			get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
 			set_pageblock_flags_group(page, flags,	\
-						  0, NR_PAGEBLOCK_BITS-1)
+						  0, PB_migrate_end)
 
 #endif	/* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9fc1b61..9276bc8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,64 @@  static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool isolation_suitable(struct compact_control *cc,
+					struct page *page)
+{
+	if (cc->ignore_skip_hint)
+		return true;
+
+	return !get_pageblock_skip(page);
+}
+
+/*
+ * This function is called to clear all cached information on pageblocks that
+ * should be skipped for page isolation when the migrate and free page scanner
+ * meet.
+ */
+static void reset_isolation_suitable(struct zone *zone)
+{
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+
+	/*
+	 * Do not reset more than once every five seconds. If allocations are
+	 * failing sufficiently quickly to allow this to happen then continually
+	 * scanning for compaction is not going to help. The choice of five
+	 * seconds is arbitrary but will mitigate excessive scanning.
+	 */
+	if (time_before(jiffies, zone->compact_blockskip_expire))
+		return;
+	zone->compact_blockskip_expire = jiffies + (HZ * 5);
+
+	/* Walk the zone and mark every pageblock as suitable for isolation */
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (zone != page_zone(page))
+			continue;
+
+		clear_pageblock_skip(page);
+	}
+}
+
+/*
+ * If no pages were isolated then mark this pageblock to be skipped in the
+ * future. The information is later cleared by reset_isolation_suitable().
+ */
+static void update_pageblock_skip(struct page *page, unsigned long nr_isolated)
+{
+	if (!page)
+		return;
+
+	if (!nr_isolated)
+		set_pageblock_skip(page);
+}
+
 static inline bool should_release_lock(spinlock_t *lock)
 {
 	return need_resched() || spin_is_contended(lock);
@@ -182,7 +240,7 @@  static unsigned long isolate_freepages_block(struct compact_control *cc,
 				bool strict)
 {
 	int nr_scanned = 0, total_isolated = 0;
-	struct page *cursor;
+	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
 
@@ -196,6 +254,8 @@  static unsigned long isolate_freepages_block(struct compact_control *cc,
 		if (!pfn_valid_within(blockpfn))
 			goto strict_check;
 		nr_scanned++;
+		if (!valid_page)
+			valid_page = page;
 
 		if (!PageBuddy(page))
 			goto strict_check;
@@ -253,6 +313,10 @@  out:
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (blockpfn == end_pfn)
+		update_pageblock_skip(valid_page, total_isolated);
+
 	return total_isolated;
 }
 
@@ -390,6 +454,7 @@  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	struct lruvec *lruvec;
 	unsigned long flags;
 	bool locked = false;
+	struct page *page = NULL, *valid_page = NULL;
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -410,7 +475,6 @@  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	/* Time to isolate some pages for migration */
 	cond_resched();
 	for (; low_pfn < end_pfn; low_pfn++) {
-		struct page *page;
 
 		/* give a chance to irqs before checking need_resched() */
 		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
@@ -447,6 +511,14 @@  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if (page_zone(page) != zone)
 			continue;
 
+		if (!valid_page)
+			valid_page = page;
+
+		/* If isolation recently failed, do not retry */
+		pageblock_nr = low_pfn >> pageblock_order;
+		if (!isolation_suitable(cc, page))
+			goto next_pageblock;
+
 		/* Skip if free */
 		if (PageBuddy(page))
 			continue;
@@ -456,7 +528,6 @@  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 * migration is optimistic to see if the minimum amount of work
 		 * satisfies the allocation
 		 */
-		pageblock_nr = low_pfn >> pageblock_order;
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
 			goto next_pageblock;
@@ -531,6 +602,10 @@  next_pageblock:
 	if (locked)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (low_pfn == end_pfn)
+		update_pageblock_skip(valid_page, nr_isolated);
+
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
 	return low_pfn;
@@ -594,6 +669,10 @@  static void isolate_freepages(struct zone *zone,
 		if (!suitable_migration_target(page))
 			continue;
 
+		/* If isolation recently failed, do not retry */
+		if (!isolation_suitable(cc, page))
+			continue;
+
 		/* Found a block suitable for isolating free pages from */
 		isolated = 0;
 		end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
@@ -710,8 +789,10 @@  static int compact_finished(struct zone *zone,
 		return COMPACT_PARTIAL;
 
 	/* Compaction run completes if the migrate and free scanner meet */
-	if (cc->free_pfn <= cc->migrate_pfn)
+	if (cc->free_pfn <= cc->migrate_pfn) {
+		reset_isolation_suitable(cc->zone);
 		return COMPACT_COMPLETE;
+	}
 
 	/*
 	 * order == -1 is expected when compacting via
@@ -819,6 +900,10 @@  static int compact_zone(struct zone *zone, struct compact_control *cc)
 	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
 	cc->free_pfn &= ~(pageblock_nr_pages-1);
 
+	/* Clear pageblock skip if there are numerous alloc failures */
+	if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
+		reset_isolation_suitable(zone);
+
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
diff --git a/mm/internal.h b/mm/internal.h
index f590a6e..fbf5ddc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -121,6 +121,7 @@  struct compact_control {
 	unsigned long free_pfn;		/* isolate_freepages search base */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
+	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c03037e..ba34eee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,6 +5662,7 @@  static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 		.order = -1,
 		.zone = page_zone(pfn_to_page(start)),
 		.sync = true,
+		.ignore_skip_hint = true,
 	};
 	INIT_LIST_HEAD(&cc.migratepages);