diff mbox

powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

Message ID 20100218222923.GC31681@kryten (mailing list archive)
State Accepted, archived
Commit 27f10907b7cca57df5e2a9c94c14354dd1b7879d
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

Anton Blanchard Feb. 18, 2010, 10:29 p.m. UTC
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
enabled via this:

        /*
         * If another node is sufficiently far away then it is better
         * to reclaim pages in a zone before going off node.
         */
        if (distance > RECLAIM_DISTANCE)
                zone_reclaim_mode = 1;

Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
RECLAIM_DISTANCE it never kicks in.

The local to remote bandwidth ratios can be quite large on System p
machines so it makes sense for us to reclaim clean pagecache locally before
going off node.

The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
zone reclaim.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Comments

Anton Blanchard Feb. 19, 2010, 12:07 a.m. UTC | #1
Hi,

> The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> zone reclaim.

FYI even with this enabled I could trip it up pretty easily with a multi
threaded application. I tried running stream across all threads in node 0. The
machine looks like:

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 free: 30254 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 free: 31832 MB

Now create some clean pagecache on node 0:

# taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16
# sync

node 0 free: 12880 MB
node 1 free: 31830 MB

I built stream to use about 25GB of memory. I then ran stream across all
threads in node 0:

# OMP_NUM_THREADS=16 taskset -c 0-15 ./stream

We exhaust all memory on node 0, and start using memory on node 1:

node 0 free: 0 MB
node 1 free: 20795 MB

ie about 10GB of node 1. Now if we run the same test with one thread:

# OMP_NUM_THREADS=1 taskset -c 0 ./stream

things are much better:

node 0 free: 11 MB
node 1 free: 31552 MB

Interestingly enough it takes two goes to get completely onto node 0, even
with one thread. The second run looks like:

node 0 free: 14 MB
node 1 free: 31811 MB

I had a quick look at the page allocation logic and I think I understand why
we would have issues with multple threads all trying to allocate at once.

- The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at
  a time, and whatever thread is in zone reclaim probably only frees a small
  amount of memory. Certainly not enough to satisfy all 16 threads.

- We seem to end up racing between zone_watermark_ok, zone_reclaim and
  buffered_rmqueue. Since everyone is in here the memory one thread reclaims
  may be stolen by another thread.

I'm not sure if there is an easy way to fix this without penalising other
workloads though.

Anton
Mel Gorman Feb. 19, 2010, 2:55 p.m. UTC | #2
On Fri, Feb 19, 2010 at 11:07:30AM +1100, Anton Blanchard wrote:
> 
> Hi,
> 
> > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> > zone reclaim.
> 

I've no problem with the patch anyway.

> FYI even with this enabled I could trip it up pretty easily with a multi
> threaded application. I tried running stream across all threads in node 0. The
> machine looks like:
> 
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 free: 30254 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 1 free: 31832 MB
> 
> Now create some clean pagecache on node 0:
> 
> # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16
> # sync
> 
> node 0 free: 12880 MB
> node 1 free: 31830 MB
> 
> I built stream to use about 25GB of memory. I then ran stream across all
> threads in node 0:
> 
> # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream
> 
> We exhaust all memory on node 0, and start using memory on node 1:
> 
> node 0 free: 0 MB
> node 1 free: 20795 MB
> 
> ie about 10GB of node 1. Now if we run the same test with one thread:
> 
> # OMP_NUM_THREADS=1 taskset -c 0 ./stream
> 
> things are much better:
> 
> node 0 free: 11 MB
> node 1 free: 31552 MB
> 
> Interestingly enough it takes two goes to get completely onto node 0, even
> with one thread. The second run looks like:
> 
> node 0 free: 14 MB
> node 1 free: 31811 MB
> 
> I had a quick look at the page allocation logic and I think I understand why
> we would have issues with multple threads all trying to allocate at once.
> 
> - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at
>   a time, and whatever thread is in zone reclaim probably only frees a small
>   amount of memory. Certainly not enough to satisfy all 16 threads.
> 
> - We seem to end up racing between zone_watermark_ok, zone_reclaim and
>   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
>   may be stolen by another thread.
> 

You're pretty much on the button here. Only one thread at a time enters
zone_reclaim. The others back off and try the next zone in the zonelist
instead. I'm not sure what the original intention was but most likely it
was to prevent too many parallel reclaimers in the same zone potentially
dumping out way more data than necessary.

> I'm not sure if there is an easy way to fix this without penalising other
> workloads though.
> 

You could experiment with waiting on the bit if the GFP flags allowi it? The
expectation would be that the reclaim operation does not take long. Wait
on the bit, if you are making the forward progress, recheck the
watermarks before continueing.
Christoph Lameter Feb. 19, 2010, 3:12 p.m. UTC | #3
On Fri, 19 Feb 2010, Mel Gorman wrote:

> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> > > zone reclaim.
> >
>
> I've no problem with the patch anyway.

Nor do I.

> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and
> >   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
> >   may be stolen by another thread.
> >
>
> You're pretty much on the button here. Only one thread at a time enters
> zone_reclaim. The others back off and try the next zone in the zonelist
> instead. I'm not sure what the original intention was but most likely it
> was to prevent too many parallel reclaimers in the same zone potentially
> dumping out way more data than necessary.

Yes it was to prevent concurrency slowing down reclaim. At that time the
number of processors per NUMA node was 2 or so. The number of pages that
are reclaimed is limited to avoid tossing too many page cache pages.

> You could experiment with waiting on the bit if the GFP flags allowi it? The
> expectation would be that the reclaim operation does not take long. Wait
> on the bit, if you are making the forward progress, recheck the
> watermarks before continueing.

You could reclaim more pages during a zone reclaim pass? Increase the
nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim
pass should reclaim enough local pages to keep the processors on a node
happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th?
Balbir Singh Feb. 19, 2010, 3:41 p.m. UTC | #4
On Fri, Feb 19, 2010 at 8:42 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> On Fri, 19 Feb 2010, Mel Gorman wrote:
>
>> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
>> > > zone reclaim.
>> >
>>
>> I've no problem with the patch anyway.
>
> Nor do I.
>
>> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and
>> >   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
>> >   may be stolen by another thread.
>> >
>>
>> You're pretty much on the button here. Only one thread at a time enters
>> zone_reclaim. The others back off and try the next zone in the zonelist
>> instead. I'm not sure what the original intention was but most likely it
>> was to prevent too many parallel reclaimers in the same zone potentially
>> dumping out way more data than necessary.
>
> Yes it was to prevent concurrency slowing down reclaim. At that time the
> number of processors per NUMA node was 2 or so. The number of pages that
> are reclaimed is limited to avoid tossing too many page cache pages.
>

That is interesting, I always thought it was to try and free page
cache first. For example with zone->min_unmapped_pages, if
zone_pagecache_reclaimable is greater than unmapped pages, we start
reclaim the cached pages first. The min_unmapped_pages almost sounds
like the higher level watermark - or am I misreading the code.

Balbir Singh
Balbir Singh Feb. 19, 2010, 3:43 p.m. UTC | #5
On Fri, Feb 19, 2010 at 3:59 AM, Anton Blanchard <anton@samba.org> wrote:
>
> I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
> enabled via this:
>
>        /*
>         * If another node is sufficiently far away then it is better
>         * to reclaim pages in a zone before going off node.
>         */
>        if (distance > RECLAIM_DISTANCE)
>                zone_reclaim_mode = 1;
>
> Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
> RECLAIM_DISTANCE it never kicks in.
>
> The local to remote bandwidth ratios can be quite large on System p
> machines so it makes sense for us to reclaim clean pagecache locally before
> going off node.
>
> The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> zone reclaim.
>

A reclaim distance of 10 implies a ratio of 1, that means we'll always
do zone_reclaim() to free page cache and slab cache before moving on
to another node?

Balbir Singh.
Christoph Lameter Feb. 19, 2010, 3:51 p.m. UTC | #6
On Fri, 19 Feb 2010, Balbir Singh wrote:

> >> zone_reclaim. The others back off and try the next zone in the zonelist
> >> instead. I'm not sure what the original intention was but most likely it
> >> was to prevent too many parallel reclaimers in the same zone potentially
> >> dumping out way more data than necessary.
> >
> > Yes it was to prevent concurrency slowing down reclaim. At that time the
> > number of processors per NUMA node was 2 or so. The number of pages that
> > are reclaimed is limited to avoid tossing too many page cache pages.
> >
>
> That is interesting, I always thought it was to try and free page
> cache first. For example with zone->min_unmapped_pages, if
> zone_pagecache_reclaimable is greater than unmapped pages, we start
> reclaim the cached pages first. The min_unmapped_pages almost sounds
> like the higher level watermark - or am I misreading the code.

Indeed the purpose is to free *old* page cache pages.

The min_unmapped_pages is to protect a mininum of the page cache pages /
fs metadata from zone reclaim so that ongoing file I/O is not impacted.
Balbir Singh Feb. 19, 2010, 5:39 p.m. UTC | #7
* Christoph Lameter <cl@linux-foundation.org> [2010-02-19 09:51:12]:

> On Fri, 19 Feb 2010, Balbir Singh wrote:
> 
> > >> zone_reclaim. The others back off and try the next zone in the zonelist
> > >> instead. I'm not sure what the original intention was but most likely it
> > >> was to prevent too many parallel reclaimers in the same zone potentially
> > >> dumping out way more data than necessary.
> > >
> > > Yes it was to prevent concurrency slowing down reclaim. At that time the
> > > number of processors per NUMA node was 2 or so. The number of pages that
> > > are reclaimed is limited to avoid tossing too many page cache pages.
> > >
> >
> > That is interesting, I always thought it was to try and free page
> > cache first. For example with zone->min_unmapped_pages, if
> > zone_pagecache_reclaimable is greater than unmapped pages, we start
> > reclaim the cached pages first. The min_unmapped_pages almost sounds
> > like the higher level watermark - or am I misreading the code.
> 
> Indeed the purpose is to free *old* page cache pages.
> 
> The min_unmapped_pages is to protect a mininum of the page cache pages /
> fs metadata from zone reclaim so that ongoing file I/O is not impacted.

Thanks for the explanation!
Anton Blanchard Feb. 23, 2010, 1:38 a.m. UTC | #8
Hi Balbir,

> A reclaim distance of 10 implies a ratio of 1, that means we'll always
> do zone_reclaim() to free page cache and slab cache before moving on
> to another node?

I want to make an effort to reclaim local pagecache before ever going
off node. As an example, a completely off node stream result is almost 3x
slower than on node on my test box.

Anton
diff mbox

Patch

Index: powerpc.git/arch/powerpc/include/asm/topology.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/topology.h	2010-02-18 14:26:45.736821967 +1100
+++ powerpc.git/arch/powerpc/include/asm/topology.h	2010-02-18 14:51:24.793071748 +1100
@@ -8,6 +8,16 @@  struct device_node;
 
 #ifdef CONFIG_NUMA
 
+/*
+ * Before going off node we want the VM to try and reclaim from the local
+ * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
+ * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
+ * 20, we never reclaim and go off node straight away.
+ *
+ * To fix this we choose a smaller value of RECLAIM_DISTANCE.
+ */
+#define RECLAIM_DISTANCE 10
+
 #include <asm/mmzone.h>
 
 static inline int cpu_to_node(int cpu)