Message ID | 20100218222923.GC31681@kryten (mailing list archive) |
---|---|
State | Accepted, archived |
Commit | 27f10907b7cca57df5e2a9c94c14354dd1b7879d |
Delegated to: | Benjamin Herrenschmidt |
Headers | show |
Hi, > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > zone reclaim. FYI even with this enabled I could trip it up pretty easily with a multi threaded application. I tried running stream across all threads in node 0. The machine looks like: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 free: 30254 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 free: 31832 MB Now create some clean pagecache on node 0: # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16 # sync node 0 free: 12880 MB node 1 free: 31830 MB I built stream to use about 25GB of memory. I then ran stream across all threads in node 0: # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream We exhaust all memory on node 0, and start using memory on node 1: node 0 free: 0 MB node 1 free: 20795 MB ie about 10GB of node 1. Now if we run the same test with one thread: # OMP_NUM_THREADS=1 taskset -c 0 ./stream things are much better: node 0 free: 11 MB node 1 free: 31552 MB Interestingly enough it takes two goes to get completely onto node 0, even with one thread. The second run looks like: node 0 free: 14 MB node 1 free: 31811 MB I had a quick look at the page allocation logic and I think I understand why we would have issues with multple threads all trying to allocate at once. - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at a time, and whatever thread is in zone reclaim probably only frees a small amount of memory. Certainly not enough to satisfy all 16 threads. - We seem to end up racing between zone_watermark_ok, zone_reclaim and buffered_rmqueue. Since everyone is in here the memory one thread reclaims may be stolen by another thread. I'm not sure if there is an easy way to fix this without penalising other workloads though. Anton
On Fri, Feb 19, 2010 at 11:07:30AM +1100, Anton Blanchard wrote: > > Hi, > > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > > zone reclaim. > I've no problem with the patch anyway. > FYI even with this enabled I could trip it up pretty easily with a multi > threaded application. I tried running stream across all threads in node 0. The > machine looks like: > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 0 free: 30254 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > node 1 free: 31832 MB > > Now create some clean pagecache on node 0: > > # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16 > # sync > > node 0 free: 12880 MB > node 1 free: 31830 MB > > I built stream to use about 25GB of memory. I then ran stream across all > threads in node 0: > > # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream > > We exhaust all memory on node 0, and start using memory on node 1: > > node 0 free: 0 MB > node 1 free: 20795 MB > > ie about 10GB of node 1. Now if we run the same test with one thread: > > # OMP_NUM_THREADS=1 taskset -c 0 ./stream > > things are much better: > > node 0 free: 11 MB > node 1 free: 31552 MB > > Interestingly enough it takes two goes to get completely onto node 0, even > with one thread. The second run looks like: > > node 0 free: 14 MB > node 1 free: 31811 MB > > I had a quick look at the page allocation logic and I think I understand why > we would have issues with multple threads all trying to allocate at once. > > - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at > a time, and whatever thread is in zone reclaim probably only frees a small > amount of memory. Certainly not enough to satisfy all 16 threads. > > - We seem to end up racing between zone_watermark_ok, zone_reclaim and > buffered_rmqueue. Since everyone is in here the memory one thread reclaims > may be stolen by another thread. > You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. > I'm not sure if there is an easy way to fix this without penalising other > workloads though. > You could experiment with waiting on the bit if the GFP flags allowi it? The expectation would be that the reclaim operation does not take long. Wait on the bit, if you are making the forward progress, recheck the watermarks before continueing.
On Fri, 19 Feb 2010, Mel Gorman wrote: > > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > > > zone reclaim. > > > > I've no problem with the patch anyway. Nor do I. > > - We seem to end up racing between zone_watermark_ok, zone_reclaim and > > buffered_rmqueue. Since everyone is in here the memory one thread reclaims > > may be stolen by another thread. > > > > You're pretty much on the button here. Only one thread at a time enters > zone_reclaim. The others back off and try the next zone in the zonelist > instead. I'm not sure what the original intention was but most likely it > was to prevent too many parallel reclaimers in the same zone potentially > dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. > You could experiment with waiting on the bit if the GFP flags allowi it? The > expectation would be that the reclaim operation does not take long. Wait > on the bit, if you are making the forward progress, recheck the > watermarks before continueing. You could reclaim more pages during a zone reclaim pass? Increase the nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim pass should reclaim enough local pages to keep the processors on a node happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th?
On Fri, Feb 19, 2010 at 8:42 PM, Christoph Lameter <cl@linux-foundation.org> wrote: > On Fri, 19 Feb 2010, Mel Gorman wrote: > >> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables >> > > zone reclaim. >> > >> >> I've no problem with the patch anyway. > > Nor do I. > >> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and >> > buffered_rmqueue. Since everyone is in here the memory one thread reclaims >> > may be stolen by another thread. >> > >> >> You're pretty much on the button here. Only one thread at a time enters >> zone_reclaim. The others back off and try the next zone in the zonelist >> instead. I'm not sure what the original intention was but most likely it >> was to prevent too many parallel reclaimers in the same zone potentially >> dumping out way more data than necessary. > > Yes it was to prevent concurrency slowing down reclaim. At that time the > number of processors per NUMA node was 2 or so. The number of pages that > are reclaimed is limited to avoid tossing too many page cache pages. > That is interesting, I always thought it was to try and free page cache first. For example with zone->min_unmapped_pages, if zone_pagecache_reclaimable is greater than unmapped pages, we start reclaim the cached pages first. The min_unmapped_pages almost sounds like the higher level watermark - or am I misreading the code. Balbir Singh
On Fri, Feb 19, 2010 at 3:59 AM, Anton Blanchard <anton@samba.org> wrote: > > I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets > enabled via this: > > /* > * If another node is sufficiently far away then it is better > * to reclaim pages in a zone before going off node. > */ > if (distance > RECLAIM_DISTANCE) > zone_reclaim_mode = 1; > > Since we use the default value of 20 for REMOTE_DISTANCE and 20 for > RECLAIM_DISTANCE it never kicks in. > > The local to remote bandwidth ratios can be quite large on System p > machines so it makes sense for us to reclaim clean pagecache locally before > going off node. > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > zone reclaim. > A reclaim distance of 10 implies a ratio of 1, that means we'll always do zone_reclaim() to free page cache and slab cache before moving on to another node? Balbir Singh.
On Fri, 19 Feb 2010, Balbir Singh wrote: > >> zone_reclaim. The others back off and try the next zone in the zonelist > >> instead. I'm not sure what the original intention was but most likely it > >> was to prevent too many parallel reclaimers in the same zone potentially > >> dumping out way more data than necessary. > > > > Yes it was to prevent concurrency slowing down reclaim. At that time the > > number of processors per NUMA node was 2 or so. The number of pages that > > are reclaimed is limited to avoid tossing too many page cache pages. > > > > That is interesting, I always thought it was to try and free page > cache first. For example with zone->min_unmapped_pages, if > zone_pagecache_reclaimable is greater than unmapped pages, we start > reclaim the cached pages first. The min_unmapped_pages almost sounds > like the higher level watermark - or am I misreading the code. Indeed the purpose is to free *old* page cache pages. The min_unmapped_pages is to protect a mininum of the page cache pages / fs metadata from zone reclaim so that ongoing file I/O is not impacted.
* Christoph Lameter <cl@linux-foundation.org> [2010-02-19 09:51:12]: > On Fri, 19 Feb 2010, Balbir Singh wrote: > > > >> zone_reclaim. The others back off and try the next zone in the zonelist > > >> instead. I'm not sure what the original intention was but most likely it > > >> was to prevent too many parallel reclaimers in the same zone potentially > > >> dumping out way more data than necessary. > > > > > > Yes it was to prevent concurrency slowing down reclaim. At that time the > > > number of processors per NUMA node was 2 or so. The number of pages that > > > are reclaimed is limited to avoid tossing too many page cache pages. > > > > > > > That is interesting, I always thought it was to try and free page > > cache first. For example with zone->min_unmapped_pages, if > > zone_pagecache_reclaimable is greater than unmapped pages, we start > > reclaim the cached pages first. The min_unmapped_pages almost sounds > > like the higher level watermark - or am I misreading the code. > > Indeed the purpose is to free *old* page cache pages. > > The min_unmapped_pages is to protect a mininum of the page cache pages / > fs metadata from zone reclaim so that ongoing file I/O is not impacted. Thanks for the explanation!
Hi Balbir, > A reclaim distance of 10 implies a ratio of 1, that means we'll always > do zone_reclaim() to free page cache and slab cache before moving on > to another node? I want to make an effort to reclaim local pagecache before ever going off node. As an example, a completely off node stream result is almost 3x slower than on node on my test box. Anton
Index: powerpc.git/arch/powerpc/include/asm/topology.h =================================================================== --- powerpc.git.orig/arch/powerpc/include/asm/topology.h 2010-02-18 14:26:45.736821967 +1100 +++ powerpc.git/arch/powerpc/include/asm/topology.h 2010-02-18 14:51:24.793071748 +1100 @@ -8,6 +8,16 @@ struct device_node; #ifdef CONFIG_NUMA +/* + * Before going off node we want the VM to try and reclaim from the local + * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. + * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of + * 20, we never reclaim and go off node straight away. + * + * To fix this we choose a smaller value of RECLAIM_DISTANCE. + */ +#define RECLAIM_DISTANCE 10 + #include <asm/mmzone.h> static inline int cpu_to_node(int cpu)
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. Signed-off-by: Anton Blanchard <anton@samba.org> ---