diff mbox

powerpc/mm: Fix RECLAIM_DISTANCE

Message ID 1485214348-19487-1-git-send-email-gwshan@linux.vnet.ibm.com
State New
Headers show

Commit Message

Gavin Shan Jan. 23, 2017, 11:32 p.m. UTC
When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled,
the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred
one will be checked for page direct reclaim in the fast path, as below
function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10,
equal to LOCAL_DISTANCE. It means no nodes, including the preferred one,
don't match the conditions. So no nodes are checked for direct reclaim
in the fast path.

   __alloc_pages_nodemask
   get_page_from_freelist
   zone_allows_reclaim

This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred
node and its directly adjacent nodes will be checked for page direct
reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This
updates and makes it correct.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Balbir Singh Jan. 25, 2017, 3:57 a.m. UTC | #1
On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote:
> When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled,
> the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred
> one will be checked for page direct reclaim in the fast path, as below
> function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10,
> equal to LOCAL_DISTANCE. It means no nodes, including the preferred one,
> don't match the conditions. So no nodes are checked for direct reclaim
> in the fast path.
> 
>    __alloc_pages_nodemask
>    get_page_from_freelist
>    zone_allows_reclaim
> 
> This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred
> node and its directly adjacent nodes will be checked for page direct
> reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This
> updates and makes it correct.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---

I spoke about this at length with Anton and the larger kernel team.
We need performance data before we can commit to the change. Do we
benchmarks to show that the change does not introduce regression
w.r.t runtime cost?

Balbir Singh.
Gavin Shan Jan. 25, 2017, 4:58 a.m. UTC | #2
On Wed, Jan 25, 2017 at 09:27:44AM +0530, Balbir Singh wrote:
>On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote:
>> When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled,
>> the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred
>> one will be checked for page direct reclaim in the fast path, as below
>> function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10,
>> equal to LOCAL_DISTANCE. It means no nodes, including the preferred one,
>> don't match the conditions. So no nodes are checked for direct reclaim
>> in the fast path.
>> 
>>    __alloc_pages_nodemask
>>    get_page_from_freelist
>>    zone_allows_reclaim
>> 
>> This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred
>> node and its directly adjacent nodes will be checked for page direct
>> reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This
>> updates and makes it correct.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>
>I spoke about this at length with Anton and the larger kernel team.
>We need performance data before we can commit to the change. Do we
>benchmarks to show that the change does not introduce regression
>w.r.t runtime cost?
>

Thanks for review. I just found the problem when studying the code
last year. It sounds reasonable not to rely on the slow path for page
reclaim if the fast path can reclaim enough pages. From this point,
I believe the performance should be improved. In the meanwhile, the
page cache/buffer could be released, as part of the output of page
reclaim. It's going to affect fs's performance for sure. So do you
have recommended test examples to measure the improved performance
because of this?

Thanks,
Gavin
Balbir Singh Jan. 27, 2017, 12:49 p.m. UTC | #3
On Wed, Jan 25, 2017 at 03:58:22PM +1100, Gavin Shan wrote:
> On Wed, Jan 25, 2017 at 09:27:44AM +0530, Balbir Singh wrote:
> >On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote:
> >> When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled,
> >> the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred
> >> one will be checked for page direct reclaim in the fast path, as below
> >> function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10,
> >> equal to LOCAL_DISTANCE. It means no nodes, including the preferred one,
> >> don't match the conditions. So no nodes are checked for direct reclaim
> >> in the fast path.
> >> 
> >>    __alloc_pages_nodemask
> >>    get_page_from_freelist
> >>    zone_allows_reclaim
> >> 
> >> This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred
> >> node and its directly adjacent nodes will be checked for page direct
> >> reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This
> >> updates and makes it correct.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >
> >I spoke about this at length with Anton and the larger kernel team.
> >We need performance data before we can commit to the change. Do we
> >benchmarks to show that the change does not introduce regression
> >w.r.t runtime cost?
> >
> 
> Thanks for review. I just found the problem when studying the code
> last year. It sounds reasonable not to rely on the slow path for page
> reclaim if the fast path can reclaim enough pages. From this point,
> I believe the performance should be improved. In the meanwhile, the
> page cache/buffer could be released, as part of the output of page
> reclaim. It's going to affect fs's performance for sure. So do you
> have recommended test examples to measure the improved performance
> because of this?
>

Anton suggested that NUMA distances in powerpc mattered and hurted
performance without this setting. We need to validate to see if this
is still true. A simple way to start would be benchmarking

Balbir Singh.
Anton Blanchard Jan. 30, 2017, 1:02 a.m. UTC | #4
Hi,

> Anton suggested that NUMA distances in powerpc mattered and hurted
> performance without this setting. We need to validate to see if this
> is still true. A simple way to start would be benchmarking

The original issue was that we never reclaimed local clean pagecache.

I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none
of them caused me to reclaim local clean pagecache! We are very broken.

I would think we have test cases for this, but here is a dumb one.
First something to consume memory:

# cat alloc.c

#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>

int main(int argc, char *argv[])
{
	void *p;

	unsigned long size;

	size = strtoul(argv[1], NULL, 0);

	p = malloc(size);
	assert(p);
	memset(p, 0, size);
	printf("%p\n", p);

	sleep(3600);

	return 0;
}

Now create a file to consume pagecache. My nodes have 32GB each, so
I create 16GB, enough to consume half of the node:

dd if=/dev/zero of=/tmp/file bs=1G count=16

Clear out our pagecache:

sync
echo 3 > /proc/sys/vm/drop_caches

Bring it in on node 0:

taskset -c 0 cat /tmp/file > /dev/null

Consume 24GB of memory on node 0:

taskset -c 0 ./alloc 25769803776

In all zone reclaim modes, the pagecache never gets reclaimed:

# grep FilePages /sys/devices/system/node/node0/meminfo

Node 0 FilePages:      16757376 kB

And our alloc process shows lots of off node memory used:

3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64

Clearly nothing is working. Gavin, if your patch fixes this we should
get it into stable too.

Anton
Gavin Shan Jan. 30, 2017, 4:38 a.m. UTC | #5
On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote:
>> Anton suggested that NUMA distances in powerpc mattered and hurted
>> performance without this setting. We need to validate to see if this
>> is still true. A simple way to start would be benchmarking
>
>The original issue was that we never reclaimed local clean pagecache.
>
>I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none
>of them caused me to reclaim local clean pagecache! We are very broken.
>
>I would think we have test cases for this, but here is a dumb one.
>First something to consume memory:
>
># cat alloc.c
>
>#include <stdlib.h>
>#include <unistd.h>
>#include <string.h>
>#include <assert.h>
>
>int main(int argc, char *argv[])
>{
>	void *p;
>
>	unsigned long size;
>
>	size = strtoul(argv[1], NULL, 0);
>
>	p = malloc(size);
>	assert(p);
>	memset(p, 0, size);
>	printf("%p\n", p);
>
>	sleep(3600);
>
>	return 0;
>}
>
>Now create a file to consume pagecache. My nodes have 32GB each, so
>I create 16GB, enough to consume half of the node:
>
>dd if=/dev/zero of=/tmp/file bs=1G count=16
>
>Clear out our pagecache:
>
>sync
>echo 3 > /proc/sys/vm/drop_caches
>
>Bring it in on node 0:
>
>taskset -c 0 cat /tmp/file > /dev/null
>
>Consume 24GB of memory on node 0:
>
>taskset -c 0 ./alloc 25769803776
>
>In all zone reclaim modes, the pagecache never gets reclaimed:
>
># grep FilePages /sys/devices/system/node/node0/meminfo
>
>Node 0 FilePages:      16757376 kB
>
>And our alloc process shows lots of off node memory used:
>
>3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64
>
>Clearly nothing is working. Gavin, if your patch fixes this we should
>get it into stable too.
>

Anton, thanks for the detailed test case. I tried what you suggested
on the box that has only one node. The memory capacity is 16GB. So
the parameters I used are different from what you had. First of all,
I observed same behaviour that the pagecache can't be reclaimed when
allocating memory for heap. With the patch applied, the pagecache
can be dropped for page reclaim and more details are showed as below
Everything looks good. I'll put your testcase, its result and stable tag
to next revision.

   root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo
   Node 0 FilePages:        142400 kB

   root@palm8:/home/gavin# sync
   root@palm8:/home/gavin# echo 3 > /proc/sys/vm/drop_caches

   root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo
   Node 0 FilePages:         62848 kB

   root@palm8:/home/gavin# du -sh file
   8.1G    file
   root@palm8:/home/gavin# cat file > /dev/null
   root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo
   Node 0 FilePages:       8448000 kB

   root@palm8:/home/gavin# ./alloc 17179869184
   root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo
   Node 0 FilePages:        387584 kB

Thanks,
Gavin
Michael Ellerman Jan. 30, 2017, 9:11 p.m. UTC | #6
Gavin Shan <gwshan@linux.vnet.ibm.com> writes:

> On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote:
>>> Anton suggested that NUMA distances in powerpc mattered and hurted
>>> performance without this setting. We need to validate to see if this
>>> is still true. A simple way to start would be benchmarking
>>
>>The original issue was that we never reclaimed local clean pagecache.
>>
>>I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none
>>of them caused me to reclaim local clean pagecache! We are very broken.
>>
>>I would think we have test cases for this, but here is a dumb one.
>>First something to consume memory:
>>
>># cat alloc.c
>>
>>#include <stdlib.h>
>>#include <unistd.h>
>>#include <string.h>
>>#include <assert.h>
>>
>>int main(int argc, char *argv[])
>>{
>>	void *p;
>>
>>	unsigned long size;
>>
>>	size = strtoul(argv[1], NULL, 0);
>>
>>	p = malloc(size);
>>	assert(p);
>>	memset(p, 0, size);
>>	printf("%p\n", p);
>>
>>	sleep(3600);
>>
>>	return 0;
>>}
>>
>>Now create a file to consume pagecache. My nodes have 32GB each, so
>>I create 16GB, enough to consume half of the node:
>>
>>dd if=/dev/zero of=/tmp/file bs=1G count=16
>>
>>Clear out our pagecache:
>>
>>sync
>>echo 3 > /proc/sys/vm/drop_caches
>>
>>Bring it in on node 0:
>>
>>taskset -c 0 cat /tmp/file > /dev/null
>>
>>Consume 24GB of memory on node 0:
>>
>>taskset -c 0 ./alloc 25769803776
>>
>>In all zone reclaim modes, the pagecache never gets reclaimed:
>>
>># grep FilePages /sys/devices/system/node/node0/meminfo
>>
>>Node 0 FilePages:      16757376 kB
>>
>>And our alloc process shows lots of off node memory used:
>>
>>3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64
>>
>>Clearly nothing is working. Gavin, if your patch fixes this we should
>>get it into stable too.
>>
>
> Anton, thanks for the detailed test case. I tried what you suggested
> on the box that has only one node. The memory capacity is 16GB. So
> the parameters I used are different from what you had. First of all,
> I observed same behaviour that the pagecache can't be reclaimed when
> allocating memory for heap. With the patch applied, the pagecache
> can be dropped for page reclaim and more details are showed as below
> Everything looks good. I'll put your testcase, its result and stable tag
> to next revision.

Hi Gavin,

I'd like to see some test results from multi-node systems.

I'd also like to understand what has changed since we changed
RECLAIM_DISTANCE in the first place, ie. why did it used to work and now
doesn't?

cheers
Gavin Shan Jan. 31, 2017, 4:33 a.m. UTC | #7
On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote:
>Hi,
>
>> Anton suggested that NUMA distances in powerpc mattered and hurted
>> performance without this setting. We need to validate to see if this
>> is still true. A simple way to start would be benchmarking
>
>The original issue was that we never reclaimed local clean pagecache.
>
>I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none
>of them caused me to reclaim local clean pagecache! We are very broken.
>
>I would think we have test cases for this, but here is a dumb one.
>First something to consume memory:
>
># cat alloc.c
>
>#include <stdlib.h>
>#include <unistd.h>
>#include <string.h>
>#include <assert.h>
>
>int main(int argc, char *argv[])
>{
>	void *p;
>
>	unsigned long size;
>
>	size = strtoul(argv[1], NULL, 0);
>
>	p = malloc(size);
>	assert(p);
>	memset(p, 0, size);
>	printf("%p\n", p);
>
>	sleep(3600);
>
>	return 0;
>}
>
>Now create a file to consume pagecache. My nodes have 32GB each, so
>I create 16GB, enough to consume half of the node:
>
>dd if=/dev/zero of=/tmp/file bs=1G count=16
>
>Clear out our pagecache:
>
>sync
>echo 3 > /proc/sys/vm/drop_caches
>
>Bring it in on node 0:
>
>taskset -c 0 cat /tmp/file > /dev/null
>
>Consume 24GB of memory on node 0:
>
>taskset -c 0 ./alloc 25769803776
>
>In all zone reclaim modes, the pagecache never gets reclaimed:
>
># grep FilePages /sys/devices/system/node/node0/meminfo
>
>Node 0 FilePages:      16757376 kB
>
>And our alloc process shows lots of off node memory used:
>
>3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64
>
>Clearly nothing is working. Gavin, if your patch fixes this we should
>get it into stable too.
>

Anton, I think the behaviour looks good. Actually, it's not very relevant 
to the issue addressed by the patch. I will reply to Michael's reply about
the reason. There are two nodes in your system and the memory is expected
to be allocated from node-0. If node-0 doesn't have enough free memory,
the allocater switches to node-1. It means we need more stress.

In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for heap.
It's not exceeding the memory capacity (64GB). So page reclaim in the fast
and slow path weren't triggered. It's why the pagecache wasn't dropped.
I think __GFP_THISNODE isn't specified when page-fault handler tries to
allocate page to accomodate the VMA for the heap.

*Without* the patch applied, I got something as below in the system where
two NUMA nodes and each of them has 64GB memory. Also, I don't think the
patch is going to change the behaviour:

# cat /proc/sys/vm/zone_reclaim_mode 
0

Drop pagecache
Read 8GB file, for pagecache to consume 8GB memory.
Node 0 FilePages:       8496960 kB
taskset -c 0 ./alloc 137438953472       <- 128GB sized heap
Node 0 FilePages:        503424 kB

Eventually, some of swap clusters have been used as well:

# free -m
              total        used        free      shared  buff/cache   available
Mem:         130583      129203         861          10         518         297
Swap:         10987        3145        7842

Thanks,
Gavin
Anton Blanchard Jan. 31, 2017, 4:58 a.m. UTC | #8
Hi,


> Anton, I think the behaviour looks good. Actually, it's not very
> relevant to the issue addressed by the patch. I will reply to
> Michael's reply about the reason. There are two nodes in your system
> and the memory is expected to be allocated from node-0. If node-0
> doesn't have enough free memory, the allocater switches to node-1. It
> means we need more stress.

Did you try setting zone_reclaim_mode? Surely we should reclaim local
clean pagecache if enabled?

Anton
--

zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is disabled by default.  For file servers or workloads
that benefit from having their data cached, zone_reclaim_mode should be
left disabled as the caching effect is likely to be more important than
data locality.

zone_reclaim may be enabled if it's known that the workload is partitioned
such that each partition fits within a NUMA node and that accessing remote
memory would cause a measurable performance reduction.  The page allocator
will then reclaim easily reusable pages (those page cache pages that are
currently not used) before allocating off node pages.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.


> 
> In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for
> heap. It's not exceeding the memory capacity (64GB). So page reclaim
> in the fast and slow path weren't triggered. It's why the pagecache
> wasn't dropped. I think __GFP_THISNODE isn't specified when
> page-fault handler tries to allocate page to accomodate the VMA for
> the heap.
> 
> *Without* the patch applied, I got something as below in the system
> where two NUMA nodes and each of them has 64GB memory. Also, I don't
> think the patch is going to change the behaviour:
> 
> # cat /proc/sys/vm/zone_reclaim_mode 
> 0
> 
> Drop pagecache
> Read 8GB file, for pagecache to consume 8GB memory.
> Node 0 FilePages:       8496960 kB
> taskset -c 0 ./alloc 137438953472       <- 128GB sized heap
> Node 0 FilePages:        503424 kB
> 
> Eventually, some of swap clusters have been used as well:
> 
> # free -m
>               total        used        free      shared  buff/cache
> available Mem:         130583      129203         861
> 10         518         297 Swap:         10987        3145        7842
> 
> Thanks,
> Gavin
>
Gavin Shan Jan. 31, 2017, 5:01 a.m. UTC | #9
On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote:
>Gavin Shan <gwshan@linux.vnet.ibm.com> writes:
>
>I'd like to see some test results from multi-node systems.
>
>I'd also like to understand what has changed since we changed
>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now
>doesn't?
>

[Ccing Mel]

Michael, thanks for review. I would like to explain a bit more. The issue
addressed by the patch is irrelevant to the number of NUMA nodes. There
is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds
to variable @node_reclaim_mode (their names don't match!). it can have
belowing bits or any combination of them. Its default value is RECLAIM_OFF (0).
Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it
later.

#define RECLAIM_OFF 0
#define RECLAIM_ZONE (1<<0)     /* Run shrink_inactive_list on the zone */
#define RECLAIM_WRITE (1<<1)    /* Writeout pages during reclaim */
#define RECLAIM_UNMAP (1<<2)    /* Unmap pages during reclaim */

When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim()
isn't called on the preferred node as the condition is false: zone_allows_reclaim(
node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to
RECLAIM_DISTANCE.

static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
        return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
                                RECLAIM_DISTANCE;
}


__alloc_pages_nodemask
   get_page_from_freelist     <- WATERMARK_LOW
      zone_watermark_fast     <- Assume the allocation is breaking WATERMARK_LOW
         node_reclaim         <- @node_reclaim_node isn't 0 and
                                 zone_allows_reclaim(preferred_zone, current_zone) returns true
         __node_reclaim       <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node
         shrink_node
      buffered_rmqueue
   __alloc_pages_slowpath
      get_page_from_freelist        <- WATERMARK_MIN
      __alloc_pages_direct_compact  <- If it's costly allocation (order > 3)
      wake_all_kswapds
      get_page_from_freelist        <- NO_WATERMARK, CPU local node is set to preferred one
      __alloc_pages_direct_reclaim
         __perform_reclaim
         try_to_free_pages          <- WRITEPAGE + UNMAP + SWAP
         do_try_to_free_pages
         shrink_zones               <- Stop until priority (12) reaches to 0 or reclaimed enough
         shrink_node
      __alloc_pages_direct_compact


Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch
doesn't provide one. It's why I set this macro to 30 in this patch. This issue is 
introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances").
In the patch, it had wrong replacement. So I would correct the wrong replacement
alternatively. Or both of them. Which way do you think is the best? Maybe Mel also
has thoughts.

     39  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
     40  {
     41 -       return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
     42 -}
     43 -
     44 -static void __paginginit init_zone_allows_reclaim(int nid)
     45 -{
     46 -       int i;
     47 -
     48 -       for_each_node_state(i, N_MEMORY)
     49 -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
     50 -                       node_set(i, NODE_DATA(nid)->reclaim_nodes);
     51 +       return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
     52 +                               RECLAIM_DISTANCE;
     53  }


Thanks,
Gavin
Gavin Shan Jan. 31, 2017, 5:30 a.m. UTC | #10
On Tue, Jan 31, 2017 at 03:58:16PM +1100, Anton Blanchard wrote:
>Hi,
>
>
>> Anton, I think the behaviour looks good. Actually, it's not very
>> relevant to the issue addressed by the patch. I will reply to
>> Michael's reply about the reason. There are two nodes in your system
>> and the memory is expected to be allocated from node-0. If node-0
>> doesn't have enough free memory, the allocater switches to node-1. It
>> means we need more stress.
>
>Did you try setting zone_reclaim_mode? Surely we should reclaim local
>clean pagecache if enabled?
>

In last experiment, I didn't enable zone_reclaim_mode. After changed it to 0x2
(RECLAIM_WRITE), the local pagecache isn't reclaimed from node-0 as we observed
before.

root@P83-p1:~# cat /proc/sys/vm/zone_reclaim_mode
2
root@P83-p1:~# sync
root@P83-p1:~# echo 3 > /proc/sys/vm/drop_caches
root@P83-p1:~# taskset -c 0 cat /tmp/file.8G > /dev/null
root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo 
Node 0 FilePages:       8497920 kB
root@P83-p1:~# taskset -c 0 ./alloc 68719476736
root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo 
Node 0 FilePages:       8497920 kB

With the patch applied, the local pagecache is reclaimed:

root@P83-p1:~# cat /proc/sys/vm/zone_reclaim_mode
2
root@P83-p1:~# sync
root@P83-p1:~# echo 3 > /proc/sys/vm/drop_caches
root@P83-p1:~# taskset -c 0 cat /tmp/file.8G > /dev/null
root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages:       8441472 kB
root@P83-p1:~# taskset -c 0 ./alloc 68719476736
root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo
Node 0 FilePages:        712960 kB

Thanks,
Gavin
Gavin Shan Jan. 31, 2017, 5:40 a.m. UTC | #11
On Tue, Jan 31, 2017 at 04:01:05PM +1100, Gavin Shan wrote:
>On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote:
>>Gavin Shan <gwshan@linux.vnet.ibm.com> writes:
>>
>>I'd like to see some test results from multi-node systems.
>>
>>I'd also like to understand what has changed since we changed
>>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now
>>doesn't?
>>
>
>[Ccing Mel]
>
>Michael, thanks for review. I would like to explain a bit more. The issue
>addressed by the patch is irrelevant to the number of NUMA nodes. There
>is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds
>to variable @node_reclaim_mode (their names don't match!). it can have
>belowing bits or any combination of them. Its default value is RECLAIM_OFF (0).
>Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it
>later.
>
>#define RECLAIM_OFF 0
>#define RECLAIM_ZONE (1<<0)     /* Run shrink_inactive_list on the zone */
>#define RECLAIM_WRITE (1<<1)    /* Writeout pages during reclaim */
>#define RECLAIM_UNMAP (1<<2)    /* Unmap pages during reclaim */
>
>When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim()
>isn't called on the preferred node as the condition is false: zone_allows_reclaim(
>node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to
>RECLAIM_DISTANCE.
>
>static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>{
>        return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
>                                RECLAIM_DISTANCE;
>}
>
>
>__alloc_pages_nodemask
>   get_page_from_freelist     <- WATERMARK_LOW
>      zone_watermark_fast     <- Assume the allocation is breaking WATERMARK_LOW
>         node_reclaim         <- @node_reclaim_node isn't 0 and
>                                 zone_allows_reclaim(preferred_zone, current_zone) returns true
>         __node_reclaim       <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node
>         shrink_node
>      buffered_rmqueue
>   __alloc_pages_slowpath
>      get_page_from_freelist        <- WATERMARK_MIN
>      __alloc_pages_direct_compact  <- If it's costly allocation (order > 3)
>      wake_all_kswapds
>      get_page_from_freelist        <- NO_WATERMARK, CPU local node is set to preferred one
>      __alloc_pages_direct_reclaim
>         __perform_reclaim
>         try_to_free_pages          <- WRITEPAGE + UNMAP + SWAP
>         do_try_to_free_pages
>         shrink_zones               <- Stop until priority (12) reaches to 0 or reclaimed enough
>         shrink_node
>      __alloc_pages_direct_compact
>
>
>Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch
>doesn't provide one. It's why I set this macro to 30 in this patch. This issue is 
>introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances").
>In the patch, it had wrong replacement. So I would correct the wrong replacement
>alternatively. Or both of them. Which way do you think is the best? Maybe Mel also
>has thoughts.
>
>     39  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>     40  {
>     41 -       return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
>     42 -}
>     43 -
>     44 -static void __paginginit init_zone_allows_reclaim(int nid)
>     45 -{
>     46 -       int i;
>     47 -
>     48 -       for_each_node_state(i, N_MEMORY)
>     49 -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
>     50 -                       node_set(i, NODE_DATA(nid)->reclaim_nodes);
>     51 +       return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
>     52 +                               RECLAIM_DISTANCE;
>     53  }
>

Sorry, to make it clear. The patch replaced "<=" with "<" wrongly :)

Thanks,
Gavin
Gavin Shan Feb. 7, 2017, 11:40 p.m. UTC | #12
On Tue, Jan 31, 2017 at 04:01:05PM +1100, Gavin Shan wrote:
>On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote:
>>Gavin Shan <gwshan@linux.vnet.ibm.com> writes:
>>
>>I'd like to see some test results from multi-node systems.
>>
>>I'd also like to understand what has changed since we changed
>>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now
>>doesn't?
>>
>
>[Ccing Mel]
>
>Michael, thanks for review. I would like to explain a bit more. The issue
>addressed by the patch is irrelevant to the number of NUMA nodes. There
>is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds
>to variable @node_reclaim_mode (their names don't match!). it can have
>belowing bits or any combination of them. Its default value is RECLAIM_OFF (0).
>Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it
>later.
>
>#define RECLAIM_OFF 0
>#define RECLAIM_ZONE (1<<0)     /* Run shrink_inactive_list on the zone */
>#define RECLAIM_WRITE (1<<1)    /* Writeout pages during reclaim */
>#define RECLAIM_UNMAP (1<<2)    /* Unmap pages during reclaim */
>
>When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim()
>isn't called on the preferred node as the condition is false: zone_allows_reclaim(
>node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to
>RECLAIM_DISTANCE.
>
>static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>{
>        return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
>                                RECLAIM_DISTANCE;
>}
>
>
>__alloc_pages_nodemask
>   get_page_from_freelist     <- WATERMARK_LOW
>      zone_watermark_fast     <- Assume the allocation is breaking WATERMARK_LOW
>         node_reclaim         <- @node_reclaim_node isn't 0 and
>                                 zone_allows_reclaim(preferred_zone, current_zone) returns true
>         __node_reclaim       <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node
>         shrink_node
>      buffered_rmqueue
>   __alloc_pages_slowpath
>      get_page_from_freelist        <- WATERMARK_MIN
>      __alloc_pages_direct_compact  <- If it's costly allocation (order > 3)
>      wake_all_kswapds
>      get_page_from_freelist        <- NO_WATERMARK, CPU local node is set to preferred one
>      __alloc_pages_direct_reclaim
>         __perform_reclaim
>         try_to_free_pages          <- WRITEPAGE + UNMAP + SWAP
>         do_try_to_free_pages
>         shrink_zones               <- Stop until priority (12) reaches to 0 or reclaimed enough
>         shrink_node
>      __alloc_pages_direct_compact
>
>
>Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch
>doesn't provide one. It's why I set this macro to 30 in this patch. This issue is 
>introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances").
>In the patch, it had wrong replacement. So I would correct the wrong replacement
>alternatively. Or both of them. Which way do you think is the best? Maybe Mel also
>has thoughts.
>
>     39  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>     40  {
>     41 -       return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
>     42 -}
>     43 -
>     44 -static void __paginginit init_zone_allows_reclaim(int nid)
>     45 -{
>     46 -       int i;
>     47 -
>     48 -       for_each_node_state(i, N_MEMORY)
>     49 -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
>     50 -                       node_set(i, NODE_DATA(nid)->reclaim_nodes);
>     51 +       return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
>     52 +                               RECLAIM_DISTANCE;
>     53  }
>

Michael, I guess Mel might have missed this thread. I will send one patch to linux-mm
to address "<" and "<=" issue. If the replacement was intentional in Mel's patch, the
patch will introduce more discussion there.

On the other hand, I will repost this patch with upated changelog, meaning to set
RECLAIM_DISTANCE to 30. Its value matches with that in include/linux/topology.h.
The value 10 for RECLAIM_DISTANCE should work if "<" is changed to "<=". However,
30 isn't harmful because there is subsequent check to limit reclaiming on CPU's
local node or those who aren't associated with any CPUs.

Thanks,
Gavin
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 8b3b46b..ce1a156 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -9,10 +9,11 @@  struct device_node;
 #ifdef CONFIG_NUMA
 
 /*
- * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
- * all zones on all nodes will be eligible for zone_reclaim().
+ * If node_reclaim_mode is enabled, a RECLAIM_DISTANCE of 30 means that
+ * the preferred node and its directly adjacent nodes are eligible for
+ * node_reclaim().
  */
-#define RECLAIM_DISTANCE 10
+#define RECLAIM_DISTANCE	30
 
 #include <asm/mmzone.h>