diff mbox

[4/4] mm: numa: Slow PTE scan rate if migration failures occur

Message ID 1425741651-29152-5-git-send-email-mgorman@suse.de (mailing list archive)
State Not Applicable
Delegated to: Michael Ellerman
Headers show

Commit Message

Mel Gorman March 7, 2015, 3:20 p.m. UTC
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

Across the board the 4.0-rc1 numbers are much slower, and the degradation
is far worse when using the large memory footprint configs. Perf points
straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive migrations
are meant to get throttled. Normally, the scan rate is tuned on a per-task
basis depending on the locality of faults.  However, if migrations fail
for any reason then the PTE scanner may scan faster if the faults continue
to be remote. This means there is higher system CPU overhead and fault
trapping at exactly the time we know that migrations cannot happen. This
patch tracks when migration failures occur and slows the PTE scanner.

This was tested on a 4 socket bare-metal machine with 48 cores. The results
compare 4.0-rc1, the patches applied and 3.19-vanilla which was the last
known good kernel. This is the standard autonuma benchmark

                                           4.0.0-rc1             4.0.0-rc1                3.19.0
                                             vanilla           slowscan-v2               vanilla
Time System-NUMA01                  602.44 (  0.00%)      209.42 ( 65.24%)      194.70 ( 67.68%)
Time System-NUMA01_THEADLOCAL        78.10 (  0.00%)       92.70 (-18.69%)       98.52 (-26.15%)
Time System-NUMA02                    6.47 (  0.00%)        6.06 (  6.34%)        9.28 (-43.43%)
Time System-NUMA02_SMT                5.06 (  0.00%)        3.39 ( 33.00%)        3.79 ( 25.10%)
Time Elapsed-NUMA01                 755.96 (  0.00%)      833.63 (-10.27%)      558.84 ( 26.08%)
Time Elapsed-NUMA01_THEADLOCAL      382.22 (  0.00%)      395.45 ( -3.46%)      382.54 ( -0.08%)
Time Elapsed-NUMA02                  49.38 (  0.00%)       50.21 ( -1.68%)       49.83 ( -0.91%)
Time Elapsed-NUMA02_SMT              47.70 (  0.00%)       48.55 ( -1.78%)       46.59 (  2.33%)

There is a performance drop as a result of this patch although in the
case of NUMA01 it is not a major concern as it's an adverse workload. The
important point is that in most cases system CPU usage is much lower. Here
are the totals

           4.0.0-rc1   4.0.0-rc1      3.19.0
             vanilla  slowscan-v2     vanilla
User        53384.29    56093.11    46119.12
System        692.14      311.64      306.41
Elapsed      1236.87     1328.61     1039.88

Note that the system CPU usage is now similar to 3.19-vanilla.

I also tested with a workload very similar to Dave's. The machine
configuration and storage is completely different so it's not an equivalent
test unfortunately. It's reporting the elapsed time and CPU time while
fsmark is running to create the inodes and when runnig xfsrepair afterwards

xfsrepair
                                    4.0.0-rc1             4.0.0-rc1                3.19.0
                                      vanilla           slowscan-v2               vanilla
Min      real-fsmark        1157.41 (  0.00%)     1150.38 (  0.61%)     1164.44 ( -0.61%)
Min      syst-fsmark        3998.06 (  0.00%)     3988.42 (  0.24%)     4016.12 ( -0.45%)
Min      real-xfsrepair      497.64 (  0.00%)      456.87 (  8.19%)      442.64 ( 11.05%)
Min      syst-xfsrepair      500.61 (  0.00%)      263.41 ( 47.38%)      194.97 ( 61.05%)
Amean    real-fsmark        1166.63 (  0.00%)     1155.97 (  0.91%)     1166.28 (  0.03%)
Amean    syst-fsmark        4020.94 (  0.00%)     4004.19 (  0.42%)     4025.87 ( -0.12%)
Amean    real-xfsrepair      507.85 (  0.00%)      459.58 (  9.50%)      447.66 ( 11.85%)
Amean    syst-xfsrepair      519.88 (  0.00%)      281.63 ( 45.83%)      202.93 ( 60.97%)
Stddev   real-fsmark           6.55 (  0.00%)        3.97 ( 39.30%)        1.44 ( 77.98%)
Stddev   syst-fsmark          16.22 (  0.00%)       15.09 (  6.96%)        9.76 ( 39.86%)
Stddev   real-xfsrepair       11.17 (  0.00%)        3.41 ( 69.43%)        5.57 ( 50.17%)
Stddev   syst-xfsrepair       13.98 (  0.00%)       19.94 (-42.60%)        5.69 ( 59.31%)
CoeffVar real-fsmark           0.56 (  0.00%)        0.34 ( 38.74%)        0.12 ( 77.97%)
CoeffVar syst-fsmark           0.40 (  0.00%)        0.38 (  6.57%)        0.24 ( 39.93%)
CoeffVar real-xfsrepair        2.20 (  0.00%)        0.74 ( 66.22%)        1.24 ( 43.47%)
CoeffVar syst-xfsrepair        2.69 (  0.00%)        7.08 (-163.23%)        2.80 ( -4.23%)
Max      real-fsmark        1171.98 (  0.00%)     1159.25 (  1.09%)     1167.96 (  0.34%)
Max      syst-fsmark        4033.84 (  0.00%)     4024.53 (  0.23%)     4039.20 ( -0.13%)
Max      real-xfsrepair      523.40 (  0.00%)      464.40 ( 11.27%)      455.42 ( 12.99%)
Max      syst-xfsrepair      533.37 (  0.00%)      309.38 ( 42.00%)      207.94 ( 61.01%)

The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
is almost cut in half. It's still not as low as 3.19-vanilla but it's
much closer

                             4.0.0-rc1   4.0.0-rc1      3.19.0
                               vanilla  slowscan-v2     vanilla
NUMA alloc hit               146138883   121929782   104019526
NUMA alloc miss               13146328    11456356     7806370
NUMA interleave hit                  0           0           0
NUMA alloc local             146060848   121865921   103953085
NUMA base PTE updates        242201535   117237258   216624143
NUMA huge PMD updates           113270       52121      127782
NUMA page range updates      300195775   143923210   282048527
NUMA hint faults             180388025    87299060   147235021
NUMA hint local faults        72784532    32939258    61866265
NUMA hint local percent             40          37          42
NUMA pages migrated           71175262    41395302    23237799

Note the big differences in faults trapped and pages migrated. 3.19-vanilla
still migrated fewer pages but if necessary the threshold at which we
start throttling migrations can be lowered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 9 +++++----
 kernel/sched/fair.c   | 8 ++++++--
 mm/huge_memory.c      | 3 ++-
 mm/memory.c           | 3 ++-
 4 files changed, 15 insertions(+), 8 deletions(-)

Comments

Ingo Molnar March 7, 2015, 4:36 p.m. UTC | #1
* Mel Gorman <mgorman@suse.de> wrote:

> Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
> 
> Across the board the 4.0-rc1 numbers are much slower, and the 
> degradation is far worse when using the large memory footprint 
> configs. Perf points straight at the cause - this is from 4.0-rc1 on 
> the "-o bhash=101073" config:
> 
> [...]

>            4.0.0-rc1   4.0.0-rc1      3.19.0
>              vanilla  slowscan-v2     vanilla
> User        53384.29    56093.11    46119.12
> System        692.14      311.64      306.41
> Elapsed      1236.87     1328.61     1039.88
> 
> Note that the system CPU usage is now similar to 3.19-vanilla.

Similar, but still worse, and also the elapsed time is still much 
worse. User time is much higher, although it's the same amount of work 
done on every kernel, right?

> I also tested with a workload very similar to Dave's. The machine 
> configuration and storage is completely different so it's not an 
> equivalent test unfortunately. It's reporting the elapsed time and 
> CPU time while fsmark is running to create the inodes and when 
> runnig xfsrepair afterwards
> 
> xfsrepair
>                                     4.0.0-rc1             4.0.0-rc1                3.19.0
>                                       vanilla           slowscan-v2               vanilla
> Min      real-fsmark        1157.41 (  0.00%)     1150.38 (  0.61%)     1164.44 ( -0.61%)
> Min      syst-fsmark        3998.06 (  0.00%)     3988.42 (  0.24%)     4016.12 ( -0.45%)
> Min      real-xfsrepair      497.64 (  0.00%)      456.87 (  8.19%)      442.64 ( 11.05%)
> Min      syst-xfsrepair      500.61 (  0.00%)      263.41 ( 47.38%)      194.97 ( 61.05%)
> Amean    real-fsmark        1166.63 (  0.00%)     1155.97 (  0.91%)     1166.28 (  0.03%)
> Amean    syst-fsmark        4020.94 (  0.00%)     4004.19 (  0.42%)     4025.87 ( -0.12%)
> Amean    real-xfsrepair      507.85 (  0.00%)      459.58 (  9.50%)      447.66 ( 11.85%)
> Amean    syst-xfsrepair      519.88 (  0.00%)      281.63 ( 45.83%)      202.93 ( 60.97%)
> Stddev   real-fsmark           6.55 (  0.00%)        3.97 ( 39.30%)        1.44 ( 77.98%)
> Stddev   syst-fsmark          16.22 (  0.00%)       15.09 (  6.96%)        9.76 ( 39.86%)
> Stddev   real-xfsrepair       11.17 (  0.00%)        3.41 ( 69.43%)        5.57 ( 50.17%)
> Stddev   syst-xfsrepair       13.98 (  0.00%)       19.94 (-42.60%)        5.69 ( 59.31%)
> CoeffVar real-fsmark           0.56 (  0.00%)        0.34 ( 38.74%)        0.12 ( 77.97%)
> CoeffVar syst-fsmark           0.40 (  0.00%)        0.38 (  6.57%)        0.24 ( 39.93%)
> CoeffVar real-xfsrepair        2.20 (  0.00%)        0.74 ( 66.22%)        1.24 ( 43.47%)
> CoeffVar syst-xfsrepair        2.69 (  0.00%)        7.08 (-163.23%)        2.80 ( -4.23%)
> Max      real-fsmark        1171.98 (  0.00%)     1159.25 (  1.09%)     1167.96 (  0.34%)
> Max      syst-fsmark        4033.84 (  0.00%)     4024.53 (  0.23%)     4039.20 ( -0.13%)
> Max      real-xfsrepair      523.40 (  0.00%)      464.40 ( 11.27%)      455.42 ( 12.99%)
> Max      syst-xfsrepair      533.37 (  0.00%)      309.38 ( 42.00%)      207.94 ( 61.01%)
> 
> The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
> is almost cut in half. It's still not as low as 3.19-vanilla but it's
> much closer
> 
>                              4.0.0-rc1   4.0.0-rc1      3.19.0
>                                vanilla  slowscan-v2     vanilla
> NUMA alloc hit               146138883   121929782   104019526
> NUMA alloc miss               13146328    11456356     7806370
> NUMA interleave hit                  0           0           0
> NUMA alloc local             146060848   121865921   103953085
> NUMA base PTE updates        242201535   117237258   216624143
> NUMA huge PMD updates           113270       52121      127782
> NUMA page range updates      300195775   143923210   282048527
> NUMA hint faults             180388025    87299060   147235021
> NUMA hint local faults        72784532    32939258    61866265
> NUMA hint local percent             40          37          42
> NUMA pages migrated           71175262    41395302    23237799
> 
> Note the big differences in faults trapped and pages migrated. 
> 3.19-vanilla still migrated fewer pages but if necessary the 
> threshold at which we start throttling migrations can be lowered.

This too is still worse than what v3.19 had.

So what worries me is that Dave bisected the regression to:

  4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

And clearly your patch #4 just tunes balancing/migration intensity - 
is that a workaround for the real problem/bug?

And the patch Dave bisected to is a relatively simple patch.
Why not simply revert it to see whether that cures much of the 
problem?

Am I missing something fundamental?

Thanks,

	Ingo
Mel Gorman March 7, 2015, 5:37 p.m. UTC | #2
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
> > 
> > Across the board the 4.0-rc1 numbers are much slower, and the 
> > degradation is far worse when using the large memory footprint 
> > configs. Perf points straight at the cause - this is from 4.0-rc1 on 
> > the "-o bhash=101073" config:
> > 
> > [...]
> 
> >            4.0.0-rc1   4.0.0-rc1      3.19.0
> >              vanilla  slowscan-v2     vanilla
> > User        53384.29    56093.11    46119.12
> > System        692.14      311.64      306.41
> > Elapsed      1236.87     1328.61     1039.88
> > 
> > Note that the system CPU usage is now similar to 3.19-vanilla.
> 
> Similar, but still worse, and also the elapsed time is still much 
> worse. User time is much higher, although it's the same amount of work 
> done on every kernel, right?
> 

Elapsed time is primarily worse on one benchmark -- numa01 which is an
adverse workload. The user time differences are also dominated by that
benchmark

                                           4.0.0-rc1             4.0.0-rc1                3.19.0
                                             vanilla         slowscan-v2r7               vanilla
Time User-NUMA01                  32883.59 (  0.00%)    35288.00 ( -7.31%)    25695.96 ( 21.86%)
Time User-NUMA01_THEADLOCAL       17453.20 (  0.00%)    17765.79 ( -1.79%)    17404.36 (  0.28%)
Time User-NUMA02                   2063.70 (  0.00%)     2063.22 (  0.02%)     2037.65 (  1.26%)
Time User-NUMA02_SMT                983.70 (  0.00%)      976.01 (  0.78%)      981.02 (  0.27%)


> > I also tested with a workload very similar to Dave's. The machine 
> > configuration and storage is completely different so it's not an 
> > equivalent test unfortunately. It's reporting the elapsed time and 
> > CPU time while fsmark is running to create the inodes and when 
> > runnig xfsrepair afterwards
> > 
> > xfsrepair
> >                                     4.0.0-rc1             4.0.0-rc1                3.19.0
> >                                       vanilla           slowscan-v2               vanilla
> > Min      real-fsmark        1157.41 (  0.00%)     1150.38 (  0.61%)     1164.44 ( -0.61%)
> > Min      syst-fsmark        3998.06 (  0.00%)     3988.42 (  0.24%)     4016.12 ( -0.45%)
> > Min      real-xfsrepair      497.64 (  0.00%)      456.87 (  8.19%)      442.64 ( 11.05%)
> > Min      syst-xfsrepair      500.61 (  0.00%)      263.41 ( 47.38%)      194.97 ( 61.05%)
> > Amean    real-fsmark        1166.63 (  0.00%)     1155.97 (  0.91%)     1166.28 (  0.03%)
> > Amean    syst-fsmark        4020.94 (  0.00%)     4004.19 (  0.42%)     4025.87 ( -0.12%)
> > Amean    real-xfsrepair      507.85 (  0.00%)      459.58 (  9.50%)      447.66 ( 11.85%)
> > Amean    syst-xfsrepair      519.88 (  0.00%)      281.63 ( 45.83%)      202.93 ( 60.97%)
> > Stddev   real-fsmark           6.55 (  0.00%)        3.97 ( 39.30%)        1.44 ( 77.98%)
> > Stddev   syst-fsmark          16.22 (  0.00%)       15.09 (  6.96%)        9.76 ( 39.86%)
> > Stddev   real-xfsrepair       11.17 (  0.00%)        3.41 ( 69.43%)        5.57 ( 50.17%)
> > Stddev   syst-xfsrepair       13.98 (  0.00%)       19.94 (-42.60%)        5.69 ( 59.31%)
> > CoeffVar real-fsmark           0.56 (  0.00%)        0.34 ( 38.74%)        0.12 ( 77.97%)
> > CoeffVar syst-fsmark           0.40 (  0.00%)        0.38 (  6.57%)        0.24 ( 39.93%)
> > CoeffVar real-xfsrepair        2.20 (  0.00%)        0.74 ( 66.22%)        1.24 ( 43.47%)
> > CoeffVar syst-xfsrepair        2.69 (  0.00%)        7.08 (-163.23%)        2.80 ( -4.23%)
> > Max      real-fsmark        1171.98 (  0.00%)     1159.25 (  1.09%)     1167.96 (  0.34%)
> > Max      syst-fsmark        4033.84 (  0.00%)     4024.53 (  0.23%)     4039.20 ( -0.13%)
> > Max      real-xfsrepair      523.40 (  0.00%)      464.40 ( 11.27%)      455.42 ( 12.99%)
> > Max      syst-xfsrepair      533.37 (  0.00%)      309.38 ( 42.00%)      207.94 ( 61.01%)
> > 
> > The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
> > is almost cut in half. It's still not as low as 3.19-vanilla but it's
> > much closer
> > 
> >                              4.0.0-rc1   4.0.0-rc1      3.19.0
> >                                vanilla  slowscan-v2     vanilla
> > NUMA alloc hit               146138883   121929782   104019526
> > NUMA alloc miss               13146328    11456356     7806370
> > NUMA interleave hit                  0           0           0
> > NUMA alloc local             146060848   121865921   103953085
> > NUMA base PTE updates        242201535   117237258   216624143
> > NUMA huge PMD updates           113270       52121      127782
> > NUMA page range updates      300195775   143923210   282048527
> > NUMA hint faults             180388025    87299060   147235021
> > NUMA hint local faults        72784532    32939258    61866265
> > NUMA hint local percent             40          37          42
> > NUMA pages migrated           71175262    41395302    23237799
> > 
> > Note the big differences in faults trapped and pages migrated. 
> > 3.19-vanilla still migrated fewer pages but if necessary the 
> > threshold at which we start throttling migrations can be lowered.
> 
> This too is still worse than what v3.19 had.
> 

Yes.

> So what worries me is that Dave bisected the regression to:
> 
>   4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")
> 
> And clearly your patch #4 just tunes balancing/migration intensity - 
> is that a workaround for the real problem/bug?
> 

The patch makes NUMA hinting faults use standard page table handling routines
and protections to trap the faults. Fundamentally it's safer even though
it appears to cause more traps to be handled. I've been assuming this is
related to the different permissions PTEs get and when they are visible on
all CPUs. This path is addressing the symptom that more faults are being
handled and that it needs to be less aggressive.

I've gone through that patch and didn't spot anything else that is doing
wrong that is not already handled in this series. Did you spot anything
obviously wrong in that patch that isn't addressed in this series?

> And the patch Dave bisected to is a relatively simple patch.
> Why not simply revert it to see whether that cures much of the 
> problem?
> 

Because it also means reverting all the PROT_NONE handling and going back
to _PAGE_NUMA tricks which I expect would be naked by Linus.
Linus Torvalds March 7, 2015, 7:12 p.m. UTC | #3
On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> And the patch Dave bisected to is a relatively simple patch.
> Why not simply revert it to see whether that cures much of the
> problem?

So the problem with that is that "pmd_set_numa()" and friends simply
no longer exist. So we can't just revert that one patch, it's the
whole series, and the whole point of the series.

What confuses me is that the only real change that I can see in that
patch is the change to "change_huge_pmd()". Everything else is pretty
much a 100% equivalent transformation, afaik. Of course, I may be
wrong about that, and missing something silly.

And the changes to "change_huge_pmd()" were basically re-done
differently by subsequent patches anyway.

The *only* change I see remaining is that change_huge_pmd() now does

   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
   entry = pmd_modify(entry, newprot);
   set_pmd_at(mm, addr, pmd, entry);

for all changes. It used to do that "pmdp_set_numa()" for the
prot_numa case, which did just

   pmd_t pmd = *pmdp;
   pmd = pmd_mknuma(pmd);
   set_pmd_at(mm, addr, pmdp, pmd);

instead.

I don't like the old pmdp_set_numa() because it can drop dirty bits,
so I think the old code was actively buggy.

But I do *not* see why the new code would cause more migrations to happen.

There's probably something really stupid I'm missing.

                           Linus
Ingo Molnar March 8, 2015, 9:41 a.m. UTC | #4
* Mel Gorman <mgorman@suse.de> wrote:

> xfsrepair
>                                     4.0.0-rc1             4.0.0-rc1                3.19.0
>                                       vanilla           slowscan-v2               vanilla
> Min      real-fsmark        1157.41 (  0.00%)     1150.38 (  0.61%)     1164.44 ( -0.61%)
> Min      syst-fsmark        3998.06 (  0.00%)     3988.42 (  0.24%)     4016.12 ( -0.45%)
> Min      real-xfsrepair      497.64 (  0.00%)      456.87 (  8.19%)      442.64 ( 11.05%)
> Min      syst-xfsrepair      500.61 (  0.00%)      263.41 ( 47.38%)      194.97 ( 61.05%)
> Amean    real-fsmark        1166.63 (  0.00%)     1155.97 (  0.91%)     1166.28 (  0.03%)
> Amean    syst-fsmark        4020.94 (  0.00%)     4004.19 (  0.42%)     4025.87 ( -0.12%)
> Amean    real-xfsrepair      507.85 (  0.00%)      459.58 (  9.50%)      447.66 ( 11.85%)
> Amean    syst-xfsrepair      519.88 (  0.00%)      281.63 ( 45.83%)      202.93 ( 60.97%)
> Stddev   real-fsmark           6.55 (  0.00%)        3.97 ( 39.30%)        1.44 ( 77.98%)
> Stddev   syst-fsmark          16.22 (  0.00%)       15.09 (  6.96%)        9.76 ( 39.86%)
> Stddev   real-xfsrepair       11.17 (  0.00%)        3.41 ( 69.43%)        5.57 ( 50.17%)
> Stddev   syst-xfsrepair       13.98 (  0.00%)       19.94 (-42.60%)        5.69 ( 59.31%)
> CoeffVar real-fsmark           0.56 (  0.00%)        0.34 ( 38.74%)        0.12 ( 77.97%)
> CoeffVar syst-fsmark           0.40 (  0.00%)        0.38 (  6.57%)        0.24 ( 39.93%)
> CoeffVar real-xfsrepair        2.20 (  0.00%)        0.74 ( 66.22%)        1.24 ( 43.47%)
> CoeffVar syst-xfsrepair        2.69 (  0.00%)        7.08 (-163.23%)        2.80 ( -4.23%)
> Max      real-fsmark        1171.98 (  0.00%)     1159.25 (  1.09%)     1167.96 (  0.34%)
> Max      syst-fsmark        4033.84 (  0.00%)     4024.53 (  0.23%)     4039.20 ( -0.13%)
> Max      real-xfsrepair      523.40 (  0.00%)      464.40 ( 11.27%)      455.42 ( 12.99%)
> Max      syst-xfsrepair      533.37 (  0.00%)      309.38 ( 42.00%)      207.94 ( 61.01%)

Btw., I think it would be nice if these numbers listed v3.19 
performance in the first column, to make it clear at a glance
how much regression we still have?

Thanks,

	Ingo
Ingo Molnar March 8, 2015, 9:54 a.m. UTC | #5
* Mel Gorman <mgorman@suse.de> wrote:

> Elapsed time is primarily worse on one benchmark -- numa01 which is 
> an adverse workload. The user time differences are also dominated by 
> that benchmark
> 
>                                            4.0.0-rc1             4.0.0-rc1                3.19.0
>                                              vanilla         slowscan-v2r7               vanilla
> Time User-NUMA01                  32883.59 (  0.00%)    35288.00 ( -7.31%)    25695.96 ( 21.86%)
> Time User-NUMA01_THEADLOCAL       17453.20 (  0.00%)    17765.79 ( -1.79%)    17404.36 (  0.28%)
> Time User-NUMA02                   2063.70 (  0.00%)     2063.22 (  0.02%)     2037.65 (  1.26%)
> Time User-NUMA02_SMT                983.70 (  0.00%)      976.01 (  0.78%)      981.02 (  0.27%)

But even for 'numa02', the simplest of the workloads, there appears to 
be some of a regression relative to v3.19, which ought to be beyond 
the noise of the measurement (which would be below 1% I suspect), and 
as such relevant, right?

And the XFS numbers still show significant regression compared to 
v3.19 - and that cannot be ignored as artificial, 'adversarial' 
workload, right?

For example, from your numbers:

xfsrepair
                                    4.0.0-rc1             4.0.0-rc1                3.19.0
                                      vanilla           slowscan-v2               vanilla
...
Amean    real-xfsrepair      507.85 (  0.00%)      459.58 (  9.50%)      447.66 ( 11.85%)
Amean    syst-xfsrepair      519.88 (  0.00%)      281.63 ( 45.83%)      202.93 ( 60.97%)

if I interpret the numbers correctly, it shows that compared to v3.19, 
system time increased by 38% - which is rather significant!

> > So what worries me is that Dave bisected the regression to:
> > 
> >   4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")
> > 
> > And clearly your patch #4 just tunes balancing/migration intensity 
> > - is that a workaround for the real problem/bug?
> 
> The patch makes NUMA hinting faults use standard page table handling 
> routines and protections to trap the faults. Fundamentally it's 
> safer even though it appears to cause more traps to be handled. I've 
> been assuming this is related to the different permissions PTEs get 
> and when they are visible on all CPUs. This path is addressing the 
> symptom that more faults are being handled and that it needs to be 
> less aggressive.

But the whole cleanup ought to have been close to an identity 
transformation from the CPU's point of view - and your measurements 
seem to confirm Dave's findings.

And your measurement was on bare metal, while Dave's is on a VM, and 
both show a significant slowdown on the xfs tests even with your 
slow-tuning patch applied, so it's unlikely to be a measurement fluke 
or some weird platform property.

> I've gone through that patch and didn't spot anything else that is 
> doing wrong that is not already handled in this series. Did you spot 
> anything obviously wrong in that patch that isn't addressed in this 
> series?

I didn't spot anything wrong, but is that a basis to go forward and 
work around the regression, in a way that doesn't even recover lost 
performance?

> > And the patch Dave bisected to is a relatively simple patch. Why 
> > not simply revert it to see whether that cures much of the 
> > problem?
> 
> Because it also means reverting all the PROT_NONE handling and going 
> back to _PAGE_NUMA tricks which I expect would be naked by Linus.

Yeah, I realize that (and obviously I support the PROT_NONE direction 
that Peter Zijlstra prototyped with the original sched/numa series), 
but can we leave this much of a regression on the table?

I hate to be such a pain in the neck, but especially the 'down tuning' 
of the scanning intensity will make an apples to apples comparison 
harder!

I'd rather not do the slow-tuning part and leave sucky performance in 
place for now and have an easy method plus the motivation to find and 
fix the real cause of the regression, than to partially hide it this 
way ...

Thanks,

	Ingo
Ingo Molnar March 8, 2015, 10:02 a.m. UTC | #6
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > And the patch Dave bisected to is a relatively simple patch. Why 
> > not simply revert it to see whether that cures much of the 
> > problem?
> 
> So the problem with that is that "pmd_set_numa()" and friends simply 
> no longer exist. So we can't just revert that one patch, it's the 
> whole series, and the whole point of the series.

Yeah.

> What confuses me is that the only real change that I can see in that 
> patch is the change to "change_huge_pmd()". Everything else is 
> pretty much a 100% equivalent transformation, afaik. Of course, I 
> may be wrong about that, and missing something silly.

Well, there's a difference in what we write to the pte:

 #define _PAGE_BIT_NUMA          (_PAGE_BIT_GLOBAL+1)
 #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL

and our expectation was that the two should be equivalent methods from 
the POV of the NUMA balancing code, right?

> And the changes to "change_huge_pmd()" were basically re-done
> differently by subsequent patches anyway.
> 
> The *only* change I see remaining is that change_huge_pmd() now does
> 
>    entry = pmdp_get_and_clear_notify(mm, addr, pmd);
>    entry = pmd_modify(entry, newprot);
>    set_pmd_at(mm, addr, pmd, entry);
> 
> for all changes. It used to do that "pmdp_set_numa()" for the
> prot_numa case, which did just
> 
>    pmd_t pmd = *pmdp;
>    pmd = pmd_mknuma(pmd);
>    set_pmd_at(mm, addr, pmdp, pmd);
> 
> instead.
> 
> I don't like the old pmdp_set_numa() because it can drop dirty bits,
> so I think the old code was actively buggy.

Could we, as a silly testing hack not to be applied, write a 
hack-patch that re-introduces the racy way of setting the NUMA bit, to 
confirm that it is indeed this difference that changes pte visibility 
across CPUs enough to create so many more faults?

Because if the answer is 'yes', then we can safely say: 'we regressed 
performance because correctness [not dropping dirty bits] comes before 
performance'.

If the answer is 'no', then we still have a mystery (and a regression) 
to track down.

As a second hack (not to be applied), could we change:

 #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL

to:

 #define _PAGE_BIT_PROTNONE      (_PAGE_BIT_GLOBAL+1)

to double check that the position of the bit does not matter?

I don't think we've exhaused all avenues of analysis here.

Thanks,

	Ingo
Linus Torvalds March 8, 2015, 6:35 p.m. UTC | #7
On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Well, there's a difference in what we write to the pte:
>
>  #define _PAGE_BIT_NUMA          (_PAGE_BIT_GLOBAL+1)
>  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
>
> and our expectation was that the two should be equivalent methods from
> the POV of the NUMA balancing code, right?

Right.

But yes, we might have screwed something up. In particular, there
might be something that thinks it cares about the global bit, but
doesn't notice that the present bit isn't set, so it considers the
protnone mappings to be global and causes lots more tlb flushes etc.

>> I don't like the old pmdp_set_numa() because it can drop dirty bits,
>> so I think the old code was actively buggy.
>
> Could we, as a silly testing hack not to be applied, write a
> hack-patch that re-introduces the racy way of setting the NUMA bit, to
> confirm that it is indeed this difference that changes pte visibility
> across CPUs enough to create so many more faults?

So one of Mel's patches did that, but I don't know if Dave tested it.

And thinking about it, it *may* be safe for huge-pages, if they always
already have the dirty bit set to begin with. And I don't see how we
could have a clean hugepage (apart from the special case of the
zeropage, which is read-only, so races on teh dirty bit aren't an
issue).

So it might actually be that the non-atomic version is safe for
hpages. And we could possibly get rid of the "atomic read-and-clear"
even for the non-numa case.

I'd rather do it for both cases than for just one of them.

But:

> As a second hack (not to be applied), could we change:
>
>  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
>
> to:
>
>  #define _PAGE_BIT_PROTNONE      (_PAGE_BIT_GLOBAL+1)
>
> to double check that the position of the bit does not matter?

Agreed. We should definitely try that.

Dave?

Also, is there some sane way for me to actually see this behavior on a
regular machine with just a single socket? Dave is apparently running
in some fake-numa setup, I'm wondering if this is easy enough to
reproduce that I could see it myself.

                          Linus
Linus Torvalds March 8, 2015, 6:46 p.m. UTC | #8
On Sun, Mar 8, 2015 at 11:35 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> As a second hack (not to be applied), could we change:
>>
>>  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
>>
>> to:
>>
>>  #define _PAGE_BIT_PROTNONE      (_PAGE_BIT_GLOBAL+1)
>>
>> to double check that the position of the bit does not matter?
>
> Agreed. We should definitely try that.

There's a second reason to do that, actually: the __supported_pte_mask
thing, _and_ the pageattr stuff in __split_large_page() etc play games
with _PAGE_GLOBAL. As does drivers/lguest for some reason.

So looking at this all, there's a lot of room for confusion with _PAGE_GLOBAL.

That kind of confusion would certainly explain the whole "the changes
_look_ like they do the same thing, but don't" - because of silly
semantic conflicts with PROTNONE vs GLOBAL.

                                Linus
Mel Gorman March 8, 2015, 8:40 p.m. UTC | #9
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > And the patch Dave bisected to is a relatively simple patch. Why 
> > > not simply revert it to see whether that cures much of the 
> > > problem?
> > 
> > So the problem with that is that "pmd_set_numa()" and friends simply 
> > no longer exist. So we can't just revert that one patch, it's the 
> > whole series, and the whole point of the series.
> 
> Yeah.
> 
> > What confuses me is that the only real change that I can see in that 
> > patch is the change to "change_huge_pmd()". Everything else is 
> > pretty much a 100% equivalent transformation, afaik. Of course, I 
> > may be wrong about that, and missing something silly.
> 
> Well, there's a difference in what we write to the pte:
> 
>  #define _PAGE_BIT_NUMA          (_PAGE_BIT_GLOBAL+1)
>  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
> 
> and our expectation was that the two should be equivalent methods from 
> the POV of the NUMA balancing code, right?
> 

Functionally yes but performance-wise no. We are now using the global bit
for NUMA faults at the very least.

> > And the changes to "change_huge_pmd()" were basically re-done
> > differently by subsequent patches anyway.
> > 
> > The *only* change I see remaining is that change_huge_pmd() now does
> > 
> >    entry = pmdp_get_and_clear_notify(mm, addr, pmd);
> >    entry = pmd_modify(entry, newprot);
> >    set_pmd_at(mm, addr, pmd, entry);
> > 
> > for all changes. It used to do that "pmdp_set_numa()" for the
> > prot_numa case, which did just
> > 
> >    pmd_t pmd = *pmdp;
> >    pmd = pmd_mknuma(pmd);
> >    set_pmd_at(mm, addr, pmdp, pmd);
> > 
> > instead.
> > 
> > I don't like the old pmdp_set_numa() because it can drop dirty bits,
> > so I think the old code was actively buggy.
> 
> Could we, as a silly testing hack not to be applied, write a 
> hack-patch that re-introduces the racy way of setting the NUMA bit, to 
> confirm that it is indeed this difference that changes pte visibility 
> across CPUs enough to create so many more faults?
> 

This was already done and tested by Dave but while it helped, it was
not enough.  As the approach was inherently unsafe it was dropped and the
throttling approach taken. However, the fact it made little difference
may indicate that this is somehow related to the global bit being used.

> Because if the answer is 'yes', then we can safely say: 'we regressed 
> performance because correctness [not dropping dirty bits] comes before 
> performance'.
> 
> If the answer is 'no', then we still have a mystery (and a regression) 
> to track down.
> 
> As a second hack (not to be applied), could we change:
> 
>  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
> 
> to:
> 
>  #define _PAGE_BIT_PROTNONE      (_PAGE_BIT_GLOBAL+1)
> 

In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
can be done.

> to double check that the position of the bit does not matter?
> 

It's worth checking in case it's a case of how the global bit is
treated. However, note that Dave is currently travelling for LSF/MM in
Boston and there is a chance he cannot test this week at all. I'm just
after landing in the hotel myself. I'll try find time during during one
of the breaks tomorrow but if the wireless is too crap then accessing the
test machine remotely might be an issue.

> I don't think we've exhaused all avenues of analysis here.
> 

True.
Dave Chinner March 9, 2015, 11:29 a.m. UTC | #10
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote:
> On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar <mingo@kernel.org> wrote:
> But:
> 
> > As a second hack (not to be applied), could we change:
> >
> >  #define _PAGE_BIT_PROTNONE      _PAGE_BIT_GLOBAL
> >
> > to:
> >
> >  #define _PAGE_BIT_PROTNONE      (_PAGE_BIT_GLOBAL+1)
> >
> > to double check that the position of the bit does not matter?
> 
> Agreed. We should definitely try that.
> 
> Dave?

As Mel has already mentioned, I'm in Boston for LSFMM and don't have
access to the test rig I've used to generate this.

> Also, is there some sane way for me to actually see this behavior on a
> regular machine with just a single socket? Dave is apparently running
> in some fake-numa setup, I'm wondering if this is easy enough to
> reproduce that I could see it myself.

Should be - I don't actually use 500TB of storage to generate this -
50GB on an SSD is all you need from the storage side. I just use a
sparse backing file to make it look like a 500TB device. :P

i.e. create an XFS filesystem on a 500TB sparse file with "mkfs.xfs
-d size=500t,file=1 /path/to/file.img", mount it on loopback or as a
virtio,cache=none device for the guest vm and then use fsmark to
generate several million files spread across many, many directories
such as:

$  fs_mark -D 10000 -S0 -n 100000 -s 1 -L 32 -d \
/mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \
/mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \
/mnt/scratch/6 -d /mnt/scratch/7

That should only take a few minutes to run - if you throw 8p at it
then it should run at >100k files/s being created.

Then unmount and run "xfs_repair -o bhash=101703 /path/to/file.img"
on the resultant image file.

Cheers,

Dave.
Linus Torvalds March 9, 2015, 4:52 p.m. UTC | #11
On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner <david@fromorbit.com> wrote:
>
>> Also, is there some sane way for me to actually see this behavior on a
>> regular machine with just a single socket? Dave is apparently running
>> in some fake-numa setup, I'm wondering if this is easy enough to
>> reproduce that I could see it myself.
>
> Should be - I don't actually use 500TB of storage to generate this -
> 50GB on an SSD is all you need from the storage side. I just use a
> sparse backing file to make it look like a 500TB device. :P

What's your virtual environment setup? Kernel config, and
virtualization environment to actually get that odd fake NUMA thing
happening?

                          Linus
Dave Chinner March 9, 2015, 7:19 p.m. UTC | #12
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
> On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner <david@fromorbit.com> wrote:
> >
> >> Also, is there some sane way for me to actually see this behavior on a
> >> regular machine with just a single socket? Dave is apparently running
> >> in some fake-numa setup, I'm wondering if this is easy enough to
> >> reproduce that I could see it myself.
> >
> > Should be - I don't actually use 500TB of storage to generate this -
> > 50GB on an SSD is all you need from the storage side. I just use a
> > sparse backing file to make it look like a 500TB device. :P
> 
> What's your virtual environment setup? Kernel config, and
> virtualization environment to actually get that odd fake NUMA thing
> happening?

I don't have the exact .config with me (test machines at home
are shut down because I'm half a world away), but it's pretty much
this (copied and munged from a similar test vm on my laptop):

$ cat run-vm-4.sh
sudo qemu-system-x86_64 \
        -machine accel=kvm \
        -no-fd-bootchk \
        -localtime \
        -boot c \
        -serial pty \
        -nographic \
        -alt-grab \
        -smp 16 -m 16384 \
        -hda /data/vm-2/root.img \
        -drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \
        -drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \
        -drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \
        -kernel /vm/vm-4/vmlinuz \
        -append "console=ttyS0,115200 root=/dev/sda1,numa=fake=4"
$

And on the host I have /vm on a ssd that is an XFS filesystem, and
I've created /vm/vm-4/vm-4-500TB.img by doing:

$ xfs_io -f -c "truncate 500t" -c "extsize 1m" /vm/vm-4/vm-4-500TB.img

and in the guest the filesystem is created with:

# mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc

And that will create a 500TB filesystem that you can then mount and
run fsmark on it, then unmount and run xfs_repair on it.

the .config I have on my laptop is from 3.18-rc something, but it
should work just with a make oldconfig update. I'ts attached below.

Hopefully this will be sufficient for you, otherwise it'll have to
wait until I get home to get the exact configs for you.

Cheers,

Dave.
Linus Torvalds March 10, 2015, 11:55 p.m. UTC | #13
On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
>>
>> What's your virtual environment setup? Kernel config, and
>> virtualization environment to actually get that odd fake NUMA thing
>> happening?
>
> I don't have the exact .config with me (test machines at home
> are shut down because I'm half a world away), but it's pretty much
> this (copied and munged from a similar test vm on my laptop):

[ snip snip ]

Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
and went back to actually *thinking* about the code instead of trying
to reproduce this and figure things out by trial and error.

And I think I figured it out. Of course, since I didn't actually test
anything, what do I know, but I feel good about it, because I think I
can explain why that patch that on the face of it shouldn't change
anything actually did.

So, the old code just did all those manual page table changes,
clearing the present bit and setting the NUMA bit instead.

The new code _ostensibly_ does the same, except it clears the present
bit and sets the PROTNONE bit instead.

However, rather than playing special games with just those two bits,
it uses the normal pte accessor functions, and in particular uses
vma->vm_page_prot to reset the protections back. Which is a nice
cleanup and really makes the code look saner, and does the same thing.

Except it really isn't the same thing at all.

Why?

The protection bits in the page tables are *not* the same as
vma->vm_page_prot. Yes, they start out that way, but they don't stay
that way. And no, I'm not talking about dirty and accessed bits.

The difference? COW. Any private mapping is marked read-only in
vma->vm_page_prot, and then the COW (or the initial write) makes it
read-write.

And so, when we did

-       pte = pte_mknonnuma(pte);
+       /* Make it present again */
+       pte = pte_modify(pte, vma->vm_page_prot);
+       pte = pte_mkyoung(pte);

that isn't equivalent at all - it makes the page read-only, because it
restores it to its original state.

Now, that isn't actually what hurts most, I suspect. Judging by the
profiles, we don't suddenly take a lot of new COW faults. No, what
hurts most is that the NUMA balancing code does this:

        /*
         * Avoid grouping on DSO/COW pages in specific and RO pages
         * in general, RO pages shouldn't hurt as much anyway since
         * they can be in shared cache state.
         */
        if (!pte_write(pte))
                flags |= TNF_NO_GROUP;

and that "!pte_write(pte)" is basically now *always* true for private
mappings (which is 99% of all mappings).

In other words, I think the patch unintentionally made the NUMA code
basically always do the TNF_NO_GROUP case.

I think that a quick hack for testing might be to just replace that
"!pte_write()" with "!pte_dirty()", and seeing how that acts.

Comments?

                      Linus
Mel Gorman March 12, 2015, 1:10 p.m. UTC | #14
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote:
> On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
> >>
> >> What's your virtual environment setup? Kernel config, and
> >> virtualization environment to actually get that odd fake NUMA thing
> >> happening?
> >
> > I don't have the exact .config with me (test machines at home
> > are shut down because I'm half a world away), but it's pretty much
> > this (copied and munged from a similar test vm on my laptop):
> 
> [ snip snip ]
> 
> Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
> and went back to actually *thinking* about the code instead of trying
> to reproduce this and figure things out by trial and error.
> 
> And I think I figured it out.
> <SNIP>

I believe you're correct and it matches what was observed. I'm still
travelling and wireless is dirt but managed to queue a test using pmd_dirty

                                              3.19.0             4.0.0-rc1             4.0.0-rc1
                                             vanilla               vanilla        ptewrite-v1r20
Time User-NUMA01                  25695.96 (  0.00%)    32883.59 (-27.97%)    24012.80 (  6.55%)
Time User-NUMA01_THEADLOCAL       17404.36 (  0.00%)    17453.20 ( -0.28%)    17950.54 ( -3.14%)
Time User-NUMA02                   2037.65 (  0.00%)     2063.70 ( -1.28%)     2046.88 ( -0.45%)
Time User-NUMA02_SMT                981.02 (  0.00%)      983.70 ( -0.27%)      983.68 ( -0.27%)
Time System-NUMA01                  194.70 (  0.00%)      602.44 (-209.42%)      158.90 ( 18.39%)
Time System-NUMA01_THEADLOCAL        98.52 (  0.00%)       78.10 ( 20.73%)      107.66 ( -9.28%)
Time System-NUMA02                    9.28 (  0.00%)        6.47 ( 30.28%)        9.25 (  0.32%)
Time System-NUMA02_SMT                3.79 (  0.00%)        5.06 (-33.51%)        3.92 ( -3.43%)
Time Elapsed-NUMA01                 558.84 (  0.00%)      755.96 (-35.27%)      532.41 (  4.73%)
Time Elapsed-NUMA01_THEADLOCAL      382.54 (  0.00%)      382.22 (  0.08%)      390.48 ( -2.08%)
Time Elapsed-NUMA02                  49.83 (  0.00%)       49.38 (  0.90%)       49.79 (  0.08%)
Time Elapsed-NUMA02_SMT              46.59 (  0.00%)       47.70 ( -2.38%)       47.77 ( -2.53%)
Time CPU-NUMA01                    4632.00 (  0.00%)     4429.00 (  4.38%)     4539.00 (  2.01%)
Time CPU-NUMA01_THEADLOCAL         4575.00 (  0.00%)     4586.00 ( -0.24%)     4624.00 ( -1.07%)
Time CPU-NUMA02                    4107.00 (  0.00%)     4191.00 ( -2.05%)     4129.00 ( -0.54%)
Time CPU-NUMA02_SMT                2113.00 (  0.00%)     2072.00 (  1.94%)     2067.00 (  2.18%)

              3.19.0   4.0.0-rc1   4.0.0-rc1
             vanilla     vanillaptewrite-v1r20
User        46119.12    53384.29    44994.10
System        306.41      692.14      279.78
Elapsed      1039.88     1236.87     1022.92

There are still some difference but it's much closer to what it was.
The balancing stats are almost looking similar to 3.19

NUMA base PTE updates        222840103   304513172   230724075
NUMA huge PMD updates           434894      594467      450274
NUMA page range updates      445505831   608880276   461264363
NUMA hint faults                601358      733491      626176
NUMA hint local faults          371571      511530      359215
NUMA hint local percent             61          69          57
NUMA pages migrated            7073177    26366701     6829196

XFS repair on the same machine is not fully restore either but a big
enough move in the right direction to indicate this was the relevant
change.

xfsrepair
                                       3.19.0             4.0.0-rc1             4.0.0-rc1
                                      vanilla               vanilla        ptewrite-v1r20
Amean    real-fsmark        1166.28 (  0.00%)     1166.63 ( -0.03%)     1184.97 ( -1.60%)
Amean    syst-fsmark        4025.87 (  0.00%)     4020.94 (  0.12%)     4071.10 ( -1.12%)
Amean    real-xfsrepair      447.66 (  0.00%)      507.85 (-13.45%)      460.94 ( -2.97%)
Amean    syst-xfsrepair      202.93 (  0.00%)      519.88 (-156.19%)      282.45 (-39.19%)
Linus Torvalds March 12, 2015, 4:20 p.m. UTC | #15
On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> I believe you're correct and it matches what was observed. I'm still
> travelling and wireless is dirt but managed to queue a test using pmd_dirty

Ok, thanks.

I'm not entirely happy with that change, and I suspect the whole
heuristic should be looked at much more (maybe it should also look at
whether it's executable, for example), but it's a step in the right
direction.

So I committed it and added a comment, and wrote a commit log about
it. I suspect any further work is post-4.0-release, unless somebody
comes up with something small and simple and obviously better.

                                 Linus
Mel Gorman March 12, 2015, 6:49 p.m. UTC | #16
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote:
> On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > I believe you're correct and it matches what was observed. I'm still
> > travelling and wireless is dirt but managed to queue a test using pmd_dirty
> 
> Ok, thanks.
> 
> I'm not entirely happy with that change, and I suspect the whole
> heuristic should be looked at much more (maybe it should also look at
> whether it's executable, for example), but it's a step in the right
> direction.
> 

I can follow up when I'm back in work properly. As you have already pulled
this in directly, can you also consider pulling in "mm: thp: return the
correct value for change_huge_pmd" please? The other two patches were very
minor can be resent through the normal paths later.
Dave Chinner March 17, 2015, 7:06 a.m. UTC | #17
On Thu, Mar 12, 2015 at 06:49:26PM +0000, Mel Gorman wrote:
> On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote:
> > On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > I believe you're correct and it matches what was observed. I'm still
> > > travelling and wireless is dirt but managed to queue a test using pmd_dirty
> > 
> > Ok, thanks.
> > 
> > I'm not entirely happy with that change, and I suspect the whole
> > heuristic should be looked at much more (maybe it should also look at
> > whether it's executable, for example), but it's a step in the right
> > direction.
> > 
> 
> I can follow up when I'm back in work properly. As you have already pulled
> this in directly, can you also consider pulling in "mm: thp: return the
> correct value for change_huge_pmd" please? The other two patches were very
> minor can be resent through the normal paths later.

TO close the loop here, now I'm back home and can run tests:

config                            3.19      4.0-rc1     4.0-rc4
defaults                         8m08s        9m34s       9m14s
-o ag_stride=-1                  4m04s        4m38s       4m11s
-o bhash=101073                  6m04s       17m43s       7m35s
-o ag_stride=-1,bhash=101073     4m54s        9m58s       7m50s

It's better but there are still significant regressions, especially
for the large memory footprint cases. I haven't had a chance to look
at any stats or profiles yet, so I don't know yet whether this is
still page fault related or some other problem....

Cheers,

Dave
Linus Torvalds March 17, 2015, 4:53 p.m. UTC | #18
On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> TO close the loop here, now I'm back home and can run tests:
>
> config                            3.19      4.0-rc1     4.0-rc4
> defaults                         8m08s        9m34s       9m14s
> -o ag_stride=-1                  4m04s        4m38s       4m11s
> -o bhash=101073                  6m04s       17m43s       7m35s
> -o ag_stride=-1,bhash=101073     4m54s        9m58s       7m50s
>
> It's better but there are still significant regressions, especially
> for the large memory footprint cases. I haven't had a chance to look
> at any stats or profiles yet, so I don't know yet whether this is
> still page fault related or some other problem....

Ok. I'd love to see some data on what changed between 3.19 and rc4 in
the profiles, just to see whether it's "more page faults due to extra
COW", or whether it's due to "more TLB flushes because of the
pte_write() vs pte_dirty()" differences. I'm *guessing*  lot of the
remaining issues are due to extra page fault overhead because I'd
expect write/dirty to be fairly 1:1, but there could be differences
due to shared memory use and/or just writebacks of dirty pages that
become clean.

I guess you can also see in vmstat.mm_migrate_pages whether it's
because of excessive migration (because of bad grouping) or not. So
not just profiles data.

At the same time, I feel fairly happy about the situation - we at
least understand what is going on, and the "3x worse performance" case
is at least gone.  Even if that last case still looks horrible.

So it's still a bad performance regression, but at the same time I
think your test setup (big 500 TB filesystem, but then a fake-numa
thing with just 4GB per node) is specialized and unrealistic enough
that I don't feel it's all that relevant from a *real-world*
standpoint, and so I wouldn't be uncomfortable saying "ok, the page
table handling cleanup caused some issues, but we know about them and
how to fix them longer-term".  So I don't consider this a 4.0
showstopper or a "we need to revert for now" issue.

If it's a case of "we take a lot more page faults because we handle
the NUMA fault and then have a COW fault almost immediately", then the
fix is likely to do the same early-cow that the normal non-numa-fault
case does. In fact, my gut feel is that we should try to unify that
numa/regula fault handling path a bit more, but that would be a pretty
invasive patch.

                     Linus
Dave Chinner March 17, 2015, 8:51 p.m. UTC | #19
On Tue, Mar 17, 2015 at 09:53:57AM -0700, Linus Torvalds wrote:
> On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > TO close the loop here, now I'm back home and can run tests:
> >
> > config                            3.19      4.0-rc1     4.0-rc4
> > defaults                         8m08s        9m34s       9m14s
> > -o ag_stride=-1                  4m04s        4m38s       4m11s
> > -o bhash=101073                  6m04s       17m43s       7m35s
> > -o ag_stride=-1,bhash=101073     4m54s        9m58s       7m50s
> >
> > It's better but there are still significant regressions, especially
> > for the large memory footprint cases. I haven't had a chance to look
> > at any stats or profiles yet, so I don't know yet whether this is
> > still page fault related or some other problem....
> 
> Ok. I'd love to see some data on what changed between 3.19 and rc4 in
> the profiles, just to see whether it's "more page faults due to extra
> COW", or whether it's due to "more TLB flushes because of the
> pte_write() vs pte_dirty()" differences. I'm *guessing*  lot of the
> remaining issues are due to extra page fault overhead because I'd
> expect write/dirty to be fairly 1:1, but there could be differences
> due to shared memory use and/or just writebacks of dirty pages that
> become clean.
> 
> I guess you can also see in vmstat.mm_migrate_pages whether it's
> because of excessive migration (because of bad grouping) or not. So
> not just profiles data.

On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
was using during steady state shows:

     471,752      migrate:mm_migrate_pages ( +-  7.38% )

The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
and 3.19 (~55,000), so that looks like even more of a problem than
before.

And the profile looks like:

-   43.73%     0.05%  [kernel]            [k] native_flush_tlb_others
   - native_flush_tlb_others
      - 99.87% flush_tlb_page
           ptep_clear_flush
           try_to_unmap_one
           rmap_walk
           try_to_unmap
           migrate_pages
           migrate_misplaced_page
         - handle_mm_fault
            - 99.84% __do_page_fault
                 trace_do_page_fault
                 do_async_page_fault
               + async_page_fault

(grrrr - running perf with call stack profiling for long enough
oom-kills xfs_repair)

And the vmstats are:

3.19:

numa_hit 5163221
numa_miss 121274
numa_foreign 121274
numa_interleave 12116
numa_local 5153127
numa_other 131368
numa_pte_updates 36482466
numa_huge_pte_updates 0
numa_hint_faults 34816515
numa_hint_faults_local 9197961
numa_pages_migrated 1228114
pgmigrate_success 1228114
pgmigrate_fail 0

4.0-rc1:

numa_hit 36952043
numa_miss 92471
numa_foreign 92471
numa_interleave 10964
numa_local 36927384
numa_other 117130
numa_pte_updates 84010995
numa_huge_pte_updates 0
numa_hint_faults 81697505
numa_hint_faults_local 21765799
numa_pages_migrated 32916316
pgmigrate_success 32916316
pgmigrate_fail 0

4.0-rc4:

numa_hit 23447345
numa_miss 47471
numa_foreign 47471
numa_interleave 10877
numa_local 23438564
numa_other 56252
numa_pte_updates 60901329
numa_huge_pte_updates 0
numa_hint_faults 58777092
numa_hint_faults_local 16478674
numa_pages_migrated 20075156
pgmigrate_success 20075156
pgmigrate_fail 0

Page migrations are still up by a factor of ~20 on 3.19.


> At the same time, I feel fairly happy about the situation - we at
> least understand what is going on, and the "3x worse performance" case
> is at least gone.  Even if that last case still looks horrible.
> 
> So it's still a bad performance regression, but at the same time I
> think your test setup (big 500 TB filesystem, but then a fake-numa
> thing with just 4GB per node) is specialized and unrealistic enough
> that I don't feel it's all that relevant from a *real-world*
> standpoint,

I don't buy it.

The regression is triggered by the search algorithm that
xfs_repair uses, and it's fairly common. It just uses a large hash
table which has at least a 50% miss rate (every I/O misses on the
initial lookup). The page faults are triggered by searching all
these pointer chasing misses.

IOWs the filesystem size under test is irrelevant - the amount of
metadata in the FS determines the xfs_repair memory footprint. The
fs size only determines concurrency, but this particular test case
(ag_stride=-1) turns off the concurrency.  Hence we'll see the same
problem with a 1TB filesystem with 50 million inodes in it, and
there's *lots* of those around.

Also, to address the "fake-numa" setup - the problem is node-local
allocation policy, not the size of the nodes.  The repair threads
wander all over the machine, even when there are only 8 threads
running (ag_stride=-1):

$ ps wauxH |grep  xfs_repair |wc -l
8
$

top - 07:31:08 up 24 min,  2 users,  load average: 1.75, 0.62, 0.69
Tasks: 240 total,   1 running, 239 sleeping,   0 stopped,   0 zombie
%Cpu0  :  3.2 us, 13.4 sy,  0.0 ni, 63.6 id, 19.8 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  1.4 us,  8.5 sy,  0.0 ni, 74.7 id, 15.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  1.0 us,  1.7 sy,  0.0 ni, 83.2 id, 14.1 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.7 sy,  0.0 ni, 98.7 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  4.0 us, 21.7 sy,  0.0 ni, 56.5 id, 17.8 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  1.0 us,  2.3 sy,  0.0 ni, 92.4 id,  4.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  2.0 us, 13.5 sy,  0.0 ni, 80.7 id,  3.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 18.8 us,  2.7 sy,  0.0 ni, 78.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  1.4 us, 10.2 sy,  0.0 ni, 87.4 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  2.4 us, 13.2 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  1.0 us,  6.1 sy,  0.0 ni, 88.6 id,  4.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.7 us,  5.4 sy,  0.0 ni, 79.3 id, 14.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.3 us,  0.7 sy,  0.0 ni, 93.3 id,  5.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  2.7 us,  9.3 sy,  0.0 ni, 87.7 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st

So, 8 threads doing work, 16 cpu cores, and only one fully idle
processor core in the machine over a 5s sample period.  Hence it
seems to me that process memory is getting sprayed over all nodes
because of the way the scheduler processes move around, not because
of the small memory size of the numa nodes.

> and so I wouldn't be uncomfortable saying "ok, the page
> table handling cleanup caused some issues, but we know about them and
> how to fix them longer-term".  So I don't consider this a 4.0
> showstopper or a "we need to revert for now" issue.

I don't consider it necessary of a revert, either, but I don't want
it swept under the table because of "workload not relevant"
arguments.

Cheers,

Dave.
Linus Torvalds March 17, 2015, 9:30 p.m. UTC | #20
On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
> was using during steady state shows:
>
>      471,752      migrate:mm_migrate_pages ( +-  7.38% )
>
> The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
> and 3.19 (~55,000), so that looks like even more of a problem than
> before.

Hmm. How stable are those numbers boot-to-boot?

That kind of extreme spread makes me suspicious. It's also interesting
that if the numbers really go up even more (and by that big amount),
then why does there seem to be almost no correlation with performance
(which apparently went up since rc1, despite migrate_pages getting
even _worse_).

> And the profile looks like:
>
> -   43.73%     0.05%  [kernel]            [k] native_flush_tlb_others

Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%).
And flush_tlb_page() does seem to be called about ten times more
(flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible
at 0.13%)

Damn. From a performance number standpoint, it looked like we zoomed
in on the right thing. But now it's migrating even more pages than
before. Odd.

> And the vmstats are:
>
> 3.19:
>
> numa_hit 5163221
> numa_local 5153127

> 4.0-rc1:
>
> numa_hit 36952043
> numa_local 36927384
>
> 4.0-rc4:
>
> numa_hit 23447345
> numa_local 23438564
>
> Page migrations are still up by a factor of ~20 on 3.19.

The thing is, those "numa_hit" things come from the zone_statistics()
call in buffered_rmqueue(), which in turn is simple from the memory
allocator. That has *nothing* to do with virtual memory, and
everything to do with actual physical memory allocations.  So the load
is simply allocating a lot more pages, presumably for those stupid
migration events.

But then it doesn't correlate with performance anyway..

Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm:
fix up numa read-only thread grouping logic") to 3.19, so that it uses
the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19
and 4.0-rc4 numbers comparable.

It does make me wonder if your load is "chaotic" wrt scheduling. The
load presumably wants to spread out across all cpu's, but then the
numa code tries to group things together for numa accesses, but
depending on just random allocation patterns and layout in the hash
tables, there either are patters with page access or there aren't.

Which is kind of why I wonder how stable those numbers are boot to
boot. Maybe this is at least partly about lucky allocation patterns.

                              Linus
Dave Chinner March 17, 2015, 10:08 p.m. UTC | #21
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote:
> On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
> > was using during steady state shows:
> >
> >      471,752      migrate:mm_migrate_pages ( +-  7.38% )
> >
> > The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
> > and 3.19 (~55,000), so that looks like even more of a problem than
> > before.
> 
> Hmm. How stable are those numbers boot-to-boot?

I've run the test several times but only profiles once so far.
runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from
the 8m2s run.

reboot, run again:

$ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

           572,839      migrate:mm_migrate_pages    ( +-  3.15% )

      10.001664694 seconds time elapsed             ( +-  0.00% )
$

And just to confirm, a minute later, still in phase 3:

	590,974      migrate:mm_migrate_pages       ( +-  2.86% )

Reboot, run again:

	575,344      migrate:mm_migrate_pages       ( +-  0.70% )

So there is boot-to-boot variation, but it doesn't look like it
gets any better....

> That kind of extreme spread makes me suspicious. It's also interesting
> that if the numbers really go up even more (and by that big amount),
> then why does there seem to be almost no correlation with performance
> (which apparently went up since rc1, despite migrate_pages getting
> even _worse_).
> 
> > And the profile looks like:
> >
> > -   43.73%     0.05%  [kernel]            [k] native_flush_tlb_others
> 
> Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%).
> And flush_tlb_page() does seem to be called about ten times more
> (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible
> at 0.13%)
> 
> Damn. From a performance number standpoint, it looked like we zoomed
> in on the right thing. But now it's migrating even more pages than
> before. Odd.

Throttling problem, like Mel originally suspected?

> > And the vmstats are:
> >
> > 3.19:
> >
> > numa_hit 5163221
> > numa_local 5153127
> 
> > 4.0-rc1:
> >
> > numa_hit 36952043
> > numa_local 36927384
> >
> > 4.0-rc4:
> >
> > numa_hit 23447345
> > numa_local 23438564
> >
> > Page migrations are still up by a factor of ~20 on 3.19.
> 
> The thing is, those "numa_hit" things come from the zone_statistics()
> call in buffered_rmqueue(), which in turn is simple from the memory
> allocator. That has *nothing* to do with virtual memory, and
> everything to do with actual physical memory allocations.  So the load
> is simply allocating a lot more pages, presumably for those stupid
> migration events.
> 
> But then it doesn't correlate with performance anyway..
>
> Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm:
> fix up numa read-only thread grouping logic") to 3.19, so that it uses
> the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19
> and 4.0-rc4 numbers comparable.

patched 3.19 numbers on this test are slightly worse than stock
3.19, but nowhere near as bad as 4.0-rc4:

	241,718      migrate:mm_migrate_pages		( +-  5.17% )

So that pte_write->pte_dirty change makes this go from ~55k to 240k,
and runtime go from 4m54s to 5m20s. vmstats:

numa_hit 9162476
numa_miss 0
numa_foreign 0
numa_interleave 10685
numa_local 9153740
numa_other 8736
numa_pte_updates 49582103
numa_huge_pte_updates 0
numa_hint_faults 48075098
numa_hint_faults_local 12974704
numa_pages_migrated 5748256
pgmigrate_success 5748256
pgmigrate_fail 0

Cheers,

Dave.
Linus Torvalds March 18, 2015, 4:08 p.m. UTC | #22
On Tue, Mar 17, 2015 at 3:08 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Damn. From a performance number standpoint, it looked like we zoomed
>> in on the right thing. But now it's migrating even more pages than
>> before. Odd.
>
> Throttling problem, like Mel originally suspected?

That doesn't much make sense for the original bisect you did, though.

Although if there are two different issues, maybe that bisect was
wrong. Or rather, incomplete.

>> Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm:
>> fix up numa read-only thread grouping logic") to 3.19, so that it uses
>> the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19
>> and 4.0-rc4 numbers comparable.
>
> patched 3.19 numbers on this test are slightly worse than stock
> 3.19, but nowhere near as bad as 4.0-rc4:
>
>         241,718      migrate:mm_migrate_pages           ( +-  5.17% )

Ok, that's still much worse than plain 3.19, which was ~55,000.
Assuming your memory/measurements were the same.

So apparently the pte_write() -> pte_dirty() check isn't equivalent at
all. My thinking that for the common case (ie private mappings) it
would be *exactly* the same, because all normal COW pages turn dirty
at the same time they turn writable (and, in page_mkclean_one(), turn
clean and read-only again at the same time). But if the numbers change
that much, then clearly my simplistic "they are the same in practice"
is just complete BS.

So why am I wrong? Why is testing for dirty not the same as testing
for writable?

I can see a few cases:

 - your load has lots of writable (but not written-to) shared memory,
and maybe the test should be something like

      pte_dirty(pte) || (vma->vm_flags & (VM_WRITE|VM_SHARED) ==
(VM_WRITE|VM_SHARED))

   and we really should have some helper function for this logic.

 - something completely different that I am entirely missing

What am I missing?

                          Linus
Linus Torvalds March 18, 2015, 5:31 p.m. UTC | #23
On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So why am I wrong? Why is testing for dirty not the same as testing
> for writable?
>
> I can see a few cases:
>
>  - your load has lots of writable (but not written-to) shared memory

Hmm. I tried to look at the xfsprog sources, and I don't see any
MAP_SHARED activity.  It looks like it's just using pread64/pwrite64,
and the only MAP_SHARED is for the xfsio mmap test thing, not for
xfsrepair.

So I don't see any shared mappings, but I don't know the code-base.

>  - something completely different that I am entirely missing

So I think there's something I'm missing. For non-shared mappings, I
still have the idea that pte_dirty should be the same as pte_write.
And yet, your testing of 3.19 shows that it's a big difference.
There's clearly something I'm completely missing.

                          Linus
Dave Chinner March 18, 2015, 10:23 p.m. UTC | #24
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
> On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So why am I wrong? Why is testing for dirty not the same as testing
> > for writable?
> >
> > I can see a few cases:
> >
> >  - your load has lots of writable (but not written-to) shared memory
> 
> Hmm. I tried to look at the xfsprog sources, and I don't see any
> MAP_SHARED activity.  It looks like it's just using pread64/pwrite64,
> and the only MAP_SHARED is for the xfsio mmap test thing, not for
> xfsrepair.
> 
> So I don't see any shared mappings, but I don't know the code-base.

Right - all the mmap activity in the xfs_repair test is coming from
memory allocation through glibc - we don't use mmap() directly
anywhere in xfs_repair. FWIW, all the IO into these pages that are
allocated is being done via direct IO, if that makes any
difference...

> >  - something completely different that I am entirely missing
> 
> So I think there's something I'm missing. For non-shared mappings, I
> still have the idea that pte_dirty should be the same as pte_write.
> And yet, your testing of 3.19 shows that it's a big difference.
> There's clearly something I'm completely missing.

This level of pte interactions is beyond my level of knowledge, so
I'm afraid at this point I'm not going to be much help other than to
test patches and report the result.

FWIW, here's the distribution of the hash table we are iterating
over. There are a lot of search misses, which means we are doing a
lot of pointer chasing, but the distribution is centred directly
around the goal of 8 entries per chain and there is no long tail:

libxfs_bcache: 0x67e110
Max supported entries = 808584
Max utilized entries = 808584
Active entries = 808583
Hash table size = 101073
Hits = 9789987
Misses = 8224234
Hit ratio = 54.35
MRU 0 entries =   4667 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      4 (  0%)
MRU 3 entries = 797447 ( 98%)
MRU 4 entries =    653 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =   2755 (  0%)
MRU 7 entries =   1518 (  0%)
MRU 8 entries =   1518 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =     21 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Hash buckets with   0 entries     30 (  0%)
Hash buckets with   1 entries    241 (  0%)
Hash buckets with   2 entries   1019 (  0%)
Hash buckets with   3 entries   2787 (  1%)
Hash buckets with   4 entries   5838 (  2%)
Hash buckets with   5 entries   9144 (  5%)
Hash buckets with   6 entries  12165 (  9%)
Hash buckets with   7 entries  14194 ( 12%)
Hash buckets with   8 entries  14387 ( 14%)
Hash buckets with   9 entries  12742 ( 14%)
Hash buckets with  10 entries  10253 ( 12%)
Hash buckets with  11 entries   7308 (  9%)
Hash buckets with  12 entries   4872 (  7%)
Hash buckets with  13 entries   2869 (  4%)
Hash buckets with  14 entries   1578 (  2%)
Hash buckets with  15 entries    894 (  1%)
Hash buckets with  16 entries    430 (  0%)
Hash buckets with  17 entries    188 (  0%)
Hash buckets with  18 entries     88 (  0%)
Hash buckets with  19 entries     24 (  0%)
Hash buckets with  20 entries     11 (  0%)
Hash buckets with  21 entries     10 (  0%)
Hash buckets with  22 entries      1 (  0%)


Cheers,

Dave.
Linus Torvalds March 19, 2015, 9:41 p.m. UTC | #25
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I think there's something I'm missing. For non-shared mappings, I
> still have the idea that pte_dirty should be the same as pte_write.
> And yet, your testing of 3.19 shows that it's a big difference.
> There's clearly something I'm completely missing.

Ahh. The normal page table scanning and page fault handling both clear
and set the dirty bit together with the writable one. But "fork()"
will clear the writable bit without clearing dirty. For some reason I
thought it moved the dirty bit into the struct page like the VM
scanning does, but that was just me having a brainfart. So yeah,
pte_dirty doesn't have to match pte_write even under perfectly normal
circumstances. Maybe there are other cases.

Not that I see a lot of forking in the xfs repair case either, so..

Dave, mind re-running the plain 3.19 numbers to really verify that the
pte_dirty/pte_write change really made that big of a difference. Maybe
your recollection of ~55,000 migrate_pages events was faulty. If the
pte_write ->pte_dirty change is the *only* difference, it's still very
odd how that one difference would make migrate_rate go from ~55k to
471k. That's an order of magnitude difference, for what really
shouldn't be a big change.

I'm running a kernel right now with a hacky "update_mmu_cache()" that
warns if pte_dirty is ever different from pte_write().

+void update_mmu_cache(struct vm_area_struct *vma,
+               unsigned long addr, pte_t *ptep)
+{
+       if (!(vma->vm_flags & VM_SHARED)) {
+               pte_t now = READ_ONCE(*ptep);
+               if (!pte_write(now) != !pte_dirty(now)) {
+                       static int count = 20;
+                       static unsigned int prev = 0;
+                       unsigned int val = pte_val(now) & 0xfff;
+                       if (prev != val && count) {
+                               prev = val;
+                               count--;
+                               WARN(1, "pte value %x", val);
+                       }
+               }
+       }
+}

I haven't seen a single warning so far (and there I wrote all that
code to limit repeated warnings), although admittedly
update_mu_cache() isn't called for all cases where we change a pte
(not for the fork case, for example). But it *is* called for the page
faulting cases

Maybe a system update has changed libraries and memory allocation
patterns, and there is something bigger than that one-liner
pte_dirty/write change going on?

                             Linus
Dave Chinner March 19, 2015, 10:41 p.m. UTC | #26
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote:
> On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I think there's something I'm missing. For non-shared mappings, I
> > still have the idea that pte_dirty should be the same as pte_write.
> > And yet, your testing of 3.19 shows that it's a big difference.
> > There's clearly something I'm completely missing.
> 
> Ahh. The normal page table scanning and page fault handling both clear
> and set the dirty bit together with the writable one. But "fork()"
> will clear the writable bit without clearing dirty. For some reason I
> thought it moved the dirty bit into the struct page like the VM
> scanning does, but that was just me having a brainfart. So yeah,
> pte_dirty doesn't have to match pte_write even under perfectly normal
> circumstances. Maybe there are other cases.
> 
> Not that I see a lot of forking in the xfs repair case either, so..
> 
> Dave, mind re-running the plain 3.19 numbers to really verify that the
> pte_dirty/pte_write change really made that big of a difference. Maybe
> your recollection of ~55,000 migrate_pages events was faulty. If the
> pte_write ->pte_dirty change is the *only* difference, it's still very
> odd how that one difference would make migrate_rate go from ~55k to
> 471k. That's an order of magnitude difference, for what really
> shouldn't be a big change.

My recollection wasn't faulty - I pulled it from an earlier email.
That said, the original measurement might have been faulty. I ran
the numbers again on the 3.19 kernel I saved away from the original
testing. That came up at 235k, which is pretty much the same as
yesterday's test. The runtime,however, is unchanged from my original
measurements of 4m54s (pte_hack came in at 5m20s).

Wondering where the 55k number came from, I played around with when
I started the measurement - all the numbers since I did the bisect
have come from starting it at roughly 130AGs into phase 3 where the
memory footprint stabilises and the tlb flush overhead kicks in.

However, if I start the measurement at the same time as the repair
test, I get something much closer to the 55k number. I also note
that my original 4.0-rc1 numbers were much lower than the more
recent steady state measurements (360k vs 470k), so I'd say the
original numbers weren't representative of the steady state
behaviour and so can be ignored...

> Maybe a system update has changed libraries and memory allocation
> patterns, and there is something bigger than that one-liner
> pte_dirty/write change going on?

Possibly. The xfs_repair binary has definitely been rebuilt (testing
unrelated bug fixes that only affect phase 6/7 behaviour), but
otherwise the system libraries are unchanged.

Cheers,

Dave.
Linus Torvalds March 19, 2015, 11:05 p.m. UTC | #27
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> My recollection wasn't faulty - I pulled it from an earlier email.
> That said, the original measurement might have been faulty. I ran
> the numbers again on the 3.19 kernel I saved away from the original
> testing. That came up at 235k, which is pretty much the same as
> yesterday's test. The runtime,however, is unchanged from my original
> measurements of 4m54s (pte_hack came in at 5m20s).

Ok. Good. So the "more than an order of magnitude difference" was
really about measurement differences, not quite as real. Looks like
more a "factor of two" than a factor of 20.

Did you do the profiles the same way? Because that would explain the
differences in the TLB flush percentages too (the "1.4% from
tlb_invalidate_range()" vs "pretty much everything from migration").

The runtime variation does show that there's some *big* subtle
difference for the numa balancing in the exact TNF_NO_GROUP details.
It must be *very* unstable for it to make that big of a difference.
But I feel at least a *bit* better about "unstable algorithm changes a
small varioation into a factor-of-two" vs that crazy factor-of-20.

Can you try Mel's change to make it use

        if (!(vma->vm_flags & VM_WRITE))

instead of the pte details? Again, on otherwise plain 3.19, just so
that we have a baseline. I'd be *so* much happer with checking the vma
details over per-pte details, especially ones that change over the
lifetime of the pte entry, and the NUMA code explicitly mucks with.

                           Linus
Dave Chinner March 19, 2015, 11:23 p.m. UTC | #28
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > My recollection wasn't faulty - I pulled it from an earlier email.
> > That said, the original measurement might have been faulty. I ran
> > the numbers again on the 3.19 kernel I saved away from the original
> > testing. That came up at 235k, which is pretty much the same as
> > yesterday's test. The runtime,however, is unchanged from my original
> > measurements of 4m54s (pte_hack came in at 5m20s).
> 
> Ok. Good. So the "more than an order of magnitude difference" was
> really about measurement differences, not quite as real. Looks like
> more a "factor of two" than a factor of 20.
> 
> Did you do the profiles the same way? Because that would explain the
> differences in the TLB flush percentages too (the "1.4% from
> tlb_invalidate_range()" vs "pretty much everything from migration").

No, the profiles all came from steady state. The profiles from the
initial startup phase hammer the mmap_sem because of page fault vs
mprotect contention (glibc runs mprotect() on every chunk of
memory it allocates). It's not until the cache reaches "full" and it
starts recycling old buffers rather than allocating new ones that
the tlb flush problem dominates the profiles.

> The runtime variation does show that there's some *big* subtle
> difference for the numa balancing in the exact TNF_NO_GROUP details.
> It must be *very* unstable for it to make that big of a difference.
> But I feel at least a *bit* better about "unstable algorithm changes a
> small varioation into a factor-of-two" vs that crazy factor-of-20.
> 
> Can you try Mel's change to make it use
> 
>         if (!(vma->vm_flags & VM_WRITE))
> 
> instead of the pte details? Again, on otherwise plain 3.19, just so
> that we have a baseline. I'd be *so* much happer with checking the vma
> details over per-pte details, especially ones that change over the
> lifetime of the pte entry, and the NUMA code explicitly mucks with.

Yup, will do. might take an hour or two before I get to it, though...

Cheers,

Dave.
Dave Chinner March 20, 2015, 12:23 a.m. UTC | #29
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
> Can you try Mel's change to make it use
> 
>         if (!(vma->vm_flags & VM_WRITE))
> 
> instead of the pte details? Again, on otherwise plain 3.19, just so
> that we have a baseline. I'd be *so* much happer with checking the vma
> details over per-pte details, especially ones that change over the
> lifetime of the pte entry, and the NUMA code explicitly mucks with.

$ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

    266,750      migrate:mm_migrate_pages ( +-  7.43% )

	  10.002032292 seconds time elapsed ( +-  0.00% )

Bit more variance there than the pte checking, but runtime
difference is in the noise - 5m4s vs 4m54s - and profiles are
identical to the pte checking version.

Cheers,

Dave.
Linus Torvalds March 20, 2015, 1:29 a.m. UTC | #30
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Bit more variance there than the pte checking, but runtime
> difference is in the noise - 5m4s vs 4m54s - and profiles are
> identical to the pte checking version.

Ahh, so that "!(vma->vm_flags & VM_WRITE)" test works _almost_ as well
as the original !pte_write() test.

Now, can you check that on top of rc4? If I've gotten everything
right, we now have:

 - plain 3.19 (pte_write): 4m54s
 - 3.19 with vm_flags & VM_WRITE: 5m4s
 - 3.19 with pte_dirty: 5m20s

so the pte_dirty version seems to have been a bad choice indeed.

For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
_much_ worse, but I'm wondering whether that VM_WRITE test will at
least shrink the difference like it does for 3.19.

And the VM_WRITE test should be stable and not have any subtle
interaction with the other changes that the numa pte things
introduced. It would be good to see if the profiles then pop something
*else* up as the performance difference (which I'm sure will remain,
since the 7m50s was so far off).

                        Linus
Dave Chinner March 20, 2015, 4:13 a.m. UTC | #31
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Bit more variance there than the pte checking, but runtime
> > difference is in the noise - 5m4s vs 4m54s - and profiles are
> > identical to the pte checking version.
> 
> Ahh, so that "!(vma->vm_flags & VM_WRITE)" test works _almost_ as well
> as the original !pte_write() test.
> 
> Now, can you check that on top of rc4? If I've gotten everything
> right, we now have:
> 
>  - plain 3.19 (pte_write): 4m54s
>  - 3.19 with vm_flags & VM_WRITE: 5m4s
>  - 3.19 with pte_dirty: 5m20s

*nod*

> so the pte_dirty version seems to have been a bad choice indeed.
> 
> For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
> _much_ worse, but I'm wondering whether that VM_WRITE test will at
> least shrink the difference like it does for 3.19.

Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
7m36s. IOWs's a bit better, but not significantly. page migrations
are pretty much unchanged, too:

	   558,632      migrate:mm_migrate_pages ( +-  6.38% )

> And the VM_WRITE test should be stable and not have any subtle
> interaction with the other changes that the numa pte things
> introduced. It would be good to see if the profiles then pop something
> *else* up as the performance difference (which I'm sure will remain,
> since the 7m50s was so far off).

No, nothing new pops up in the kernel profiles. All the system CPU
time is still being spent sending IPIs on the tlb flush path.

Cheers,

Dave.
Mel Gorman March 20, 2015, 9:56 a.m. UTC | #32
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > My recollection wasn't faulty - I pulled it from an earlier email.
> > That said, the original measurement might have been faulty. I ran
> > the numbers again on the 3.19 kernel I saved away from the original
> > testing. That came up at 235k, which is pretty much the same as
> > yesterday's test. The runtime,however, is unchanged from my original
> > measurements of 4m54s (pte_hack came in at 5m20s).
> 
> Ok. Good. So the "more than an order of magnitude difference" was
> really about measurement differences, not quite as real. Looks like
> more a "factor of two" than a factor of 20.
> 
> Did you do the profiles the same way? Because that would explain the
> differences in the TLB flush percentages too (the "1.4% from
> tlb_invalidate_range()" vs "pretty much everything from migration").
> 
> The runtime variation does show that there's some *big* subtle
> difference for the numa balancing in the exact TNF_NO_GROUP details.

TNF_NO_GROUP affects whether the scheduler tries to group related processes
together. Whether migration occurs depends on what node a process is
scheduled on. If processes are aggressively grouped inappropriately then it
is possible there is a bug that causes the load balancer to move processes
off a node (possible migration) with NUMA balancing trying to pull it back
(another possible migration). Small bugs there can result in excessive
migration.
Mel Gorman March 20, 2015, 10:12 a.m. UTC | #33
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
> And the VM_WRITE test should be stable and not have any subtle
> interaction with the other changes that the numa pte things
> introduced. It would be good to see if the profiles then pop something
> *else* up as the performance difference (which I'm sure will remain,
> since the 7m50s was so far off).
> 

As a side-note, I did test a patch that checked pte_write and preserved
it across both faults and setting the protections. It did not alter
migration activity much but there was a  drop in minor faults - 20% drop in
autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to
refaults to mark pages writable.  The patch looks and is hacky so I won't
post it to save people bleaching their eyes. I'll spend some time soon
(hopefully today) at a smooth way of falling through to WP checks after
trapping a NUMA fault.
Linus Torvalds March 20, 2015, 5:02 p.m. UTC | #34
On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
> 7m36s. IOWs's a bit better, but not significantly. page migrations
> are pretty much unchanged, too:
>
>            558,632      migrate:mm_migrate_pages ( +-  6.38% )

Ok. That was kind of the expected thing.

I don't really know the NUMA fault rate limiting code, but one thing
that strikes me is that if it tries to balance the NUMA faults against
the *regular* faults, then maybe just the fact that we end up taking
more COW faults after a NUMA fault then means that the NUMA rate
limiting code now gets over-eager (because it sees all those extra
non-numa faults).

Mel, does that sound at all possible? I really have never looked at
the magic automatic rate handling..

                         Linus
Mel Gorman March 23, 2015, 12:01 p.m. UTC | #35
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
> > 7m36s. IOWs's a bit better, but not significantly. page migrations
> > are pretty much unchanged, too:
> >
> >            558,632      migrate:mm_migrate_pages ( +-  6.38% )
> 
> Ok. That was kind of the expected thing.
> 
> I don't really know the NUMA fault rate limiting code, but one thing
> that strikes me is that if it tries to balance the NUMA faults against
> the *regular* faults, then maybe just the fact that we end up taking
> more COW faults after a NUMA fault then means that the NUMA rate
> limiting code now gets over-eager (because it sees all those extra
> non-numa faults).
> 
> Mel, does that sound at all possible? I really have never looked at
> the magic automatic rate handling..
> 

It should not be trying to balance against regular faults as it has no
information on it. The trapping of additional faults to mark the PTE
writable will alter timing so it indirectly affects how many migration
faults there but this is only a side-effect IMO.

There is more overhead now due to losing the writable information and
that should be reduced so I tried a few approaches.  Ultimately, the one
that performed the best and was easiest to understand simply preserved
the writable bit across the protection update and page fault. I'll post
it later when I stick a changelog on it.
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..a419b65770d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1625,11 +1625,11 @@  struct task_struct {
 
 	/*
 	 * numa_faults_locality tracks if faults recorded during the last
-	 * scan window were remote/local. The task scan period is adapted
-	 * based on the locality of the faults with different weights
-	 * depending on whether they were shared or private faults
+	 * scan window were remote/local or failed to migrate. The task scan
+	 * period is adapted based on the locality of the faults with different
+	 * weights depending on whether they were shared or private faults
 	 */
-	unsigned long numa_faults_locality[2];
+	unsigned long numa_faults_locality[3];
 
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1719,6 +1719,7 @@  struct task_struct {
 #define TNF_NO_GROUP	0x02
 #define TNF_SHARED	0x04
 #define TNF_FAULT_LOCAL	0x08
+#define TNF_MIGRATE_FAIL 0x10
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..bcfe32088b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,9 +1609,11 @@  static void update_task_scan_period(struct task_struct *p,
 	/*
 	 * If there were no record hinting faults then either the task is
 	 * completely idle or all activity is areas that are not of interest
-	 * to automatic numa balancing. Scan slower
+	 * to automatic numa balancing. Related to that, if there were failed
+	 * migration then it implies we are migrating too quickly or the local
+	 * node is overloaded. In either case, scan slower
 	 */
-	if (local + shared == 0) {
+	if (local + shared == 0 || p->numa_faults_locality[2]) {
 		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period << 1);
 
@@ -2080,6 +2082,8 @@  void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 
 	if (migrated)
 		p->numa_pages_migrated += pages;
+	if (flags & TNF_MIGRATE_FAIL)
+		p->numa_faults_locality[2] += pages;
 
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae13ad31e113..f508fda07d34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1353,7 +1353,8 @@  int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (migrated) {
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
-	}
+	} else
+		flags |= TNF_MIGRATE_FAIL;
 
 	goto out;
 clear_pmdnuma:
diff --git a/mm/memory.c b/mm/memory.c
index 8068893697bb..187daf695f88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3097,7 +3097,8 @@  static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (migrated) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
-	}
+	} else
+		flags |= TNF_MIGRATE_FAIL;
 
 out:
 	if (page_nid != -1)