diff mbox

[RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead

Message ID 20150210165414.GF2395@suse.de
State New
Headers show

Commit Message

Mel Gorman Feb. 10, 2015, 4:54 p.m. UTC
On Tue, Feb 10, 2015 at 10:37:52AM -0500, Carlos O'Donell wrote:
> On 02/09/2015 05:49 PM, Mel Gorman wrote:
> > I would also welcome suggestions on how madvise could be throttled without
> > the use of counters. The counters are heap-local where I do not expect
> > there will be cache conflicts and the allocation-side counter is only
> > updated after a recent heap shrink to minimise updates.
> > 
> > Initially I worked around this in the kernel but any solution there
> > breaks the existing semantics of MADV_DONTNEED and was rejected. See
> > last paragraph of https://lkml.org/lkml/2015/2/2/696 .
> 
> The truth is that glibc doesn't want to use MADV_DONTNEED in malloc,
> but it's the only interface we have right now that has similar
> semantics to what we need.
> 
> Similarly Kostya from google told me that ASAN also has this problem since 
> it has tons of statistics pages that it will touch soon, but doesn't care
> if they come back as zero or the original data.
> 
> Two years ago I spoke with Motohiro-san and we talked about
> MADV_"Want but don't need", but no mainline solution was present at the
> time.
> 
> The immediate work that comes to mind is the `vrange` work by Minchan Kim.
> http://lwn.net/Articles/542504/
> 

Yup, I was around for most of the volatile ranges discussions although I
tended to stay away from that particular one.

> I agree with Rik's comment in the above link that we really want MADV_FREE
> in these cases, and it gives a 200% speedup over MADV_DONTNEED (as reported
> by testers using jemalloc patches).
> 

It's true that MADV_FREE will be faster than MADV_DONTNEED in this case
as the pages do not need to be reallocated. That does not make it a
cheap operation and using MADV_FREE thousands of times per second in the
worst-case is still going to hurt. MADV_FREE still incurs a page table
protection update, TLB flush and if it's reused then it's another minor
fault. In the absense of memory pressure, the main gain is that you do
not have to refault, realloc and zero the pages.

> Thus, instead of adding more complex heuristics to glibc's malloc I think

Is it really that complex? It's a basic heuristic comparing two counters
that affects the alloc and free slow paths. The checks are neglible in
comparison to the necessary work if glibc calls into the kernel.

> we should be testing the addition of MADV_FREE in glibc's malloc and then
> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
> the outcome.
> 

They are ortogonal to each other and both can exist side by side. MADV_FREE
is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
be merged in the current merge window. The patches are sitting in Andrew
Morton's mmotm tree but he has not sent a merge request yet.

> In the meantime we can talk about mitigating the problems in glibc's
> allocator for systems with old kernels, but it should not be the primary
> solution. In glibc we would conditionalize the changes against the first
> kernel version that included MADV_FREE, and when the minimum supported
> kernel version is higher than that we would remove the code in question.
> 
> My suggested next steps are:
> 
> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
>     MADV_FREE, see how that performs, and inform upstream kernel.
> 

The upstream kernel developers in this case won't care what the performance
is (changelog is already set) although Michan would care if there was any
bug reports associated with its usage. MADV_FREE is likely to be supported
either way and it's up to the glibc folk whether they want to support it.

> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
>     measure. Before we go any further here, please increase M_TRIM_THRESHOLD
>     in ebizzy and see if that makes a difference? It should make a difference
>     by increasing the threshold at which we trim back to the OS, both sbrk,
>     and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
>     memory pressure. Changing the default though is not a trivial thing, since
>     it could lead to immediate OOM for existing applications that run close to
>     the limit of RAM. Discussion and analysis will be required.
> 

Altering trim threshold does not appear to help but that's not really
surprising considering that it's only applied the main arena and not the
per-thread heaps. At least that is how I'm interpreting this check

      if (av == &main_arena) {
#ifndef MORECORE_CANNOT_TRIM
        if ((unsigned long)(chunksize(av->top)) >=
            (unsigned long)(mp_.trim_threshold))
          systrim(mp_.top_pad, av);
#endif
      } else {
        /* Always try heap_trim(), even if the top chunk is not
           large, because the corresponding heap might go away.  */
        heap_info *heap = heap_for_ptr(top(av));

        assert(heap->ar_ptr == av);
        heap_trim(heap, mp_.top_pad);
      }

In the case of the per-thread heaps, shrinking decisions are based on the
page size made due to this check.

 extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
  if (extra < (long) pagesz)
    return 0;

  /* Try to shrink. */
  if (shrink_heap (heap, extra) != 0)
    return 0;


You may have noticed that part of the patch takes mp_.mmap_threshold into
account in the shrinking decision but this was the wrong choice. I should
have at least used the trim threshold there.  I had considered just using
trim threshold but worried that the glibc developers would not like it
as it requred manual tuning by the user.

If trim threshold was to be used as a workaround then I'd think we'd need
at this patch. Anyone using the ebizzy benchmark without knowing this will
still report a regression between distros with newer glibcs but at least
a google search might find this thread.

Comments

Carlos O'Donell Feb. 10, 2015, 7:01 p.m. UTC | #1
On 02/10/2015 11:54 AM, Mel Gorman wrote:
> It's true that MADV_FREE will be faster than MADV_DONTNEED in this case
> as the pages do not need to be reallocated. That does not make it a
> cheap operation and using MADV_FREE thousands of times per second in the
> worst-case is still going to hurt. MADV_FREE still incurs a page table
> protection update, TLB flush and if it's reused then it's another minor
> fault. In the absense of memory pressure, the main gain is that you do
> not have to refault, realloc and zero the pages.

I agree.

>> Thus, instead of adding more complex heuristics to glibc's malloc I think
> 
> Is it really that complex? It's a basic heuristic comparing two counters
> that affects the alloc and free slow paths. The checks are neglible in
> comparison to the necessary work if glibc calls into the kernel.

It is complex. It's a time-based integral value that is going to 
complicate debugging, reproducers, and users will need to be educated
as to how it works. We will likely also want a tunnable for it such that
if it negatively impacts workloads we can disable it. The tunnable though
adds runtime overhead, and that's all added complexity.

Please keep in mind that glibc is a general purpose allocator which aims
to be good at the average case. Unfortunately to be honest we lack any data
with which to measure "average" workloads, which is why glibc as a community
is looking at whole system benchmarks to gather workload data and integrate
it directly into our microbenchmark (see benchtests/) or a future whole system
benchmark framework.

So while the technical complexity of the fix is low, that's not the only
factor I measure when looking at a fix.

A high quality contribution for changes to malloc should IMO include a
microbenchmark or two to flesh out the matching workload against which the
change was tested (see benchtests/README).

>> we should be testing the addition of MADV_FREE in glibc's malloc and then
>> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
>> the outcome.
>>
> 
> They are ortogonal to each other and both can exist side by side. MADV_FREE
> is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
> cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
> be merged in the current merge window. The patches are sitting in Andrew
> Morton's mmotm tree but he has not sent a merge request yet.

They are orthogonal in a technical sense, but they solve the same problem:
Make glibc's malloc faster for workload X.

My opinion is that MADV_FREE is a technically superior solution to rate
limiting madvise() calls from malloc. Rate limiting or dynamic heuristics
have been difficult to tune and support over the last decade, and I am
very hesitant to accept more without serious discussion, justification,
and data.

>> In the meantime we can talk about mitigating the problems in glibc's
>> allocator for systems with old kernels, but it should not be the primary
>> solution. In glibc we would conditionalize the changes against the first
>> kernel version that included MADV_FREE, and when the minimum supported
>> kernel version is higher than that we would remove the code in question.
>>
>> My suggested next steps are:
>>
>> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
>>     MADV_FREE, see how that performs, and inform upstream kernel.
>>
> 
> The upstream kernel developers in this case won't care what the performance
> is (changelog is already set) although Michan would care if there was any
> bug reports associated with its usage. MADV_FREE is likely to be supported
> either way and it's up to the glibc folk whether they want to support it.

We do want to support it. I also believe that upstream does care.

>> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
>>     measure. Before we go any further here, please increase M_TRIM_THRESHOLD
>>     in ebizzy and see if that makes a difference? It should make a difference
>>     by increasing the threshold at which we trim back to the OS, both sbrk,
>>     and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
>>     memory pressure. Changing the default though is not a trivial thing, since
>>     it could lead to immediate OOM for existing applications that run close to
>>     the limit of RAM. Discussion and analysis will be required.
>>
> 
> Altering trim threshold does not appear to help but that's not really
> surprising considering that it's only applied the main arena and not the
> per-thread heaps. At least that is how I'm interpreting this check

I agree. It looks like a bug to me. The arenas should from first principles
all behave relatively the same, regardless of the underlying storage e.g.
brk vs. mmap.
 
> You may have noticed that part of the patch takes mp_.mmap_threshold into
> account in the shrinking decision but this was the wrong choice. I should
> have at least used the trim threshold there.  I had considered just using
> trim threshold but worried that the glibc developers would not like it
> as it requred manual tuning by the user.

Please do not hesitate to ask an opinion on the list, like you are doing
right now with your RFC. Just ask if there is a preference, choice, or
rationale for using or not using trim_threshold.

I see algorithm development going through various phases:

(a) Implement algorithm, add tunnables.

(b) Evaluate algorithm against workloads and tweak tunnables.

(c) Develop dynamic tunning and test.

(e) Replace tunnable with dynamic tunning.

With glibc's malloc we've been at (b) for years. I hope that with some
of the tunnables unification work we're doing that we can make (b), 
(c) and (d) easier for the core developers.

> If trim threshold was to be used as a workaround then I'd think we'd need
> at this patch. Anyone using the ebizzy benchmark without knowing this will
> still report a regression between distros with newer glibcs but at least
> a google search might find this thread.

The trimming threshold is another solution.

Limiting the trimming should reduce the number of MADV_DONTNEED calls,
and make all of the arenas behave more uniformly.

> diff --git a/malloc/arena.c b/malloc/arena.c
> index 886defb074a2..a78d4835a825 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
>      }
>    top_size = chunksize (top_chunk);
>    extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> -  if (extra < (long) pagesz)
> +  if (extra < (long) mp_.trim_threshold)
>      return 0;
>  
>    /* Try to shrink. */
> 

This patch looks correct to me. I think it's a bug that non-main arenas
don't use the trim threshold. All arenas should behave the same way,
and thus make them easier to debug and use.

How does the patch impact your testing?

I might try this out in Fedora Rawhide for a few months.

Would you post another patch with just this change and your test
results?

Cheers,
Carlos.
Mel Gorman Feb. 10, 2015, 8:24 p.m. UTC | #2
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote:
> >> Thus, instead of adding more complex heuristics to glibc's malloc I think
> > 
> > Is it really that complex? It's a basic heuristic comparing two counters
> > that affects the alloc and free slow paths. The checks are neglible in
> > comparison to the necessary work if glibc calls into the kernel.
> 
> It is complex. It's a time-based integral value that is going to 
> complicate debugging, reproducers, and users will need to be educated
> as to how it works. We will likely also want a tunnable for it such that
> if it negatively impacts workloads we can disable it. The tunnable though
> adds runtime overhead, and that's all added complexity.
> 

Ok, I cannot argue with that logic.

> Please keep in mind that glibc is a general purpose allocator which aims
> to be good at the average case. Unfortunately to be honest we lack any data
> with which to measure "average" workloads, which is why glibc as a community
> is looking at whole system benchmarks to gather workload data and integrate
> it directly into our microbenchmark (see benchtests/) or a future whole system
> benchmark framework.
> 

Understood. For what it's worth, the kernel memory management has had
numerous similar discussions in the past.

> So while the technical complexity of the fix is low, that's not the only
> factor I measure when looking at a fix.
> 
> A high quality contribution for changes to malloc should IMO include a
> microbenchmark or two to flesh out the matching workload against which the
> change was tested (see benchtests/README).
> 

Ok. In this case, a benchmark exists but I would be hesitant to include it
in benchtests without better evidence that a number of workloads actually
suffer this problem. mariadb hits this during database initialisation but
it's negligible in comparison to the overall cost of that operation.

> >> we should be testing the addition of MADV_FREE in glibc's malloc and then
> >> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
> >> the outcome.
> >>
> > 
> > They are ortogonal to each other and both can exist side by side. MADV_FREE
> > is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
> > cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
> > be merged in the current merge window. The patches are sitting in Andrew
> > Morton's mmotm tree but he has not sent a merge request yet.
> 
> They are orthogonal in a technical sense, but they solve the same problem:
> Make glibc's malloc faster for workload X.
> 
> My opinion is that MADV_FREE is a technically superior solution to rate
> limiting madvise() calls from malloc. Rate limiting or dynamic heuristics
> have been difficult to tune and support over the last decade, and I am
> very hesitant to accept more without serious discussion, justification,
> and data.
> 

Understood.

> >> In the meantime we can talk about mitigating the problems in glibc's
> >> allocator for systems with old kernels, but it should not be the primary
> >> solution. In glibc we would conditionalize the changes against the first
> >> kernel version that included MADV_FREE, and when the minimum supported
> >> kernel version is higher than that we would remove the code in question.
> >>
> >> My suggested next steps are:
> >>
> >> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
> >>     MADV_FREE, see how that performs, and inform upstream kernel.
> >>
> > 
> > The upstream kernel developers in this case won't care what the performance
> > is (changelog is already set) although Michan would care if there was any
> > bug reports associated with its usage. MADV_FREE is likely to be supported
> > either way and it's up to the glibc folk whether they want to support it.
> 
> We do want to support it. I also believe that upstream does care.
> 

If glibc hits bug reports when MADV_FREE is used, they will *definitely*
care.

> >> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
> >>     measure. Before we go any further here, please increase M_TRIM_THRESHOLD
> >>     in ebizzy and see if that makes a difference? It should make a difference
> >>     by increasing the threshold at which we trim back to the OS, both sbrk,
> >>     and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
> >>     memory pressure. Changing the default though is not a trivial thing, since
> >>     it could lead to immediate OOM for existing applications that run close to
> >>     the limit of RAM. Discussion and analysis will be required.
> >>
> > 
> > Altering trim threshold does not appear to help but that's not really
> > surprising considering that it's only applied the main arena and not the
> > per-thread heaps. At least that is how I'm interpreting this check
> 
> I agree. It looks like a bug to me. The arenas should from first principles
> all behave relatively the same, regardless of the underlying storage e.g.
> brk vs. mmap.
>  

Cool. By co-incidence it happens to solve the problem for this
particular workload but the milage will vary considerable. At least it
can be tuned around without disabling per-thread heaps entirely/

> > You may have noticed that part of the patch takes mp_.mmap_threshold into
> > account in the shrinking decision but this was the wrong choice. I should
> > have at least used the trim threshold there.  I had considered just using
> > trim threshold but worried that the glibc developers would not like it
> > as it requred manual tuning by the user.
> 
> Please do not hesitate to ask an opinion on the list, like you are doing
> right now with your RFC. Just ask if there is a preference, choice, or
> rationale for using or not using trim_threshold.
> 
> I see algorithm development going through various phases:
> 
> (a) Implement algorithm, add tunnables.
> 
> (b) Evaluate algorithm against workloads and tweak tunnables.
> 
> (c) Develop dynamic tunning and test.
> 
> (e) Replace tunnable with dynamic tunning.
> 
> With glibc's malloc we've been at (b) for years. I hope that with some
> of the tunnables unification work we're doing that we can make (b), 
> (c) and (d) easier for the core developers.
> 

Thanks for the clarification.

> > If trim threshold was to be used as a workaround then I'd think we'd need
> > at this patch. Anyone using the ebizzy benchmark without knowing this will
> > still report a regression between distros with newer glibcs but at least
> > a google search might find this thread.
> 
> The trimming threshold is another solution.
> 
> Limiting the trimming should reduce the number of MADV_DONTNEED calls,
> and make all of the arenas behave more uniformly.
> 
> > diff --git a/malloc/arena.c b/malloc/arena.c
> > index 886defb074a2..a78d4835a825 100644
> > --- a/malloc/arena.c
> > +++ b/malloc/arena.c
> > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
> >      }
> >    top_size = chunksize (top_chunk);
> >    extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> > -  if (extra < (long) pagesz)
> > +  if (extra < (long) mp_.trim_threshold)
> >      return 0;
> >  
> >    /* Try to shrink. */
> > 
> 
> This patch looks correct to me. I think it's a bug that non-main arenas
> don't use the trim threshold. All arenas should behave the same way,
> and thus make them easier to debug and use.
> 
> How does the patch impact your testing?
> 

ebizzy gets fixed by co-incidence. Another test case that has a mix of
mallocs and frees shows a reduction in the number of calls to madvise by
default and it's possible to eliminate it entirely.

> I might try this out in Fedora Rawhide for a few months.
> 
> Would you post another patch with just this change and your test
> results?
> 

No problem. It'll be posted within the next few minutes. The changelog
may be considered excessive but feel free to trim it (ba dum tisch, I'm
here all week).
Rich Felker Feb. 11, 2015, 1:19 p.m. UTC | #3
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote:
> >> Thus, instead of adding more complex heuristics to glibc's malloc I think
> > 
> > Is it really that complex? It's a basic heuristic comparing two counters
> > that affects the alloc and free slow paths. The checks are neglible in
> > comparison to the necessary work if glibc calls into the kernel.
> 
> It is complex. It's a time-based integral value that is going to 
> complicate debugging, reproducers, and users will need to be educated
> as to how it works.

One way to express this idea is that the complexity is emergent. The
code itself is not complex, but the resulting behavior, which is the
result of a feedback process with internal state and input from the
entire sequence of allocation and deallocation operations, is very
complex.

Rich
diff mbox

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index 886defb074a2..a78d4835a825 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -696,7 +696,7 @@  heap_trim (heap_info *heap, size_t pad)
     }
   top_size = chunksize (top_chunk);
   extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
-  if (extra < (long) pagesz)
+  if (extra < (long) mp_.trim_threshold)
     return 0;
 
   /* Try to shrink. */