Message ID | 20150210165414.GF2395@suse.de |
---|---|
State | New |
Headers | show |
On 02/10/2015 11:54 AM, Mel Gorman wrote: > It's true that MADV_FREE will be faster than MADV_DONTNEED in this case > as the pages do not need to be reallocated. That does not make it a > cheap operation and using MADV_FREE thousands of times per second in the > worst-case is still going to hurt. MADV_FREE still incurs a page table > protection update, TLB flush and if it's reused then it's another minor > fault. In the absense of memory pressure, the main gain is that you do > not have to refault, realloc and zero the pages. I agree. >> Thus, instead of adding more complex heuristics to glibc's malloc I think > > Is it really that complex? It's a basic heuristic comparing two counters > that affects the alloc and free slow paths. The checks are neglible in > comparison to the necessary work if glibc calls into the kernel. It is complex. It's a time-based integral value that is going to complicate debugging, reproducers, and users will need to be educated as to how it works. We will likely also want a tunnable for it such that if it negatively impacts workloads we can disable it. The tunnable though adds runtime overhead, and that's all added complexity. Please keep in mind that glibc is a general purpose allocator which aims to be good at the average case. Unfortunately to be honest we lack any data with which to measure "average" workloads, which is why glibc as a community is looking at whole system benchmarks to gather workload data and integrate it directly into our microbenchmark (see benchtests/) or a future whole system benchmark framework. So while the technical complexity of the fix is low, that's not the only factor I measure when looking at a fix. A high quality contribution for changes to malloc should IMO include a microbenchmark or two to flesh out the matching workload against which the change was tested (see benchtests/README). >> we should be testing the addition of MADV_FREE in glibc's malloc and then >> supporting or adjusting the proposal for MADV_FREE or vrange dependent on >> the outcome. >> > > They are ortogonal to each other and both can exist side by side. MADV_FREE > is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far > cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will > be merged in the current merge window. The patches are sitting in Andrew > Morton's mmotm tree but he has not sent a merge request yet. They are orthogonal in a technical sense, but they solve the same problem: Make glibc's malloc faster for workload X. My opinion is that MADV_FREE is a technically superior solution to rate limiting madvise() calls from malloc. Rate limiting or dynamic heuristics have been difficult to tune and support over the last decade, and I am very hesitant to accept more without serious discussion, justification, and data. >> In the meantime we can talk about mitigating the problems in glibc's >> allocator for systems with old kernels, but it should not be the primary >> solution. In glibc we would conditionalize the changes against the first >> kernel version that included MADV_FREE, and when the minimum supported >> kernel version is higher than that we would remove the code in question. >> >> My suggested next steps are: >> >> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses >> MADV_FREE, see how that performs, and inform upstream kernel. >> > > The upstream kernel developers in this case won't care what the performance > is (changelog is already set) although Michan would care if there was any > bug reports associated with its usage. MADV_FREE is likely to be supported > either way and it's up to the glibc folk whether they want to support it. We do want to support it. I also believe that upstream does care. >> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary >> measure. Before we go any further here, please increase M_TRIM_THRESHOLD >> in ebizzy and see if that makes a difference? It should make a difference >> by increasing the threshold at which we trim back to the OS, both sbrk, >> and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased >> memory pressure. Changing the default though is not a trivial thing, since >> it could lead to immediate OOM for existing applications that run close to >> the limit of RAM. Discussion and analysis will be required. >> > > Altering trim threshold does not appear to help but that's not really > surprising considering that it's only applied the main arena and not the > per-thread heaps. At least that is how I'm interpreting this check I agree. It looks like a bug to me. The arenas should from first principles all behave relatively the same, regardless of the underlying storage e.g. brk vs. mmap. > You may have noticed that part of the patch takes mp_.mmap_threshold into > account in the shrinking decision but this was the wrong choice. I should > have at least used the trim threshold there. I had considered just using > trim threshold but worried that the glibc developers would not like it > as it requred manual tuning by the user. Please do not hesitate to ask an opinion on the list, like you are doing right now with your RFC. Just ask if there is a preference, choice, or rationale for using or not using trim_threshold. I see algorithm development going through various phases: (a) Implement algorithm, add tunnables. (b) Evaluate algorithm against workloads and tweak tunnables. (c) Develop dynamic tunning and test. (e) Replace tunnable with dynamic tunning. With glibc's malloc we've been at (b) for years. I hope that with some of the tunnables unification work we're doing that we can make (b), (c) and (d) easier for the core developers. > If trim threshold was to be used as a workaround then I'd think we'd need > at this patch. Anyone using the ebizzy benchmark without knowing this will > still report a regression between distros with newer glibcs but at least > a google search might find this thread. The trimming threshold is another solution. Limiting the trimming should reduce the number of MADV_DONTNEED calls, and make all of the arenas behave more uniformly. > diff --git a/malloc/arena.c b/malloc/arena.c > index 886defb074a2..a78d4835a825 100644 > --- a/malloc/arena.c > +++ b/malloc/arena.c > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > } > top_size = chunksize (top_chunk); > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > - if (extra < (long) pagesz) > + if (extra < (long) mp_.trim_threshold) > return 0; > > /* Try to shrink. */ > This patch looks correct to me. I think it's a bug that non-main arenas don't use the trim threshold. All arenas should behave the same way, and thus make them easier to debug and use. How does the patch impact your testing? I might try this out in Fedora Rawhide for a few months. Would you post another patch with just this change and your test results? Cheers, Carlos.
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote: > >> Thus, instead of adding more complex heuristics to glibc's malloc I think > > > > Is it really that complex? It's a basic heuristic comparing two counters > > that affects the alloc and free slow paths. The checks are neglible in > > comparison to the necessary work if glibc calls into the kernel. > > It is complex. It's a time-based integral value that is going to > complicate debugging, reproducers, and users will need to be educated > as to how it works. We will likely also want a tunnable for it such that > if it negatively impacts workloads we can disable it. The tunnable though > adds runtime overhead, and that's all added complexity. > Ok, I cannot argue with that logic. > Please keep in mind that glibc is a general purpose allocator which aims > to be good at the average case. Unfortunately to be honest we lack any data > with which to measure "average" workloads, which is why glibc as a community > is looking at whole system benchmarks to gather workload data and integrate > it directly into our microbenchmark (see benchtests/) or a future whole system > benchmark framework. > Understood. For what it's worth, the kernel memory management has had numerous similar discussions in the past. > So while the technical complexity of the fix is low, that's not the only > factor I measure when looking at a fix. > > A high quality contribution for changes to malloc should IMO include a > microbenchmark or two to flesh out the matching workload against which the > change was tested (see benchtests/README). > Ok. In this case, a benchmark exists but I would be hesitant to include it in benchtests without better evidence that a number of workloads actually suffer this problem. mariadb hits this during database initialisation but it's negligible in comparison to the overall cost of that operation. > >> we should be testing the addition of MADV_FREE in glibc's malloc and then > >> supporting or adjusting the proposal for MADV_FREE or vrange dependent on > >> the outcome. > >> > > > > They are ortogonal to each other and both can exist side by side. MADV_FREE > > is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far > > cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will > > be merged in the current merge window. The patches are sitting in Andrew > > Morton's mmotm tree but he has not sent a merge request yet. > > They are orthogonal in a technical sense, but they solve the same problem: > Make glibc's malloc faster for workload X. > > My opinion is that MADV_FREE is a technically superior solution to rate > limiting madvise() calls from malloc. Rate limiting or dynamic heuristics > have been difficult to tune and support over the last decade, and I am > very hesitant to accept more without serious discussion, justification, > and data. > Understood. > >> In the meantime we can talk about mitigating the problems in glibc's > >> allocator for systems with old kernels, but it should not be the primary > >> solution. In glibc we would conditionalize the changes against the first > >> kernel version that included MADV_FREE, and when the minimum supported > >> kernel version is higher than that we would remove the code in question. > >> > >> My suggested next steps are: > >> > >> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses > >> MADV_FREE, see how that performs, and inform upstream kernel. > >> > > > > The upstream kernel developers in this case won't care what the performance > > is (changelog is already set) although Michan would care if there was any > > bug reports associated with its usage. MADV_FREE is likely to be supported > > either way and it's up to the glibc folk whether they want to support it. > > We do want to support it. I also believe that upstream does care. > If glibc hits bug reports when MADV_FREE is used, they will *definitely* care. > >> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary > >> measure. Before we go any further here, please increase M_TRIM_THRESHOLD > >> in ebizzy and see if that makes a difference? It should make a difference > >> by increasing the threshold at which we trim back to the OS, both sbrk, > >> and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased > >> memory pressure. Changing the default though is not a trivial thing, since > >> it could lead to immediate OOM for existing applications that run close to > >> the limit of RAM. Discussion and analysis will be required. > >> > > > > Altering trim threshold does not appear to help but that's not really > > surprising considering that it's only applied the main arena and not the > > per-thread heaps. At least that is how I'm interpreting this check > > I agree. It looks like a bug to me. The arenas should from first principles > all behave relatively the same, regardless of the underlying storage e.g. > brk vs. mmap. > Cool. By co-incidence it happens to solve the problem for this particular workload but the milage will vary considerable. At least it can be tuned around without disabling per-thread heaps entirely/ > > You may have noticed that part of the patch takes mp_.mmap_threshold into > > account in the shrinking decision but this was the wrong choice. I should > > have at least used the trim threshold there. I had considered just using > > trim threshold but worried that the glibc developers would not like it > > as it requred manual tuning by the user. > > Please do not hesitate to ask an opinion on the list, like you are doing > right now with your RFC. Just ask if there is a preference, choice, or > rationale for using or not using trim_threshold. > > I see algorithm development going through various phases: > > (a) Implement algorithm, add tunnables. > > (b) Evaluate algorithm against workloads and tweak tunnables. > > (c) Develop dynamic tunning and test. > > (e) Replace tunnable with dynamic tunning. > > With glibc's malloc we've been at (b) for years. I hope that with some > of the tunnables unification work we're doing that we can make (b), > (c) and (d) easier for the core developers. > Thanks for the clarification. > > If trim threshold was to be used as a workaround then I'd think we'd need > > at this patch. Anyone using the ebizzy benchmark without knowing this will > > still report a regression between distros with newer glibcs but at least > > a google search might find this thread. > > The trimming threshold is another solution. > > Limiting the trimming should reduce the number of MADV_DONTNEED calls, > and make all of the arenas behave more uniformly. > > > diff --git a/malloc/arena.c b/malloc/arena.c > > index 886defb074a2..a78d4835a825 100644 > > --- a/malloc/arena.c > > +++ b/malloc/arena.c > > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > > } > > top_size = chunksize (top_chunk); > > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > > - if (extra < (long) pagesz) > > + if (extra < (long) mp_.trim_threshold) > > return 0; > > > > /* Try to shrink. */ > > > > This patch looks correct to me. I think it's a bug that non-main arenas > don't use the trim threshold. All arenas should behave the same way, > and thus make them easier to debug and use. > > How does the patch impact your testing? > ebizzy gets fixed by co-incidence. Another test case that has a mix of mallocs and frees shows a reduction in the number of calls to madvise by default and it's possible to eliminate it entirely. > I might try this out in Fedora Rawhide for a few months. > > Would you post another patch with just this change and your test > results? > No problem. It'll be posted within the next few minutes. The changelog may be considered excessive but feel free to trim it (ba dum tisch, I'm here all week).
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote: > >> Thus, instead of adding more complex heuristics to glibc's malloc I think > > > > Is it really that complex? It's a basic heuristic comparing two counters > > that affects the alloc and free slow paths. The checks are neglible in > > comparison to the necessary work if glibc calls into the kernel. > > It is complex. It's a time-based integral value that is going to > complicate debugging, reproducers, and users will need to be educated > as to how it works. One way to express this idea is that the complexity is emergent. The code itself is not complex, but the resulting behavior, which is the result of a feedback process with internal state and input from the entire sequence of allocation and deallocation operations, is very complex. Rich
diff --git a/malloc/arena.c b/malloc/arena.c index 886defb074a2..a78d4835a825 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) } top_size = chunksize (top_chunk); extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); - if (extra < (long) pagesz) + if (extra < (long) mp_.trim_threshold) return 0; /* Try to shrink. */