Message ID | 20150209140608.GD2395@suse.de |
---|---|
State | New |
Headers | show |
On 02/09/2015 09:06 AM, Mel Gorman wrote: > while (data_to_process) { > buf = malloc(large_size); > do_stuff(); > free(buf); > } Why isn't the fix to change the application to hoist the malloc out of the loop? buf = malloc(large_size); while (data_to_process) { do_stuff(); } free(buf); Is it simply that the software frameworks themselves are unable to do this directly? I can understand your position. Ebizzy models the workload and you use the workload model to improve performance by changing the runtime to match the workload. The problem I face as a maintainer is that you've added complexity to malloc in the form of a decaying counter, and I need a strong justification for that kind of added complexity. For example, I see you're from SUSE, have you put this change through testing in your distribution builds or releases? What were the results? Under what *real* workloads did this make a difference? Cheers, Carlos.
Thanks for reviewing. On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote: > On 02/09/2015 09:06 AM, Mel Gorman wrote: > > while (data_to_process) { > > buf = malloc(large_size); > > do_stuff(); > > free(buf); > > } > > Why isn't the fix to change the application to hoist the > malloc out of the loop? > > buf = malloc(large_size); > while (data_to_process) > { > do_stuff(); > } > free(buf); > > Is it simply that the software frameworks themselves are > unable to do this directly? > Fixing the benchmark in this case hides the problem -- glibc malloc has a pathological worst case for a relatively basic allocation pattern that is encountered simply because it happens to use threads (processes would have avoided the problem). It was spotted when comparing versions of a distribution and initially I assumed it was a kernel issue until I analysed the problem. Even if ebizzy was fixed and it had an upstream maintainer that would accept the patch, glibc would still have the same problem. It would be a bit of a shame if the recommendation in some cases was simply to avoid using malloc/free and instead cache buffers within the application. > I can understand your position. Ebizzy models the workload and > you use the workload model to improve performance by changing > the runtime to match the workload. > Exactly. > The problem I face as a maintainer is that you've added > complexity to malloc in the form of a decaying counter, and > I need a strong justification for that kind of added complexity. > I would also welcome suggestions on how madvise could be throttled without the use of counters. The counters are heap-local where I do not expect there will be cache conflicts and the allocation-side counter is only updated after a recent heap shrink to minimise updates. Initially I worked around this in the kernel but any solution there breaks the existing semantics of MADV_DONTNEED and was rejected. See last paragraph of https://lkml.org/lkml/2015/2/2/696 . > For example, I see you're from SUSE, have you put this change > through testing in your distribution builds or releases? I'm not the glibc maintainer for our distribution but even if I was, the distribution has an upstream-first policy. A change of this type would have to be acceptable to the upstream maintainers. If this can be addressed here then I can ask our glibc maintainers to apply the patch as a backport. > What were the results? Under what *real* workloads did this > make a difference? > It was detected manually but the behaviour was also spotted in firefox during normal browsing (200 calls to madvise on average per second when monitored for a short period) and evolution when updating search folders (110 madvise calls per second) but in neither case can I actually quantify the impact because the overhead is a relatively small part of the overall workload. MariaDB when populating a database during the startup phase of sysbench benchmark was calling madvise 95 times a second during population. In that case, the cost of the page table teardown + refault is negligible in comparison to the IO costs and bulk of the CPU time is spent in mariadb itself but glancing at perf top, it looks like about 25% of system CPU time is spent tearing down and reallocating pages CPU cycles were spent on teardown measured cycles). During the sysbench run itself, mariadb was calling madvise 2000 times a second. I didn't formally quantify the impact as I do not have a test setup for testing glibc modifications system-wide and it's essentially the same problem seen by ebizzy except ebizzy is a hell of a lot easier to test with a modified glibc. Thanks.
On 02/09/2015 09:52 PM, Carlos O'Donell wrote: > On 02/09/2015 09:06 AM, Mel Gorman wrote: >> while (data_to_process) { >> buf = malloc(large_size); >> do_stuff(); >> free(buf); >> } > > Why isn't the fix to change the application to hoist the > malloc out of the loop? For a lot of C++ code, this would require replacing global operator new with a pool allocator. We do not want programmers to do that, for various reasons (loss of tooling, our limited malloc hardening, etc.).
On 02/10/2015 07:30 AM, Florian Weimer wrote: > On 02/09/2015 09:52 PM, Carlos O'Donell wrote: >> On 02/09/2015 09:06 AM, Mel Gorman wrote: >>> while (data_to_process) { >>> buf = malloc(large_size); >>> do_stuff(); >>> free(buf); >>> } >> >> Why isn't the fix to change the application to hoist the >> malloc out of the loop? > > For a lot of C++ code, this would require replacing global operator new > with a pool allocator. We do not want programmers to do that, for > various reasons (loss of tooling, our limited malloc hardening, etc.). Is this because the objects are implicitly allocated by the language in the inner loop using new? Cheers, Carlos.
On 02/09/2015 05:49 PM, Mel Gorman wrote: > I would also welcome suggestions on how madvise could be throttled without > the use of counters. The counters are heap-local where I do not expect > there will be cache conflicts and the allocation-side counter is only > updated after a recent heap shrink to minimise updates. > > Initially I worked around this in the kernel but any solution there > breaks the existing semantics of MADV_DONTNEED and was rejected. See > last paragraph of https://lkml.org/lkml/2015/2/2/696 . The truth is that glibc doesn't want to use MADV_DONTNEED in malloc, but it's the only interface we have right now that has similar semantics to what we need. Similarly Kostya from google told me that ASAN also has this problem since it has tons of statistics pages that it will touch soon, but doesn't care if they come back as zero or the original data. Two years ago I spoke with Motohiro-san and we talked about MADV_"Want but don't need", but no mainline solution was present at the time. The immediate work that comes to mind is the `vrange` work by Minchan Kim. http://lwn.net/Articles/542504/ I agree with Rik's comment in the above link that we really want MADV_FREE in these cases, and it gives a 200% speedup over MADV_DONTNEED (as reported by testers using jemalloc patches). Thus, instead of adding more complex heuristics to glibc's malloc I think we should be testing the addition of MADV_FREE in glibc's malloc and then supporting or adjusting the proposal for MADV_FREE or vrange dependent on the outcome. In the meantime we can talk about mitigating the problems in glibc's allocator for systems with old kernels, but it should not be the primary solution. In glibc we would conditionalize the changes against the first kernel version that included MADV_FREE, and when the minimum supported kernel version is higher than that we would remove the code in question. My suggested next steps are: (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses MADV_FREE, see how that performs, and inform upstream kernel. (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary measure. Before we go any further here, please increase M_TRIM_THRESHOLD in ebizzy and see if that makes a difference? It should make a difference by increasing the threshold at which we trim back to the OS, both sbrk, and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased memory pressure. Changing the default though is not a trivial thing, since it could lead to immediate OOM for existing applications that run close to the limit of RAM. Discussion and analysis will be required. Comments? Cheers, Carlos.
On 02/10/2015 03:56 PM, Carlos O'Donell wrote: > On 02/10/2015 07:30 AM, Florian Weimer wrote: >> On 02/09/2015 09:52 PM, Carlos O'Donell wrote: >>> On 02/09/2015 09:06 AM, Mel Gorman wrote: >>>> while (data_to_process) { >>>> buf = malloc(large_size); >>>> do_stuff(); >>>> free(buf); >>>> } >>> >>> Why isn't the fix to change the application to hoist the >>> malloc out of the loop? >> >> For a lot of C++ code, this would require replacing global operator new >> with a pool allocator. We do not want programmers to do that, for >> various reasons (loss of tooling, our limited malloc hardening, etc.). > > Is this because the objects are implicitly allocated by the language > in the inner loop using new? Not by the language, but by libraries. The libraries may not offer a choice to avoid the large allocation and deallocation. In general, object reuse is no longer considered good design because it typically complicates the different states an object must support. This is probably not specific to C++, it may affect modern C code as well (which tends to hide implementation details with pointers to incomplete structs etc., and not rely on callers providing sufficiently sized buffers). Even if there is a way to reuse objects (e.g., we have freopen in addition to fopen/close), I don't want to suggest it to developers because it eventually leads to custom allocators, pools and the problems that come with them.
On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote: > On 02/09/2015 09:06 AM, Mel Gorman wrote: > > while (data_to_process) { > > buf = malloc(large_size); > > do_stuff(); > > free(buf); > > } > > Why isn't the fix to change the application to hoist the > malloc out of the loop? I understand this is impossible for some language idioms (typically OOP, and despite my personal belief that this indicates they're bad language idioms, I don't want to descend into that type of argument), but to me the big question is: Why, when you have a large buffer -- so large that it can effect MADV_DONTNEED or munmap when freed -- are you doing so little with it in do_stuff() that the work performed on the buffer doesn't dominate the time spent? This indicates to me that the problem might actually be significant over-allocation beyond the size that's actually going to be used. Do we have some real-world specific examples of where this is happening? If it's poor design in application code and the applications could be corrected, I think we should consider whether the right fix is on the application side. Rich
On Wed, Feb 11, 2015 at 08:26:31AM -0500, Rich Felker wrote: > On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote: > > On 02/09/2015 09:06 AM, Mel Gorman wrote: > > > while (data_to_process) { > > > buf = malloc(large_size); > > > do_stuff(); > > > free(buf); > > > } > > > > Why isn't the fix to change the application to hoist the > > malloc out of the loop? > > I understand this is impossible for some language idioms (typically > OOP, and despite my personal belief that this indicates they're bad > language idioms, I don't want to descend into that type of argument), > but to me the big question is: > > Why, when you have a large buffer -- so large that it can effect > MADV_DONTNEED or munmap when freed -- are you doing so little with it > in do_stuff() that the work performed on the buffer doesn't dominate > the time spent? > It's less than ideal application behaviour. > This indicates to me that the problem might actually be significant > over-allocation beyond the size that's actually going to be used. Do > we have some real-world specific examples of where this is happening? In the case of ebizzy, it is the case that do_stuff is so small that the allocation/free cost dominates. In the cases where I've seen this happen on other workloads (firefix, evolution, mariadb during database init from system) the cost of the operations on the buffer dominated. The malloc/free cost was there but the performance difference is in the noise. > If it's poor design in application code and the applications could be > corrected, I think we should consider whether the right fix is on the > application side. > Could you look at v2 of the patch please? After discussions, I accept that fixing this with a tricky heuristic is overkill. The second patch just obeys trim threshold for per-thread heaps which is much simplier. If an application is then identified that both requires trim threshold to perform and the application is correctly implemented then more complex options can be considered. Thanks
On Wed, Feb 11, 2015 at 01:34:53PM +0000, Mel Gorman wrote: > > This indicates to me that the problem might actually be significant > > over-allocation beyond the size that's actually going to be used. Do > > we have some real-world specific examples of where this is happening? > > In the case of ebizzy, it is the case that do_stuff is so small that the > allocation/free cost dominates. ebizzy is supposed to be a benchmark simulating typical workload, right? If so, I think this specific test operation is a mistake, and I think glibc should be cautious about optimizing for benchmarks that don't reflect meaningful real-world usage. > In the cases where I've seen this happen > on other workloads (firefix, evolution, mariadb during database init from > system) the cost of the operations on the buffer dominated. The malloc/free > cost was there but the performance difference is in the noise. If it's not distinguishable from noise in actual usage cases, then I'm skeptical that there's a need to fix this issue. Rich
On Wed, Feb 11, 2015 at 09:07:37AM -0500, Rich Felker wrote: > On Wed, Feb 11, 2015 at 01:34:53PM +0000, Mel Gorman wrote: > > > This indicates to me that the problem might actually be significant > > > over-allocation beyond the size that's actually going to be used. Do > > > we have some real-world specific examples of where this is happening? > > > > In the case of ebizzy, it is the case that do_stuff is so small that the > > allocation/free cost dominates. > > ebizzy is supposed to be a benchmark simulating typical workload, > right? Right - web application server workload specifically. In reality, I think it would depend on what lanaguage said workload was implemented in. Java workloads would not hit the glibc allocator at all for example. > If so, I think this specific test operation is a mistake, and I > think glibc should be cautious about optimizing for benchmarks that > don't reflect meaningful real-world usage. > That rules out the complex approach in V1 at least. > > In the cases where I've seen this happen > > on other workloads (firefix, evolution, mariadb during database init from > > system) the cost of the operations on the buffer dominated. The malloc/free > > cost was there but the performance difference is in the noise. > > If it's not distinguishable from noise in actual usage cases, then I'm > skeptical that there's a need to fix this issue. > I take that is a NAK for v1 of the patch. How about V2? It is expected that heap trims can be controlled with tuning parameters but right now, it's not possible to tune the trim threshold for per-thread heaps. V2 of the patch fixes that and at least is consistent behaviour.
On Wed, Feb 11, 2015 at 02:19:26PM +0000, Mel Gorman wrote: > > > In the cases where I've seen this happen > > > on other workloads (firefix, evolution, mariadb during database init from > > > system) the cost of the operations on the buffer dominated. The malloc/free > > > cost was there but the performance difference is in the noise. > > > > If it's not distinguishable from noise in actual usage cases, then I'm > > skeptical that there's a need to fix this issue. > > > > I take that is a NAK for v1 of the patch. How about V2? It is expected > that heap trims can be controlled with tuning parameters but right now, > it's not possible to tune the trim threshold for per-thread heaps. V2 of > the patch fixes that and at least is consistent behaviour. I don't think I'm qualified to discuss specific code changes to glibc's allocator. I'll leave this to others who know it better. I was just weighing in on the high-level aspects of the proposed changes which made sense to discuss without knowledge specific to glibc's malloc internals. Rich
On 02/11/2015 09:21 AM, Rich Felker wrote: > On Wed, Feb 11, 2015 at 02:19:26PM +0000, Mel Gorman wrote: >>>> In the cases where I've seen this happen >>>> on other workloads (firefix, evolution, mariadb during database init from >>>> system) the cost of the operations on the buffer dominated. The malloc/free >>>> cost was there but the performance difference is in the noise. >>> >>> If it's not distinguishable from noise in actual usage cases, then I'm >>> skeptical that there's a need to fix this issue. >>> >> >> I take that is a NAK for v1 of the patch. How about V2? It is expected >> that heap trims can be controlled with tuning parameters but right now, >> it's not possible to tune the trim threshold for per-thread heaps. V2 of >> the patch fixes that and at least is consistent behaviour. > > I don't think I'm qualified to discuss specific code changes to > glibc's allocator. I'll leave this to others who know it better. I was > just weighing in on the high-level aspects of the proposed changes > which made sense to discuss without knowledge specific to glibc's > malloc internals. I'm looking at this now. Cheers, Carlos.
diff --git a/ChangeLog b/ChangeLog index dc1ed1ba1249..5208fadaf7c3 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2015-02-09 Mel Gorman <mgorman@suse.de> + + * malloc/malloc.c (malloc): Account for heap grows vs shrinks + * malloc/arena.c (free): Limit calls to madvise when shrinking + 2015-02-06 Carlos O'Donell <carlos@systemhalted.org> * version.h (RELEASE): Set to "stable". diff --git a/malloc/arena.c b/malloc/arena.c index 886defb074a2..9012f4d2a0b8 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -45,6 +45,20 @@ malloc_chunks. It is allocated with mmap() and always starts at an address aligned to HEAP_MAX_SIZE. */ +/* madvise throttling counters. This structure forces the counters to fit + into a size_t so that the alignment requirements of _heap_info can be + met. */ +typedef struct _madvise_throttle +{ + union { + struct { + uint16_t event_madvise; + uint16_t event_large_malloc; + }; + size_t pad; + }; +} madvise_throttle; + typedef struct _heap_info { mstate ar_ptr; /* Arena for this heap. */ @@ -52,10 +66,13 @@ typedef struct _heap_info size_t size; /* Current size in bytes. */ size_t mprotect_size; /* Size in bytes that has been mprotected PROT_READ|PROT_WRITE. */ + + madvise_throttle mt; + /* Make sure the following data is properly aligned, particularly that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of MALLOC_ALIGNMENT. */ - char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK]; + char pad[-7 * SIZE_SZ & MALLOC_ALIGN_MASK]; } heap_info; /* Get a compile-time error if the heap_info padding is not correct @@ -632,8 +649,31 @@ shrink_heap (heap_info *h, long diff) h->mprotect_size = new_size; } - else - __madvise ((char *) h + new_size, diff, MADV_DONTNEED); + else { + unsigned int ratio; + /* Only keep track of recent madvise events by decaying counters */ + if (++h->mt.event_madvise >= 100) + { + h->mt.event_large_malloc >>= 1; + h->mt.event_madvise >>= 1; + } + ratio = (h->mt.event_large_malloc + 1) * 100 / h->mt.event_madvise; + + /* madvise and a refault is an expensive operation if the shrink request + is temporary. Only call madvise if it is a request bigger than + mmap_threshold or if it is detected that there are a mix of growths and + shrinks but more shrink requests recently. One impact is that a stream + of free+shrink requests will be batched under a single madvise call. */ + if (diff >= mp_.mmap_threshold || (ratio < 100 && h->mt.event_large_malloc > 1)) + { + __madvise ((char *) h + new_size, diff, MADV_DONTNEED); + h->mt.event_large_malloc = h->mt.event_madvise = 0; + } + else + { + return -1; + } + } /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/ h->size = new_size; diff --git a/malloc/malloc.c b/malloc/malloc.c index aa7edbfd4571..41f66cb41557 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -2388,6 +2388,16 @@ sysmalloc (INTERNAL_SIZE_T nb, mstate av) arena_mem += old_heap->size - old_heap_size; set_head (old_top, (((char *) old_heap + old_heap->size) - (char *) old_top) | PREV_INUSE); + /* Track ratio of heap grows/shrinks after recent shrinks */ + if (old_heap->mt.event_madvise) + { + /* Only keep track of recent allocations by decaying counters */ + if (++old_heap->mt.event_large_malloc >= 100) + { + old_heap->mt.event_large_malloc >>= 1; + old_heap->mt.event_madvise >>= 1; + } + } } else if ((heap = new_heap (nb + (MINSIZE + sizeof (*heap)), mp_.top_pad))) {