Message ID | xnpodpxiv9.fsf@greed.delorie.com |
---|---|
State | New |
Headers | show |
On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote: >> Why does this need to be inherited by children of setxid processes? I >> guess the fact that the remaining malloc tunables are is one probable >> reason, but those also need to be reviewed and brought down to >> SXID_ERASE if necessary. > > IIRC Florian asked for that setting; I don't know specifically what the > difference is but if you two want to fight it out, I'll get some popcorn ;-) haha, interesting, because I remember Florian and I had this discussion about the utility of SXID_IGNORE six months ago when I designed the scope control for the envvars. Lets have Florian explain this. >>> +/* We overlay this structure on the user-data portion of a chunk when >>> + the chunk is stored in the per-thread cache. */ >>> +typedef struct tcache_entry { >> >> Incorrect formatting. > > You weren't specific here, so I had to guess which formatting you were > referring to, as the style I used matches code elsewhere in malloc... The braces should be on the next line: typedef struct tcache_entry { ... >>> +static void __attribute__ ((section ("__libc_thread_freeres_fn"))) >> >> __attribute__ should be on its own line. > > Likewise, I was copying the formatting used elsewhere in malloc. What > is the correct formatting for that line? Do I need to fix the other > existing instances also? Sure, but in a separate (obvious) patch. The malloc code was in a completely different format when it was ported in from dlmalloc and since then it has been a bit of a mishmash that needs a cleanup. >>> + for (i = 0; i < TCACHE_MAX_BINS; ++i) { >>> + while (tcache_tmp->entries[i]) >>> + { >>> + tcache_entry *e = tcache_tmp->entries[i]; >>> + tcache_tmp->entries[i] = e->next; >>> + __libc_free (e); >>> + } >> >> Incorrect formatting. > > Not sure what formatting you're referring to here, unless it's the tabs > being messed up by the "> >"... Opening brace for the for loop should be on the next line and everything else indented accordingly. >> This validation should be done in tunables by setting the minval and >> maxval values in dl-tunables.list. > > I don't see how I can put MAX_TCACHE_SIZE in dl-tunables.list. Should I > hard-code the default value, and also include the test here? Or is > there no other other use of min/max in dl-tunables.list? If you hardcode the value in dl-tunables.list, you won't need the test since __init_tunables will do that validation for you. However, I noticed that it is a function of a bunch of other macros and unless those are constant across architectures it won't go into tunables as is. I guess you can let it be as it is and I'll fix it up later. I think this is OK with the couple of formatting fixes above and with the SXID_IGNORE removed, which we can then discuss and add back if necessary as an add-on change. Thanks, Siddhesh > > Revised patch attached. > > 2017-06-27 DJ Delorie <dj@delorie.com> > > * config.make.in: Enable experimental malloc option. > * configure.ac: Likewise. > * configure: Regenerate. > * malloc/Makefile: Likewise. > * malloc/malloc.c: Add per-thread cache (tcache). > (tcache_put): New. > (tcache_get): New. > (tcache_thread_freeres): New. > (tcache_init): New. > (__libc_malloc): Use cached chunks if available. > (__libc_free): Initialize tcache if needed. > (__libc_realloc): Likewise. > (__libc_calloc): Likewise. > (_int_malloc): Prefill tcache when appropriate. > (_int_free): Likewise. > (do_set_tcache_max): New. > (do_set_tcache_count): New. > (do_set_tcache_unsorted_limit): New. > * manual/probes.texi: Document new probes. > * malloc/arena.c: Add new tcache tunables. > * elf/dl-tunables.list: Likewise. > * manual/tunables.texi: Document them. > > diff -x .git -x po -r -U3 -p glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in > --- glibc.pristine/config.make.in 2017-06-20 12:54:22.636484566 -0400 > +++ glibc.djmalloc-tcache/config.make.in 2017-06-26 17:32:41.164888977 -0400 > @@ -78,6 +78,8 @@ multi-arch = @multi_arch@ > > mach-interface-list = @mach_interface_list@ > > +experimental-malloc = @experimental_malloc@ > + > nss-crypt = @libc_cv_nss_crypt@ > static-nss-crypt = @libc_cv_static_nss_crypt@ > > diff -x .git -x po -r -U3 -p glibc.pristine/configure glibc.djmalloc-tcache/configure > --- glibc.pristine/configure 2017-06-20 12:54:22.637484566 -0400 > +++ glibc.djmalloc-tcache/configure 2017-06-26 17:32:41.163888976 -0400 > @@ -674,6 +674,7 @@ build_obsolete_nsl > link_obsolete_rpc > libc_cv_static_nss_crypt > libc_cv_nss_crypt > +experimental_malloc > enable_werror > all_warnings > force_install > @@ -779,6 +780,7 @@ enable_kernel > enable_all_warnings > enable_werror > enable_multi_arch > +enable_experimental_malloc > enable_nss_crypt > enable_obsolete_rpc > enable_obsolete_nsl > @@ -1450,6 +1452,8 @@ Optional Features: > --disable-werror do not build with -Werror > --enable-multi-arch enable single DSO with optimizations for multiple > architectures > + --disable-experimental-malloc > + disable experimental malloc features > --enable-nss-crypt enable libcrypt to use nss > --enable-obsolete-rpc build and install the obsolete RPC code for > link-time usage > @@ -3522,6 +3526,15 @@ else > fi > > > +# Check whether --enable-experimental-malloc was given. > +if test "${enable_experimental_malloc+set}" = set; then : > + enableval=$enable_experimental_malloc; experimental_malloc=$enableval > +else > + experimental_malloc=yes > +fi > + > + > + > # Check whether --enable-nss-crypt was given. > if test "${enable_nss_crypt+set}" = set; then : > enableval=$enable_nss_crypt; nss_crypt=$enableval > diff -x .git -x po -r -U3 -p glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac > --- glibc.pristine/configure.ac 2017-06-20 12:54:22.638484567 -0400 > +++ glibc.djmalloc-tcache/configure.ac 2017-06-26 17:32:41.155888970 -0400 > @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch], > [multi_arch=$enableval], > [multi_arch=default]) > > +AC_ARG_ENABLE([experimental-malloc], > + AC_HELP_STRING([--disable-experimental-malloc], > + [disable experimental malloc features]), > + [experimental_malloc=$enableval], > + [experimental_malloc=yes]) > +AC_SUBST(experimental_malloc) > + > AC_ARG_ENABLE([nss-crypt], > AC_HELP_STRING([--enable-nss-crypt], > [enable libcrypt to use nss]), > diff -x .git -x po -r -U3 -p glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list > --- glibc.pristine/elf/dl-tunables.list 2017-06-23 20:49:10.174148477 -0400 > +++ glibc.djmalloc-tcache/elf/dl-tunables.list 2017-06-26 17:32:41.154888970 -0400 > @@ -76,6 +76,18 @@ glibc { > minval: 1 > security_level: SXID_IGNORE > } > + tcache_max { > + type: SIZE_T > + security_level: SXID_IGNORE > + } > + tcache_count { > + type: SIZE_T > + security_level: SXID_IGNORE > + } > + tcache_unsorted_limit { > + type: SIZE_T > + security_level: SXID_IGNORE > + } > } > tune { > hwcap_mask { > diff -x .git -x po -r -U3 -p glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile > --- glibc.pristine/malloc/Makefile 2017-06-23 20:49:10.199148493 -0400 > +++ glibc.djmalloc-tcache/malloc/Makefile 2017-06-26 17:32:41.153888968 -0400 > @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal > tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3 > tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV) > > +ifeq ($(experimental-malloc),yes) > +CPPFLAGS-malloc.c += -DUSE_TCACHE=1 > +else > +CPPFLAGS-malloc.c += -DUSE_TCACHE=0 > +endif > # Uncomment this for test releases. For public releases it is too expensive. > #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1 > > diff -x .git -x po -r -U3 -p glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c > --- glibc.pristine/malloc/arena.c 2017-06-20 12:54:22.645484573 -0400 > +++ glibc.djmalloc-tcache/malloc/arena.c 2017-06-26 17:49:33.701608305 -0400 > @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt > TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t) > TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t) > TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t) > +#if USE_TCACHE > +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t) > +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t) > +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t) > +#endif > #else > /* Initialization routine. */ > #include <string.h> > @@ -322,6 +327,12 @@ ptmalloc_init (void) > TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max)); > TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max)); > TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test)); > +#if USE_TCACHE > + TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max)); > + TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count)); > + TUNABLE_GET (tcache_unsorted_limit, size_t, > + TUNABLE_CALLBACK (set_tcache_unsorted_limit)); > +#endif > __libc_lock_unlock (main_arena.mutex); > #else > const char *s = NULL; > diff -x .git -x po -r -U3 -p glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c > --- glibc.pristine/malloc/malloc.c 2017-05-03 16:26:47.559839224 -0400 > +++ glibc.djmalloc-tcache/malloc/malloc.c 2017-06-26 17:32:41.151888967 -0400 > @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion, > } > #endif > > +#if USE_TCACHE > +/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */ > +# define TCACHE_MAX_BINS 64 > +# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1) > + > +/* Only used to pre-fill the tunables. */ > +# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ) > + > +/* When "x" is from chunksize(). */ > +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT) > +/* When "x" is a user-provided size. */ > +# define usize2tidx(x) csize2tidx (request2size (x)) > + > +/* With rounding and alignment, the bins are... > + idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit) > + idx 1 bytes 25..40 or 13..20 > + idx 2 bytes 41..56 or 21..28 > + etc. */ > + > +/* This is another arbitrary limit, which tunables can change. Each > + tcache bin will hold at most this number of chunks. */ > +# define TCACHE_FILL_COUNT 7 > +#endif > + > > /* > REALLOC_ZERO_BYTES_FREES should be set if a call to > @@ -1712,6 +1736,17 @@ struct malloc_par > > /* First address handed out by MORECORE/sbrk. */ > char *sbrk_base; > + > +#if USE_TCACHE > + /* Maximum number of buckets to use. */ > + size_t tcache_bins; > + size_t tcache_max_bytes; > + /* Maximum number of chunks in each bucket. */ > + size_t tcache_count; > + /* Maximum number of chunks to remove from the unsorted list, which > + aren't used to prefill the cache. */ > + size_t tcache_unsorted_limit; > +#endif > }; > > /* There are several instances of this struct ("arenas") in this > @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ = > .trim_threshold = DEFAULT_TRIM_THRESHOLD, > #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8)) > .arena_test = NARENAS_FROM_NCORES (1) > +#if USE_TCACHE > + , > + .tcache_count = TCACHE_FILL_COUNT, > + .tcache_bins = TCACHE_MAX_BINS, > + .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1), > + .tcache_unsorted_limit = 0 /* No limit. */ > +#endif > }; > > /* Maximum size of memory handled in fastbins. */ > @@ -2875,6 +2917,121 @@ mremap_chunk (mchunkptr p, size_t new_si > > /*------------------------ Public wrappers. --------------------------------*/ > > +#if USE_TCACHE > + > +/* We overlay this structure on the user-data portion of a chunk when > + the chunk is stored in the per-thread cache. */ > +typedef struct tcache_entry { > + struct tcache_entry *next; > +} tcache_entry; > + > +/* There is one of these for each thread, which contains the > + per-thread cache (hence "tcache_perthread_struct"). Keeping > + overall size low is mildly important. Note that COUNTS and ENTRIES > + are redundant (we could have just counted the linked list each > + time), this is for performance reasons. */ > +typedef struct tcache_perthread_struct { > + char counts[TCACHE_MAX_BINS]; > + tcache_entry *entries[TCACHE_MAX_BINS]; > +} tcache_perthread_struct; > + > +static __thread char tcache_shutting_down = 0; > +static __thread tcache_perthread_struct *tcache = NULL; > + > +/* Caller must ensure that we know tc_idx is valid and there's room > + for more chunks. */ > +static void > +tcache_put (mchunkptr chunk, size_t tc_idx) > +{ > + tcache_entry *e = (tcache_entry *) chunk2mem (chunk); > + assert (tc_idx < TCACHE_MAX_BINS); > + e->next = tcache->entries[tc_idx]; > + tcache->entries[tc_idx] = e; > + ++(tcache->counts[tc_idx]); > +} > + > +/* Caller must ensure that we know tc_idx is valid and there's > + available chunks to remove. */ > +static void * > +tcache_get (size_t tc_idx) > +{ > + tcache_entry *e = tcache->entries[tc_idx]; > + assert (tc_idx < TCACHE_MAX_BINS); > + assert (tcache->entries[tc_idx] > 0); > + tcache->entries[tc_idx] = e->next; > + --(tcache->counts[tc_idx]); > + return (void *) e; > +} > + > +static void __attribute__ ((section ("__libc_thread_freeres_fn"))) > +tcache_thread_freeres (void) > +{ > + int i; > + tcache_perthread_struct *tcache_tmp = tcache; > + > + if (!tcache) > + return; > + > + tcache = NULL; > + > + for (i = 0; i < TCACHE_MAX_BINS; ++i) { > + while (tcache_tmp->entries[i]) > + { > + tcache_entry *e = tcache_tmp->entries[i]; > + tcache_tmp->entries[i] = e->next; > + __libc_free (e); > + } > + } > + > + __libc_free (tcache_tmp); > + > + tcache_shutting_down = 1; > +} > +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres); > + > +static void > +tcache_init(void) > +{ > + mstate ar_ptr; > + void *victim = 0; > + const size_t bytes = sizeof (tcache_perthread_struct); > + > + if (tcache_shutting_down) > + return; > + > + arena_get (ar_ptr, bytes); > + victim = _int_malloc (ar_ptr, bytes); > + if (!victim && ar_ptr != NULL) > + { > + ar_ptr = arena_get_retry (ar_ptr, bytes); > + victim = _int_malloc (ar_ptr, bytes); > + } > + > + > + if (ar_ptr != NULL) > + __libc_lock_unlock (ar_ptr->mutex); > + > + /* In a low memory situation, we may not be able to allocate memory > + - in which case, we just keep trying later. However, we > + typically do this very early, so either there is sufficient > + memory, or there isn't enough memory to do non-trivial > + allocations anyway. */ > + if (victim) > + { > + tcache = (tcache_perthread_struct *) victim; > + memset (tcache, 0, sizeof (tcache_perthread_struct)); > + } > + > +} > + > +#define MAYBE_INIT_TCACHE() \ > + if (__glibc_unlikely (tcache == NULL)) \ > + tcache_init(); > + > +#else > +#define MAYBE_INIT_TCACHE() > +#endif > + > void * > __libc_malloc (size_t bytes) > { > @@ -2885,6 +3042,21 @@ __libc_malloc (size_t bytes) > = atomic_forced_read (__malloc_hook); > if (__builtin_expect (hook != NULL, 0)) > return (*hook)(bytes, RETURN_ADDRESS (0)); > +#if USE_TCACHE > + /* int_free also calls request2size, be careful to not pad twice. */ > + size_t tbytes = request2size (bytes); > + size_t tc_idx = csize2tidx (tbytes); > + > + MAYBE_INIT_TCACHE (); > + > + if (tc_idx < mp_.tcache_bins > + && tc_idx < TCACHE_MAX_BINS /* to appease gcc */ > + && tcache > + && tcache->entries[tc_idx] != NULL) > + { > + return tcache_get (tc_idx); > + } > +#endif > > arena_get (ar_ptr, bytes); > > @@ -2944,6 +3116,8 @@ __libc_free (void *mem) > return; > } > > + MAYBE_INIT_TCACHE (); > + > ar_ptr = arena_for_chunk (p); > _int_free (ar_ptr, p, 0); > } > @@ -2981,7 +3155,10 @@ __libc_realloc (void *oldmem, size_t byt > if (chunk_is_mmapped (oldp)) > ar_ptr = NULL; > else > - ar_ptr = arena_for_chunk (oldp); > + { > + MAYBE_INIT_TCACHE (); > + ar_ptr = arena_for_chunk (oldp); > + } > > /* Little security check which won't hurt performance: the allocator > never wrapps around at the end of the address space. Therefore > @@ -3206,6 +3383,8 @@ __libc_calloc (size_t n, size_t elem_siz > > sz = bytes; > > + MAYBE_INIT_TCACHE (); > + > arena_get (av, sz); > if (av) > { > @@ -3336,6 +3515,10 @@ _int_malloc (mstate av, size_t bytes) > mchunkptr fwd; /* misc temp for linking */ > mchunkptr bck; /* misc temp for linking */ > > +#if USE_TCACHE > + size_t tcache_unsorted_count; /* count of unsorted chunks processed */ > +#endif > + > const char *errstr = NULL; > > /* > @@ -3365,19 +3548,22 @@ _int_malloc (mstate av, size_t bytes) > can try it without checking, which saves some time on this fast path. > */ > > +#define REMOVE_FB(fb, victim, pp) \ > + do \ > + { \ > + victim = pp; \ > + if (victim == NULL) \ > + break; \ > + } \ > + while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \ > + != victim); \ > + > if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ())) > { > idx = fastbin_index (nb); > mfastbinptr *fb = &fastbin (av, idx); > mchunkptr pp = *fb; > - do > - { > - victim = pp; > - if (victim == NULL) > - break; > - } > - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) > - != victim); > + REMOVE_FB (fb, victim, pp); > if (victim != 0) > { > if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0)) > @@ -3388,6 +3574,26 @@ _int_malloc (mstate av, size_t bytes) > return NULL; > } > check_remalloced_chunk (av, victim, nb); > +#if USE_TCACHE > + /* While we're here, if we see other chunks of the same size, > + stash them in the tcache. */ > + size_t tc_idx = csize2tidx (nb); > + if (tcache && tc_idx < mp_.tcache_bins) > + { > + mchunkptr tc_victim; > + > + /* While bin not empty and tcache not full, copy chunks over. */ > + while (tcache->counts[tc_idx] < mp_.tcache_count > + && (pp = *fb) != NULL) > + { > + REMOVE_FB (fb, tc_victim, pp); > + if (tc_victim != 0) > + { > + tcache_put (tc_victim, tc_idx); > + } > + } > + } > +#endif > void *p = chunk2mem (victim); > alloc_perturb (p, bytes); > return p; > @@ -3426,6 +3632,32 @@ _int_malloc (mstate av, size_t bytes) > if (av != &main_arena) > set_non_main_arena (victim); > check_malloced_chunk (av, victim, nb); > +#if USE_TCACHE > + /* While we're here, if we see other chunks of the same size, > + stash them in the tcache. */ > + size_t tc_idx = csize2tidx (nb); > + if (tcache && tc_idx < mp_.tcache_bins) > + { > + mchunkptr tc_victim; > + > + /* While bin not empty and tcache not full, copy chunks over. */ > + while (tcache->counts[tc_idx] < mp_.tcache_count > + && (tc_victim = last (bin)) != bin) > + { > + if (tc_victim != 0) > + { > + bck = tc_victim->bk; > + set_inuse_bit_at_offset (tc_victim, nb); > + if (av != &main_arena) > + set_non_main_arena (tc_victim); > + bin->bk = bck; > + bck->fd = bin; > + > + tcache_put (tc_victim, tc_idx); > + } > + } > + } > +#endif > void *p = chunk2mem (victim); > alloc_perturb (p, bytes); > return p; > @@ -3464,6 +3696,16 @@ _int_malloc (mstate av, size_t bytes) > otherwise need to expand memory to service a "small" request. > */ > > +#if USE_TCACHE > + INTERNAL_SIZE_T tcache_nb = 0; > + size_t tc_idx = csize2tidx (nb); > + if (tcache && tc_idx < mp_.tcache_bins) > + tcache_nb = nb; > + int return_cached = 0; > + > + tcache_unsorted_count = 0; > +#endif > + > for (;; ) > { > int iters = 0; > @@ -3524,10 +3766,26 @@ _int_malloc (mstate av, size_t bytes) > set_inuse_bit_at_offset (victim, size); > if (av != &main_arena) > set_non_main_arena (victim); > +#if USE_TCACHE > + /* Fill cache first, return to user only if cache fills. > + We may return one of these chunks later. */ > + if (tcache_nb > + && tcache->counts[tc_idx] < mp_.tcache_count) > + { > + tcache_put (victim, tc_idx); > + return_cached = 1; > + continue; > + } > + else > + { > +#endif > check_malloced_chunk (av, victim, nb); > void *p = chunk2mem (victim); > alloc_perturb (p, bytes); > return p; > +#if USE_TCACHE > + } > +#endif > } > > /* place chunk in bin */ > @@ -3594,11 +3852,31 @@ _int_malloc (mstate av, size_t bytes) > fwd->bk = victim; > bck->fd = victim; > > +#if USE_TCACHE > + /* If we've processed as many chunks as we're allowed while > + filling the cache, return one of the cached ones. */ > + ++tcache_unsorted_count; > + if (return_cached > + && mp_.tcache_unsorted_limit > 0 > + && tcache_unsorted_count > mp_.tcache_unsorted_limit) > + { > + return tcache_get (tc_idx); > + } > +#endif > + > #define MAX_ITERS 10000 > if (++iters >= MAX_ITERS) > break; > } > > +#if USE_TCACHE > + /* If all the small chunks we found ended up cached, return one now. */ > + if (return_cached) > + { > + return tcache_get (tc_idx); > + } > +#endif > + > /* > If a large request, scan through the chunks of current bin in > sorted order to find smallest that fits. Use the skip list for this. > @@ -3884,6 +4162,20 @@ _int_free (mstate av, mchunkptr p, int h > > check_inuse_chunk(av, p); > > +#if USE_TCACHE > + { > + size_t tc_idx = csize2tidx (size); > + > + if (tcache > + && tc_idx < mp_.tcache_bins > + && tcache->counts[tc_idx] < mp_.tcache_count) > + { > + tcache_put (p, tc_idx); > + return; > + } > + } > +#endif > + > /* > If eligible, place chunk on a fastbin so it can be found > and used quickly in malloc. > @@ -4845,6 +5137,38 @@ do_set_arena_max (size_t value) > return 1; > } > > +#if USE_TCACHE > +static inline int > +__always_inline > +do_set_tcache_max (size_t value) > +{ > + if (value >= 0 && value <= MAX_TCACHE_SIZE) > + { > + LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes); > + mp_.tcache_max_bytes = value; > + mp_.tcache_bins = csize2tidx (request2size(value)) + 1; > + } > + return 1; > +} > + > +static inline int > +__always_inline > +do_set_tcache_count (size_t value) > +{ > + LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count); > + mp_.tcache_count = value; > + return 1; > +} > + > +static inline int > +__always_inline > +do_set_tcache_unsorted_limit (size_t value) > +{ > + LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit); > + mp_.tcache_unsorted_limit = value; > + return 1; > +} > +#endif > > int > __libc_mallopt (int param_number, int value) > diff -x .git -x po -r -U3 -p glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi > --- glibc.pristine/manual/probes.texi 2017-01-10 17:02:06.305343587 -0500 > +++ glibc.djmalloc-tcache/manual/probes.texi 2017-06-26 17:32:35.785884967 -0400 > @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @ > the adjusted mmap and trim thresholds, respectively. > @end deftp > > +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2}) > +This probe is triggered when the @code{glibc.malloc.tcache_max} > +tunable is set. Argument @var{$arg1} is the requested value, and > +@var{$arg2} is the previous value of this tunable. > +@end deftp > + > +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2}) > +This probe is triggered when the @code{glibc.malloc.tcache_count} > +tunable is set. Argument @var{$arg1} is the requested value, and > +@var{$arg2} is the previous value of this tunable. > +@end deftp > + > +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2}) > +This probe is triggered when the > +@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument > +@var{$arg1} is the requested value, and @var{$arg2} is the previous > +value of this tunable. > +@end deftp > + > @node Mathematical Function Probes > @section Mathematical Function Probes > > diff -x .git -x po -r -U3 -p glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi > --- glibc.pristine/manual/tunables.texi 2017-06-23 20:49:10.201148494 -0400 > +++ glibc.djmalloc-tcache/manual/tunables.texi 2017-06-26 17:35:07.952998383 -0400 > @@ -193,6 +193,39 @@ systems the limit is twice the number of > is 8 times the number of cores online. > @end deftp > > +@deftp Tunable glibc.malloc.tcache_max > +The maximum size of a request (in bytes) which may be met via the > +per-thread cache. The default (and maximum) value is 1032 bytes on > +64-bit systems and 516 bytes on 32-bit systems. > +@end deftp > + > +@deftp Tunable glibc.malloc.tcache_count > +The maximum number of chunks of each size to cache. The default is 7. > +There is no upper limit, other than available system memory. Note > +that chunks are rounded up to malloc's guaranteed alignment - this > +count is per rounded size, not per user-provided size. > + > +The approximate maximum overhead of the per-thread cache (for each > +thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins, > +max 64 bins) times @code{glibc.malloc.tcache_count} times the size for > +each bin. With defaults, this is about 236 KB on 64-bit systems and > +118 KB on 32-bit systems. > +@end deftp > + > +@deftp Tunable glibc.malloc.tcache_unsorted_limit > +When the user requests memory and the request cannot be met via the > +per-thread cache, the arenas are used to meet the request. At this > +time, additional chunks will be moved from existing arena lists to > +pre-fill the corresponding cache. While copies from the fastbins, > +smallbins, and regular bins are bounded and predictable due to the bin > +sizes, copies from the unsorted bin are not bounded, and incur > +additional time penalties as they need to be sorted as they're > +scanned. To make scanning the unsorted list more predictable and > +bounded, the user may set this tunable to limit the number of blocks > +that are scanned from the unsorted list while searching for chunks to > +pre-fill the per-thread cache with. The default, or when set to zero, > +is no limit. > + > @node Hardware Capability Tunables > @section Hardware Capability Tunables > @cindex hardware capability tunables >
On 06/28/2017 09:28 AM, Siddhesh Poyarekar wrote: > On Wednesday 28 June 2017 02:12 AM, DJ Delorie wrote: >>> Why does this need to be inherited by children of setxid processes? I >>> guess the fact that the remaining malloc tunables are is one probable >>> reason, but those also need to be reviewed and brought down to >>> SXID_ERASE if necessary. >> >> IIRC Florian asked for that setting; I don't know specifically what the >> difference is but if you two want to fight it out, I'll get some popcorn ;-) > > haha, interesting, because I remember Florian and I had this discussion > about the utility of SXID_IGNORE six months ago when I designed the > scope control for the envvars. Lets have Florian explain this. I don't recall discussing this. I think we should use SXID_ERASE for new tunables, and revisit that decision only when necessary. Thanks, Florian
diff -x .git -x po -r -U3 -p glibc.pristine/config.make.in glibc.djmalloc-tcache/config.make.in --- glibc.pristine/config.make.in 2017-06-20 12:54:22.636484566 -0400 +++ glibc.djmalloc-tcache/config.make.in 2017-06-26 17:32:41.164888977 -0400 @@ -78,6 +78,8 @@ multi-arch = @multi_arch@ mach-interface-list = @mach_interface_list@ +experimental-malloc = @experimental_malloc@ + nss-crypt = @libc_cv_nss_crypt@ static-nss-crypt = @libc_cv_static_nss_crypt@ diff -x .git -x po -r -U3 -p glibc.pristine/configure glibc.djmalloc-tcache/configure --- glibc.pristine/configure 2017-06-20 12:54:22.637484566 -0400 +++ glibc.djmalloc-tcache/configure 2017-06-26 17:32:41.163888976 -0400 @@ -674,6 +674,7 @@ build_obsolete_nsl link_obsolete_rpc libc_cv_static_nss_crypt libc_cv_nss_crypt +experimental_malloc enable_werror all_warnings force_install @@ -779,6 +780,7 @@ enable_kernel enable_all_warnings enable_werror enable_multi_arch +enable_experimental_malloc enable_nss_crypt enable_obsolete_rpc enable_obsolete_nsl @@ -1450,6 +1452,8 @@ Optional Features: --disable-werror do not build with -Werror --enable-multi-arch enable single DSO with optimizations for multiple architectures + --disable-experimental-malloc + disable experimental malloc features --enable-nss-crypt enable libcrypt to use nss --enable-obsolete-rpc build and install the obsolete RPC code for link-time usage @@ -3522,6 +3526,15 @@ else fi +# Check whether --enable-experimental-malloc was given. +if test "${enable_experimental_malloc+set}" = set; then : + enableval=$enable_experimental_malloc; experimental_malloc=$enableval +else + experimental_malloc=yes +fi + + + # Check whether --enable-nss-crypt was given. if test "${enable_nss_crypt+set}" = set; then : enableval=$enable_nss_crypt; nss_crypt=$enableval diff -x .git -x po -r -U3 -p glibc.pristine/configure.ac glibc.djmalloc-tcache/configure.ac --- glibc.pristine/configure.ac 2017-06-20 12:54:22.638484567 -0400 +++ glibc.djmalloc-tcache/configure.ac 2017-06-26 17:32:41.155888970 -0400 @@ -313,6 +313,13 @@ AC_ARG_ENABLE([multi-arch], [multi_arch=$enableval], [multi_arch=default]) +AC_ARG_ENABLE([experimental-malloc], + AC_HELP_STRING([--disable-experimental-malloc], + [disable experimental malloc features]), + [experimental_malloc=$enableval], + [experimental_malloc=yes]) +AC_SUBST(experimental_malloc) + AC_ARG_ENABLE([nss-crypt], AC_HELP_STRING([--enable-nss-crypt], [enable libcrypt to use nss]), diff -x .git -x po -r -U3 -p glibc.pristine/elf/dl-tunables.list glibc.djmalloc-tcache/elf/dl-tunables.list --- glibc.pristine/elf/dl-tunables.list 2017-06-23 20:49:10.174148477 -0400 +++ glibc.djmalloc-tcache/elf/dl-tunables.list 2017-06-26 17:32:41.154888970 -0400 @@ -76,6 +76,18 @@ glibc { minval: 1 security_level: SXID_IGNORE } + tcache_max { + type: SIZE_T + security_level: SXID_IGNORE + } + tcache_count { + type: SIZE_T + security_level: SXID_IGNORE + } + tcache_unsorted_limit { + type: SIZE_T + security_level: SXID_IGNORE + } } tune { hwcap_mask { diff -x .git -x po -r -U3 -p glibc.pristine/malloc/Makefile glibc.djmalloc-tcache/malloc/Makefile --- glibc.pristine/malloc/Makefile 2017-06-23 20:49:10.199148493 -0400 +++ glibc.djmalloc-tcache/malloc/Makefile 2017-06-26 17:32:41.153888968 -0400 @@ -189,6 +189,11 @@ tst-malloc-usable-static-ENV = $(tst-mal tst-malloc-usable-tunables-ENV = GLIBC_TUNABLES=glibc.malloc.check=3 tst-malloc-usable-static-tunables-ENV = $(tst-malloc-usable-tunables-ENV) +ifeq ($(experimental-malloc),yes) +CPPFLAGS-malloc.c += -DUSE_TCACHE=1 +else +CPPFLAGS-malloc.c += -DUSE_TCACHE=0 +endif # Uncomment this for test releases. For public releases it is too expensive. #CPPFLAGS-malloc.o += -DMALLOC_DEBUG=1 diff -x .git -x po -r -U3 -p glibc.pristine/malloc/arena.c glibc.djmalloc-tcache/malloc/arena.c --- glibc.pristine/malloc/arena.c 2017-06-20 12:54:22.645484573 -0400 +++ glibc.djmalloc-tcache/malloc/arena.c 2017-06-26 17:49:33.701608305 -0400 @@ -236,6 +236,11 @@ TUNABLE_CALLBACK_FNDECL (set_perturb_byt TUNABLE_CALLBACK_FNDECL (set_trim_threshold, size_t) TUNABLE_CALLBACK_FNDECL (set_arena_max, size_t) TUNABLE_CALLBACK_FNDECL (set_arena_test, size_t) +#if USE_TCACHE +TUNABLE_CALLBACK_FNDECL (set_tcache_max, size_t) +TUNABLE_CALLBACK_FNDECL (set_tcache_count, size_t) +TUNABLE_CALLBACK_FNDECL (set_tcache_unsorted_limit, size_t) +#endif #else /* Initialization routine. */ #include <string.h> @@ -322,6 +327,12 @@ ptmalloc_init (void) TUNABLE_GET (mmap_max, int32_t, TUNABLE_CALLBACK (set_mmaps_max)); TUNABLE_GET (arena_max, size_t, TUNABLE_CALLBACK (set_arena_max)); TUNABLE_GET (arena_test, size_t, TUNABLE_CALLBACK (set_arena_test)); +#if USE_TCACHE + TUNABLE_GET (tcache_max, size_t, TUNABLE_CALLBACK (set_tcache_max)); + TUNABLE_GET (tcache_count, size_t, TUNABLE_CALLBACK (set_tcache_count)); + TUNABLE_GET (tcache_unsorted_limit, size_t, + TUNABLE_CALLBACK (set_tcache_unsorted_limit)); +#endif __libc_lock_unlock (main_arena.mutex); #else const char *s = NULL; diff -x .git -x po -r -U3 -p glibc.pristine/malloc/malloc.c glibc.djmalloc-tcache/malloc/malloc.c --- glibc.pristine/malloc/malloc.c 2017-05-03 16:26:47.559839224 -0400 +++ glibc.djmalloc-tcache/malloc/malloc.c 2017-06-26 17:32:41.151888967 -0400 @@ -296,6 +296,30 @@ __malloc_assert (const char *assertion, } #endif +#if USE_TCACHE +/* We want 64 entries. This is an arbitrary limit, which tunables can reduce. */ +# define TCACHE_MAX_BINS 64 +# define MAX_TCACHE_SIZE tidx2usize (TCACHE_MAX_BINS-1) + +/* Only used to pre-fill the tunables. */ +# define tidx2usize(idx) (((size_t) idx) * MALLOC_ALIGNMENT + MINSIZE - SIZE_SZ) + +/* When "x" is from chunksize(). */ +# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT) +/* When "x" is a user-provided size. */ +# define usize2tidx(x) csize2tidx (request2size (x)) + +/* With rounding and alignment, the bins are... + idx 0 bytes 0..24 (64-bit) or 0..12 (32-bit) + idx 1 bytes 25..40 or 13..20 + idx 2 bytes 41..56 or 21..28 + etc. */ + +/* This is another arbitrary limit, which tunables can change. Each + tcache bin will hold at most this number of chunks. */ +# define TCACHE_FILL_COUNT 7 +#endif + /* REALLOC_ZERO_BYTES_FREES should be set if a call to @@ -1712,6 +1736,17 @@ struct malloc_par /* First address handed out by MORECORE/sbrk. */ char *sbrk_base; + +#if USE_TCACHE + /* Maximum number of buckets to use. */ + size_t tcache_bins; + size_t tcache_max_bytes; + /* Maximum number of chunks in each bucket. */ + size_t tcache_count; + /* Maximum number of chunks to remove from the unsorted list, which + aren't used to prefill the cache. */ + size_t tcache_unsorted_limit; +#endif }; /* There are several instances of this struct ("arenas") in this @@ -1750,6 +1785,13 @@ static struct malloc_par mp_ = .trim_threshold = DEFAULT_TRIM_THRESHOLD, #define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8)) .arena_test = NARENAS_FROM_NCORES (1) +#if USE_TCACHE + , + .tcache_count = TCACHE_FILL_COUNT, + .tcache_bins = TCACHE_MAX_BINS, + .tcache_max_bytes = tidx2usize (TCACHE_MAX_BINS-1), + .tcache_unsorted_limit = 0 /* No limit. */ +#endif }; /* Maximum size of memory handled in fastbins. */ @@ -2875,6 +2917,121 @@ mremap_chunk (mchunkptr p, size_t new_si /*------------------------ Public wrappers. --------------------------------*/ +#if USE_TCACHE + +/* We overlay this structure on the user-data portion of a chunk when + the chunk is stored in the per-thread cache. */ +typedef struct tcache_entry { + struct tcache_entry *next; +} tcache_entry; + +/* There is one of these for each thread, which contains the + per-thread cache (hence "tcache_perthread_struct"). Keeping + overall size low is mildly important. Note that COUNTS and ENTRIES + are redundant (we could have just counted the linked list each + time), this is for performance reasons. */ +typedef struct tcache_perthread_struct { + char counts[TCACHE_MAX_BINS]; + tcache_entry *entries[TCACHE_MAX_BINS]; +} tcache_perthread_struct; + +static __thread char tcache_shutting_down = 0; +static __thread tcache_perthread_struct *tcache = NULL; + +/* Caller must ensure that we know tc_idx is valid and there's room + for more chunks. */ +static void +tcache_put (mchunkptr chunk, size_t tc_idx) +{ + tcache_entry *e = (tcache_entry *) chunk2mem (chunk); + assert (tc_idx < TCACHE_MAX_BINS); + e->next = tcache->entries[tc_idx]; + tcache->entries[tc_idx] = e; + ++(tcache->counts[tc_idx]); +} + +/* Caller must ensure that we know tc_idx is valid and there's + available chunks to remove. */ +static void * +tcache_get (size_t tc_idx) +{ + tcache_entry *e = tcache->entries[tc_idx]; + assert (tc_idx < TCACHE_MAX_BINS); + assert (tcache->entries[tc_idx] > 0); + tcache->entries[tc_idx] = e->next; + --(tcache->counts[tc_idx]); + return (void *) e; +} + +static void __attribute__ ((section ("__libc_thread_freeres_fn"))) +tcache_thread_freeres (void) +{ + int i; + tcache_perthread_struct *tcache_tmp = tcache; + + if (!tcache) + return; + + tcache = NULL; + + for (i = 0; i < TCACHE_MAX_BINS; ++i) { + while (tcache_tmp->entries[i]) + { + tcache_entry *e = tcache_tmp->entries[i]; + tcache_tmp->entries[i] = e->next; + __libc_free (e); + } + } + + __libc_free (tcache_tmp); + + tcache_shutting_down = 1; +} +text_set_element (__libc_thread_subfreeres, tcache_thread_freeres); + +static void +tcache_init(void) +{ + mstate ar_ptr; + void *victim = 0; + const size_t bytes = sizeof (tcache_perthread_struct); + + if (tcache_shutting_down) + return; + + arena_get (ar_ptr, bytes); + victim = _int_malloc (ar_ptr, bytes); + if (!victim && ar_ptr != NULL) + { + ar_ptr = arena_get_retry (ar_ptr, bytes); + victim = _int_malloc (ar_ptr, bytes); + } + + + if (ar_ptr != NULL) + __libc_lock_unlock (ar_ptr->mutex); + + /* In a low memory situation, we may not be able to allocate memory + - in which case, we just keep trying later. However, we + typically do this very early, so either there is sufficient + memory, or there isn't enough memory to do non-trivial + allocations anyway. */ + if (victim) + { + tcache = (tcache_perthread_struct *) victim; + memset (tcache, 0, sizeof (tcache_perthread_struct)); + } + +} + +#define MAYBE_INIT_TCACHE() \ + if (__glibc_unlikely (tcache == NULL)) \ + tcache_init(); + +#else +#define MAYBE_INIT_TCACHE() +#endif + void * __libc_malloc (size_t bytes) { @@ -2885,6 +3042,21 @@ __libc_malloc (size_t bytes) = atomic_forced_read (__malloc_hook); if (__builtin_expect (hook != NULL, 0)) return (*hook)(bytes, RETURN_ADDRESS (0)); +#if USE_TCACHE + /* int_free also calls request2size, be careful to not pad twice. */ + size_t tbytes = request2size (bytes); + size_t tc_idx = csize2tidx (tbytes); + + MAYBE_INIT_TCACHE (); + + if (tc_idx < mp_.tcache_bins + && tc_idx < TCACHE_MAX_BINS /* to appease gcc */ + && tcache + && tcache->entries[tc_idx] != NULL) + { + return tcache_get (tc_idx); + } +#endif arena_get (ar_ptr, bytes); @@ -2944,6 +3116,8 @@ __libc_free (void *mem) return; } + MAYBE_INIT_TCACHE (); + ar_ptr = arena_for_chunk (p); _int_free (ar_ptr, p, 0); } @@ -2981,7 +3155,10 @@ __libc_realloc (void *oldmem, size_t byt if (chunk_is_mmapped (oldp)) ar_ptr = NULL; else - ar_ptr = arena_for_chunk (oldp); + { + MAYBE_INIT_TCACHE (); + ar_ptr = arena_for_chunk (oldp); + } /* Little security check which won't hurt performance: the allocator never wrapps around at the end of the address space. Therefore @@ -3206,6 +3383,8 @@ __libc_calloc (size_t n, size_t elem_siz sz = bytes; + MAYBE_INIT_TCACHE (); + arena_get (av, sz); if (av) { @@ -3336,6 +3515,10 @@ _int_malloc (mstate av, size_t bytes) mchunkptr fwd; /* misc temp for linking */ mchunkptr bck; /* misc temp for linking */ +#if USE_TCACHE + size_t tcache_unsorted_count; /* count of unsorted chunks processed */ +#endif + const char *errstr = NULL; /* @@ -3365,19 +3548,22 @@ _int_malloc (mstate av, size_t bytes) can try it without checking, which saves some time on this fast path. */ +#define REMOVE_FB(fb, victim, pp) \ + do \ + { \ + victim = pp; \ + if (victim == NULL) \ + break; \ + } \ + while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \ + != victim); \ + if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ())) { idx = fastbin_index (nb); mfastbinptr *fb = &fastbin (av, idx); mchunkptr pp = *fb; - do - { - victim = pp; - if (victim == NULL) - break; - } - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) - != victim); + REMOVE_FB (fb, victim, pp); if (victim != 0) { if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0)) @@ -3388,6 +3574,26 @@ _int_malloc (mstate av, size_t bytes) return NULL; } check_remalloced_chunk (av, victim, nb); +#if USE_TCACHE + /* While we're here, if we see other chunks of the same size, + stash them in the tcache. */ + size_t tc_idx = csize2tidx (nb); + if (tcache && tc_idx < mp_.tcache_bins) + { + mchunkptr tc_victim; + + /* While bin not empty and tcache not full, copy chunks over. */ + while (tcache->counts[tc_idx] < mp_.tcache_count + && (pp = *fb) != NULL) + { + REMOVE_FB (fb, tc_victim, pp); + if (tc_victim != 0) + { + tcache_put (tc_victim, tc_idx); + } + } + } +#endif void *p = chunk2mem (victim); alloc_perturb (p, bytes); return p; @@ -3426,6 +3632,32 @@ _int_malloc (mstate av, size_t bytes) if (av != &main_arena) set_non_main_arena (victim); check_malloced_chunk (av, victim, nb); +#if USE_TCACHE + /* While we're here, if we see other chunks of the same size, + stash them in the tcache. */ + size_t tc_idx = csize2tidx (nb); + if (tcache && tc_idx < mp_.tcache_bins) + { + mchunkptr tc_victim; + + /* While bin not empty and tcache not full, copy chunks over. */ + while (tcache->counts[tc_idx] < mp_.tcache_count + && (tc_victim = last (bin)) != bin) + { + if (tc_victim != 0) + { + bck = tc_victim->bk; + set_inuse_bit_at_offset (tc_victim, nb); + if (av != &main_arena) + set_non_main_arena (tc_victim); + bin->bk = bck; + bck->fd = bin; + + tcache_put (tc_victim, tc_idx); + } + } + } +#endif void *p = chunk2mem (victim); alloc_perturb (p, bytes); return p; @@ -3464,6 +3696,16 @@ _int_malloc (mstate av, size_t bytes) otherwise need to expand memory to service a "small" request. */ +#if USE_TCACHE + INTERNAL_SIZE_T tcache_nb = 0; + size_t tc_idx = csize2tidx (nb); + if (tcache && tc_idx < mp_.tcache_bins) + tcache_nb = nb; + int return_cached = 0; + + tcache_unsorted_count = 0; +#endif + for (;; ) { int iters = 0; @@ -3524,10 +3766,26 @@ _int_malloc (mstate av, size_t bytes) set_inuse_bit_at_offset (victim, size); if (av != &main_arena) set_non_main_arena (victim); +#if USE_TCACHE + /* Fill cache first, return to user only if cache fills. + We may return one of these chunks later. */ + if (tcache_nb + && tcache->counts[tc_idx] < mp_.tcache_count) + { + tcache_put (victim, tc_idx); + return_cached = 1; + continue; + } + else + { +#endif check_malloced_chunk (av, victim, nb); void *p = chunk2mem (victim); alloc_perturb (p, bytes); return p; +#if USE_TCACHE + } +#endif } /* place chunk in bin */ @@ -3594,11 +3852,31 @@ _int_malloc (mstate av, size_t bytes) fwd->bk = victim; bck->fd = victim; +#if USE_TCACHE + /* If we've processed as many chunks as we're allowed while + filling the cache, return one of the cached ones. */ + ++tcache_unsorted_count; + if (return_cached + && mp_.tcache_unsorted_limit > 0 + && tcache_unsorted_count > mp_.tcache_unsorted_limit) + { + return tcache_get (tc_idx); + } +#endif + #define MAX_ITERS 10000 if (++iters >= MAX_ITERS) break; } +#if USE_TCACHE + /* If all the small chunks we found ended up cached, return one now. */ + if (return_cached) + { + return tcache_get (tc_idx); + } +#endif + /* If a large request, scan through the chunks of current bin in sorted order to find smallest that fits. Use the skip list for this. @@ -3884,6 +4162,20 @@ _int_free (mstate av, mchunkptr p, int h check_inuse_chunk(av, p); +#if USE_TCACHE + { + size_t tc_idx = csize2tidx (size); + + if (tcache + && tc_idx < mp_.tcache_bins + && tcache->counts[tc_idx] < mp_.tcache_count) + { + tcache_put (p, tc_idx); + return; + } + } +#endif + /* If eligible, place chunk on a fastbin so it can be found and used quickly in malloc. @@ -4845,6 +5137,38 @@ do_set_arena_max (size_t value) return 1; } +#if USE_TCACHE +static inline int +__always_inline +do_set_tcache_max (size_t value) +{ + if (value >= 0 && value <= MAX_TCACHE_SIZE) + { + LIBC_PROBE (memory_tunable_tcache_max_bytes, 2, value, mp_.tcache_max_bytes); + mp_.tcache_max_bytes = value; + mp_.tcache_bins = csize2tidx (request2size(value)) + 1; + } + return 1; +} + +static inline int +__always_inline +do_set_tcache_count (size_t value) +{ + LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count); + mp_.tcache_count = value; + return 1; +} + +static inline int +__always_inline +do_set_tcache_unsorted_limit (size_t value) +{ + LIBC_PROBE (memory_tunable_tcache_unsorted_limit, 2, value, mp_.tcache_unsorted_limit); + mp_.tcache_unsorted_limit = value; + return 1; +} +#endif int __libc_mallopt (int param_number, int value) diff -x .git -x po -r -U3 -p glibc.pristine/manual/probes.texi glibc.djmalloc-tcache/manual/probes.texi --- glibc.pristine/manual/probes.texi 2017-01-10 17:02:06.305343587 -0500 +++ glibc.djmalloc-tcache/manual/probes.texi 2017-06-26 17:32:35.785884967 -0400 @@ -231,6 +231,25 @@ dynamic brk/mmap thresholds. Argument @ the adjusted mmap and trim thresholds, respectively. @end deftp +@deftp Probe memory_tunable_tcache_max_bytes (int @var{$arg1}, int @var{$arg2}) +This probe is triggered when the @code{glibc.malloc.tcache_max} +tunable is set. Argument @var{$arg1} is the requested value, and +@var{$arg2} is the previous value of this tunable. +@end deftp + +@deftp Probe memory_tunable_tcache_count (int @var{$arg1}, int @var{$arg2}) +This probe is triggered when the @code{glibc.malloc.tcache_count} +tunable is set. Argument @var{$arg1} is the requested value, and +@var{$arg2} is the previous value of this tunable. +@end deftp + +@deftp Probe memory_tunable_tcache_unsorted_limit (int @var{$arg1}, int @var{$arg2}) +This probe is triggered when the +@code{glibc.malloc.tcache_unsorted_limit} tunable is set. Argument +@var{$arg1} is the requested value, and @var{$arg2} is the previous +value of this tunable. +@end deftp + @node Mathematical Function Probes @section Mathematical Function Probes diff -x .git -x po -r -U3 -p glibc.pristine/manual/tunables.texi glibc.djmalloc-tcache/manual/tunables.texi --- glibc.pristine/manual/tunables.texi 2017-06-23 20:49:10.201148494 -0400 +++ glibc.djmalloc-tcache/manual/tunables.texi 2017-06-26 17:35:07.952998383 -0400 @@ -193,6 +193,39 @@ systems the limit is twice the number of is 8 times the number of cores online. @end deftp +@deftp Tunable glibc.malloc.tcache_max +The maximum size of a request (in bytes) which may be met via the +per-thread cache. The default (and maximum) value is 1032 bytes on +64-bit systems and 516 bytes on 32-bit systems. +@end deftp + +@deftp Tunable glibc.malloc.tcache_count +The maximum number of chunks of each size to cache. The default is 7. +There is no upper limit, other than available system memory. Note +that chunks are rounded up to malloc's guaranteed alignment - this +count is per rounded size, not per user-provided size. + +The approximate maximum overhead of the per-thread cache (for each +thread, of course) is thus @code{glibc.malloc.tcache_max} (in bins, +max 64 bins) times @code{glibc.malloc.tcache_count} times the size for +each bin. With defaults, this is about 236 KB on 64-bit systems and +118 KB on 32-bit systems. +@end deftp + +@deftp Tunable glibc.malloc.tcache_unsorted_limit +When the user requests memory and the request cannot be met via the +per-thread cache, the arenas are used to meet the request. At this +time, additional chunks will be moved from existing arena lists to +pre-fill the corresponding cache. While copies from the fastbins, +smallbins, and regular bins are bounded and predictable due to the bin +sizes, copies from the unsorted bin are not bounded, and incur +additional time penalties as they need to be sorted as they're +scanned. To make scanning the unsorted list more predictable and +bounded, the user may set this tunable to limit the number of blocks +that are scanned from the unsorted list while searching for chunks to +pre-fill the per-thread cache with. The default, or when set to zero, +is no limit. + @node Hardware Capability Tunables @section Hardware Capability Tunables @cindex hardware capability tunables