Message ID | 20240826025534.472882-3-wangyang.guo@intel.com |
---|---|
State | New |
Headers | show |
Series | malloc: TCACHE improvement for free and calloc | expand |
* Wangyang Guo: > Tcache is an important optimzation to accelerate memory free(), things > within this code path should be kept as simple as possible. This commit > try to remove the function call when free() invokes tcache code path. > > Result of bench-malloc-thread benchmark > > Test Platform: Xeon-8380 > Ratio: New / Original time_per_iteration (Lower is Better) > > Threads# | Ratio > -----------|------ > 1 thread | 0.904 > 4 threads | 0.919 > > The performance data shows it can improve bench-malloc-thread benchmark > by ~10% in single thread and ~8% in multi-thread scenario. Does the speedup come from bypassing these checks at the start of _int_free? size = chunksize (p); /* Little security check which won't hurt performance: the allocator never wraps around at the end of the address space. Therefore we can exclude some size values which might appear here by accident or by "design" from some intruder. */ if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0) || __builtin_expect (misaligned_chunk (p), 0)) malloc_printerr ("free(): invalid pointer"); /* We know that each chunk is at least MINSIZE bytes in size or a multiple of MALLOC_ALIGNMENT. */ if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size))) malloc_printerr ("free(): invalid size"); check_inuse_chunk(av, p); I wonder if it's due to to that, or indeed due to inlining. Thanks, Florian
On 8/28/2024 4:26 PM, Florian Weimer wrote: > * Wangyang Guo: > >> Tcache is an important optimzation to accelerate memory free(), things >> within this code path should be kept as simple as possible. This commit >> try to remove the function call when free() invokes tcache code path. >> >> Result of bench-malloc-thread benchmark >> >> Test Platform: Xeon-8380 >> Ratio: New / Original time_per_iteration (Lower is Better) >> >> Threads# | Ratio >> -----------|------ >> 1 thread | 0.904 >> 4 threads | 0.919 >> >> The performance data shows it can improve bench-malloc-thread benchmark >> by ~10% in single thread and ~8% in multi-thread scenario. > > Does the speedup come from bypassing these checks at the start of > _int_free? > > size = chunksize (p); > > /* Little security check which won't hurt performance: the > allocator never wraps around at the end of the address space. > Therefore we can exclude some size values which might appear > here by accident or by "design" from some intruder. */ > if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0) > || __builtin_expect (misaligned_chunk (p), 0)) > malloc_printerr ("free(): invalid pointer"); > /* We know that each chunk is at least MINSIZE bytes in size or a > multiple of MALLOC_ALIGNMENT. */ > if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size))) > malloc_printerr ("free(): invalid size"); > > check_inuse_chunk(av, p); > > I wonder if it's due to to that, or indeed due to inlining. > > Thanks, > Florian > > These checks are still kept and wrapped as _int_free_check(): size = chunksize (p); _int_free_check (ar_ptr, p, size); if (tcache_free (p, size)) .... The logic is almost the same as original, the only difference is inline. BR Wangyang
* Wangyang Guo: > On 8/28/2024 4:26 PM, Florian Weimer wrote: >> * Wangyang Guo: >> >>> Tcache is an important optimzation to accelerate memory free(), things >>> within this code path should be kept as simple as possible. This commit >>> try to remove the function call when free() invokes tcache code path. >>> >>> Result of bench-malloc-thread benchmark >>> >>> Test Platform: Xeon-8380 >>> Ratio: New / Original time_per_iteration (Lower is Better) >>> >>> Threads# | Ratio >>> -----------|------ >>> 1 thread | 0.904 >>> 4 threads | 0.919 >>> >>> The performance data shows it can improve bench-malloc-thread benchmark >>> by ~10% in single thread and ~8% in multi-thread scenario. >> Does the speedup come from bypassing these checks at the start of >> _int_free? >> size = chunksize (p); >> /* Little security check which won't hurt performance: the >> allocator never wraps around at the end of the address space. >> Therefore we can exclude some size values which might appear >> here by accident or by "design" from some intruder. */ >> if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0) >> || __builtin_expect (misaligned_chunk (p), 0)) >> malloc_printerr ("free(): invalid pointer"); >> /* We know that each chunk is at least MINSIZE bytes in size or a >> multiple of MALLOC_ALIGNMENT. */ >> if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size))) >> malloc_printerr ("free(): invalid size"); >> check_inuse_chunk(av, p); >> I wonder if it's due to to that, or indeed due to inlining. >> Thanks, >> Florian >> > These checks are still kept and wrapped as _int_free_check(): > size = chunksize (p); > _int_free_check (ar_ptr, p, size); > if (tcache_free (p, size)) > .... > The logic is almost the same as original, the only difference is inline. Ahh, missed it. Thanks, Florian
diff --git a/malloc/malloc.c b/malloc/malloc.c index b2373b2212..caa738a96e 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -3440,7 +3440,17 @@ __libc_free (void *mem) (void)tag_region (chunk2mem (p), memsize (p)); ar_ptr = arena_for_chunk (p); - _int_free (ar_ptr, p, 0); + INTERNAL_SIZE_T size = chunksize (p); + _int_free_check (ar_ptr, p, size); + +#if USE_TCACHE + if (tcache_free (p, size)) + { + __set_errno (err); + return; + } +#endif + _int_free_chunk (ar_ptr, p, size, 0); } __set_errno (err);
Tcache is an important optimzation to accelerate memory free(), things within this code path should be kept as simple as possible. This commit try to remove the function call when free() invokes tcache code path. Result of bench-malloc-thread benchmark Test Platform: Xeon-8380 Ratio: New / Original time_per_iteration (Lower is Better) Threads# | Ratio -----------|------ 1 thread | 0.904 4 threads | 0.919 The performance data shows it can improve bench-malloc-thread benchmark by ~10% in single thread and ~8% in multi-thread scenario. --- Changes in v2: - _int_free_check() should be put outside of USE_TCACHE. - Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159359.html Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> --- malloc/malloc.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)