diff mbox series

[v2,2/5] malloc: Avoid func call for tcache quick path in free()

Message ID 20240826025534.472882-3-wangyang.guo@intel.com
State New
Headers show
Series malloc: TCACHE improvement for free and calloc | expand

Commit Message

Guo, Wangyang Aug. 26, 2024, 2:55 a.m. UTC
Tcache is an important optimzation to accelerate memory free(), things
within this code path should be kept as simple as possible. This commit
try to remove the function call when free() invokes tcache code path.

Result of bench-malloc-thread benchmark

Test Platform: Xeon-8380
Ratio: New / Original time_per_iteration (Lower is Better)

Threads#   | Ratio
-----------|------
1 thread   | 0.904
4 threads  | 0.919

The performance data shows it can improve bench-malloc-thread benchmark
by ~10% in single thread and ~8% in multi-thread scenario.

---
Changes in v2:
- _int_free_check() should be put outside of USE_TCACHE.
- Link to v1: https://sourceware.org/pipermail/libc-alpha/2024-August/159359.html

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 malloc/malloc.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Florian Weimer Aug. 28, 2024, 8:26 a.m. UTC | #1
* Wangyang Guo:

> Tcache is an important optimzation to accelerate memory free(), things
> within this code path should be kept as simple as possible. This commit
> try to remove the function call when free() invokes tcache code path.
>
> Result of bench-malloc-thread benchmark
>
> Test Platform: Xeon-8380
> Ratio: New / Original time_per_iteration (Lower is Better)
>
> Threads#   | Ratio
> -----------|------
> 1 thread   | 0.904
> 4 threads  | 0.919
>
> The performance data shows it can improve bench-malloc-thread benchmark
> by ~10% in single thread and ~8% in multi-thread scenario.

Does the speedup come from bypassing these checks at the start of
_int_free?

  size = chunksize (p);

  /* Little security check which won't hurt performance: the
     allocator never wraps around at the end of the address space.
     Therefore we can exclude some size values which might appear
     here by accident or by "design" from some intruder.  */
  if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0)
      || __builtin_expect (misaligned_chunk (p), 0))
    malloc_printerr ("free(): invalid pointer");
  /* We know that each chunk is at least MINSIZE bytes in size or a
     multiple of MALLOC_ALIGNMENT.  */
  if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size)))
    malloc_printerr ("free(): invalid size");

  check_inuse_chunk(av, p);

I wonder if it's due to to that, or indeed due to inlining.

Thanks,
Florian
Guo, Wangyang Aug. 28, 2024, 8:38 a.m. UTC | #2
On 8/28/2024 4:26 PM, Florian Weimer wrote:
> * Wangyang Guo:
> 
>> Tcache is an important optimzation to accelerate memory free(), things
>> within this code path should be kept as simple as possible. This commit
>> try to remove the function call when free() invokes tcache code path.
>>
>> Result of bench-malloc-thread benchmark
>>
>> Test Platform: Xeon-8380
>> Ratio: New / Original time_per_iteration (Lower is Better)
>>
>> Threads#   | Ratio
>> -----------|------
>> 1 thread   | 0.904
>> 4 threads  | 0.919
>>
>> The performance data shows it can improve bench-malloc-thread benchmark
>> by ~10% in single thread and ~8% in multi-thread scenario.
> 
> Does the speedup come from bypassing these checks at the start of
> _int_free?
> 
>    size = chunksize (p);
> 
>    /* Little security check which won't hurt performance: the
>       allocator never wraps around at the end of the address space.
>       Therefore we can exclude some size values which might appear
>       here by accident or by "design" from some intruder.  */
>    if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0)
>        || __builtin_expect (misaligned_chunk (p), 0))
>      malloc_printerr ("free(): invalid pointer");
>    /* We know that each chunk is at least MINSIZE bytes in size or a
>       multiple of MALLOC_ALIGNMENT.  */
>    if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size)))
>      malloc_printerr ("free(): invalid size");
> 
>    check_inuse_chunk(av, p);
> 
> I wonder if it's due to to that, or indeed due to inlining.
> 
> Thanks,
> Florian
> 
> 
These checks are still kept and wrapped as _int_free_check():
     size = chunksize (p);
     _int_free_check (ar_ptr, p, size);
     if (tcache_free (p, size))
     ....
The logic is almost the same as original, the only difference is inline.

BR
Wangyang
Florian Weimer Aug. 28, 2024, 9:02 a.m. UTC | #3
* Wangyang Guo:

> On 8/28/2024 4:26 PM, Florian Weimer wrote:
>> * Wangyang Guo:
>> 
>>> Tcache is an important optimzation to accelerate memory free(), things
>>> within this code path should be kept as simple as possible. This commit
>>> try to remove the function call when free() invokes tcache code path.
>>>
>>> Result of bench-malloc-thread benchmark
>>>
>>> Test Platform: Xeon-8380
>>> Ratio: New / Original time_per_iteration (Lower is Better)
>>>
>>> Threads#   | Ratio
>>> -----------|------
>>> 1 thread   | 0.904
>>> 4 threads  | 0.919
>>>
>>> The performance data shows it can improve bench-malloc-thread benchmark
>>> by ~10% in single thread and ~8% in multi-thread scenario.
>> Does the speedup come from bypassing these checks at the start of
>> _int_free?
>>    size = chunksize (p);
>>    /* Little security check which won't hurt performance: the
>>       allocator never wraps around at the end of the address space.
>>       Therefore we can exclude some size values which might appear
>>       here by accident or by "design" from some intruder.  */
>>    if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0)
>>        || __builtin_expect (misaligned_chunk (p), 0))
>>      malloc_printerr ("free(): invalid pointer");
>>    /* We know that each chunk is at least MINSIZE bytes in size or a
>>       multiple of MALLOC_ALIGNMENT.  */
>>    if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size)))
>>      malloc_printerr ("free(): invalid size");
>>    check_inuse_chunk(av, p);
>> I wonder if it's due to to that, or indeed due to inlining.
>> Thanks,
>> Florian
>> 
> These checks are still kept and wrapped as _int_free_check():
>     size = chunksize (p);
>     _int_free_check (ar_ptr, p, size);
>     if (tcache_free (p, size))
>     ....
> The logic is almost the same as original, the only difference is inline.

Ahh, missed it.

Thanks,
Florian
diff mbox series

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index b2373b2212..caa738a96e 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3440,7 +3440,17 @@  __libc_free (void *mem)
       (void)tag_region (chunk2mem (p), memsize (p));
 
       ar_ptr = arena_for_chunk (p);
-      _int_free (ar_ptr, p, 0);
+      INTERNAL_SIZE_T size = chunksize (p);
+      _int_free_check (ar_ptr, p, size);
+
+#if USE_TCACHE
+      if (tcache_free (p, size))
+	{
+	  __set_errno (err);
+	  return;
+	}
+#endif
+      _int_free_chunk (ar_ptr, p, size, 0);
     }
 
   __set_errno (err);