Message ID | 20211111162428.2286605-4-hjl.tools@gmail.com |
---|---|
State | New |
Headers | show |
Series | Optimize CAS [BZ #28537] | expand |
Sorry for letting this one slip... "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> writes: > size_t n = narenas; > if (__glibc_unlikely (n <= narenas_limit - 1)) > { > + if (atomic_load_relaxed (&narenas) != n) > + { > + atomic_spin_nop (); > + goto repeat; > + } > if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n)) > goto repeat; I understand that a congested spinloop will benefit from this kind of change, but... we JUST loaded narenas into n, and adding arenas is rare. We probably should have loaded it atomically, but still, we just loaded it. The odds of malloc being so congested that we miss the CAS is essentially (but of course not exactly) zero. Are we just adding an uneeded atomic read here? Do any benchmarks say this would be beneficial? Also, the malloc code is already complicated enough. Are the extra lines of code and slight reduction in readability justified? Also, we've been migrating to C11-like atomics; would this patch need changing for that? Should target-specific atomics optimizations be "hidden" somewhere in the atomics implementation? Just because x86 may benefit from a pre-read doesn't mean that all targets will, and if x86 generally benefits, it should update its implementation of the atomics to do that at a lower level. > a = _int_new_arena (size); > diff --git a/malloc/malloc.c b/malloc/malloc.c > index 095d97a3be..403ffb84ef 100644 > --- a/malloc/malloc.c > +++ b/malloc/malloc.c > @@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes) > pp = REVEAL_PTR (victim->fd); \ > if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp))) \ > malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \ > + if (atomic_load_relaxed (fb) != victim) \ > + { \ > + atomic_spin_nop (); \ > + continue; \ > + } \ > } \ > while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \ > != victim); \ > @@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock) > malloc_printerr ("double free or corruption (fasttop)"); > old2 = old; > p->fd = PROTECT_PTR (&p->fd, old); > + if (atomic_load_relaxed (fb) != old2) > + { > + atomic_spin_nop (); > + continue; > + } > } > while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) > != old2); Likewise, although these are less rare, but not so common as I'd expect a benefit from the extra code.
diff --git a/malloc/arena.c b/malloc/arena.c index 78ef4cf18c..e7fbe7c183 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -899,6 +899,11 @@ arena_get2 (size_t size, mstate avoid_arena) enough address space to create that many arenas. */ if (__glibc_unlikely (n <= narenas_limit - 1)) { + if (atomic_load_relaxed (&narenas) != n) + { + atomic_spin_nop (); + goto repeat; + } if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n)) goto repeat; a = _int_new_arena (size); diff --git a/malloc/malloc.c b/malloc/malloc.c index 095d97a3be..403ffb84ef 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes) pp = REVEAL_PTR (victim->fd); \ if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp))) \ malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \ + if (atomic_load_relaxed (fb) != victim) \ + { \ + atomic_spin_nop (); \ + continue; \ + } \ } \ while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \ != victim); \ @@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock) malloc_printerr ("double free or corruption (fasttop)"); old2 = old; p->fd = PROTECT_PTR (&p->fd, old); + if (atomic_load_relaxed (fb) != old2) + { + atomic_spin_nop (); + continue; + } } while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) != old2);