Message ID | 20211103150415.1211388-1-hjl.tools@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537] | expand |
On Nov 03 2021, H.J. Lu via Libc-alpha wrote: > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h > index 2692d94a92..92c7cf58b7 100644 > --- a/sysdeps/x86/atomic-machine.h > +++ b/sysdeps/x86/atomic-machine.h > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; > #define ATOMIC_EXCHANGE_USES_CAS 0 > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > - __sync_val_compare_and_swap (mem, oldval, newval) > + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ > + ret = (oldmem == (oldval) \ > + ? __sync_val_compare_and_swap (mem, oldval, newval) \ > + : oldmem); \ > + ret; }) Can't this cause the CAS to fail spuriously if the compiler used some stale cached value for *mem? Andreas.
Hi, H.J. Lu You may not perform plain reads on values you want to be atomic. This results in undefined behavior. For example, the compiler IS NOT obliged to perform the read with a single CPU instruction -- of course it will not, but it is allowed to read it in two halves and compare them separately. Or it may reuse cached value from previous evaluations. This is only the compiler level issue. Similar issues will arise at CPU level with all the kind of memory coherency, caching and instruction reordering. Or if the value would cross a cache line boundary the plain read might return half-updated value with the part from one cache line being new and the other part being old. Finally, I'm not following this thread and I have very little knowledge on function purposes in the library but the "_acq" suffix bears the impression that the memory operation has to exhibit the acquire semantics. I.e., it has to fetch memory updates and make them visible for the thread. But this is what you are trying to eliminate with your patch. On Wed, Nov 3, 2021 at 5:05 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > From the CPU's point of view, getting a cache line for writing is more > expensive than reading. See Appendix A.2 Spinlock in: > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > The full compare and swap will grab the cache line exclusive and cause > excessive cache line bouncing. Check the current memory value first and > return immediately if writing cache line may fail to reduce cache line > bouncing on contended locks. > > This fixes BZ# 28537. > --- > sysdeps/x86/atomic-machine.h | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h > index 2692d94a92..92c7cf58b7 100644 > --- a/sysdeps/x86/atomic-machine.h > +++ b/sysdeps/x86/atomic-machine.h > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; > #define ATOMIC_EXCHANGE_USES_CAS 0 > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > - __sync_val_compare_and_swap (mem, oldval, newval) > + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ > + ret = (oldmem == (oldval) \ > + ? __sync_val_compare_and_swap (mem, oldval, newval) \ > + : oldmem); \ > + ret; }) > #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ > - (! __sync_bool_compare_and_swap (mem, oldval, newval)) > + ({ __typeof (*(mem)) old = *(mem); \ > + int ret; \ > + if (old != (oldval)) \ > + ret = 1; \ > + else \ > + ret = !__sync_bool_compare_and_swap (mem, oldval, newval); \ > + ret; }) > > > #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \ > -- > 2.33.1 >
* H. J. Lu via Libc-alpha: > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h > index 2692d94a92..92c7cf58b7 100644 > --- a/sysdeps/x86/atomic-machine.h > +++ b/sysdeps/x86/atomic-machine.h > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; > #define ATOMIC_EXCHANGE_USES_CAS 0 > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > - __sync_val_compare_and_swap (mem, oldval, newval) > + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ > + ret = (oldmem == (oldval) \ > + ? __sync_val_compare_and_swap (mem, oldval, newval) \ > + : oldmem); \ > + ret; }) > #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ > - (! __sync_bool_compare_and_swap (mem, oldval, newval)) > + ({ __typeof (*(mem)) old = *(mem); \ > + int ret; \ > + if (old != (oldval)) \ > + ret = 1; \ > + else \ > + ret = !__sync_bool_compare_and_swap (mem, oldval, newval); \ > + ret; }) Shouldn't GCC be fixed to generate the appropriate instruction sequence for this architecture? Should we perhaps switch to compiler atomics for x86 instead of defining our own? Thanks, Florian
On 11/3/2021 8:50 AM, Oleh Derevenko wrote: > Hi, H.J. Lu > > You may not perform plain reads on values you want to be atomic. This > results in undefined behavior. so the way the patch works is that it does not DEPEND on that read to be atomic. What the patch does is check non-atomic first if the actual atomic operation has a chance of working. if it has a chance, the actual normal atomic operation is done as before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors out early. The big gain is for the contended lock case (_acq suffix!). If there's, say, 4 threads spinning on a lock. Before this patch these 4 cpu cores would be taking turns bouncing the cacheline around super aggressively.. which causes system degradation and worse, also makes the core that will eventually unlock the lock wait for the cacheline. Now with the patch, the "it is locked already" is noticed before the cacheline gets taken exclusive, so all 4 spinning cores have the same cacheline in a shared state -- no pingponging. Now the core that is going to unlock the lock can now do the exclusive acquire of the cacheline without having to fight with those 4 cores in the exclusive acquire fight. > For example, the compiler IS NOT obliged to perform the read with a > single CPU instruction -- of course it will not, but it is allowed to > read it in two halves and compare them separately. Or it may reuse > cached value from previous evaluations. > This is only the compiler level issue. Similar issues will arise at > CPU level with all the kind of memory coherency, caching and > instruction reordering. the cpu in this case won't, the x86 memory model won't allow that (and this is in the x86 implementation code) > Or if the value would cross a cache line boundary the plain read might > return half-updated value with the part from one cache line being new > and the other part being old. (I can't say in polite company what cmpxchg across cache lines does)
On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote: > What the patch does is check non-atomic first if the actual atomic operation has > a chance of working. if it has a chance, the actual normal atomic operation is done as > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > out early. But if the compiler keeps reusing the same cached value it may conclude that the CAS _never_ has a chance to succeed. Andreas.
On Wed, Nov 3, 2021 at 10:05 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > From the CPU's point of view, getting a cache line for writing is more > expensive than reading. See Appendix A.2 Spinlock in: > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > The full compare and swap will grab the cache line exclusive and cause > excessive cache line bouncing. Check the current memory value first and > return immediately if writing cache line may fail to reduce cache line > bouncing on contended locks. How does this affect the no-contention case? AFAIK that is the common case in most systems. This seems like something a user can do if their application can benefit from it and not something for everyone. > > This fixes BZ# 28537. > --- > sysdeps/x86/atomic-machine.h | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h > index 2692d94a92..92c7cf58b7 100644 > --- a/sysdeps/x86/atomic-machine.h > +++ b/sysdeps/x86/atomic-machine.h > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; > #define ATOMIC_EXCHANGE_USES_CAS 0 > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > - __sync_val_compare_and_swap (mem, oldval, newval) > + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ > + ret = (oldmem == (oldval) \ > + ? __sync_val_compare_and_swap (mem, oldval, newval) \ > + : oldmem); \ > + ret; }) > #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ > - (! __sync_bool_compare_and_swap (mem, oldval, newval)) > + ({ __typeof (*(mem)) old = *(mem); \ > + int ret; \ > + if (old != (oldval)) \ > + ret = 1; \ > + else \ > + ret = !__sync_bool_compare_and_swap (mem, oldval, newval); \ > + ret; }) > > > #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \ > -- > 2.33.1 >
Arjan > What the patch does is check non-atomic first if the actual atomic operation has a chance of working. if it has a chance, the actual normal atomic operation is done as before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors out early. The idea of atomic function is that they are intended to work fairly with any type of memory. In your case, the speculative reads for a cached device memory may result in cache access only and will prevent fetching memory updates from the device, thus making the reading thread "see" the change later than it could. If you want to make a "RAM-specific" version of compare-n-exchange give it a distinct specific name. On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote: > > On 11/3/2021 8:50 AM, Oleh Derevenko wrote: > > Hi, H.J. Lu > > > > You may not perform plain reads on values you want to be atomic. This > > results in undefined behavior. > > so the way the patch works is that it does not DEPEND on that read to be atomic. > > What the patch does is check non-atomic first if the actual atomic operation has > a chance of working. if it has a chance, the actual normal atomic operation is done as > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > out early. > > The big gain is for the contended lock case (_acq suffix!). If there's, say, 4 threads spinning > on a lock. Before this patch these 4 cpu cores would be taking turns bouncing the cacheline around > super aggressively.. which causes system degradation and worse, also makes the core that will > eventually unlock the lock wait for the cacheline. > > Now with the patch, the "it is locked already" is noticed before the cacheline gets taken exclusive, > so all 4 spinning cores have the same cacheline in a shared state -- no pingponging. > Now the core that is going to unlock the lock can now do the exclusive acquire of the cacheline > without having to fight with those 4 cores in the exclusive acquire fight. > > > > For example, the compiler IS NOT obliged to perform the read with a > > single CPU instruction -- of course it will not, but it is allowed to > > read it in two halves and compare them separately. Or it may reuse > > cached value from previous evaluations. > > > This is only the compiler level issue. Similar issues will arise at > > CPU level with all the kind of memory coherency, caching and > > instruction reordering. > > the cpu in this case won't, the x86 memory model won't allow that > (and this is in the x86 implementation code) > > > Or if the value would cross a cache line boundary the plain read might > > return half-updated value with the part from one cache line being new > > and the other part being old. > > (I can't say in polite company what cmpxchg across cache lines does)
On 11/3/2021 10:26 AM, Oleh Derevenko wrote: > Arjan > >> What the patch does is check non-atomic first if the actual atomic operation has > a chance of working. if it has a chance, the actual normal atomic > operation is done as > before. But if non-atomic read already tells you the cmpxchg has no > chance to succeed, it errors > out early. > > The idea of atomic function is that they are intended to work fairly > with any type of memory. In your case, the speculative reads for a > cached device memory may result in cache access only and will prevent > fetching memory updates from the device, thus making the reading > thread "see" the change later than it could. eh I am not sure I understand what you say since cmpxchg uses the exact same cache protocol/etc to do its read... it won't go to device memory either if the cache line is anywhere in the cache hierarchy (including core-to-core transfers in case another core has it in their caches) (and cmpxchg on MMIO space has very very interesting and unexpected behavior. If folks remember the "linux torches your e1000 eeprom" bug from some years ago, it came from that)
Arjan, > eh I am not sure I understand what you say since cmpxchg uses the exact same > cache protocol/etc to do its read... Well, if you are sure... I did not have that information. The last question if you permit. What, as to your opinion, are the reasons they did it in the hardware implementation? This thing, I mean: > The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing. Why was not this optimization initially implemented in the CPU? Well, I guess I can explain it myself. Because in the success cases these extra checks would add to execution time. And cmpcxhg can be used for many purposes — not only for polling from four threads in parallel. On Wed, Nov 3, 2021 at 7:30 PM Arjan van de Ven <arjan@linux.intel.com> wrote: > > On 11/3/2021 10:26 AM, Oleh Derevenko wrote: > > Arjan > > > >> What the patch does is check non-atomic first if the actual atomic operation has > > a chance of working. if it has a chance, the actual normal atomic > > operation is done as > > before. But if non-atomic read already tells you the cmpxchg has no > > chance to succeed, it errors > > out early. > > > > The idea of atomic function is that they are intended to work fairly > > with any type of memory. In your case, the speculative reads for a > > cached device memory may result in cache access only and will prevent > > fetching memory updates from the device, thus making the reading > > thread "see" the change later than it could. > > > eh I am not sure I understand what you say since cmpxchg uses the exact same > cache protocol/etc to do its read... it won't go to device memory either if > the cache line is anywhere in the cache hierarchy (including core-to-core transfers > in case another core has it in their caches) > > > (and cmpxchg on MMIO space has very very interesting and unexpected behavior. If folks remember > the "linux torches your e1000 eeprom" bug from some years ago, it came from that)
On Wed, Nov 3, 2021 at 9:36 AM Florian Weimer <fweimer@redhat.com> wrote: > > * H. J. Lu via Libc-alpha: > > > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h > > index 2692d94a92..92c7cf58b7 100644 > > --- a/sysdeps/x86/atomic-machine.h > > +++ b/sysdeps/x86/atomic-machine.h > > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; > > #define ATOMIC_EXCHANGE_USES_CAS 0 > > > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > > - __sync_val_compare_and_swap (mem, oldval, newval) > > + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ > > + ret = (oldmem == (oldval) \ > > + ? __sync_val_compare_and_swap (mem, oldval, newval) \ > > + : oldmem); \ > > + ret; }) > > #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ > > - (! __sync_bool_compare_and_swap (mem, oldval, newval)) > > + ({ __typeof (*(mem)) old = *(mem); \ > > + int ret; \ > > + if (old != (oldval)) \ > > + ret = 1; \ > > + else \ > > + ret = !__sync_bool_compare_and_swap (mem, oldval, newval); \ > > + ret; }) > > Shouldn't GCC be fixed to generate the appropriate instruction sequence > for this architecture? Should we perhaps switch to compiler atomics for > x86 instead of defining our own? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065 In the meantime, we should help older compilers.
On 11/3/2021 10:17 AM, Andreas Schwab wrote: > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote: > >> What the patch does is check non-atomic first if the actual atomic operation has >> a chance of working. if it has a chance, the actual normal atomic operation is done as >> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors >> out early. > > But if the compiler keeps reusing the same cached value it may conclude > that the CAS _never_ has a chance to succeed. that's a good point. Now there SHOULD be a (compiler) barrier around this somewhere due to inserting the pause instruction for such a loop. But putting some explicit compiler barrier in this makes it for sure safer.
On 11/3/2021 10:55 AM, Oleh Derevenko wrote: > Arjan, > >> eh I am not sure I understand what you say since cmpxchg uses the exact same >> cache protocol/etc to do its read... > Well, if you are sure... I did not have that information. > > The last question if you permit. > What, as to your opinion, are the reasons they did it in the hardware > implementation? This thing, I mean: >> The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing. > Why was not this optimization initially implemented in the CPU? an instruction that needs the same cacheline for read and maybe exclusive will get it exclusive, in cpu microarchitecture there usually isn't some fast way (and we all want locks ot be fast clearly) to do a two phase cache protocol on the same line > > Well, I guess I can explain it myself. Because in the success cases > these extra checks would add to execution time. And cmpcxhg can be > used for many purposes — not only for polling from four threads in > parallel.
On Wed, Nov 3, 2021 at 12:21 PM Arjan van de Ven <arjan@linux.intel.com> wrote: > > On 11/3/2021 10:17 AM, Andreas Schwab wrote: > > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote: > > > >> What the patch does is check non-atomic first if the actual atomic operation has > >> a chance of working. if it has a chance, the actual normal atomic operation is done as > >> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > >> out early. > > > > But if the compiler keeps reusing the same cached value it may conclude > > that the CAS _never_ has a chance to succeed. > > that's a good point. Now there SHOULD be a (compiler) barrier around this somewhere due > to inserting the pause instruction for such a loop. But putting some explicit compiler barrier in this > makes it for sure safer. Like this? #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ ({ __typeof (*(mem)) oldmem = *(mem), ret; \ atomic_read_barrier (); \ ret = (oldmem == (oldval) \ ? __sync_val_compare_and_swap (mem, oldval, newval) \ : oldmem); \ ret; }) #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ ({ __typeof (*(mem)) oldmem = *(mem); \ atomic_read_barrier (); \ int ret; \ ret = (oldmem == (oldval) \ ? !__sync_bool_compare_and_swap (mem, oldval, newval) \ : 1); \ ret; })
To force a compiler to re-read the value it is sufficient to take a pointer to the storage, cast it to volatile and read via that pointer-to-volatile. On Wed, Nov 3, 2021 at 8:00 PM Andreas Schwab <schwab@linux-m68k.org> wrote: > > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote: > > > What the patch does is check non-atomic first if the actual atomic operation has > > a chance of working. if it has a chance, the actual normal atomic operation is done as > > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > > out early. > > But if the compiler keeps reusing the same cached value it may conclude > that the CAS _never_ has a chance to succeed. > > Andreas. > > -- > Andreas Schwab, schwab@linux-m68k.org > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 > "And now for something completely different."
On Wed, Nov 3, 2021 at 1:38 PM Oleh Derevenko <oleh.derevenko@gmail.com> wrote: > > To force a compiler to re-read the value it is sufficient to take a > pointer to the storage, cast it to volatile and read via that > pointer-to-volatile. Like this? #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ ({ volatile __typeof (*(mem)) *memp = (mem); \ __typeof (*(mem)) oldmem = *memp, ret; \ ret = (oldmem == (oldval) \ ? __sync_val_compare_and_swap (mem, oldval, newval) \ : oldmem); \ ret; }) #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ ({ volatile __typeof (*(mem)) *memp = (mem); \ __typeof (*(mem)) oldmem = *memp; \ int ret; \ ret = (oldmem == (oldval) \ ? !__sync_bool_compare_and_swap (mem, oldval, newval) \ : 1); \ ret; }) > On Wed, Nov 3, 2021 at 8:00 PM Andreas Schwab <schwab@linux-m68k.org> wrote: > > > > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote: > > > > > What the patch does is check non-atomic first if the actual atomic operation has > > > a chance of working. if it has a chance, the actual normal atomic operation is done as > > > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > > > out early. > > > > But if the compiler keeps reusing the same cached value it may conclude > > that the CAS _never_ has a chance to succeed. > > > > Andreas. > > > > -- > > Andreas Schwab, schwab@linux-m68k.org > > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 > > "And now for something completely different." > > > > -- > > Oleh Derevenko > > -- Skype with underscore
Hi, On Thu, Nov 4, 2021 at 12:12 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Wed, Nov 3, 2021 at 1:38 PM Oleh Derevenko <oleh.derevenko@gmail.com> wrote: > > > > To force a compiler to re-read the value it is sufficient to take a > > pointer to the storage, cast it to volatile and read via that > > pointer-to-volatile. > > Like this? > > #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ > ({ volatile __typeof (*(mem)) *memp = (mem); \ > __typeof (*(mem)) oldmem = *memp, ret; \ > ret = (oldmem == (oldval) \ > ? __sync_val_compare_and_swap (mem, oldval, newval) \ > : oldmem); \ > ret; }) > #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ > ({ volatile __typeof (*(mem)) *memp = (mem); \ > __typeof (*(mem)) oldmem = *memp; \ > int ret; \ > ret = (oldmem == (oldval) \ > ? !__sync_bool_compare_and_swap (mem, oldval, newval) \ > : 1); \ > ret; }) > Yes. The only problem is that I'm not sure if that extra volatile will not cause a compilation warning/error if the type already had the volatile modifier in it. Also, another problem, blindly taking __typeof (*(mem)) oldmem = ... could declare the "oldmem" variable with const and/or volatile (depending on the actual parameter to the macro) and make either an assignment to const (a compile error) or an assignment to volatile (unnecessary forced store into memory with following retrieval from there).
Hi On Thu, Nov 4, 2021 at 10:58 AM Oleh Derevenko <oleh.derevenko@gmail.com> wrote: > Also, another problem, blindly taking > __typeof (*(mem)) oldmem = ... > could declare the "oldmem" variable with const and/or volatile > (depending on the actual parameter to the macro) and make either an > assignment to const (a compile error) or an assignment to volatile > (unnecessary forced store into memory with following retrieval from > there). Well, const cannot appear in the target, of course. But volatile can.
* H. J. Lu: >> Shouldn't GCC be fixed to generate the appropriate instruction sequence >> for this architecture? Should we perhaps switch to compiler atomics for >> x86 instead of defining our own? > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065 > > In the meantime, we should help older compilers. But wouldn't that extra read persist with fixed compilers, due to the memory barrier? So that there are two reads? Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in the memory order? Thanks, Florian
Hi Arjan, On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote: > > Or if the value would cross a cache line boundary the plain read might > > return half-updated value with the part from one cache line being new > > and the other part being old. > > (I can't say in polite company what cmpxchg across cache lines does) Well, the "IA-32 Intel® Architecture Software Developer’s Manual, Volume 3" in section 7.1.2.1 doesn't agree with you The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
On 11/4/2021 4:42 AM, Oleh Derevenko wrote: > Hi Arjan, > > On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote: >>> Or if the value would cross a cache line boundary the plain read might >>> return half-updated value with the part from one cache line being new >>> and the other part being old. >> >> (I can't say in polite company what cmpxchg across cache lines does) > > Well, the "IA-32 Intel® Architecture Software Developer’s Manual, > Volume 3" in section 7.1.2.1 doesn't agree with you > > The integrity of a bus lock is not affected by the alignment of the > memory field. The LOCK > semantics are followed for as many bus cycles as necessary to update > the entire operand. > However, it is recommend that locked accesses be aligned on their > natural boundaries for better > system performance: yes you can do buslocks. they are functionally correct unless you turn on the bios/OS option to ban them (at which point they cause a fault). what also happens is that the WHOLE SYSTEM (all cores on all sockets) come to a screeching halt if you have a pile of buslocks happening to you. (hence the ability to disable them and make them fault) on a contented lock that is a huge issue since at that point, the core that holds the lock and wants to unlock it is also put in that very slow queue of who gets the cachelines >
On Thu, Nov 4, 2021 at 3:15 AM Florian Weimer <fweimer@redhat.com> wrote: > > * H. J. Lu: > > >> Shouldn't GCC be fixed to generate the appropriate instruction sequence > >> for this architecture? Should we perhaps switch to compiler atomics for > >> x86 instead of defining our own? > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065 > > > > In the meantime, we should help older compilers. > > But wouldn't that extra read persist with fixed compilers, due to the > memory barrier? So that there are two reads? Yes, this could be a problem. > Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in > the memory order? We need a way to identify the fixed compiler. We can add a predefined macro, something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK.
On Thu, Nov 4, 2021 at 7:31 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Thu, Nov 4, 2021 at 3:15 AM Florian Weimer <fweimer@redhat.com> wrote: > > > > * H. J. Lu: > > > > >> Shouldn't GCC be fixed to generate the appropriate instruction sequence > > >> for this architecture? Should we perhaps switch to compiler atomics for > > >> x86 instead of defining our own? > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065 > > > > > > In the meantime, we should help older compilers. > > > > But wouldn't that extra read persist with fixed compilers, due to the > > memory barrier? So that there are two reads? > > Yes, this could be a problem. > > > Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in > > the memory order? > > We need a way to identify the fixed compiler. We can add a predefined macro, > something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK. > Better to define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK for other compilers.
diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h index 2692d94a92..92c7cf58b7 100644 --- a/sysdeps/x86/atomic-machine.h +++ b/sysdeps/x86/atomic-machine.h @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t; #define ATOMIC_EXCHANGE_USES_CAS 0 #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \ - __sync_val_compare_and_swap (mem, oldval, newval) + ({ __typeof (*(mem)) oldmem = *(mem), ret; \ + ret = (oldmem == (oldval) \ + ? __sync_val_compare_and_swap (mem, oldval, newval) \ + : oldmem); \ + ret; }) #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \ - (! __sync_bool_compare_and_swap (mem, oldval, newval)) + ({ __typeof (*(mem)) old = *(mem); \ + int ret; \ + if (old != (oldval)) \ + ret = 1; \ + else \ + ret = !__sync_bool_compare_and_swap (mem, oldval, newval); \ + ret; }) #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \