diff mbox

Fix __atomic to not implement atomic loads with CAS.

Message ID 1485802440.16721.118.camel@redhat.com
State New
Headers show

Commit Message

Torvald Riegel Jan. 30, 2017, 6:54 p.m. UTC
This patch fixes the __atomic builtins to not implement supposedly
lock-free atomic loads based on just a compare-and-swap operation.

If there is no hardware-backed atomic load for a certain memory
location, the current implementation can implement the load with a CAS
while claiming that the access is lock-free.  This is a bug in the cases
of volatile atomic loads and atomic loads to read-only-mapped memory; it
also creates a lot of contention in case of concurrent atomic loads,
which results in at least counter-intuitive performance because most
users probably understand "lock-free" to mean hardware-backed (and thus
"fast") instead of just in the progress-criteria sense.

This patch implements option 3b of the choices described here:
https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html

In a nutshell, this does change affected accesses to call libatomic
instead of inlining the CAS-based load emulation -- but libatomic will
continue to do the CAS-based approach.  Thus, there's no change in how
the changes are actually performed (including compatibility with the
__sync builtins, which are not changed) -- but we do make it much easier
to fix this in the future, and we do ensure that less future code will
have the problematic code inlined into it (and thus unfixable).

Second, the return of __atomic_always_lock_free and
__atomic_is_lock_free are changed to report false for the affected
accesses.  This aims at making users aware of the fact that they don't
get fast hardware-backed performance for the affected accesses.

This patch does not yet change how OpenMP atomics support is
implemented.  Jakub said he would take care of that.  I suppose one
approach would be to check can_atomic_load_p (introduced by this patch)
to decide in expand_omp_atomic whether to just use the mutex-based
approach; I think that conservatively, OpenMP atomics have to assume
that there could be an atomic load to a particular memory location
elsewhere in the program, so the choice between mutex-backed or not has
to be consistent for a particular memory location.

Thanks to Richard Henderson for providing the libatomic part of this
patch, and thanks to Andrew MacLeod for helping with the compiler parts.

I've tested this on an x86_64-linux bootstrap build and see no
regressions.  (With the exception of two OpenMP failures, which Jakub
will take care of.  The other failures I saw didn't seem atomics related
(eg, openacc); I haven't compared it against a full bootstrap build and
make check of trunk.).

AFAICT, in practice only x86 with -mcx16 (or where this is implicitly
implied) is affected.  The other architectures I looked at seemed to
have patterns for true hardware-backed atomic loads whenever they also
had a wider-than-word-sized CAS available.

I know we missed the stage3 deadline with this, unfortunately.  I think
it would be good to have this fix be part of GCC 7 though, because this
would ensure that less future code has the problematic load-via-CAS code
inlined.

Jakub: If this is OK for GCC 7, can you please take care of the OpenMP
bits and commit this?  Changelog entries are in the commit message.

If others could test on other hardware, this would also be appreciated.

Comments

Torvald Riegel Feb. 1, 2017, 5:26 p.m. UTC | #1
On Mon, 2017-01-30 at 19:54 +0100, Torvald Riegel wrote:
> This patch fixes the __atomic builtins to not implement supposedly
> lock-free atomic loads based on just a compare-and-swap operation.

After an off-list OK by Jakub, I have committed this as r245098.
Jakub will take care of the OpenMP side in a follow-up patch.
Thomas Schwinge Feb. 2, 2017, 12:58 p.m. UTC | #2
Hi!

On Mon, 30 Jan 2017 19:54:00 +0100, Torvald Riegel <triegel@redhat.com> wrote:
> This patch fixes the __atomic builtins to not implement supposedly
> lock-free atomic loads based on just a compare-and-swap operation.  [...]

> I've tested this on an x86_64-linux bootstrap build and see no
> regressions.  (With the exception of two OpenMP failures, which Jakub
> will take care of.

These are:

    [-PASS:-]{+FAIL:+} libgomp.c/atomic-2.c (test for excess errors)
    [-PASS:-]{+UNRESOLVED:+} libgomp.c/atomic-2.c [-execution test-]{+compilation failed to produce executable+}

    atomic-2.c:(.text+0x50): undefined reference to `__atomic_load_16'

    [-PASS:-]{+FAIL:+} libgomp.c/atomic-5.c (test for excess errors)
    [-PASS:-]{+UNRESOLVED:+} libgomp.c/atomic-5.c [-execution test-]{+compilation failed to produce executable+}

    atomic-5.c:(.text+0x55): undefined reference to `__atomic_load_16'

(Would've been nice to xfail these as part of your commit, until Jakub
gets to address that.)

> The other failures I saw didn't seem atomics related
> (eg, openacc)

I suppose you're testing without nvptx offloading -- which failures do
you see for OpenACC testing?  (There shouldn't be any for host fallback
testing.)

> I haven't compared it against a full bootstrap build and
> make check of trunk.).

I just happened to do that for x86_64-pc-linux-gnu, and I'm seeing the
following additional changes; posting this just in case that's not
expected to happen:

    -PASS: gcc.dg/atomic-compare-exchange-5.c (test for excess errors)
    -PASS: gcc.dg/atomic-compare-exchange-5.c execution test
    +UNSUPPORTED: gcc.dg/atomic-compare-exchange-5.c

    -PASS: gcc.dg/atomic-exchange-5.c (test for excess errors)
    -PASS: gcc.dg/atomic-exchange-5.c execution test
    +UNSUPPORTED: gcc.dg/atomic-exchange-5.c

    -PASS: gcc.dg/atomic-load-5.c (test for excess errors)
    -PASS: gcc.dg/atomic-load-5.c execution test
    +UNSUPPORTED: gcc.dg/atomic-load-5.c

    -PASS: gcc.dg/atomic-op-5.c (test for excess errors)
    -PASS: gcc.dg/atomic-op-5.c execution test
    +UNSUPPORTED: gcc.dg/atomic-op-5.c

    -PASS: gcc.dg/atomic-store-5.c (test for excess errors)
    -PASS: gcc.dg/atomic-store-5.c execution test
    +UNSUPPORTED: gcc.dg/atomic-store-5.c

    -PASS: gcc.dg/atomic-store-6.c (test for excess errors)
    -PASS: gcc.dg/atomic-store-6.c execution test
    +UNSUPPORTED: gcc.dg/atomic-store-6.c

    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O0 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O0 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-load-int128.c   -O0 -g 
    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O2 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O2 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-load-int128.c   -O2 -g 
    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O3 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-load-int128.c   -O3 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-load-int128.c   -O3 -g 

    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O0 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O0 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-other-int128.c   -O0 -g 
    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O2 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O2 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-other-int128.c   -O2 -g 
    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O3 -g  (test for excess errors)
    -PASS: gcc.dg/simulate-thread/atomic-other-int128.c   -O3 -g  thread simulation test
    +UNSUPPORTED: gcc.dg/simulate-thread/atomic-other-int128.c   -O3 -g 


Grüße
 Thomas
Ramana Radhakrishnan Feb. 2, 2017, 2:48 p.m. UTC | #3
On 30/01/17 18:54, Torvald Riegel wrote:
> This patch fixes the __atomic builtins to not implement supposedly
> lock-free atomic loads based on just a compare-and-swap operation.
>
> If there is no hardware-backed atomic load for a certain memory
> location, the current implementation can implement the load with a CAS
> while claiming that the access is lock-free.  This is a bug in the cases
> of volatile atomic loads and atomic loads to read-only-mapped memory; it
> also creates a lot of contention in case of concurrent atomic loads,
> which results in at least counter-intuitive performance because most
> users probably understand "lock-free" to mean hardware-backed (and thus
> "fast") instead of just in the progress-criteria sense.
>
> This patch implements option 3b of the choices described here:
> https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html


Will Deacon pointed me at this thread asking if something similar could 
be done on ARM.

On armv8-a we can implement an atomic load of 16 bytes using an LDXP / 
STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do 
have a CAS on 16 bytes.

This was discussed last year here.

https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html

and the consensus seemed to be that we couldn't really do a CAS on 
readonly data as we would have no way of avoiding a trap.

I'm sure I'm missing something piece of the puzzle, so I'd be interested 
in how you avoid this in this implementation on x86_64 with what I can 
see is a CAS.

regards
Ramana


>
> In a nutshell, this does change affected accesses to call libatomic
> instead of inlining the CAS-based load emulation -- but libatomic will
> continue to do the CAS-based approach.  Thus, there's no change in how
> the changes are actually performed (including compatibility with the
> __sync builtins, which are not changed) -- but we do make it much easier
> to fix this in the future, and we do ensure that less future code will
> have the problematic code inlined into it (and thus unfixable).
>
> Second, the return of __atomic_always_lock_free and
> __atomic_is_lock_free are changed to report false for the affected
> accesses.  This aims at making users aware of the fact that they don't
> get fast hardware-backed performance for the affected accesses.
>
> This patch does not yet change how OpenMP atomics support is
> implemented.  Jakub said he would take care of that.  I suppose one
> approach would be to check can_atomic_load_p (introduced by this patch)
> to decide in expand_omp_atomic whether to just use the mutex-based
> approach; I think that conservatively, OpenMP atomics have to assume
> that there could be an atomic load to a particular memory location
> elsewhere in the program, so the choice between mutex-backed or not has
> to be consistent for a particular memory location.
>
> Thanks to Richard Henderson for providing the libatomic part of this
> patch, and thanks to Andrew MacLeod for helping with the compiler parts.
>
> I've tested this on an x86_64-linux bootstrap build and see no
> regressions.  (With the exception of two OpenMP failures, which Jakub
> will take care of.  The other failures I saw didn't seem atomics related
> (eg, openacc); I haven't compared it against a full bootstrap build and
> make check of trunk.).
>
> AFAICT, in practice only x86 with -mcx16 (or where this is implicitly
> implied) is affected.  The other architectures I looked at seemed to
> have patterns for true hardware-backed atomic loads whenever they also
> had a wider-than-word-sized CAS available.
>
> I know we missed the stage3 deadline with this, unfortunately.  I think
> it would be good to have this fix be part of GCC 7 though, because this
> would ensure that less future code has the problematic load-via-CAS code
> inlined.
>
> Jakub: If this is OK for GCC 7, can you please take care of the OpenMP
> bits and commit this?  Changelog entries are in the commit message.
>
> If others could test on other hardware, this would also be appreciated.
>
Jakub Jelinek Feb. 2, 2017, 2:52 p.m. UTC | #4
On Thu, Feb 02, 2017 at 02:48:42PM +0000, Ramana Radhakrishnan wrote:
> On 30/01/17 18:54, Torvald Riegel wrote:
> > This patch fixes the __atomic builtins to not implement supposedly
> > lock-free atomic loads based on just a compare-and-swap operation.
> > 
> > If there is no hardware-backed atomic load for a certain memory
> > location, the current implementation can implement the load with a CAS
> > while claiming that the access is lock-free.  This is a bug in the cases
> > of volatile atomic loads and atomic loads to read-only-mapped memory; it
> > also creates a lot of contention in case of concurrent atomic loads,
> > which results in at least counter-intuitive performance because most
> > users probably understand "lock-free" to mean hardware-backed (and thus
> > "fast") instead of just in the progress-criteria sense.
> > 
> > This patch implements option 3b of the choices described here:
> > https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> 
> 
> Will Deacon pointed me at this thread asking if something similar could be
> done on ARM.
> 
> On armv8-a we can implement an atomic load of 16 bytes using an LDXP / STXP
> loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do have a
> CAS on 16 bytes.

If the AArch64 ISA guarantees LDXP is atomic, then yes, you can do that.
The problem we have on x86_64 is that I think neither Intel nor AMD gave us
guarantees that aligned SSE or AVX loads are guaranteed to be atomic.

	Jakub
Ramana Radhakrishnan Feb. 2, 2017, 3:14 p.m. UTC | #5
On 02/02/17 14:52, Jakub Jelinek wrote:
> On Thu, Feb 02, 2017 at 02:48:42PM +0000, Ramana Radhakrishnan wrote:
>> On 30/01/17 18:54, Torvald Riegel wrote:
>>> This patch fixes the __atomic builtins to not implement supposedly
>>> lock-free atomic loads based on just a compare-and-swap operation.
>>>
>>> If there is no hardware-backed atomic load for a certain memory
>>> location, the current implementation can implement the load with a CAS
>>> while claiming that the access is lock-free.  This is a bug in the cases
>>> of volatile atomic loads and atomic loads to read-only-mapped memory; it
>>> also creates a lot of contention in case of concurrent atomic loads,
>>> which results in at least counter-intuitive performance because most
>>> users probably understand "lock-free" to mean hardware-backed (and thus
>>> "fast") instead of just in the progress-criteria sense.
>>>
>>> This patch implements option 3b of the choices described here:
>>> https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
>>
>>
>> Will Deacon pointed me at this thread asking if something similar could be
>> done on ARM.
>>
>> On armv8-a we can implement an atomic load of 16 bytes using an LDXP / STXP
>> loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do have a
>> CAS on 16 bytes.
>
> If the AArch64 ISA guarantees LDXP is atomic, then yes, you can do that.
> The problem we have on x86_64 is that I think neither Intel nor AMD gave us
> guarantees that aligned SSE or AVX loads are guaranteed to be atomic.


LDXP is not single copy atomic.

so this would become something like the following with appropriate 
additional barriers.

You need to write this as a loop of

.retry
LDXP x0, x1 , [x2]
STXP X3, X0, X1, [x2]
CBNZ X3, .retry

You have to do the write again to guarantee atomicity on AArch64.

I missed the first paragraph in this thread on gcc@.

https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html


> I consider this a bug because it can
> result in a store being issued (e.g., when loading from a read-only
> page, or when trying to do a volatile atomic load), and because it can
> increase contention (which makes atomic loads perform much different
> than HW load instructions would).  See the thread "GCC libatomic ABI
> specification draft" for more background.

Ok, this means implementing such a change in libatomic for AArch64 will 
introduce the bug that Torvald is worried about for x86_64 .


regards
Ramana






>
> 	Jakub
>
Torvald Riegel Feb. 2, 2017, 3:21 p.m. UTC | #6
On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:
> On 30/01/17 18:54, Torvald Riegel wrote:
> > This patch fixes the __atomic builtins to not implement supposedly
> > lock-free atomic loads based on just a compare-and-swap operation.
> >
> > If there is no hardware-backed atomic load for a certain memory
> > location, the current implementation can implement the load with a CAS
> > while claiming that the access is lock-free.  This is a bug in the cases
> > of volatile atomic loads and atomic loads to read-only-mapped memory; it
> > also creates a lot of contention in case of concurrent atomic loads,
> > which results in at least counter-intuitive performance because most
> > users probably understand "lock-free" to mean hardware-backed (and thus
> > "fast") instead of just in the progress-criteria sense.
> >
> > This patch implements option 3b of the choices described here:
> > https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> 
> 
> Will Deacon pointed me at this thread asking if something similar could 
> be done on ARM.

It would be nice if someone more familiar with ARM could double-check
that ARM is not affected.  I guess ARM isn't, but that's based on me
looking at machine descriptions, which I hadn't ever done before working
on this patch...

The patch introduces a can_atomic_load_p, which checks that an entry
exists in optabs and the machine descriptions for an atomic load of the
particular mode.

IIRC, arm and aarch64 had atomic load patterns defined whenever they had
a CAS defined.  It would be nice if you could double-check that.  If
that's the case, nothing should change with my patch because
can_atomic_load_p would always return true if a CAS could be issued.
If that's not the case, arm/aarch64 could be in the same boat as x86_64
with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
solution for ARM as well (OTOH, compatibility with existing code and
__sync builtins might perhaps not be as much of a factor as it might be
on x86_64).

The atomic load patterns you have could still be wrong, for example by
generating a LDXP/STXP loop or something else that can store on an
atomic load.  In that case, the problem is similar as not having a
custom load pattern, and the second case in the previous paragraph
applies.

> On armv8-a we can implement an atomic load of 16 bytes using an LDXP / 
> STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do 
> have a CAS on 16 bytes.
> 
> This was discussed last year here.
> 
> https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html
> 
> and the consensus seemed to be that we couldn't really do a CAS on 
> readonly data as we would have no way of avoiding a trap.

Yes, and I agree that not doing a CAS (or anything that can store on
atomic load) is the right thing to do.

BTW, on the ARM targets, is it possible that the __sync builtins would
use LDXP/STXP on 16 byte and the __atomic loads would fall back to using
libatomic and locks?  Or do you simply not provide 16-byte __sync
builtins?

> I'm sure I'm missing something piece of the puzzle, so I'd be interested 
> in how you avoid this in this implementation on x86_64 with what I can 
> see is a CAS.

If we'd start from scratch, we wouldn't use cmpxchg16b if we don't have
a 16-byte atomic load.  However, we did do that, and there might be code
out there that does have inlined cmpxchg16b.  As discussed in the thread
you cited, changing GCC 7 back to libatomics *and* the use of locks
would be an effective ABI break.  Therefore, my patch (and Option 3b)
makes a compromise and delegates to libatomic but libatomic gets custom
code to actually keep using cmpxchg16b.  That means that the actual
synchronization doesn't change, but we constrain this problem to
libatomic and prevent application code generated with GCC 7 from being
directly affected.

Does this clarify the x86_64 situation?
Torvald Riegel Feb. 2, 2017, 5:37 p.m. UTC | #7
On Thu, 2017-02-02 at 13:58 +0100, Thomas Schwinge wrote:
> > The other failures I saw didn't seem atomics related
> > (eg, openacc)
> 
> I suppose you're testing without nvptx offloading -- which failures do
> you see for OpenACC testing?  (There shouldn't be any for host fallback
> testing.)

Sorry, false alarm.  Those tests print out "FAILED", but they don't
actually fail.

> > I haven't compared it against a full bootstrap build and
> > make check of trunk.).
> 
> I just happened to do that for x86_64-pc-linux-gnu, and I'm seeing the
> following additional changes; posting this just in case that's not
> expected to happen:
> 
>     -PASS: gcc.dg/atomic-compare-exchange-5.c (test for excess errors)
>     -PASS: gcc.dg/atomic-compare-exchange-5.c execution test
>     +UNSUPPORTED: gcc.dg/atomic-compare-exchange-5.c

Those changes are expected because we don't expect anymore that
cmpxchg16b availability also means that 16-byte atomics will be inlined.
Ramana Radhakrishnan Feb. 3, 2017, 1:44 p.m. UTC | #8
On 02/02/17 15:21, Torvald Riegel wrote:
> On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:
>> On 30/01/17 18:54, Torvald Riegel wrote:
>>> This patch fixes the __atomic builtins to not implement supposedly
>>> lock-free atomic loads based on just a compare-and-swap operation.
>>>
>>> If there is no hardware-backed atomic load for a certain memory
>>> location, the current implementation can implement the load with a CAS
>>> while claiming that the access is lock-free.  This is a bug in the cases
>>> of volatile atomic loads and atomic loads to read-only-mapped memory; it
>>> also creates a lot of contention in case of concurrent atomic loads,
>>> which results in at least counter-intuitive performance because most
>>> users probably understand "lock-free" to mean hardware-backed (and thus
>>> "fast") instead of just in the progress-criteria sense.
>>>
>>> This patch implements option 3b of the choices described here:
>>> https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
>>
>>
>> Will Deacon pointed me at this thread asking if something similar could
>> be done on ARM.
>
> It would be nice if someone more familiar with ARM could double-check
> that ARM is not affected.  I guess ARM isn't, but that's based on me
> looking at machine descriptions, which I hadn't ever done before working
> on this patch...
>

ARM doesn't have __int128 support, so I don't think the problem exists 
there.

On ARM, on architecture levels (i.e arch < armv6k) that do not have 
single copy atomic routines we end up with calling the kernel helper 
routines where the appropriate handling is done by the kernel depending 
on whether you are multicore or not.

__atomic_load on ARM appears to be ok as well

except for

__atomic_load_di which should really be the ldrexd / strexd loop but we 
could ameliorate that similar to your option 3b.



On AArch64

* <16 byte loads have always been fine. The architecture allows single 
copy atomic loads using single load instructions for all other sizes and 
memory models, so we are fine there.

* we have gone through the libatomic locks from day one of the port for 
16 byte loads.  This has been a bit of a bugbear for a number of users 
within ARM who would really like to get performance without heavy weight 
locks for 16 byte atomic ops.
We could never switch this around on AArch64 to using the loop (or CAS) 
by default as this would be the reverse issue (i.e. old compilers 
calling libatomic locks and new code using the inlined instruction 
sequence).


My interest in this patch was piqued because if we were to switch 
aarch64 to not use the locks in libatomic and go to a CAS or the loop 
sequence I showed in my reply to Jakub, I don't believe we have an ABI 
issue as I would expect there to be a single copy of libatomic on the 
system and everyone would simply be using that.

However we would end up with the situation of generating stores for 
__atomic_loads as you described above.

We could still in theory do that and gain the same bug for rodata and 
volatile atomic pointers on AArch64 - I don't see a way of avoiding that 
on aarch32 as we have a similar ABI issue to you.


> The patch introduces a can_atomic_load_p, which checks that an entry
> exists in optabs and the machine descriptions for an atomic load of the
> particular mode.
>
> IIRC, arm and aarch64 had atomic load patterns defined whenever they had
> a CAS defined.  It would be nice if you could double-check that.  If
> that's the case, nothing should change with my patch because
> can_atomic_load_p would always return true if a CAS could be issued.
> If that's not the case, arm/aarch64 could be in the same boat as x86_64
> with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
> solution for ARM as well (OTOH, compatibility with existing code and
> __sync builtins might perhaps not be as much of a factor as it might be
> on x86_64).

On AArch64 IIRC only those instructions that are single copy atomic as 
per the architecture are allowed to be atomic loads. Thus we do not 
generate such loops anywhere.


>
> The atomic load patterns you have could still be wrong, for example by
> generating a LDXP/STXP loop or something else that can store on an
> atomic load.  In that case, the problem is similar as not having a
> custom load pattern, and the second case in the previous paragraph
> applies.


>
>> On armv8-a we can implement an atomic load of 16 bytes using an LDXP /
>> STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do
>> have a CAS on 16 bytes.
>>
>> This was discussed last year here.
>>
>> https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html
>>
>> and the consensus seemed to be that we couldn't really do a CAS on
>> readonly data as we would have no way of avoiding a trap.
>
> Yes, and I agree that not doing a CAS (or anything that can store on
> atomic load) is the right thing to do.
>
> BTW, on the ARM targets, is it possible that the __sync builtins would
> use LDXP/STXP on 16 byte and the __atomic loads would fall back to using
> libatomic and locks?  Or do you simply not provide 16-byte __sync
> builtins?

* Atomic loads always using libatomic and locks for 16 bytes on AArch64.
* There are no sync primitives provided for 16 byte operations on 
AArch64 and it's been that way forever. Thus __sync_fetch_and_add on a 
TImode value will cause a link error. Arguably that's a bug but no one 
seems to have actually complained so far over the last many years.

>
>> I'm sure I'm missing something piece of the puzzle, so I'd be interested
>> in how you avoid this in this implementation on x86_64 with what I can
>> see is a CAS.
>
> If we'd start from scratch, we wouldn't use cmpxchg16b if we don't have
> a 16-byte atomic load.  However, we did do that, and there might be code
> out there that does have inlined cmpxchg16b.  As discussed in the thread
> you cited, changing GCC 7 back to libatomics *and* the use of locks
> would be an effective ABI break.  Therefore, my patch (and Option 3b)
> makes a compromise and delegates to libatomic but libatomic gets custom
> code to actually keep using cmpxchg16b.  That means that the actual
> synchronization doesn't change, but we constrain this problem to
> libatomic and prevent application code generated with GCC 7 from being
> directly affected.
>
> Does this clarify the x86_64 situation?
>

Thanks for clarifying the x86_64 situation.

I hope this helps in explaining the ARM situation.

regards
Ramana
Torvald Riegel Feb. 3, 2017, 3:07 p.m. UTC | #9
On Fri, 2017-02-03 at 13:44 +0000, Ramana Radhakrishnan wrote:
> __atomic_load on ARM appears to be ok as well
> 
> except for
> 
> __atomic_load_di which should really be the ldrexd / strexd loop but we 
> could ameliorate that similar to your option 3b.

This uses just ldrexd now, and thus is not guaranteed to be atomic?

> On AArch64
> 
> * <16 byte loads have always been fine. The architecture allows single 
> copy atomic loads using single load instructions for all other sizes and 
> memory models, so we are fine there.
> 
> * we have gone through the libatomic locks from day one of the port for 
> 16 byte loads.  This has been a bit of a bugbear for a number of users 
> within ARM who would really like to get performance without heavy weight 
> locks for 16 byte atomic ops.

Would it be acceptable for those users to have loads that perform like
CAS loops, especially under contention?  Or are these users more
concerned about aarch64 not offering a true atomic 16-byte load?

> We could never switch this around on AArch64 to using the loop (or CAS) 
> by default as this would be the reverse issue (i.e. old compilers 
> calling libatomic locks and new code using the inlined instruction 
> sequence).

I think there's an implicit assumption that whenever one uses code
generated with a particular compiler, the libatomic version being used
at runtime is at least as recent as that particular compiler.  In a mix
of old and new code, this would mean that new libatomic has to be used.
However, AFAIK we haven't specified that explicitly yet.  (Richard
Henderson said we'd make similar implicit assumptions for libgcc, if I
understood him correctly.)

If that assumption holds, then switching from locks in libatomic to
inlined instructions is possible, provided that libatomic switches too.

> 
> My interest in this patch was piqued because if we were to switch 
> aarch64 to not use the locks in libatomic and go to a CAS or the loop 
> sequence I showed in my reply to Jakub, I don't believe we have an ABI 
> issue as I would expect there to be a single copy of libatomic on the 
> system and everyone would simply be using that.

Right, and it would have to be at least as recent as the most recent
compiler being used (eg, the system compiler on that system).

> However we would end up with the situation of generating stores for 
> __atomic_loads as you described above.
> 
> We could still in theory do that and gain the same bug for rodata and 
> volatile atomic pointers on AArch64 - I don't see a way of avoiding that 
> on aarch32 as we have a similar ABI issue to you.

Agreed.

> 
> > The patch introduces a can_atomic_load_p, which checks that an entry
> > exists in optabs and the machine descriptions for an atomic load of the
> > particular mode.
> >
> > IIRC, arm and aarch64 had atomic load patterns defined whenever they had
> > a CAS defined.  It would be nice if you could double-check that.  If
> > that's the case, nothing should change with my patch because
> > can_atomic_load_p would always return true if a CAS could be issued.
> > If that's not the case, arm/aarch64 could be in the same boat as x86_64
> > with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
> > solution for ARM as well (OTOH, compatibility with existing code and
> > __sync builtins might perhaps not be as much of a factor as it might be
> > on x86_64).
> 
> On AArch64 IIRC only those instructions that are single copy atomic as 
> per the architecture are allowed to be atomic loads. Thus we do not 
> generate such loops anywhere.

I think the condition we'd have to check is rather that whenever there
is an atomic CAS available, there is also an atomic load available.
That is not quite the same as having only truly atomic loads available,
because then a CAS could still be available regardless of the
availability of a matching load.
Jakub Jelinek Feb. 3, 2017, 3:13 p.m. UTC | #10
On Fri, Feb 03, 2017 at 04:07:22PM +0100, Torvald Riegel wrote:
> On Fri, 2017-02-03 at 13:44 +0000, Ramana Radhakrishnan wrote:
> > __atomic_load on ARM appears to be ok as well
> > 
> > except for
> > 
> > __atomic_load_di which should really be the ldrexd / strexd loop but we 
> > could ameliorate that similar to your option 3b.
> 
> This uses just ldrexd now, and thus is not guaranteed to be atomic?
> 
> > On AArch64
> > 
> > * <16 byte loads have always been fine. The architecture allows single 
> > copy atomic loads using single load instructions for all other sizes and 
> > memory models, so we are fine there.
> > 
> > * we have gone through the libatomic locks from day one of the port for 
> > 16 byte loads.  This has been a bit of a bugbear for a number of users 
> > within ARM who would really like to get performance without heavy weight 
> > locks for 16 byte atomic ops.
> 
> Would it be acceptable for those users to have loads that perform like
> CAS loops, especially under contention?  Or are these users more
> concerned about aarch64 not offering a true atomic 16-byte load?

Can the store you need for atomicity be into an automatic var on the stack?
Then there really shouldn't be any contention on that var (especially if it
is padded so the cache-line doesn't contain anything else), does load
contention on the atomic var matter if there are no stores to it?

	Jakub
Ramana Radhakrishnan Feb. 3, 2017, 4:19 p.m. UTC | #11
On 03/02/17 15:13, Jakub Jelinek wrote:
> On Fri, Feb 03, 2017 at 04:07:22PM +0100, Torvald Riegel wrote:
>> On Fri, 2017-02-03 at 13:44 +0000, Ramana Radhakrishnan wrote:
>>> __atomic_load on ARM appears to be ok as well
>>>
>>> except for
>>>
>>> __atomic_load_di which should really be the ldrexd / strexd loop but we
>>> could ameliorate that similar to your option 3b.
>>
>> This uses just ldrexd now, and thus is not guaranteed to be atomic?
>>
>>> On AArch64
>>>
>>> * <16 byte loads have always been fine. The architecture allows single
>>> copy atomic loads using single load instructions for all other sizes and
>>> memory models, so we are fine there.
>>>
>>> * we have gone through the libatomic locks from day one of the port for
>>> 16 byte loads.  This has been a bit of a bugbear for a number of users
>>> within ARM who would really like to get performance without heavy weight
>>> locks for 16 byte atomic ops.
>>
>> Would it be acceptable for those users to have loads that perform like
>> CAS loops, especially under contention?  Or are these users more
>> concerned about aarch64 not offering a true atomic 16-byte load?
>
> Can the store you need for atomicity be into an automatic var on the stack?

No, it has to be to the same location.

Ramana
Jakub Jelinek Feb. 3, 2017, 4:21 p.m. UTC | #12
On Fri, Feb 03, 2017 at 04:19:58PM +0000, Ramana Radhakrishnan wrote:
> > > Would it be acceptable for those users to have loads that perform like
> > > CAS loops, especially under contention?  Or are these users more
> > > concerned about aarch64 not offering a true atomic 16-byte load?
> > 
> > Can the store you need for atomicity be into an automatic var on the stack?
> 
> No, it has to be to the same location.

But then it is the same problem as using cmpxchg16b on x86_64, the location
could be read-only, or that it is too slow otherwise for what users expect
for atomic load.

	Jakub
Torvald Riegel Feb. 3, 2017, 4:27 p.m. UTC | #13
On Fri, 2017-02-03 at 17:21 +0100, Jakub Jelinek wrote:
> On Fri, Feb 03, 2017 at 04:19:58PM +0000, Ramana Radhakrishnan wrote:
> > > > Would it be acceptable for those users to have loads that perform like
> > > > CAS loops, especially under contention?  Or are these users more
> > > > concerned about aarch64 not offering a true atomic 16-byte load?
> > > 
> > > Can the store you need for atomicity be into an automatic var on the stack?
> > 
> > No, it has to be to the same location.
> 
> But then it is the same problem as using cmpxchg16b on x86_64, the location
> could be read-only, or that it is too slow otherwise for what users expect
> for atomic load.

It would be the same problem.

I was merely interested in the needs and concerns of those users that
Ramana mentioned, regardless of whether these needs could be addressed
in the scope of the __atomic builtins.

For example, if those users just need fast atomic read-modify-write
operation but not actually pure loads in their use cases (eg, reductions
in a parallel workload), then something else than __atomic could provide
that.
Richard Henderson Feb. 6, 2017, 8:19 p.m. UTC | #14
On 02/03/2017 05:44 AM, Ramana Radhakrishnan wrote:
> On 02/02/17 15:21, Torvald Riegel wrote:
>> On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:
>>> On 30/01/17 18:54, Torvald Riegel wrote:
>>>> This patch fixes the __atomic builtins to not implement supposedly
>>>> lock-free atomic loads based on just a compare-and-swap operation.
>>>>
>>>> If there is no hardware-backed atomic load for a certain memory
>>>> location, the current implementation can implement the load with a CAS
>>>> while claiming that the access is lock-free.  This is a bug in the cases
>>>> of volatile atomic loads and atomic loads to read-only-mapped memory; it
>>>> also creates a lot of contention in case of concurrent atomic loads,
>>>> which results in at least counter-intuitive performance because most
>>>> users probably understand "lock-free" to mean hardware-backed (and thus
>>>> "fast") instead of just in the progress-criteria sense.
>>>>
>>>> This patch implements option 3b of the choices described here:
>>>> https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
>>>
>>>
>>> Will Deacon pointed me at this thread asking if something similar could
>>> be done on ARM.
>>
>> It would be nice if someone more familiar with ARM could double-check
>> that ARM is not affected.  I guess ARM isn't, but that's based on me
>> looking at machine descriptions, which I hadn't ever done before working
>> on this patch...
>>
>
> ARM doesn't have __int128 support, so I don't think the problem exists there.
>
> On ARM, on architecture levels (i.e arch < armv6k) that do not have single copy
> atomic routines we end up with calling the kernel helper routines where the
> appropriate handling is done by the kernel depending on whether you are
> multicore or not.
>
> __atomic_load on ARM appears to be ok as well
>
> except for
>
> __atomic_load_di which should really be the ldrexd / strexd loop but we could
> ameliorate that similar to your option 3b.

No, look again.  ldrexd has 64-bit single-copy semantics WITHOUT requiring the 
strexd.  It's only the AArch64 (64-bit) LDXP that requires the store.


r~
Tom de Vries April 11, 2018, 9:38 a.m. UTC | #15
On 01/30/2017 07:54 PM, Torvald Riegel wrote:
> This patch fixes the __atomic builtins to not implement supposedly
> lock-free atomic loads based on just a compare-and-swap operation.

Hi,

The internals doc still lists CAS ( 
https://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html#index-atomic_005floadmode-instruction-pattern 
):
...
‘atomic_loadmode’

     This pattern implements an atomic load operation with memory model 
semantics. Operand 1 is the memory address being loaded from. Operand 0 
is the result of the load. Operand 2 is the memory model to be used for 
the load operation.

     If not present, the __atomic_load built-in function will either 
resort to a normal load with memory barriers, or a compare-and-swap 
operation if a normal load would not be atomic.
...

Thanks,
- Tom
diff mbox

Patch

commit 1db13cb386e673d5265bcaf2d70fc25dda22e5fd
Author: Torvald Riegel <triegel@redhat.com>
Date:   Fri Jan 27 17:40:44 2017 +0100

    Fix __atomic to not implement atomic loads with CAS.
    
    gcc/
    	* builtins.c (fold_builtin_atomic_always_lock_free): Make "lock-free"
    	conditional on existance of a fast atomic load.
    	* optabs-query.c (can_atomic_load_p): New function.
    	* optabs-query.h (can_atomic_load_p): Declare it.
    	* optabs.c (expand_atomic_exchange): Always delegate to libatomic if
    	no fast atomic load is available for the particular size of access.
    	(expand_atomic_compare_and_swap): Likewise.
    	(expand_atomic_load): Likewise.
    	(expand_atomic_store): Likewise.
    	(expand_atomic_fetch_op): Likewise.
    	* testsuite/lib/target-supports.exp
    	(check_effective_target_sync_int_128): Remove x86 because it provides
    	no fast atomic load.
    	(check_effective_target_sync_int_128_runtime): Likewise.
    
    libatomic/
    	* acinclude.m4: Add #define FAST_ATOMIC_LDST_*.
    	* auto-config.h.in: Regenerate.
    	* config/x86/host-config.h (FAST_ATOMIC_LDST_16): Define to 0.
    	(atomic_compare_exchange_n): New.
    	* glfree.c (EXACT, LARGER): Change condition and add comments.

diff --git a/gcc/builtins.c b/gcc/builtins.c
index bf68e31..0a0e8b9 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6157,8 +6157,9 @@  fold_builtin_atomic_always_lock_free (tree arg0, tree arg1)
 
   /* Check if a compare_and_swap pattern exists for the mode which represents
      the required size.  The pattern is not allowed to fail, so the existence
-     of the pattern indicates support is present.  */
-  if (can_compare_and_swap_p (mode, true))
+     of the pattern indicates support is present.  Also require that an
+     atomic load exists for the required size.  */
+  if (can_compare_and_swap_p (mode, true) && can_atomic_load_p (mode))
     return boolean_true_node;
   else
     return boolean_false_node;
diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
index 6c34a4e..4899333 100644
--- a/gcc/optabs-query.c
+++ b/gcc/optabs-query.c
@@ -584,6 +584,25 @@  can_atomic_exchange_p (machine_mode mode, bool allow_libcall)
   return can_compare_and_swap_p (mode, allow_libcall);
 }
 
+/* Return true if an atomic load can be performed without falling back to
+   a compare-and-swap.  */
+
+bool
+can_atomic_load_p (machine_mode mode)
+{
+  enum insn_code icode;
+
+  /* Does the target supports the load directly?  */
+  icode = direct_optab_handler (atomic_load_optab, mode);
+  if (icode != CODE_FOR_nothing)
+    return true;
+
+  /* If the size of the object is greater than word size on this target,
+     then we assume that a load will not be atomic.  Also see
+     expand_atomic_load.  */
+  return GET_MODE_PRECISION (mode) <= BITS_PER_WORD;
+}
+
 /* Determine whether "1 << x" is relatively cheap in word_mode.  */
 
 bool
diff --git a/gcc/optabs-query.h b/gcc/optabs-query.h
index a80a0e7..e85a7f1 100644
--- a/gcc/optabs-query.h
+++ b/gcc/optabs-query.h
@@ -176,6 +176,7 @@  int can_mult_highpart_p (machine_mode, bool);
 bool can_vec_mask_load_store_p (machine_mode, machine_mode, bool);
 bool can_compare_and_swap_p (machine_mode, bool);
 bool can_atomic_exchange_p (machine_mode, bool);
+bool can_atomic_load_p (machine_mode);
 bool lshift_cheap_p (bool);
 
 #endif
diff --git a/gcc/optabs.c b/gcc/optabs.c
index d8831a8..1afd593 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -6086,8 +6086,15 @@  expand_atomic_test_and_set (rtx target, rtx mem, enum memmodel model)
 rtx
 expand_atomic_exchange (rtx target, rtx mem, rtx val, enum memmodel model)
 {
+  machine_mode mode = GET_MODE (mem);
   rtx ret;
 
+  /* If loads are not atomic for the required size and we are not called to
+     provide a __sync builtin, do not do anything so that we stay consistent
+     with atomic loads of the same size.  */
+  if (!can_atomic_load_p (mode) && !is_mm_sync (model))
+    return NULL_RTX;
+
   ret = maybe_emit_atomic_exchange (target, mem, val, model);
 
   /* Next try a compare-and-swap loop for the exchange.  */
@@ -6121,6 +6128,12 @@  expand_atomic_compare_and_swap (rtx *ptarget_bool, rtx *ptarget_oval,
   rtx target_oval, target_bool = NULL_RTX;
   rtx libfunc;
 
+  /* If loads are not atomic for the required size and we are not called to
+     provide a __sync builtin, do not do anything so that we stay consistent
+     with atomic loads of the same size.  */
+  if (!can_atomic_load_p (mode) && !is_mm_sync (succ_model))
+    return false;
+
   /* Load expected into a register for the compare and swap.  */
   if (MEM_P (expected))
     expected = copy_to_reg (expected);
@@ -6316,19 +6329,13 @@  expand_atomic_load (rtx target, rtx mem, enum memmodel model)
     }
 
   /* If the size of the object is greater than word size on this target,
-     then we assume that a load will not be atomic.  */
+     then we assume that a load will not be atomic.  We could try to
+     emulate a load with a compare-and-swap operation, but the store that
+     doing this could result in would be incorrect if this is a volatile
+     atomic load or targetting read-only-mapped memory.  */
   if (GET_MODE_PRECISION (mode) > BITS_PER_WORD)
-    {
-      /* Issue val = compare_and_swap (mem, 0, 0).
-	 This may cause the occasional harmless store of 0 when the value is
-	 already 0, but it seems to be OK according to the standards guys.  */
-      if (expand_atomic_compare_and_swap (NULL, &target, mem, const0_rtx,
-					  const0_rtx, false, model, model))
-	return target;
-      else
-      /* Otherwise there is no atomic load, leave the library call.  */
-        return NULL_RTX;
-    }
+    /* If there is no atomic load, leave the library call.  */
+    return NULL_RTX;
 
   /* Otherwise assume loads are atomic, and emit the proper barriers.  */
   if (!target || target == const0_rtx)
@@ -6370,7 +6377,9 @@  expand_atomic_store (rtx mem, rtx val, enum memmodel model, bool use_release)
 	return const0_rtx;
     }
 
-  /* If using __sync_lock_release is a viable alternative, try it.  */
+  /* If using __sync_lock_release is a viable alternative, try it.
+     Note that this will not be set to true if we are expanding a generic
+     __atomic_store_n.  */
   if (use_release)
     {
       icode = direct_optab_handler (sync_lock_release_optab, mode);
@@ -6389,16 +6398,22 @@  expand_atomic_store (rtx mem, rtx val, enum memmodel model, bool use_release)
     }
 
   /* If the size of the object is greater than word size on this target,
-     a default store will not be atomic, Try a mem_exchange and throw away
-     the result.  If that doesn't work, don't do anything.  */
+     a default store will not be atomic.  */
   if (GET_MODE_PRECISION (mode) > BITS_PER_WORD)
     {
-      rtx target = maybe_emit_atomic_exchange (NULL_RTX, mem, val, model);
-      if (!target)
-        target = maybe_emit_compare_and_swap_exchange_loop (NULL_RTX, mem, val);
-      if (target)
-        return const0_rtx;
-      else
+      /* If loads are atomic or we are called to provide a __sync builtin,
+	 we can try a atomic_exchange and throw away the result.  Otherwise,
+	 don't do anything so that we do not create an inconsistency between
+	 loads and stores.  */
+      if (can_atomic_load_p (mode) || is_mm_sync (model))
+	{
+	  rtx target = maybe_emit_atomic_exchange (NULL_RTX, mem, val, model);
+	  if (!target)
+	    target = maybe_emit_compare_and_swap_exchange_loop (NULL_RTX, mem,
+								val);
+	  if (target)
+	    return const0_rtx;
+	}
         return NULL_RTX;
     }
 
@@ -6713,6 +6728,12 @@  expand_atomic_fetch_op (rtx target, rtx mem, rtx val, enum rtx_code code,
   rtx result;
   bool unused_result = (target == const0_rtx);
 
+  /* If loads are not atomic for the required size and we are not called to
+     provide a __sync builtin, do not do anything so that we stay consistent
+     with atomic loads of the same size.  */
+  if (!can_atomic_load_p (mode) && !is_mm_sync (model))
+    return NULL_RTX;
+
   result = expand_atomic_fetch_op_no_fallback (target, mem, val, code, model,
 					       after);
   
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 95a1c50..7a26008 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -6514,9 +6514,7 @@  proc check_effective_target_section_anchors { } {
 # Return 1 if the target supports atomic operations on "int_128" values.
 
 proc check_effective_target_sync_int_128 { } {
-    if { (([istarget i?86-*-*] || [istarget x86_64-*-*])
-	  && ![is-effective-target ia32])
-	 || [istarget spu-*-*] } {
+    if { [istarget spu-*-*] } {
 	return 1
     } else {
 	return 0
@@ -6525,23 +6523,10 @@  proc check_effective_target_sync_int_128 { } {
 
 # Return 1 if the target supports atomic operations on "int_128" values
 # and can execute them.
+# This requires support for both compare-and-swap and true atomic loads.
 
 proc check_effective_target_sync_int_128_runtime { } {
-    if { (([istarget i?86-*-*] || [istarget x86_64-*-*])
-	  && ![is-effective-target ia32]
-	  && [check_cached_effective_target sync_int_128_available {
-	      check_runtime_nocache sync_int_128_available {
-		  #include "cpuid.h"
-		  int main ()
-		  {
-		      unsigned int eax, ebx, ecx, edx;
-		      if (__get_cpuid (1, &eax, &ebx, &ecx, &edx))
-			return !(ecx & bit_CMPXCHG16B);
-		      return 1;
-		  }
-	      } ""
-	  }])
-	 || [istarget spu-*-*] } {
+    if { [istarget spu-*-*] } {
 	return 1
     } else {
 	return 0
diff --git a/libatomic/acinclude.m4 b/libatomic/acinclude.m4
index a86e52b..485d731 100644
--- a/libatomic/acinclude.m4
+++ b/libatomic/acinclude.m4
@@ -96,6 +96,7 @@  AC_DEFUN([LIBAT_HAVE_ATOMIC_LOADSTORE],[
   LIBAT_DEFINE_YESNO([HAVE_ATOMIC_LDST_$2], [$libat_cv_have_at_ldst_$2],
 	[Have __atomic_load/store for $2 byte integers.])
   AH_BOTTOM([#define MAYBE_HAVE_ATOMIC_LDST_$2 HAVE_ATOMIC_LDST_$2])
+  AH_BOTTOM([#define FAST_ATOMIC_LDST_$2 HAVE_ATOMIC_LDST_$2])
 ])
 
 dnl
diff --git a/libatomic/auto-config.h.in b/libatomic/auto-config.h.in
index 83e54e2..d5b8a26 100644
--- a/libatomic/auto-config.h.in
+++ b/libatomic/auto-config.h.in
@@ -222,6 +222,16 @@ 
 
 #define MAYBE_HAVE_ATOMIC_LDST_1 HAVE_ATOMIC_LDST_1
 
+#define FAST_ATOMIC_LDST_16 HAVE_ATOMIC_LDST_16
+
+#define MAYBE_HAVE_ATOMIC_TAS_1 HAVE_ATOMIC_TAS_1
+
+#define MAYBE_HAVE_ATOMIC_TAS_2 HAVE_ATOMIC_TAS_2
+
+#define MAYBE_HAVE_ATOMIC_TAS_4 HAVE_ATOMIC_TAS_4
+
+#define MAYBE_HAVE_ATOMIC_TAS_8 HAVE_ATOMIC_TAS_8
+
 #define MAYBE_HAVE_ATOMIC_TAS_16 HAVE_ATOMIC_TAS_16
 
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_1 HAVE_ATOMIC_EXCHANGE_1
@@ -232,6 +242,8 @@ 
 
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_8 HAVE_ATOMIC_EXCHANGE_8
 
+#define FAST_ATOMIC_LDST_1 HAVE_ATOMIC_LDST_1
+
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_16 HAVE_ATOMIC_EXCHANGE_16
 
 #define MAYBE_HAVE_ATOMIC_CAS_1 HAVE_ATOMIC_CAS_1
@@ -242,8 +254,6 @@ 
 
 #define MAYBE_HAVE_ATOMIC_CAS_8 HAVE_ATOMIC_CAS_8
 
-#define MAYBE_HAVE_ATOMIC_LDST_2 HAVE_ATOMIC_LDST_2
-
 #define MAYBE_HAVE_ATOMIC_CAS_16 HAVE_ATOMIC_CAS_16
 
 #define MAYBE_HAVE_ATOMIC_FETCH_ADD_1 HAVE_ATOMIC_FETCH_ADD_1
@@ -254,6 +264,8 @@ 
 
 #define MAYBE_HAVE_ATOMIC_FETCH_ADD_8 HAVE_ATOMIC_FETCH_ADD_8
 
+#define MAYBE_HAVE_ATOMIC_LDST_2 HAVE_ATOMIC_LDST_2
+
 #define MAYBE_HAVE_ATOMIC_FETCH_ADD_16 HAVE_ATOMIC_FETCH_ADD_16
 
 #define MAYBE_HAVE_ATOMIC_FETCH_OP_1 HAVE_ATOMIC_FETCH_OP_1
@@ -264,22 +276,20 @@ 
 
 #define MAYBE_HAVE_ATOMIC_FETCH_OP_8 HAVE_ATOMIC_FETCH_OP_8
 
-#define MAYBE_HAVE_ATOMIC_LDST_4 HAVE_ATOMIC_LDST_4
-
 #define MAYBE_HAVE_ATOMIC_FETCH_OP_16 HAVE_ATOMIC_FETCH_OP_16
 
 #ifndef WORDS_BIGENDIAN
 #define WORDS_BIGENDIAN 0
 #endif
 
-#define MAYBE_HAVE_ATOMIC_LDST_8 HAVE_ATOMIC_LDST_8
+#define FAST_ATOMIC_LDST_2 HAVE_ATOMIC_LDST_2
 
-#define MAYBE_HAVE_ATOMIC_LDST_16 HAVE_ATOMIC_LDST_16
+#define MAYBE_HAVE_ATOMIC_LDST_4 HAVE_ATOMIC_LDST_4
 
-#define MAYBE_HAVE_ATOMIC_TAS_1 HAVE_ATOMIC_TAS_1
+#define FAST_ATOMIC_LDST_4 HAVE_ATOMIC_LDST_4
 
-#define MAYBE_HAVE_ATOMIC_TAS_2 HAVE_ATOMIC_TAS_2
+#define MAYBE_HAVE_ATOMIC_LDST_8 HAVE_ATOMIC_LDST_8
 
-#define MAYBE_HAVE_ATOMIC_TAS_4 HAVE_ATOMIC_TAS_4
+#define FAST_ATOMIC_LDST_8 HAVE_ATOMIC_LDST_8
 
-#define MAYBE_HAVE_ATOMIC_TAS_8 HAVE_ATOMIC_TAS_8
+#define MAYBE_HAVE_ATOMIC_LDST_16 HAVE_ATOMIC_LDST_16
diff --git a/libatomic/config/x86/host-config.h b/libatomic/config/x86/host-config.h
index 5754db4..2e9f85a 100644
--- a/libatomic/config/x86/host-config.h
+++ b/libatomic/config/x86/host-config.h
@@ -47,6 +47,9 @@  extern unsigned int libat_feat1_edx HIDDEN;
 # define MAYBE_HAVE_ATOMIC_EXCHANGE_16	IFUNC_COND_1
 # undef MAYBE_HAVE_ATOMIC_LDST_16
 # define MAYBE_HAVE_ATOMIC_LDST_16	IFUNC_COND_1
+/* Since load and store are implemented with CAS, they are not fast.  */
+# undef FAST_ATOMIC_LDST_16
+# define FAST_ATOMIC_LDST_16		0
 # if IFUNC_ALT == 1
 #  undef HAVE_ATOMIC_CAS_16
 #  define HAVE_ATOMIC_CAS_16 1
@@ -64,6 +67,21 @@  extern unsigned int libat_feat1_edx HIDDEN;
 # endif
 #endif
 
+#if defined(__x86_64__) && N == 16 && IFUNC_ALT == 1
+static inline bool
+atomic_compare_exchange_n (UTYPE *mptr, UTYPE *eptr, UTYPE newval,
+                           bool weak_p UNUSED, int sm UNUSED, int fm UNUSED)
+{
+  UTYPE cmpval = *eptr;
+  UTYPE oldval = __sync_val_compare_and_swap_16 (mptr, cmpval, newval);
+  if (oldval == cmpval)
+    return true;
+  *eptr = oldval;
+  return false;
+}
+# define atomic_compare_exchange_n atomic_compare_exchange_n
+#endif /* Have CAS 16 */
+
 #endif /* HAVE_IFUNC */
 
 #include_next <host-config.h>
diff --git a/libatomic/glfree.c b/libatomic/glfree.c
index b68dec7..59fe533 100644
--- a/libatomic/glfree.c
+++ b/libatomic/glfree.c
@@ -24,26 +24,41 @@ 
 
 #include "libatomic_i.h"
 
-
+/* Accesses with a power-of-two size are not lock-free if we don't have an
+   integer type of this size or if they are not naturally aligned.  They
+   are lock-free if such a naturally aligned access is always lock-free
+   according to the compiler, which requires that both atomic loads and CAS
+   are available.
+   In all other cases, we fall through to LARGER (see below).  */
 #define EXACT(N)						\
   do {								\
     if (!C2(HAVE_INT,N)) break;					\
     if ((uintptr_t)ptr & (N - 1)) break;			\
     if (__atomic_always_lock_free(N, 0)) return true;		\
-    if (C2(MAYBE_HAVE_ATOMIC_CAS_,N)) return true;		\
+    if (!C2(MAYBE_HAVE_ATOMIC_CAS_,N)) break;			\
+    if (C2(FAST_ATOMIC_LDST_,N)) return true;			\
   } while (0)
 
 
+/* We next check to see if an access of a larger size is lock-free.  We use
+   a similar check as in EXACT, except that we also check that the alignment
+   of the access is so that the data to be accessed is completely covered
+   by the larger access.  */
 #define LARGER(N)						\
   do {								\
     uintptr_t r = (uintptr_t)ptr & (N - 1);			\
     if (!C2(HAVE_INT,N)) break;					\
-    if (!C2(HAVE_ATOMIC_LDST_,N)) break;			\
+    if (!C2(FAST_ATOMIC_LDST_,N)) break;			\
     if (!C2(MAYBE_HAVE_ATOMIC_CAS_,N)) break;			\
     if (r + n <= N) return true;				\
   } while (0)
 
 
+/* Note that this can return that a size/alignment is not lock-free even if
+   all the operations that we use to implement the respective accesses provide
+   lock-free forward progress as specified in C++14:  Users likely expect
+   "lock-free" to also mean "fast", which is why we do not return true if, for
+   example, we implement loads with this size/alignment using a CAS.  */
 bool
 libat_is_lock_free (size_t n, void *ptr)
 {