This is the mail archive of the
mailing list for the GCC project.
Re: -mcx16 vs. not using CAS for atomic loads
- From: Richard Henderson <rth at redhat dot com>
- To: Torvald Riegel <triegel at redhat dot com>, GCC <gcc at gcc dot gnu dot org>
- Cc: Bin Fan <bin dot x dot fan at oracle dot com>
- Date: Fri, 20 Jan 2017 09:55:04 -0800
- Subject: Re: -mcx16 vs. not using CAS for atomic loads
- Authentication-results: sourceware.org; auth=none
- References: <firstname.lastname@example.org>
On 01/19/2017 10:23 AM, Torvald Riegel wrote:
* Option 3a:
-mcx16 continues to only mean that cmpxchg16b is available, and we keep
__sync builtins unchanged. This doesn't break valid uses of __sync*
(eg, if they didn't need atomic loads at all).
We change __atomic for 16-byte to not use cmpxchg16b but to instead call
out to libatomic. libatomic would continue to use cmpxchg16b
internally. We retain compatibility between __atomic and __sync. We do
not change __atomic_*_lock_free.
This does not fix the load-via-cmpxchg bug, but makes sure that we
reroute through libatomic early for the __atomic builtins, so that it
becomes easier in the future to either do something like Option 2 or
Option 3c. Until then, nothing would really change.
* Option 3b:
Like Option 3a, except that __atomic_*_lock_free return false for 16
bytes. The benefit over 3a is that this stops advertising "fast"
atomics when that is arguably not the case because the loads are slowed
down by contention (I assume a lot more users read "lock-free" as "fast"
instead of thinking about progress conditions). The potential downside
is that programs may exist that assert(__atomic_always_lock_free(16,0));
these assertions would fail, although the rest of the program would
continue to work.
* Option 3c:
Like Option 3b, but libatomic would not use cmpxchg16b internally but
fall back to locks for 16-byte atomics. This fixes the load-via-cmpxchg
bug, but breaks compatibility between old __atomic-using code and new
__atomic-using code, and between __sync and new __atomic.
* Option 4:
Introduce a -mload16atomic option or similar that asserts that true
16-byte atomic loads are supported by the hardware (eg, through SSE).
Enable this option for all processors where we know that it is true.
Don't change __sync. Change __atomic to use the 16-byte atomic loads if
available, and otherwise continue to use cmpxchg16b. Return false from
__atomic_*_lock_free(16, ...) if 16-byte atomic loads are not available.
I think I prefer Option 3b as the short-term solution. It does not
break programs (except the __atomic_always_lock_free assertion scenario,
but that's likely to not work anyway given that the atomics will be
lock-free but not "fast"). It makes programs aware that the atomics
will not be fast when they are not fast indeed (ie, when getting loads
I agree. Let's go through the library for the loads, giving us a hook to fix
this in the future.
I'm worried that Option 4 would not be possible until some time in the
future when we have actually gotten confirmation from the HW vendors
about 16-byte atomic loads. The additional risk is that we may never
get such a confirmation (eg, because they do not want to constrain
future HW), or that this actually holds just for a few processors.
Indeed, I don't think we'll get any proper confirmation from the hw vendors any
time soon. Or possibly ever.
The only light on the horizon that I can see is that HTM is now working in
newly shipping Intel processors, and we could write a pure load path through
libatomic that uses that. Over time the lack of guaranteed SSE atomicity
becomes less relevant.