`std::atomic<__int128>::store` on aarch64 is not lock free and generates a function call `__atomic_store_16`. However, required atomic instructions (`stlxp`) exists in armv8 and is used by clang. Should gcc use the same instruction as well? Ref https://llvm.org/bugs/show_bug.cgi?id=27081, which fixes the clang codegen for this case.
Confirmed. For reference, the testcase is: #include <atomic> #include <stdint.h> std::atomic<__int128> a; __attribute__((noinline)) void f(int64_t v) { a.store(v); } int main() { f(1); return 0; } g++ at -O2 for aarch64 emits for the function 'f': _Z1fl: .LFB328: .cfi_startproc mov x2, x0 adrp x1, .LANCHOR0 add x0, x1, :lo12:.LANCHOR0 mov w4, 5 asr x3, x2, 63 b __atomic_store_16
As an update, it seems that the llvm trunk recently switched to using `__atomic_*` function call for int128 on all platforms (that I can test anyway). I'm not sure how that decision is made or if there's any communication with GCC but IMHO it would be nice to agree on using the atomic instruction on aarch64 as long as it provides all the necessary ones.
Atomic loads and stores have to use the same policy for a single object, you can't have loads using one model and stores another. Atomic loads have to work even if the object is marked const. C/C++ permit a non-const object pointer to be cast to a const object pointer. That means that any mechanism used for atomic loads has to work with const objects, in particular with objects that are immutable (would trap if modification were attempted). The v8 ARM ARM says that the way to perform a 128-bit atomic read using ldxp is to read the value and then attempt to store it back using a stxp; if the store succeeds, then the read was atomic, otherwise it was not and must be retried. This model doesn't work if the object was immutable, since the store will fault. The consequence of this is that ldxp cannot be used for atomic 128-bit reads. And the consequence of that is that stxp cannot be used for 128-bit stores. Ergo, the only way to perform 128-bit atomic accesses is by using locks. So I think the existing code is correct. I'll leave this open for the moment for further discussion, but I'm minded to reject as INVALID.
Thanks for the explanation. I didn't realize that the load is the problem. Just curious (since I somehow can't find documentation about it), would `ldaxp` provide the right semantics without the corresponding store?
I don't think so. It's to do with whether the CPU is permitted to 'crack' the operation into multiple micro-ops that proceed independently down the pipeline. LDAXP could still be cracked in this way provided that from that CPU's perspective other memory operations from that core were not re-ordered with respect to that operation in an incompatible manner. However, external memory operations could still cause the separate micro-ops to see different values.
(In reply to Richard Earnshaw from comment #5) > I don't think so. It's to do with whether the CPU is permitted to 'crack' > the operation into multiple micro-ops that proceed independently down the > pipeline. LDAXP could still be cracked in this way provided that from that > CPU's perspective other memory operations from that core were not re-ordered > with respect to that operation in an incompatible manner. However, external > memory operations could still cause the separate micro-ops to see different > values. Makes sense (and I Know of one micro arch which will doing that for casp). So closing as invalid. Note the locks used in libatomic are no where near fair and can cause sometimes not progressing on one thread (but that is a separate bug and is recorded as bug 59305).
Closing as invalid. Though for v8.4 or so it would be nice if there was 128bit atomic loads.
(In reply to Andrew Pinski from comment #7) > Closing as invalid. Though for v8.4 or so it would be nice if there was > 128bit atomic loads. That probably wouldn't help. It would require a new ABI before you could use them, since all code has to swap to that mechanism. Iff you could guarantee that an application had exactly one instance of __atomic_{load,store}_16 (ie that it isn't a private function in each DSO component), then you *might* be able to re-implement those routines using the new instructions, but otherwise you're stuck with the requirement to retain backwards compatibility.
(In reply to Richard Earnshaw from comment #8) > (In reply to Andrew Pinski from comment #7) > > Closing as invalid. Though for v8.4 or so it would be nice if there was > > 128bit atomic loads. > > That probably wouldn't help. It would require a new ABI before you could > use them, since all code has to swap to that mechanism. Iff you could > guarantee that an application had exactly one instance of > __atomic_{load,store}_16 (ie that it isn't a private function in each DSO > component), then you *might* be able to re-implement those routines using > the new instructions, but otherwise you're stuck with the requirement to > retain backwards compatibility. On x86, they use ifuncs for this purpose inside libatomic. Basically the requirement is only one libatomic can be used at a time.
(In reply to Andrew Pinski from comment #9) > On x86, they use ifuncs for this purpose inside libatomic. Basically the > requirement is only one libatomic can be used at a time. If we can guarantee that, then yes, we could do something similar if we had fully atomic 128-bit loads.
(In reply to Richard Earnshaw from comment #10) > (In reply to Andrew Pinski from comment #9) > > On x86, they use ifuncs for this purpose inside libatomic. Basically the > > requirement is only one libatomic can be used at a time. > > If we can guarantee that, then yes, we could do something similar if we had > fully atomic 128-bit loads. You have to guarantee that anyways because of the way libatomic is implemented. That is the lock used by the implementations has to be the same across all uses of them.
*** Bug 110061 has been marked as a duplicate of this bug. ***