70814 – atomic store of __int128 is not lock free on aarch64

Bug 70814 - atomic store of __int128 is not lock free on aarch64

Summary: atomic store of __int128 is not lock free on aarch64

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	5.3.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-04-26 21:40 UTC by Yichao Yu
Modified:	2023-05-31 14:04 UTC (History)
CC List:	2 users (show)

See Also:	108659
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-04-27 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yichao Yu 2016-04-26 21:40:22 UTC

`std::atomic<__int128>::store` on aarch64 is not lock free and generates a function call `__atomic_store_16`. However, required atomic instructions (`stlxp`) exists in armv8 and is used by clang. Should gcc use the same instruction as well?

Ref https://llvm.org/bugs/show_bug.cgi?id=27081, which fixes the clang codegen for this case.

Comment 1 ktkachov 2016-04-27 13:09:09 UTC

Confirmed.
For reference, the testcase is:
#include <atomic>
#include <stdint.h>

std::atomic<__int128> a;

__attribute__((noinline)) void f(int64_t v)
{
    a.store(v);
}

int main()
{
    f(1);
    return 0;
}

g++ at -O2 for aarch64 emits for the function 'f':
_Z1fl:
.LFB328:
        .cfi_startproc
        mov     x2, x0
        adrp    x1, .LANCHOR0
        add     x0, x1, :lo12:.LANCHOR0
        mov     w4, 5
        asr     x3, x2, 63
        b       __atomic_store_16

Comment 2 Yichao Yu 2016-04-27 13:20:59 UTC

As an update, it seems that the llvm trunk recently switched to using `__atomic_*` function call for int128 on all platforms (that I can test anyway). I'm not sure how that decision is made or if there's any communication with GCC but IMHO it would be nice to agree on using the atomic instruction on aarch64 as long as it provides all the necessary ones.

Comment 3 Richard Earnshaw 2016-06-28 15:14:01 UTC

Atomic loads and stores have to use the same policy for a single object, you can't have loads using one model and stores another.

Atomic loads have to work even if the object is marked const.

C/C++ permit a non-const object pointer to be cast to a const object pointer.  That means that any mechanism used for atomic loads has to work with const objects, in particular with objects that are immutable (would trap if modification were attempted).

The v8 ARM ARM says that the way to perform a 128-bit atomic read using ldxp is to read the value and then attempt to store it back using a stxp; if the store succeeds, then the read was atomic, otherwise it was not and must be retried.  This model doesn't work if the object was immutable, since the store will fault.

The consequence of this is that ldxp cannot be used for atomic 128-bit reads.  And the consequence of that is that stxp cannot be used for 128-bit stores.  Ergo, the only way to perform 128-bit atomic accesses is by using locks.  So I think the existing code is correct.

I'll leave this open for the moment for further discussion, but I'm minded to reject as INVALID.

Comment 4 Yichao Yu 2016-06-28 15:54:17 UTC

Thanks for the explanation. I didn't realize that the load is the problem. Just curious (since I somehow can't find documentation about it), would `ldaxp` provide the right semantics without the corresponding store?

Comment 5 Richard Earnshaw 2016-06-28 16:07:21 UTC

I don't think so.  It's to do with whether the CPU is permitted to 'crack' the operation into multiple micro-ops that proceed independently down the pipeline.  LDAXP could still be cracked in this way provided that from that CPU's perspective other memory operations from that core were not re-ordered with respect to that operation in an incompatible manner.  However, external memory operations could still cause the separate micro-ops to see different values.

Comment 6 Andrew Pinski 2016-06-28 16:18:28 UTC

(In reply to Richard Earnshaw from comment #5)
> I don't think so.  It's to do with whether the CPU is permitted to 'crack'
> the operation into multiple micro-ops that proceed independently down the
> pipeline.  LDAXP could still be cracked in this way provided that from that
> CPU's perspective other memory operations from that core were not re-ordered
> with respect to that operation in an incompatible manner.  However, external
> memory operations could still cause the separate micro-ops to see different
> values.

Makes sense (and I Know of one micro arch which will doing that for casp).

So closing as invalid.

Note the locks used in libatomic are no where near fair and can cause sometimes not progressing on one thread (but that is a separate bug and is recorded as bug 59305).

Comment 7 Andrew Pinski 2016-06-28 16:19:30 UTC

Closing as invalid.  Though for v8.4 or so it would be nice if there was 128bit atomic loads.

Comment 8 Richard Earnshaw 2016-06-28 16:32:38 UTC

(In reply to Andrew Pinski from comment #7)
> Closing as invalid.  Though for v8.4 or so it would be nice if there was
> 128bit atomic loads.

That probably wouldn't help.  It would require a new ABI before you could use them, since all code has to swap to that mechanism.  Iff you could guarantee that an application had exactly one instance of __atomic_{load,store}_16 (ie that it isn't a private function in each DSO component), then you *might* be able to re-implement those routines using the new instructions, but otherwise you're stuck with the requirement to retain backwards compatibility.

Comment 9 Andrew Pinski 2016-06-28 16:36:10 UTC

(In reply to Richard Earnshaw from comment #8)
> (In reply to Andrew Pinski from comment #7)
> > Closing as invalid.  Though for v8.4 or so it would be nice if there was
> > 128bit atomic loads.
> 
> That probably wouldn't help.  It would require a new ABI before you could
> use them, since all code has to swap to that mechanism.  Iff you could
> guarantee that an application had exactly one instance of
> __atomic_{load,store}_16 (ie that it isn't a private function in each DSO
> component), then you *might* be able to re-implement those routines using
> the new instructions, but otherwise you're stuck with the requirement to
> retain backwards compatibility.

On x86, they use ifuncs for this purpose inside libatomic. Basically the requirement is only one libatomic can be used at a time.

Comment 10 Richard Earnshaw 2016-06-29 10:19:13 UTC

(In reply to Andrew Pinski from comment #9)
> On x86, they use ifuncs for this purpose inside libatomic. Basically the
> requirement is only one libatomic can be used at a time.

If we can guarantee that, then yes, we could do something similar if we had fully atomic 128-bit loads.

Comment 11 Andrew Pinski 2016-06-29 16:38:14 UTC

(In reply to Richard Earnshaw from comment #10)
> (In reply to Andrew Pinski from comment #9)
> > On x86, they use ifuncs for this purpose inside libatomic. Basically the
> > requirement is only one libatomic can be used at a time.
> 
> If we can guarantee that, then yes, we could do something similar if we had
> fully atomic 128-bit loads.

You have to guarantee that anyways because of the way libatomic is implemented.  That is the lock used by the implementations has to be the same across all uses of them.

Comment 12 Andrew Pinski 2023-05-31 13:24:36 UTC

*** Bug 110061 has been marked as a duplicate of this bug. ***

Comment 13 Andrew Pinski 2023-05-31 14:04:00 UTC

*** Bug 110061 has been marked as a duplicate of this bug. ***