Bug 80878 - -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
Summary: -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
: 84522 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-05-25 10:01 UTC by admin_public
Modified: 2023-11-16 01:33 UTC (History)
13 users (show)

See Also:
Host: x86_64
Target: x86_64
Build: x86_64
Known to work: 6.2.0
Known to fail: 7.1.0
Last reconfirmed: 2018-03-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description admin_public 2017-05-25 10:01:05 UTC
Hi.

I've been building GCC compilers, starting with 4.1.2 up till 7.1.0, on a couple of platforms.  It's hard to do, and I'm really not good at it, and I think it practically certain this bug is really me messing things up, rather than there being an actual real bug, except that when I switch to 6.2.0, which I built in exactly the same way (I've written a script), the problem goes away.

The problem is that the compiler when compiling reports "undefined reference to `__atomic_compare_exchange_16'".

I have reduced this down to a short test program, thus;

#include <stdio.h>
#include <stdlib.h>

int main( void );

int main()
{
  __uint128_t
    compare = 1,
    exchange = 2,
    target = 1;

  printf( "target before = %llu\n", (int long long unsigned) target );

  __atomic_compare_exchange( &target, &compare, &exchange, 1, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST );

  printf( "target after = %llu\n", (int long long unsigned) target );

  return EXIT_SUCCESS;
}

I compile this with;

gcc -Wall -march=native -mtune=native -mcx16 test.c

With GCC 6.2.0, the output is;

./a.out 
target before = 1
target after = 2

With GCC 7.1.0, the output is;

/tmp/ccjMRvw1.o: In function `main':
test.c:(.text+0x75): undefined reference to `__atomic_compare_exchange_16'
collect2: error: ld returned 1 exit status

This is the same error 6.2.0 gives if I omit "-mcx16".
Comment 1 Andrew Pinski 2017-05-25 12:49:51 UTC
IIRC this was removed as the instruction cannot be used for read only memory.
Comment 2 admin_public 2017-05-25 13:08:07 UTC
Am I right in understanding your comment to mean 128 bit CAS is no longer supported for x86_64?

I publish a library of lock-free data structures, liblfds.  It has some users, include AT&T, Red Hat and Xen.

Contigious double-word compare-and-swap is necessary for a range of such data structures, and even though this means only aarch64, arm, x86_64 and x86, it is still of immense value, as these are common platforms.

In particular, all three of those users would find the data structures they are using no longer compile with 7.1.0.

This is not death-and-end-of-the-world, however.  The library provides an abstraction layer to mask platform differences, and so I would need then to add an additional abstraction layer for GCC 7.1.0 and later on x86_64, where I use inline assembly for 128-bit CAS (such an abstraction already exists in fact for early versions of GCC which lack -mcx16).

One minor note : the "-mcx16" command is specific to x86_64 and so its removal means no more double word CAS on that platform.  However, aarch64, arm and x86 will still continue to support double word CAS, as they do so natively and without the need for a special argument to GCC.
Comment 3 Jonathan Wakely 2017-05-25 13:13:03 UTC
libatomic provides a definition for __atomic_compare_exchange_16 when the native instruction isn't available.
Comment 4 admin_public 2017-05-25 13:45:51 UTC
I've had a look at the libatomic source code.  Obviously, it's problematic to be sure you're understanding a large code base correctly when you go to it for the first time and you're looking something specific, so forgive me if I am here completely mistaken!

I think I understand you Jonathan to mean that in the absence (by whatever means) of the native instruction, a replacement is provided, which has a different internal mechanism.

From what I can see in the libatomic code (and I may be completely wrong!), this internal mechanism under POSIX is a mutex.

One of the advantages of lock-free data structures is that when properly written, they scale well.  This advantage will absolutely and most certainly no longer exist if the native instruction is replaced by an alternative, as there are, as far as I know, no alternatives on any platforms which will continue to allow the scaling properties of the native instruction.

In other words, libatomic is absolutely no use to lock-free data structures.  This is not a fatal problem, as inline assembly can be used.

(Also, lock-free data structures do not sleep, where a mutex can, and that does change the behaviour of the code, for there are some places in some kernels where you are not permitted to sleep.)
Comment 5 Jonathan Wakely 2017-05-25 14:00:36 UTC
Yes, I didn't say it's lock-free, but the code can be compiled and linked.
Comment 6 Alexander Monakov 2017-05-26 13:20:36 UTC
There's a bit of a misunderstanding here, the -mcx16 option remains supported, and the compiler remains capable of issuing lock-cmpxchg16b for __sync builtins, in particular for __sync_val_compare_and_swap. What changed in gcc-7 is that __atomic builtins that would previously get expanded to a sequence involving cmpxchg16b now always yield a library call on x86, but libatomic tries to support that efficiently by using cmpxchg16b internally, on CPUs that have it and on targets that support IFUNC.

No bug here, rather a (non-obvious imho) design change.
Comment 7 Jonathan Wakely 2017-05-26 14:02:43 UTC
Since this makes libatomic required for DCAS on x86_64 it should probably have been documented at https://gcc.gnu.org/gcc-7/changes.html
Comment 8 Alexander Monakov 2017-05-26 18:51:10 UTC
Well, at least it's not too late to update the compiler manual, so I've submitted a patch: https://gcc.gnu.org/ml/gcc-patches/2017-05/msg02080.html
Comment 9 andysem 2017-09-02 14:56:18 UTC
The docs (https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins) still says that `__atomic` builtins are intended to replace `__sync` builtins and should be preferred in the new code. This is no longer true as `__sync` builtins are now the only way to generate cmpxchg16b without having to write assembler code. Please, update the docs accordingly.
Comment 10 Andrew Pinski 2018-02-25 18:48:03 UTC
*** Bug 84522 has been marked as a duplicate of this bug. ***
Comment 11 Florian Weimer 2018-03-29 13:52:50 UTC
We do have a bug here: libatomic selects CMPXCHG16B based on CPUID support.  If we want to support loads from read-only mappings, we cannot do that, and have to use locks unconditionally (for all 128-bit atomics, to achieve synchronization).

So we either need to fix libatomic to use locks consistently, or -mcx16 should enable the 128 bit CAS instruction (for loads/stores/CAS).

I believe most users who use the 128-bit atomics on x86-64 will want the lock-free instructions, and not the support for read-only mappings.

Furthermore, the read-only mapping case is most relevant to cross-process synchronization, and a process-local lock will not achieve synchronization there.
Comment 12 andysem 2018-03-29 14:31:11 UTC
Is read-only memory a valid use case for __atomic intrinsics anyway? These intrinsics are primarily targeted to implement std::atomic, but does the standard guarantee these operations (primarily, std::atomic::load()) do not issue writes to the memory?
Comment 13 toby 2018-03-29 15:16:19 UTC
On 29 Mar 2018, andysem at mail dot ru <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
>
>--- Comment #12 from andysem at mail dot ru ---
>Is read-only memory a valid use case for __atomic intrinsics anyway?
>These
>intrinsics are primarily targeted to implement std::atomic, but does
>the
>standard guarantee these operations (primarily, std::atomic::load()) do
>not
>issue writes to the memory?

On Intel, all CAS operations always write, even if thr compare failed.
Comment 14 andysem 2018-03-29 18:23:38 UTC
> On Intel, all CAS operations always write, even if thr compare failed.

I understand that. The question is whether this is allowed behavior for atd::atomic::load() operation according to the C++ standard.
Comment 15 crc 2018-03-29 18:28:28 UTC
On 29/03/18 19:23, andysem at mail dot ru wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
> 
> --- Comment #14 from andysem at mail dot ru ---
>> On Intel, all CAS operations always write, even if thr compare failed.
> 
> I understand that. The question is whether this is allowed behavior for
> atd::atomic::load() operation according to the C++ standard.

Apologies.  I replied off the cuff from my phone; I realised afterwards.
Comment 16 Florian Weimer 2018-03-29 18:36:07 UTC
(In reply to andysem from comment #12)
> Is read-only memory a valid use case for __atomic intrinsics anyway? These
> intrinsics are primarily targeted to implement std::atomic,

I strongly disagree about that.  These intrinsics are used in many other contexts.

> but does the
> standard guarantee these operations (primarily, std::atomic::load()) do not
> issue writes to the memory?

std::atomic objects need to be placed in memory which allows CAS to work (or whatever is used for the loads).  On some architectures, there are more constraints than just read-only vs writable.  I don't know if libstdc++ ensures that in some way; due the constexpr constructor, this could be tricky.
Comment 17 andysem 2018-03-29 18:40:32 UTC
I'll clarify why I think load() should be allowed to issue writes on the memory. According to [atomics.types.operations]/18 in N4713, compare_exchange_*() is a load operation if the comparison fails, yet we know cmpxchg (even the ones more narrow than cmpxchg16b) always writes, so we must assume a load operation may write. I do not find a definition of a "load operation" in the standard and [atomics.types.operations]/12 and 13 avoid this term, saying that load() "Atomically returns the value pointed to by this." Again, it doesn't say anything about writes to the memory.

So, if compare_exchange_*() is allowed to write on failure, why load() shouldn't be? Either compare_exchange_*() issuing writes is a bug (in which case a lock-free CAS can't be implemented on x86 at all) or writes in load() should be allowed and the change wrt. cmpxchg16b should be reverted.
Comment 18 Ruslan Nikolaev 2018-04-05 19:31:24 UTC
(In reply to andysem from comment #17)
> I'll clarify why I think load() should be allowed to issue writes on the
> memory. According to [atomics.types.operations]/18 in N4713,
> compare_exchange_*() is a load operation if the comparison fails, yet we
> know cmpxchg (even the ones more narrow than cmpxchg16b) always writes, so
> we must assume a load operation may write. I do not find a definition of a
> "load operation" in the standard and [atomics.types.operations]/12 and 13
> avoid this term, saying that load() "Atomically returns the value pointed to
> by this." Again, it doesn't say anything about writes to the memory.
> 
> So, if compare_exchange_*() is allowed to write on failure, why load()
> shouldn't be? Either compare_exchange_*() issuing writes is a bug (in which
> case a lock-free CAS can't be implemented on x86 at all) or writes in load()
> should be allowed and the change wrt. cmpxchg16b should be reverted.

I think, there is way too much over-thinking about read-only case for 128-bit atomics. Current solution is very confusing and not very well documented at the very least. Correct me if I am wrong, but does current solution guarantee address-freedom? If not, what is the motivation to support 128-bit read only atomics? The only practical use case seems to be IPC where one process has a read-only access. If not guaranteed for 128-bit, why even bother to support read-only case which is a) not guaranteed to be lock-free b) works only within a single process where it is easy to control read-only behavior.

I really prefer the way it was implemented in clang. It is only redirecting if -mcx16 is not specified. BTW, it also provides very nice implementation for Aarch64 which GCC is also lacking.
Comment 19 rockeet 2020-01-29 06:27:09 UTC
Is there a way(command line option or -DSOME_MACRO ...) to make gcc issue cmpxchg16b for std::atomic<obj16b>.compare_exchange_* ?
Comment 20 Avi Kivity 2020-04-18 17:14:57 UTC
Note that clang generates cmpxchg16b when the conditions are ripe:

https://godbolt.org/z/j9Whgh
Comment 21 Florian Weimer 2020-04-18 17:23:50 UTC
(In reply to Avi Kivity from comment #20)
> Note that clang generates cmpxchg16b when the conditions are ripe:
> 
> https://godbolt.org/z/j9Whgh

I believe this is a different, C++-specific issue. The C front end already emits cmpxchgq in this situation.
Comment 22 Avi Kivity 2020-04-18 17:31:55 UTC
Perhaps PR 84522 should be reopened and unmarked as a duplicate. While the reproducer there is a C API, it is the C equivalent of <atomic> (<stdatomic.h>).
Comment 23 Florian Weimer 2020-04-18 17:40:12 UTC
Ahh, I think this bug here is specific to __uint128 (with the C front end at least)

The C translation of the C++ reproducer from comment 20:

struct a
{
  long  _Alignas(16) x;
  long y;
};

_Bool
cmpxchg (struct a *data, struct a expected, struct a newval)
{
  return __atomic_compare_exchange_n (&data, &expected, &newval, 1,
                                      __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

produces the atomic instruction.
Comment 24 Avi Kivity 2020-04-18 18:13:45 UTC
I'll file a new PR.
Comment 25 Avi Kivity 2020-04-18 18:18:29 UTC
PR 94649.
Comment 26 Florian Weimer 2020-04-18 18:41:20 UTC
(In reply to Florian Weimer from comment #23)
> Ahh, I think this bug here is specific to __uint128 (with the C front end at
> least)
> 
> The C translation of the C++ reproducer from comment 20:
> 
> struct a
> {
>   long  _Alignas(16) x;
>   long y;
> };
> 
> _Bool
> cmpxchg (struct a *data, struct a expected, struct a newval)
> {
>   return __atomic_compare_exchange_n (&data, &expected, &newval, 1,
>                                       __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> }
> 
> produces the atomic instruction.

Eh, no, that code does something else.
Comment 27 Yongwei Wu 2021-03-14 03:21:58 UTC
Anyone can show a valid use case for a non-lock-free version of 128-bit atomic_compare_exchange?

I am trying to use it in a data structure intended to be lock-free. I am surprised to find that the C++ std::atomic::compare_exchange_weak does not result in lock-free code for a 128-bit struct intended for ABA-free CAS. As a result, the GCC-generated code is MUCH slower than the mutex-based version in my 8-thread contention test, defeating all its valid purposes. I am talking about a 10x difference. And the Clang-generated code is more than 200x faster in the same test.

Friends, being 200x worse in an important use case (lock-free, ABA-free data structures like queues and lists) is not funny at all.
Comment 28 Yongwei Wu 2021-03-14 03:46:25 UTC
OK, somewhat answering myself. I was not aware of the fact that 128-bit atomic read has to be implemented using cmpxchg16b as well, thus defeating some non-CAS usage scenarios.

The natural question is: which usage scenario is more significant? Or is there a way to support both?

I still think lock-free data structures are too import to ignore.
Comment 29 Yongwei Wu 2021-03-16 16:02:18 UTC
As usual, test results are always elusive. I have to add yet another important piece of information. The very bad performance result does not occur on Linux, but only on macOS (Homebrew versions of GCC and libatomic).

So far, it seems to indicate that this is more a libatomic issue on macOS, which I traced to pthread_mutex_lock, instead of "lock cmpxchg16b" on Linux...
Comment 30 Niall Douglas 2021-05-06 17:50:44 UTC
I got bit by this GCC regression today at work. Consider https://godbolt.org/z/M9fd7nhdh where std::atomic<__int128> is compare exchanged with -march=sandybridge:

- On GCC 6.4 and earlier, this emits lock cmpxchg16b, as you would expect.

- From GCC 7 up to trunk (12?), this emits __atomic_compare_exchange_16.

- On clang, this emits lock cmpxchg16b, as you would expect.

This is clearly a regression. GCCs before 7 did the right thing. GCCs from 7 onwards do not. clangs with libstdc++ do do the right thing.

This isn't just an x64 thing, either. Consider https://godbolt.org/z/x6d5GE4o6 where GCC on ARM64 emits __atomic_compare_exchange_16, whereas clang on ARM64 emits ldaxp/stlxp, as you would expect.

Please mark this bug as a regression affecting all versions of GCC from 7 to trunk, and affecting all 128 bit atomic capable architectures not just x64.
Comment 31 Andrew Pinski 2021-05-06 19:32:51 UTC
(In reply to Niall Douglas from comment #30)
> I got bit by this GCC regression today at work. Consider
> https://godbolt.org/z/M9fd7nhdh where std::atomic<__int128> is compare
> exchanged with -march=sandybridge:
> 
> - On GCC 6.4 and earlier, this emits lock cmpxchg16b, as you would expect.
> 
> - From GCC 7 up to trunk (12?), this emits __atomic_compare_exchange_16.
> 
> - On clang, this emits lock cmpxchg16b, as you would expect.
> 
> This is clearly a regression. GCCs before 7 did the right thing. GCCs from 7
> onwards do not. clangs with libstdc++ do do the right thing.
> 
> This isn't just an x64 thing, either. Consider
> https://godbolt.org/z/x6d5GE4o6 where GCC on ARM64 emits
> __atomic_compare_exchange_16, whereas clang on ARM64 emits ldaxp/stlxp, as
> you would expect.
> 
> Please mark this bug as a regression affecting all versions of GCC from 7 to
> trunk, and affecting all 128 bit atomic capable architectures not just x64.

Again the problem is stuff like:
static const _Atomic __int128_t t = 2000;

__int128_t g(void)
{
  return t;
}

DOES NOT WORK if you use CAS (or ldaxp/stlxp).

So clang is broken really ....

Also GCC for ARM64 emits calls for all compare and exchange because using the LSE (from ARMv8.1-a) is useful.
Comment 32 liblfds admin 2021-05-06 20:53:42 UTC
(In reply to Andrew Pinski from comment #31)
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).
> 
> So clang is broken really ....
> 
> Also GCC for ARM64 emits calls for all compare and exchange because using
> the LSE (from ARMv8.1-a) is useful.

It may be a case of selecting the lesser of two evils.

The problem for me, as an author of a lock-free data structure library, is that a mutex is not repeat NOT a replacement for a compare-exchange instruction.

This is because lock-free data structures possess the property of not sleeping.  Such data structures are used in kernels, at times and in places where sleeping is absolutely forbidden and will cause the kernel to panic.  Accordingly, replacing an atomic exchange with a mutex does *not* provide identical functionality - an atomic exchange works fine, a mutex makes the kernel panic.

To reiterate : I write a library of lock-free data structures, and on the face of it you would then think I would be a prime user of libatomic, and I *specifically* MUST avoid libatomic, and indeed have basically implemented my own, because of how libatomic behaves.

It's crazy that people writing lock-free data structure must specifically ensure and guarantee they absolutely do not touch libatomic,
Comment 33 Niall Douglas 2021-05-07 12:27:38 UTC
(In reply to Andrew Pinski from comment #31)
> 
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).

I think we are talking about different things here. You are talking about calling convention. I'm talking about std::atomic<__int128>::compare_exchange_weak() i.e. that the specific member function compare_exchange_weak() is not generating cmpxchg16b if compiled with GCC, but does with clang.

Re: your original point, I cannot say anything about _Atomic. However, for std::atomic<__int128>, as __int128 is not an integral type it seems reasonable to me that its specialisation tell the compiler to not store it in read only memory. Mark it with attribute section, give it a non-trivial destructor, or whatever it needs.

Perhaps I ought to open a separate issue here, as my specific problem is that std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b? Mr. Wakely your thoughts?
Comment 34 Jonathan Wakely 2021-05-07 14:07:57 UTC
(In reply to Niall Douglas from comment #33)
> Re: your original point, I cannot say anything about _Atomic. However, for
> std::atomic<__int128>, as __int128 is not an integral type it seems

That depends whether you use -std=c++NN or -std=gnu++NN

> reasonable to me that its specialisation tell the compiler to not store it
> in read only memory. Mark it with attribute section, give it a non-trivial
> destructor, or whatever it needs.

std::atomic<T> requires T to have a trivial destructor, so the destructor is always trivial.

> Perhaps I ought to open a separate issue here, as my specific problem is
> that std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b?

Isn't that covered by PR 94649?

std::atomic just calls the relevant __atomic built-in for all operations. What the built-in does is not up to libstdc++.
Comment 35 Niall Douglas 2021-05-07 14:44:30 UTC
(In reply to Jonathan Wakely from comment #34)

> > Perhaps I ought to open a separate issue here, as my specific problem is
> > that std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b?
> 
> Isn't that covered by PR 94649?

That issue is definitely closer to mine, but still not the same. Still, I'll relocate this report from here to there. Thanks for pointing me at it.
Comment 36 LIU Hao 2022-11-03 04:21:59 UTC
(In reply to Andrew Pinski from comment #1)
> IIRC this was removed as the instruction cannot be used for read only memory.

That's not a valid argument. The first argument is a pointer to non-const type, and whoever passes a read-only object bears the risk on their own.

As mention in previous posts, the double-word compare-and-swap operation is invaluable for many algorithms. The fact that GCC does not generate it, even when requested explicitly with `-mcx16`, is silly and unacceptable.
Comment 37 LIU Hao 2022-11-03 09:55:07 UTC
(In reply to Andrew Pinski from comment #31)
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).
> 

Can this be made using MOVDQA instead? I haven't tested this though, just out of curiosity.
Comment 38 Jakub Jelinek 2022-11-03 10:04:10 UTC
Please see PR104688 .  We got a response from Intel, where they guaranteed atomicity of certain 16-byte load instructions for Intel CPUs with AVX support.
AFAIK we didn't get similar guarantee from AMD.
The current state is that on the libatomic side when ifuncs are possible we use those atomic loads etc. on Intel with AVX, and do what we used to do before for other CPUs.
We haven't changed what the compiler emits, I think we'd need to introduce some new option for it (guarantee code will run only on Intel CPUs) and imply that from -march= listing Intel CPUs (with AVX).  If AMD would give a similar guarantee, it would be much easier, we could just emit that whenever -mavx.
Comment 39 Florian Weimer 2022-11-03 10:32:09 UTC
(In reply to Jakub Jelinek from comment #38)
> Please see PR104688 .  We got a response from Intel, where they guaranteed
> atomicity of certain 16-byte load instructions for Intel CPUs with AVX
> support.
> AFAIK we didn't get similar guarantee from AMD.

I'm trying to work with AMD to get an official statement that covers older CPUs as well. I have a preliminary statement, but I hope to get to the point that we can say the rule is the same as for Intel (AVX support can act as a proxy).
Comment 40 admin_public 2022-11-03 11:16:37 UTC
On 03/11/2022 12:04, jakub at gcc dot gnu.org wrote:
> --- Comment #38 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Please see PR104688 .  We got a response from Intel, where they guaranteed
> atomicity of certain 16-byte load instructions for Intel CPUs with AVX support.

Now, it's been a quite a long time since I've delved into lock-free, and I have reason to doubt my earlier understanding anyway 
- so I may be *completely* wrong - but, as I recall, and, as I understood it, the "usual" atomic operations (i.e. non-AVX) are 
essential in that they force the honouring of any previously issued read/write barriers, as they force a read from, and write 
to, memory (well, I say memory - I mean to say, at least out to the cache coherency protocol).

Will AVX do the same?

> The current state is that on the libatomic side when ifuncs are possible we use
> those atomic loads etc. on Intel with AVX, and do what we used to do before for
> other CPUs.

Yes.  As I recall, this is the problem for me - if such lock-free is not available, mutexes or some such as used instead, and 
this is absolutely *not* okay, because their properties are completely different; if I have a lock-free data structure and I'm 
using it in the kernel and I'm not allowed to sleep, I *can't* use a sleep-based locking mechanism.

Lock-free has unique properties and when those properties are needed, if they are not available, the only option is to fail to 
compile/build/run.
Comment 41 LIU Hao 2023-11-16 01:33:11 UTC
There should have been an option, long ago since GCC 7, which may be called

  -mcx16-just-emit-the-god-damn-cmpxchg16b-for-me-if-it-does-not-work-its-not-your-fault


`__sync_*` are not an option as 1) they do not pass the old value and the zero flag in a single operation, and 2) they do not accept 16-byte structs, and 3) they are not full barriers.