94649 – 16-byte aligned atomic_compare_exchange doesn not generate cmpxcg16b on x86_64

Bug 94649 - 16-byte aligned atomic_compare_exchange doesn not generate cmpxcg16b on x86_64

Summary: 16-byte aligned atomic_compare_exchange doesn not generate cmpxcg16b on x86_64

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-04-18 18:17 UTC by Avi Kivity
Modified:	2023-02-17 07:53 UTC (History)
CC List:	6 users (show)

See Also:	80878
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Avi Kivity 2020-04-18 18:17:38 UTC

The code

#include <atomic>

struct alignas(16) a {
    long x;
    long y;
};

bool cmpxchg(std::atomic<a>& data, a expected, a newval) {
    return std::atomic_compare_exchange_weak(&data, &expected, newval);
}

compiles with -O3 -mcx16 generates a "lock cmpxchg16b" instruction with clang, but not with gcc, where it generates a library call to __atomic_compare_exchange_16. This is a missed optimization.

Comment 1 Richard Biener 2020-04-20 07:03:33 UTC

It's also an ABI issue when code compiled with -mcx16 and without -mcx16 has to inter-operate.  So it might be a deliberate choice and not a missed optimization.

Comment 2 Avi Kivity 2020-04-20 11:37:49 UTC

Maybe we can have a new flag -mcx16all that assumes that all code using 16-byte CAS is compiled with the flag.

Comment 3 Yongwei Wu 2021-03-14 03:10:57 UTC

Is there really a valid use case for a non-lock-free version of 128-bit CAS?

I am using it in a lock-free data structure. The GCC-generated code is MUCH slower than the mutex-based version, defeating all its valid purposes. I am talking about a 10x difference. And the Clang-generated code is more than 200x faster in my 8-thread contention test.

To me, the current GCC behaviour is not missed optimization. It is pessimization. I am really having a difficult time understanding the rationale of the current design.

Comment 4 Niall Douglas 2021-05-07 14:44:06 UTC

Relocating my issue from PR 80878 to here:

I got bit by this GCC regression today at work. Consider https://godbolt.org/z/M9fd7nhdh where std::atomic<__int128>::compare_exchange_weak() is called with option -march=sandybridge passed to the command line:

- On GCC 6.4 and earlier, this emits lock cmpxchg16b, as you would expect.

- From GCC 7 up to trunk (12?), this emits __atomic_compare_exchange_16.

- On clang, this emits lock cmpxchg16b, as you would expect.

This is clearly a regression. GCCs before 7 did the right thing. GCCs from 7 onwards do not. clangs with libstdc++ do do the right thing.

Please mark this bug as a regression affecting all versions of GCC from 7 to trunk.

--- cut ---

NOTE that unlike the original PR above where the struct is a UDT, I am talking here about std::atomic<__int128>::compare_exchange_weak(). It seems weird that __int128 is treated as a UDT when the CPU is perfectly capable of hardware CAS.

Common feedback from this and other PRs:

1. Changing this would break ABI

Firstly, I told GCC -march=sandybridge, and we know that libatomic will choose cmpxchg16b to implement __atomic_compare_exchange_16 because cpuid for sandybridge will say cmpxchg16b is supported. So, it's the same implementation for __atomic_compare_exchange_16, nothing breaks here.


2. static const std::atomic<__int128>::load() will segfault

std::atomic<__int128> could examine the macro environment (__GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 et al) and if only 128 bit compare and swap is available, but 128 bit atomics are not, then std::atomic<__int128> could be conditionally marked with attribute section to prevent it being stored into the read only code section.

That said, I don't actually consider static const std::atomic<__int128>::load() segfaulting important enough to special case, in my opinion.


3. This was changed in GCC 7 because _Atomic is broken

_Atomic is indeed broken, but I am talking about std::atomic the C++ library type here. As Mr. Wakely said in another PR:

> std::atomic just calls the relevant __atomic built-in for all operations.
> What the built-in does is not up to libstdc++.

... to this I would say both yes and no. __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16  is not defined if the architecture relies on software emulation (libatomic) to implement 128 bit CAS. So std::atomic<types sizeof(16)>::compare_exchange_X() *could* examine macros for architecture and presence of __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 and inline some assembler for certain architectures as a QoI measure, which is not ABI breaking because if __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 is 1, then libatomic will be choosing that same assembler in any case. Note that I refer to the CAS operation only, for load and store it's trivial to write CAS based emulations, but you could just leave those continue to call libatomic.

Ultimately I probably agree that because _Atomic is broken, the compiler is not the right thing to change here. But libstdc++'s std::atomic implementation is another matter.

Comment 5 Henning Baldersheim 2023-02-15 10:56:17 UTC

Are there any next steps here.
This is still an issue at least with gcc-12. It would be nice to get rid of the hacks we have to avoid it.

Comment 6 Jakub Jelinek 2023-02-15 10:59:55 UTC

See PR104688

Comment 7 Henning Baldersheim 2023-02-17 07:53:53 UTC

Thanks, perhaps add 104688 to the see also list.