Bug 104688 - gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
Summary: gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: https://gcc.gnu.org/pipermail/gcc-hel...
Keywords: missed-optimization, patch
Depends on:
Blocks:
 
Reported: 2022-02-25 14:22 UTC by Xi Ruoyao
Modified: 2024-07-18 20:53 UTC (History)
18 users (show)

See Also:
Host:
Target: x86_64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-11-14 00:00:00


Attachments
gcc12-pr104688.patch (1.99 KB, patch)
2022-02-25 16:36 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Xi Ruoyao 2022-02-25 14:22:23 UTC
In Dec 2021, Intel updated the SDM and added the following content:

> Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
> - MOVAPD, MOVAPS, and MOVDQA.
> - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
> - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
> 
> (Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

(see Change 13, https://cdrdv2.intel.com/v1/dl/getContent/671294)

So we can use SSE for Intel CPUs with AVX, instead of a loop with LOCK CMPXCHG16B.

AMD has no such guarantee (at least for now), so we still need LOCK CMPXCHG16B on old Intel CPUs and (old or new) AMD CPUs.
Comment 1 Jakub Jelinek 2022-02-25 14:30:48 UTC
So, shall we just handle it in libatomic by adding yet another ifunc selected version for __atomic_load_16 ?
Or do we want to expand it back inline if some new -m* option selected by default  for -march= of Intel made CPUs with AVX is set?
Comment 2 Xi Ruoyao 2022-02-25 14:33:04 UTC
See option 4 of https://gcc.gnu.org/legacy-ml/gcc/2017-01/msg00167.html.
Comment 3 Florian Weimer 2022-02-25 14:34:19 UTC
I feel we should give AMD some time to comment here. If they can commit supporting it like Intel did, that alters the design space somewhat.
Comment 4 Jakub Jelinek 2022-02-25 16:36:55 UTC
Created attachment 52517 [details]
gcc12-pr104688.patch

Untested patch to handle it so far on the libatomic side only.
Not sure about what exactly to use for 16-byte __atomic_store_n, for 8-byte
we use a store for relaxed and xchg for seq_cst (haven't checked other models),
we don't have any xchg, so I'm using vmovdqa + mfence, but am not sure if that is fastest.
Comment 5 Jakub Jelinek 2022-02-25 16:37:47 UTC
Of course, if AMD confirms the same, we could just revert the __libat_feat1_init change.
Comment 6 GCC Commits 2022-03-17 17:50:32 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75

commit r12-7689-g1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Mar 17 18:49:00 2022 +0100

    libatomic: Improve 16-byte atomics on Intel AVX [PR104688]
    
    As mentioned in the PR, the latest Intel SDM has added:
    "Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28])
    guarantee that the 16-byte memory operations performed by the following instructions will always be
    carried out atomically:
    ⢠MOVAPD, MOVAPS, and MOVDQA.
    ⢠VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
    ⢠VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
    (Note that these instructions require the linear addresses of their memory operands to be 16-byte
    aligned.)"
    
    The following patch deals with it just on the libatomic library side so far,
    currently (since ~ 2017) we emit all the __atomic_* 16-byte builtins as
    library calls since and this is something that we can hopefully backport.
    
    The patch simply introduces yet another ifunc variant that takes priority
    over the pure CMPXCHG16B one, one that checks AVX and CMPXCHG16B bits and
    on non-Intel clears the AVX bit during detection for now (if AMD comes
    with the same guarantee, we could revert the config/x86/init.c hunk),
    which implements 16-byte atomic load as vmovdqa and 16-byte atomic store
    as vmovdqa followed by mfence.
    
    2022-03-17  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/104688
            * Makefile.am (IFUNC_OPTIONS): Change on x86_64 to -mcx16 -mcx16.
            (libatomic_la_LIBADD): Add $(addsuffix _16_2_.lo,$(SIZEOBJS)) for
            x86_64.
            * Makefile.in: Regenerated.
            * config/x86/host-config.h (IFUNC_COND_1): For x86_64 define to
            both AVX and CMPXCHG16B bits.
            (IFUNC_COND_2): Define.
            (IFUNC_NCOND): For x86_64 define to 2 * (N == 16).
            (MAYBE_HAVE_ATOMIC_CAS_16, MAYBE_HAVE_ATOMIC_EXCHANGE_16,
            MAYBE_HAVE_ATOMIC_LDST_16): Define to IFUNC_COND_2 rather than
            IFUNC_COND_1.
            (HAVE_ATOMIC_CAS_16): Redefine to 1 whenever IFUNC_ALT != 0.
            (HAVE_ATOMIC_LDST_16): Redefine to 1 whenever IFUNC_ALT == 1.
            (atomic_compare_exchange_n): Define whenever IFUNC_ALT != 0
            on x86_64 for N == 16.
            (__atomic_load_n, __atomic_store_n): Redefine whenever IFUNC_ALT == 1
            on x86_64 for N == 16.
            (atomic_load_n, atomic_store_n): New functions.
            * config/x86/init.c (__libat_feat1_init): On x86_64 clear bit_AVX
            if CPU vendor is not Intel.
Comment 7 GCC Commits 2022-03-29 05:54:14 UTC
The releases/gcc-11 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:1861b9a9f13c64333306a2eb146b2da0a41d044f

commit r11-9729-g1861b9a9f13c64333306a2eb146b2da0a41d044f
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Mar 17 18:49:00 2022 +0100

    libatomic: Improve 16-byte atomics on Intel AVX [PR104688]
    
    As mentioned in the PR, the latest Intel SDM has added:
    "Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28])
    guarantee that the 16-byte memory operations performed by the following instructions will always be
    carried out atomically:
    ⢠MOVAPD, MOVAPS, and MOVDQA.
    ⢠VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
    ⢠VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
    (Note that these instructions require the linear addresses of their memory operands to be 16-byte
    aligned.)"
    
    The following patch deals with it just on the libatomic library side so far,
    currently (since ~ 2017) we emit all the __atomic_* 16-byte builtins as
    library calls since and this is something that we can hopefully backport.
    
    The patch simply introduces yet another ifunc variant that takes priority
    over the pure CMPXCHG16B one, one that checks AVX and CMPXCHG16B bits and
    on non-Intel clears the AVX bit during detection for now (if AMD comes
    with the same guarantee, we could revert the config/x86/init.c hunk),
    which implements 16-byte atomic load as vmovdqa and 16-byte atomic store
    as vmovdqa followed by mfence.
    
    2022-03-17  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/104688
            * Makefile.am (IFUNC_OPTIONS): Change on x86_64 to -mcx16 -mcx16.
            (libatomic_la_LIBADD): Add $(addsuffix _16_2_.lo,$(SIZEOBJS)) for
            x86_64.
            * Makefile.in: Regenerated.
            * config/x86/host-config.h (IFUNC_COND_1): For x86_64 define to
            both AVX and CMPXCHG16B bits.
            (IFUNC_COND_2): Define.
            (IFUNC_NCOND): For x86_64 define to 2 * (N == 16).
            (MAYBE_HAVE_ATOMIC_CAS_16, MAYBE_HAVE_ATOMIC_EXCHANGE_16,
            MAYBE_HAVE_ATOMIC_LDST_16): Define to IFUNC_COND_2 rather than
            IFUNC_COND_1.
            (HAVE_ATOMIC_CAS_16): Redefine to 1 whenever IFUNC_ALT != 0.
            (HAVE_ATOMIC_LDST_16): Redefine to 1 whenever IFUNC_ALT == 1.
            (atomic_compare_exchange_n): Define whenever IFUNC_ALT != 0
            on x86_64 for N == 16.
            (__atomic_load_n, __atomic_store_n): Redefine whenever IFUNC_ALT == 1
            on x86_64 for N == 16.
            (atomic_load_n, atomic_store_n): New functions.
            * config/x86/init.c (__libat_feat1_init): On x86_64 clear bit_AVX
            if CPU vendor is not Intel.
    
    (cherry picked from commit 1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75)
Comment 8 Xi Ruoyao 2022-04-05 12:30:38 UTC
Shall I close it as FIXED, or keep it opening waiting for AMD response?
Comment 9 Jakub Jelinek 2022-04-05 12:35:27 UTC
Besides missing AMD response, it isn't fully fixed, because the change is on the libatomic side only.
So we still pay the cost to call those functions (and often PLT cost too) and return from them.
For GCC 13, we should add some option that optionally reverts the change to use library calls all the time, and default that option for -march= Intel CPUs with AVX support or something similar (perhaps only if AVX is also enabled).
Comment 10 GGanesh 2022-11-14 03:31:54 UTC
Apologies for late response!

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

Can we extend this patch to AMD processors as well. If not, I will plan to submit the patch for stage-1!
Comment 11 Sam James 2022-11-14 04:58:29 UTC
(In reply to GGanesh from comment #10)
> Can we extend this patch to AMD processors as well. If not, I will plan to
> submit the patch for stage-1!

GCC 13 (as of today) is in stage 3 - see https://gcc.gnu.org/develop.html, but it may or may not still be possible to submit it (not my call).
Comment 12 Jakub Jelinek 2022-11-14 07:54:19 UTC
I've posted the patches (so far only lightly tested):
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606021.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606022.html
It is still Sunday in AoE, so we still have stage1 there.
Comment 13 Alexander Monakov 2022-11-14 09:24:58 UTC
Jakub, sorry if I misunderstood the patches from a brief glance, but what ordering guarantees are you assuming for AVX accesses? It should not be SEQ_CST. I think what Intel manual is saying is that said accessing will not tear, but reordering is the same as pre-existing x86 TSO rules (a load can finish before an earlier store is globally visible).
Comment 14 Jakub Jelinek 2022-11-14 09:29:30 UTC
For ordering guarantees I assume (already since the r12-7689 change) that
VMOVDQA behaves the same as MOVL/MOVQ.
This PR was about whether there is a quarantee that VMOVDQA will be an atomic load or store provided 128-bit aligned address.
Comment 15 Alexander Monakov 2022-11-14 09:52:03 UTC
Ah, there will be an mfence after the vmovdqa when necessary for an atomic store, thanks (I missed that because the testcase doesn't scan for mfence).
Comment 16 GCC Commits 2022-11-15 07:18:42 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:4a7a846687e076eae58ad3ea959245b2bf7fdc07

commit r13-4048-g4a7a846687e076eae58ad3ea959245b2bf7fdc07
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]
    
    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).
    
    2022-11-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.
Comment 17 Jakub Jelinek 2022-11-15 07:20:14 UTC
Fixed for AMD on the library side too.
We need a statement from Zhaoxin and VIA for their CPUs.
Comment 18 GCC Commits 2022-11-20 23:11:55 UTC
The releases/gcc-12 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:86dea99d8525bf49d51636332d6be440e51b931a

commit r12-8920-g86dea99d8525bf49d51636332d6be440e51b931a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]
    
    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).
    
    2022-11-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.
    
    (cherry picked from commit 4a7a846687e076eae58ad3ea959245b2bf7fdc07)
Comment 19 GCC Commits 2022-11-21 09:23:03 UTC
The releases/gcc-11 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:60880f3afc82f55b834643e449883dd5b6ad057a

commit r11-10385-g60880f3afc82f55b834643e449883dd5b6ad057a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]
    
    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).
    
    2022-11-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.
    
    (cherry picked from commit 4a7a846687e076eae58ad3ea959245b2bf7fdc07)
Comment 20 Xi Ruoyao 2022-11-23 09:18:31 UTC
From Mayshao (Zhaoxin engineer):

"On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed memory is Write Back, but it's not guaranteed for other memory types."

Is it allowed to use VMOVDQA then?
Comment 21 Jakub Jelinek 2022-11-23 09:51:44 UTC
What about loads?  That is even more important than the stores.  While atomic store can be worst case done through cmpxchg16b, even when it is slower, we can't use cmpxchg16b on atomic load because we don't know if the memory isn't read-only.
As for the Write Back only vs. other types, doesn't that match the
" for cacheable" in the AMD statement?
Comment 22 Xi Ruoyao 2022-11-23 10:23:03 UTC
(In reply to Jakub Jelinek from comment #21)
> What about loads?  That is even more important than the stores.  While
> atomic store can be worst case done through cmpxchg16b, even when it is
> slower, we can't use cmpxchg16b on atomic load because we don't know if the
> memory isn't read-only.

Loads are also atomic for WB.

> As for the Write Back only vs. other types, doesn't that match the
> " for cacheable" in the AMD statement?

If I read the manual correctly, Write Back, Write Through, and Write Protected are all "cacheable".  Mayshao told me VMOVDQA is atomic for WB, but not atomic for UC and WC (they are not cacheable so I think we don't need to take care).  So how about WT and WP?
Comment 23 Peter Cordes 2022-11-28 18:35:22 UTC
(In reply to Xi Ruoyao from comment #20)
> "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed
> memory is Write Back, but it's not guaranteed for other memory types."
 
VMOVDQA is still fine, I think WB is the only memory type that's relevant for atomics, at least on the mainstream OSes we compile for.  It's not normally possible for user-space to allocate memory of other types.  Kernels normally use WB memory for their shared data, too.

You're correct that WT and WP are the other two cacheable memory types, and Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike Intel and AMD.

But at least on Linux, I don't think there's a way for user-space to even ask for a page of WT or WP memory (or UC or WC).  Only WB memory is easily available without hacking the kernel.  As far as I know, this is true on other existing OSes.

WT = write-through: read caching, no write-allocate.  Write hits update the line and memory.
WP = write-protect: read caching, no write-allocate.  Writes go around the cache, evicting even on hit.
(https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type quotes the Intel definitions.)

Until recently, the main work on formalizing the x86 TSO memory model had only looked at WB memory.
A 2022 paper looked at WT, UC, and WC memory types: https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency and Persistency
Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores
(The intro part describing memory types is quite readable, in plain English not full of formal symbols.  They only mention WP once, but tested some litmus tests with readers and writers using any combination of the other memory types.)


Some commenters on my answer on when WT is ever used or useful confirmed that mainstream OSes don't give easy access to it.
https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838
* Linux has never merged a patch to let user-space allocate WT pages.
* The Windows kernel reportedly doesn't have a mechanism to keep track of pages that should be WT or WP, so you won't find any.

I don't know about *BSD making it plausible for user-space to point an _Atomic int * at a page of WT or WP memory.  I'd guess not.

I don't know if there's anywhere we can document that _Atomic objects need to be in memory that's allocated in a "normal" way.  Probably hard to word without accidentally disallowing something that's fine.
Comment 24 Alexander Monakov 2022-11-28 18:46:35 UTC
(In reply to Peter Cordes from comment #23)
> But at least on Linux, I don't think there's a way for user-space to even
> ask for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
> available without hacking the kernel.  As far as I know, this is true on
> other existing OSes.

I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device memory (and then CPU vendor cannot guarantee that 128b access won't tear because it might depend on downstream devices).
Comment 25 Peter Cordes 2022-11-28 19:03:07 UTC
(In reply to Alexander Monakov from comment #24)
> 
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> memory (and then CPU vendor cannot guarantee that 128b access won't tear
> because it might depend on downstream devices).


Even atomic_int doesn't work properly if you deref a pointer to WC memory.  WC doesn't have the same ordering guarantees, so it would break acquire/release semantics.
So we already don't support WC for this.

We do at least de-facto support atomics on UC memory because the ordering guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned load/store is guaranteed even for non-cacheable memory types since P5 Pentium (and on AMD).  (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores would be atomic on UC memory.  So this change could break that very unwise corner-case on AMD which only guarantees that for cacheable loads/stores, and Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory?  Do we actually care about supporting that?  I'd guess no and no, so it's just a matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope* people weren't using __atomic_store_n on device memory.  The underlying implementation was never guaranteed.
Comment 26 Alexander Monakov 2022-11-28 20:11:20 UTC
Sure, the right course of action seems to be to simply document that atomic types and built-ins are meant to be used on "common" (writeback) memory, and no guarantees can be given otherwise, because it would involve platform specifics (relaxed ordering of WC writes as you say; tearing by PCI bridges and device interfaces seems like another possible caveat).
Comment 27 Peter Cordes 2022-11-28 20:47:26 UTC
(In reply to Alexander Monakov from comment #26)
> Sure, the right course of action seems to be to simply document that atomic
> types and built-ins are meant to be used on "common" (writeback) memory

Agreed.  Where in the manual should this go?  Maybe a new subsection of the chapter about __atomic builtins where we document per-ISA requirements for them to actually work?

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same inner-shareable cache-coherency domain, thus barriers are   dmb ish   not  dmb sy and so on.
I guess we might want to avoid documenting the actual asm implementation strategies in the main manual, because that would imply it's supported to make assumptions based on that.

Putting it near the __atomic docs might make it easier for readers to notice that the list of requirements exists, vs. scattering them into different pages for different ISAs.  And we don't currently have any section in the manual about per-ISA quirks or requirements, just about command-line options, builtins, and attributes that are per-ISA, so there's no existing page where this could get tacked on.

This would also be a place where we can document that __atomic ops are address-free when they're lock-free, and thus usable on shared memory between processes.  ISO C++ says that *should* be the case for std::atomic<T>, but doesn't standardize the existence of multiple processes.

To avoid undue worry, documentation about this should probably start by saying that normal programs (running under mainstream OSes) don't have to worry about it or do anything special.
Comment 28 Florian Weimer 2022-11-29 08:11:25 UTC
(In reply to Peter Cordes from comment #27)
> (In reply to Alexander Monakov from comment #26)
> > Sure, the right course of action seems to be to simply document that atomic
> > types and built-ins are meant to be used on "common" (writeback) memory
> 
> Agreed.  Where in the manual should this go?  Maybe a new subsection of the
> chapter about __atomic builtins where we document per-ISA requirements for
> them to actually work?

Maybe this belongs in the ABI manual? For example, the POWER ABI says that memcpy needs to work on device memory. Documenting the required memory types for automics seems along the same lines.

The rules are also potentially different for different targets sharing the same processor architecture.
Comment 29 Segher Boessenkool 2023-02-15 12:27:06 UTC
(In reply to Florian Weimer from comment #28)
> Maybe this belongs in the ABI manual? For example, the POWER ABI says that
> memcpy needs to work on device memory.

Huh?!

Where do you see this?  The way you state it it is trivially impossible to
implement, so if we really say that it needs fixing asap.
Comment 30 Florian Weimer 2023-02-15 12:46:56 UTC
(In reply to Segher Boessenkool from comment #29)
> (In reply to Florian Weimer from comment #28)
> > Maybe this belongs in the ABI manual? For example, the POWER ABI says that
> > memcpy needs to work on device memory.
> 
> Huh?!
> 
> Where do you see this?  The way you state it it is trivially impossible to
> implement, so if we really say that it needs fixing asap.

I thought I had an explicit documented reference somewhere, but for now, all we have is an undocumented requirement (so not a good example in the context of this bug at all):

[PATCH] powerpc: Use aligned stores in memset
<https://inbox.sourceware.org/libc-alpha/1503033107-20047-1-git-send-email-raji@linux.vnet.ibm.com/>

(There's also a CPU quirk in this area, but I think this wasn't about that.)
Comment 31 Segher Boessenkool 2023-02-15 16:03:20 UTC
Yes, there was a user who incorrectly used memcpy on non-memory memory.

This is not valid, and never has been.
Comment 32 Andrew Pinski 2023-02-15 16:07:46 UTC
(In reply to Segher Boessenkool from comment #31)
> Yes, there was a user who incorrectly used memcpy on non-memory memory.
From what I remember (it was also reported about aarch64 at one point too), one of the graphics libraries would call memcpy from normal memory to GPU Memory (over PCIe) and memcpy will sometimes use unaligned accesses which causes a fault to the GPU memory.
Comment 33 Segher Boessenkool 2023-02-15 16:09:43 UTC
Yes, exactly.  It was the X server I think?  I try to forget such horrors :-)
Comment 34 Mayshao-oc 2024-07-10 05:51:18 UTC
(In reply to Jakub Jelinek from comment #17)
> Fixed for AMD on the library side too.
> We need a statement from Zhaoxin and VIA for their CPUs.

Sorry for the late reply.
We guarantee that VMOVDQA will be an atomic load or store provided 128 bit aligned address in Zhaoxin processors, provided that the memory type is WB.
Can we extend this patch to Zhaoxin processors as well?
Comment 35 Uroš Bizjak 2024-07-10 06:33:30 UTC
(In reply to Mayshao-oc from comment #34)
> Can we extend this patch to Zhaoxin processors as well?

Just post the enablement patch to gcc-patches@ mailing list.
Comment 36 Richard Henderson 2024-07-10 16:57:19 UTC
(In reply to Mayshao-oc from comment #34)
> (In reply to Jakub Jelinek from comment #17)
> > Fixed for AMD on the library side too.
> > We need a statement from Zhaoxin and VIA for their CPUs.
> 
> Sorry for the late reply.
> We guarantee that VMOVDQA will be an atomic load or store provided 128 bit
> aligned address in Zhaoxin processors, provided that the memory type is WB.
> Can we extend this patch to Zhaoxin processors as well?

Is VMOVDQU atomic, provided the address is aligned in Zhaoxin processors?

In QEMU, we make use of this additional guarantee from AMD.
We also reference this gcc bugzilla entry for documentation.  :-)
Comment 37 Mayshao-oc 2024-07-15 11:12:34 UTC
vmovdqu is also atomic in Zhaoxin processors if it meets three requirements:
1. the address of its memory operand must be 16-byte aligned
2. vmovdqu is vex.128 not vex.256
3. the memory type of the address is WB
Comment 38 Mayshao-oc 2024-07-15 11:15:12 UTC
vmovdqu is also atomic in Zhaoxin processors if it meets three requirements:
1. the address of its memory operand must be 16-byte aligned
2. vmovdqu is vex.128 not vex.256
3. the memory type of the address is WB
(In reply to Richard Henderson from comment #36)
> (In reply to Mayshao-oc from comment #34)
> > (In reply to Jakub Jelinek from comment #17)
> > > Fixed for AMD on the library side too.
> > > We need a statement from Zhaoxin and VIA for their CPUs.
> > 
> > Sorry for the late reply.
> > We guarantee that VMOVDQA will be an atomic load or store provided 128 bit
> > aligned address in Zhaoxin processors, provided that the memory type is WB.
> > Can we extend this patch to Zhaoxin processors as well?
> 
> Is VMOVDQU atomic, provided the address is aligned in Zhaoxin processors?
> 
> In QEMU, we make use of this additional guarantee from AMD.
> We also reference this gcc bugzilla entry for documentation.  :-)

(In reply to Richard Henderson from comment #36)
> (In reply to Mayshao-oc from comment #34)
> > (In reply to Jakub Jelinek from comment #17)
> > > Fixed for AMD on the library side too.
> > > We need a statement from Zhaoxin and VIA for their CPUs.
> > 
> > Sorry for the late reply.
> > We guarantee that VMOVDQA will be an atomic load or store provided 128 bit
> > aligned address in Zhaoxin processors, provided that the memory type is WB.
> > Can we extend this patch to Zhaoxin processors as well?
> 
> Is VMOVDQU atomic, provided the address is aligned in Zhaoxin processors?
> 
> In QEMU, we make use of this additional guarantee from AMD.
> We also reference this gcc bugzilla entry for documentation.  :-)
Comment 39 GCC Commits 2024-07-18 20:45:18 UTC
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:

https://gcc.gnu.org/g:9846b0916c1a9b9f3e9df4657670ef4419617134

commit r15-2147-g9846b0916c1a9b9f3e9df4657670ef4419617134
Author: mayshao <mayshao-oc@zhaoxin.com>
Date:   Thu Jul 18 22:43:00 2024 +0200

    libatomic: Handle AVX+CX16 ZHAOXIN like Intel for 16b atomic [PR104688]
    
            PR target/104688
    
    libatomic/ChangeLog:
    
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on ZHAOXIN CPUs.