Bug 80835 - Reading a member of an atomic<struct> can load just that member, not the whole struct
Summary: Reading a member of an atomic<struct> can load just that member, not the whol...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-05-20 08:30 UTC by Peter Cordes
Modified: 2017-05-22 20:29 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-05-20 08:30:50 UTC
For std::atomic<pointer_and_counter> or similar small struct, accessing just one member with foo.load().m compiles to an atomic load of the entire thing.  This is very slow for a 16B object on x86 (lock cmpxchg16b).  It's also not free for 8-byte objects in 32-bit code (bouncing through SSE).

I think it's safe to optimize partial accesses to atomic object into narrow loads.  We need to be sure:

* it's still atomic
* it still Synchronizes With release-stores of the whole object (i.e. that other threads' ops become globally visible to us in the right order)
* it can't lead to any of this thread's operations becoming globally visible in the wrong order

For larger objects (non-lock-free), obviously once you hold the lock you can copy as much or as little as you want, but I think libatomic would need new entry-points to support small copies out of larger objects.  Acquiring the same lock that the writer used gives us acquire semantics for the load.

----------

For lock-free objects, foo.load(mo_relaxed).m can be a narrow load if a load with that size & alignment is still atomic.  There are no ordering requirements for the load itself, except that it's still ordered by fences and other acq/rel operations (which is the case).

mo_acquire loads need to Synchronize With release-stores of the whole object.  AFAIK, all architectures that uses MESI-style coherent caches synchronize based on cache lines, not object addresses.  So an acquire-load Synchronizes With a release-store if they both touch the same cache-line, even if they use different address and size.  Thus it's safe to use a narrow acquire-load on part of an object, even if the object spans two cache lines but the load only touches one.  (But note that for x86, cmpxchg16b requires a 16B aligned memory operand, so this would only be possible with an 8B or smaller object, in which case we'd need lock cmpxchg loads and xchg stores, and those would be even slower than usual because of the cl-split.)

For platforms other than x86, there may be issues if using synchronization primitives that only synchronize a particular address, but I'm not familiar with them.

------------

You can use a union of an atomic<struct> and a struct of atomic<> members to implement efficient read-only access to a single member with current gcc/clang.  It's safe this optimization is safe. 
 See http://stackoverflow.com/questions/38984153/implement-aba-counter-with-c11-cas/38991835#38991835.  It would be nice if gcc would do the optimization for us on targets where it really is safe.

 If someone uses a union to store to part of an atomic object, the resulting UB is their problem.  Lock-free atomic stores must always store the whole object.  It usually doesn't make sense to store less than the whole object anyway, and C++ std::atomic has no way to get an lvalue referring to a single member of an atomic<struct>.

---------

One worry I had was Intel's x86 manual saying:

> 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

> The Intel-64 memory-ordering model allows a load to be reordered with an
> earlier store to a different location. However, loads are not reordered with
> stores to the same location.

We know (from  http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it) that there can be reordering when the load and store addresses don't match, even if they overlap.

But it seems Intel only means within a single thread, i.e. that a thread observes its own operations to happen in program order.  Store-forwarding allows a store/reload to become globally visible load-first.

But I think if the store fully overlaps the load, opposite of what the SO question is about, there's no way you can observe that reordering, because all the load data comes from the store.  That just means the store and reload are adjacent to each other in the total order, with way for a store from another thread to sneak in between them.

See discussion in comments for more about this: 
 http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/35910141?noredirect=1#comment75186782_35910141


-----

A simplified example of loading one member of a pointer+counter struct:

// https://godbolt.org/g/k063wH for the code with a union, from the SO Q&A.

// https://godbolt.org/g/hNoJzj for this simplified code:
#include <atomic>
#include <stdint.h>
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
    node *ptr;    // non-atomic pointer-to-atomic
    uintptr_t count;
};

node *load_nounion(std::atomic<counted_ptr> *p) {
  return p->load(std::memory_order_acquire).ptr;
}

From gcc8 -mx32 -O3 -march=haswell is able to make pretty good code:

        movq    (%edi), %rax
        movl    %eax, %eax
        ret

I didn't think it was necessary to zero-extend return values at all.  Unless there's something special about pointer return values in the x32 ABI.  But if we do need to, it would be more efficient to avoid mov same,same, because Intel Haswell always needs an execution unit in that case.  It can only eliminate MOV in the rename stage between two different architectural registers.


From gcc8-snapshot -m64 -O3 -march=haswell:

        subq    $8, %rsp
        movl    $2, %esi
        call    __atomic_load_16
        addq    $8, %rsp
        ret

IDK why this doesn't tail-call optimize to just mov $2,%esi / jmp.

With gcc6.x and earlier, the cmpxchg16b is inlined.  (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html changed that for gcc7).

Efficient read-mostly access with occasional cmpxchg16b updates of the pointer+counter is a potential use-case for atomic 16B objects.  Maybe something to keep in mind when deciding whether to change from the current strategy...
Comment 1 Peter Cordes 2017-05-20 19:08:02 UTC
related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490: lack of a way for CPUs to advertise when 16B SSE loads/stores are atomic, and that cmpxchg16b segfaults on mmap(PROT_READ), which is one of the reasons for not inlining cmpxchg16b (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html)

I tried to add a CC to Torvald Riegel, since a read-mostly pointer + ABA-counter maybe be a relevant use-case for 16-byte atomic objects, where the reads only want the current pointer value.

Using locking for 16B objects would break this use-case.
Comment 2 Jonathan Wakely 2017-05-22 12:09:05 UTC
You've reported this against libstdc++ but there's no way the library code can possibly transform p->load(std::memory_order_acquire).ptr into loading a subobject. std::atomic::load has no idea what you plan to do with the returned value.
Comment 3 Peter Cordes 2017-05-22 15:32:55 UTC
(In reply to Jonathan Wakely from comment #2)
> You've reported this against libstdc++

I had to take a guess at the right component, based on a couple other std::atomic bugs I looked at.  Apparently I picked wrong, if libstdc++ really is just the headers and not compiler internals at all.  Can someone please mark the appropriate component?  I mostly just look at compiler asm output, not the real internals.
Comment 4 Peter Cordes 2017-05-22 19:27:49 UTC
Thanks for correcting my mistake in tagging this bug, but this got me thinking it's not just a C++ issue.

This also applies to GNU C __atomic_load_n(), and ISO C11 stdatomic code like

#include <stdatomic.h>
#include <stdint.h>
uint32_t load(atomic_uint_fast64_t *p) {  // https://godbolt.org/g/CXuiPO
  return *p;
}

With -m32, it's definitely useful to only do a 32-bit load.  With -m64, that might let us fold the load as a memory operand, like `add (%rdi), %eax` or something.

In 64-bit code, it just compiles to movq (%edi),%rax; ret.  But if we'd wanted the high half, it would have been a load and shift instead of just a 32-bit load.  Or shift+zero-extend if we'd wanted bytes 1 through 4 with (*p) >> 8.  We know that a 64-bit load of *p doesn't cross a cache-line boundary (otherwise it wouldn't be atomic), so neither will any load that's fully contained by it.  

This optimization should be disabled for `volatile`, if that's supposed to make stdatomic usable for MMIO registers.

------------------

It's common to use a load as setup for a CAS.  For cmpxchg8/16b, it's most efficient to use two separate loads as setup for cmpxchg8b.  gcc even does this for us with code like *p |= 3, but we can't get that behaviour if we write a CAS loop ourselves.  This makes the un-contended case slower, even if it's just an xmm->int delay for an 8 byte load, not a function-call + CMPXCHG16B.


void cas_compiler(atomic_uint_fast64_t *p) {
  *p |= 3;  // separate 32-bit loads before loop
}
        #gcc8 snapshot 20170522 -m32, and clang does the same
        ... intro, including including pushing 4 regs, two of which aren't used :(
        movl    (%esi), %eax    #* p, tmp90
        movl    4(%esi), %edx   #,
        ... (loop with MOV, OR, and CMPXCHG8B)


void cas_explicit(atomic_uint_fast64_t *p) {
  // AFAIK, it would be data-race UB to cast to regular uint64_t* for a non-atomic load
  uint_fast64_t expect = atomic_load_explicit(p, memory_order_relaxed);
  _Bool done = 0;
  do {
    uint_fast64_t desired = expect | 3;
    done = atomic_compare_exchange_weak(p, &expect, desired);
  } while (!done);
}
        ... similar setup, but also reserve some stack space
        movq    (%esi), %xmm0   #* p, tmp102
        movq    %xmm0, (%esp)   # tmp102,
        movl    (%esp), %eax    #, expect
        movl    4(%esp), %edx   #, expect
        ...  (then the same loop)

I think it's legal to split an atomic load into two halves if the value can't escape and is only feeding the old/new args of a CAS:

If the cmpxchg8/16b succeeds, that means the "expected" value was there in memory.  Seeing it earlier than it was actually there because of tearing between two previous values is indistinguishable from a possible memory ordering for a full-width atomic load.  We could have got the same result if everything in this thread had happened after the store that made the value seen by cmpxchg8/16b globally visible.  So this also requires that there are no other synchronization points between the load and the CAS.


For 16-byte objects in 64-bit code, this saves a CMPXCHG16B-load, so it's about half the cost in the no-contention case.

For 8-byte objects with -m32, it's smaller, but probably not totally irrelevant in the un-contended case.  An SSE2 load and xmm->int (via ALU or store/reload) might be the same order of magnitude in cost as lock cmpxchg8b on AMD Bulldozer.  As far the latency chain for operations on a single atomic variable, lock cmpxchg8b has ~42 cycle latency on Piledriver (according to Agner Fog), while an extra xmm->int has about 8c latency beyond directly loading into integer regs.  (The throughput costs of lock cmpxchg8b are vastly higher, though: 18 m-ops instead of 3 for movd/pextrd.)
Comment 5 Peter Cordes 2017-05-22 20:29:15 UTC
Previous godbolt link was supposed to be: https://godbolt.org/g/78kIAl
which includes the CAS functions.