This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libstdc++/80835] New: Reading a member of an atomic<struct> can load just that member, not the whole struct


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835

            Bug ID: 80835
           Summary: Reading a member of an atomic<struct> can load just
                    that member, not the whole struct
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

For std::atomic<pointer_and_counter> or similar small struct, accessing just
one member with foo.load().m compiles to an atomic load of the entire thing. 
This is very slow for a 16B object on x86 (lock cmpxchg16b).  It's also not
free for 8-byte objects in 32-bit code (bouncing through SSE).

I think it's safe to optimize partial accesses to atomic object into narrow
loads.  We need to be sure:

* it's still atomic
* it still Synchronizes With release-stores of the whole object (i.e. that
other threads' ops become globally visible to us in the right order)
* it can't lead to any of this thread's operations becoming globally visible in
the wrong order

For larger objects (non-lock-free), obviously once you hold the lock you can
copy as much or as little as you want, but I think libatomic would need new
entry-points to support small copies out of larger objects.  Acquiring the same
lock that the writer used gives us acquire semantics for the load.

----------

For lock-free objects, foo.load(mo_relaxed).m can be a narrow load if a load
with that size & alignment is still atomic.  There are no ordering requirements
for the load itself, except that it's still ordered by fences and other acq/rel
operations (which is the case).

mo_acquire loads need to Synchronize With release-stores of the whole object. 
AFAIK, all architectures that uses MESI-style coherent caches synchronize based
on cache lines, not object addresses.  So an acquire-load Synchronizes With a
release-store if they both touch the same cache-line, even if they use
different address and size.  Thus it's safe to use a narrow acquire-load on
part of an object, even if the object spans two cache lines but the load only
touches one.  (But note that for x86, cmpxchg16b requires a 16B aligned memory
operand, so this would only be possible with an 8B or smaller object, in which
case we'd need lock cmpxchg loads and xchg stores, and those would be even
slower than usual because of the cl-split.)

For platforms other than x86, there may be issues if using synchronization
primitives that only synchronize a particular address, but I'm not familiar
with them.

------------

You can use a union of an atomic<struct> and a struct of atomic<> members to
implement efficient read-only access to a single member with current gcc/clang.
 It's safe this optimization is safe. 
 See
http://stackoverflow.com/questions/38984153/implement-aba-counter-with-c11-cas/38991835#38991835.
 It would be nice if gcc would do the optimization for us on targets where it
really is safe.

 If someone uses a union to store to part of an atomic object, the resulting UB
is their problem.  Lock-free atomic stores must always store the whole object. 
It usually doesn't make sense to store less than the whole object anyway, and
C++ std::atomic has no way to get an lvalue referring to a single member of an
atomic<struct>.

---------

One worry I had was Intel's x86 manual saying:

> 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

> The Intel-64 memory-ordering model allows a load to be reordered with an
> earlier store to a different location. However, loads are not reordered with
> stores to the same location.

We know (from 
http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it)
that there can be reordering when the load and store addresses don't match,
even if they overlap.

But it seems Intel only means within a single thread, i.e. that a thread
observes its own operations to happen in program order.  Store-forwarding
allows a store/reload to become globally visible load-first.

But I think if the store fully overlaps the load, opposite of what the SO
question is about, there's no way you can observe that reordering, because all
the load data comes from the store.  That just means the store and reload are
adjacent to each other in the total order, with way for a store from another
thread to sneak in between them.

See discussion in comments for more about this: 

http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/35910141?noredirect=1#comment75186782_35910141


-----

A simplified example of loading one member of a pointer+counter struct:

// https://godbolt.org/g/k063wH for the code with a union, from the SO Q&A.

// https://godbolt.org/g/hNoJzj for this simplified code:
#include <atomic>
#include <stdint.h>
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
    node *ptr;    // non-atomic pointer-to-atomic
    uintptr_t count;
};

node *load_nounion(std::atomic<counted_ptr> *p) {
  return p->load(std::memory_order_acquire).ptr;
}

From gcc8 -mx32 -O3 -march=haswell is able to make pretty good code:

        movq    (%edi), %rax
        movl    %eax, %eax
        ret

I didn't think it was necessary to zero-extend return values at all.  Unless
there's something special about pointer return values in the x32 ABI.  But if
we do need to, it would be more efficient to avoid mov same,same, because Intel
Haswell always needs an execution unit in that case.  It can only eliminate MOV
in the rename stage between two different architectural registers.


From gcc8-snapshot -m64 -O3 -march=haswell:

        subq    $8, %rsp
        movl    $2, %esi
        call    __atomic_load_16
        addq    $8, %rsp
        ret

IDK why this doesn't tail-call optimize to just mov $2,%esi / jmp.

With gcc6.x and earlier, the cmpxchg16b is inlined. 
(https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html changed that for
gcc7).

Efficient read-mostly access with occasional cmpxchg16b updates of the
pointer+counter is a potential use-case for atomic 16B objects.  Maybe
something to keep in mind when deciding whether to change from the current
strategy...

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]