This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/65146] alignment of _Atomic structure member is not correct

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Tue, 05 Sep 2017 13:11:19 +0000
Subject: [Bug target/65146] alignment of _Atomic structure member is not correct
Auto-submitted: auto-generated
References: <bug-65146-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
Created attachment 42125
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42125&action=edit
C11 / pthreads test case for tearing of atomic_llong.  compile with -m32
-pthread

This is a real bug.  See attached C11 testcase for atomic_llong tearing of
loads/stores in practice on real x86 hardware with gcc -m32.  (It also compiles
as C++11 to show the difference).

One thread writes 0 and -1 alternating, the other thread reads and checks that
the value is either 0 or -1.  (Or with `double`, 0 and -1.1, or with a
two-member struct, {-1,-1}).

Compile with gcc -m32 -march=native -O3 -pthread
atomic_llong-underaligned-in-struct.c  && ./a.out

offset of x in AtomicStruct = 60. addr=0x565a80bc.  lockfree() = 1
    sizeof(test_type) = 8.  test_type = long long
alignof(AtomicStruct) = 4, alignof(atomic_ttype) = 4, alignof(test_type) = 4
found tearing: tmp = 0x000000ffffffff

If there's a problem, the whole program uses under a millisecond of CPU time,
probably most of it on startup and printing.  (e.g. perf counters for
machine_clears.memory_ordering = 72, so it didn't spend long in the read loop
with the writer going).  I get the object to cross a cache-line boundary if the
type is under-aligned by using struct AtomicStruct { char filler[57];
atomic_ttype x; }; and using alignas(64).


In the x86 32-bit System V ABI, gcc under-aligns an atomic_llong so it can
split across the boundary between two cache lines.  This makes loads and stores
non-atomic on every CPU (except uniprocessor of course).  Some AMD CPUs are
potentially non-atomic when 8B or 16B boundaries are crossed. 
(https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic).

Here are some ways to fix this for the i386 System V ABI.  (I think the Windows
32-bit ABI aligns long long and double to 64b, but an _Atomic struct still
potentially needs more than its default alignment to be efficiently lock-free).

1. Change the C stdatomic ABI to match the C++ std::atomic ABI, requiring
lock-free atomic objects to be naturally aligned.  (But don't change anything
for non-atomic objects.  The i386 SysV ABI only aligns 64-bit long long to 4B,
and we do need to preserve that.)
2. Use lock cmpxchg8b to implement .load() and .store() instead of SSE or x87
load or store instructions on all 8-byte objects, like for 16-byte objects in
x86-64 with cmpxchg16b.
3. Make 64-bit objects check alignment before using lock-free sequences,
otherwise use locking (or lock cmpxchg8b).  Checks can be optimized out in
cases where the compiler can prove that an object is 8B-aligned.  e.g. an
object with static storage since we get to align it.  Unless we're linking with
code compiled by an old gcc that under-aligned static 64b objects.
4. Make 64-bit objects never lock-free in the i386 SysV ABI.
5. Option 4 + define a new 32-bit ABI that doesn't suck.  (pass args in
registers, use SSE FP, etc., and align 64-bit types to 64-bit.)  Not realistic
because nobody cares enough about 32-bit code outside of Windows, so the small
benefit wouldn't justify the pain of having 2 incompatible ABIs.

Clang is already doing option 1, so gcc and clang are currently incompatible
for struct layout for structs with a C11 _Atomic member.

Option 1 is by far the best long-term option for performance and simplicity. 
(Not counting option 5).  

Option 2 will work, but is always horrible for performance with pure loads or
non-seq-cst stores, even with aligned objects.

lock cmpxchg8b is atomic even when it crosses a cache-line or page boundary
(like all x86 atomic RMW operations using the lock prefix), or whatever
boundary is atomic for regular loads/stores.  This is **catastrophic** for
performance, though, because instead of just internally locking a line of L1D
cache (by delaying responses to Invalidate its copy), the CPU has to make sure
the change to both cache lines propagates all the way to memory, I think.  (x86
locked instructions are atomic with respect to I/O and DMA observers, not just
other CPUs, so it can't just keep both cache lines locked).  On my Skylake
i7-6700k, it's literally a 132x slowdown for a single thread doing `lock add`
aligned vs. crossing a cache line boundary.

These penalties will happen by chance for more 8B objects on AMD hardware if
crossing a 16B or 32B boundary really is non-atomic for regular loads/stores,
instead of only having a penalty at 64B boundaries.

00000000004000e0 <_start.loop>:
  4000e0:       f0 48 83 47 3f 01       lock add QWORD PTR [rdi+0x3f],0x1      
          ## rdi is page-aligned
  4000e6:       f0 48 83 47 7f 01       lock add QWORD PTR [rdi+0x7f],0x1
  4000ec:       f0 48 83 87 bf 00 00 00 01      lock add QWORD PTR
[rdi+0xbf],0x1
  4000f5:       f0 48 83 87 ff 00 00 00 01      lock add QWORD PTR
[rdi+0xff],0x1
  4000fe:       ff cd                   dec    ebp
  400100:       7f de                   jg     4000e0 <_start.loop>

This runs in 29221 ms (113.961G clock cycles) for 10M loop iterations.  That's
~1119 cycles per instruction (including the loop branch).

The equivalent loop with offsets of +0x40 etc (so every object in a separate
single cache line) ran in 220ms (0.8605G clock cycles).

The equivalent loop with movq [rdi+0x40], xmm0 runs in 10ms (0.0402G cycles). 
Half that for loads.  (Not counting the time to unpack back to integer
registers for long long, but this is fair for double).  So that's another
factor of ~20 for pure load / pure store vs. aligned lock add (which is about
the same as lock cmpxchg8b)

Thus the worst-case penalty (misaligned lock cmpxchg8b emulation of load/store
vs. aligned SSE2 simple load or release-store) is about a factor of 3000.

------

Something has regressed since this bug was filed, and now 
alignof(atomic_llong) is only 4, so this is a potential problem even outside of
structs.   Same for _Atomic double or _Atomic struct { int a,b; }: these 8 byte
objects need 8B alignment to work correctly as lock-free objects.

C++11 std::atomic previously had the same bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147), filed at the same time,
but it got fixed correctly so lock-free std::atomic<> objects always have
natural alignment.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]