[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

Sun Aug 20 20:46:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN

gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact.  -m64
just calls __atomic_load_16

gcc7.2 -O3 -mx32 output:
follow_nounion(std::atomic<counted_ptr>*):
        movq    (%edi), %rax
        movl    %eax, %eax
        ret

vs.

gcc7.1 -O3 -mx32
follow_nounion(std::atomic<counted_ptr>*):
        movq    (%edi), %rcx
        xorl    %edx, %edx
        movzbl  %ch, %eax
        movb    %cl, %dl
        movq    %rcx, %rsi
        movb    %al, %dh
        andl    $16711680, %esi
        andl    $4278190080, %ecx
        movzwl  %dx, %eax
        orq     %rsi, %rax
        orq     %rcx, %rax
        ret

-------

gcc7.2 -O3 -m64 just forwards its arg to __atomic_load_16 and then returns:

follow_nounion(std::atomic<counted_ptr>*):
        subq    $8, %rsp
        movl    $2, %esi
        call    __atomic_load_16
        addq    $8, %rsp
        ret

It unfortunately doesn't optimize the tail-call to

        movl    $2, %esi
        jmp     __atomic_load_16

presumably because it hasn't realized early enough that it takes zero
instructions to extract the 8-byte low half of the 16-byte __atomic_load_16
return value.