This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug c++/77485] New: Missed dead store elimination when returning constexpr generated data


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77485

            Bug ID: 77485
           Summary: Missed dead store elimination when returning constexpr
                    generated data
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: petschy at gmail dot com
  Target Milestone: ---

Created attachment 39563
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39563&action=edit
C++ source

A string is generated at compile time with constexpr function, then it is
returned by value, in a struct having a char array. There are two versions of
the generator: foo() generates into a larger array, foo_sized() first
calculates the needed size, then generates into an array with the exactly
needed size.

The char array must be initialized due to constexpr, and these zero stores are
not all eliminated when overwritten with the actual generated characters. BUT,
only if the array in the struct is larger than the generated data:

foo_sized():
Dump of assembler code for function foo_sized():
   0x00000000004005e0 <+0>:     movdqa 0xa8(%rip),%xmm0        # 0x4006b0
   0x00000000004005e8 <+8>:     mov    %rdi,%rax
   0x00000000004005eb <+11>:    movups %xmm0,(%rdi)
   0x00000000004005ee <+14>:    movdqa 0xaa(%rip),%xmm0        # 0x4006c0
   0x00000000004005f6 <+22>:    movups %xmm0,0x10(%rdi)
   0x00000000004005fa <+26>:    retq   
0x4006b0 : "0 1 2 3 4 5 6 7 8 9 10 11 12 13"

gcc 6.2.1 and 7.0.0 generate the exact same code. Since the size of the buffer
equals to the number of generated characters (32), two 16 byte load+stores are
used.


foo() with gcc 7.0:
Dump of assembler code for function foo():
   0x0000000000400560 <+0>:     mov    %rdi,%rdx                ; rdi = rdx =
0x7fffffffe1b0
   0x0000000000400563 <+3>:     movq   $0x0,0xc0(%rdi)          ; zero out the
last 8 bytes of the buffer
   0x000000000040056e <+14>:    lea    0x8(%rdi),%rdi           ; rdi += 8 =>
0x7fffffffe1b8
   0x0000000000400572 <+18>:    mov    %rdx,%rcx                ; rcx =
0x7fffffffe1b0
   0x0000000000400575 <+21>:    movdqa 0x113(%rip),%xmm0        # 0x4006b0
   0x000000000040057d <+29>:    and    $0xfffffffffffffff8,%rdi ; already
aligned
   0x0000000000400581 <+33>:    xor    %eax,%eax
   0x0000000000400583 <+35>:    sub    %rdi,%rcx                ; rcx = -8
   0x0000000000400586 <+38>:    add    $0xc8,%ecx               ; rcx = 0xc0,
0xc8 is the size of the buffer, ie 200
   0x000000000040058c <+44>:    shr    $0x3,%ecx                ; rcx = 0x18
   0x000000000040058f <+47>:    rep stos %rax,%es:(%rdi)        ; rdi is buf+8,
ecx is (size-8)/8, so will zero out the char buffer from index 8 to 0xbf
   0x0000000000400592 <+50>:    movups %xmm0,(%rdx)
   0x0000000000400595 <+53>:    movb   $0x38,0x10(%rdx)
   0x0000000000400599 <+57>:    movb   $0x20,0x11(%rdx)
   0x000000000040059d <+61>:    mov    %rdx,%rax
   0x00000000004005a0 <+64>:    movb   $0x39,0x12(%rdx)
   0x00000000004005a4 <+68>:    movb   $0x20,0x13(%rdx)
   0x00000000004005a8 <+72>:    movb   $0x31,0x14(%rdx)
   0x00000000004005ac <+76>:    movb   $0x30,0x15(%rdx)
   0x00000000004005b0 <+80>:    movb   $0x20,0x16(%rdx)
   0x00000000004005b4 <+84>:    movb   $0x31,0x17(%rdx)
   0x00000000004005b8 <+88>:    movb   $0x31,0x18(%rdx)
   0x00000000004005bc <+92>:    movb   $0x20,0x19(%rdx)
   0x00000000004005c0 <+96>:    movb   $0x31,0x1a(%rdx)
   0x00000000004005c4 <+100>:   movb   $0x32,0x1b(%rdx)
   0x00000000004005c8 <+104>:   movb   $0x20,0x1c(%rdx)
   0x00000000004005cc <+108>:   movb   $0x31,0x1d(%rdx)
   0x00000000004005d0 <+112>:   movb   $0x33,0x1e(%rdx)
   0x00000000004005d4 <+116>:   retq  
0x4006b0 : "0 1 2 3 4 5 6 7 8 9 10 11 12 13"
This is the same data used in foo_sized().

I wrote the register values at the end of each line of interest. The issues I
found:

0) Not all byte stores are merged. The first 16 bytes are copied w/ xmm0, but
the rest initialized byte-wise. I was told in bug 77456 that there is work in
progress solving that (see [1]), but I'm a bit confused, since it seems that
some merging does occur here already using xmm0, but the second 16 bytes are
not merged for some reason. The last byte of the second 16 byte pack is zero,
so it might be that that zero write is eliminated (no "movb 0x00, 0x1f(%rdx)"
in the disasm), due to the previous zero fill, and the remaining 15 bytes
written byte wise as it's not an exact fit for xmm0, but I'm just speculating
here. Code generated with gcc 6.2.1 does not use xmm0, all bytes are stored
with movb; the zero fill is the same, and the zero store to 0x1f is missing,
too.

1) The zero fill of the buffer does not take into account that the beginning of
the buffer is overwritten with the characters:
   - first it wipes the last 8 bytes
   - then aligns the buffer address to 8, and zero fills 192 bytes
   - I guess the zero fill is done this way so that it works efficiently even
if the buffer address is not aligned to 8
   - then stores the first 16 bytes w/ xmm0, and the rest w/ byte wise stores

The optimal code would only zero fill the buffer starting at index 32, with the
needed fiddling for the unaligned case, then the 32 bytes copied with xmm0,
potentially unaligned.

The platform used was Debian Jessie AMD64, the gcc versions:

$ g++-6.2.1 -v
Using built-in specs.
COLLECT_GCC=g++-6.2.1
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/6.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++ --disable-multilib
--program-suffix=-6.2.1 --disable-bootstrap CFLAGS='-O2 -march=native'
CXXFLAGS='-O2 -march=native'
Thread model: posix
gcc version 6.2.1 20160831 (GCC) 
git b823cdd4ccc1499a674e3863ce875c7459207727

g++-7.0.0 -v 
Using built-in specs.
COLLECT_GCC=g++-7.0.0
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/7.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++ --disable-multilib
--program-suffix=-7.0.0 --disable-bootstrap CFLAGS='-O2 -march=native'
CXXFLAGS='-O2 -march=native'
Thread model: posix
gcc version 7.0.0 20160831 (experimental) (GCC)
git 14c36b15d931bf299bbc214707b903d0af124449


[1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]