Bug 111166

Summary: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
Product: gcc Reporter: Catelyn <gnu_bugzilla_gcc>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: gabravier, guojiufu, rguenth, sayle, sjames
Priority: P3 Keywords: missed-optimization
Version: 13.2.1   
Target Milestone: ---   
See Also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93897
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Host: Target: x86_64
Build: Known to work:
Known to fail: Last reconfirmed: 2023-08-28 00:00:00
Bug Depends on: 101926, 105617    
Bug Blocks:    
Attachments: preprocessed file that triggers the bug, as requested
preprocessed file containing the benchmark code I used

Description Catelyn 2023-08-26 18:29:00 UTC
Created attachment 55799 [details]
preprocessed file that triggers the bug, as requested

GCC version: gcc version 13.2.1 20230801 (GCC)

Target: x86_64-pc-linux-gnu

Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror

Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s -S -O3 -mtune=generic -march=x86-64
(same behaviour is observed with -O2)

Command gives no output to stdout nor stderr, and returns with exit code 0

When compiling the function `turn_into_struct`, a simple function that packs 4 32 bit unsigned integers arguments into a simple struct holding 4 such integers and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as well as spilling onto the stack. This does not seem like a good idea to me, performance wise

When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32 bits are cleared) and `orq` to pack the data together, avoiding memory altogether, which (intuitively to me) seems like a significantly faster implementation as it doesn't need to touch SSE nor memory
Comment 1 Richard Biener 2023-08-28 07:37:21 UTC
Confirmed.  We vectorize the scalar code:

> ./cc1 -quiet t.i -O2 -fopt-info-vec -fdump-tree-slp-details
weird_gcc_behaviour.c:15:41: optimized: basic block part vectorized using 16 byte vectors

generating

uint64_t turn_into_struct (u32 a, u32 b, u32 c, u32 d)
{
  u32 * vectp.4;
  vector(4) unsigned int * vectp.3;
  struct quad_u32 D.2865;
  uint64_t _11;
  vector(4) unsigned int _13;

  <bb 2> [local count: 1073741824]:
  _13 = {a_2(D), b_4(D), c_6(D), d_8(D)};
  MEM <vector(4) unsigned int> [(unsigned int *)&D.2865] = _13;
  _11 = do_smth_with_4_u32 (D.2865);
  D.2865 ={v} {CLOBBER(eol)};
  return _11;

and

weird_gcc_behaviour.c:15:41: note: Cost model analysis:
a_2(D) 1 times scalar_store costs 12 in body
b_4(D) 1 times scalar_store costs 12 in body
c_6(D) 1 times scalar_store costs 12 in body
d_8(D) 1 times scalar_store costs 12 in body
a_2(D) 1 times vector_store costs 12 in body
node 0x5d35f70 1 times vec_construct costs 36 in prologue
weird_gcc_behaviour.c:15:41: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 48
weird_gcc_behaviour.c:15:41: note: Basic block will be vectorized using SLP

we are choosing the vector side at same cost because we assume it would win
on code size.  Practically a vector store instead of a scalar store is
also good for store forwarding.

We get

turn_into_struct:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm1
        movd    %esi, %xmm4
        movd    %edx, %xmm0
        movd    %ecx, %xmm3
        punpckldq       %xmm4, %xmm1
        punpckldq       %xmm3, %xmm0
        movdqa  %xmm1, %xmm2
        punpcklqdq      %xmm0, %xmm2
        movaps  %xmm2, -24(%rsp)
        movq    -24(%rsp), %rdi
        movq    -16(%rsp), %rsi
        jmp     do_smth_with_4_u32

instead of (-fno-tree-vectorize)

turn_into_struct:
.LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        movl    %ecx, %r8d
        movl    %edi, %ecx
        movl    %edx, %r9d
        movabsq $-4294967296, %r10
        movq    %rax, %rdi
        xorl    %edx, %edx
        salq    $32, %r8
        andq    %r10, %rdi
        orq     %rcx, %rdi
        movq    %rsi, %rcx
        salq    $32, %rcx
        movl    %edi, %esi
        orq     %rcx, %rsi
        movq    %rdx, %rcx
        andq    %r10, %rcx
        movq    %rsi, %rdi
        orq     %r9, %rcx
        movl    %ecx, %ecx
        orq     %r8, %rcx
        movq    %rcx, %rsi
        jmp     do_smth_with_4_u32

and our guess for code-size is correct (47 bytes for vector, 67 for scalar).
The latency for the scalar code is also quite a bit bigger.  The spilling
should be OK, the store should forward nicely.

Unless you can come up with an actual benchmark showing the vector code is
slower I'd say it's not.  Given it's smaller it should win on the icache
side if not executed frequently as well.

So - not a bug?

The spilling could be avoided by using movq, movhlps + movq, but it's
call handling so possibly difficult to achieve.
Comment 2 Catelyn 2023-08-28 11:49:07 UTC
Created attachment 55807 [details]
preprocessed file containing the benchmark code I used

I compiled this code (although using includes for clock, CLOCKS_PER_SEC, time_t, printf, and <stdint.h>) to an object and linked it with the bug-triggering file (compiled with -Os, -O2 and -O3 to test all those options), to measure the speed of the generated implementations of the bug-triggering file
Comment 3 Catelyn 2023-08-28 11:52:32 UTC
(In reply to Richard Biener from comment #1)
> Unless you can come up with an actual benchmark showing the vector code is
> slower I'd say it's not.  Given it's smaller it should win on the icache
> side if not executed frequently as well.

I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an object file, and then compiled a benchmarking file into an object as well (to avoid variance caused by the benchmarking file being compiled with different optimization levels), I added a very simple implementation for `do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop, with varying (pre-generated) input data and storing the result in an array, I timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and end, then added up the calculated results to ensure all three programs get the same result

on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and -O3 versions take ~.43 and ~.42 seconds

I tried both -O2 and -O3 to get a slightly better view of the typical variance between program runs, and their times are very similar, but the -Os version is a decent amount faster (around 16%, which I'd assume is significant)

I've added the preprocessed benchmark file as well, which I then compiled with -mtune=generic and -march=x86-64 to match the system-under-test
Comment 4 Richard Biener 2023-08-28 12:37:22 UTC
Your benchmark confirms the vectorized variant is slower, on a 7900X it's
both the memory roundtrip and the gpr->xmm move causing it.  perf shows

       |    turn_into_struct():
     1 |      movd       %edi,%xmm1
     3 |      movd       %esi,%xmm4
     4 |      movd       %edx,%xmm0
    95 |      movd       %ecx,%xmm3
     6 |      punpckldq  %xmm4,%xmm1
     2 |      punpckldq  %xmm3,%xmm0
     1 |      movdqa     %xmm1,%xmm2
       |      punpcklqdq %xmm0,%xmm2
     5 |      movaps     %xmm2,-0x18(%rsp)
    63 |      mov        -0x18(%rsp),%rdi
    70 |      mov        -0x10(%rsp),%rsi
    47 |      jmp        400630 <do_smth_with_4_u32>

note the situation is difficult to rectify - ideally the vectorizer
would see that we require two 64bit register pieces but it doesn't - it sees
we store into memory.

I'll note the non-vectorized code is also far from optimal.  clang
produces the following which is faster by more of the delta that
the vectorized version is slower compared to the scalar GCC variant.

turn_into_struct:                       # @turn_into_struct
        .cfi_startproc
# %bb.0:
                                        # kill: def $ecx killed $ecx def $rcx
                                        # kill: def $esi killed $esi def $rsi
        shlq    $32, %rsi
        movl    %edi, %edi
        orq     %rsi, %rdi
        shlq    $32, %rcx
        movl    %edx, %esi
        orq     %rcx, %rsi
        jmp     do_smth_with_4_u32      # TAILCALL
Comment 5 Catelyn 2023-08-28 12:51:15 UTC
(In reply to Richard Biener from comment #4)
> note the situation is difficult to rectify - ideally the vectorizer
> would see that we require two 64bit register pieces but it doesn't - it sees
> we store into memory.

right, I figured that might have been what was going on, given some of the related issues, the vectorizer incorrectly calculating the cost beforehand

> I'll note the non-vectorized code is also far from optimal.  clang
> produces the following which is faster by more of the delta that
> the vectorized version is slower compared to the scalar GCC variant.

I did notice that the GCC -Os and clang -O3 versions were different, didn't realize that it was by that much, interesting
Comment 6 Richard Biener 2023-08-28 12:53:33 UTC
Roger was working on TImode incoming(?) argument code generation, this is
TImode outgoing argument code generation where we produce for 32bit parts

    7: NOTE_INSN_BASIC_BLOCK 2
    2: r84:SI=di:SI
    3: r85:SI=si:SI
    4: r86:SI=dx:SI
    5: r87:SI=cx:SI
    6: NOTE_INSN_FUNCTION_BEG
    9: r88:DI=zero_extend(r84:SI)
   10: r89:DI=r82:TI#0
   11: r91:DI=0xffffffff00000000
   12: {r90:DI=r89:DI&r91:DI;clobber flags:CC;}
   13: {r92:DI=r90:DI|r88:DI;clobber flags:CC;}
   14: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r92:DI)
   15: r93:DI=zero_extend(r85:SI)
   16: {r94:DI=r93:DI<<0x20;clobber flags:CC;}
   17: r95:DI=r82:TI#0
   18: r96:DI=zero_extend(r95:DI#0)
   19: {r97:DI=r96:DI|r94:DI;clobber flags:CC;}
   20: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r97:DI)
   21: r98:DI=zero_extend(r86:SI)
   22: r99:DI=r82:TI#8
   23: r101:DI=0xffffffff00000000
   24: {r100:DI=r99:DI&r101:DI;clobber flags:CC;}
   25: {r102:DI=r100:DI|r98:DI;clobber flags:CC;}
   26: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r102:DI)<<0x40
   27: r103:DI=zero_extend(r87:SI)
   28: {r104:DI=r103:DI<<0x20;clobber flags:CC;}
   29: r105:DI=r82:TI#8
   30: r106:DI=zero_extend(r105:DI#0)
   31: {r107:DI=r106:DI|r104:DI;clobber flags:CC;}
   32: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r107:DI)<<0x40
   33: r108:DI=r82:TI#0
   34: r109:DI=r82:TI#8
   35: di:DI=r108:DI
   36: si:DI=r109:DI
   37: ax:DI=call [`do_smth_with_4_u32'] argc:0

and we fail to dissect "backwards" from the

   33: r108:DI=r82:TI#0
   34: r109:DI=r82:TI#8

subregs.  Possibly one issue is that we re-use r82.  The dual-use of r82
at the end also poses issues as combine tries to match things like

(parallel [ 
        (set (reg:DI 108 [ D.2865 ])
            (subreg:DI (reg:TI 82 [ D.2865 ]) 0))
        (set (reg:TI 82 [ D.2865 ])
            (ior:TI (and:TI (reg:TI 82 [ D.2865 ])
                    (const_wide_int 0x0ffffffffffffffff))
                (ashift:TI (zero_extend:TI (reg:DI 107))
                    (const_int 64 [0x40]))))
    ])      

but fails to "rename" r82 to split the parallel.

At RTL expansion time we store to D.2865 where it's DECL_RTL is r82:TI so
we can hardly fix it there.  Only a later pass could figure each of the
insns fully define the reg.

Jiufu Guo is working to improve what we choose for DECL_RTL, but for
incoming params / outgoing return.  This is a case where we could,
with -fno-tree-vectorize, improve DECL_RTL for an automatic var and
choose not TImode but something like a (concat:TI reg:DI reg:DI).