Created attachment 55799 [details] preprocessed file that triggers the bug, as requested GCC version: gcc version 13.2.1 20230801 (GCC) Target: x86_64-pc-linux-gnu Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s -S -O3 -mtune=generic -march=x86-64 (same behaviour is observed with -O2) Command gives no output to stdout nor stderr, and returns with exit code 0 When compiling the function `turn_into_struct`, a simple function that packs 4 32 bit unsigned integers arguments into a simple struct holding 4 such integers and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as well as spilling onto the stack. This does not seem like a good idea to me, performance wise When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32 bits are cleared) and `orq` to pack the data together, avoiding memory altogether, which (intuitively to me) seems like a significantly faster implementation as it doesn't need to touch SSE nor memory
Confirmed. We vectorize the scalar code: > ./cc1 -quiet t.i -O2 -fopt-info-vec -fdump-tree-slp-details weird_gcc_behaviour.c:15:41: optimized: basic block part vectorized using 16 byte vectors generating uint64_t turn_into_struct (u32 a, u32 b, u32 c, u32 d) { u32 * vectp.4; vector(4) unsigned int * vectp.3; struct quad_u32 D.2865; uint64_t _11; vector(4) unsigned int _13; <bb 2> [local count: 1073741824]: _13 = {a_2(D), b_4(D), c_6(D), d_8(D)}; MEM <vector(4) unsigned int> [(unsigned int *)&D.2865] = _13; _11 = do_smth_with_4_u32 (D.2865); D.2865 ={v} {CLOBBER(eol)}; return _11; and weird_gcc_behaviour.c:15:41: note: Cost model analysis: a_2(D) 1 times scalar_store costs 12 in body b_4(D) 1 times scalar_store costs 12 in body c_6(D) 1 times scalar_store costs 12 in body d_8(D) 1 times scalar_store costs 12 in body a_2(D) 1 times vector_store costs 12 in body node 0x5d35f70 1 times vec_construct costs 36 in prologue weird_gcc_behaviour.c:15:41: note: Cost model analysis for part in loop 0: Vector cost: 48 Scalar cost: 48 weird_gcc_behaviour.c:15:41: note: Basic block will be vectorized using SLP we are choosing the vector side at same cost because we assume it would win on code size. Practically a vector store instead of a scalar store is also good for store forwarding. We get turn_into_struct: .LFB0: .cfi_startproc movd %edi, %xmm1 movd %esi, %xmm4 movd %edx, %xmm0 movd %ecx, %xmm3 punpckldq %xmm4, %xmm1 punpckldq %xmm3, %xmm0 movdqa %xmm1, %xmm2 punpcklqdq %xmm0, %xmm2 movaps %xmm2, -24(%rsp) movq -24(%rsp), %rdi movq -16(%rsp), %rsi jmp do_smth_with_4_u32 instead of (-fno-tree-vectorize) turn_into_struct: .LFB0: .cfi_startproc xorl %eax, %eax movl %ecx, %r8d movl %edi, %ecx movl %edx, %r9d movabsq $-4294967296, %r10 movq %rax, %rdi xorl %edx, %edx salq $32, %r8 andq %r10, %rdi orq %rcx, %rdi movq %rsi, %rcx salq $32, %rcx movl %edi, %esi orq %rcx, %rsi movq %rdx, %rcx andq %r10, %rcx movq %rsi, %rdi orq %r9, %rcx movl %ecx, %ecx orq %r8, %rcx movq %rcx, %rsi jmp do_smth_with_4_u32 and our guess for code-size is correct (47 bytes for vector, 67 for scalar). The latency for the scalar code is also quite a bit bigger. The spilling should be OK, the store should forward nicely. Unless you can come up with an actual benchmark showing the vector code is slower I'd say it's not. Given it's smaller it should win on the icache side if not executed frequently as well. So - not a bug? The spilling could be avoided by using movq, movhlps + movq, but it's call handling so possibly difficult to achieve.
Created attachment 55807 [details] preprocessed file containing the benchmark code I used I compiled this code (although using includes for clock, CLOCKS_PER_SEC, time_t, printf, and <stdint.h>) to an object and linked it with the bug-triggering file (compiled with -Os, -O2 and -O3 to test all those options), to measure the speed of the generated implementations of the bug-triggering file
(In reply to Richard Biener from comment #1) > Unless you can come up with an actual benchmark showing the vector code is > slower I'd say it's not. Given it's smaller it should win on the icache > side if not executed frequently as well. I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an object file, and then compiled a benchmarking file into an object as well (to avoid variance caused by the benchmarking file being compiled with different optimization levels), I added a very simple implementation for `do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop, with varying (pre-generated) input data and storing the result in an array, I timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and end, then added up the calculated results to ensure all three programs get the same result on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and -O3 versions take ~.43 and ~.42 seconds I tried both -O2 and -O3 to get a slightly better view of the typical variance between program runs, and their times are very similar, but the -Os version is a decent amount faster (around 16%, which I'd assume is significant) I've added the preprocessed benchmark file as well, which I then compiled with -mtune=generic and -march=x86-64 to match the system-under-test
Your benchmark confirms the vectorized variant is slower, on a 7900X it's both the memory roundtrip and the gpr->xmm move causing it. perf shows | turn_into_struct(): 1 | movd %edi,%xmm1 3 | movd %esi,%xmm4 4 | movd %edx,%xmm0 95 | movd %ecx,%xmm3 6 | punpckldq %xmm4,%xmm1 2 | punpckldq %xmm3,%xmm0 1 | movdqa %xmm1,%xmm2 | punpcklqdq %xmm0,%xmm2 5 | movaps %xmm2,-0x18(%rsp) 63 | mov -0x18(%rsp),%rdi 70 | mov -0x10(%rsp),%rsi 47 | jmp 400630 <do_smth_with_4_u32> note the situation is difficult to rectify - ideally the vectorizer would see that we require two 64bit register pieces but it doesn't - it sees we store into memory. I'll note the non-vectorized code is also far from optimal. clang produces the following which is faster by more of the delta that the vectorized version is slower compared to the scalar GCC variant. turn_into_struct: # @turn_into_struct .cfi_startproc # %bb.0: # kill: def $ecx killed $ecx def $rcx # kill: def $esi killed $esi def $rsi shlq $32, %rsi movl %edi, %edi orq %rsi, %rdi shlq $32, %rcx movl %edx, %esi orq %rcx, %rsi jmp do_smth_with_4_u32 # TAILCALL
(In reply to Richard Biener from comment #4) > note the situation is difficult to rectify - ideally the vectorizer > would see that we require two 64bit register pieces but it doesn't - it sees > we store into memory. right, I figured that might have been what was going on, given some of the related issues, the vectorizer incorrectly calculating the cost beforehand > I'll note the non-vectorized code is also far from optimal. clang > produces the following which is faster by more of the delta that > the vectorized version is slower compared to the scalar GCC variant. I did notice that the GCC -Os and clang -O3 versions were different, didn't realize that it was by that much, interesting
Roger was working on TImode incoming(?) argument code generation, this is TImode outgoing argument code generation where we produce for 32bit parts 7: NOTE_INSN_BASIC_BLOCK 2 2: r84:SI=di:SI 3: r85:SI=si:SI 4: r86:SI=dx:SI 5: r87:SI=cx:SI 6: NOTE_INSN_FUNCTION_BEG 9: r88:DI=zero_extend(r84:SI) 10: r89:DI=r82:TI#0 11: r91:DI=0xffffffff00000000 12: {r90:DI=r89:DI&r91:DI;clobber flags:CC;} 13: {r92:DI=r90:DI|r88:DI;clobber flags:CC;} 14: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r92:DI) 15: r93:DI=zero_extend(r85:SI) 16: {r94:DI=r93:DI<<0x20;clobber flags:CC;} 17: r95:DI=r82:TI#0 18: r96:DI=zero_extend(r95:DI#0) 19: {r97:DI=r96:DI|r94:DI;clobber flags:CC;} 20: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r97:DI) 21: r98:DI=zero_extend(r86:SI) 22: r99:DI=r82:TI#8 23: r101:DI=0xffffffff00000000 24: {r100:DI=r99:DI&r101:DI;clobber flags:CC;} 25: {r102:DI=r100:DI|r98:DI;clobber flags:CC;} 26: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r102:DI)<<0x40 27: r103:DI=zero_extend(r87:SI) 28: {r104:DI=r103:DI<<0x20;clobber flags:CC;} 29: r105:DI=r82:TI#8 30: r106:DI=zero_extend(r105:DI#0) 31: {r107:DI=r106:DI|r104:DI;clobber flags:CC;} 32: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r107:DI)<<0x40 33: r108:DI=r82:TI#0 34: r109:DI=r82:TI#8 35: di:DI=r108:DI 36: si:DI=r109:DI 37: ax:DI=call [`do_smth_with_4_u32'] argc:0 and we fail to dissect "backwards" from the 33: r108:DI=r82:TI#0 34: r109:DI=r82:TI#8 subregs. Possibly one issue is that we re-use r82. The dual-use of r82 at the end also poses issues as combine tries to match things like (parallel [ (set (reg:DI 108 [ D.2865 ]) (subreg:DI (reg:TI 82 [ D.2865 ]) 0)) (set (reg:TI 82 [ D.2865 ]) (ior:TI (and:TI (reg:TI 82 [ D.2865 ]) (const_wide_int 0x0ffffffffffffffff)) (ashift:TI (zero_extend:TI (reg:DI 107)) (const_int 64 [0x40])))) ]) but fails to "rename" r82 to split the parallel. At RTL expansion time we store to D.2865 where it's DECL_RTL is r82:TI so we can hardly fix it there. Only a later pass could figure each of the insns fully define the reg. Jiufu Guo is working to improve what we choose for DECL_RTL, but for incoming params / outgoing return. This is a case where we could, with -fno-tree-vectorize, improve DECL_RTL for an automatic var and choose not TImode but something like a (concat:TI reg:DI reg:DI).