Summary: | gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) | ||
---|---|---|---|
Product: | gcc | Reporter: | Catelyn <gnu_bugzilla_gcc> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | gabravier, guojiufu, rguenth, sayle, sjames |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 13.2.1 | ||
Target Milestone: | --- | ||
See Also: |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93897 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 |
||
Host: | Target: | x86_64 | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2023-08-28 00:00:00 | |
Bug Depends on: | 101926, 105617 | ||
Bug Blocks: | |||
Attachments: |
preprocessed file that triggers the bug, as requested
preprocessed file containing the benchmark code I used |
Description
Catelyn
2023-08-26 18:29:00 UTC
Confirmed. We vectorize the scalar code:
> ./cc1 -quiet t.i -O2 -fopt-info-vec -fdump-tree-slp-details
weird_gcc_behaviour.c:15:41: optimized: basic block part vectorized using 16 byte vectors
generating
uint64_t turn_into_struct (u32 a, u32 b, u32 c, u32 d)
{
u32 * vectp.4;
vector(4) unsigned int * vectp.3;
struct quad_u32 D.2865;
uint64_t _11;
vector(4) unsigned int _13;
<bb 2> [local count: 1073741824]:
_13 = {a_2(D), b_4(D), c_6(D), d_8(D)};
MEM <vector(4) unsigned int> [(unsigned int *)&D.2865] = _13;
_11 = do_smth_with_4_u32 (D.2865);
D.2865 ={v} {CLOBBER(eol)};
return _11;
and
weird_gcc_behaviour.c:15:41: note: Cost model analysis:
a_2(D) 1 times scalar_store costs 12 in body
b_4(D) 1 times scalar_store costs 12 in body
c_6(D) 1 times scalar_store costs 12 in body
d_8(D) 1 times scalar_store costs 12 in body
a_2(D) 1 times vector_store costs 12 in body
node 0x5d35f70 1 times vec_construct costs 36 in prologue
weird_gcc_behaviour.c:15:41: note: Cost model analysis for part in loop 0:
Vector cost: 48
Scalar cost: 48
weird_gcc_behaviour.c:15:41: note: Basic block will be vectorized using SLP
we are choosing the vector side at same cost because we assume it would win
on code size. Practically a vector store instead of a scalar store is
also good for store forwarding.
We get
turn_into_struct:
.LFB0:
.cfi_startproc
movd %edi, %xmm1
movd %esi, %xmm4
movd %edx, %xmm0
movd %ecx, %xmm3
punpckldq %xmm4, %xmm1
punpckldq %xmm3, %xmm0
movdqa %xmm1, %xmm2
punpcklqdq %xmm0, %xmm2
movaps %xmm2, -24(%rsp)
movq -24(%rsp), %rdi
movq -16(%rsp), %rsi
jmp do_smth_with_4_u32
instead of (-fno-tree-vectorize)
turn_into_struct:
.LFB0:
.cfi_startproc
xorl %eax, %eax
movl %ecx, %r8d
movl %edi, %ecx
movl %edx, %r9d
movabsq $-4294967296, %r10
movq %rax, %rdi
xorl %edx, %edx
salq $32, %r8
andq %r10, %rdi
orq %rcx, %rdi
movq %rsi, %rcx
salq $32, %rcx
movl %edi, %esi
orq %rcx, %rsi
movq %rdx, %rcx
andq %r10, %rcx
movq %rsi, %rdi
orq %r9, %rcx
movl %ecx, %ecx
orq %r8, %rcx
movq %rcx, %rsi
jmp do_smth_with_4_u32
and our guess for code-size is correct (47 bytes for vector, 67 for scalar).
The latency for the scalar code is also quite a bit bigger. The spilling
should be OK, the store should forward nicely.
Unless you can come up with an actual benchmark showing the vector code is
slower I'd say it's not. Given it's smaller it should win on the icache
side if not executed frequently as well.
So - not a bug?
The spilling could be avoided by using movq, movhlps + movq, but it's
call handling so possibly difficult to achieve.
Created attachment 55807 [details]
preprocessed file containing the benchmark code I used
I compiled this code (although using includes for clock, CLOCKS_PER_SEC, time_t, printf, and <stdint.h>) to an object and linked it with the bug-triggering file (compiled with -Os, -O2 and -O3 to test all those options), to measure the speed of the generated implementations of the bug-triggering file
(In reply to Richard Biener from comment #1) > Unless you can come up with an actual benchmark showing the vector code is > slower I'd say it's not. Given it's smaller it should win on the icache > side if not executed frequently as well. I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an object file, and then compiled a benchmarking file into an object as well (to avoid variance caused by the benchmarking file being compiled with different optimization levels), I added a very simple implementation for `do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop, with varying (pre-generated) input data and storing the result in an array, I timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and end, then added up the calculated results to ensure all three programs get the same result on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and -O3 versions take ~.43 and ~.42 seconds I tried both -O2 and -O3 to get a slightly better view of the typical variance between program runs, and their times are very similar, but the -Os version is a decent amount faster (around 16%, which I'd assume is significant) I've added the preprocessed benchmark file as well, which I then compiled with -mtune=generic and -march=x86-64 to match the system-under-test Your benchmark confirms the vectorized variant is slower, on a 7900X it's both the memory roundtrip and the gpr->xmm move causing it. perf shows | turn_into_struct(): 1 | movd %edi,%xmm1 3 | movd %esi,%xmm4 4 | movd %edx,%xmm0 95 | movd %ecx,%xmm3 6 | punpckldq %xmm4,%xmm1 2 | punpckldq %xmm3,%xmm0 1 | movdqa %xmm1,%xmm2 | punpcklqdq %xmm0,%xmm2 5 | movaps %xmm2,-0x18(%rsp) 63 | mov -0x18(%rsp),%rdi 70 | mov -0x10(%rsp),%rsi 47 | jmp 400630 <do_smth_with_4_u32> note the situation is difficult to rectify - ideally the vectorizer would see that we require two 64bit register pieces but it doesn't - it sees we store into memory. I'll note the non-vectorized code is also far from optimal. clang produces the following which is faster by more of the delta that the vectorized version is slower compared to the scalar GCC variant. turn_into_struct: # @turn_into_struct .cfi_startproc # %bb.0: # kill: def $ecx killed $ecx def $rcx # kill: def $esi killed $esi def $rsi shlq $32, %rsi movl %edi, %edi orq %rsi, %rdi shlq $32, %rcx movl %edx, %esi orq %rcx, %rsi jmp do_smth_with_4_u32 # TAILCALL (In reply to Richard Biener from comment #4) > note the situation is difficult to rectify - ideally the vectorizer > would see that we require two 64bit register pieces but it doesn't - it sees > we store into memory. right, I figured that might have been what was going on, given some of the related issues, the vectorizer incorrectly calculating the cost beforehand > I'll note the non-vectorized code is also far from optimal. clang > produces the following which is faster by more of the delta that > the vectorized version is slower compared to the scalar GCC variant. I did notice that the GCC -Os and clang -O3 versions were different, didn't realize that it was by that much, interesting Roger was working on TImode incoming(?) argument code generation, this is TImode outgoing argument code generation where we produce for 32bit parts 7: NOTE_INSN_BASIC_BLOCK 2 2: r84:SI=di:SI 3: r85:SI=si:SI 4: r86:SI=dx:SI 5: r87:SI=cx:SI 6: NOTE_INSN_FUNCTION_BEG 9: r88:DI=zero_extend(r84:SI) 10: r89:DI=r82:TI#0 11: r91:DI=0xffffffff00000000 12: {r90:DI=r89:DI&r91:DI;clobber flags:CC;} 13: {r92:DI=r90:DI|r88:DI;clobber flags:CC;} 14: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r92:DI) 15: r93:DI=zero_extend(r85:SI) 16: {r94:DI=r93:DI<<0x20;clobber flags:CC;} 17: r95:DI=r82:TI#0 18: r96:DI=zero_extend(r95:DI#0) 19: {r97:DI=r96:DI|r94:DI;clobber flags:CC;} 20: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r97:DI) 21: r98:DI=zero_extend(r86:SI) 22: r99:DI=r82:TI#8 23: r101:DI=0xffffffff00000000 24: {r100:DI=r99:DI&r101:DI;clobber flags:CC;} 25: {r102:DI=r100:DI|r98:DI;clobber flags:CC;} 26: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r102:DI)<<0x40 27: r103:DI=zero_extend(r87:SI) 28: {r104:DI=r103:DI<<0x20;clobber flags:CC;} 29: r105:DI=r82:TI#8 30: r106:DI=zero_extend(r105:DI#0) 31: {r107:DI=r106:DI|r104:DI;clobber flags:CC;} 32: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r107:DI)<<0x40 33: r108:DI=r82:TI#0 34: r109:DI=r82:TI#8 35: di:DI=r108:DI 36: si:DI=r109:DI 37: ax:DI=call [`do_smth_with_4_u32'] argc:0 and we fail to dissect "backwards" from the 33: r108:DI=r82:TI#0 34: r109:DI=r82:TI#8 subregs. Possibly one issue is that we re-use r82. The dual-use of r82 at the end also poses issues as combine tries to match things like (parallel [ (set (reg:DI 108 [ D.2865 ]) (subreg:DI (reg:TI 82 [ D.2865 ]) 0)) (set (reg:TI 82 [ D.2865 ]) (ior:TI (and:TI (reg:TI 82 [ D.2865 ]) (const_wide_int 0x0ffffffffffffffff)) (ashift:TI (zero_extend:TI (reg:DI 107)) (const_int 64 [0x40])))) ]) but fails to "rename" r82 to split the parallel. At RTL expansion time we store to D.2865 where it's DECL_RTL is r82:TI so we can hardly fix it there. Only a later pass could figure each of the insns fully define the reg. Jiufu Guo is working to improve what we choose for DECL_RTL, but for incoming params / outgoing return. This is a case where we could, with -fno-tree-vectorize, improve DECL_RTL for an automatic var and choose not TImode but something like a (concat:TI reg:DI reg:DI). |