LNT has detected an 18% regression of SPECFP 2006 benchmark 470.lbm when it is compiled with -Ofast -march=native on a Zen2 machine: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=421.240.0&plot.1=301.240.0& ...and similarly a 6% regression when it is run on the same machine with -Ofast: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0& I have bisected both on another zen2 machine to commit r12-897-gde56f95afaaa22 (Run pass_sink_code once more before store_merging). Zen1 machine has also seen a similar -march=native regression in the same time frame: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0& Zen1 -march=generic seems to be unaffected, which is also the case for the Intel machines we track. Although lbm has been known to have weird regressions caused entirely by code layout where the compiler was not really at fault, the fact that both generic code-gen and Zen1 are affected seems to indicate this is not the case.
Maybe related to PR102008, see the comment I made there. Martin, maybe you can try moving late sink to before the last phiopt pass.
Verified 470.lbm doesn't show regression on Power8 with Ofast. runtime is 141 sec for r12-897, without that patch it is 142 sec.
(In reply to Richard Biener from comment #1) > Martin, maybe you can try moving late sink to before the last phiopt pass. If you mean the following then unfortunately that has not helped. diff --git a/gcc/passes.def b/gcc/passes.def index d7a1f8c97a6..5eb70cd2cd8 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -347,10 +347,10 @@ along with GCC; see the file COPYING3. If not see /* After late CD DCE we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_forwprop); + NEXT_PASS (pass_sink_code); NEXT_PASS (pass_phiopt, false /* early_p */); NEXT_PASS (pass_fold_builtins); NEXT_PASS (pass_optimize_widening_mul); - NEXT_PASS (pass_sink_code); NEXT_PASS (pass_store_merging); NEXT_PASS (pass_tail_calls); /* If DCE is not run before checking for uninitialized uses, ...I'll have a very brief look at what is actually happening just so that I have more reasons to believe this is not a code placement issue again.
(In reply to Martin Jambor from comment #3) > ...I'll have a very brief look at what is actually happening just so that I > have more reasons to believe this is not a code placement issue again. The hot function is at the same address when compiled by both revisions and the newer version looks sufficiently different. I even tried sprinkling it with nops and it did not help. I am no saying we are not bumping against some michro-architectural peculiarity but it does not seem to be a code placement issue.
Analysis is missing but the regression persists. On Haswell I do not see any effect. I do suspect it's about cmov vs. non-cmov but w/o a profile and looking affected assembly that's a wild guess.
Created attachment 52296 [details] perf annotate before and after the revision
I see a lot more GPR <-> XMM moves in the 'after' case: 1035 : 401c8b: vaddsd %xmm1,%xmm0,%xmm0 1953 : 401c8f: vmovq %rcx,%xmm1 305 : 401c94: vaddsd %xmm8,%xmm1,%xmm1 3076 : 401c99: vmovq %xmm0,%r14 590 : 401c9e: vmovq %r11,%xmm0 267 : 401ca3: vmovq %xmm1,%r8 136 : 401ca8: vmovq %rdx,%xmm1 448 : 401cad: vaddsd %xmm1,%xmm0,%xmm1 1703 : 401cb1: vmovq %xmm1,%r9 (*) 834 : 401cb6: vmovq %r8,%xmm1 1719 : 401cbb: vmovq %r9,%xmm0 (*) 2782 : 401cc0: vaddsd %xmm0,%xmm1,%xmm1 22135 : 401cc4: vmovsd %xmm1,%xmm1,%xmm0 1261 : 401cc8: vmovq %r14,%xmm1 646 : 401ccd: vaddsd %xmm0,%xmm1,%xmm0 18136 : 401cd1: vaddsd %xmm2,%xmm5,%xmm1 629 : 401cd5: vmovq %xmm1,%r8 142 : 401cda: vaddsd %xmm6,%xmm3,%xmm1 177 : 401cde: vmovq %xmm0,%r14 288 : 401ce3: vmovq %xmm1,%r9 177 : 401ce8: vmovq %r8,%xmm1 174 : 401ced: vmovq %r9,%xmm0 those look like RA / spilling artifacts and IIRC I saw Hongtao posting patches in this area to regcprop I think? The above is definitely bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9. The function is LBM_performStreamCollide, the sinking pass does nothing wrong, it moves unconditionally executed - _948 = _861 + _867; - _957 = _944 + _948; - _912 = _861 + _873; ... - _981 = _853 + _865; - _989 = _977 + _981; - _916 = _853 + _857; - _924 = _912 + _916; into a conditionally executed block. But that increases register pressure by 5 FP regs (if I counted correctly) in that area. So this would be the usual issue of GIMPLE transforms not being register-pressure aware. -fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this (though I think EBB scheduling cannot undo such movement). In postreload I see transforms like -(insn 466 410 411 7 (set (reg:DF 0 ax [530]) - (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0 S8 A64])) "lbm.c":241:5 141 {*movdf_internal} - (expr_list:REG_EQUAL (const_double:DF 9.939744999999999830464503247640095651149749755859375e-1 [0x0.fe751ce28ed5fp+0]) - (nil))) -(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123]) +(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123]) (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal} (nil)) which seems like we could have reloaded %xmm5 from .LC10. But the spilling to GPRs seems to be present already after LRA and cprop_hardreg doesn't do anything bad either. The differences can be seen on trunk with -Ofast -march=znver2 [-fdisable-tree-sink2]. We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC and the interesting thing is that when I disable them I do see some spilling to the stack but also quite some re-materialized constants (loads from .LC* as seem from the opportunity above). It might be interesting to benchmark with -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way to make costs in a way that IRA/LRA prefer re-materialization of constants from the constant pool over spilling to GPRs (if that's possible at all - Vlad?)
So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then this improves to 114 seconds, with sink2 disabled I get 108 seconds and with the tune-ctrl ontop I get 113 seconds. Note that Zen2 is quite special in that it has the ability to handle load/store from the stack by mapping it to a register, effectively making them zero latency (zen3 lost this ability). So while moves between GPRs and XMM might not be bad anymore _spilling_ to a GPR (and I suppose XMM, too) is still a bad idea and the stack should be preferred. Not sure if it's possible to do that though. Doing the same experiment as above on a Zen3 machine would be nice, too.
1703 : 401cb1: vmovq %xmm1,%r9 (*) 834 : 401cb6: vmovq %r8,%xmm1 1719 : 401cbb: vmovq %r9,%xmm0 (*) Look like %r9 is dead after the second (*), and it can be optimized to 1703 : 401cb1: vmovq %xmm1,%xmm0 834 : 401cb6: vmovq %r8,%xmm1
(In reply to Richard Biener from comment #8) > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then > this improves to 114 seconds, with sink2 disabled I get 108 seconds > and with the tune-ctrl ontop I get 113 seconds. > > Note that Zen2 is quite special in that it has the ability to handle > load/store from the stack by mapping it to a register, effectively > making them zero latency (zen3 lost this ability). > > So while moves between GPRs and XMM might not be bad anymore _spilling_ > to a GPR (and I suppose XMM, too) is still a bad idea and the stack > should be preferred. > According to znver2_cost Cost of sse_to_integer is a little bit less than fp_store, maybe increase sse_to_integer cost(more than fp_store) can helps RA to choose memory instead of GPR.
(In reply to Richard Biener from comment #8) > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then > this improves to 114 seconds, with sink2 disabled I get 108 seconds > and with the tune-ctrl ontop I get 113 seconds. With -Ofast -march=znver2 -fschedule-insns -fsched-pressure I get 113 seconds.
(In reply to Hongtao.liu from comment #10) > (In reply to Richard Biener from comment #8) > > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add > > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then > > this improves to 114 seconds, with sink2 disabled I get 108 seconds > > and with the tune-ctrl ontop I get 113 seconds. > > > > Note that Zen2 is quite special in that it has the ability to handle > > load/store from the stack by mapping it to a register, effectively > > making them zero latency (zen3 lost this ability). > > > > So while moves between GPRs and XMM might not be bad anymore _spilling_ > > to a GPR (and I suppose XMM, too) is still a bad idea and the stack > > should be preferred. > > > > According to znver2_cost > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase > sse_to_integer cost(more than fp_store) can helps RA to choose memory > instead of GPR. That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special, but modeling that doesn't seem to be necessary. We seem to have store costs of 8 and load costs of 6, I'll try bumping the gpr<->xmm move cost to 8.
> > According to znver2_cost > > > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase > > sse_to_integer cost(more than fp_store) can helps RA to choose memory > > instead of GPR. > > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm > but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special, > but modeling that doesn't seem to be necessary. > > We seem to have store costs of 8 and load costs of 6, I'll try bumping the > gpr<->xmm move cost to 8. I was simply following latencies here, so indeed reg<->mem bypass is not really modelled. I recall doing few experiments which was kind of inconclusive.
(In reply to Hongtao.liu from comment #9) > 1703 : 401cb1: vmovq %xmm1,%r9 (*) > 834 : 401cb6: vmovq %r8,%xmm1 > 1719 : 401cbb: vmovq %r9,%xmm0 (*) > > Look like %r9 is dead after the second (*), and it can be optimized to > > 1703 : 401cb1: vmovq %xmm1,%xmm0 > 834 : 401cb6: vmovq %r8,%xmm1 Yep, we also have code like - movabsq $0x3ff03db8fde2ef4e, %r8 ... - vmovq %r8, %xmm11 or movq .LC11(%rip), %rax vmovq %rax, %xmm14 which is extremely odd to see ... (I didn't check how we arrive at that) When I do diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 017ffa69958..4c51358d7b6 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = { in 32,64,128,256 and 512-bit. */ {8, 8, 8, 8, 16}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit. */ - 6, 6, /* SSE->integer and integer->SSE + 8, 8, /* SSE->integer and integer->SSE moves. */ 8, 8, /* mask->integer and integer->mask moves */ {6, 6, 6}, /* cost of loading mask register performance improves from 128 seconds to 115 seconds. The result is a lot more stack spilling in the code but there are still cases like movq .LC8(%rip), %rax vmovsd .LC13(%rip), %xmm6 vmovsd .LC16(%rip), %xmm11 vmovsd .LC14(%rip), %xmm3 vmovsd .LC12(%rip), %xmm14 vmovq %rax, %xmm2 vmovq %rax, %xmm0 movq .LC9(%rip), %rax see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also loading directly to xmm0. In the end register pressure is the main issue but how we deal with it is bad. It's likely caused by a combination of PRE & hoisting & sinking which together exploit if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+ (0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) { ux = 0.005; uy = 0.002; uz = 0.000; } which makes the following computes partly compile-time resolvable. I still think the above code generation issues need to be analyzed and we should figure why we emit this weird code under register pressure. I'll attach a testcase that has the function in question split out for easier analysis.
Created attachment 52300 [details] LBM_performStreamCollide testcase This is the relevant function.
> > Yep, we also have code like > > - movabsq $0x3ff03db8fde2ef4e, %r8 > ... > - vmovq %r8, %xmm11 It is loading random constant to xmm11. Since reg<->xmm moves are relatively cheap it looks OK to me that we generate this. Is it faster to load constant from the memory? > movq .LC11(%rip), %rax > vmovq %rax, %xmm14 This is odd indeed and even more odd that we both movabs and memory load... i386 FE plays some games with allowing some constants in SSE instructions (to allow simplification and combining) and split them out to memory later. It may be consequence of this.
So in .reload we have (with unpatched trunk) 401: NOTE_INSN_BASIC_BLOCK 6 462: ax:DF=[`*.LC0'] REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 407: xmm2:DF=ax:DF 463: ax:DF=[`*.LC0'] REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 408: xmm4:DF=ax:DF why??! We can load .LC0 into xmm4 directly. IRA sees 401: NOTE_INSN_BASIC_BLOCK 6 407: r118:DF=r482:DF 408: r119:DF=r482:DF now I cannot really decipher IRA or LRA dumps but my guess would be that inheritance (causing us to load from LC0) interferes badly with register class assignment? Changing pseudo 482 in operand 1 of insn 407 on equiv 9.850689999999999724167309977929107844829559326171875e-1 ... alt=21,overall=9,losers=1,rld_nregs=1 Choosing alt 21 in insn 407: (0) v (1) r {*movdf_internal} Creating newreg=525, assigning class GENERAL_REGS to r525 407: r118:DF=r525:DF Inserting insn reload before: 462: r525:DF=[`*.LC0'] REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 we should have preferred alt 14 I think (0) v (1) m, but that has alt=14,overall=13,losers=1,rld_nregs=0 0 Spill pseudo into memory: reject+=3 Using memory insn operand 0: reject+=3 0 Non input pseudo reload: reject++ 1 Non-pseudo reload: reject+=2 1 Non input pseudo reload: reject++ alt=15,overall=28,losers=3 -- refuse 0 Costly set: reject++ alt=16: Bad operand -- refuse 0 Costly set: reject++ 1 Costly loser: reject++ 1 Non-pseudo reload: reject+=2 1 Non input pseudo reload: reject++ alt=17,overall=17,losers=2 -- refuse 0 Costly set: reject++ 1 Spill Non-pseudo into memory: reject+=3 Using memory insn operand 1: reject+=3 1 Non input pseudo reload: reject++ alt=18,overall=14,losers=1 -- refuse 0 Spill pseudo into memory: reject+=3 Using memory insn operand 0: reject+=3 0 Non input pseudo reload: reject++ 1 Costly loser: reject++ 1 Non-pseudo reload: reject+=2 1 Non input pseudo reload: reject++ alt=19,overall=29,losers=3 -- refuse 0 Non-prefered reload: reject+=600 0 Non input pseudo reload: reject++ alt=20,overall=607,losers=1 -- refuse 1 Non-pseudo reload: reject+=2 1 Non input pseudo reload: reject++ I'm not sure I can decipher the reasoning but I don't understand how it doesn't seem to anticipate the cost of reloading the GPR in the alternative it chooses? Vlad?
For the case of LBM what also helps is disabling PRE or using PGO (which sees the useless PRE) given that the path the expressions become partially compile-time computable is never taken at runtime. In theory we could isolate that path completely via -ftracer but the pass pipeline setup doesn't look optimal.
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178 > > --- Comment #13 from hubicka at kam dot mff.cuni.cz --- > > > According to znver2_cost > > > > > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase > > > sse_to_integer cost(more than fp_store) can helps RA to choose memory > > > instead of GPR. > > > > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm > > but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above > > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special, > > but modeling that doesn't seem to be necessary. > > > > We seem to have store costs of 8 and load costs of 6, I'll try bumping the > > gpr<->xmm move cost to 8. > > I was simply following latencies here, so indeed reg<->mem bypass is not > really modelled. I recall doing few experiments which was kind of > inconclusive. Yes, I think xmm->gpr->xmm vs. xmm->mem->xmm isn't really the issue here but it's mem->gpr->xmm vs. mem->xmm with all the constant pool remats. Agner lists latency of 3 for gpr<->xmm and a latency of 4 for mem<->xmm but then there's forwarding (and the clever renaming trick) which likely makes xmm->mem->xmm cheaper than 4 + 4 but xmm->gpr->xmm will really be 3 + 3 latency. grp<->xmm seem to be also more resource constrained. In any case for moving xmm to gpr it doesn't make sense to go through memory but it doesn't seem worth to spill to xmm or gpr when we only use gpr / xmm later. Letting the odd and bogus code we generate for the .LC0 re-materialization aside which we should fix and which fixing likely will fix LBM.
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178 > > --- Comment #16 from hubicka at kam dot mff.cuni.cz --- > > > > Yep, we also have code like > > > > - movabsq $0x3ff03db8fde2ef4e, %r8 > > ... > > - vmovq %r8, %xmm11 > > It is loading random constant to xmm11. Since reg<->xmm moves are > relatively cheap it looks OK to me that we generate this. Is it faster > to load constant from the memory? I would say so. It saves code size and also uop space unless the two can magically fuse to a immediate to %xmm move (I doubt that). > > movq .LC11(%rip), %rax > > vmovq %rax, %xmm14 > This is odd indeed and even more odd that we both movabs and memory load... > i386 FE plays some games with allowing some constants in SSE > instructions (to allow simplification and combining) and split them out > to memory later. It may be consequence of this. I've pasted the LRA dump pieces I think are relevant but I don't understand them. The constant load isn't visible originally but is introduced by LRA so that may be the key to the mystery here.
> I would say so. It saves code size and also uop space unless the two > can magically fuse to a immediate to %xmm move (I doubt that). I made simple benchmark double a=10; int main() { long int i; double sum,val1,val2,val3,val4; for (i=0;i<1000000000;i++) { #if 1 #if 1 asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val1): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val2): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val3): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val4): :"r8","xmm11"); #else asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val1):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val2):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val3):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val4):"m"(a) :"r8","xmm11"); #endif #else asm __volatile__("vmovq %1, %0": "=x"(val1):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val2):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val3):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val4):"m"(a) :"r8","xmm11"); #endif sum+=val1+val2+val3+val4; } return sum; and indeed the third variant runs 1.2s while the first two takes equal time 2.4s on my zen2 laptop.
Is this related to PR 104059? Can you try: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html
(In reply to H.J. Lu from comment #22) > Is this related to PR 104059? Can you try: > > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html I verified that hardreg cprop does nothing on the testcase, so no.
for vmovq %rdi, %xmm7 # 503 [c=4 l=4] *movdf_internal/21 .. vmulsd %xmm7, %xmm4, %xmm5 # 320 [c=12 l=4] *fop_df_comm/2 .. movabsq $0x3fef85af6c69b5a6, %rdi # 409 [c=5 l=10] *movdf_internal/11 and 7806(insn 320 319 322 8 (set (reg:DF 441) 7807 (mult:DF (reg:DF 166 [ _323 ]) 7808 (reg:DF 249 [ _900 ]))) "../test.c":87:218 1072 {*fop_df_comm} 7809 (expr_list:REG_DEAD (reg:DF 249 [ _900 ]) 7810 (nil))) RA allocate rdi for 249 because cost of general reg is cheaper than mem. a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964 BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964 CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NON_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:19546,19546 FP_SECOND_REG:19546,19546 FLOAT_REGS:19546,19546 SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0,0 SSE_REGS:0,0 FLOAT_SSE_REGS:19546,19546 FLOAT_INT_REGS:19546,19546 INT_SSE_REGS:19546,19546 FLOAT_INT_SSE_REGS:19546,19546 MEM:6294,6294 950 r249: preferred SSE_REGS, alternative GENERAL_REGS, allocno INT_SSE_REGS Disposition: 66:r249 l1 5 with -mtune=aldlake, for r249 cost of general regs is expensive than mem, and RA will allocate mem for it, then no more movabsq/vmovq is needed. 655 a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964 BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964 CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NO\ N_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:14908,14908 FP_SECOND_REG:14908,14908 FLOAT_REGS:14908,14908 SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0\ ,0 SSE_REGS:0,0 FLOAT_SSE_REGS:14908,14908 FLOAT_INT_REGS:14908,14908 INT_SSE_REGS:14908,14908 FLOAT_INT_SSE_REGS:14908,14908 MEM:5632,5632 950 r249: preferred SSE_REGS, alternative NO_REGS, allocno SSE_REGS 66:r249 l1 mem vmulsd -80(%rsp), %xmm2, %xmm3 # 320 [c=29 l=6] *fop_df_comm/2 Guess we need to let RA know mem cost is cheaper than GPR for r249.
> Guess we need to let RA know mem cost is cheaper than GPR for r249. Reduce sse_store cost?
(In reply to Richard Biener from comment #7) > make costs in a way that IRA/LRA prefer re-materialization of constants > from the constant pool over spilling to GPRs (if that's possible at all - > Vlad?) LRA rematerialization can not rematerialize constant value from memory pool. It can rematerialize value of expression only consisting of other pseudos (currently assigned to hard regs) and constants. I guess rematerialization pass can be extended to work for constants from constant memory pool. It is pretty doable project opposite to rematerialization of any memory which would require a lot analysis including aliasing and complicated cost calculation benefits. May be somebody could pick this project up.
(In reply to Richard Biener from comment #17) > So in .reload we have (with unpatched trunk) > > 401: NOTE_INSN_BASIC_BLOCK 6 > 462: ax:DF=[`*.LC0'] > REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 > 407: xmm2:DF=ax:DF > 463: ax:DF=[`*.LC0'] > REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 > 408: xmm4:DF=ax:DF > > why??! We can load .LC0 into xmm4 directly. IRA sees > > 401: NOTE_INSN_BASIC_BLOCK 6 > 407: r118:DF=r482:DF > 408: r119:DF=r482:DF > > now I cannot really decipher IRA or LRA dumps but my guess would be that > inheritance (causing us to load from LC0) interferes badly with register > class assignment? > > Changing pseudo 482 in operand 1 of insn 407 on equiv > 9.850689999999999724167309977929107844829559326171875e-1 > ... > alt=21,overall=9,losers=1,rld_nregs=1 > Choosing alt 21 in insn 407: (0) v (1) r {*movdf_internal} > Creating newreg=525, assigning class GENERAL_REGS to r525 > 407: r118:DF=r525:DF > Inserting insn reload before: > 462: r525:DF=[`*.LC0'] > REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1 > > we should have preferred alt 14 I think (0) v (1) m, but that has > > alt=14,overall=13,losers=1,rld_nregs=0 > 0 Spill pseudo into memory: reject+=3 > Using memory insn operand 0: reject+=3 > 0 Non input pseudo reload: reject++ > 1 Non-pseudo reload: reject+=2 > 1 Non input pseudo reload: reject++ > alt=15,overall=28,losers=3 -- refuse > 0 Costly set: reject++ > alt=16: Bad operand -- refuse > 0 Costly set: reject++ > 1 Costly loser: reject++ > 1 Non-pseudo reload: reject+=2 > 1 Non input pseudo reload: reject++ > alt=17,overall=17,losers=2 -- refuse > 0 Costly set: reject++ > 1 Spill Non-pseudo into memory: reject+=3 > Using memory insn operand 1: reject+=3 > 1 Non input pseudo reload: reject++ > alt=18,overall=14,losers=1 -- refuse > 0 Spill pseudo into memory: reject+=3 > Using memory insn operand 0: reject+=3 > 0 Non input pseudo reload: reject++ > 1 Costly loser: reject++ > 1 Non-pseudo reload: reject+=2 > 1 Non input pseudo reload: reject++ > alt=19,overall=29,losers=3 -- refuse > 0 Non-prefered reload: reject+=600 > 0 Non input pseudo reload: reject++ > alt=20,overall=607,losers=1 -- refuse > 1 Non-pseudo reload: reject+=2 > 1 Non input pseudo reload: reject++ > > I'm not sure I can decipher the reasoning but I don't understand how it > doesn't seem to anticipate the cost of reloading the GPR in the alternative > it chooses? > > Vlad? All this diagnostics is just description of voodoo from the old reload pass. LRA choosing alternative the same way as the old reload pass (I doubt that any other approach will not break all existing targets). Simply the old reload pass does not report its decisions in the dump. LRA code (lra-constraints.cc::process_alt_operands) choosing the insn alternatives (as the old reload pass) does not use any memory or register move costs. Instead, the alternative is chosen by heuristics and insn constraints hints (like ? !). The only case where these costs are used, when we have reg:=reg and the register move costs for this is 2. In this case LRA(reload) does not bother to check the insn constraints.
Could somebody benchmark the following patch on zen2 470.lbm. diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc index 9cee17479ba..76619aca8eb 100644 --- a/gcc/lra-constraints.cc +++ b/gcc/lra-constraints.cc @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p) (x, lra_get_allocno_class (i)) == NO_REGS)) || contains_symbol_ref_p (x)))) ira_reg_equiv[i].defined_p = false; - if (contains_reg_p (x, false, true)) + if (contains_reg_p (x, false, true) + || (CONST_DOUBLE_P (x) + && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8))) ira_reg_equiv[i].profitable_p = false; if (get_equiv (reg) != reg) bitmap_ior_into (equiv_insn_bitmap, &lra_reg_info[i].insn_bitmap); If it improves the performance, I'll commit this patch. The expander unconditionally uses memory pool for double constants. I think the analogous treatment could be done for equiv double constants in LRA. I know only x86_64 permits 64-bit constants as immediate for moving them into general regs. As double fp operations is not done in general regs in the most cases, they should be moved into fp regs and this is costly as Jan wrote. So it has sense to prohibit using equiv double constant values in LRA unconditionally. If in the future we have a target which can move double immediate into fp regs we can introduce some target hooks to deal with equiv double constant. But right now I think there is no need for the new hook.
(In reply to Vladimir Makarov from comment #28) > Could somebody benchmark the following patch on zen2 470.lbm. > > diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc > index 9cee17479ba..76619aca8eb 100644 > --- a/gcc/lra-constraints.cc > +++ b/gcc/lra-constraints.cc > @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p) > (x, lra_get_allocno_class (i)) == NO_REGS)) > || contains_symbol_ref_p (x)))) > ira_reg_equiv[i].defined_p = false; > - if (contains_reg_p (x, false, true)) > + if (contains_reg_p (x, false, true) > + || (CONST_DOUBLE_P (x) > + && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8))) > ira_reg_equiv[i].profitable_p = false; > if (get_equiv (reg) != reg) > bitmap_ior_into (equiv_insn_bitmap, > &lra_reg_info[i].insn_bitmap); > > If it improves the performance, I'll commit this patch. > > The expander unconditionally uses memory pool for double constants. I think > the analogous treatment could be done for equiv double constants in LRA. > > I know only x86_64 permits 64-bit constants as immediate for moving them > into general regs. As double fp operations is not done in general regs in > the most cases, they should be moved into fp regs and this is costly as Jan > wrote. So it has sense to prohibit using equiv double constant values in > LRA unconditionally. If in the future we have a target which can move > double immediate into fp regs we can introduce some target hooks to deal > with equiv double constant. But right now I think there is no need for the > new hook. Code generation changes quite a bit, with the patch the offending function is 16 bytes larger. I see no large immediate moves to GPRs anymore but there is still a lot of spilling of XMMs to GPRs. Performance is unchanged by the patch: 470.lbm 13740 128 107 S 13740 128 107 S 470.lbm 13740 128 107 * 13740 128 107 S 470.lbm 13740 128 107 S 13740 128 107 * I've used diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc index 9cee17479ba..a0ec608c056 100644 --- a/gcc/lra-constraints.cc +++ b/gcc/lra-constraints.cc @@ -5084,7 +5084,9 @@ lra_constraints (bool first_p) (x, lra_get_allocno_class (i)) == NO_REGS)) || contains_symbol_ref_p (x)))) ira_reg_equiv[i].defined_p = false; - if (contains_reg_p (x, false, true)) + if (contains_reg_p (x, false, true) + || (CONST_DOUBLE_P (x) + && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), UNITS_PER_WORD))) ira_reg_equiv[i].profitable_p = false; if (get_equiv (reg) != reg) bitmap_ior_into (equiv_insn_bitmap, &lra_reg_info[i].insn_bitmap); note UNITS_PER_WORD vs. literal 8. Without knowing much of the code I wonder if we can check whether the move will be to a reg in GENERAL_REGS? That is, do we know whether there are (besides some special constants like zero), immediate moves to the destination register class? That said, given the result on LBM I'd not change this at this point. Honza wanted to look at the move pattern to try to mitigate the GPR spilling of XMMs. I do think that we need to take costs into account at some point and get rid of the reload style hand-waving with !?* in the move patterns.
(In reply to Richard Biener from comment #29) > (In reply to Vladimir Makarov from comment #28) > > Could somebody benchmark the following patch on zen2 470.lbm. > > Code generation changes quite a bit, with the patch the offending function > is 16 bytes larger. I see no large immediate moves to GPRs anymore but > there is still a lot of spilling of XMMs to GPRs. Performance is > unchanged by the patch: > > 470.lbm 13740 128 107 S 13740 128 107 S > 470.lbm 13740 128 107 * 13740 128 107 S > 470.lbm 13740 128 107 S 13740 128 107 * > > Thank you very much for testing the patch, Richard. The results mean no go for the patch to me. > Without knowing much of the code I wonder if we can check whether the move > will be to a reg in GENERAL_REGS? That is, do we know whether there are > (besides some special constants like zero), immediate moves to the > destination register class? > There are no such info from the target code. Ideally we need to have the cost of loading *particular* immediate value into register class on the same cost basis as load/store. Still to use this info efficiently choosing alternatives should be based on costs not on the hints and some machine independent general heuristics (as now). > That said, given the result on LBM I'd not change this at this point. > > Honza wanted to look at the move pattern to try to mitigate the > GPR spilling of XMMs. > > I do think that we need to take costs into account at some point and get > rid of the reload style hand-waving with !?* in the move patterns. In general I am agree with the direction but it will be quite hard to do. I know it well from my experience to change register class cost calculation algorithm in IRA (the experimental code can be found on the branch ira-select). I expect huge number of test failures and some benchmark performance degradation practically for any targets and a big involvement of target maintainers to fix them. Although it is possible to try to do this for one target at the time.
For the testcase I still see 58 vmovq GPR <-> XMM at -Ofast -march=znver2, resulting from spilling of xmms. And there's still cases like movq .LC0(%rip), %rax ... vmovq %rax, %xmm2 vmovq %rax, %xmm4 Honza - you said you wanted to have some look here. As I understand Vlad costing isn't taken into account when choosing alternatives for reloading.
So the bad "head" can be fixed via diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index c74edd1aaef..8f9f26e0a82 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3580,9 +3580,9 @@ ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and 7. (define_insn "*movdf_internal" [(set (match_operand:DF 0 "nonimmediate_operand" - "=Yf*f,m ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r ,v,r ,o ,r ,m") + "=Yf*f,m ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,!r,!v,r ,o ,r ,m") (match_operand:DF 1 "general_operand" - "Yf*fm,Yf*f,G ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r ,roF,rF,rmF,rC"))] + "Yf*fm,Yf*f,G ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,!v,!r,roF,rF,rmF,rC"))] "!(MEM_P (operands[0]) && MEM_P (operands[1])) && (lra_in_progress || reload_completed || !CONST_DOUBLE_P (operands[1]) which is adding ! to r<->v alternatives. That should eventually be done by duplicating the alternatives and enabling one set via some enable attribute based on some tunable. I see those alternatives are already (set (attr "preferred_for_speed") (cond [(eq_attr "alternative" "3,4") (symbol_ref "TARGET_INTEGER_DFMODE_MOVES") (eq_attr "alternative" "20") (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC") (eq_attr "alternative" "21") (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC") ] (symbol_ref "true"))) not sure why it's preferred_for_speed here though - shouldn't that be enabled for size if !TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC and otherwise disabled? Not sure if combining enabled and preferred_for_speed is reasonably possible, but we have a preferred_for_size attribute here. The diff with ! added is quite short, I've yet have to measure any effect on LBM: --- streamcollide.s.orig 2022-04-25 11:37:01.638733951 +0200 +++ streamcollide.s2 2022-04-25 11:35:54.885849296 +0200 @@ -33,28 +33,24 @@ .p2align 4 .p2align 3 .L12: - movq .LC0(%rip), %rax - vmovsd .LC4(%rip), %xmm6 + vmovsd .LC0(%rip), %xmm2 + vmovsd .LC1(%rip), %xmm13 + movabsq $0x3ff01878b7a1c25d, %rax movabsq $0x3fef85af6c69b5a6, %rdi + vmovsd .LC2(%rip), %xmm12 + vmovsd .LC3(%rip), %xmm14 movabsq $0x3ff03db8fde2ef4e, %r8 + movabsq $0x3fefcea39c51dabe, %r9 + vmovsd .LC4(%rip), %xmm6 vmovsd .LC5(%rip), %xmm7 movq .LC8(%rip), %r11 - movabsq $0x3fefcea39c51dabe, %r9 movq .LC6(%rip), %rdx movq .LC7(%rip), %rcx - vmovq %rax, %xmm2 - vmovq %rax, %xmm4 - movq .LC1(%rip), %rax movq %r11, %rsi movq %r11, %r12 - vmovq %rax, %xmm13 - vmovq %rax, %xmm8 - movq .LC2(%rip), %rax - vmovq %rax, %xmm12 - vmovq %rax, %xmm5 - movq .LC3(%rip), %rax - vmovq %rax, %xmm14 - movabsq $0x3ff01878b7a1c25d, %rax + vmovsd %xmm2, %xmm2, %xmm4 + vmovsd %xmm13, %xmm13, %xmm8 + vmovsd %xmm12, %xmm12, %xmm5 vmovsd %xmm14, -16(%rsp) .L5: vmulsd .LC9(%rip), %xmm0, %xmm3
(In reply to Richard Biener from comment #32) > The diff with ! added is quite short, I've yet have to measure any > effect on LBM: > > --- streamcollide.s.orig 2022-04-25 11:37:01.638733951 +0200 > +++ streamcollide.s2 2022-04-25 11:35:54.885849296 +0200 > @@ -33,28 +33,24 @@ > .p2align 4 > .p2align 3 > .L12: > - movq .LC0(%rip), %rax > - vmovsd .LC4(%rip), %xmm6 > + vmovsd .LC0(%rip), %xmm2 > + vmovsd .LC1(%rip), %xmm13 > + movabsq $0x3ff01878b7a1c25d, %rax > movabsq $0x3fef85af6c69b5a6, %rdi > + vmovsd .LC2(%rip), %xmm12 > + vmovsd .LC3(%rip), %xmm14 > movabsq $0x3ff03db8fde2ef4e, %r8 > + movabsq $0x3fefcea39c51dabe, %r9 > + vmovsd .LC4(%rip), %xmm6 > vmovsd .LC5(%rip), %xmm7 > movq .LC8(%rip), %r11 > - movabsq $0x3fefcea39c51dabe, %r9 > movq .LC6(%rip), %rdx > movq .LC7(%rip), %rcx > - vmovq %rax, %xmm2 > - vmovq %rax, %xmm4 > - movq .LC1(%rip), %rax > movq %r11, %rsi > movq %r11, %r12 > - vmovq %rax, %xmm13 > - vmovq %rax, %xmm8 > - movq .LC2(%rip), %rax > - vmovq %rax, %xmm12 > - vmovq %rax, %xmm5 > - movq .LC3(%rip), %rax > - vmovq %rax, %xmm14 > - movabsq $0x3ff01878b7a1c25d, %rax > + vmovsd %xmm2, %xmm2, %xmm4 > + vmovsd %xmm13, %xmm13, %xmm8 > + vmovsd %xmm12, %xmm12, %xmm5 > vmovsd %xmm14, -16(%rsp) > .L5: > vmulsd .LC9(%rip), %xmm0, %xmm3 Huh, and the net effect is that the + code is 9% _slower_!?
As noted the effect of if(...) { ux = 0.005; uy = 0.002; uz = 0.000; } is PRE of most(!) dependent instructions, creating # prephitmp_1099 = PHI <_1098(6), 6.49971724999999889149648879538290202617645263671875e-1(5)> # prephitmp_1111 = PHI <_1110(6), 1.089805708333333178483570691241766326129436492919921875e-1(5)> ... we successfully coalesce the non-constant incoming register with the result but have to emit copies for all constants on the other edge where we have quite a number of duplicate constants to deal with. I've experimented with ensuring we get _full_ PRE of the dependent expressions by more aggressively re-associating (give PHIs with a constant incoming operand on at least one edge a rank similar to constants, 1). This increases the number of PHIs further but reduces the followup computations more. We still fail to simply tail-duplicate the merge block - another possibility to eventually save some of the overhead, our tail duplication code (gimple-ssa-split-paths.cc) doesn't handle this case since the diamond is not the one immediately preceeding the loop exit/latch. The result of "full PRE" is a little bit worse than the current state (so it's not a full solution here).
(In reply to Richard Biener from comment #34) > As noted the effect of > > if(...) { > ux = 0.005; > uy = 0.002; > uz = 0.000; > } > > is PRE of most(!) dependent instructions, creating > > # prephitmp_1099 = PHI <_1098(6), > 6.49971724999999889149648879538290202617645263671875e-1(5)> > # prephitmp_1111 = PHI <_1110(6), > 1.089805708333333178483570691241766326129436492919921875e-1(5)> > ... > > we successfully coalesce the non-constant incoming register with the result > but have to emit copies for all constants on the other edge where we have > quite a number of duplicate constants to deal with. > > I've experimented with ensuring we get _full_ PRE of the dependent > expressions > by more aggressively re-associating (give PHIs with a constant incoming > operand > on at least one edge a rank similar to constants, 1). > > This increases the number of PHIs further but reduces the followup > computations > more. We still fail to simply tail-duplicate the merge block - another > possibility to eventually save some of the overhead, our tail duplication > code (gimple-ssa-split-paths.cc) doesn't handle this case since the > diamond is not the one immediately preceeding the loop exit/latch. > > The result of "full PRE" is a little bit worse than the current state (so > it's not a full solution here). Btw, looking at coverage the constant case is only an umimportant fraction of the runtime, so the register pressure increase by the PRE dominates (but the branch is predicted to be 50/50): 3562383000: 241: if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { 55296000: 242: ux = 0.005; 55296000: 243: uy = 0.002; 55296000: 244: uz = 0.000; -: 245: } we can also see that PGO notices this and we do _not_ perform the PRE. So the root cause is nothing we can fix for GCC 12, tuning to avoid spilling to GPRs can recover parts of the regression but will definitely have effects elsewhere. Re-targeting to GCC 13.
GCC 13.1 is being released, retargeting bugs to GCC 13.2.
GCC 13.2 is being released, retargeting bugs to GCC 13.3.
GCC 13.3 is being released, retargeting bugs to GCC 13.4.