https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64-2006/456_hmmer_big.png shows two recent regressions, one around June 13th and one around Jul 3th.
Jul 1st regression: last known good is r272839, first known bad r272955. June 13th regression: last known good is r272230, first known bad r272280.
Bisecting the Jul 1st regression now.
The Jul 1st regression looks like it is r272922 (mine :/). Still need to bisect the older one.
For the Jul 1st regression the DOM change causes us to recognize MIN/MAX patterns way earlier (they are exposed by conditional store elimination). While before the change we have to if-convert 5 loops of P7Viterbi after the change only one is remaining. This naturally affects the scalar fallback. perf shows (base is r272921, peak is r272922) 41.37% 611991 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi 29.84% 441484 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi 15.32% 226646 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi.cold 6.74% 99771 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi.cold so the cold part is way faster but somehow the hot part quite a bit slower, I suspect profile changes in the end, sums are 668130 vs. 711762 samples. -fopt-info-loop doesn't show any changes, vectorizer cost-model estimates (and thus runtime niter checks) are the same. Note the Jul 1st regression is in the range PR90911 was present which should be the observed Jun 13th regression (r272239). perf points to a sequence of two cmovs being slow compared to one cmove and a conditional branch. Fast: │ dc[k] = dc[k-1] + tpdd[k-1]; 7.42 │ fd0: mov -0x80(%rbp),%rbx 0.08 │ add -0x8(%rbx,%rax,4),%ecx │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 0.24 │ mov -0xa0(%rbp),%rbx │ dc[k] = dc[k-1] + tpdd[k-1]; 0.07 │ mov %ecx,-0x4(%rdx,%rax,4) │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 7.41 │ mov -0x8(%r15,%rax,4),%esi 2.69 │ add -0x8(%rbx,%rax,4),%esi 3.31 │ cmp %ecx,%esi 2.85 │ cmovge %esi,%ecx 12.17 │ mov %ecx,-0x4(%rdx,%rax,4) │ if (dc[k] < -INFTY) dc[k] = -INFTY; 7.43 │ cmp $0xc521974f,%ecx │ jge 1060 0.01 │ movl $0xc521974f,-0x4(%rdx,%rax,4) │ for (k = 1; k <= M; k++) { 0.02 │ mov %eax,%esi 0.02 │ inc %rax 0.01 │ cmp %rax,%rdi │ je 106e │ if (dc[k] < -INFTY) dc[k] = -INFTY; 0.01 │ mov $0xc521974f,%ecx 0.01 │ jmp fd0 ... 0.00 │1060: mov %eax,%esi 0.03 │ inc %rax 0.00 │ cmp %rdi,%rax 0.00 │ jne fd0 0.06 │106e: cmp -0x64(%rbp),%esi │ jg 1508 slow: │ dc[k] = dc[k-1] + tpdd[k-1]; 0.06 │13b0: add -0x8(%r9,%rcx,4),%eax 0.07 │ mov %eax,-0x4(%r13,%rcx,4) │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 6.81 │ mov -0x8(%r8,%rcx,4),%esi 0.04 │ add -0x8(%rdx,%rcx,4),%esi 1.46 │ cmp %eax,%esi 0.29 │ cmovge %esi,%eax │ for (k = 1; k <= M; k++) { 13.43 │ mov %ecx,%esi 0.05 │ cmp $0xc521974f,%eax 6.78 │ cmovl %ebx,%eax 13.53 │ mov %eax,-0x4(%r13,%rcx,4) 6.81 │ inc %rcx 0.03 │ cmp %rcx,%rdi │ jne 13b0 where on GIMPLE we have mostly MAX_EXPR <..., -987654321> for the slow case and a conditional statement in the fast case. Of course(?) a jump should be well predicted (and can be speculated [correctly]) while the data dependence on the earlier access is not. Thus in the end this is a RTL expansion / if-conversion or backend costing issue. The following heuristic^Whack fixes the regression for me: Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 272922) +++ gcc/expr.c (working copy) @@ -9159,7 +9159,9 @@ expand_expr_real_2 (sepops ops, rtx targ } /* Use a conditional move if possible. */ - if (can_conditionally_move_p (mode)) + if ((TREE_CODE (treeop1) != INTEGER_CST + || !wi::eq_p (wi::to_widest (treeop1), -987654321)) + && can_conditionally_move_p (mode)) { rtx insn; Index: gcc/ifcvt.c =================================================================== --- gcc/ifcvt.c (revision 272922) +++ gcc/ifcvt.c (working copy) @@ -3570,6 +3570,13 @@ noce_process_if_block (struct noce_if_in which do a good enough job these days. */ return FALSE; + rtx_insn *earliest; + cond = noce_get_alt_condition (if_info, if_info->a, &earliest); + if (cond + && CONST_INT_P (XEXP (cond, 1)) + && INTVAL (XEXP (cond, 1)) == -987654321) + return FALSE; + if (noce_try_move (if_info)) goto success; if (noce_try_ifelse_collapse (if_info))
One x86 option is to peephole (insn 635 632 637 17 (set (reg:CCGC 17 flags) (compare:CCGC (reg:SI 0 ax [1735]) (const_int -987654321 [0xffffffffc521974f]))) 11 {*cmpsi_1} (nil)) (insn 637 635 638 17 (set (reg:SI 0 ax [1737]) (if_then_else:SI (ge (reg:CCGC 17 flags) (const_int 0 [0])) (reg:SI 0 ax [1735]) (reg:SI 3 bx [2395]))) 947 {*movsicc_noc} (expr_list:REG_DEAD (reg:CCGC 17 flags) (expr_list:REG_EQUAL (if_then_else:SI (ge (reg:CCGC 17 flags) (const_int 0 [0])) (reg:SI 0 ax [1735]) (const_int -987654321 [0xffffffffc521974f])) (nil)))) into a compare-and-jump sequence. Benchmarking on a more recent microarchitecture is probably also in order.
So pretty much undo all ifcvt into conditional moves (aka pretend the target doesn't have cmov) or something else? The problem is that cmov isn't unconditionally bad or unconditionally good and moving in either direction usually results in significant regressions. cmov is good if the branch will be often badly predicted, so when the branch prediction is close to 50%, and is bad when it is usually well predicted.
This is PR 56309, again and again.
(In reply to Jakub Jelinek from comment #6) > So pretty much undo all ifcvt into conditional moves (aka pretend the target > doesn't have cmov) or something else? > The problem is that cmov isn't unconditionally bad or unconditionally good > and moving in either direction usually results in significant regressions. > cmov is good if the branch will be often badly predicted, so when the branch > prediction is close to 50%, and is bad when it is usually well predicted. I'd only do it for MAX<small-constant, x> and MIN<large-constant, x>. There's also another conditional move a few instructions up, not sure if cmove density would be another thing to key on. If target costing generally favors the cmp + if-then-else then we may also simply advertise [us]{min,max} patterns here? That said, RTL expansion doesn't look at cost but only asks can_conditionally_move_p (mode) but hopefully if-conversion considers actual rtx-cost. Other ideas welcome of course.
(In reply to Richard Biener from comment #8) > (In reply to Jakub Jelinek from comment #6) > > So pretty much undo all ifcvt into conditional moves (aka pretend the target > > doesn't have cmov) or something else? > > The problem is that cmov isn't unconditionally bad or unconditionally good > > and moving in either direction usually results in significant regressions. > > cmov is good if the branch will be often badly predicted, so when the branch > > prediction is close to 50%, and is bad when it is usually well predicted. > > I'd only do it for MAX<small-constant, x> and MIN<large-constant, x>. > There's > also another conditional move a few instructions up, not sure if cmove > density would be another thing to key on. And if I read the refrenced PR56309 then the issue might be that for cmpl %eax, %esi cmovge %esi, %eax movl %ecx, %esi cmpl %ebx, %eax cmovl %ebx, %eax the second compare uses the cmov destination. But todays micro architectures are too complex to really tell what the issue is (still constraining a peephole quite strictly could be possible, on either or both of the patterns that appear here).
Note that on Haswell the conditional moves are two uops while on Broadwell and up they are only one uop (overall loop 16 uops vs. 18 uops). IACA doesn't show any particular issue (the iterations shoud neatly interleave w/o inter iteration dependences), but it says the throughput bottleneck is dependency chains (not sure if it models conditional moves very well).
One could in some awkward way also rewrite the loop to use integer SSE (so we have access to min/max), relying on zero-filling scalar moves into %xmm and then using vector integer operations. Need to split the loads from the arithmetic for that of course. With that IACA is more happy (14 uops for Haswell, throughput 3.5 cycles vs. 4 before). But this is quite aggressive "STV". vmovd %eax, %xmm10 vmovd %ebx, %xmm12 .p2align 4,,10 .p2align 3 .L34: # addl -8(%r9,%rcx,4), %eax vmovd -8(%r9,%rcx,4), %xmm13 vpaddd %xmm13, %xmm10, %xmm10 # movl %eax, -4(%r13,%rcx,4) vmovd %xmm10, -4(%r13,%rcx,4) # movl -8(%r8,%rcx,4), %esi vmovd -8(%r8,%rcx,4), %xmm11 # addl -8(%rdx,%rcx,4), %esi vmovd -8(%rdx,%rcx,4), %xmm13 vpaddd %xmm13, %xmm11, %xmm11 # cmpl %eax, %esi # cmovge %esi, %eax vpmaxsd %xmm11, %xmm10, %xmm10 movl %ecx, %esi # cmpl %ebx, %eax # cmovl %ebx, %eax vpmaxsd %xmm12, %xmm10, %xmm10 # movl %eax, -4(%r13,%rcx,4) vmovd %xmm10, -4(%r13,%rcx,4) incq %rcx cmpq %rcx, %rdi jne .L34 Interesting fact is that doing this _improves_ 456.hmmer 9% beyond fixing the original regression...! Note I carefully avoided crossing integer/vector boundaries and thus didn't try to just do the max operation in the vector domain. At least on AMD CPUs moving data between integer and FP/vector is slow (IIRC Intel doesn't care). With AVX512 can we even do vpadd %eax, %xmm0, %xmm0 (don't care if %eax is splat or just in lane zero) - IIRC there was support for 'scalar' operands on some ops. The above experiment also clearly shows integer max/min operations are desperately missing and cmp + cmov or cmp + branch + move isn't a good substitute. Now - isn't STV supposed to handle exactly cases like this? Well, it seems it only looks for TImode operations. Note ICC when tuning for Haswell produces exactly the code we are producing now (which is slow): ..B1.80: # Preds ..B1.80 ..B1.79 # Execution count [1.25e+01] movl 4(%r15,%rdi,4), %eax #146.10 movl 4(%rsi,%rdi,4), %edx #147.12 addl 4(%r13,%rdi,4), %eax #146.19 addl 4(%r11,%rdi,4), %edx #147.20 cmpl %edx, %eax #149.2 cmovge %eax, %edx #149.2 addl 4(%rbx,%rdi,4), %edx #148.2 cmpl $-987654321, %edx #149.2 cmovl %r14d, %edx #149.2 movl %edx, 4(%r8,%rdi,4) #149.26 incq %rdi #133.5 cmpq %rcx, %rdi #133.5 jb ..B1.80 # Prob 82% #133.5 it everywhere uses cmov quite aggressively it seems...
On Skylake (Coffeelake actually) with the same binary (built for Haswell), the improvement is down to 5%. On Haswell, when I just replace the second conditional move like vmovd %ebx, %xmm12 .p2align 4,,10 .p2align 3 .L34: ... cmpl %eax, %esi cmovge %esi, %eax movl %ecx, %esi # cmpl %ebx, %eax # cmovl %ebx, %eax vmovd %eax, %xmm10 vpmaxsd %xmm12, %xmm10, %xmm10 vmovd %xmm10, %eax movl %eax, -4(%r13,%rcx,4) ... this doesn't make a difference. Replacing both, like movl -8(%r8,%rcx,4), %esi addl -8(%rdx,%rcx,4), %esi # cmpl %eax, %esi # cmovge %esi, %eax vmovd %eax, %xmm10 vmovd %esi, %xmm11 vpmaxsd %xmm11, %xmm10, %xmm10 movl %ecx, %esi # cmpl %ebx, %eax # cmovl %ebx, %eax vpmaxsd %xmm12, %xmm10, %xmm10 vmovd %xmm10, %eax movl %eax, -4(%r13,%rcx,4) makes runtime improve to within 1% of fixing the regression. I guess that's the best a insn-localized "fix" (provide a smax pattern for SImode) would get to here. As expected on Zen this localized "fix" is a loss (additional 11% regression, tested same Haswell tuned binary) while the full SSE variant is also a lot faster - 35% so compared to the code with two cmovs and 17% compared to the good r272921 code. So it seems important to avoid crossing the gpr/xmm domain here. When I go through the stack for all three GPR<->XMM moves the situation gets even (much) worse. Overall this means enhancing STV to seed itself on conditional moves that match max/min _and_ that can avoid GPR <-> XMM moves would be quite a big win here.
Testcase for the loop in question: void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) { int sc; int k; for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -987654321) dc[k] = -987654321; } }
With Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 273567) +++ gcc/config/i386/i386.md (working copy) @@ -17681,6 +17681,23 @@ (define_insn "<code><mode>3" (set_attr "type" "sseadd") (set_attr "mode" "<MODE>")]) +(define_expand "smaxsi3" + [(set (match_operand:SI 0 "register_operand") + (smax:SI + (match_operand:SI 1 "register_operand") + (match_operand:SI 2 "register_operand")))] + "" +{ + rtx vop1 = gen_reg_rtx (V4SImode); + rtx vop2 = gen_reg_rtx (V4SImode); + emit_insn (gen_vec_setv4si_0 (vop1, CONST0_RTX (V4SImode), operands[1])); + emit_insn (gen_vec_setv4si_0 (vop2, CONST0_RTX (V4SImode), operands[2])); + rtx tem = gen_reg_rtx (V4SImode); + emit_insn (gen_smaxv4si3 (tem, vop1, vop2)); + emit_move_insn (operands[0], lowpart_subreg (SImode, tem, V4SImode)); + DONE; +}) + ;; These versions of the min/max patterns implement exactly the operations ;; min = (op1 < op2 ? op1 : op2) ;; max = (!(op1 < op2) ? op1 : op2) we generate .L3: addl (%rdx,%r8,4), %r9d movl (%rcx,%r8,4), %eax addl (%rsi,%r8,4), %eax vmovd %r9d, %xmm1 vmovd %eax, %xmm0 movq %r8, %rax vpmaxsd %xmm1, %xmm0, %xmm0 vinsertps $0xe, %xmm0, %xmm0, %xmm0 vpmaxsd %xmm2, %xmm0, %xmm0 vmovd %xmm0, 4(%rdi,%r8,4) vmovd %xmm0, %r9d incq %r8 cmpq %rax, %r10 jne .L3 so we manage to catch the store as well but somehow (insn:TI 35 27 37 4 (set (reg:V4SI 20 xmm0 [114]) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 20 xmm0 [107])) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) 2740 {vec_setv4si_0} (nil)) fails to be elided. Maybe vec_setv4si_0 isn't the optimal representation choice. Ah, of course the zeros might end up invalidated by the earlier max... we can't say in RTL that we actually do not care about the upper bits - can we? Anyhow, while the above would fix the regression on Haswell we'd degrade on Zen and in more isolated cmov cases it's clearly not going to be a win as well.
So another idea would be to provide [us]{min,max} patterns for integer modes that split after reload into a compare&cmov or jumpy sequence if allocated using GPR regs but also allow SSE reg alternatives which would fit into existing SSE code if we'd allow SI-WITH-SSE similar to how we do TARGET_MMX_WITH_SSE operations. We already seem to have movsi patterns {r,m}<->v so that part is done, for the testcase that would leave addsi3 plus appropriate costing of the min/max case. The smaxsi3 "splitter" (well ;)) is Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 273592) +++ gcc/config/i386/i386.md (working copy) @@ -1881,6 +1881,27 @@ (define_expand "mov<mode>" "" "ix86_expand_move (<MODE>mode, operands); DONE;") +(define_insn "smaxsi3" + [(set (match_operand:SI 0 "register_operand" "=r,x") + (smax:SI (match_operand:SI 1 "register_operand" "%0,x") + (match_operand:SI 2 "register_operand" "r,x"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_AVX2" +{ + switch (get_attr_type (insn)) + { + case TYPE_SSEADD: + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; + case TYPE_ICMOV: + return "cmpl\t{%2, %0|%0, %2}\n" + "cmovl\t{%2, %0|%0, %2}"; + default: + gcc_unreachable (); + } +} + [(set_attr "isa" "noavx,avx") + (set_attr "type" "icmov,sseadd")]) + (define_insn "*mov<mode>_xor" [(set (match_operand:SWI48 0 "register_operand" "=r") (match_operand:SWI48 1 "const0_operand")) with that we get the elision of the zeroing between the vpmaxsd but even -mtune=bdver2 doesn't disparage the cross-unit moves enough for the RA to choose the first alternative. Huh. But it then goes through the stack... .L3: vmovd %xmm0, %r8d addl (%rdx,%rax,4), %r8d movl %r8d, 4(%rdi,%rax,4) movl (%rcx,%rax,4), %r9d addl (%rsi,%rax,4), %r9d movl %r9d, -4(%rsp) vmovd -4(%rsp), %xmm2 movl %r8d, -4(%rsp) movq %rax, %r8 vmovd -4(%rsp), %xmm3 vpmaxsd %xmm3, %xmm2, %xmm0 vpmaxsd %xmm1, %xmm0, %xmm0 vmovd %xmm0, 4(%rdi,%rax,4) incq %rax cmpq %r8, %r10 jne .L3 not sure why RA selected the 2nd alternative at all... but yes, we'd want to slightly prefer it. But I expected the inter-unit moves to push us to the first alternative. As you can see we can automatically handle stores of SImode in SSE regs and I verified we can also handle loads by slightly altering the testcase. That leaves implementing the addsi3 alternative for SSE regs, like the following quite incomplete hack Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 273592) +++ gcc/config/i386/i386.md (working copy) @@ -5368,10 +5389,10 @@ (define_insn_and_split "*add<dwi>3_doubl }) (define_insn "*add<mode>_1" - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r,v") (plus:SWI48 - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) + (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r,v") + (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le,v"))) (clobber (reg:CC FLAGS_REG))] "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" { @@ -5390,6 +5411,9 @@ (define_insn "*add<mode>_1" return "dec{<imodesuffix>}\t%0"; } + case TYPE_SSEADD: + return "vpaddd\t{%2, %1, %0|%0, %1, %2}"; + default: /* For most processors, ADD is faster than LEA. This alternative was added to use ADD as much as possible. */ @@ -5406,6 +5430,8 @@ (define_insn "*add<mode>_1" [(set (attr "type") (cond [(eq_attr "alternative" "3") (const_string "lea") + (eq_attr "alternative" "4") + (const_string "sseadd") (match_operand:SWI48 2 "incdec_operand") (const_string "incdec") ] but somehow that doesn't trigger for me.
Ah, because x86_64_general_operand allows memory but the v alternative not and reloading that is appearantly more expensive than not doing that and reloading the general reg later. Fun. Changing that to x86_64_nonmemory_operand makes the whole thing work nearly fully (for this testcase, breaking everything else of course), there's one gpr op remaining again because we get memory, this time in the first operand which I kept as nonimmediate_operand. Not sure how we make RA happier to reload a memory operand for the v,v,v alternative without doing that elsewhere. movl $-987654321, %r10d vmovd (%rdi), %xmm0 leal -1(%r8), %r9d xorl %eax, %eax vmovd %r10d, %xmm1 .p2align 4,,10 .p2align 3 .L3: vmovd (%rdx,%rax,4), %xmm2 vpaddd %xmm2, %xmm0, %xmm0 vmovd %xmm0, 4(%rdi,%rax,4) movl (%rcx,%rax,4), %r8d addl (%rsi,%rax,4), %r8d vmovd %r8d, %xmm3 movq %rax, %r8 vpmaxsd %xmm0, %xmm3, %xmm0 vpmaxsd %xmm1, %xmm0, %xmm0 vmovd %xmm0, 4(%rdi,%rax,4) addq $1, %rax cmpq %r9, %r8 jne .L3
We're going to need a fix for Power as well. We see a 12% drop in 456.hmmer around July 2. We lose vectorization and end up with a couple of isel's in the loop (equivalent to x86 cmov).
Richi corrected me -- this is not vectorization, but use of SSE on lane zero to do scalar computation.
So how does this cause 12% degradation (20% by some other measurements) on power9? Pretty much everything is the *opposite* way around for us: we do have cheap conditional moves, we do prefer integer registers.
On Wed, 7 Aug 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > > Segher Boessenkool <segher at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |segher at gcc dot gnu.org > > --- Comment #19 from Segher Boessenkool <segher at gcc dot gnu.org> --- > So how does this cause 12% degradation (20% by some other measurements) > on power9? Pretty much everything is the *opposite* way around for us: > we do have cheap conditional moves, we do prefer integer registers. Might be that speculating the jump (which is very very well predicted) is still a lot faster than the conditional move.
Author: rguenth Date: Wed Aug 14 08:31:54 2019 New Revision: 274422 URL: https://gcc.gnu.org/viewcvs?rev=274422&root=gcc&view=rev Log: 2019-08-14 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.c (dimode_scalar_chain::compute_convert_gain): Compute and dump individual instruction gain. Fix reg-reg copy GRP cost. Use ix86_cost->sse_op for vector instruction costs. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-features.c
Author: rguenth Date: Wed Aug 14 12:04:05 2019 New Revision: 274481 URL: https://gcc.gnu.org/viewcvs?rev=274481&root=gcc&view=rev Log: 2019-08-14 Richard Biener <rguenther@suse.de> Uroš Bizjak <ubizjak@gmail.com> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Add {S,U}{MIN,MAX} support. (general_scalar_chain::replace_with_subreg): Use vmode/smode. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. (*<maxmin>di3_doubleword): Likewise. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. * gcc.target/i386/minmax-6.c: Likewise. * gcc.target/i386/minmax-1.c: Add -mno-stv. * gcc.target/i386/minmax-2.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/i386/minmax-3.c trunk/gcc/testsuite/gcc.target/i386/minmax-4.c trunk/gcc/testsuite/gcc.target/i386/minmax-5.c trunk/gcc/testsuite/gcc.target/i386/minmax-6.c trunk/gcc/testsuite/gcc.target/i386/pr91154.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-features.c trunk/gcc/config/i386/i386-features.h trunk/gcc/config/i386/i386.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/minmax-1.c trunk/gcc/testsuite/gcc.target/i386/minmax-2.c
Fixed for {x86_64,i?86}-*-* (hopefully).
It looks that the patch introduced a (small?) runtime regression of 5% in SPEC2000 300.twolf on haswell [1]. Maybe worth looking at. [1] https://gcc.opensuse.org/gcc-old/SPEC/CINT/sb-czerny-head-64/300_twolf_recent_big.png
On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> --- > It looks that the patch introduced a (small?) runtime regression of 5% in > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at. Biggest changes when benchmarking -mno-stv (base) against -mstv (peak): 7.28% 37789 twolf_peak.none twolf_peak.none [.] ucxx2 4.21% 25709 twolf_base.none twolf_base.none [.] ucxx2 3.72% 22584 twolf_base.none twolf_base.none [.] new_dbox 2.48% 22364 twolf_peak.none twolf_peak.none [.] new_dbox 1.49% 8270 twolf_base.none twolf_base.none [.] sub_penal 1.12% 7576 twolf_peak.none twolf_peak.none [.] sub_penal 1.36% 9314 twolf_peak.none twolf_peak.none [.] old_assgnto_new2 1.11% 5257 twolf_base.none twolf_base.none [.] old_assgnto_new2 and in ucxx2 I see 0.17 │ mov %eax,(%rsp) 3.55 │ vpmins (%rsp),%xmm0,%xmm1 │ test %eax,%eax 0.22 │ vmovd %xmm1,%r8d 0.80 │ cmovs %esi,%r8d This is from code like a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin) with only the inner one recognized as MIN because 'numBins' is only ever loaded conditionally and we don't speculate it. So we expand from _41 = _40 / binWidth.15_36; if (_41 >= 0) goto <bb 5>; [59.00%] else goto <bb 6>; [41.00%] bb5: numBins.26_42 = numBins; iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>; bb6: # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)> ending up with movl %r9d, %eax cltd idivl %ecx movl %eax, (%rsp) vpminsd (%rsp), %xmm0, %xmm1 testl %eax, %eax vmovd %xmm1, %r11d cmovs %esi, %r11d and STV converting single-instruction 'chains': Collected chain #40... insns: 381 defs to convert: r463, r465 Computing gain for chain #40... Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber flags:CC;} REG_DEAD r463:SI REG_UNUSED flags:CC Instruction conversion gain: 8 Registers conversion cost: 4 Total gain: 4 Converting chain #40... to me the "spill" to (%rsp) looks suspicious and even more so the vector(!) memory use in vpminsd. RA could have used movd %eax, %xmm1 vpminsd %xmm1, %xmm0, %xmm1 no? IRA allocates the pseudo to memory. Testcase: extern int numBins; extern int binOffst; extern int binWidth; extern int Trybin; void foo (int); void bar (int aleft, int axcenter) { int a1LoBin = (((Trybin=((axcenter + aleft)-binOffst)/binWidth)<0) ? 0 : ((Trybin>numBins) ? numBins : Trybin)); foo (a1LoBin); } STV had emitted (insn 10 9 38 2 (parallel [ (set (reg:SI 93) (div:SI (reg:SI 92) ... (insn 38 10 12 2 (set (subreg:V4SI (reg:SI 98) 0) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 93)) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) "t.c":9:56 -1 (nil)) ... (insn 39 31 32 2 (set (reg:SI 99) (mem/c:SI (symbol_ref:DI ("numBins") [flags 0x40] <var_decl 0x7f9a4bfdbab0 numBins>) [1 numBins+0 S4 A32])) "t.c":9:75 -1 (nil)) (insn 32 39 37 2 (set (subreg:V4SI (reg:SI 95) 0) (smin:V4SI (subreg:V4SI (reg:SI 98) 0) (subreg:V4SI (reg:SI 99) 0))) "t.c":9:75 3657 {*sse4_1_sminv4si3} (nil)) but then combine elimiated the copy... Trying 38 -> 32: 38: r98:SI#0=vec_merge(vec_duplicate(r93:SI),const_vector,0x1) 32: r95:SI#0=smin(r98:SI#0,r99:SI#0) REG_DEAD r99:SI REG_DEAD r98:SI Successfully matched this instruction: (set (subreg:V4SI (reg:SI 95) 0) (smin:V4SI (subreg:V4SI (reg:SI 93) 0) (subreg:V4SI (reg:SI 99 [ numBins ]) 0))) allowing combination of insns 38 and 32 original costs 4 + 40 = 44 replacement cost 40 ...running into the issue I tried to fix with making the live-range split copy more "explicit" ... So it looks like STV "forgot" to convert 'reg:SI 99' which is a memory load it split out. But even when fixing that combine now forwards both ops, eliminating the LR split again :/ Disabling combine results in the expected variant (not going through the stack): idivl binWidth(%rip) vmovd %eax, %xmm0 testl %eax, %eax movl %eax, Trybin(%rip) vpminsd %xmm1, %xmm0, %xmm0 vmovd %xmm0, %eax cmovns %eax, %edi jmp foo so while Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274666) +++ gcc/config/i386/i386-features.c (working copy) @@ -910,7 +910,9 @@ general_scalar_chain::convert_op (rtx *o { rtx tmp = gen_reg_rtx (GET_MODE (*op)); - emit_insn_before (gen_move_insn (tmp, *op), insn); + emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0), + gen_gpr_to_xmm_move_src (vmode, *op)), + insn); *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) will help to fence off regprop it doesn't help against combines power :/ (or IRAs failure to split live-ranges)
This is the powers of simplify_subreg I guess. We're lucky it doesn't do this to arbitrary arithmetic. So we need to really change all defs we introduce to vector modes instead of making our live easy and using paradoxical subregs all over the place.
(In reply to rguenther@suse.de from comment #25) > On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > > > > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> --- > > It looks that the patch introduced a (small?) runtime regression of 5% in > > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at. > > Biggest changes when benchmarking -mno-stv (base) against -mstv (peak): > > 7.28% 37789 twolf_peak.none twolf_peak.none [.] ucxx2 > 4.21% 25709 twolf_base.none twolf_base.none [.] ucxx2 > 3.72% 22584 twolf_base.none twolf_base.none [.] new_dbox > 2.48% 22364 twolf_peak.none twolf_peak.none [.] new_dbox > 1.49% 8270 twolf_base.none twolf_base.none [.] sub_penal > 1.12% 7576 twolf_peak.none twolf_peak.none [.] sub_penal > 1.36% 9314 twolf_peak.none twolf_peak.none [.] > old_assgnto_new2 > 1.11% 5257 twolf_base.none twolf_base.none [.] > old_assgnto_new2 > > and in ucxx2 I see > > 0.17 │ mov %eax,(%rsp) > 3.55 │ vpmins (%rsp),%xmm0,%xmm1 > │ test %eax,%eax > 0.22 │ vmovd %xmm1,%r8d > 0.80 │ cmovs %esi,%r8d > > This is from code like > > a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin) > > with only the inner one recognized as MIN because 'numBins' is only > ever loaded conditionally and we don't speculate it. So we expand > from > > _41 = _40 / binWidth.15_36; > if (_41 >= 0) > goto <bb 5>; [59.00%] > else > goto <bb 6>; [41.00%] > > bb5: > numBins.26_42 = numBins; > iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>; > > bb6: > # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)> > > ending up with > > movl %r9d, %eax > cltd > idivl %ecx > movl %eax, (%rsp) > vpminsd (%rsp), %xmm0, %xmm1 > testl %eax, %eax > vmovd %xmm1, %r11d > cmovs %esi, %r11d > > and STV converting single-instruction 'chains': > > Collected chain #40... > insns: 381 > defs to convert: r463, r465 > Computing gain for chain #40... > Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber > flags:CC;} > REG_DEAD r463:SI > REG_UNUSED flags:CC > Instruction conversion gain: 8 > Registers conversion cost: 4 > Total gain: 4 > Converting chain #40... Is this in STV1 pass? This (pre-combine) pass should be enabled only for TImode conversion, a semi-hack where 64bit targets convert memory access to TImode. General STV should not be ran before combine. > to me the "spill" to (%rsp) looks suspicious and even more so > the vector(!) memory use in vpminsd. RA could have used > > movd %eax, %xmm1 > vpminsd %xmm1, %xmm0, %xmm1 > > no? IRA allocates the pseudo to memory. Testcase: This is how IRA handles subregs. Please note, that the memory is correctly aligned, so vector load does not trip alignment trap. However, on x86 this approach triggers store forwarding stall.
(In reply to Richard Biener from comment #26) > This is the powers of simplify_subreg I guess. We're lucky it doesn't do > this to arbitrary arithmetic. > > So we need to really change all defs we introduce to vector modes instead of > making our live easy and using paradoxical subregs all over the place. No, IMO IRA should be "fixed" to avoid stack temporary and (based on some cost metric) use direct move for paradoxical subregs.
(In reply to Uroš Bizjak from comment #27) > (In reply to rguenther@suse.de from comment #25) > > and STV converting single-instruction 'chains': > > > > Collected chain #40... > > insns: 381 > > defs to convert: r463, r465 > > Computing gain for chain #40... > > Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber > > flags:CC;} > > REG_DEAD r463:SI > > REG_UNUSED flags:CC > > Instruction conversion gain: 8 > > Registers conversion cost: 4 > > Total gain: 4 > > Converting chain #40... > > Is this in STV1 pass? This (pre-combine) pass should be enabled only for > TImode conversion, a semi-hack where 64bit targets convert memory access to > TImode. General STV should not be ran before combine. Yes, this is STV1. My patch to enable SImode and DImode chains didn't change where the pass runs or enable the 2nd run out of compile-time concerns. Indeed changing this fixes the issue. I'm going to benchmark it on 300.twolf. > > to me the "spill" to (%rsp) looks suspicious and even more so > > the vector(!) memory use in vpminsd. RA could have used > > > > movd %eax, %xmm1 > > vpminsd %xmm1, %xmm0, %xmm1 > > > > no? IRA allocates the pseudo to memory. Testcase: > > This is how IRA handles subregs. Please note, that the memory is correctly > aligned, so vector load does not trip alignment trap. However, on x86 this > approach triggers store forwarding stall.
(In reply to Richard Biener from comment #29) > (In reply to Uroš Bizjak from comment #27) > > (In reply to rguenther@suse.de from comment #25) > > > and STV converting single-instruction 'chains': > > > > > > Collected chain #40... > > > insns: 381 > > > defs to convert: r463, r465 > > > Computing gain for chain #40... > > > Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber > > > flags:CC;} > > > REG_DEAD r463:SI > > > REG_UNUSED flags:CC > > > Instruction conversion gain: 8 > > > Registers conversion cost: 4 > > > Total gain: 4 > > > Converting chain #40... > > > > Is this in STV1 pass? This (pre-combine) pass should be enabled only for > > TImode conversion, a semi-hack where 64bit targets convert memory access to > > TImode. General STV should not be ran before combine. > > Yes, this is STV1. My patch to enable SImode and DImode chains didn't change > where the pass runs or enable the 2nd run out of compile-time concerns. > > Indeed changing this fixes the issue. I'm going to benchmark it on > 300.twolf. Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus cactusADM (IIRC).
(In reply to Uroš Bizjak from comment #28) > (In reply to Richard Biener from comment #26) > > This is the powers of simplify_subreg I guess. We're lucky it doesn't do > > this to arbitrary arithmetic. > > > > So we need to really change all defs we introduce to vector modes instead of > > making our live easy and using paradoxical subregs all over the place. > > No, IMO IRA should be "fixed" to avoid stack temporary and (based on some > cost metric) use direct move for paradoxical subregs. The problem is /* Moves between SSE and integer units are expensive. */ if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) /* ??? By keeping returned value relatively high, we limit the number of moves between integer and SSE registers for all targets. Additionally, high value prevents problem with x86_modes_tieable_p(), where integer modes in SSE registers are not tieable because of missing QImode and HImode moves to, from or between MMX/SSE registers. */ return MAX (8, SSE_CLASS_P (class1) ? ix86_cost->hard_register.sse_to_integer : ix86_cost->hard_register.integer_to_sse); The minimum cost of moves between SSE and integer units is 8.
(In reply to H.J. Lu from comment #31) > > No, IMO IRA should be "fixed" to avoid stack temporary and (based on some > > cost metric) use direct move for paradoxical subregs. > > The problem is > > /* Moves between SSE and integer units are expensive. */ > if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) > > /* ??? By keeping returned value relatively high, we limit the number > of moves between integer and SSE registers for all targets. > Additionally, high value prevents problem with x86_modes_tieable_p(), > where integer modes in SSE registers are not tieable > because of missing QImode and HImode moves to, from or between > MMX/SSE registers. */ > return MAX (8, SSE_CLASS_P (class1) > ? ix86_cost->hard_register.sse_to_integer > : ix86_cost->hard_register.integer_to_sse); > > The minimum cost of moves between SSE and integer units is 8. I guess this should be reviewed. This is from reload time, nowadays we never actually disable sse <-> int moves, we use preferred_for_{speed,size} attributes. Also, at least for TARGET_MMX_WITH_SSE targets, we can change the penalty for MMX moves. These sort of changes should be backed by runtime benchmark results. Thanks for heads up, but let's take this cost issue elsewhere.
(In reply to Uroš Bizjak from comment #32) > (In reply to H.J. Lu from comment #31) > > > > No, IMO IRA should be "fixed" to avoid stack temporary and (based on some > > > cost metric) use direct move for paradoxical subregs. > > > > The problem is > > > > /* Moves between SSE and integer units are expensive. */ > > if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) > > > > /* ??? By keeping returned value relatively high, we limit the number > > of moves between integer and SSE registers for all targets. > > Additionally, high value prevents problem with x86_modes_tieable_p(), > > where integer modes in SSE registers are not tieable > > because of missing QImode and HImode moves to, from or between > > MMX/SSE registers. */ > > return MAX (8, SSE_CLASS_P (class1) > > ? ix86_cost->hard_register.sse_to_integer > > : ix86_cost->hard_register.integer_to_sse); > > > > The minimum cost of moves between SSE and integer units is 8. > > I guess this should be reviewed. This is from reload time, nowadays we never > actually disable sse <-> int moves, we use preferred_for_{speed,size} > attributes. Also, at least for TARGET_MMX_WITH_SSE targets, we can change > the penalty for MMX moves. > > These sort of changes should be backed by runtime benchmark results. > > Thanks for heads up, but let's take this cost issue elsewhere. I think it's related but since we were asked to keep this PR open also for other targets I've created PR91498 for the 300.twolf regression and this RA/costing issue.
Some of the new testcases FAIL on 32-bit x86: +FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pmaxsd 1 +FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pmaxud 1 +FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pminsd 1 +FAIL: gcc.target/i386/minmax-4.c scan-assembler-times pminud 1 +FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpmaxsd 1 +FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpmaxud 1 +FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpminsd 1 +FAIL: gcc.target/i386/minmax-5.c scan-assembler-times vpminud 1 +FAIL: gcc.target/i386/minmax-6.c scan-assembler pmaxsd This group is only seen on i386-pc-solaris2.11 while ... +FAIL: gcc.target/i386/pr91154.c scan-assembler-not cmov +FAIL: gcc.target/i386/pr91154.c scan-assembler-times paddd 2 +FAIL: gcc.target/i386/pr91154.c scan-assembler-times pmaxsd 2 ... for this one there are also couple of gcc-testresults reports on x86_64-pc-linux-gnu.
Between 20190819 (r274678) and 20190820 (r274749), a new failure was introduced: +FAIL: gcc.target/i386/minmax-6.c scan-assembler-not rsp Seen on both i386-pc-solaris2.11 with -m64 and x86_64-pc-linux-gnu (oldest report for r274708).
(In reply to Richard Biener from comment #30) > Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus cactusADM > (IIRC). I was looking a bit into minmax6.c failure. Apparently, despite spill/fill cactusADM doesn't regress [1]. By avoiding subregs, spill and fill are both in V4SImode, so store forwarding mitigates the effects of memory access. Thus, the current testsuite failure is benign, and scan-asm-not should be adjusted to only check for SImode spill. OTOH, spill in SImode (32bit) and fill in V4SImode (128 bit), which is the consequence of using V4SImode paradoxical subreg of SImode value disables store forwarding and a big runtime regression in a hot loop is observed. Based on this observation, your approach of using special patterns to convert SImode to V4SImode is the correct approach, and I retract my patch that reintroduces subregs instead. [1] https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_recent_big.png
On September 1, 2019 12:05:52 PM GMT+02:00, ubizjak at gmail dot com <gcc-bugzilla@gcc.gnu.org> wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > >--- Comment #36 from Uroš Bizjak <ubizjak at gmail dot com> --- >(In reply to Richard Biener from comment #30) > >> Hmm, it regresses the gcc.target/i386/minmax-6.c though and thus >cactusADM >> (IIRC). > >I was looking a bit into minmax6.c failure. Apparently, despite >spill/fill >cactusADM doesn't regress [1]. By avoiding subregs, spill and fill are >both in >V4SImode, so store forwarding mitigates the effects of memory access. >Thus, the >current testsuite failure is benign, and scan-asm-not should be >adjusted to >only check for SImode spill. > >OTOH, spill in SImode (32bit) and fill in V4SImode (128 bit), which is >the >consequence of using V4SImode paradoxical subreg of SImode value >disables store >forwarding and a big runtime regression in a hot loop is observed. Yeah, that might very well be the original issue. Still reg, xmm moves should be better so not sure if we should adjust the testcase. Maybe we can make two of them and assert we don't see the SImode spill with any tuning? >Based on this observation, your approach of using special patterns to >convert >SImode to V4SImode is the correct approach, and I retract my patch that >reintroduces subregs instead. > >[1] >https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_recent_big.png
Just for the record, the same can be seen on -march=znver1: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=248.180.0&plot.1=27.180.0&
So this bug turned into a mess of random benchmarks. The 456.hmmer regression is fixed. Please open new bugs for any other things found. cactusADM jumps up and down on my tester and shows no clear regression.