gcc.target/i386/pr45685.c with -march=cascadelake shows missing RTL if-conversion. The cruical GIMPLE difference is _36 = _3 > 0; iftmp.0_13 = _36 ? 1 : -1; prephitmp_31 = ABS_EXPR <_3>; prephitmp_32 = _36 ? -1 : 1; prephitmp_33 = _36 ? 4294967295 : 1; prephitmp_35 = _36 ? 1 : 4294967295; ... _29 = prephitmp_31 != _42; val_12 = _29 ? prephitmp_32 : iftmp.0_13; prephitmp_37 = _29 ? prephitmp_33 : prephitmp_35; vs. iftmp.0_13 = _3 > 0 ? 1 : -1; prephitmp_31 = ABS_EXPR <_3>; prephitmp_32 = _3 > 0 ? -1 : 1; prephitmp_33 = _3 > 0 ? 4294967295 : 1; prephitmp_35 = _3 > 0 ? 1 : 4294967295; ... val_12 = i.1_6 == prephitmp_31 ? iftmp.0_13 : prephitmp_32; prephitmp_37 = i.1_6 != prephitmp_31 ? prephitmp_33 : prephitmp_35; where the split out condition is now CSEd and the multi-use makes us not TER the comparison. Previously we got two compare & jump sequences while now we get the compare computing a QImode value and the then two compare & jump sequences. While without -march=cascadelake we do get the desired number of cmovs the generated code is still worse. The testcase is unfortunately a bit obfuscated due to the many if-conversions taking place. Smaller GIMPLE testcases do not exhibit jumpy RTL expansion. void __GIMPLE(ssa, startwith("optimized")) foo (long *p, long a, long b, long c, long d, long e, long f) { _Bool _2; long _3; long _8; __BB(2): _2 = a_1(D) < b_10(D); _3 = _2 ? c_4(D) : d_5(D); _8 = _2 ? f_6(D) : e_7(D); __MEM <long> (p_9(D)) = _3; __MEM <long> (p_9(D) + 4) = _8; return; } #if __GNUC__ < 13 void __GIMPLE(ssa, startwith("optimized")) bar (long *p, long a, long b, long c, long d, long e, long f) { long _3; long _8; __BB(2): _3 = a_1(D) < b_10(D) ? c_4(D) : d_5(D); _8 = a_1(D) >= b_10(D) ? e_7(D) : f_6(D); __MEM <long> (p_9(D)) = _3; __MEM <long> (p_9(D) + 4) = _8; return; } #endif
Btw, I'm hoping for a smaller/simpler testcase to appear - and yes, something like this was expected I guess (but also latent since the new IL was valid before as well).
(In reply to Richard Biener from comment #0) > > void __GIMPLE(ssa, startwith("optimized")) > foo (long *p, long a, long b, long c, long d, long e, long f) > { > _Bool _2; > long _3; > long _8; > > __BB(2): > _2 = a_1(D) < b_10(D); > _3 = _2 ? c_4(D) : d_5(D); > _8 = _2 ? f_6(D) : e_7(D); > __MEM <long> (p_9(D)) = _3; > __MEM <long> (p_9(D) + 4) = _8; > return; > } > > #if __GNUC__ < 13 > void __GIMPLE(ssa, startwith("optimized")) > bar (long *p, long a, long b, long c, long d, long e, long f) > { > long _3; > long _8; > > __BB(2): > _3 = a_1(D) < b_10(D) ? c_4(D) : d_5(D); > _8 = a_1(D) >= b_10(D) ? e_7(D) : f_6(D); > __MEM <long> (p_9(D)) = _3; > __MEM <long> (p_9(D) + 4) = _8; > return; > } > #endif So with the above we do have (on the GCC 12 branch) foo: .LFB0: .cfi_startproc cmpq %rdx, %rsi setl %al testb %al, %al cmovne 8(%rsp), %r9 cmove %r8, %rcx movq %rcx, (%rdi) movq %r9, 4(%rdi) ret vs. bar: .LFB1: .cfi_startproc cmpq %rdx, %rsi cmovl 8(%rsp), %r9 cmovge %r8, %rcx movq %rcx, (%rdi) movq %r9, 4(%rdi) ret which shows an optimization difference when the condition is split out. It also shows the condition is split out anyway in the end, it's just the RTL representation with CCnn modes doesn't nicely match up the GIMPLE so we somehow need to improve the plumbing here. But CCnn get clobbered quite a lot and so any clever "CSE"ing of the (different!) modes involved is going to require code motion or CCnn spilling (bad).
So I guess the trick might be to notice that ... ;; _2 = a_1(D) < b_10(D); (insn 12 11 13 (set (reg:CCGC 17 flags) (compare:CCGC (reg/v:DI 86 [ a ]) (reg/v:DI 87 [ b ]))) -1 (nil)) (insn 13 12 14 (set (reg:QI 92) (lt:QI (reg:CCGC 17 flags) (const_int 0 [0]))) -1 (nil)) (insn 14 13 0 (set (reg:QI 82 [ _2 ]) (reg:QI 92)) -1 (nil)) ... we expand to CCGCmode here ... _3 replace with --> _3 = _2 ? c_4(D) : d_5(D); ;; *p_9(D) = _3; (insn 15 14 16 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 82 [ _2 ]) (const_int 0 [0]))) "t2.c":12:7 -1 (nil)) (insn 16 15 17 (set (reg:DI 93) (if_then_else:DI (ne (reg:CCZ 17 flags) (const_int 0 [0])) (reg/v:DI 88 [ c ]) (reg/v:DI 89 [ d ]))) "t2.c":12:7 -1 (nil)) ... and thus want to use a CCGCmode based compare here as well? We can of course force-forward ("un-CSE") the condition during RTL expansion. But the question would be what's the best approach to deal with the situation so that followup RTL passes have a chance to optimize the redundant compares?
GCC 13.1 is being released, retargeting bugs to GCC 13.2.
I'm testing a patch.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:cbe5f6859a73b2acf203bd7d13f9fb245d63cbd4 commit r14-2620-gcbe5f6859a73b2acf203bd7d13f9fb245d63cbd4 Author: Richard Biener <rguenther@suse.de> Date: Tue Jul 18 10:02:52 2023 +0200 middle-end/105715 - missed RTL if-conversion with COND_EXPR expansion When the COND_EXPR condition operand was split out to a separate stmt it became subject to CSE with other condition evaluations. This unfortunately leads to TER no longer applying and in turn RTL expansion of COND_EXPRs no longer seeing the condition and thus failing to try conditional move expansion. This can be seen with gcc.target/i386/pr45685.c when built with -march=cascadelake which then FAILs to produce the expected number of cmovs. It can also be seen when we create more COND_EXPRs early like for instruction selection of MIN/MAX operations that map to IEEE a > b ? a : b expression semantics. PR middle-end/105715 * gimple-isel.cc (gimple_expand_vec_exprs): Merge into... (pass_gimple_isel::execute): ... this. Duplicate comparison defs of COND_EXPRs.
Fixed on trunk.
GCC 13.2 is being released, retargeting bugs to GCC 13.3.
GCC 13.3 is being released, retargeting bugs to GCC 13.4.
GCC 13.4 is being released, retargeting bugs to GCC 13.5.
Fixed for GCC 14.