[Bug target/51244] SH Target: Inefficient conditional branch
kkojima at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Nov 22 23:36:00 GMT 2011
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244
--- Comment #1 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-11-22 22:33:43 UTC ---
> return (a != b || a != c) ? b : c;
test_func_0_NG and test_func_1_NG cases are related with the target
implementation of cstoresi4.
The middle end expands a complex conditional jump to cstores and
a simple conditional jumps. For expression a != b, SH's cstoresi4
implementation uses sh.c:sh_emit_compare_and_set which generates
cmp/eq and movnegt insn, because we have no cmp/ne insn. Then we've
got the sequence
mov #-1,rn
negc rn,rm
tst #255,rm
which is essentially T_reg = T_reg. Usually combine catches such
situation, but negc might be too complex for combine.
For this case, replacing current movnegt expander by insn, splitter
and peephole something like
(define_insn "movnegt"
[(set (match_operand:SI 0 "arith_reg_dest" "=r")
(plus:SI (reg:SI T_REG) (const_int -1)))
(clobber (match_scratch:SI 1 "=&r"))
(clobber (reg:SI T_REG))]
""
"#"
[(set_attr "length" "4")])
(define_split
[(set (match_operand:SI 0 "arith_reg_dest" "=r")
(plus:SI (reg:SI T_REG) (const_int -1)))
(clobber (match_scratch:SI 1 "=&r"))
(clobber (reg:SI T_REG))]
"reload_completed"
[(set (match_dup 1) (const_int -1))
(parallel [(set (match_dup 0)
(neg:SI (plus:SI (reg:SI T_REG)
(match_dup 1))))
(set (reg:SI T_REG)
(ne:SI (ior:SI (reg:SI T_REG) (match_dup 1))
(const_int 0)))])]
"")
(define_peephole2
[(set (match_operand:SI 1 "" "") (const_int -1))
(parallel [(set (match_operand:SI 0 "" "")
(neg:SI (plus:SI (reg:SI T_REG)
(match_dup 1))))
(set (reg:SI T_REG)
(ne:SI (ior:SI (reg:SI T_REG) (match_dup 1))
(const_int 0)))])
(set (reg:SI T_REG)
(eq:SI (match_operand:QI 3 "" "") (const_int 0)))]
"REGNO (operands[3]) == REGNO (operands[0])
&& peep2_reg_dead_p (3, operands[0])
&& peep2_reg_dead_p (3, operands[1])"
[(const_int 0)]
"")
the above useless sequence could be removed, though we will miss
the chance that the -1 can be CSE-ed when the cstore value is
used. This will cause a bit worse code for the loop like
int
foo (int *a, int x, int n)
{
int i;
int count;
for (i = 0; i < n; i++)
count += (*(a + i) != x);
return count;
}
though it may be relatively rare.
BTW, OT, (a != b || a != c) ? b : c could be reduced to b, I think.
> return a >= 0 && b >= 0 ? c : d;
x >= 0 is expanded to the sequence like
ra = not x
rb = -31
rc = ra >> (neg rb)
T = (rc == 0)
conditional jump
and combine tries to simplify it. combine simplifies b >= 0
successfully into shll and bt but fails to simplify a >= 0.
It seems that combine doesn't do constant propagation well and
misses the constant -31. In this case, a peephole like
(define_peephole2
[(set (match_operand:SI 0 "arith_reg_dest" "")
(not:SI (match_operand:SI 1 "arith_reg_operand" "")))
(set (match_operand:SI 2 "arith_reg_dest" "") (const_int -31))
(set (match_operand:SI 3 "arith_reg_dest" "")
(lshiftrt:SI (match_dup 0) (neg:SI (match_dup 2))))
(set (reg:SI T_REG)
(eq:SI (match_operand:QI 4 "arith_reg_operand" "")
(const_int 0)))
(set (pc)
(if_then_else (match_operator 5 "comparison_operator"
[(reg:SI T_REG) (const_int 0)])
(label_ref (match_operand 6 "" ""))
(pc)))]
"REGNO (operands[3]) == REGNO (operands[4])
&& peep2_reg_dead_p (4, operands[0])
&& (peep2_reg_dead_p (4, operands[3])
|| rtx_equal_p (operands[2], operands[3]))
&& peep2_regno_dead_p (5, T_REG)"
[(set (match_dup 2) (const_int -31))
(set (reg:SI T_REG) (ge:SI (match_dup 1) (const_int 0)))
(set (pc)
(if_then_else (match_op_dup 7 [(reg:SI T_REG) (const_int 0)])
(label_ref (match_dup 6))
(pc)))]
"
{
operands[7] = gen_rtx_fmt_ee (reverse_condition (GET_CODE (operands[5])),
GET_MODE (operands[5]),
XEXP (operands[5], 0), XEXP (operands[5], 1));
}")
will be a workaround. It isn't ideal, but better than nothing.
> return a == b ? test_sub0 (a, b) : test_sub1 (a, b);
> return a != b ? test_sub0 (a, b) : test_sub1 (a, b);
This case is intresting. At -Os, two calls are converted into
one computed goto. A bit surprisingly, the conversion is done
as a side effect of combine-stack-adjustments pass. That pass
calls
cleanup_cfg (flag_crossjumping ? CLEANUP_CROSSJUMP : 0);
and the cross jumping optimization merges two calls.
With -Os -fno-delayed-branch, the OK case is compiled to
test_func_3_OK:
mov r4,r1
cmp/eq r5,r1
mov.l .L4,r0
bf .L3
mov r1,r5
mov.l .L5,r0
bra .L3
nop
.L3:
jmp @r0
nop
and the NG case
test_func_3_NG:
mov r4,r1
cmp/eq r5,r1
bt .L2
mov.l .L4,r0
bra .L3
nop
.L2:
mov.l .L5,r0
mov r1,r5
.L3:
jmp @r0
nop
Yep, the former is lucky. I guess that the latter requires
basic block reordering for the further simplification, though
I've found a comment
/* Don't reorder blocks when optimizing for size because extra jump insns may
be created; also barrier may create extra padding.
More correctly we should have a block reordering mode that tried to
minimize the combined size of all the jumps. This would more or less
automatically remove extra jumps, but would also try to use more short
jumps instead of long jumps. */
if (!optimize_function_for_speed_p (cfun))
return false;
in bb-reorder.c.
More information about the Gcc-bugs
mailing list