[Bug target/51244] SH Target: Inefficient conditional branch

Tue Nov 22 23:36:00 GMT 2011

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244

--- Comment #1 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2011-11-22 22:33:43 UTC ---
>  return (a != b || a != c) ? b : c;

test_func_0_NG and test_func_1_NG cases are related with the target
implementation of cstoresi4.
The middle end expands a complex conditional jump to cstores and
a simple conditional jumps.  For expression a != b, SH's cstoresi4
implementation uses sh.c:sh_emit_compare_and_set which generates
cmp/eq and movnegt insn, because we have no cmp/ne insn.  Then we've
got the sequence

  mov #-1,rn
  negc rn,rm
  tst #255,rm

which is essentially T_reg = T_reg.  Usually combine catches such
situation, but negc might be too complex for combine.
For this case, replacing current movnegt expander by insn, splitter
and peephole something like

(define_insn "movnegt"
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
    (plus:SI (reg:SI T_REG) (const_int -1)))
   (clobber (match_scratch:SI 1 "=&r"))
   (clobber (reg:SI T_REG))]
  ""
  "#"
 [(set_attr "length" "4")])

(define_split
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
    (plus:SI (reg:SI T_REG) (const_int -1)))
   (clobber (match_scratch:SI 1 "=&r"))
   (clobber (reg:SI T_REG))]
  "reload_completed"
  [(set (match_dup 1) (const_int -1))
   (parallel [(set (match_dup 0)
           (neg:SI (plus:SI (reg:SI T_REG)
                    (match_dup 1))))
          (set (reg:SI T_REG)
           (ne:SI (ior:SI (reg:SI T_REG) (match_dup 1))
              (const_int 0)))])]
  "")

(define_peephole2
  [(set (match_operand:SI 1 "" "") (const_int -1))
   (parallel [(set (match_operand:SI 0 "" "")
           (neg:SI (plus:SI (reg:SI T_REG)
                    (match_dup 1))))
          (set (reg:SI T_REG)
           (ne:SI (ior:SI (reg:SI T_REG) (match_dup 1))
              (const_int 0)))])
   (set (reg:SI T_REG)
    (eq:SI (match_operand:QI 3 "" "") (const_int 0)))]
  "REGNO (operands[3]) == REGNO (operands[0])
   && peep2_reg_dead_p (3, operands[0])
   && peep2_reg_dead_p (3, operands[1])"
  [(const_int 0)]
  "")

the above useless sequence could be removed, though we will miss
the chance that the -1 can be CSE-ed when the cstore value is
used.  This will cause a bit worse code for the loop like

int
foo (int *a, int x, int n)
{
  int i;
  int count;

  for (i = 0; i < n; i++)
    count += (*(a + i) != x);

  return count;
}

though it may be relatively rare.

BTW, OT, (a != b || a != c) ? b : c could be reduced to b, I think.

>  return a >= 0 && b >= 0 ? c : d;

x >= 0 is expanded to the sequence like

  ra = not x
  rb = -31
  rc = ra >> (neg rb)
  T = (rc == 0)
  conditional jump

and combine tries to simplify it.  combine simplifies b >= 0
successfully into shll and bt but fails to simplify a >= 0.
It seems that combine doesn't do constant propagation well and
misses the constant -31.  In this case, a peephole like

(define_peephole2
  [(set (match_operand:SI 0 "arith_reg_dest" "")
    (not:SI (match_operand:SI 1 "arith_reg_operand" "")))
   (set (match_operand:SI 2 "arith_reg_dest" "") (const_int -31))
   (set (match_operand:SI 3 "arith_reg_dest" "")
    (lshiftrt:SI (match_dup 0) (neg:SI (match_dup 2))))
   (set (reg:SI T_REG)
    (eq:SI (match_operand:QI 4 "arith_reg_operand" "")
           (const_int 0)))
   (set (pc)
    (if_then_else (match_operator 5 "comparison_operator"
            [(reg:SI T_REG) (const_int 0)])
              (label_ref (match_operand 6 "" ""))
              (pc)))]
  "REGNO (operands[3]) == REGNO (operands[4])
   && peep2_reg_dead_p (4, operands[0])
   && (peep2_reg_dead_p (4, operands[3])
       || rtx_equal_p (operands[2], operands[3]))
   && peep2_regno_dead_p (5, T_REG)"
  [(set (match_dup 2) (const_int -31))
   (set (reg:SI T_REG) (ge:SI (match_dup 1) (const_int 0)))
   (set (pc)
    (if_then_else (match_op_dup 7 [(reg:SI T_REG) (const_int 0)])
              (label_ref (match_dup 6))
              (pc)))]
  "
{
  operands[7] = gen_rtx_fmt_ee (reverse_condition (GET_CODE (operands[5])),
                GET_MODE (operands[5]),
                XEXP (operands[5], 0), XEXP (operands[5], 1));
}")

will be a workaround.  It isn't ideal, but better than nothing.

>  return a == b ? test_sub0 (a, b) : test_sub1 (a, b);
>  return a != b ? test_sub0 (a, b) : test_sub1 (a, b);

This case is intresting.  At -Os, two calls are converted into
one computed goto.  A bit surprisingly, the conversion is done
as a side effect of combine-stack-adjustments pass.  That pass
calls

  cleanup_cfg (flag_crossjumping ? CLEANUP_CROSSJUMP : 0);

and the cross jumping optimization merges two calls.
With -Os -fno-delayed-branch, the OK case is compiled to

test_func_3_OK:
        mov     r4,r1
        cmp/eq  r5,r1
        mov.l   .L4,r0
        bf      .L3
        mov     r1,r5
        mov.l   .L5,r0
        bra     .L3
        nop
.L3:
        jmp     @r0
        nop

and the NG case

test_func_3_NG:
        mov     r4,r1
        cmp/eq  r5,r1
        bt      .L2
        mov.l   .L4,r0
        bra     .L3
        nop
.L2:
        mov.l   .L5,r0
        mov     r1,r5
.L3:
        jmp     @r0
        nop

Yep, the former is lucky.  I guess that the latter requires
basic block reordering for the further simplification, though
I've found a comment

  /* Don't reorder blocks when optimizing for size because extra jump insns may
     be created; also barrier may create extra padding.

     More correctly we should have a block reordering mode that tried to
     minimize the combined size of all the jumps.  This would more or less
     automatically remove extra jumps, but would also try to use more short
     jumps instead of long jumps.  */
  if (!optimize_function_for_speed_p (cfun))
    return false;

in bb-reorder.c.