[Bug target/64793] [SH] missed delay slot

Sat Feb 14 19:53:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64793

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kkojima at gcc dot gnu.org

--- Comment #1 from Oleg Endo <olegendo at gcc dot gnu.org> ---
This is caused by the fake annulled conditional true branches.
Applying this:

Index: gcc/config/sh/sh.md
===================================================================
--- gcc/config/sh/sh.md    (revision 220708)
+++ gcc/config/sh/sh.md    (working copy)
@@ -593,20 +593,9 @@
   [(and (eq_attr "in_delay_slot" "yes")
     (eq_attr "type" "!pstore,prget")) (nil) (nil)])

-;; Say that we have annulled true branches, since this gives smaller and
-;; faster code when branches are predicted as not taken.
-
-;; ??? The non-annulled condition should really be "in_delay_slot",
-;; but insns that can be filled in non-annulled get priority over insns
-;; that can only be filled in anulled.
-
 (define_delay
-  (and (eq_attr "type" "cbranch")
-       (match_test "TARGET_SH2"))
-  ;; SH2e has a hardware bug that pretty much prohibits the use of
-  ;; annulled delay slots.
-  [(eq_attr "cond_delay_slot" "yes") (and (eq_attr "cond_delay_slot" "yes")
-                      (not (eq_attr "cpu" "sh2e"))) (nil)])
+  (and (eq_attr "type" "cbranch") (match_test "TARGET_SH2"))
+  [(eq_attr "cond_delay_slot" "yes") (nil) (nil)])


 ;; -------------------------------------------------------------------------
 ;; SImode signed integer comparisons


results in the expected code:

        mov     r5,r0
        mov.b   @(r0,r4),r1
        mov     r1,r0
        cmp/eq  #92,r0
        bt      .L3
        rts
        mov     r7,r0
        .align 1
.L3:
        rts
        mov     r6,r0

The downside is that code size increases on average.  CSiBE shows a total
increase
   3371399 -> 3372451    +1052 / +0.031204 %

even though there are also individual code size decreases.

It also seems that this catches more missed cases of cbranches with delay slot:

blocksort.c (fallbackSort):

before:
.L275:
        cmp/pl  r3
        bf      .L23
        mov.l   @(28,r15),r4
        mov     #0,r0
        mov.l   @(16,r15),r2

after:
.L275:
        cmp/pl  r3
        bf/s    .L23
        mov     #0,r0
        mov.l   @(28,r15),r4
        mov.l   @(16,r15),r2


The code size increase is caused by duplicated insns such as:

before:
        bf      .L315
        ...
        bf      .L315
        ...
        bf      .L315
        ...
.L315:
        cmp/hi  r13,r12
        bra     .L308
        movt    r0

after:
        bf/s    .L322
        cmp/hi  r13,r12
        ...
        bf/s    .L322
        cmp/hi  r13,r12
        ...
        bf/s    .L322
        cmp/hi  r13,r12
        ...
.L322:
        bra     .L307
        movt    r0


In a similar way, the builtin strcmp code results in sequences such as:

        bt/s    .L67
        sett
        mov.b   @r1+,r2
    tst     r2,r2
    bt/s    .L67
    sett        

The sh_optimize_sett_clrt pass does not eliminate the sett insn because T is
not the same value in all paths and thus it gets copied into the delay slots.


There's an old comment from r9888

;; Say that we have annulled true branches, since this gives smaller and
;; faster code when branches are predicted as not taken.

I don't know what this comment is based on.  Branch prediction was added on
SH4A, which was long time after that comment.  Maybe it refers to the fact that
conditional branches are faster on SH when they are not taken.  Public SH2
documentation states that (bf/s, bt/s) are 2 cycles and (bt, bf) are 3 cycles. 
In both cases the branch insns take 1 cycle if they don't branch.  Looking at
other documentation (ST40-300, SH4A), it seems that using the delay-slot
variants has a higher chance of executing the branch and delay-slot insn in
parallel.

Kaz, if you have some time, could you please do a CSiBE runtime comparison
with/without the patch above?  I'm tempted to apply the patch above and drop
the fake annulled delay slot insns.