[Bug target/64793] [SH] missed delay slot
olegendo at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Sat Feb 14 19:53:00 GMT 2015
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64793
Oleg Endo <olegendo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kkojima at gcc dot gnu.org
--- Comment #1 from Oleg Endo <olegendo at gcc dot gnu.org> ---
This is caused by the fake annulled conditional true branches.
Applying this:
Index: gcc/config/sh/sh.md
===================================================================
--- gcc/config/sh/sh.md (revision 220708)
+++ gcc/config/sh/sh.md (working copy)
@@ -593,20 +593,9 @@
[(and (eq_attr "in_delay_slot" "yes")
(eq_attr "type" "!pstore,prget")) (nil) (nil)])
-;; Say that we have annulled true branches, since this gives smaller and
-;; faster code when branches are predicted as not taken.
-
-;; ??? The non-annulled condition should really be "in_delay_slot",
-;; but insns that can be filled in non-annulled get priority over insns
-;; that can only be filled in anulled.
-
(define_delay
- (and (eq_attr "type" "cbranch")
- (match_test "TARGET_SH2"))
- ;; SH2e has a hardware bug that pretty much prohibits the use of
- ;; annulled delay slots.
- [(eq_attr "cond_delay_slot" "yes") (and (eq_attr "cond_delay_slot" "yes")
- (not (eq_attr "cpu" "sh2e"))) (nil)])
+ (and (eq_attr "type" "cbranch") (match_test "TARGET_SH2"))
+ [(eq_attr "cond_delay_slot" "yes") (nil) (nil)])
;; -------------------------------------------------------------------------
;; SImode signed integer comparisons
results in the expected code:
mov r5,r0
mov.b @(r0,r4),r1
mov r1,r0
cmp/eq #92,r0
bt .L3
rts
mov r7,r0
.align 1
.L3:
rts
mov r6,r0
The downside is that code size increases on average. CSiBE shows a total
increase
3371399 -> 3372451 +1052 / +0.031204 %
even though there are also individual code size decreases.
It also seems that this catches more missed cases of cbranches with delay slot:
blocksort.c (fallbackSort):
before:
.L275:
cmp/pl r3
bf .L23
mov.l @(28,r15),r4
mov #0,r0
mov.l @(16,r15),r2
after:
.L275:
cmp/pl r3
bf/s .L23
mov #0,r0
mov.l @(28,r15),r4
mov.l @(16,r15),r2
The code size increase is caused by duplicated insns such as:
before:
bf .L315
...
bf .L315
...
bf .L315
...
.L315:
cmp/hi r13,r12
bra .L308
movt r0
after:
bf/s .L322
cmp/hi r13,r12
...
bf/s .L322
cmp/hi r13,r12
...
bf/s .L322
cmp/hi r13,r12
...
.L322:
bra .L307
movt r0
In a similar way, the builtin strcmp code results in sequences such as:
bt/s .L67
sett
mov.b @r1+,r2
tst r2,r2
bt/s .L67
sett
The sh_optimize_sett_clrt pass does not eliminate the sett insn because T is
not the same value in all paths and thus it gets copied into the delay slots.
There's an old comment from r9888
;; Say that we have annulled true branches, since this gives smaller and
;; faster code when branches are predicted as not taken.
I don't know what this comment is based on. Branch prediction was added on
SH4A, which was long time after that comment. Maybe it refers to the fact that
conditional branches are faster on SH when they are not taken. Public SH2
documentation states that (bf/s, bt/s) are 2 cycles and (bt, bf) are 3 cycles.
In both cases the branch insns take 1 cycle if they don't branch. Looking at
other documentation (ST40-300, SH4A), it seems that using the delay-slot
variants has a higher chance of executing the branch and delay-slot insn in
parallel.
Kaz, if you have some time, could you please do a CSiBE runtime comparison
with/without the patch above? I'm tempted to apply the patch above and drop
the fake annulled delay slot insns.
More information about the Gcc-bugs
mailing list