[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
pinskia at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Oct 2 18:33:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pinskia at gcc dot gnu.org
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Dmitrij Pochepko from comment #2)
> aarch64 won't be necessarily faster with such fix.
> 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).
This sounds like we only pass 0 or 1 to this function in deepsjeng_r?
Have you figured out the values that deepsjeng_r uses for these loops?
Because 31-clz would be:
clz w0, w0
mov w1, 31
sub w0, w1, w0
--- CUT ---
While the loop version would be:
asr w1, w0, 1
mov w0, 0
cbz w1, .L3
.p2align 2
.L5:
add w0, w0, 1
asr w1, w1, 1
cbnz w1, .L5
.L3:
If the first branch was predicted as being taken (and it was actually taken;
that is skip the loop), it would be a few cycles faster than the non-loop based
one. This would also mean the value of w0 is either 0 or 1.
Did you anlaysis why it was worse for TX2?
More information about the Gcc-bugs
mailing list