[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

Wed Oct 2 18:33:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu.org

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Dmitrij Pochepko from comment #2)
> aarch64 won't be necessarily faster with such fix.
> 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).

This sounds like we only pass 0 or 1 to this function in deepsjeng_r?
Have you figured out the values that deepsjeng_r uses for these loops?

Because 31-clz would be:
        clz     w0, w0
        mov     w1, 31
        sub     w0, w1, w0
--- CUT ---
While the loop version would be:
        asr     w1, w0, 1
        mov     w0, 0
        cbz     w1, .L3
        .p2align 2
.L5:
        add     w0, w0, 1
        asr     w1, w1, 1
        cbnz    w1, .L5
.L3:

If the first branch was predicted as being taken (and it was actually taken;
that is skip the loop), it would be a few cycles faster than the non-loop based
one.  This would also mean the value of w0 is either 0 or 1.

Did you anlaysis why it was worse for TX2?