[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<<n in opposite sides of a branch

Thu Jun 9 07:33:26 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904

            Bug ID: 105904
           Summary: Predicated mov r0, #1 with opposite conditions could
                    be hoisted, between 1 and 1<<n in opposite sides of a
                    branch
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: arm-*-*

#include <bit>  // using the libstdc++ header
unsigned roundup(unsigned x){
    return std::bit_ceil(x);
}

https://godbolt.org/z/Px1fvWaex

GCC's version is somewhat clunky, including MOV r0, #1 in either "side":

roundup(unsigned int):
        cmp     r0, #1
        itttt   hi
        addhi   r3, r0, #-1
        movhi   r0, #1            @@ here
        clzhi   r3, r3
        rsbhi   r3, r3, #32
        ite     hi
        lslhi   r0, r0, r3
        movls   r0, #1            @@ here
        bx      lr

Even without spotting the other optimizations that clang finds, we can combine
to a single unconditional MOV r0, #1.  But only if we avoid setting flags, so
it requires a 4-byte encoding, not MOVS.  Still, it's one fewer instruction to
execute.

This is not totally trivial: it requires seeing that we can move it across the
conditional LSL.  So it's really a matter of folding the 1s between 1<<n and 1 
in opposite sides of an if-converted branch.

        cmp     r0, #1
        ittt    hi
        addhi   r3, r0, #-1
        clzhi   r3, r3
        rsbhi   r3, r3, #32
        mov     r0, #1            @@ now unconditional
        it      hi
        lslhi   r0, r0, r3
        bx      lr

clang makes rather nice asm for ARMv7 -mcpu=cortex-a53 as discussed in PR104773
which covers a different missed optimization in the same asm.

roundup(unsigned int):                @@ clang's version.
        subs    r0, r0, #1
        clz     r0, r0
        rsb     r1, r0, #32         @ 32-clz
        mov     r0, #1
        lslhi   r0, r0, r1          @ using flags set by SUBS
        bx      lr                  @ 1<<(32-clz) or just 1

Folding the mov r0, #1 from either side is only a couple steps away from making
the clz and rsb unconditional, and keeping only the LSL conditional.