Bug 94687

Summary: PPC vector fails to optimize shift (used bits)
Product: gcc Reporter: Shawn Landden <shawn>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: segher
Priority: P3 Keywords: missed-optimization
Version: 8.0   
Target Milestone: ---   
Host: Target: powerpc
Build: Known to work:
Known to fail: Last reconfirmed: 2020-04-30 00:00:00

Description Shawn Landden 2020-04-21 10:32:28 UTC
https://godbolt.org/z/ZyTG9b


#include <altivec.h>

typedef vector unsigned __int128 block;
typedef vector unsigned long long vector2_u64;

block swap_with_shift(block num) {
    return num << 64 | num >> 64;
}

block swap_without_shift(block num) {
    vector unsigned long long ret;
    ret[0] = ((vector2_u64)num)[1];
    ret[1] = ((vector2_u64)num)[0];
    return (block)ret;
}

typedef unsigned __int128 u128;

u128 swap_scalar(u128 in) {
    return in << 64 | in >> 64;
}

swap_with_shift:
        xxpermdi 34,34,34,2
        addi 9,1,-16
        stxvd2x 34,0,9
        ld 8,-8(1)
        ld 9,-16(1)
        mtvsrd 1,8
        mtvsrd 0,9
        xxpermdi 34,0,1,0
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
swap_without_shift:
        xxpermdi 34,34,34,2
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
swap_scalar:
        mr 9,3
        mr 3,4
        mr 4,9
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
Comment 1 Segher Boessenkool 2020-04-30 17:16:26 UTC
Confirmed.

At combine time we start with

insn_cost 4 for    25: r130:V1TI=%2:V1TI
      REG_DEAD %2:V1TI
insn_cost 4 for    20: r129:V1TI=r130:V1TI
      REG_DEAD r130:V1TI
insn_cost 4 for    21: r127:DI=r129:V1TI#0
insn_cost 4 for    22: r128:DI=r129:V1TI#8
      REG_DEAD r129:V1TI
insn_cost 4 for    24: r123:TI=0
insn_cost 4 for     7: r123:TI#8=r127:DI
      REG_DEAD r127:DI
insn_cost 4 for     8: r123:TI#0=r128:DI
      REG_DEAD r128:DI
insn_cost 4 for     9: r122:V1TI=r123:TI#0
      REG_DEAD r123:TI
insn_cost 4 for    14: %2:V1TI=r122:V1TI
      REG_DEAD r122:V1TI
insn_cost 0 for    15: use %2:V1TI

and those subregs at the lhs (insns 7 and 8) cannot combine with anything.

2-to-2 combine won't handle 20+21 (and then, 20+22) because 20 is a register
move already.  It would probably combine fine if that subreg lhs problem was
fixed though.
Comment 2 Shawn Landden 2020-06-16 11:40:23 UTC
LLVM fixed this by lowering to vector shuffle: https://dev.gnupg.org/D501