https://godbolt.org/z/ZyTG9b #include <altivec.h> typedef vector unsigned __int128 block; typedef vector unsigned long long vector2_u64; block swap_with_shift(block num) { return num << 64 | num >> 64; } block swap_without_shift(block num) { vector unsigned long long ret; ret[0] = ((vector2_u64)num)[1]; ret[1] = ((vector2_u64)num)[0]; return (block)ret; } typedef unsigned __int128 u128; u128 swap_scalar(u128 in) { return in << 64 | in >> 64; } swap_with_shift: xxpermdi 34,34,34,2 addi 9,1,-16 stxvd2x 34,0,9 ld 8,-8(1) ld 9,-16(1) mtvsrd 1,8 mtvsrd 0,9 xxpermdi 34,0,1,0 blr .long 0 .byte 0,0,0,0,0,0,0,0 swap_without_shift: xxpermdi 34,34,34,2 blr .long 0 .byte 0,0,0,0,0,0,0,0 swap_scalar: mr 9,3 mr 3,4 mr 4,9 blr .long 0 .byte 0,0,0,0,0,0,0,0
Confirmed. At combine time we start with insn_cost 4 for 25: r130:V1TI=%2:V1TI REG_DEAD %2:V1TI insn_cost 4 for 20: r129:V1TI=r130:V1TI REG_DEAD r130:V1TI insn_cost 4 for 21: r127:DI=r129:V1TI#0 insn_cost 4 for 22: r128:DI=r129:V1TI#8 REG_DEAD r129:V1TI insn_cost 4 for 24: r123:TI=0 insn_cost 4 for 7: r123:TI#8=r127:DI REG_DEAD r127:DI insn_cost 4 for 8: r123:TI#0=r128:DI REG_DEAD r128:DI insn_cost 4 for 9: r122:V1TI=r123:TI#0 REG_DEAD r123:TI insn_cost 4 for 14: %2:V1TI=r122:V1TI REG_DEAD r122:V1TI insn_cost 0 for 15: use %2:V1TI and those subregs at the lhs (insns 7 and 8) cannot combine with anything. 2-to-2 combine won't handle 20+21 (and then, 20+22) because 20 is a register move already. It would probably combine fine if that subreg lhs problem was fixed though.
LLVM fixed this by lowering to vector shuffle: https://dev.gnupg.org/D501