[PATCH] rs6000: Support doubleword swaps removal in rot64 load store [PR100085]

Fri Jun 4 01:40:58 GMT 2021

On 2021/6/4 04:31, Segher Boessenkool wrote:
> On Thu, Jun 03, 2021 at 02:49:15PM +0800, Xionghu Luo wrote:
>> If remove the rotate in simplify-rtx like below:
>>
>> +++ b/gcc/simplify-rtx.c
>> @@ -3830,10 +3830,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code code,
>>       case ROTATE:
>>         if (trueop1 == CONST0_RTX (mode))
>>          return op0;
>> +
>> +      if (GET_CODE (trueop0) == ROTATE && trueop1 == GEN_INT (64)
>> +         && CONST_INT_P (XEXP (trueop0, 1))
>> +         && INTVAL (XEXP (trueop0, 1)) == 64)
>> +       return XEXP (trueop0, 0);
> 
> (The hardcoded 64 need improving -- but this is just a proof of concept
> I'll assume :-) )
> 
>> Combine still fail to merge the two instructions:
>>
>> Trying 6 -> 7:
>>      6: r120:KF#0=r125:KF#0<-<0x40
>>        REG_DEAD r125:KF
>>      7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
>>        REG_DEAD r120:KF
>> Successfully matched this instruction:
>> (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
>>              (reg:DI 123)) [1  S16 A128])
>>      (subreg:V1TI (reg:KF 125) 0))
>> rejecting combination of insns 6 and 7
>> original costs 4 + 4 = 8
>> replacement cost 12
> 
> So what instructions were these?  Why did the store cost 4 but the new
> one costs 12?

For this case of __float128 to vector __int128:

typedef union
{
  __float128 vf1;
  vector __int128 vi128;
  __int128 i128;
} VF_128;

vector __int128
foo1 (__float128 f128)
{
  VF_128 vunion;

  vunion.vf1 = f128;
  return vunion.vi128;
}

Without this patch, the RTL in combine is:

(insn 6 3 17 2 (set (subreg:V1TI (reg:KF 120 [ f128 ]) 0)
        (rotate:V1TI (subreg:V1TI (reg:KF 125) 0)
            (const_int 64 [0x40]))) "pr100085.c":258:14 1113 {*vsx_le_permute_v1ti}
     (expr_list:REG_DEAD (reg:KF 125)
        (nil)))
(insn 17 6 7 2 (set (reg:DI 123)
        (const_int 32 [0x20])) "pr100085.c":258:14 636 {*movdi_internal64}
     (nil))
(insn 7 17 19 2 (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
                (reg:DI 123)) [1  S16 A128])
        (rotate:V1TI (subreg:V1TI (reg:KF 120 [ f128 ]) 0)
            (const_int 64 [0x40]))) "pr100085.c":258:14 1113 {*vsx_le_permute_v1ti}
     (expr_list:REG_DEAD (reg:KF 120 [ f128 ])
        (nil)))
(note 19 7 14 2 NOTE_INSN_DELETED)
(insn 14 19 15 2 (set (reg/i:V1TI 66 %v2)
        (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
                (reg:DI 123)) [1  S16 A128])) "pr100085.c":260:1 1119 {*vsx_le_perm_load_v1ti}
     (expr_list:REG_DEAD (reg:DI 123)
        (nil)))
(insn 15 14 0 2 (use (reg/i:V1TI 66 %v2)) "pr100085.c":260:1 -1
     (nil))

insn 6 and insn 7 are two vsx_le_permute_v1ti instructions each with costs 4,
(The two instructions are VSX and LE specific like Bill said, swap pass tries 
to remove insn if legal).  If remove the rotates in simplify-rtx.c
(simplify_context::simplify_binary_operation_1) like my last reply, combine will
try to merge them to vsx_le_perm_store_v1ti whose insn cost is 12 and meet "rejecting
combination".  They are all V1TI mode.

> 
>> By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,
> 
> It should be the same cost as the other store!

vsx_le_permute_v1ti's cost is defined to 4 in vsx.md:

;; Little endian word swapping for 128-bit types that are either scalars or the
;; special V1TI container class, which it is not appropriate to use vec_select
;; for the type.
(define_insn "*vsx_le_permute_<mode>"
  [(set (match_operand:VSX_TI 0 "nonimmediate_operand" "=wa,wa,Z,&r,&r,Q")
	(rotate:VSX_TI
	 (match_operand:VSX_TI 1 "input_operand" "wa,Z,wa,r,Q,r")
	 (const_int 64)))]
  "!BYTES_BIG_ENDIAN && TARGET_VSX && !TARGET_P9_VECTOR"
  "@
   xxpermdi %x0,%x1,%x1,2
   lxvd2x %x0,%y1
   stxvd2x %x1,%y0
   mr %0,%L1\;mr %L0,%1
   ld%U1%X1 %0,%L1\;ld%U1%X1 %L0,%1
   std%U0%X0 %L1,%0\;std%U0%X0 %1,%L0"
  [(set_attr "length" "*,*,*,8,8,8")
   (set_attr "type" "vecperm,vecload,vecstore,*,load,store")])

> 
>> it could merge the instructions:
>>
>>      21: r125:KF=%v2:KF
>>        REG_DEAD %v2:KF
>>      2: NOTE_INSN_DELETED
>>      3: NOTE_INSN_FUNCTION_BEG
>>      6: NOTE_INSN_DELETED
>>     17: r123:DI=0x20
>>      7: [sfp:DI+r123:DI]=r125:KF#0
>>        REG_DEAD r125:KF
>>     19: NOTE_INSN_DELETED
>>     14: %v2:V1TI=[sfp:DI+r123:DI]
>>        REG_DEAD r123:DI
>>     15: use %v2:V1TI
>>
>> Then followed split1 pass will still split it to due to no dse pass
>> between to remove the memory operations on stack, remove the rotate
>> in swap won't face such problem since it runs before dse and no split
>> pass between them:
> 
> Sure, but none of that is the point.  I asked if we did this for TImode
> properly, and maybe we do, but:
> 
>>     22: r126:V1TI=r125:KF#0<-<0x40
>>     23: [sfp:DI+r123:DI]=r126:V1TI<-<0x40
> 
> ... this is V1TI mode.
> 
> 
> Segher
> 

-- 
Thanks,
Xionghu