[PATCH] rs6000: Fix incorrect RTL for Power LE when removing the UNSPECS [PR106069]

Tue Aug 9 03:01:05 GMT 2022

Hi Xionghu,

Thanks for the fix.

on 2022/8/8 11:42, Xionghu Luo wrote:
> The native RTL expression for vec_mrghw should be same for BE and LE as
> they are register and endian-independent.  So both BE and LE need
> generate exactly same RTL with index [0 4 1 5] when expanding vec_mrghw
> with vec_select and vec_concat.
> 
> (set (reg:V4SI 141) (vec_select:V4SI (vec_concat:V8SI
> 		   (subreg:V4SI (reg:V16QI 139) 0)
> 		   (subreg:V4SI (reg:V16QI 140) 0))
> 		   [const_int 0 4 1 5]))
> 
> Then combine pass could do the nested vec_select optimization
> in simplify-rtx.c:simplify_binary_operation_1 also on both BE and LE:
> 
> 21: r150:V4SI=vec_select(vec_concat(r141:V4SI,r146:V4SI),parallel [0 4 1 5])
> 24: {r151:SI=vec_select(r150:V4SI,parallel [const_int 3]);}
> 
> =>
> 
> 21: r150:V4SI=vec_select(vec_concat(r141:V4SI,r146:V4SI),parallel)
> 24: {r151:SI=vec_select(r146:V4SI,parallel [const_int 1]);}
> 
> The endianness check need only once at ASM generation finally.
> ASM would be better due to nested vec_select simplified to simple scalar
> load.
> 
> Regression tested pass for Power8{LE,BE}{32,64} and Power{9,10}LE{32,64}

Sorry, no -m32 for LE testing.  I noticed the attachement in that PR didn't
include the test case (though the changelog has it), so I re-tested it
again, nothing changed.  :)

> Linux(Thanks to Kewen), OK for master?  Or should we revert r12-4496 to
> restore to the UNSPEC implementation?
> 

I have some concern on those changed "altivec_*_direct", IMHO the suffix
"_direct" is normally to indicate the define_insn is mapped to the
corresponding hw insn directly.  With this change, for example,
altivec_vmrghb_direct can be mapped into vmrghb or vmrglb, this looks
misleading.  Maybe we can add the corresponding _direct_le and _direct_be
versions, both are mapped into the same insn but have different RTL
patterns.  Looking forward to Segher's and David's suggestions.

> gcc/ChangeLog:
> 	PR target/106069
> 	* config/rs6000/altivec.md (altivec_vmrghb): Emit same native
> 	RTL for BE and LE.
> 	(altivec_vmrghh): Likewise.
> 	(altivec_vmrghw): Likewise.
> 	(*altivec_vmrghsf): Adjust.
> 	(altivec_vmrglb): Likewise.
> 	(altivec_vmrglh): Likewise.
> 	(altivec_vmrglw): Likewise.
> 	(*altivec_vmrglsf): Adjust.
> 	(altivec_vmrghb_direct): Emit different ASM for BE and LE.
> 	(altivec_vmrghh_direct): Likewise.
> 	(altivec_vmrghw_direct_<mode>): Likewise.
> 	(altivec_vmrglb_direct): Likewise.
> 	(altivec_vmrglh_direct): Likewise.
> 	(altivec_vmrglw_direct_<mode>): Likewise.
> 	(vec_widen_smult_hi_v16qi): Adjust.
> 	(vec_widen_smult_lo_v16qi): Adjust.
> 	(vec_widen_umult_hi_v16qi): Adjust.
> 	(vec_widen_umult_lo_v16qi): Adjust.
> 	(vec_widen_smult_hi_v8hi): Adjust.
> 	(vec_widen_smult_lo_v8hi): Adjust.
> 	(vec_widen_umult_hi_v8hi): Adjust.
> 	(vec_widen_umult_lo_v8hi): Adjust.
> 	* config/rs6000/rs6000.cc (altivec_expand_vec_perm_const): Emit same
> 	native RTL for BE and LE.
> 	* config/rs6000/vsx.md (vsx_xxmrghw_<mode>): Likewise.
> 	(vsx_xxmrglw_<mode>): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 	PR target/106069
> 	* gcc.target/powerpc/pr106069.C: New test.
> 
> Signed-off-by: Xionghu Luo <xionghuluo@tencent.com>
> ---
>  gcc/config/rs6000/altivec.md                | 122 ++++++++++++--------
>  gcc/config/rs6000/rs6000.cc                 |  36 +++---
>  gcc/config/rs6000/vsx.md                    |  16 +--
>  gcc/testsuite/gcc.target/powerpc/pr106069.C | 118 +++++++++++++++++++
>  4 files changed, 209 insertions(+), 83 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr106069.C
> 
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index 2c4940f2e21..8d9c0109559 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -1144,11 +1144,7 @@ (define_expand "altivec_vmrghb"
>     (use (match_operand:V16QI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct
> -						: gen_altivec_vmrglb_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  emit_insn (gen_altivec_vmrghb_direct (operands[0], operands[1], operands[2]));
>    DONE;
>  })
>  
> @@ -1167,7 +1163,12 @@ (define_insn "altivec_vmrghb_direct"
>  		     (const_int 6) (const_int 22)
>  		     (const_int 7) (const_int 23)])))]
>    "TARGET_ALTIVEC"
> -  "vmrghb %0,%1,%2"
> +  {
> +     if (BYTES_BIG_ENDIAN)
> +      return "vmrghb %0,%1,%2";
> +    else
> +      return "vmrglb %0,%2,%1";
> + }
>    [(set_attr "type" "vecperm")])
>  
>  (define_expand "altivec_vmrghh"
> @@ -1176,11 +1177,7 @@ (define_expand "altivec_vmrghh"
>     (use (match_operand:V8HI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghh_direct
> -						: gen_altivec_vmrglh_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  emit_insn (gen_altivec_vmrghh_direct (operands[0], operands[1], operands[2]));
>    DONE;
>  })
>  
> @@ -1195,7 +1192,12 @@ (define_insn "altivec_vmrghh_direct"
>  		     (const_int 2) (const_int 10)
>  		     (const_int 3) (const_int 11)])))]
>    "TARGET_ALTIVEC"
> -  "vmrghh %0,%1,%2"
> +  {
> +     if (BYTES_BIG_ENDIAN)
> +      return "vmrghh %0,%1,%2";
> +    else
> +      return "vmrglh %0,%2,%1";
> + }
>    [(set_attr "type" "vecperm")])
>  
>  (define_expand "altivec_vmrghw"
> @@ -1204,12 +1206,8 @@ (define_expand "altivec_vmrghw"
>     (use (match_operand:V4SI 2 "register_operand"))]
>    "VECTOR_MEM_ALTIVEC_P (V4SImode)"
>  {
> -  rtx (*fun) (rtx, rtx, rtx);
> -  fun = BYTES_BIG_ENDIAN ? gen_altivec_vmrghw_direct_v4si
> -			 : gen_altivec_vmrglw_direct_v4si;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  emit_insn (
> +    gen_altivec_vmrghw_direct_v4si (operands[0], operands[1], operands[2]));
>    DONE;
>  })
>  
[snip]
>    [(set_attr "type" "vecperm")])
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106069.C b/gcc/testsuite/gcc.target/powerpc/pr106069.C
> new file mode 100644
> index 00000000000..56219a74692
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106069.C

Since this is a C++ test case, it should be placed in gcc/testsuite/g++.target/powerpc/.

> @@ -0,0 +1,118 @@
> +/* { dg-do run } */

This case requires altivec, it needs something like:

/* { dg-require-effective-target vmx_hw } */
/* { dg-options "-maltivec" } */

BR,
Kewen

> +
> +extern "C" void *
> +memcpy (void *, const void *, unsigned long);
> +typedef __attribute__ ((altivec (vector__))) unsigned native_simd_type;
> +
> +union
> +{
> +  native_simd_type V;
> +  int R[4];
> +} store_le_vec;
> +
> +struct S
> +{
> +  S () = default;
> +  S (unsigned B0)
> +  {
> +    native_simd_type val{B0};
> +    m_simd = val;
> +  }
> +  void store_le (unsigned int out[])
> +  {
> +    store_le_vec.V = m_simd;
> +    unsigned int x0 = store_le_vec.R[0];
> +    memcpy (out, &x0, 1);
> +  }
> +  S rotl (unsigned int r)
> +  {
> +    native_simd_type rot{r};
> +    return __builtin_vec_rl (m_simd, rot);
> +  }
> +  void operator+= (S other)
> +  {
> +    m_simd = __builtin_vec_add (m_simd, other.m_simd);
> +  }
> +  void operator^= (S other)
> +  {
> +    m_simd = __builtin_vec_xor (m_simd, other.m_simd);
> +  }
> +  static void transpose (S &B0, S B1, S B2, S B3)
> +  {
> +    native_simd_type T0 = __builtin_vec_mergeh (B0.m_simd, B2.m_simd);
> +    native_simd_type T1 = __builtin_vec_mergeh (B1.m_simd, B3.m_simd);
> +    native_simd_type T2 = __builtin_vec_mergel (B0.m_simd, B2.m_simd);
> +    native_simd_type T3 = __builtin_vec_mergel (B1.m_simd, B3.m_simd);
> +    B0 = __builtin_vec_mergeh (T0, T1);
> +    B3 = __builtin_vec_mergel (T2, T3);
> +  }
> +  S (native_simd_type x) : m_simd (x) {}
> +  native_simd_type m_simd;
> +};
> +
> +void
> +foo (unsigned int output[], unsigned state[])
> +{
> +  S R00 = state[0];
> +  S R01 = state[0];
> +  S R02 = state[2];
> +  S R03 = state[0];
> +  S R05 = state[5];
> +  S R06 = state[6];
> +  S R07 = state[7];
> +  S R08 = state[8];
> +  S R09 = state[9];
> +  S R10 = state[10];
> +  S R11 = state[11];
> +  S R12 = state[12];
> +  S R13 = state[13];
> +  S R14 = state[4];
> +  S R15 = state[15];
> +  for (int r = 0; r != 10; ++r)
> +    {
> +      R09 += R13;
> +      R11 += R15;
> +      R05 ^= R09;
> +      R06 ^= R10;
> +      R07 ^= R11;
> +      R07 = R07.rotl (7);
> +      R00 += R05;
> +      R01 += R06;
> +      R02 += R07;
> +      R15 ^= R00;
> +      R12 ^= R01;
> +      R13 ^= R02;
> +      R00 += R05;
> +      R01 += R06;
> +      R02 += R07;
> +      R15 ^= R00;
> +      R12 = R12.rotl (8);
> +      R13 = R13.rotl (8);
> +      R10 += R15;
> +      R11 += R12;
> +      R08 += R13;
> +      R09 += R14;
> +      R05 ^= R10;
> +      R06 ^= R11;
> +      R07 ^= R08;
> +      R05 = R05.rotl (7);
> +      R06 = R06.rotl (7);
> +      R07 = R07.rotl (7);
> +    }
> +  R00 += state[0];
> +  S::transpose (R00, R01, R02, R03);
> +  R00.store_le (output);
> +}
> +
> +unsigned int res[1];
> +unsigned main_state[]{1634760805, 60878,      2036477234, 6,
> +		      0,	  825562964,  1471091955, 1346092787,
> +		      506976774,  4197066702, 518848283,  118491664,
> +		      0,	  0,	      0,	  0};
> +int
> +main ()
> +{
> +  foo (res, main_state);
> +  if (res[0] != 0x41fcef98)
> +    __builtin_abort ();
> +}