[Bug rtl-optimization/84753] New: GCC does not fold xxswapd followed by vperm

Wed Mar 7 21:19:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

            Bug ID: 84753
           Summary: GCC does not fold xxswapd followed by vperm
           Product: gcc
           Version: 7.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: noloader at gmail dot com
  Target Milestone: ---

I'm working on GCC112 from the compile farm. It is ppc64-le machine. It has
both GCC 4.8.5 and GCC 7.2.0 installed. The issue is present on both.

We are trying to recover missing 1 to 2 cpb performance when using Power8 SHA
built-ins. Part of the code to load a message into the message schedule looks
like so:

   uint8_t msg[64] = {...};
   __vector unsigned char mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};

   __vector unsigned int t = vec_vsx_ld(0, msg);
   t = vec_perm(t, t, mask);

When I compile at -O3 and disassemble it, I see:

    100008bc:   99 26 20 7c     lxvd2x  vs33,0,r4
    ...
    100008d0:   57 0a 21 f0     xxswapd vs33,vs33
    100008d8:   2b 08 21 10     vperm   v1,v1,v1,v0

Calling xxswapd followed by vperm seems to be a lot like calling shuffle_epi32
followed by shuffle_epi8 on an x86 machine. It feels like the two permutes
should be folded into one.

On x86 I would manually fold the two shuffles. On PPC I cannot because xxswapd
is generated as part of the load, and then I call vperm. I have not figured out
how to avoid the xxswapd. (I even tried to issue my own xxswapd to cancel out
the one being generated by the compiler).

**********

Here's a minimal case, but the optimizer is removing the code of interest. The
real code suffers it, and it can be found at
https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-p8.cxx .

$ cat test.cxx
#include <stdint.h>
#if defined(__ALTIVEC__)
# include <altivec.h>
# undef vector
# undef pixel
# undef bool
#endif

typedef __vector unsigned char uint8x16_p8;
typedef __vector unsigned int  uint32x4_p8;

// Unaligned load
template <class T> static inline
uint32x4_p8 VectorLoad32x4u(const T* data, int offset)
{
  return vec_vsx_ld(offset, (uint32_t*)data);
}

// Unaligned store
template <class T> static inline
void VectorStore32x4u(const uint32x4_p8 val, T* data, int offset)
{
  vec_vsx_st(val, offset, (uint32_t*)data);
}

static inline
uint32x4_p8 VectorPermute32x4(const uint32x4_p8 val, const uint8x16_p8 mask)
{
  return (uint32x4_p8)vec_perm(val, val, mask);
}

int main(int argc, char* argv[])
{
  uint8_t M[64];
  uint32_t W[64];

  uint8_t* m = M;
  uint32_t* w = W;

  const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
  for (unsigned int i=0; i<16; i+=4, m+=4, w+=4)
    VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(m, 0), mask), w, 0);

  return 0;
}