[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

Sat Mar 10 23:35:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #7 from Jeffrey Walton <noloader at gmail dot com> ---
(In reply to Bill Schmidt from comment #4)
> ...
> 
> The best performance will be achieved by writing this loop entirely using
> inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no
> swaps), thus in "big-endian element order" (element 0 in the high-order
> position of the register).  Because of the big-endian nature of vshasigmaw,
> this is always going to be the best approach.

Thanks Bill.

We are working on your lxvd2x suggestion using inline assembly.

Related, see "GCC vec_xl_be replacement using inline assembly",
https://stackoverflow.com/q/49215090/608639.

-----

I'm not sure if I am doing something wrong, or this is a new issue:

$ cat test.cxx
...

typedef __vector unsigned int  uint32x4_p8;

uint32x4_p8 VEC_XL_BE(const uint8_t* data, int offset)
{
#if defined(__xlc__) || defined(__xlC__)
  return (uint32x4_p8)vec_xl_be(offset, (uint8_t*)data);
#else
  uint32x4_p8 res;
  __asm(" lxvd2x  %x0, %1, %2    \n\t"
        : "=wa" (res)
        : "g" (data), "g" (offset));
  return res;

#endif
}

When I use VEC_XL_BE in real life it results in:

$ g++ -DTEST_MAIN -g3 -O3 -mcpu=power8 sha256-p8.cxx -o sha256-p8.exe
/home/noloader/tmp/ccbDnfFr.s: Assembler messages:
/home/noloader/tmp/ccbDnfFr.s:758: Error: operand out of range (32 is not
between 0 and 31)
/home/noloader/tmp/ccbDnfFr.s:983: Error: operand out of range (48 is not
between 0 and 31)