[Bug rtl-optimization/84753] New: GCC does not fold xxswapd followed by vperm
noloader at gmail dot com
gcc-bugzilla@gcc.gnu.org
Wed Mar 7 21:19:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753
Bug ID: 84753
Summary: GCC does not fold xxswapd followed by vperm
Product: gcc
Version: 7.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: noloader at gmail dot com
Target Milestone: ---
I'm working on GCC112 from the compile farm. It is ppc64-le machine. It has
both GCC 4.8.5 and GCC 7.2.0 installed. The issue is present on both.
We are trying to recover missing 1 to 2 cpb performance when using Power8 SHA
built-ins. Part of the code to load a message into the message schedule looks
like so:
uint8_t msg[64] = {...};
__vector unsigned char mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
__vector unsigned int t = vec_vsx_ld(0, msg);
t = vec_perm(t, t, mask);
When I compile at -O3 and disassemble it, I see:
100008bc: 99 26 20 7c lxvd2x vs33,0,r4
...
100008d0: 57 0a 21 f0 xxswapd vs33,vs33
100008d8: 2b 08 21 10 vperm v1,v1,v1,v0
Calling xxswapd followed by vperm seems to be a lot like calling shuffle_epi32
followed by shuffle_epi8 on an x86 machine. It feels like the two permutes
should be folded into one.
On x86 I would manually fold the two shuffles. On PPC I cannot because xxswapd
is generated as part of the load, and then I call vperm. I have not figured out
how to avoid the xxswapd. (I even tried to issue my own xxswapd to cancel out
the one being generated by the compiler).
**********
Here's a minimal case, but the optimizer is removing the code of interest. The
real code suffers it, and it can be found at
https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-p8.cxx .
$ cat test.cxx
#include <stdint.h>
#if defined(__ALTIVEC__)
# include <altivec.h>
# undef vector
# undef pixel
# undef bool
#endif
typedef __vector unsigned char uint8x16_p8;
typedef __vector unsigned int uint32x4_p8;
// Unaligned load
template <class T> static inline
uint32x4_p8 VectorLoad32x4u(const T* data, int offset)
{
return vec_vsx_ld(offset, (uint32_t*)data);
}
// Unaligned store
template <class T> static inline
void VectorStore32x4u(const uint32x4_p8 val, T* data, int offset)
{
vec_vsx_st(val, offset, (uint32_t*)data);
}
static inline
uint32x4_p8 VectorPermute32x4(const uint32x4_p8 val, const uint8x16_p8 mask)
{
return (uint32x4_p8)vec_perm(val, val, mask);
}
int main(int argc, char* argv[])
{
uint8_t M[64];
uint32_t W[64];
uint8_t* m = M;
uint32_t* w = W;
const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
for (unsigned int i=0; i<16; i+=4, m+=4, w+=4)
VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(m, 0), mask), w, 0);
return 0;
}
More information about the Gcc-bugs
mailing list