g++ relys too much on slow AVX vinsertf128 on haswell
Tim Prince via gcc-help
gcc-help@gcc.gnu.org
Mon Feb 27 12:02:00 GMT 2017
On 2/27/2017 12:26 AM, Yifei wrote:
> Hi there,
> I have a class around __m256d representing a 3-d vector. Yet for comparison I also have a legacy array version made of double[3].
> The AVX version looks like this:
> struct vector {
> __m256d V;
> // ctor
> vector operator+(const vector& rhs) const {
> return {V[0] + rhs.V[0] ... V[2] + rhs.V[2]};
> }
> vector abs() const {
> // clear sign bit vandpd
> }
> };
> Compiling with -O2 -march=native. For the vector version, g++ emits a lot of vextractf128 and vinsertf128 which, ugh, I don't understand why that's necessary. But, so far, okay. And for scalar version, g++ simply vmovsd and vaddsd.
>
> The thing is, the scalar version seems work better than AVX vector version. So I'm tuning back to the scalar one.
> I tried with explicit intrinsics, and reinterpret_cast (this one is even worse, a lot of slow instructions and vunpcklpd, around 6x slower). Yet both failed and g++ persists and blindly relys on vextract128f which is incredibly slow and perform around 2x worse than the scalar one. I even tried inline assembly but I don't really know how to manually allocate stack space in assembly.
>
> I had also tried vaddpd directly, yet it's slightly slower than scalar version as well, may be subject to slightly more memory access, but I'm not sure.
>
If you are using unaligned data and targetting a CPU older than Haswell,
performance problems with 256-bit memory access and need for vinsertf128
and vextractf128 are expected.
--
Tim Prince
More information about the Gcc-help
mailing list