g++: Suboptimal code generation for simple wrapper class around vector data type

Mon Mar 22 14:34:22 GMT 2021

Hi,

the attached test case is the (slightly simplified) hot loop from a
library for spherical harmonic transforms.
This code uses explicit vectorization, and I try to use simple wrapper
classes around the primitive vector types (like __m256d) to simplify
operations like initialization with a scalar etc.

However it seems that using the wrapper type inside the critical loop
causes g++ to produce sub-optimal code. This can be seen by running

g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc

and inspecting the generated assembler code (I'm using gcc 10.2.1).
The version where I use the wrapper type even in the hot loop (i.e.
"foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before
the end of the loop body, which are missing in the version where I cast
to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>").

My suspicion is that the "Tvsimple" type is somehow not completely POD
and that this prohibits g++ from optimizing more aggressively. On the
other hand, clang++ produces identical code for both versions, which is
comparable in speed with the faster version generated by g++.

Is g++ missing an opportunity to optimize here? If so, is there a way to
alter the "Tvsimple" class so that it doesn't stop g++ from optimizing?

Thanks,
  Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testcase.cc
Type: text/x-c++src
Size: 2513 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/gcc-help/attachments/20210322/0747131c/attachment.bin>