Hi, I realized that GCC fails to vectorize simple loops if there are indirect loads (or stores) and the index used for the indirect access doesn't match a very small subset of possible integer data types. I'm targeting AVX-512. This is the MWE (only an indirect load, but a direct store): ============================== #include <cstdint> using loop_t = uint32_t; using idx_t = uint32_t; void loop(double * const __restrict__ dst, double const * const __restrict__ src, idx_t const * const __restrict__ idx, loop_t const begin, loop_t const end) { for (loop_t i = begin; i < end; ++i) { dst[i] = 42.0 * src[idx[i]]; } } ============================== See: https://godbolt.org/z/Ps-sOv This only vectorizes if idx_t is int32_t, int64_t, or uint64_t. My suspicion is this goes back to the gather/scatter instructions of AVX-512 that come in two flavors: with 32 and 64 bit signed integers for the indices. Unsigned 64 bit probably works (on a 64 bit architecture) because it looks like it's just treated as a signed 64 bit value, which probably is due to (from the documentation): "... The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. ..." Unfortunately, for int16_t, uint16_t, and uint32_t, this does not vectorize. Although the 32 bit version of gather/scatter could be used -- with proper zero padding -- for int16_t and uint16_t. Likewise, the 64 bit version could be used with indices of type uint32_t. Although the code example only uses idx[i] for loading, it appears to be the exact same issue when using idx[i] for storing (meaning: when scatter would be required). Are there any plans to get this working? Or did I maybe miss something and this should already work? Many thanks in advance Florian
There were recent updates to scatter/gather support for AVX512 and GCC 9. I'm not sure that generic widening is done though.
I think I haven't changed anything on this. AVX2 and AVX512{F,VL} gathers and scatters only do signed widening themselves. Thus, if the index is uint32_t and -m64, the only hope would be to perform vectorization conversion of it to uint64_t/int64_t vector first and then use gathers or scatters with V*DImode indexes.
Thank you for your very quick replies! I'm aware of 88464 (I think this is the recent work you are referring to?), but this had no effect on the index data type issue I was describing. Even if the gathers/scatters do widening themselves, my code example is not vectorized when using int16_t. Probably because no gather/scatter is created by GCC in the first place? As for uint32_t with -m64 (= unsgined int on x86-64, and sadly the problem that we are facing): I'm aware that manually transforming the index array from uint32_t to int64_t is a solution, but one that comes at a cost for us. Looking at clang, they use "vpmovzxdq" when loading the data. Which is the only difference to the int64_t/uint64_t version, which uses a different load. Are there any plans for GCC to make these "unfitting" index data types work with AVX-512 gathers/scatters?
Hi Jakub, Richard, I hope you both had a good start into 2019. I'm still wondering if there are any plans to make arbitrary index data types work with gather/scatter? If there are no such plans at the moment, we will work around this issue on our side.
Certainly not in the immediate future, GCC 9 is now in stage4 and will remains in regression bugfixing mode until usually mid April. Even after that, I'll be busy with OpenMP 5, but somebody else could implement it too.
Thanks Jakub. That's good information to have. We would certainly be willing to help since this is something that we would really like GCC to be able to handle. Does it make sense for us, as developers that have never been involved in GCC development, to have a look if you give as some pointers on where to look?
(In reply to Florian Schornbaum from comment #6) > Thanks Jakub. That's good information to have. > > We would certainly be willing to help since this is something that we would > really like GCC to be able to handle. > > Does it make sense for us, as developers that have never been involved in > GCC development, to have a look if you give as some pointers on where to > look? Since Mentor and thus CodeSourcery is now part of the Siemens family you could even get support for this from various GCC maintainers in-house ;)
They are definitely a good source to ask. We'll try to get in contact with them and see if we can get help/insight. Thanks for all your input so far!
I notice this testcase can be vectorized w/ gcc version 12.0.0 20210805 (experimental) (GCC) Guess it's related to Richi's https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576527.html
It is fixed by r12-2733.
OK, that probably was an unintended side-effect of now doing /* Include the conversion if it is widening and we're using the IFN path or the target can handle the converted from offset or the current size is not already the same as the data vector element size. */ if ((TYPE_PRECISION (TREE_TYPE (op0)) < TYPE_PRECISION (TREE_TYPE (off))) && (use_ifn_p || (DR_IS_READ (dr) ? (targetm.vectorize.builtin_gather && targetm.vectorize.builtin_gather (vectype, TREE_TYPE (op0), scale)) : (targetm.vectorize.builtin_scatter && targetm.vectorize.builtin_scatter (vectype, TREE_TYPE (op0), scale))) || !operand_equal_p (TYPE_SIZE (TREE_TYPE (off)), TYPE_SIZE (TREE_TYPE (vectype)), 0))) { off = op0; offtype = TREE_TYPE (off); STRIP_NOPS (off); continue; } that is we no longer try to consume the conversion because with the conversion source the gather is not supported and the offset is also already of the size of the data. We should probably add this testcase to make sure any other heuristic improvements in the above code doesn't break it again.
For some reason, -march=x86-64 -mx32 and -march=x86-64 -m32 -mfpmath=sse won't vectorize the loop.
Here is the equivalent C code: --- #include <stdint.h> #define loop_t uint32_t #define idx_t uint32_t void loop(double * const __restrict__ dst, double const * const __restrict__ src, idx_t const * const __restrict__ idx, loop_t const begin, loop_t const end) { for (loop_t i = begin; i < end; ++i) dst[i] = 42.0 * src[idx[i]]; } ---
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>: https://gcc.gnu.org/g:6866f4819ad8e6e62fef2177520f9fb217dfa353 commit r12-2795-g6866f4819ad8e6e62fef2177520f9fb217dfa353 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Aug 7 07:29:04 2021 -0700 Add tests for PR tree-optimization/88531 PR tree-optimization/88531 * gcc.target/i386/pr88531-1a.c: New test. * gcc.target/i386/pr88531-1b.c: Likewise. * gcc.target/i386/pr88531-1c.c: Likewise. * gcc.target/i386/pr88531-2a.c: Likewise. * gcc.target/i386/pr88531-2b.c: Likewise. * gcc.target/i386/pr88531-2c.c: Likewise.
Fixed for GCC 12.