Bug 88531 - Index data types when targeting AVX-512 vectorization with gather/scatter
Summary: Index data types when targeting AVX-512 vectorization with gather/scatter
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.2.0
: P3 enhancement
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-12-17 12:09 UTC by Florian Schornbaum
Modified: 2024-10-11 00:37 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-12-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Schornbaum 2018-12-17 12:09:07 UTC
Hi,

I realized that GCC fails to vectorize simple loops if there are indirect loads (or stores) and the index used for the indirect access doesn't match a very small subset of possible integer data types. I'm targeting AVX-512. This is the MWE (only an indirect load, but a direct store):

==============================
#include <cstdint>

using loop_t = uint32_t;
using idx_t = uint32_t;

void loop(double * const __restrict__ dst,
          double const * const __restrict__ src,
          idx_t const * const __restrict__ idx,
          loop_t const begin,
          loop_t const end)
{
    for (loop_t i = begin; i < end; ++i)
    {
        dst[i] = 42.0 * src[idx[i]];
    }
}
==============================
See: https://godbolt.org/z/Ps-sOv

This only vectorizes if idx_t is int32_t, int64_t, or uint64_t.

My suspicion is this goes back to the gather/scatter instructions of AVX-512 that come in two flavors: with 32 and 64 bit signed integers for the indices.
Unsigned 64 bit probably works (on a 64 bit architecture) because it looks like it's just treated as a signed 64 bit value, which probably is due to (from the documentation):
"... The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. ..."

Unfortunately, for int16_t, uint16_t, and uint32_t, this does not vectorize. Although the 32 bit version of gather/scatter could be used -- with proper zero padding -- for int16_t and uint16_t. Likewise, the 64 bit version could be used with indices of type uint32_t.

Although the code example only uses idx[i] for loading, it appears to be the exact same issue when using idx[i] for storing (meaning: when scatter would be required).

Are there any plans to get this working?
Or did I maybe miss something and this should already work?

Many thanks in advance

Florian
Comment 1 Richard Biener 2018-12-17 12:23:39 UTC
There were recent updates to scatter/gather support for AVX512 and GCC 9.  I'm not sure that generic widening is done though.
Comment 2 Jakub Jelinek 2018-12-17 12:32:24 UTC
I think I haven't changed anything on this.  AVX2 and AVX512{F,VL} gathers and scatters only do signed widening themselves.  Thus, if the index is uint32_t and -m64, the only hope would be to perform vectorization conversion of it to uint64_t/int64_t vector first and then use gathers or scatters with V*DImode indexes.
Comment 3 Florian Schornbaum 2018-12-17 15:47:54 UTC
Thank you for your very quick replies!

I'm aware of 88464 (I think this is the recent work you are referring to?), but this had no effect on the index data type issue I was describing.

Even if the gathers/scatters do widening themselves, my code example is not vectorized when using int16_t. Probably because no gather/scatter is created by GCC in the first place?

As for uint32_t with -m64 (= unsgined int on x86-64, and sadly the problem that we are facing): I'm aware that manually transforming the index array from uint32_t to int64_t is a solution, but one that comes at a cost for us.
Looking at clang, they use "vpmovzxdq" when loading the data. Which is the only difference to the int64_t/uint64_t version, which uses a different load.

Are there any plans for GCC to make these "unfitting" index data types work with AVX-512 gathers/scatters?
Comment 4 Florian Schornbaum 2019-01-21 13:05:15 UTC
Hi Jakub, Richard,

I hope you both had a good start into 2019.

I'm still wondering if there are any plans to make arbitrary index data types work with gather/scatter?

If there are no such plans at the moment, we will work around this issue on our side.
Comment 5 Jakub Jelinek 2019-01-21 13:27:21 UTC
Certainly not in the immediate future, GCC 9 is now in stage4 and will remains in regression bugfixing mode until usually mid April.  Even after that, I'll be busy with OpenMP 5, but somebody else could implement it too.
Comment 6 Florian Schornbaum 2019-01-23 10:02:32 UTC
Thanks Jakub. That's good information to have.

We would certainly be willing to help since this is something that we would really like GCC to be able to handle.

Does it make sense for us, as developers that have never been involved in GCC development, to have a look if you give as some pointers on where to look?
Comment 7 Richard Biener 2019-01-23 10:06:07 UTC
(In reply to Florian Schornbaum from comment #6)
> Thanks Jakub. That's good information to have.
> 
> We would certainly be willing to help since this is something that we would
> really like GCC to be able to handle.
> 
> Does it make sense for us, as developers that have never been involved in
> GCC development, to have a look if you give as some pointers on where to
> look?

Since Mentor and thus CodeSourcery is now part of the Siemens family you
could even get support for this from various GCC maintainers in-house ;)
Comment 8 Florian Schornbaum 2019-01-23 13:55:52 UTC
They are definitely a good source to ask.
We'll try to get in contact with them and see if we can get help/insight.

Thanks for all your input so far!
Comment 9 Hongtao.liu 2021-08-06 07:23:54 UTC
I notice this testcase can be vectorized w/ gcc version 12.0.0 20210805 (experimental) (GCC) 

Guess it's related to Richi's https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576527.html
Comment 10 H.J. Lu 2021-08-06 13:01:24 UTC
It is fixed by r12-2733.
Comment 11 Richard Biener 2021-08-06 13:29:11 UTC
OK, that probably was an unintended side-effect of now doing

          /* Include the conversion if it is widening and we're using
             the IFN path or the target can handle the converted from
             offset or the current size is not already the same as the
             data vector element size.  */
          if ((TYPE_PRECISION (TREE_TYPE (op0))
               < TYPE_PRECISION (TREE_TYPE (off)))
              && (use_ifn_p
                  || (DR_IS_READ (dr)
                      ? (targetm.vectorize.builtin_gather
                         && targetm.vectorize.builtin_gather (vectype,
                                                              TREE_TYPE (op0),
                                                              scale))
                      : (targetm.vectorize.builtin_scatter
                         && targetm.vectorize.builtin_scatter (vectype,
                                                               TREE_TYPE (op0),
                                                               scale)))
                  || !operand_equal_p (TYPE_SIZE (TREE_TYPE (off)),
                                       TYPE_SIZE (TREE_TYPE (vectype)), 0)))
            {
              off = op0;
              offtype = TREE_TYPE (off);
              STRIP_NOPS (off);
              continue;
            }

that is we no longer try to consume the conversion because with the conversion
source the gather is not supported and the offset is also already of the
size of the data.

We should probably add this testcase to make sure any other heuristic improvements in the above code doesn't break it again.
Comment 12 H.J. Lu 2021-08-06 13:29:39 UTC
For some reason,

-march=x86-64 -mx32

and

-march=x86-64 -m32 -mfpmath=sse

won't vectorize the loop.
Comment 13 H.J. Lu 2021-08-06 13:31:54 UTC
Here is the equivalent C code:

---
#include <stdint.h>

#define loop_t uint32_t
#define idx_t uint32_t

void loop(double * const __restrict__ dst,
          double const * const __restrict__ src,
          idx_t const * const __restrict__ idx,
          loop_t const begin,
          loop_t const end)
{
  for (loop_t i = begin; i < end; ++i)
    dst[i] = 42.0 * src[idx[i]];
}
---
Comment 14 GCC Commits 2021-08-07 14:35:16 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:6866f4819ad8e6e62fef2177520f9fb217dfa353

commit r12-2795-g6866f4819ad8e6e62fef2177520f9fb217dfa353
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sat Aug 7 07:29:04 2021 -0700

    Add tests for PR tree-optimization/88531
    
            PR tree-optimization/88531
            * gcc.target/i386/pr88531-1a.c: New test.
            * gcc.target/i386/pr88531-1b.c: Likewise.
            * gcc.target/i386/pr88531-1c.c: Likewise.
            * gcc.target/i386/pr88531-2a.c: Likewise.
            * gcc.target/i386/pr88531-2b.c: Likewise.
            * gcc.target/i386/pr88531-2c.c: Likewise.
Comment 15 H.J. Lu 2021-08-07 14:42:50 UTC
Fixed for GCC 12.