[PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512

Richard Biener richard.guenther@gmail.com
Thu Aug 27 13:07:59 GMT 2020


On Thu, Aug 27, 2020 at 2:25 PM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>
> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean
> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.

IIRC I once implemented this (re-using vector constant components
for non-vector pool entries) but it was quite hackish and never merged
it seems.

Richard.

>         Jakub
>


More information about the Gcc-patches mailing list