[PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512
Richard Biener
richard.guenther@gmail.com
Thu Aug 27 13:07:59 GMT 2020
On Thu, Aug 27, 2020 at 2:25 PM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > + subrtx_ptr_iterator::array_type array;
> > + FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > + {
> > + rtx *loc = *iter;
> > + rtx x = *loc;
> > + rtx broadcast_mem, vec_dup, constant, first;
> > + machine_mode mode;
> > + if (GET_CODE (x) != MEM
>
> MEM_P
>
> > + || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > + || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > + continue;
> > +
> > + mode = GET_MODE (x);
> > + if (!VECTOR_MODE_P (mode))
> > + return;
> > +
> > + constant = get_pool_constant (XEXP (x, 0));
> > + first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > + broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > + vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > + *loc = vec_dup;
> > + INSN_CODE (insn) = -1;
> > + /* Revert change if there's no corresponding pattern. */
> > + if (recog_memoized (insn) < 0)
> > + {
> > + *loc = x;
> > + recog_memoized (insn);
> > + }
>
> The usual way of doing this would be through
> validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)? Or are avx/avx2 broadcast slower than
> full vector loads?
>
> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean
> actually larger .rodata. If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.
IIRC I once implemented this (re-using vector constant components
for non-vector pool entries) but it was quite hackish and never merged
it seems.
Richard.
> Jakub
>
More information about the Gcc-patches
mailing list