[PATCH]middle-end Add an RPO pass after successful vectorization
Richard Biener
rguenther@suse.de
Tue Nov 2 15:29:00 GMT 2021
On Tue, 2 Nov 2021, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, November 2, 2021 2:24 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > vectorization
> >
> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> >
> > > Hi All,
> > >
> > > Following my current SVE predicate optimization series a problem has
> > > presented itself in that the way vector masks are generated for masked
> > > operations relies on CSE to share masks efficiently.
> > >
> > > The issue however is that masking is done using the & operand and & is
> > > associative and so reassoc decides to reassociate the masked operations.
> >
> > But it does this for the purpose of canonicalization and thus CSE.
>
> Yes, but it turns something like
>
> (a & b) & mask into a & (b & mask).
>
> When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting
> In this case.
OK, so that's a known "issue" with reassoc, it doesn't consider global
CSE opportunities and I guess it pushes 'mask' to leaf if it is loop
carried.
> >
> > > This makes CSE then unable to CSE an unmasked and a masked operation
> > > leading to duplicate operations being performed.
> > >
> > > To counter this we want to add an RPO pass over the vectorized loop
> > > body when vectorization succeeds. This makes it then no longer
> > > reliant on the RTL level CSE.
> > >
> > > I have not added a testcase for this as it requires the changes in my
> > > patch series, however the entire series relies on this patch to work
> > > so all the tests there cover it.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
> > > no issues.
> > >
> > > Ok for master?
> >
> > You are running VN over _all_ loop bodies rather only those vectorized.
> > We loop over vectorized loops earlier for optimizing masked store sequences.
> > I suppose you could hook in there. I'll also notice that we have
> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
> > So I don't understand why it doesn't work to CSE later.
> >
>
> Atm, say you have the conditions a > b, and a > b & a > c
>
> We generate
>
> mask1 = (a > b) & loop_mask
> mask2 = (a > b & a > c) & loop_mask
>
> with the intention that mask1 can be re-used in mask2.
>
> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
>
> Which has now unmasked (a > b) in mask2, which leaves us unable to combine
> the mask1 and mask2. It doesn't generate incorrect code, just inefficient.
>
> > for (i = 1; i < number_of_loops (cfun); i++)
> > {
> > loop_vec_info loop_vinfo;
> > bool has_mask_store;
> >
> > loop = get_loop (cfun, i);
> > if (!loop || !loop->aux)
> > continue;
> > loop_vinfo = (loop_vec_info) loop->aux;
> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> > delete loop_vinfo;
> > if (has_mask_store
> > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
> > optimize_mask_stores (loop);
> > loop->aux = NULL;
> > }
> >
>
> Ah thanks, I'll make the changes.
Note I think that full-blown CSE is a bit overkill just to counter
a deficient reassoc (or VN). At least it is supposed to be "cheap"
and can be conditionalized on loop masks being used as well.
> Thanks,
> Tamar
>
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
> > upon
> > > successful vectorization.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> > >
> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
> > 660
> > > af4b6d32dc51 100644
> > > --- a/gcc/tree-vectorizer.c
> > > +++ b/gcc/tree-vectorizer.c
> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see
> > > #include "gimple-pretty-print.h"
> > > #include "opt-problem.h"
> > > #include "internal-fn.h"
> > > -
> > > +#include "tree-ssa-sccvn.h"
> > >
> > > /* Loop or bb location, with hotness information. */
> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
> > > vectorize_loops (void)
> > > ??? Also while we try hard to update loop-closed SSA form we fail
> > > to properly do this in some corner-cases (see PR56286). */
> > > rewrite_into_loop_closed_ssa (NULL,
> > > TODO_update_ssa_only_virtuals);
> > > +
> > > + for (i = 1; i < number_of_loops (cfun); i++)
> > > + {
> > > + loop = get_loop (cfun, i);
> > > + if (!loop || !single_exit (loop))
> > > + continue;
> > > +
> > > + bitmap exit_bbs;
> > > + /* Perform local CSE, this esp. helps because we emit code for
> > > + predicates that need to be shared for optimal predicate usage.
> > > + However reassoc will re-order them and prevent CSE from working
> > > + as it should. CSE only the loop body, not the entry. */
> > > + exit_bbs = BITMAP_ALLOC (NULL);
> > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > + bitmap_set_bit (exit_bbs, loop->latch->index);
> > > +
> > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> > > +
> > > + BITMAP_FREE (exit_bbs);
> > > + }
> > > +
> > > return TODO_cleanup_cfg;
> > > }
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
More information about the Gcc-patches
mailing list