This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Masked load/store vectorization (take 6)
- From: Richard Biener <rguenther at suse dot de>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: Sergey Ostanevich <sergos dot gnu at gmail dot com>, Richard Henderson <rth at redhat dot com>, gcc-patches at gcc dot gnu dot org
- Date: Fri, 6 Dec 2013 14:32:06 +0100 (CET)
- Subject: Re: [PATCH] Masked load/store vectorization (take 6)
- Authentication-results: sourceware.org; auth=none
- References: <CAGYS_TLjBfd6bLLh5Tc8Bysa9OXYdxfKKf_PgrqnPLpqJrrv2w at mail dot gmail dot com> <20131022132658 dot GM30970 at tucnak dot zalov dot cz> <CAGYS_T+UvX8YBUEpd5=vf8Xm5+kFDfY9pgsruiJ=6y=h_qEdnw at mail dot gmail dot com> <CAGYS_TJmppbJs-kbuxHnEr_swG+c1hDRg=wBywohFW=jBH8SEQ at mail dot gmail dot com> <20131023172220 dot GW30970 at tucnak dot zalov dot cz> <20131024111439 dot GZ30970 at tucnak dot zalov dot cz> <20131112142930 dot GT27813 at tucnak dot zalov dot cz> <alpine dot LNX dot 2 dot 00 dot 1311271608170 dot 8615 at zhemvz dot fhfr dot qr> <20131128230906 dot GX892 at tucnak dot redhat dot com> <alpine dot LNX dot 2 dot 00 dot 1312061249300 dot 8615 at zhemvz dot fhfr dot qr> <20131206132725 dot GN892 at tucnak dot redhat dot com>
On Fri, 6 Dec 2013, Jakub Jelinek wrote:
> On Fri, Dec 06, 2013 at 01:49:50PM +0100, Richard Biener wrote:
> > Comments inline (scary large this patch for this stage ...)
>
> Thanks.
>
> > > +(define_expand "maskload<mode>"
> > > + [(set (match_operand:V48_AVX2 0 "register_operand")
> > > + (unspec:V48_AVX2
> > > + [(match_operand:<sseintvecmode> 2 "register_operand")
> > > + (match_operand:V48_AVX2 1 "memory_operand")]
> > > + UNSPEC_MASKMOV))]
> > > + "TARGET_AVX")
> > > +
> > > +(define_expand "maskstore<mode>"
> > > + [(set (match_operand:V48_AVX2 0 "memory_operand")
> > > + (unspec:V48_AVX2
> > > + [(match_operand:<sseintvecmode> 2 "register_operand")
> > > + (match_operand:V48_AVX2 1 "register_operand")
> > > + (match_dup 0)]
> > > + UNSPEC_MASKMOV))]
> > > + "TARGET_AVX")
> > > +
> > > (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
> > > [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
> > > (unspec:AVX256MODE2P
> >
> > x86 maintainers should comment here (ick - unspecs)
>
> Well, the unspecs are preexisting (right now used by intrinsics only), I'm
> just adding expanders that will expand to those instructions.
>
> > > @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec
> > > {
> > > unsigned i, n;
> > >
> > > - op0 = gimple_call_lhs_ptr (stmt);
> > > + ref.is_read = false;
> > > + if (gimple_call_internal_p (stmt))
> > > + switch (gimple_call_internal_fn (stmt))
> > > + {
> > > + case IFN_MASK_LOAD:
> > > + ref.is_read = true;
> > > + case IFN_MASK_STORE:
> > > + ref.ref = build2 (MEM_REF,
> > > + ref.is_read
> > > + ? TREE_TYPE (gimple_call_lhs (stmt))
> > > + : TREE_TYPE (gimple_call_arg (stmt, 3)),
> > > + gimple_call_arg (stmt, 0),
> > > + gimple_call_arg (stmt, 1));
> > > + references->safe_push (ref);
> >
> > This may not be a canonical MEM_REF AFAIK, so you should
> > use fold_build2 here (if the address is &a.b the .b needs folding
>
> Ok, will try that.
>
> > into the offset). I assume the 2nd arg is always constant and
> > thus doesn't change pointer-type during propagations?
>
> Yes, it is, always created by
> ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> (and for vectorized IFN_MASK_* copied over from the non-vectorized
> IFN_MASK_* call).
>
> > > @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l
> > >
> > > FOR_EACH_VEC_ELT (references, i, ref)
> > > {
> > > - dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
> > > + dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
> > > gcc_assert (dr != NULL);
> > > datarefs->safe_push (dr);
> > > }
> >
> > Interetsting that you succeeded in removing the indirection
> > on ref.pos ... I remember trying that twice at least and
> > failing ;)
> >
> > You can install that as cleanup now if you split it out (so hopefully
> > no users creep back that make removing it impossible).
>
> Ok, will do.
>
> > > + /* Check whether this is a load or store. */
> > > + lhs = gimple_assign_lhs (stmt);
> > > + if (TREE_CODE (lhs) != SSA_NAME)
> > > + {
> >
> > gimple_store_p ()?
>
> Likely.
>
> > > + if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
> > > + return false;
> > > + op = maskstore_optab;
> > > + ref = lhs;
> > > + }
> > > + else if (gimple_assign_load_p (stmt))
> > > + {
> > > + op = maskload_optab;
> > > + ref = gimple_assign_rhs1 (stmt);
> > > + }
> > > + else
> > > + return false;
> > > +
> > > + /* And whether REF isn't a MEM_REF with non-addressable decl. */
> > > + if (TREE_CODE (ref) == MEM_REF
> > > + && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
> > > + && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
> > > + && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
> > > + return false;
> >
> > I think that's overly conservative and not conservative enough. Just
> > use may_be_nonaddressable_p () (even though the implementation can
> > need some TLC) and make sure to set TREE_ADDRESSABLE when you
> > end up taking its address.
>
> Will try.
>
> > Please factor out the target bits into a predicate in optabs.c
> > so you can reduce the amount of includes here. You can eventually
> > re-use that from the vectorization parts.
>
> Okay.
>
> > > @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo
> > > basic_block bb = ifc_bbs[i];
> > > gimple_seq stmts;
> > >
> > > - if (!is_predicated (bb))
> > > + if (!is_predicated (bb)
> > > + || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >
> > isn't that redundant now?
>
> Will try (and read the corresponding threads and IRC about that).
>
> > > for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > - if ((stmt = gsi_stmt (gsi))
> > > - && gimple_assign_single_p (stmt)
> > > - && gimple_vdef (stmt))
> > > + if ((stmt = gsi_stmt (gsi)) == NULL
> >
> > I don't think gsi_stmt can be NULL
>
> It can if gsi_end_p, but that is apparently in the loop condition.
> It was preexisting code anyway, but will change.
>
> > > + will be then if-converted, the new copy of the loop will not,
> > > + and the LOOP_VECTORIZED internal call will be guarding which
> > > + loop to execute. The vectorizer pass will fold this
> > > + internal call into either true or false. */
> > >
> > > static bool
> > > +version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
> > > +{
> >
> > What's the do_outer parameter?
>
> That is whether it should version also the outer loop.
> Though, perhaps if we solely use the loop versioning + IFN_LOOP_VECTORIZED
> for MASK_{LOAD,STORE} and perhaps MASK_CALL, we could avoid that and simply
> live with not vectorizing those in nested loops, perhaps that occurs rarely
> enough and will not be a vectorization regression. The reason I have added
> it was when the versioning was used always for if-conversion, then without
> it we have regressed quite some tests. Perhaps the outer loop versioning
> is too costly for too unlikely case. So, do you prefer to drop that, or
> keep?
If it's easy to rip out it looks it can simplify the patch at this
stage which is good.
> >
> > This needs a comment with explaining what code you create.
>
> Ok.
>
> > >
> > > - return changed;
> > > + if (todo && version_outer_loop)
> > > + {
> > > + if (todo & TODO_update_ssa_only_virtuals)
> > > + {
> > > + update_ssa (TODO_update_ssa_only_virtuals);
> > > + todo &= ~TODO_update_ssa_only_virtuals;
> > > + }
> >
> > Btw I hate that we do update_ssa multiple times per pass per
> > function. That makes us possibly worse than O(N^2) as update_ssa computes
> > the IDF of the whole function.
> >
> > This is something your patch introduces (it's only rewriting the
> > virtuals, not the incremental SSA update by BB copying).
>
> See above. If you don't like the 2x loop versioning, it can be dropped.
> That said, every loop versioning has one update_ssa anyway, so this
> isn't making the complexity any worse, other than constant factor.
Yeah. On my TODO list is still cheaper loop versioning (including
getting rid of the speel_* stuff and having only a single SESE
copying machinery).
> > > + maskt = gimple_call_arg (stmt, 2);
> > > + lhs = gimple_call_lhs (stmt);
> > > + type = TREE_TYPE (lhs);
> > > + rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> > > + gimple_call_arg (stmt, 1));
> >
> > That's possibly not canonical again, use fold_build2.
>
> Ok, will try.
>
> > > + gimple g = loop_vectorized_call;
> > > + tree lhs = gimple_call_lhs (g);
> > > + gimple_stmt_iterator gsi = gsi_for_stmt (g);
> > > + gimplify_and_update_call_from_tree (&gsi, boolean_true_node);
> >
> > plain update_call_from_tree should also work here, boolean_true_node
> > is already gimple.
>
> Will test.
> >
> > > + gsi_next (&gsi);
> > > + if (!gsi_end_p (gsi))
> > > + {
> > > + g = gsi_stmt (gsi);
> > > + if (gimple_code (g) == GIMPLE_COND
> > > + && gimple_cond_lhs (g) == lhs)
> > > + {
> > > + gimple_cond_set_lhs (g, boolean_true_node);
> >
> > or simply replace all immediate uses of 'lhs' by boolean_true_node
> > and remove the loop_vectorized call?
>
> That would work, though e.g. removing of the call is something
> any DCE probably handles well and vectorizer relies on tons of DCE anyway.
Well, but you handle forwarding into the COND_EXPR here anyway
(probably so that cfgcleanup removes it), so simply forward it
to all uses.
> > See above. And factor this out into a function. Also move this
> > to the cleanup loop below.
>
> Ok.
Thanks.
Ricahrd.