This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [SVE] PR86753


On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 26 Aug 2019 at 14:48, Richard Biener <richard.guenther@gmail.com> wrote:
> >>
> >> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
> >> <prathamesh.kulkarni@linaro.org> wrote:
> >> >
> >> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> > >
> >> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> >> > > > <richard.sandiford@arm.com> wrote:
> >> > > >>
> >> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
> >> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> > > >> >> Alternatively you could ask the VN state to generate the stmt for
> >> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> > > >> >> work).  One complication might be availability if you don't value-number
> >> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
> >> > > >> >> should keep this mode in mind when designing the one working on
> >> > > >> >> an existing BB.  Note as you write it it depends on visiting the
> >> > > >> >> stmts in proper order - is that guaranteed when for example
> >> > > >> >> vectorizing SLP?
> >> > > >> > Hi,
> >> > > >> > Indeed, I wrote the function with assumption that, stmts would be
> >> > > >> > visited in proper order.
> >> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> >> > > >> > vect_transform_stmt is
> >> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
> >> > > >> > vect_transform_loop.
> >> > > >> > But I agree we could make it more general.
> >> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> >> > > >> > non-dominating scalar stmts during outer-loop vectorization.
> >> > > >>
> >> > > >> Maybe we could do the numbering in a separate walk immediately before
> >> > > >> the transform phase instead.
> >> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> >> > > > just before transform phase
> >> > > > or run full VN ?
> >> > >
> >> > > No, I just meant that we could do a separate walk of the contents
> >> > > of the basic block:
> >> > >
> >> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >> > > >      {
> >> > > >        basic_block bb = bbs[i];
> >> > > >        stmt_vec_info stmt_info;
> >> > > > +      vn_bb_init (bb);
> >> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> >> > > >
> >> > >
> >> > > ...here, rather than doing it on the fly during vect_transform_stmt
> >> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> >> > > others don't have to pay the compile-time penalty.  (Same for
> >> > > cond_to_vec_mask itself really.)
> >> > Hi,
> >> > Does the attached patch look OK ?
> >> > In patch, I put call to vn_visit stmt in bb loop in
> >> > vect_transform_loop to avoid replicating logic for processing phi and
> >> > stmts.
> >> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> >> > compile time penalty for checking cond_to_vec_mask
> >> > should be pretty small ?
> >> > If this is not OK, I will walk bb immediately before the bb loop.
> >>
> >> So if I understand correctly you never have vectorizable COND_EXPRs
> >> in SLP mode?  Because we vectorize all SLP chains before entering
> >> the loop in vect_transform_loop where you VN existing scalar(!) stmts.
>
> On the "!": the idea behind the patch is to find cases in which a
> scalar condition is used in both a statement that needs to be masked
> for correctness reasons and a statement that we can choose to mask
> if we want to.  It also tries (opportunisticly) to match the ?: order
> with other conditions.
>
> That's why it's operating on scalar values rather than vector values.
> In principle it could be done as a subpass before vectorisation rather
> than on the fly, when there aren't any vector stmts around.
>
> >> Then all this hew hash-table stuff should not be needed since this
> >> is what VN should provide you with.  You of course need to visit
> >> generated condition stmts.  And condition support is weak
> >> in VN due to it possibly having two operations in a single stmt.
> >> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
> >> why you even need it given you are doing your own hashing?
> > Well, we thought of using VN for comparing operands for cases
> > operand_equal_p would not
> > work. Actually, VN seems not to be required for test-cases in PR
> > because both conditions
> > are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
> > with operand_equal_p.
>
> Right, that's why I was suggesting in the earlier thread that we
> treat value numbering as a follow-on.  But...
>
> > Input to vectorizer is:
> >
> >  <bb 3> [local count: 1063004407]:
> >   # i_20 = PHI <i_16(7), 0(15)>
> >   # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
> >   _1 = (long unsigned int) i_20;
> >   _2 = _1 * 4;
> >   _3 = y_11(D) + _2;
> >   _4 = *_3;
> >   _5 = z_12(D) + _2;
> >   _35 = _4 != 0;
> >   iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
> >   iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;
> >
> > In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
> > cond_to_vec_mask,
> > and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
> > not require VN
> > since operands are same.
> >
> > Initially, I was trying to change the generated vectorized code:
> >
> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > vect_iftmp.11_47, vect_cst__49>;
> >
> > where both conditions are equivalent because vect_cst__42 and
> > vect_cst__48 are zero vectors but operand_equal_p failed to catch
> > those.
> >
> > Sorry, I mixed up later between scalar and vector stmts.
> > I wonder if we then need VN ? Perhaps there might be other cases where
> > operands of scalar
> > conditions may be equivalent but not match with operand_equal_p ?
> > In the attached patch, changing operator==, to compare using
> > operand_equal_p works for the tests.
>
> ...those are only very simple cases :-) The problem is that ifcvt
> doesn't leave the code free of redundancies, and pattern stmt generation
> can create redundancies too.  Of course, we could fix those instead.
>
> E.g. for:
>
> void
> f (short *restrict x, short *restrict a, short *restrict b, short *restrict c)
> {
>   for (int i = 0; i < 100; ++i)
>     if (a[i] >= 1)
>       x[i] = b[i] >= 1 ? a[i] : c[i];
> }
>
> ifcvt produces:
>
>   <bb 3> [local count: 1063004407]:
>   # i_34 = PHI <i_30(10), 0(21)>
>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
>   _1 = (long unsigned int) i_34;
>   _2 = _1 * 2;
>   _3 = a_23(D) + _2;
>   _4 = *_3;
>   _7 = b_24(D) + _2;
>   _49 = _4 > 0;
>   _8 = .MASK_LOAD (_7, 16B, _49);
>   _12 = _4 > 0;
>   _13 = _8 > 0;
>   _9 = _12 & _13;
>   _10 = _4 > 0;
>   _11 = _8 > 0;
>   _27 = ~_11;
>   _15 = _10 & _27;
>   _14 = c_25(D) + _2;
>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
>   _17 = x_28(D) + _2;
>   _50 = _4 > 0;
>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
>   i_30 = i_34 + 1;
>   ivtmp_6 = ivtmp_5 - 1;
>   if (ivtmp_6 != 0)
>     goto <bb 10>; [98.99%]
>   else
>     goto <bb 9>; [1.01%]
>
>   <bb 10> [local count: 1052266994]:
>   goto <bb 3>; [100.00%]
>
> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).

Huh.  if-conversion does

  /* Now all statements are if-convertible.  Combine all the basic
     blocks into one huge basic block doing the if-conversion
     on-the-fly.  */
  combine_blocks (loop);

  /* Delete dead predicate computations.  */
  ifcvt_local_dce (loop->header);

  /* Perform local CSE, this esp. helps the vectorizer analysis if loads
     and stores are involved.  CSE only the loop body, not the entry
     PHIs, those are to be kept in sync with the non-if-converted copy.
     ???  We'll still keep dead stores though.  */
  exit_bbs = BITMAP_ALLOC (NULL);
  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
  bitmap_set_bit (exit_bbs, loop->latch->index);
  todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);

which should remove those redundant _4 > 0 checks.  In fact when I
run this on x86_64 with -mavx512bw I see

  <bb 3> [local count: 1063004407]:
  # i_25 = PHI <i_20(9), 0(21)>
  # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
  _1 = (long unsigned int) i_25;
  _2 = _1 * 2;
  _3 = a_14(D) + _2;
  _4 = *_3;
  _5 = b_15(D) + _2;
  _49 = _4 > 0;
  _6 = .MASK_LOAD (_5, 16B, _49);
  _22 = _6 > 0;
  _28 = ~_22;
  _29 = _28 & _49;
  _7 = c_16(D) + _2;
  iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
  iftmp.0_10 = _29 ? iftmp.0_17 : _4;
  _8 = x_18(D) + _2;
  .MASK_STORE (_8, 16B, _49, iftmp.0_10);
  i_20 = i_25 + 1;
  ivtmp_12 = ivtmp_24 - 1;
  if (ivtmp_12 != 0)

after if-conversion (that should be the case already on the GCC 9 branch).

> Looking through the definition of an SSA name means that we can cope
> with these redundancies for single comparisons, but not for comparisons
> combined through & and |.  This hurts most when trying to decide whether
> to invert comparisons.
>
> But maybe value numbering won't cope with that either :-)

Not sure, it definitely doesn't do arbitrary re-association
(but commutative ops are handled).  So it cannot prove
a & (b & c) == (c & a) & b but it can prove a & b == b & a

Richard.

>
> Thanks,
> Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]