This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Outer vs. inner loop ifcvt (PR tree-optimization/78899)


On Mon, 9 Jan 2017, Jakub Jelinek wrote:

> On Sat, Jan 07, 2017 at 12:46:26PM +0100, Richard Biener wrote:
> > >The following patch tweaks tree-if-conv.c so that when it wants to
> > >version
> > >an outer loop, it actually transforms:
> > >	      loop1
> > >		loop2
> > >into:
> > >	      if (LOOP_VECTORIZED (1, 3))
> > >		{
> > >		  loop1
> > >		    loop2
> > >		}
> > >	      else
> > >		loop3 (copy of loop1)
> > >		  if (LOOP_VECTORIZED (4, 5))
> > >		    loop4 (copy of loop2)
> > >		  else
> > >		    loop5 (copy of loop4)
> > 
> > Huh.  Why isn't the else case equal to the if case for the vectorizer?  That is, we have the inner loop if-converted And thus for the if case either outer or inner loop vectorization should be possible.
> > 
> > So - why doesn't it work that way?
> 
> The patch is a consequence of the r242520 changes:
> http://gcc.gnu.org/ml/gcc-patches/2016-11/msg01541.html
> Previously, we had
> 	loop1
> 	  loop2
> transformed into:
> 	loop1
> 	  if (LOOP_VECTORIZED (2, 3))
> 	    loop2
> 	  else
> 	    loop3 (copy of loop2)
> (if actually we find if-conversion of loop2 desirable, there are masked
> loads/stores etc.).  loop2 is if-converted after the versioning.
> This works well if the inner loop is vectorized, doesn't work at all
> (prevents it) outer loop vectorization.
> Now, with the r242520, it is transformed into:
> 	if (LOOP_VECTORIZED (1, 3))
> 	  {
> 	    loop1
> 	      loop2
> 	  }
> 	else
> 	  loop3 (copy of loop1)
> 	    loop4 (copy of loop2)
> if if-conversion thinks outer loop vectorization might be successful.
> In this case, loop2 is if-converted.  This works well if the outer loop
> versioning is subsequently successful, doesn't work at all if it is
> unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
> vectorizing, we use loop2 itself as its scalar loop (so it will contain
> MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
> vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
> effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
> instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it,
> loop4 isn't, so might be vectorized, but only if no if-conversion is needed
> for it (but tree-if-conversion determined it is needed)).
> With my patch, we have instead:
> 	      if (LOOP_VECTORIZED (1, 3))
> 		{
> 		  loop1
> 		    loop2
> 		}
> 	      else
> 		loop3 (copy of loop1)
> 		  if (LOOP_VECTORIZED (4, 5))
> 		    loop4 (copy of loop2)
> 		  else
> 		    loop5 (copy of loop4)
> loop2 and loop4 are if-converted, so either outer loop vectorization of
> loop1 is successful, then we use loop1/loop2 as the vectorized loop
> and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
> then we use non-vectorized loop3, either with successfully vectorized
> loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
> versioning for alignment etc.), or we fail to vectorize anything and
> end up with scalar loop3/loop5.

But that causes even more versioning (plus redundant if-conversion).

> Without the patch I've posted, there are some remaining options, but those
> will mean big amount of work in the loop manipulation code etc.
> 
> One option is to keep r242520 in, then the problem is that:
> 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
> outer loop vectorization failed, if there is still possible inner loop
> vectorization (not that difficult)

Yeah, that's something r242520 missed to address it seems.  We've
expected this works as expected...  heh.  Testsuite coverage is low
here it seems.

> 2) we'd need to use loop4 as the scalar_loop for the vectorization of
> loop2, but that loop is not adjacent to the vectorized loop, so we'd need
> to somehow transform all the SSA_NAMEs that might be affected by that
> different placement (as if all the loop3 PHIs were loop1 PHIs instead,
> and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
> if those were those in loop2 instead); this is the hard part I'm not really
> enthusiastic to write

We use the special scalar_loop always if we have the loop_vectorized
guard, right?  Which is of course good.  And required for masked 
loads/stores.

I see how this looks somewhat unfortunate.

> Another option is to revert r242520, and then do something for the outer loop
> vectorization.  Right now we expect certain fixed form (5 basic blocks in
> the outer loop, lots of assumptions about the cfg of that, dunno where
> everywhere it is hardcoded).  We'd need to allow also 7+ basic block form,
> where one of the extra loops is just if LOOP_VECTORIZED (x, y), then there
> is if-converted single basic block loop x, and then perhaps many basic
> blocks loop y.  Lots of the vectorization code expects ->inner to be THE
> inner loop, that would no longer be the case, it would be either the scalar
> or vector loop, etc.

Yeah, this is why we went the r242520 route...

> To me my patch at least for GCC7 looks like far less work than both other
> option would require.  I know it has the drawback that compared to the other
> options there is more loop copying that - for this regard, the option to
> revert r242520 and deal with it would be the most effective, only a single
> loop is versioned, with r242520 and without my patch it is 2 loops versioned
> and with my patch 3 loops versioned.

Certainly handling r242520 in a different way looks best then (w/o
r242520 the followup to always version loops in if-conversion shows
a lot of vect.exp regressions).

> There is one thing my patch doesn't do but should for efficiency, if loop1
> (outer loop) is not successfully outer-loop vectorized, then we should mark
> loop2 (its inner loop) as dont_vectorize if the outer loop has been
> LOOP_VECTORIZED guarded.  Then the gcc.dg/gomp/pr68128-1.c change
> wouldn't be actually needed.

Yes.

Are you willing to try re-doing r242520?

Thanks,
Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]