This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Outer vs. inner loop ifcvt (PR tree-optimization/78899)


On Sat, Jan 07, 2017 at 12:46:26PM +0100, Richard Biener wrote:
> >The following patch tweaks tree-if-conv.c so that when it wants to
> >version
> >an outer loop, it actually transforms:
> >	      loop1
> >		loop2
> >into:
> >	      if (LOOP_VECTORIZED (1, 3))
> >		{
> >		  loop1
> >		    loop2
> >		}
> >	      else
> >		loop3 (copy of loop1)
> >		  if (LOOP_VECTORIZED (4, 5))
> >		    loop4 (copy of loop2)
> >		  else
> >		    loop5 (copy of loop4)
> 
> Huh.  Why isn't the else case equal to the if case for the vectorizer?  That is, we have the inner loop if-converted And thus for the if case either outer or inner loop vectorization should be possible.
> 
> So - why doesn't it work that way?

The patch is a consequence of the r242520 changes:
http://gcc.gnu.org/ml/gcc-patches/2016-11/msg01541.html
Previously, we had
	loop1
	  loop2
transformed into:
	loop1
	  if (LOOP_VECTORIZED (2, 3))
	    loop2
	  else
	    loop3 (copy of loop2)
(if actually we find if-conversion of loop2 desirable, there are masked
loads/stores etc.).  loop2 is if-converted after the versioning.
This works well if the inner loop is vectorized, doesn't work at all
(prevents it) outer loop vectorization.
Now, with the r242520, it is transformed into:
	if (LOOP_VECTORIZED (1, 3))
	  {
	    loop1
	      loop2
	  }
	else
	  loop3 (copy of loop1)
	    loop4 (copy of loop2)
if if-conversion thinks outer loop vectorization might be successful.
In this case, loop2 is if-converted.  This works well if the outer loop
versioning is subsequently successful, doesn't work at all if it is
unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
vectorizing, we use loop2 itself as its scalar loop (so it will contain
MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it,
loop4 isn't, so might be vectorized, but only if no if-conversion is needed
for it (but tree-if-conversion determined it is needed)).
With my patch, we have instead:
	      if (LOOP_VECTORIZED (1, 3))
		{
		  loop1
		    loop2
		}
	      else
		loop3 (copy of loop1)
		  if (LOOP_VECTORIZED (4, 5))
		    loop4 (copy of loop2)
		  else
		    loop5 (copy of loop4)
loop2 and loop4 are if-converted, so either outer loop vectorization of
loop1 is successful, then we use loop1/loop2 as the vectorized loop
and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
then we use non-vectorized loop3, either with successfully vectorized
loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
versioning for alignment etc.), or we fail to vectorize anything and
end up with scalar loop3/loop5.

Without the patch I've posted, there are some remaining options, but those
will mean big amount of work in the loop manipulation code etc.

One option is to keep r242520 in, then the problem is that:
1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
outer loop vectorization failed, if there is still possible inner loop
vectorization (not that difficult)
2) we'd need to use loop4 as the scalar_loop for the vectorization of
loop2, but that loop is not adjacent to the vectorized loop, so we'd need
to somehow transform all the SSA_NAMEs that might be affected by that
different placement (as if all the loop3 PHIs were loop1 PHIs instead,
and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
if those were those in loop2 instead); this is the hard part I'm not really
enthusiastic to write

Another option is to revert r242520, and then do something for the outer loop
vectorization.  Right now we expect certain fixed form (5 basic blocks in
the outer loop, lots of assumptions about the cfg of that, dunno where
everywhere it is hardcoded).  We'd need to allow also 7+ basic block form,
where one of the extra loops is just if LOOP_VECTORIZED (x, y), then there
is if-converted single basic block loop x, and then perhaps many basic
blocks loop y.  Lots of the vectorization code expects ->inner to be THE
inner loop, that would no longer be the case, it would be either the scalar
or vector loop, etc.

To me my patch at least for GCC7 looks like far less work than both other
option would require.  I know it has the drawback that compared to the other
options there is more loop copying that - for this regard, the option to
revert r242520 and deal with it would be the most effective, only a single
loop is versioned, with r242520 and without my patch it is 2 loops versioned
and with my patch 3 loops versioned.

There is one thing my patch doesn't do but should for efficiency, if loop1
(outer loop) is not successfully outer-loop vectorized, then we should mark
loop2 (its inner loop) as dont_vectorize if the outer loop has been
LOOP_VECTORIZED guarded.  Then the gcc.dg/gomp/pr68128-1.c change
wouldn't be actually needed.

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]