This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Outer vs. inner loop ifcvt (PR tree-optimization/78899)


On Mon, Jan 09, 2017 at 10:08:24AM +0100, Richard Biener wrote:
> > if if-conversion thinks outer loop vectorization might be successful.
> > In this case, loop2 is if-converted.  This works well if the outer loop
> > versioning is subsequently successful, doesn't work at all if it is
> > unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
> > vectorizing, we use loop2 itself as its scalar loop (so it will contain
> > MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
> > vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
> > effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
> > instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it,
> > loop4 isn't, so might be vectorized, but only if no if-conversion is needed
> > for it (but tree-if-conversion determined it is needed)).
> > With my patch, we have instead:
> > 	      if (LOOP_VECTORIZED (1, 3))
> > 		{
> > 		  loop1
> > 		    loop2
> > 		}
> > 	      else
> > 		loop3 (copy of loop1)
> > 		  if (LOOP_VECTORIZED (4, 5))
> > 		    loop4 (copy of loop2)
> > 		  else
> > 		    loop5 (copy of loop4)
> > loop2 and loop4 are if-converted, so either outer loop vectorization of
> > loop1 is successful, then we use loop1/loop2 as the vectorized loop
> > and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
> > then we use non-vectorized loop3, either with successfully vectorized
> > loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
> > versioning for alignment etc.), or we fail to vectorize anything and
> > end up with scalar loop3/loop5.
> 
> But that causes even more versioning (plus redundant if-conversion).

Yes, one more loop (i.e. 2->3 loops versioned).

> > One option is to keep r242520 in, then the problem is that:
> > 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
> > outer loop vectorization failed, if there is still possible inner loop
> > vectorization (not that difficult)
> 
> Yeah, that's something r242520 missed to address it seems.  We've
> expected this works as expected...  heh.  Testsuite coverage is low
> here it seems.

The insufficient testsuite coverage is what is the bigest problem I think.
As I said, this part isn't that hard and I could do it.

> > 2) we'd need to use loop4 as the scalar_loop for the vectorization of
> > loop2, but that loop is not adjacent to the vectorized loop, so we'd need
> > to somehow transform all the SSA_NAMEs that might be affected by that
> > different placement (as if all the loop3 PHIs were loop1 PHIs instead,
> > and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
> > if those were those in loop2 instead); this is the hard part I'm not really
> > enthusiastic to write
> 
> We use the special scalar_loop always if we have the loop_vectorized
> guard, right?  Which is of course good.  And required for masked 
> loads/stores.
> 
> I see how this looks somewhat unfortunate.

Yes, the scalar loop is used in many places, every time we need something
that will not be vectorized, we use that (because, we don't have and don't
want scalar MASK_LOAD/MASK_STORE, or without -ftree-loop-if-convert now also
the COND_EXPRs all around).

> Certainly handling r242520 in a different way looks best then (w/o
> r242520 the followup to always version loops in if-conversion shows
> a lot of vect.exp regressions).
> 
> > There is one thing my patch doesn't do but should for efficiency, if loop1
> > (outer loop) is not successfully outer-loop vectorized, then we should mark
> > loop2 (its inner loop) as dont_vectorize if the outer loop has been
> > LOOP_VECTORIZED guarded.  Then the gcc.dg/gomp/pr68128-1.c change
> > wouldn't be actually needed.
> 
> Yes.
> 
> Are you willing to try re-doing r242520?

Given the amount of time it took me to debug even this version of the patch
(wasted more than a day on the tree-vect-loop-manip.c change, the ICEs on
pr71077 testcase looked very cryptic), I'm afraid
I don't know the SSA_NAME renaming/vect manipulation good enough not to
spend a week or more on that and I probably need to spend that time on other
PRs instead, we are over 50 P1-P3s behind e.g. GCC5 schedule (comparing
from Jan 8th GCC5 status report).
If you think you'd be able to handle it faster or if somebody else (Bill
as the author of r242520) is willing to handle this part, I can help with
the tree-vectorizer.c part.
I think the minimum number of functions that need changing is
slpeel_tree_duplicate_loop_to_edge_cfg
vect_loop_versioning
If the tree-vectorizer.c part supplies loop4 as the LOOP_VINFO_SCALAR_LOOP
for loop2, then the check whether this needs to extra more complicated
handling could be whether the vectorized loop's outer loop is not equal to
scalar_loop's outer loop.

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]