This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Gimple loop splitting

From: Michael Matz <matz at suse dot de>
To: Jeff Law <law at redhat dot com>
Cc: gcc-patches at gcc dot gnu dot org
Date: Mon, 16 Nov 2015 17:05:57 +0100 (CET)
Subject: Re: Gimple loop splitting
Authentication-results: sourceware.org; auth=none
References: <alpine dot LSU dot 2 dot 20 dot 1511121734040 dot 11029 at wotan dot suse dot de> <5645083A dot 5070607 at redhat dot com>
Hi,

On Thu, 12 Nov 2015, Jeff Law wrote:

> > this new pass implements loop iteration space splitting for loops that
> > contain a conditional that's always true for one part of the iteration
> > space and false for the other, i.e. such situations:
> FWIW, Ajit suggested the same transformation earlier this year.  During that
> discussion Richi indicated that for hmmer this transformation would enable
> vectorization.

It's a prerequisite indeed, but not enough in itself.  The next problem 
will be that only parts of access chains inside the hot loop are 
vectorizable, but for that those parts need to be disambiguated.  ICC is 
doing that by a massive chain of conditionals testing non-overlapping of 
the respective arrays at runtime.  Our vectorizer could also do that 
(possibly by increasing the allowed number of conditionals), but the next 
problem then is that one of these (then indeed separated) parts is not 
vectorizable by our vectorizer: it's a 'a[i] = f(a[i-1])' dependency that 
can't yet be handled by us.  If the separation of parts would be done by 
loop distribution that would be fine (we'd have separate loops for the 
parts, some of them vectorizable), but our loop distribution can't do 
runtime disambiguation, only our vectorizer.

hmmer is actually quite interesting because it's a fairly isolated hot 
loop posing quite challenging problems for us :)

> 
>   It does increase code size, when the loop body contains
> > also unconditional code (that one is duplicated), so we only transform hot
> > loops.
> 
> Probably ought to be disabled when we're not optimizing for speed as well.

That should be dealt with by '!optimize_loop_for_size_p (loop)'.

> > I've regstrapped this pass enabled with -O2 on x86-64-linux, without
> > regressions.  I've also checked cpu2006 (the non-fortran part) for
> > correctness, not yet for performance.  In the end it should probably only
> > be enabled for -O3+ (although if the whole loop body is conditional it
> > makes sense to also have it with -O2 because code growth is very small
> > then).
> 
> Very curious on the performance side, so if you could get some #s on that,
> it'd be greatly appreciated.

My test machine misbehaved over the weekend, but as soon as I have them 
I'll update here.

> > testsuite/
> > 	* gcc.dg/loop-split.c: New test.
> 
> Please clean up the #if 0/#if 1 code in the new tests.

Actually I'd prefer if that test contains the by-hand code and the TRACE 
stuff as well, I'd only change the #if 0 into some #if BYHAND or so ...

> You might also want to clean out the TRACE stuff.  Essentially the tests 
> look like you just dropped in a test you'd been running by hand until 
> now :-)

... the reason being, that bugs in the splitter are somewhat unwieldy to 
debug by just staring at the dumps, you only get a checksum mismatch, so 
TRACE=1 is for finding out which of the params and loops is actually 
miscompiled, TRACE=2 for finding the specific iteration that's broken, and 
the #if0 code for putting that situation into a non-macroized and smaller 
function than dotest.  (That's actually how I've run the testcase after I 
had it basically working, extending dotest with a couple more lines, aka 
example loop sitations, adjusting the checksum, and then making a face and 
scratching my head and mucking with the TRACE and #if0 macros :) ).

> I don't see any negative tests -- ie tests that should not be split due 
> to boundary conditions.  Do you have any from development?

Good point, I had some but only ones where I was able to extend the 
splitters to cover them.  I'll think of some that really shouldn't be 
split.

> > Index: tree-ssa-loop-unswitch.c
> > ===================================================================
> > --- tree-ssa-loop-unswitch.c	(revision 229763)
> > +++ tree-ssa-loop-unswitch.c	(working copy)
> Given the amount of new code, unless there's a strong need, I'd prefer this
> transformation to be implemented in its own file.

Okay.

> > +/* Give an induction variable GUARD_IV, and its affine descriptor IV,
> > +   find the loop phi node in LOOP defining it directly, or create
> > +   such phi node.  Return that phi node.  */
> > +
> > +static gphi *
> > +find_or_create_guard_phi (struct loop *loop, tree guard_iv, affine_iv *
> > /*iv*/)
> > +{
> > +  gimple *def = SSA_NAME_DEF_STMT (guard_iv);
> > +  gphi *phi;
> > +  if ((phi = dyn_cast <gphi *> (def))
> > +      && gimple_bb (phi) == loop->header)
> > +    return phi;
> > +
> > +  /* XXX Create the PHI instead.  */
> > +  return NULL;
> 
> So right now we just punt if we need to create the PHI?  Does that 
> happen with any kind of regularity in practice?

Only with such situations:

  for (int i = start; i < end; i++) {
    if (i + offset < bound)
      ...
  }

Here the condition-IV is not directly defined by a PHI node.  If it 
happens often I don't know, I guess the usual situation is testing the 
control IV directly.  The deficiency is not hard to fix.

> > +static bool
> > +split_loop (struct loop *loop, struct tree_niter_desc *niter)
> This should probably be broken up a bit more.  It's loooong as-is.
> 
> Without looking at how much stuff would have to be passed around, 
> diddling the exit edge of the first loop, phi updates for the 2nd loop, 
> fix iteration space of 2nd loop, exit block fixup might be a good 
> initial cut at breaking this down into something of manageable size.

Thanks, I'll do that.

> > +	initialize_original_copy_tables ();
> > +	basic_block cond_bb;
> > +	struct loop *floop = loop_version (loop, cond, &cond_bb,
> > +					   REG_BR_PROB_BASE, REG_BR_PROB_BASE,
> > +					   REG_BR_PROB_BASE, false);
> > +	gcc_assert (floop);
> > +	update_ssa (TODO_update_ssa);
> > +
> > +	/* Now diddle the exit edge of the first loop (floop->join in the
> > +	   above) to either go to the common exit (join) or to the second
> > +	   loop, depending on if there are still iterations left, or not.
> > +	   We split the floop exit edge and insert a copy of the
> > +	   original exit expression into the new block, that either
> > +	   skips the second loop or goes to it.  */
> 
> So after diddling, haven't we mucked up the dominator tree and the SSA 
> graph? You're iterating over each PHI in two loop headers and fixing the 
> SSA graph by hand AFAICT.  But ISTM the dominator tree is still mucked 
> up, right?

I think I convinced myself on paper that the dominator tree is correct due 
to our helpers doing the right thing (loop_version() for the initial 
loop copying and split_edge for the above diddling).  Let's see if I can 
paint some ASCII art.  So, after loop_version (which updates dom) we 
have:

               .------if (cond)-------.
               v                      v
             pre1                   pre2
              |                      |
             h1<----.               h2<----.
              |     |                |     |
          .--ex1    |        .------ex2    |
          |    \    |        |        \    |
          |    l1---'        |        l2---'
          |                  |
          |                  |
          '--X--------->join<'


At this point dominators are all correct (due to loop_version updating 
them), in particular dom(pre1)==dom(pre2)==if(cond).  Now we split 
ex1->join at X, and split_edge also updates them (trivially), but we 
insert a new edge from split_bb to pre2.  There are no paths from region2 
into region1, and anything in region2 except pre2 is still dominated by 
pre2 (or something further down), so if anything changes, then dom(pre2).


               .------if (cond)----.
               v                   |
             pre1                  |
              |                    |
             h1<----.              |
              |     |              |
             ex1    |              |
              | \   |              |
              |  l1-'              |
              v                    |
          .-split-----------.      |
          |                 v      |
          |               pre2<----'
          |                |
          |               h2<----.
          |                |     |
          |               ex2    |
          |                | \   |
          |                | l2--'
          |              .-'
          '------>join<--'

But there's a path directly to pre2, skipping whole region1, so dom(pre2) 
must be still if(cond), as originally.  Also dom(join) doesn't change, 
because what was first a normal diamond between 
if(cond),region1,region2,join now is a meddled diamond with paths from 
region1 to region2, but not back, so the dominator of the join block still 
is the if(cond) block.

This is all true if the internal structure of region1/region2 is sensible, 
and single_exit() regions are such.  Even multiple exits to something 
behind join wouldn't change this, but we don't even have to think about 
this.

In addition, anything not updating dominators correctly would scream 
loudly in the verifier.

The SSA tree is correct after loop_version() and split_edge.  The new edge 
split_bb->pre2 needs the adjustments in that loop over loop PHI nodes.  
That walk must catch everything, if it wouldn't then that would mean a use 
in region2 that's defined in region1, that wasn't originally dominated by 
the def (and hence must have been a loop-carried value and hence be 
defined in the loop header PHI block).

> Overall I think this looks real good.  THe biggest problem IMHO is 
> breaking down that monster function a bit.  I'm a bit concerned by the 
> dominator tree state.  Worst case is we have to rebuild the dominators 
> before ensuring we're LCSSA form, and even that doesn't seem too bad.

Actually keeping LCSSA form correct is doable as well, but needs another 
loop over one or the other PHI nodes.  I punted for now and called 
rewrite_into_loop_closed_ssa_1, which actually isn't too expensive for a 
single loop.

> As I mentioned, it may actually be the case that we're OK on the 
> dominator tree, kindof by accident more than design.

I'm pretty sure it is correct, and it is so by design :)

Thanks for the feedback, I'll update the patch accordingly.


Ciao,
Michael.
Follow-Ups:
- Re: Gimple loop splitting
  - From: Jeff Law
References:
- Gimple loop splitting
  - From: Michael Matz
- Re: Gimple loop splitting
  - From: Jeff Law
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]