This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Merge the "ISL optimizer" from the graphite branch

From: Tobias Grosser <tobias at grosser dot es>
To: Richard Guenther <rguenther at suse dot de>
Cc: gcc-patches at gcc dot gnu dot org, Michael Matz <matz at suse dot de>
Date: Tue, 03 Jul 2012 14:45:29 +0200
Subject: Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
References: <alpine.LNX.2.00.1207031310120.17233@jbgna.fhfr.qr> <4FF2D7BE.9000805@grosser.es> <alpine.LNX.2.00.1207031353060.17233@jbgna.fhfr.qr> <4FF2E039.3070600@grosser.es> <alpine.LNX.2.00.1207031417170.17233@jbgna.fhfr.qr>

On 07/03/2012 02:24 PM, Richard Guenther wrote:

On Tue, 3 Jul 2012, Tobias Grosser wrote:

On 07/03/2012 01:56 PM, Richard Guenther wrote:

On Tue, 3 Jul 2012, Tobias Grosser wrote:

On 07/03/2012 01:15 PM, Richard Guenther wrote:


This merges the last bit from the graphite ISL branch - an
integrated optimizer based on ISL.  To quote Tobias:

"The isl scheduling optimizer implements the scheduling algorithm first
developed in Pluto [1]. Pluto has shown significant speedups and is
nowadays even implemented in the IBM XL-C compiler. The implementation
of
this pass is a first draft and was copied largely from Polly. We need
still to adapt the code to the gcc coding style and we need to tune the
isl scheduler. At the moment we get reasonable compile times (at most
2x-3x slowdown) and first speedups.  We now need to tune the compile
time
and start to investigate which optimizations and heuristics need to be
tuned in our reimplementation.

[1] http://pluto-compiler.sourceforge.net/";

Micha kindly did the code adaption to gcc coding style and I renamed
the flag to -floop-nest-optimize (from -floop-optimize-isl).  We
both agree that such integrated LNO is the way to go, superseeding
individual graphite transforms we have now.  We might be even able
to drop the separate blocking&    strip-mining transforms we have
right now in favor of this?

Thanks Micha for adapting the style to gcc.

I would like to point out that this pass is still very experimental and
not
tuned at all. Specifically, it was only tested on polybench with one
specific
set of flags. Even there we did not only get speedups, but due to missing
heuristics some benchmarks also got large slowdowns. When using it on even
slightly different benchmarks or with slightly different flags, infinite
compile time or large performance regressions may show up! This optimizer
may
obviously also contain bugs that yield to miscompiles.

Also, the loop nest optimizer will be not very effective, as long as pre
and
licm are scheduled before graphite.


I have noticed the change to disable those on the graphite-isl branch,
but I fail to see how we can not handle PREd/LIMd loops from within
polyhedral optimizations.  In fact even the user may have performed
PRE or LIM at the source level, thus the point to address this issue
is surely within graphite/ISL.


You can still handle those loops, however the PREd/LIMD will introduce a lot
of additional dependences, which will block transformations. Such dependences
can be removed e.g. with array expansion or undoing some of the PREd/LIMD
transformations, but this is a difficult problem especially if you don't want
to increase the used memory too much.

I do not see any benefits from PREd and LIMD before graphite, as at the very
best their transformations will be reverted (if such an optimization is ever
written). However, scheduling PREd and LIMD after graphite makes perfect
sense. (Especially as there are probably a lot of new opportunities).

Moving such passes behind graphite, does obviously not solve the case of
manual PRE and LIMD done by the user. However, it will allow us to optimize
the non manually PREd code a lot better.


I suppose it might make sense to not perform GRAPHITE stuff from within
our main loop optimization pipeline.

I have no idea. In general all passes that help scalar evolution and that canonicalize the code are good to execute before graphite.

It may make sense to create a new block of loop optimizations with the preparing transformations + graphite, before the normal loop optimizations. This makes sense in general, as graphite only does high level optimizations and the resulting code should anyway have normal low-level loop optimizations applied afterwards. Also, this would keep the schedule identical if graphite is not enabled, and would only add compile time overhead with graphite enabled.

Are there any passes that benefit
GRAPHITE that currently run before it from inside loop?

       NEXT_PASS (pass_tree_loop);
         {
           struct opt_pass **p =&pass_tree_loop.pass.sub;
           NEXT_PASS (pass_tree_loop_init);
           NEXT_PASS (pass_lim);
           NEXT_PASS (pass_copy_prop);
           NEXT_PASS (pass_dce_loop);
           NEXT_PASS (pass_tree_unswitch);
           NEXT_PASS (pass_scev_cprop);
           NEXT_PASS (pass_record_bounds);
           NEXT_PASS (pass_check_data_deps);
           NEXT_PASS (pass_loop_distribution);
           NEXT_PASS (pass_copy_prop);
           NEXT_PASS (pass_graphite);

I suppose scev_cprop helps because it removes scalar dependences
from outside of the loops, unswitch might help because graphite
does not handle conditional code (? then if-conversion would help, too).

Sounds reasonable. I did not investigate this yet.

I would say invariant motion from LIM does not hurt either, it is probably
store-motion that does (it introduces scalar dependences from outside
of the loop).

We've already moved the PRE-like predictive commoning after loop
optimizations, and PRE has several measures to not introduce
dependences that would confuse vectorization.  The vectorizer only
handles scalar reductions, so it requires store-motion done by LIM.
The vectorizer meanwhile handles invariant loads just fine though
(I'll look if we can handle stores to invariant locations, too).

I think that needs some further investigation. It may make sense to try e.g. 2mm from the polybench and look for a pass sequence that allows the isl optimizer to jump in. Unfortunately I don't have time to try it myself, but I am obviously available in case you need here. (You need to take aliasing into account and probably want to use C99 arrays)

Re-ordering passes is always something tedious though ;)

Yes. The idea I mentioned above should not change the default pass ordering.

Cheers
Tobi

Follow-Ups:
- Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Jack Howarth

References:
- [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Richard Guenther
- Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Tobias Grosser
- Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Richard Guenther
- Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Tobias Grosser
- Re: [PATCH] Merge the "ISL optimizer" from the graphite branch
  - From: Richard Guenther

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]