[PATCH] Merge the "ISL optimizer" from the graphite branch

Thu Jul 5 15:42:00 GMT 2012

On Tue, Jul 03, 2012 at 02:45:29PM +0200, Tobias Grosser wrote:
> On 07/03/2012 02:24 PM, Richard Guenther wrote:
>> On Tue, 3 Jul 2012, Tobias Grosser wrote:
>>
>>> On 07/03/2012 01:56 PM, Richard Guenther wrote:
>>>> On Tue, 3 Jul 2012, Tobias Grosser wrote:
>>>>
>>>>> On 07/03/2012 01:15 PM, Richard Guenther wrote:
>>>>>>
>>>>>> This merges the last bit from the graphite ISL branch - an
>>>>>> integrated optimizer based on ISL.  To quote Tobias:
>>>>>>
>>>>>> "The isl scheduling optimizer implements the scheduling algorithm first
>>>>>> developed in Pluto [1]. Pluto has shown significant speedups and is
>>>>>> nowadays even implemented in the IBM XL-C compiler. The implementation
>>>>>> of
>>>>>> this pass is a first draft and was copied largely from Polly. We need
>>>>>> still to adapt the code to the gcc coding style and we need to tune the
>>>>>> isl scheduler. At the moment we get reasonable compile times (at most
>>>>>> 2x-3x slowdown) and first speedups.  We now need to tune the compile
>>>>>> time
>>>>>> and start to investigate which optimizations and heuristics need to be
>>>>>> tuned in our reimplementation.
>>>>>>
>>>>>> [1] http://pluto-compiler.sourceforge.net/"
>>>>>>
>>>>>> Micha kindly did the code adaption to gcc coding style and I renamed
>>>>>> the flag to -floop-nest-optimize (from -floop-optimize-isl).  We
>>>>>> both agree that such integrated LNO is the way to go, superseeding
>>>>>> individual graphite transforms we have now.  We might be even able
>>>>>> to drop the separate blocking&    strip-mining transforms we have
>>>>>> right now in favor of this?
>>>>>
>>>>> Thanks Micha for adapting the style to gcc.
>>>>>
>>>>> I would like to point out that this pass is still very experimental and
>>>>> not
>>>>> tuned at all. Specifically, it was only tested on polybench with one
>>>>> specific
>>>>> set of flags. Even there we did not only get speedups, but due to missing
>>>>> heuristics some benchmarks also got large slowdowns. When using it on even
>>>>> slightly different benchmarks or with slightly different flags, infinite
>>>>> compile time or large performance regressions may show up! This optimizer
>>>>> may
>>>>> obviously also contain bugs that yield to miscompiles.
>>>>>
>>>>> Also, the loop nest optimizer will be not very effective, as long as pre
>>>>> and
>>>>> licm are scheduled before graphite.
>>>>
>>>> I have noticed the change to disable those on the graphite-isl branch,
>>>> but I fail to see how we can not handle PREd/LIMd loops from within
>>>> polyhedral optimizations.  In fact even the user may have performed
>>>> PRE or LIM at the source level, thus the point to address this issue
>>>> is surely within graphite/ISL.
>>>
>>> You can still handle those loops, however the PREd/LIMD will introduce a lot
>>> of additional dependences, which will block transformations. Such dependences
>>> can be removed e.g. with array expansion or undoing some of the PREd/LIMD
>>> transformations, but this is a difficult problem especially if you don't want
>>> to increase the used memory too much.
>>>
>>> I do not see any benefits from PREd and LIMD before graphite, as at the very
>>> best their transformations will be reverted (if such an optimization is ever
>>> written). However, scheduling PREd and LIMD after graphite makes perfect
>>> sense. (Especially as there are probably a lot of new opportunities).
>>>
>>> Moving such passes behind graphite, does obviously not solve the case of
>>> manual PRE and LIMD done by the user. However, it will allow us to optimize
>>> the non manually PREd code a lot better.
>>
>> I suppose it might make sense to not perform GRAPHITE stuff from within
>> our main loop optimization pipeline.
>
> I have no idea. In general all passes that help scalar evolution and  
> that canonicalize the code are good to execute before graphite.
>
> It may make sense to create a new block of loop optimizations with the  
> preparing transformations + graphite, before the normal loop  
> optimizations. This makes sense in general, as graphite only does high  
> level optimizations and the resulting code should anyway have normal  
> low-level loop optimizations applied afterwards. Also, this would keep  
> the schedule identical if graphite is not enabled, and would only add  
> compile time overhead with graphite enabled.
>
>> Are there any passes that benefit
>> GRAPHITE that currently run before it from inside loop?
>>
>>        NEXT_PASS (pass_tree_loop);
>>          {
>>            struct opt_pass **p =&pass_tree_loop.pass.sub;
>>            NEXT_PASS (pass_tree_loop_init);
>>            NEXT_PASS (pass_lim);
>>            NEXT_PASS (pass_copy_prop);
>>            NEXT_PASS (pass_dce_loop);
>>            NEXT_PASS (pass_tree_unswitch);
>>            NEXT_PASS (pass_scev_cprop);
>>            NEXT_PASS (pass_record_bounds);
>>            NEXT_PASS (pass_check_data_deps);
>>            NEXT_PASS (pass_loop_distribution);
>>            NEXT_PASS (pass_copy_prop);
>>            NEXT_PASS (pass_graphite);
>>
>> I suppose scev_cprop helps because it removes scalar dependences
>> from outside of the loops, unswitch might help because graphite
>> does not handle conditional code (? then if-conversion would help, too).
>
> Sounds reasonable. I did not investigate this yet.
>
>> I would say invariant motion from LIM does not hurt either, it is probably
>> store-motion that does (it introduces scalar dependences from outside
>> of the loop).
>>
>> We've already moved the PRE-like predictive commoning after loop
>> optimizations, and PRE has several measures to not introduce
>> dependences that would confuse vectorization.  The vectorizer only
>> handles scalar reductions, so it requires store-motion done by LIM.
>> The vectorizer meanwhile handles invariant loads just fine though
>> (I'll look if we can handle stores to invariant locations, too).
>
> I think that needs some further investigation. It may make sense to try  
> e.g. 2mm from the polybench and look for a pass sequence that allows the  
> isl optimizer to jump in. Unfortunately I don't have time to try it  
> myself, but I am obviously available in case you need here. (You need to  
> take aliasing into account and probably want to use C99 arrays)
>
>> Re-ordering passes is always something tedious though ;)
>
> Yes. The idea I mentioned above should not change the default pass ordering.
>
> Cheers
> Tobi

   On a slightly different topic, I thought there was a plan at one point to make
-ftree-loop-linear the default for -O3 when gcc was built with graphite support.
Has this idea been abandoned?
                      Jack