Loop fusion.

Tue Apr 24 12:58:00 GMT 2018

On Mon, Apr 23, 2018 at 8:35 PM, Toon Moene <toon@moene.org> wrote:
> On 04/23/2018 01:00 PM, Richard Biener wrote:
>
>> On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote:
>>>
>>> A few days ago there was a rant on the Fortran Standardization
>>> Committee's
>>> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>>>
>>> An example picked at random from our weather forecasting code:
>>>
>>>      ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>>>      ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>>>      ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>>>      ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>>>
>>> The reaction from one of the members of the committee (about "their"
>>> compiler):
>>>
>>> 'And multiple consecutive array statements with the same shape are
>>> “fused”
>>> exactly so that the compiler can generate good cache use. This sort of
>>> optimization is pretty low hanging fruit.'
>>>
>>> As far as I can see loop fusion as a stand-alone optimization is not
>>> supported as yet, although some mention is made in the context of
>>> graphite.
>>>
>>> Is this something that should be pursued ?
>>
>>
>> In principle GRAPHITE can handle loop fusion but yes, standalone fusion
>> is sth useful.
>>
>> Note that while it looks "obvious" in the above source fragment the IL
>> that is presented to optimizers may make things a lot less "low-hanging".
>
>
> Well, the loops are generated by the front end, so I *assume* they are
> basically the same ...

The issue will be boiler-plate code like duplicated loop header checks.
That said, it's perfectly understandable that other Fortran compilers have
high-level loop opts deeply embedded within their frontends...

> Probably the largest problem to address is the heuristic for preventing
> register pressure going through the roof ...

Yes.  As Bin said loop fusion and fission are closely related and should
share their cost modeling - they both have the goal to optimize the
number of input and output memory streams and their re-use within
the constraints of the CPU architecture (which includes number of
available registers).

Note that the difficulty with fusion compared to fission is that our
dependence analysis framework cannot handle the case of sibling
loops (as required by fusion).  So we have to either apply some
"tricks" to use the current framework or move over to more capable
dependence testing infrastructure like that of ISL.

Richard.

>
> --
> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news