Add a loop versioning pass

Richard Biener rguenther@suse.de
Thu Dec 13 16:08:00 GMT 2018


On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>Richard Biener <richard.guenther@gmail.com> writes:
>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>>> Also repeated the performance testing (but haven't yet tried an
>>> LTO variant; will do that over the weekend).
>>
>> Any results?
>
>Sorry, I should've remembered that finding time to run tests is easy,
>finding time to analyse them is hard.
>
>Speed-wise, the impact of the patch for LTO is similar to without,
>with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>-flto=jobserver.
>
>Size-wise, there are three tests that grow by >=2% on x86_64:
>
>549.fotonik3d_r: 5.5%
>548.exchange2_r: 29.5%
>554.roms_r: 39.6%

Uh. With LTO we might have a reasonable guessed profile and you do have a optimize_loop_nest_for_speed guard on the transform? 

How does compile time fare with the above benchmarks?

>The roms increase seems fair enough given the benefit, since the
>whole program uses a similar style for handling arrays.
>
>fotonik3d is a mixed bag.  Some versioning conditions are from
>things like:
>
>  subroutine foo(a)
>    real :: a(:,:,:)
>    a(1,:,:) = ...
>  end subroutine
>
>where we version for the middle dimension having a stride of 1.
>This could be eliminated if we had more semantic information.
>
>Other conditions are for things like:
>
>  subroutine foo(a)
>    real :: a(:,:,:)
>    a(:,1,:) = ...
>  end subroutine
>
>where the pass really does help, but unfortunately those loops
>aren't hot and might not even be used at all.
>
>Some opportunities are cases in which we avoid gather loads, such as
>in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
>But mp feeds a gather load itself and these assignments aren't a
>critical part of the loop.
>
>(I'm testing on an AVX-only system, so these are synthesised gathers
>rather than real gathers.)
>
>The growth in 548.exchange2_r comes from reasonable changes to cold
>code.
>The test spends almost all its time in __brute_force_MOD_digits_2,
>which
>contains the same code before and after the patch, but which accounts
>for less than 1% of .text before the patch.
>
>> I've skimmed over the updated patch and it looks
>> a lot better now.
>>
>> +bool
>> +loop_versioning
>> +::find_per_loop_multiplication (address_info &address,
>address_term_info &term)
>> +{
>>
>> is that what coding convention allows?
>
>Not sure we've settled on something, so I improvised.
>
>> For grepping I'd then say we should do
>>
>> bool loop_versioning::
>> find_per_loop_multiplication (...)
>>
>> ;)
>
>Sounds good to me.
>
>> Anywhere else we you use
>>
>> loop_versioning::foo
>>
>> so please stick to that.
>
>Yeah, I used that when I could avoid an argument spilling over 80
>chars,
>but I think I missed that the above function now fits into that
>category,
>Will double-check the others.
>
>> Otherwise OK.
>
>Thanks :-)
>
>> I think I don't see a testcase where we could version both loops in a
>nest
>> so I'm not sure whether the transform works fine when you are only
>> updating SSA form at the very end of the pass.
>
>You mean something like:
>
>  real :: foo(:,:), bar(:)
>
>  do i=1,n
>    do j=1,n
>      foo(i,j) = ...
>    end do
>    bar(i) = ..
>  end do
>
>?  I can add a test if so.

Please. 

>At the moment the pass only looks for direct versioning opportunities
>in inner loops, so the assignment to bar wouldn't be treated as a
>versioning opportunity.  

Ah, OK. 

We should still hoist the version check
>for foo outside both loops, which is tested by loop_versioning_4.f90
>for cases in which the outer loop doesn't have its own array
>arithmetic, but isn't tested for cases like the above.
>
>> There may also be some subtle issues with substitute_and_fold being
>> applied to non-up-to-date SSA form given it folds stmts looking at
>> (single-use!) SSA edges.  The single-use guard might be what saves
>you
>> here (SSA uses in the copies are not yet updated to point to the
>> copied DEFs).
>
>OK.  I was hoping that because we only apply substitute_and_fold
>to new code, there would be no problem with uses elsewhere.

Might be, yes.

>Would it be safer to:
>
>  - version all loops we want to version
>  - update SSA explicitly
>  - apply substitute and fold to all "new" loops

That would be definitely less fishy. But you need to get at the actual 'new' SSA names for the replacements as I guess they'll be rewritten? Or maybe those are not. 

>?  Could we then get away with returning a 0 TODO at the end?

Yes. 

Richard. 

>Thanks,
>Richard



More information about the Gcc-patches mailing list