Add a loop versioning pass
Richard Biener
rguenther@suse.de
Thu Dec 13 16:08:00 GMT 2018
On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>Richard Biener <richard.guenther@gmail.com> writes:
>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>>> Also repeated the performance testing (but haven't yet tried an
>>> LTO variant; will do that over the weekend).
>>
>> Any results?
>
>Sorry, I should've remembered that finding time to run tests is easy,
>finding time to analyse them is hard.
>
>Speed-wise, the impact of the patch for LTO is similar to without,
>with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>-flto=jobserver.
>
>Size-wise, there are three tests that grow by >=2% on x86_64:
>
>549.fotonik3d_r: 5.5%
>548.exchange2_r: 29.5%
>554.roms_r: 39.6%
Uh. With LTO we might have a reasonable guessed profile and you do have a optimize_loop_nest_for_speed guard on the transform?
How does compile time fare with the above benchmarks?
>The roms increase seems fair enough given the benefit, since the
>whole program uses a similar style for handling arrays.
>
>fotonik3d is a mixed bag. Some versioning conditions are from
>things like:
>
> subroutine foo(a)
> real :: a(:,:,:)
> a(1,:,:) = ...
> end subroutine
>
>where we version for the middle dimension having a stride of 1.
>This could be eliminated if we had more semantic information.
>
>Other conditions are for things like:
>
> subroutine foo(a)
> real :: a(:,:,:)
> a(:,1,:) = ...
> end subroutine
>
>where the pass really does help, but unfortunately those loops
>aren't hot and might not even be used at all.
>
>Some opportunities are cases in which we avoid gather loads, such as
>in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
>But mp feeds a gather load itself and these assignments aren't a
>critical part of the loop.
>
>(I'm testing on an AVX-only system, so these are synthesised gathers
>rather than real gathers.)
>
>The growth in 548.exchange2_r comes from reasonable changes to cold
>code.
>The test spends almost all its time in __brute_force_MOD_digits_2,
>which
>contains the same code before and after the patch, but which accounts
>for less than 1% of .text before the patch.
>
>> I've skimmed over the updated patch and it looks
>> a lot better now.
>>
>> +bool
>> +loop_versioning
>> +::find_per_loop_multiplication (address_info &address,
>address_term_info &term)
>> +{
>>
>> is that what coding convention allows?
>
>Not sure we've settled on something, so I improvised.
>
>> For grepping I'd then say we should do
>>
>> bool loop_versioning::
>> find_per_loop_multiplication (...)
>>
>> ;)
>
>Sounds good to me.
>
>> Anywhere else we you use
>>
>> loop_versioning::foo
>>
>> so please stick to that.
>
>Yeah, I used that when I could avoid an argument spilling over 80
>chars,
>but I think I missed that the above function now fits into that
>category,
>Will double-check the others.
>
>> Otherwise OK.
>
>Thanks :-)
>
>> I think I don't see a testcase where we could version both loops in a
>nest
>> so I'm not sure whether the transform works fine when you are only
>> updating SSA form at the very end of the pass.
>
>You mean something like:
>
> real :: foo(:,:), bar(:)
>
> do i=1,n
> do j=1,n
> foo(i,j) = ...
> end do
> bar(i) = ..
> end do
>
>? I can add a test if so.
Please.
>At the moment the pass only looks for direct versioning opportunities
>in inner loops, so the assignment to bar wouldn't be treated as a
>versioning opportunity.
Ah, OK.
We should still hoist the version check
>for foo outside both loops, which is tested by loop_versioning_4.f90
>for cases in which the outer loop doesn't have its own array
>arithmetic, but isn't tested for cases like the above.
>
>> There may also be some subtle issues with substitute_and_fold being
>> applied to non-up-to-date SSA form given it folds stmts looking at
>> (single-use!) SSA edges. The single-use guard might be what saves
>you
>> here (SSA uses in the copies are not yet updated to point to the
>> copied DEFs).
>
>OK. I was hoping that because we only apply substitute_and_fold
>to new code, there would be no problem with uses elsewhere.
Might be, yes.
>Would it be safer to:
>
> - version all loops we want to version
> - update SSA explicitly
> - apply substitute and fold to all "new" loops
That would be definitely less fishy. But you need to get at the actual 'new' SSA names for the replacements as I guess they'll be rewritten? Or maybe those are not.
>? Could we then get away with returning a 0 TODO at the end?
Yes.
Richard.
>Thanks,
>Richard
More information about the Gcc-patches
mailing list