This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction
On Tue, Dec 12, 2017 at 1:42 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> Hi,
>>>>> This simple patch makes interchange even more conservative for small loops with constant initialized simple reduction.
>>>>> The reason is undoing such reduction introduces new data reference and cond_expr, which could cost too much in a small
>>>>> loop.
>>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch. Is it OK if test passes?
>>>>
>>>> Shouldn't we do this even for non-constant initialzied simple
>>>> reduction? Because for any simple
>>>> reduction we add two DRs that are not innermost, for constant
>>>> initialized we add an additional
>>>> cond-expr. So ...
>>>>
>>>> + /* Conservatively skip interchange in cases only have few data references
>>>> + and constant initialized simple reduction since it introduces new data
>>>> + reference as well as ?: operation. */
>>>> + if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
>>>> + return false;
>>>> +
>>>>
>>>> can you, instead of carrying num_const_init_simple_reduc simply loop
>>>> over m_reductions
>>>> and classify them in this function accordingly? I think we want to
>>>> cost non-constant-init
>>>> reductions as well. The :? can eventually count for another DR for
>>>> cost purposes.
>>> Number of non-constant-init reductions can still be carried in struct
>>> loop_cand? I am not very sure what's the advantage of an additional
>>> loop over m_reductions getting the same information.
>>> Perhaps the increase of stmts should be counted like:
>>> num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
>>> Question is which number should this be compared against. (we may
>>> need to shift num_new_inv_drs to the other side for wrapping issue).
>>>
>>>>
>>>> It looks like we do count the existing DRs for the reduction? Is that
>>>> why you arrive
>>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
>>> Yes.
>>>> But we don't really know whether the DR was invariant in the outer
>>>> loop (well, I suppose
>>> Hmm, I might misunderstand here. num_old_inv_drs tracks the number of
>>> invariant reference with regarding to inner loop, rather than the
>>> outer loop. The same to num_new_inv_drs,
>>> which means a reference become invariant after loop interchange with
>>> regarding to (the new) inner loop. This invariant information is
>>> always known from data reference, right?
>>> As for DRs for reduction, we know it's invariant because we set its
>>> inner loop stride to zero.
>>>
>>>> we could remember the DR in m_reductions).
>>>>
>>>> Note that the good thing is that the ?: has an invariant condition and
>>>> thus vectorization
>>>> can hoist the mask generation out of the vectorized loop which means
>>>> it boils down to
>>>> cheap operations. My gut feeling is that just looking at the number
>>>> of memory references
>>>> isn't a good indicator of profitability as the regular stmt workload
>>>> has a big impact on
>>>> profitability of vectorization.
>>> It's not specific to vectorization. The generated new code also costs
>>> too much in small loops without vectorization. But yes, # of mem_refs
>>> may be too inaccurate, maybe we should check against num_stmts.
>>
>> Not specific to vectorization but the interchange may pay off only when
>> vectorizing a loop. Would the loop in loop-interchange-5.c be still
>> interchanged? If we remove the multiplication and just keep
>> c[i][j] = c[i][j] + b[k][j];
>> ? That is, why is the constant init so special? Even for non-constant init
>> we're changing two outer loop DRs to two non-consecutive inner loop DRs.
> Hi Richard,
> This is updated patch taking stmt cost into consideration.
>
> Firstly stmt cost (from # of stmt)
> of loops are recorded. Then stmt cost of outer loop is adjusted by decreasing
> number of IVs, increasing by the number of constant initialized simple
> reductions.
> Lastly we check stmt cost between inner/outer loops and give up on interchange
> if outer loop has too many stmts.
>
> Test gcc.target/aarch64/pr62178.c is fixed with this patch. Bootstrap and test
> on x86_64 andAArch64. Any comment?
This looks good now!
Ok for trunk.
Thanks,
Richard.
> Thanks,
> bin
> 2017-12-12 Bin Cheng <bin.cheng@arm.com>
>
> * gimple-loop-interchange.cc (STMT_COST_RATIO): New macro.
> (loop_cand::m_num_stmts, loop_cand::m_const_init_reduc): New members.
> (loop_cand::loop_cand): Initialize above members.
> (loop_cand::supported_operations): Delete.
> (loop_cand::can_interchange_p): Inline above function.
> (loop_cand::classify_simple_reduction): Record number of constant
> initialized simple reductions.
> (should_interchange_loops): New parameters. Check stmt cost of loops
> to be interchange.
> (tree_loop_interchange::interchange): Prepare stmt cost of outer loop.
> Update call to should_interchange_loops.
> (should_interchange_loop_nest): Update call to
> should_interchange_loops.