This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check


On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>> Hi,
>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>> x86_64 and aarch64.
>>>>
>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't expose
>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>> which it clearly does.
>>>>
>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the wording
>>>> of "parallelism" still confuses me.
>>>>
>>>> Can you elaborate on that.  Now onto the patch:
>>> Given we don't model data locality or memory bandwidth, whether
>>> distribution enables loops that can be executed paralleled becomes the
>>> major criteria for distribution.  BTW, I think a good memory stream
>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>> etc., appropriate for distribution.
>>
>> True.  But what means "parallel" here?  ldist-13.c if partitioned into two loops
>> can be executed "in parallel"
> So if a loop by itself can be vectorized (or so called can be executed
> paralleled), we tend to no distribute it into small ones.  But there
> is one exception here, if the distributed small loops are recognized
> as builtin functions, we still distribute it.  I assume it's generally
> better to call builtin memory functions than vectorize it by GCC?

Yes.

>>
>>>>
>>>> +   Loop distribution is the dual of loop fusion.  It separates statements
>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>> +   same loop header.  The major goal is to separate statements which may
>>>> +   be vectorized from those that can't.  This pass implements distribution
>>>> +   in the following steps:
>>>>
>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>> enabler.  distributing a loop can also reduce register pressure.
>>> I will revise the comment, but as explained, enabling more
>>> vectorization is the major criteria for distribution to some extend
>>> now.
>>
>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
> Let's see if any performance drop will be reported against this patch.
> Let's see if we can create a cost model for it.

Fine.

>>
>>>>
>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>> didn't look
>>>> into yet).  If you don't use that please introduce it separately.
>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>
>>>>
>>>> +             /* Be conservative.  If data references are not well analyzed,
>>>> +                or the two data references have the same base address and
>>>> +                offset, add dependence and consider it alias to each other.
>>>> +                In other words, the dependence can not be resolved by
>>>> +                runtime alias check.  */
>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>> +                 || res == 0)
>>>>
>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>> a specific case?
>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>> here straightforwardly.  Shall I try to further generalize the
>>> interface as independence patch to this one?
>>
>> That would be nice.
>>
>>>>
>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>> +  if (flag_tree_loop_vectorize)
>>>> +    {
>>>>
>>>> so at this point I'd condition the whole runtime-alias check generating
>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>> That looks somewhat inconsistent...
>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>> +
>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>> +  if (alias_ddrs->length () > 0)
>>> +    return true;
>>> +
>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>> +     vectorizer is not enable because no other pass can fold it.  */
>>> +  if (!flag_tree_loop_vectorize)
>>> +    return false;
>>> +
>>>
>>> It means we also versioning loops even if runtime alias check is
>>> unnecessary.  I did this because we lack cost model and current
>>> distribution may result in too many distribution?  If that's the case,
>>> at least vectorizer will remove distributed version loop and fall back
>>> to the original one.  Hmm, shall I change it into below code:
>>
>> Hmm, ok - that really ties things to vectorization apart from the
>> builtin recognition.  So what happens if the vectorizer can vectorize
>> both the distributed and non-distributed loops?
> Hmm, there is no such cases.  Under condition no builtins is
> recognized, we wouldn't distribute loop if by itself can be
> vectorized.  Does this make sense?  Of course, cost model for memory
> behavior can change this behavior and is wanted.

So which cases _do_ we split loops then?  "more parallelism" -- but what
does that mean exactly?  Is there any testcase that shows the desired
splits for vectorization?

>>
>>> +
>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>> +  if (alias_ddrs->length () == 0)
>>> +    return false;
>>> +
>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>> +     vectorizer is not enable because no other pass can fold it.  */
>>> +  if (!flag_tree_loop_vectorize)
>>> +    return false;
>>> +
>>>
>>> Then I can remove the check you mentioned in function
>>> version_loop_by_alias_check?
>>
>> Yeah, I guess that would be easier to understand.  Need to update
>> the comment above the alias_ddrs check though.
> Yes the logic here is complicated.  On one hand, I want to be
> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
> can undo all "unnecessary" distribution before memory behavior is
> modeled.
>
>
>>
>>>>
>>>> +  /* Don't version loop if any partition is builtin.  */
>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>> +    {
>>>> +      if (partition->kind != PKIND_NORMAL)
>>>> +       break;
>>>> +    }
>>>>
>>>> why's that?  Do you handle the case where only a subset of partitions
>>> One reason is I generally consider distributed builtin functions as a
>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>> reason is if all distributed loops are recognized as builtins, we
>>> can't really version with current logic because the
>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>
>> Ah, ok.  I guess the maze was too twisted for me to see what
>> version_for_distribution_p
>> does ;)
>>
>>>> require a runtime alias check to be generated?  Thus from a loop
>>>> generate three loops, only two of them being versioned?  The
>>>> complication would be that both the runtime alias and non-alias
>>>> versions would be "transformed".  Or do we rely on recursively
>>>> distributing the distribution result, thus if we have partitions that
>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>> and recurse on those?
>>> No, this is not precisely handled now, the pass will version the whole
>>> loop once.  Though I think it's not very difficult to do two stages
>>> distribution, I am not sure how useful it is.
>>
>> Not sure either.
>>
>> So to understand you're looking at loop-distribution as vectorization enabler
>> and pattern detector.  I think that is reasonable without a better cost model.
>>
>> Note that I think the state before your patches had the sensible cost-model
>> that it fused very conservatively and just produced the maximum distribution
>> with the idea that the looping overhead itself is cheap.  Note that with
>> a more "maximum" distribution the vectorizer also gets the chance to
>> do "partial vectorization" in case profitability is different.  Of course the
>> setup cost may offset that in the case all loops end up vectorized...
> Ideally, we have cost model for memory behavior in distribution.  If
> we know distribution is beneficial in loop distribution, we can simply
> distribute it; otherwise we pass distribution cost (including memory
> cost as well as runtime alias check cost as an argument to
> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
> together with vectorization.

Yes.  The old cost model wasn't really one thus loop distribution was never
enabled by default.

>>
>> So in reality the vectorizer itself should "distribute away"
>> non-vectorizable parts
>> of the loop ... :/
> Do we really want to put more and more things into vectorizer? :)
> Michael's opinion (IIUC, about introducing different loop
> optimizations as building blocks for loop optimization infrastructure)
> is quite inspiring to me.  In this way, we can at least make different
> optimization conservative.  In general, it's much worse to do
> aggressive (unnecessary) transformation than simply miss the
> opportunity?

Heh, but with increasing number of "different loop optimizations" we
miss an overall goal.  Like either each pass knows what the other
would do (vectorize or predcom for example) or we need to have
a meta-engine driving the building blocks and the building blocks
divided into analysis + costing and a transform phase (and if ever
two transforms shall apply their analysis may not be invalidated
by an earlier one).  Integrating the simple building blocks into one
framework is what is sexy about GRAPHITE ...

For example predcom might be more beneficial than vectorization
with v2df vectors for example.

Richard.

>
> Thanks,
> bin
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]