This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check


On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> Hi,
>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>> x86_64 and aarch64.
>>
>> why is ldist-12.c no longer distributed?  your comment says it doesn't expose
>> more "parallelism" but the point is to reduce memory bandwith requirements
>> which it clearly does.
>>
>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the wording
>> of "parallelism" still confuses me.
>>
>> Can you elaborate on that.  Now onto the patch:
> Given we don't model data locality or memory bandwidth, whether
> distribution enables loops that can be executed paralleled becomes the
> major criteria for distribution.  BTW, I think a good memory stream
> optimization model shouldn't consider small loops as in ldist-12.c,
> etc., appropriate for distribution.

True.  But what means "parallel" here?  ldist-13.c if partitioned into two loops
can be executed "in parallel"

>>
>> +   Loop distribution is the dual of loop fusion.  It separates statements
>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>> +   same loop header.  The major goal is to separate statements which may
>> +   be vectorized from those that can't.  This pass implements distribution
>> +   in the following steps:
>>
>> misses the goal of being a memory stream optimization, not only a vectorization
>> enabler.  distributing a loop can also reduce register pressure.
> I will revise the comment, but as explained, enabling more
> vectorization is the major criteria for distribution to some extend
> now.

Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.

>>
>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>> didn't look
>> into yet).  If you don't use that please introduce it separately.
> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>
>>
>> +             /* Be conservative.  If data references are not well analyzed,
>> +                or the two data references have the same base address and
>> +                offset, add dependence and consider it alias to each other.
>> +                In other words, the dependence can not be resolved by
>> +                runtime alias check.  */
>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>> +                 || res == 0)
>>
>> ISTR a helper that computes whether we can handle a runtime alias check for
>> a specific case?
> I guess you mean runtime_alias_check_p that I factored out previously?
>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
> here straightforwardly.  Shall I try to further generalize the
> interface as independence patch to this one?

That would be nice.

>>
>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>> +  if (flag_tree_loop_vectorize)
>> +    {
>>
>> so at this point I'd condition the whole runtime-alias check generating
>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>> That looks somewhat inconsistent...
> It is a bit complicated.  In function version_for_distribution_p, we have
> +
> +  /* Need to version loop if runtime alias check is necessary.  */
> +  if (alias_ddrs->length () > 0)
> +    return true;
> +
> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
> +     vectorizer is not enable because no other pass can fold it.  */
> +  if (!flag_tree_loop_vectorize)
> +    return false;
> +
>
> It means we also versioning loops even if runtime alias check is
> unnecessary.  I did this because we lack cost model and current
> distribution may result in too many distribution?  If that's the case,
> at least vectorizer will remove distributed version loop and fall back
> to the original one.  Hmm, shall I change it into below code:

Hmm, ok - that really ties things to vectorization apart from the
builtin recognition.  So what happens if the vectorizer can vectorize
both the distributed and non-distributed loops?

> +
> +  /* Need to version loop if runtime alias check is necessary.  */
> +  if (alias_ddrs->length () == 0)
> +    return false;
> +
> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
> +     vectorizer is not enable because no other pass can fold it.  */
> +  if (!flag_tree_loop_vectorize)
> +    return false;
> +
>
> Then I can remove the check you mentioned in function
> version_loop_by_alias_check?

Yeah, I guess that would be easier to understand.  Need to update
the comment above the alias_ddrs check though.

>>
>> +  /* Don't version loop if any partition is builtin.  */
>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>> +    {
>> +      if (partition->kind != PKIND_NORMAL)
>> +       break;
>> +    }
>>
>> why's that?  Do you handle the case where only a subset of partitions
> One reason is I generally consider distributed builtin functions as a
> win, thus distribution won't be canceled later in vectorizer.  Another
> reason is if all distributed loops are recognized as builtins, we
> can't really version with current logic because the
> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.

Ah, ok.  I guess the maze was too twisted for me to see what
version_for_distribution_p
does ;)

>> require a runtime alias check to be generated?  Thus from a loop
>> generate three loops, only two of them being versioned?  The
>> complication would be that both the runtime alias and non-alias
>> versions would be "transformed".  Or do we rely on recursively
>> distributing the distribution result, thus if we have partitions that
>> can be handled w/o runtime alias check fuse the remaining partitions
>> and recurse on those?
> No, this is not precisely handled now, the pass will version the whole
> loop once.  Though I think it's not very difficult to do two stages
> distribution, I am not sure how useful it is.

Not sure either.

So to understand you're looking at loop-distribution as vectorization enabler
and pattern detector.  I think that is reasonable without a better cost model.

Note that I think the state before your patches had the sensible cost-model
that it fused very conservatively and just produced the maximum distribution
with the idea that the looping overhead itself is cheap.  Note that with
a more "maximum" distribution the vectorizer also gets the chance to
do "partial vectorization" in case profitability is different.  Of course the
setup cost may offset that in the case all loops end up vectorized...

So in reality the vectorizer itself should "distribute away"
non-vectorizable parts
of the loop ... :/

Thanks,
Richard.

> Thanks,
> bin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]