This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check


On Mon, Jul 17, 2017 at 1:09 PM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 17 July 2017 at 12:06, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
>> <christophe.lyon@linaro.org> wrote:
>>> Hi Bin,
>>>
>>> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>>
>>>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't expose
>>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>>>> which it clearly does.
>>>>>>>>>>
>>>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the wording
>>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>>
>>>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>>> etc., appropriate for distribution.
>>>>>>>>
>>>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into two loops
>>>>>>>> can be executed "in parallel"
>>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates statements
>>>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>>>> +   same loop header.  The major goal is to separate statements which may
>>>>>>>>>> +   be vectorized from those that can't.  This pass implements distribution
>>>>>>>>>> +   in the following steps:
>>>>>>>>>>
>>>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>>>> now.
>>>>>>>>
>>>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>>>> Let's see if any performance drop will be reported against this patch.
>>>>>>> Let's see if we can create a cost model for it.
>>>>>>
>>>>>> Fine.
>>>>> I will run some benchmarks to see if there is breakage.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>>>> didn't look
>>>>>>>>>> into yet).  If you don't use that please introduce it separately.
>>>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +             /* Be conservative.  If data references are not well analyzed,
>>>>>>>>>> +                or the two data references have the same base address and
>>>>>>>>>> +                offset, add dependence and consider it alias to each other.
>>>>>>>>>> +                In other words, the dependence can not be resolved by
>>>>>>>>>> +                runtime alias check.  */
>>>>>>>>>> +             if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>>>> +                 || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>>>> +                 || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>>>> +                 || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>>>> +                 || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>>>> +                 || res == 0)
>>>>>>>>>>
>>>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>>>> a specific case?
>>>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>>>>  Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>>>> here straightforwardly.  Shall I try to further generalize the
>>>>>>>>> interface as independence patch to this one?
>>>>>>>>
>>>>>>>> That would be nice.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +  /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS.  */
>>>>>>>>>> +  if (flag_tree_loop_vectorize)
>>>>>>>>>> +    {
>>>>>>>>>>
>>>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>>>> on flag_tree_loop_vectorize.  You seem to support versioning w/o
>>>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>>>> That looks somewhat inconsistent...
>>>>>>>>> It is a bit complicated.  In function version_for_distribution_p, we have
>>>>>>>>> +
>>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>>> +  if (alias_ddrs->length () > 0)
>>>>>>>>> +    return true;
>>>>>>>>> +
>>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>>> +    return false;
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>>>> unnecessary.  I did this because we lack cost model and current
>>>>>>>>> distribution may result in too many distribution?  If that's the case,
>>>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>>>> to the original one.  Hmm, shall I change it into below code:
>>>>>>>>
>>>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>>>> builtin recognition.  So what happens if the vectorizer can vectorize
>>>>>>>> both the distributed and non-distributed loops?
>>>>>>> Hmm, there is no such cases.  Under condition no builtins is
>>>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>>>> vectorized.  Does this make sense?  Of course, cost model for memory
>>>>>>> behavior can change this behavior and is wanted.
>>>>>>
>>>>>> So which cases _do_ we split loops then?  "more parallelism" -- but what
>>>>>> does that mean exactly?  Is there any testcase that shows the desired
>>>>>> splits for vectorization?
>>>>> At least one of distributed loop can be executed paralleled while the
>>>>> original loop can't.
>>>>> Not many, but ldist-26.c is added by one of patch.
>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +  /* Need to version loop if runtime alias check is necessary.  */
>>>>>>>>> +  if (alias_ddrs->length () == 0)
>>>>>>>>> +    return false;
>>>>>>>>> +
>>>>>>>>> +  /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>>>> +     vectorizer is not enable because no other pass can fold it.  */
>>>>>>>>> +  if (!flag_tree_loop_vectorize)
>>>>>>>>> +    return false;
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Then I can remove the check you mentioned in function
>>>>>>>>> version_loop_by_alias_check?
>>>>>>>>
>>>>>>>> Yeah, I guess that would be easier to understand.  Need to update
>>>>>>>> the comment above the alias_ddrs check though.
>>>>>>> Yes the logic here is complicated.  On one hand, I want to be
>>>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>>>> modeled.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +  /* Don't version loop if any partition is builtin.  */
>>>>>>>>>> +  for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>>>> +    {
>>>>>>>>>> +      if (partition->kind != PKIND_NORMAL)
>>>>>>>>>> +       break;
>>>>>>>>>> +    }
>>>>>>>>>>
>>>>>>>>>> why's that?  Do you handle the case where only a subset of partitions
>>>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>>>> win, thus distribution won't be canceled later in vectorizer.  Another
>>>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>>>> can't really version with current logic because the
>>>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>>>
>>>>>>>> Ah, ok.  I guess the maze was too twisted for me to see what
>>>>>>>> version_for_distribution_p
>>>>>>>> does ;)
>>>>>>>>
>>>>>>>>>> require a runtime alias check to be generated?  Thus from a loop
>>>>>>>>>> generate three loops, only two of them being versioned?  The
>>>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>>>> versions would be "transformed".  Or do we rely on recursively
>>>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>>>> and recurse on those?
>>>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>>>> loop once.  Though I think it's not very difficult to do two stages
>>>>>>>>> distribution, I am not sure how useful it is.
>>>>>>>>
>>>>>>>> Not sure either.
>>>>>>>>
>>>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>>>> and pattern detector.  I think that is reasonable without a better cost model.
>>>>>>>>
>>>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>>>> with the idea that the looping overhead itself is cheap.  Note that with
>>>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>>>> do "partial vectorization" in case profitability is different.  Of course the
>>>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>>>> Ideally, we have cost model for memory behavior in distribution.  If
>>>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>>>> cost as well as runtime alias check cost as an argument to
>>>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>>>> together with vectorization.
>>>>>>
>>>>>> Yes.  The old cost model wasn't really one thus loop distribution was never
>>>>>> enabled by default.
>>>>>>
>>>>
>>>> Hi,
>>>> This is updated patch.  It makes below changes as well as renaming
>>>> ldist_alias_id to orig_loop_num.
>>>> 1) Simplifies relation with flag_tree_loop_vectorization.  Now it only
>>>> versions loop if runtime alias check is needed.
>>>> 2) Record the new versioned loop as original loop in order to simplify
>>>> dominance working routine.  It also makes sense since versioned loop
>>>> is the one same with the original loop.
>>>>
>>>> No change for ChangeLog entry.  Bootstrap and test.  Is it OK?
>>>>
>>>
>>> I've noticed that this patch introduces regressions on armeb:
>>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -fomit-frame-pointer
>>> -funroll-loops -fpeel-loops -ftracer -finline-functions  execution
>>> test
>>> FAIL:    gcc.dg/torture/pr52028.c   -O3 -g  execution test
>>>
>>> For instance for
>>> armeb-none-linux-gnueabihf
>>> --with-cpu=cortex-a9
>>> --with-fpu=neon-fp16
>> Hi Christophe,
>> I am having difficulty in reproducing this one.  According to you test
>> results at
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/report-build-info.html
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/248356/armeb-none-linux-gnueabihf/fail-gcc-rh60-armeb-none-linux-gnueabihf-arm-cortex-a9-neon-fp16.txt
>>
>> It started at least from r248356?  Also I didn't get any difference in
>> generated assembly between r248356/r248342 for my local toolchain.
>> Maybe because I used different configuration options?
>>
>
> Hi Bin,
>
> Sorry I should have shared more details.
> I noticed the regression when you committed this patch (r249994):
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/249994/report-build-info.html
>
> Checking in my validation logs, it was also present
> between r248356 and r249990 (your patch 9/13 Simply cost model merges
> partitions with the same references)
> and is failing since r249994 (this patch)
>
> Was it "fixed" by accident by r249990 and latent until r249994 ?
Ah, thanks very much for the information.  Most likely it was latent
because of temporarily conservative loop distribution.  I filed
PR81472 for tracking.

Thanks,
bin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]