This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
- From: "Bin.Cheng" <amker dot cheng at gmail dot com>
- To: Christophe Lyon <christophe dot lyon at linaro dot org>
- Cc: Richard Biener <richard dot guenther at gmail dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Mon, 10 Jul 2017 10:39:20 +0100
- Subject: Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
- Authentication-results: sourceware.org; auth=none
- References: <VI1PR0802MB2176F36122DDF66429B0FB3AE7CD0@VI1PR0802MB2176.eurprd08.prod.outlook.com> <CAHFci29yG3+oHwefL44Dx6whLCkYn=yCp-Eg2nMKjSv+ETs0Xg@mail.gmail.com> <CAHFci2_AMQFc857ai1KnabtO4iJcHjksx4ZXATa+vhLOgfj+wA@mail.gmail.com> <CAFiYyc2vH25ZhrWFuJyvqwqHda8SggCP2m66YG8Lc8+43vXDBw@mail.gmail.com> <CAHFci2_TzjRQFecSkckJEX0cBAoNwSpsas=n3oUrqdFujpmU5w@mail.gmail.com> <CAFiYyc3VvAahs1KtNRPiTFAd=cCoy9fvUbVWYkkwErFxXnByMQ@mail.gmail.com> <CAHFci29NWPwbEd_s9+E1fHRuG2kHVSjMcoYwerOk0KarbwdKpg@mail.gmail.com> <CAFiYyc1Beu=yCvpOdgRfvH6tO=42Puh0eXZJBR0hwak7fW-8FA@mail.gmail.com> <CAHFci2-Tbnku=UNMOQtPmyGBXwekLfYcTexvDX5XRv7GMuy4mQ@mail.gmail.com> <CAHFci28gbSZYSiAXCfRbTi=m7wYB9syWc5LEKF5ssAP4eL48NA@mail.gmail.com> <CAKdteObgf-6OGGwELRTfn-5LwF2NZtnwKDXpgzXYca0mmaCTLw@mail.gmail.com>
On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> Hi Bin,
>
> On 30 June 2017 at 12:43, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>> Rebased V3 for changes in previous patches. Bootstap and test on
>>>>>>>>> x86_64 and aarch64.
>>>>>>>>
>>>>>>>> why is ldist-12.c no longer distributed? your comment says it doesn't expose
>>>>>>>> more "parallelism" but the point is to reduce memory bandwith requirements
>>>>>>>> which it clearly does.
>>>>>>>>
>>>>>>>> Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>
>>>>>>>> Can you elaborate on that. Now onto the patch:
>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>> major criteria for distribution. BTW, I think a good memory stream
>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>> etc., appropriate for distribution.
>>>>>>
>>>>>> True. But what means "parallel" here? ldist-13.c if partitioned into two loops
>>>>>> can be executed "in parallel"
>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>> paralleled), we tend to no distribute it into small ones. But there
>>>>> is one exception here, if the distributed small loops are recognized
>>>>> as builtin functions, we still distribute it. I assume it's generally
>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + Loop distribution is the dual of loop fusion. It separates statements
>>>>>>>> + of a loop (or loop nest) into multiple loops (or loop nests) with the
>>>>>>>> + same loop header. The major goal is to separate statements which may
>>>>>>>> + be vectorized from those that can't. This pass implements distribution
>>>>>>>> + in the following steps:
>>>>>>>>
>>>>>>>> misses the goal of being a memory stream optimization, not only a vectorization
>>>>>>>> enabler. distributing a loop can also reduce register pressure.
>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>> now.
>>>>>>
>>>>>> Yes, I agree -- originally it was written to optimize the stream benchmark IIRC.
>>>>> Let's see if any performance drop will be reported against this patch.
>>>>> Let's see if we can create a cost model for it.
>>>>
>>>> Fine.
>>> I will run some benchmarks to see if there is breakage.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> You introduce ldist_alias_id in struct loop (probably in 01/n which I
>>>>>>>> didn't look
>>>>>>>> into yet). If you don't use that please introduce it separately.
>>>>>>> Hmm, yes it is introduced in patch [01/n] and set in this patch.
>>>>>>>
>>>>>>>>
>>>>>>>> + /* Be conservative. If data references are not well analyzed,
>>>>>>>> + or the two data references have the same base address and
>>>>>>>> + offset, add dependence and consider it alias to each other.
>>>>>>>> + In other words, the dependence can not be resolved by
>>>>>>>> + runtime alias check. */
>>>>>>>> + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
>>>>>>>> + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
>>>>>>>> + || !DR_INIT (dr1) || !DR_INIT (dr2)
>>>>>>>> + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
>>>>>>>> + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
>>>>>>>> + || res == 0)
>>>>>>>>
>>>>>>>> ISTR a helper that computes whether we can handle a runtime alias check for
>>>>>>>> a specific case?
>>>>>>> I guess you mean runtime_alias_check_p that I factored out previously?
>>>>>>> Unfortunately, it's factored out vectorizer's usage and doesn't fit
>>>>>>> here straightforwardly. Shall I try to further generalize the
>>>>>>> interface as independence patch to this one?
>>>>>>
>>>>>> That would be nice.
>>>>>>
>>>>>>>>
>>>>>>>> + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
>>>>>>>> + if (flag_tree_loop_vectorize)
>>>>>>>> + {
>>>>>>>>
>>>>>>>> so at this point I'd condition the whole runtime-alias check generating
>>>>>>>> on flag_tree_loop_vectorize. You seem to support versioning w/o
>>>>>>>> that here but in other places disable versioning w/o flag_tree_loop_vectorize.
>>>>>>>> That looks somewhat inconsistent...
>>>>>>> It is a bit complicated. In function version_for_distribution_p, we have
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () > 0)
>>>>>>> + return true;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> It means we also versioning loops even if runtime alias check is
>>>>>>> unnecessary. I did this because we lack cost model and current
>>>>>>> distribution may result in too many distribution? If that's the case,
>>>>>>> at least vectorizer will remove distributed version loop and fall back
>>>>>>> to the original one. Hmm, shall I change it into below code:
>>>>>>
>>>>>> Hmm, ok - that really ties things to vectorization apart from the
>>>>>> builtin recognition. So what happens if the vectorizer can vectorize
>>>>>> both the distributed and non-distributed loops?
>>>>> Hmm, there is no such cases. Under condition no builtins is
>>>>> recognized, we wouldn't distribute loop if by itself can be
>>>>> vectorized. Does this make sense? Of course, cost model for memory
>>>>> behavior can change this behavior and is wanted.
>>>>
>>>> So which cases _do_ we split loops then? "more parallelism" -- but what
>>>> does that mean exactly? Is there any testcase that shows the desired
>>>> splits for vectorization?
>>> At least one of distributed loop can be executed paralleled while the
>>> original loop can't.
>>> Not many, but ldist-26.c is added by one of patch.
>>>>
>>>>>>
>>>>>>> +
>>>>>>> + /* Need to version loop if runtime alias check is necessary. */
>>>>>>> + if (alias_ddrs->length () == 0)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Don't version the loop with call to IFN_LOOP_DIST_ALIAS if
>>>>>>> + vectorizer is not enable because no other pass can fold it. */
>>>>>>> + if (!flag_tree_loop_vectorize)
>>>>>>> + return false;
>>>>>>> +
>>>>>>>
>>>>>>> Then I can remove the check you mentioned in function
>>>>>>> version_loop_by_alias_check?
>>>>>>
>>>>>> Yeah, I guess that would be easier to understand. Need to update
>>>>>> the comment above the alias_ddrs check though.
>>>>> Yes the logic here is complicated. On one hand, I want to be
>>>>> conservative by versioning with IFN_LOOP_DIST_ALIAS so that vectorizer
>>>>> can undo all "unnecessary" distribution before memory behavior is
>>>>> modeled.
>>>>>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> + /* Don't version loop if any partition is builtin. */
>>>>>>>> + for (i = 0; partitions->iterate (i, &partition); ++i)
>>>>>>>> + {
>>>>>>>> + if (partition->kind != PKIND_NORMAL)
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>>
>>>>>>>> why's that? Do you handle the case where only a subset of partitions
>>>>>>> One reason is I generally consider distributed builtin functions as a
>>>>>>> win, thus distribution won't be canceled later in vectorizer. Another
>>>>>>> reason is if all distributed loops are recognized as builtins, we
>>>>>>> can't really version with current logic because the
>>>>>>> IFN_LOOP_DIST_ALIAS won't be folded in vectorizer.
>>>>>>
>>>>>> Ah, ok. I guess the maze was too twisted for me to see what
>>>>>> version_for_distribution_p
>>>>>> does ;)
>>>>>>
>>>>>>>> require a runtime alias check to be generated? Thus from a loop
>>>>>>>> generate three loops, only two of them being versioned? The
>>>>>>>> complication would be that both the runtime alias and non-alias
>>>>>>>> versions would be "transformed". Or do we rely on recursively
>>>>>>>> distributing the distribution result, thus if we have partitions that
>>>>>>>> can be handled w/o runtime alias check fuse the remaining partitions
>>>>>>>> and recurse on those?
>>>>>>> No, this is not precisely handled now, the pass will version the whole
>>>>>>> loop once. Though I think it's not very difficult to do two stages
>>>>>>> distribution, I am not sure how useful it is.
>>>>>>
>>>>>> Not sure either.
>>>>>>
>>>>>> So to understand you're looking at loop-distribution as vectorization enabler
>>>>>> and pattern detector. I think that is reasonable without a better cost model.
>>>>>>
>>>>>> Note that I think the state before your patches had the sensible cost-model
>>>>>> that it fused very conservatively and just produced the maximum distribution
>>>>>> with the idea that the looping overhead itself is cheap. Note that with
>>>>>> a more "maximum" distribution the vectorizer also gets the chance to
>>>>>> do "partial vectorization" in case profitability is different. Of course the
>>>>>> setup cost may offset that in the case all loops end up vectorized...
>>>>> Ideally, we have cost model for memory behavior in distribution. If
>>>>> we know distribution is beneficial in loop distribution, we can simply
>>>>> distribute it; otherwise we pass distribution cost (including memory
>>>>> cost as well as runtime alias check cost as an argument to
>>>>> IFN_LOOP_DIST_ALIAS), thus vectorizer can measure the cost/benefit
>>>>> together with vectorization.
>>>>
>>>> Yes. The old cost model wasn't really one thus loop distribution was never
>>>> enabled by default.
>>>>
>>
>> Hi,
>> This is updated patch. It makes below changes as well as renaming
>> ldist_alias_id to orig_loop_num.
>> 1) Simplifies relation with flag_tree_loop_vectorization. Now it only
>> versions loop if runtime alias check is needed.
>> 2) Record the new versioned loop as original loop in order to simplify
>> dominance working routine. It also makes sense since versioned loop
>> is the one same with the original loop.
>>
>> No change for ChangeLog entry. Bootstrap and test. Is it OK?
>>
>
> I've noticed that this patch introduces regressions on armeb:
> FAIL: gcc.dg/torture/pr52028.c -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions execution
> test
> FAIL: gcc.dg/torture/pr52028.c -O3 -g execution test
Thanks for reporting this, I will create a PR and investigate.
Thanks,
bin
>
> For instance for
> armeb-none-linux-gnueabihf
> --with-cpu=cortex-a9
> --with-fpu=neon-fp16
>
> Christophe
>
>> Thanks,
>> bin