This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: PR81635: Use chrecs to help find related data refs
On Thu, Aug 17, 2017 at 12:35 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> "Bin.Cheng" <amker.cheng@gmail.com> writes:
>> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> "Bin.Cheng" <amker.cheng@gmail.com> writes:
>>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
>>>> <richard.sandiford@linaro.org> wrote:
>>>>> "Bin.Cheng" <amker.cheng@gmail.com> writes:
>>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>>>>>> <richard.sandiford@linaro.org> wrote:
>>>>>>> The first loop in the testcase regressed after my recent changes to
>>>>>>> dr_analyze_innermost. Previously we would treat "i" as an iv even
>>>>>>> for bb analysis and end up with:
>>>>>>>
>>>>>>> DR_BASE_ADDRESS: p or q
>>>>>>> DR_OFFSET: 0
>>>>>>> DR_INIT: 0 or 4
>>>>>>> DR_STEP: 16
>>>>>>>
>>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>>>>>
>>>>>>> DR_BASE_ADDRESS: p or q
>>>>>>> DR_OFFSET: (intptr_t) i
>>>>>>> DR_INIT: 0 or 4
>>>>>>> DR_STEP: 0
>>>>>>>
>>>>>>> This is also what we'd like to have for the unsigned "i", but the
>>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>>>>>> longer seem to be related to the [i] ones.
>>>>>>
>>>>>> Didn't read the change in detail, so sorry if I mis-understood the issue.
>>>>>> I made changes in scev to better fold type conversion by various sources
>>>>>> of information, for example, vrp, niters, undefined overflow behavior etc.
>>>>>> In theory these information should be available for other
>>>>>> optimizers without
>>>>>> querying scev. For the mentioned test, vrp should compute accurate range
>>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without
>>>>>> worrying
>>>>>> overflow. Note we don't do it in generic folding because
>>>>>> (intptr_t) (i) + 1
>>>>>> could be more expensive (especially in case of (T)(i + j)), or because the
>>>>>> CST part is in bigger precision after conversion.
>>>>>> But such folding is wanted in several places, e.g, IVOPTs. To provide such
>>>>>> an interface, we changed tree-affine and made it do aggressive fold. I am
>>>>>> curious if it's possible to use aff_tree to implement strip_constant_offset
>>>>>> here since aggressive folding is wanted. After all, using additional chrec
>>>>>> looks like a little heavy wrto the simple test.
>>>>>
>>>>> Yeah, using aff_tree does work here when the bounds are constant.
>>>>> It doesn't look like it works for things like:
>>>>>
>>>>> double p[1000];
>>>>> double q[1000];
>>>>>
>>>>> void
>>>>> f4 (unsigned int n)
>>>>> {
>>>>> for (unsigned int i = 0; i < n; i += 4)
>>>>> {
>>>>> double a = q[i] + p[i];
>>>>> double b = q[i + 1] + p[i + 1];
>>>>> q[i] = a;
>>>>> q[i + 1] = b;
>>>>> }
>>>>> }
>>>>>
>>>>> though, where the bounds on the global arrays guarantee that [i + 1] can't
>>>>> overflow, even though "n" is unconstrained. The patch as posted handles
>>>>> this case too.
>>>> BTW is this a missed optimization in value range analysis? The range
>>>> information for i should flow in a way like: array boundary -> niters
>>>> -> scev/vrp.
>>>> I think that's what niters/scev do in analysis.
>>>
>>> Yeah, maybe :-) It looks like the problem is that when SLP runs,
>>> the previous VRP pass came before loop header copying, so the (single)
>>> header has to cope with n == 0 case. Thus we get:
>> Ah, there are several passes in between vrp and pass_ch, not sure if
>> any such pass depends on vrp intensively. I would suggestion reorder
>> the two passes, or standalone VRP interface updating information for
>> loop region after header copied? This is a non-trivial issue that
>> needs to be fixed. Niters analyzer rely on
>> simplify_using_initial_conditions heavily to get the same information,
>> which in my opinion should be provided by VRP. Though this won't be
>> able to obsolete simplify_using_initial_conditions because VRP is weak
>> in symbolic range...
>>
>>>
>>> Visiting statement:
>>> i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>> Intersecting
>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements)
>>> and
>>> [0, 0]
>>> to
>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements)
>>> Intersecting
>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements)
>>> and
>>> [0, 1000]
>>> to
>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements)
>>> Found new range for i_15: [0, 0]
>>>
>>> Visiting statement:
>>> _3 = i_15 + 1;
>>> Match-and-simplified i_15 + 1 to 1
>>> Intersecting
>>> [1, 1]
>>> and
>>> [0, +INF]
>>> to
>>> [1, 1]
>>> Found new range for _3: [1, 1]
>>>
>>> (where _3 is the index we care about), followed by:
>>>
>>> Visiting statement:
>>> i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>> Intersectings/aarch64-linux/trunk-orig/debug/gcc'
>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements)
>>> and
>>> [0, 4]
>>> to
>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements)
>>> Intersecting
>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements)
>>> and
>>> [0, 1000]
>>> to
>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements)
>>> Found new range for i_15: [0, n_9(D) + 4294967295]
>>>
>>> Visiting statement:
>>> _3 = i_15 + 1;
>>> Intersecting
>>> [0, +INF]
>>> and
>>> [0, +INF]
>>> to
>>> [0, +INF]
>>> Found new range for _3: [0, +INF]
>>>
>>> I guess in this case it would be better to intersect the i_15 ranges
>>> to [0, 1000] rather than [0, n_9(D) + 4294967295].
>>>
>>> It does work if another VRP pass runs after CH. But even then,
>>> is it a good idea to rely on the range info being kept up-to-date
>>> all the way through to SLP? A lot happens inbetween.
>> To some extend yes. Now I understand that SCEV uses niters
>> information to prove no_overflow. Niters analysis does infer better
>> information from array boundary, while VRP fails to do that. I don't
>> worry much about gap between vrp pass and slp, it's basically the same
>> as niters. Both information are analyzed at one point and meant to be
>> used by following passes. It's also not common for vrp information
>> become invalid given we are on SSA?
>
> I'm not worried so much about vrp information becoming invalid on
> an SSA name that existed when VRP was run. It's more a question
> of what happens about SSA names that get introduced after VRP,
> e.g. by things like dom, reassoc, PRE, etc.
For induction variables in concern, these passes shouldn't
aggressively introduces new variables I think.
>
>> Now that data address is not analyzed against loop, VRP would be the
>> only information we can use unless adding back scev analysis. IMHO,
>> the patch is doing so in another way than before. It requires
>> additional chrec data structure. I remember the previous patch
>> enables more slp vectorization, is it because of "step" information
>> from scev?
>
> Do you mean that:
>
> 2017-07-03 Richard Sandiford <richard.sandiford@linaro.org>
>
> * tree-data-ref.c (dr_analyze_innermost): Replace the "nest"
> parameter with a "loop" parameter and use it instead of the
> loop containing DR_STMT. Don't check simple_iv when doing
> BB analysis. Describe the two analysis modes in the comment.
>
> enabled more SLP vectorisation in bb-slp-pr65935.c? That was due
> to us not doing IV analysis for BB vectorisation, and ensuring that
> the step was always zero.
Which means vectorizer code can handle not IV-analyzed offset, but
can't for analyzed form?
>
>> In this patch, step information is simply discarded. I am
>> wondering if possible to always analyze scev within innermost loop for
>> slp while discards step information.
>
> Well, we don't calculate a step for bb analysis (i.e. it's not the case
> the patch calculates step information and then simply discards it).
> I don't see how that would work. For bb analysis, the DR_OFFSET + DR_INIT
> has to give the offset for every execution of the block, not just the
> first iteration of the containing loop. So if we get back a nonzero
> step, we have to do something with it.
Yeah.
>
> But:
>
> (a) the old simple_iv analysis is more expensive than simply calling
> analyze_scev, so I don't think this is a win in terms of complexity.
>
> (b) for bb analysis, there's nothing particularly special about the
> innermost loop. It makes more sense to analyse it in the innermost
> loop for which the offset is invariant, as shown by the second
> testcase in the patch.
>
> (c) The patch helps with loop vectorisation too, since analysing the
> starting DR_OFFSET in the context of the containing loop can help
> in a similar way as analysing the full offset does for SLP.
I have to admit I am not very much into this method. It complicates
structure as well as code.
Mostly because now dr_init are split into two different fields and one
of it is lazily computed.
For example:
> @@ -2974,12 +2974,12 @@ vect_vfa_segment_size (struct data_refer
> vect_no_alias_p (struct data_reference *a, struct data_reference *b,
> tree segment_length_a, tree segment_length_b)
> {
> - gcc_assert (TREE_CODE (DR_INIT (a)) == INTEGER_CST
> - && TREE_CODE (DR_INIT (b)) == INTEGER_CST);
> - if (tree_int_cst_equal (DR_INIT (a), DR_INIT (b)))
> + gcc_assert (TREE_CODE (DR_CHREC_INIT (a)) == INTEGER_CST
> + && TREE_CODE (DR_CHREC_INIT (b)) == INTEGER_CST);
> + if (tree_int_cst_equal (DR_CHREC_INIT (a), DR_CHREC_INIT (b)))
> return false;
>
> - tree seg_a_min = DR_INIT (a);
> + tree seg_a_min = DR_CHREC_INIT (a);
> tree seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_min),
> seg_a_min, segment_length_a);
> /* For negative step, we need to adjust address range by TYPE_SIZE_UNIT
> @@ -2990,10 +2990,10 @@ vect_no_alias_p (struct data_reference *
> tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (a)));
> seg_a_min = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_max),
> seg_a_max, unit_size);
> - seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_INIT (a)),
> - DR_INIT (a), unit_size);
> + seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_CHREC_INIT (a)),
> + DR_CHREC_INIT (a), unit_size);
> }
> - tree seg_b_min = DR_INIT (b);
> + tree seg_b_min = DR_CHREC_INIT (b);
> tree seg_b_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_b_min),
> seg_b_min, segment_length_b);
> if (tree_int_cst_compare (DR_STEP (b), size_zero_node) < 0)
Use of DR_INIT is simply replaced by DR_CHREC_INIT. Is it safe to do
so in case of non-ZERO
DR_INIT? It worries me that I may need to think twice before
referring to DR_INIT because it's
not clear when DR_OFFSET is split and DR_CHREC_INIT becomes non-ZERO.
It may simply
because I am too dumb to handle this. I will leave this to richi.
Thanks,
bin
>
> Thanks,
> Richard
>
>>
>> Thanks,
>> bin
>>>
>>> FWIW, the old simple_iv check that I removed for bb data-ref analysis
>>> relies on SCEV analysis too, so I don't think this is more expensive
>>> than what we had before.
>>>
>>> Thanks,
>>> Richard