[PATCH] Check type size for doloop iv on BITS_PER_WORD [PR61837]

Tue Jul 13 08:16:22 GMT 2021

On 2021-07-13 15:09, Richard Biener wrote:
> On Tue, 13 Jul 2021, guojiufu wrote:
> 
>> On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
>> > On 2021-07-12 22:46, Richard Biener wrote:
>> >> On Mon, 12 Jul 2021, guojiufu wrote:
>> >>
>> >>> On 2021-07-12 18:02, Richard Biener wrote:
>> >>> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >>> >
>> >>> >> On 2021-07-12 16:57, Richard Biener wrote:
>> >>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>> >>> >> >
>> >>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>> >>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>> >>> >> >> >
>> >>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>> >>> >> >> >> > I wonder if there's a way to query the target what modes the
>> >>> >> >> >> > doloop
>> >>> >> >> >> > pattern can handle (not being too familiar with the doloop
>> >>> >> >> >> > code).
>> >>> >> >> >>
>> >>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>> >>> >> >> >> perhaps?  Although that is a define_expand, not a define_insn, so
>> >>> >> >> >> it
>> >>> >> >> >> is
>> >>> >> >> >> hard to introspect.
>> >>> >> >> >>
>> >>> >> >> >> > Why do you need to do any checks besides the new type being
>> >>> >> >> >> > able to
>> >>> >> >> >> > represent all IV values?  The original doloop IV will never
>> >>> >> >> >> > wrap
>> >>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>> >>> >> >> >> > become
>> >>> >> >> >> > zero ... I suppose the doloop might still do the correct thing
>> >>> >> >> >> > here
>> >>> >> >> >> > but it also still will with a IV with larger type).
>> >>> >> >>
>> >>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on
>> >>> >> >> which
>> >>> >> >> niter + 1 becomes zero.  And because the step for doloop is -1;
>> >>> >> >> then, on
>> >>> >> >> larger type 'zero - 1' will be a very large number on larger type
>> >>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a
>> >>> >> >> small
>> >>> >> >> value
>> >>> >> >> (e.g. "0xff").
>> >>> >> >
>> >>> >> > But for the larger type the small type MAX + 1 fits and does not
>> >>> >> > yield
>> >>> >> > zero so it should still work exactly as before, no?  Of course you
>> >>> >> > have to compute the + 1 in the larger type.
>> >>> >> >
>> >>> >> You are right, if compute the "+ 1" in the larger type it is ok, as
>> >>> >> below
>> >>> >> code:
>> >>> >> ```
>> >>> >>    /* Use type in word size may fast.  */
>> >>> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>> >>> >>       {
>> >>> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>> >>         niter = fold_convert (ntype, niter);
>> >>> >>       }
>> >>> >>
>> >>> >>     tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>> >>> >>                              build_int_cst (ntype, 1));
>> >>> >>
>> >>> >>
>> >>> >>     add_candidate (data, base, build_int_cst (ntype, -1), true, NULL,
>> >>> >> NULL,
>> >>> >> true);
>> >>> >> ```
>> >>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>> >>> >>   _12 = (unsigned int) xx(D);
>> >>> >>   _10 = _12 + 4294967295;
>> >>> >>   _24 = (long unsigned int) _10;
>> >>> >>   doloop.6_8 = _24 + 1;
>> >>> >>
>> >>> >> if use previous patch, "+ 1" on original type, then the stmts will
>> >>> >> looks
>> >>> >> like:
>> >>> >>   _12 = (unsigned int) xx(D);
>> >>> >>   doloop.6_8 = (long unsigned int) _12;
>> >>> >>
>> >>> >> This is the reason for checking
>> >>> >>    wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>> >>> >
>> >>> > But this then only works when there's an upper bound on the number
>> >>> > of iterations.  Note you should not use TYPE_MAX_VALUE here but
>> >>> > you can instead use
>> >>> >
>> >>> >      wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>> >>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>> >>>
>> >>> Ok, Thanks!
>> >>> I remember you mentioned that:
>> >>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
>> >>> (ntype)),
>> >>> TYPE_SIGN (ntype))
>> >>> would be better than
>> >>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>> >>>
>> >>> It seems that:
>> >>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>> >>> (NODE)->type_non_common.maxval"
>> >>> which do a numerical-check and return the field of maxval.  And then call
>> >>> to
>> >>> wi::to_widest
>> >>>
>> >>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>> >>> wi::max_value
>> >>> and widest_int::from.
>> >>>
>> >>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>> >>
>> >> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> >> underlying modes precision.  At some point we've tried to eliminate
>> >> most of its uses, not sure what the situation/position is right now.
>> > Ok, get it, thanks.
>> > I will use "widest_int::from (wi::max_value (..,..),..)".
>> >
>> >>
>> >>> > I think the -1 above comes from number of latch iterations vs. header
>> >>> > entries - it's a common source for this kind of issues.  range analysis
>> >>> > might be able to prove that we can still merge the two adds even with
>> >>> > the intermediate extension.
>> >>> Yes, as you mentioned here, it relates to number of latch iterations
>> >>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>> >>> This kind of loop, the niter is used to be 'n - 1' after transformed
>> >>> into 'do-while' form.
>> > For this kind of loop, the max value for the number of iteration "n - 1"
>> > would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
>> > This kind of loop is already common, and we could use wi::ltu (max,
>> > max_value_type)
>> > to check.
>> >
>> > For loop looks like:
>> >   do ;
>> >   while (n-- > 0); /* while  (n-- > low); */
>> >
>> > The niter_desc->max will wi::eq to max_value_type, and niter would be "n",
>> > and then doloop.xx is 'n+1'.
>> >
>> >>> I would see how to merge these two adds safely at this point
>> >>> when generating doloop iv. (maybe range info, thanks!
>> >>>
>> 
>> For some cases, it may not easy to merge -1 +1 pair:
>> (from doloop-1.c)
>> ```
>> int __attribute__ ((noinline))
>> test (unsigned char n)
>> {
>>   do ;
>>   while (--n > 0);
>>   return n;
>> }
>> 
>> int main ()
>> {
>>   unsigned char z = 0;
>> 
>>   return test (z);
>> }
>> ```
>> 
>> For this loop, niter_desc->max is 255, and niter is 'n - 1', then
>> doloop.xx is "'n - 1' + 1",  'n - 1' should be compute at uchar type
>> (QImode).
>> It is ok to compute '- 1' and "+ 1" under the original shorter type.
>> And it is ok if computing "n - 1" under original type and computing
>> "+ 1" under larger mode.
>> But it is not ok to compute both "- 1" and "+ 1" in large mode.
>> 
>> As discussed previously:  when 'n - 1' is U*_MAX (original short MAX),
>> '(n - 1) + 1' becomes zero.  And the step for doloop is -1; then,
>> on the larger type 'zero - 1' will be a very large number.
>> 
>> It is ok: Original
>>   unsigned char doloop.5;
>>   # doloop.5_4 = PHI <n_2(D)(2), doloop.5_6(5)>
>>   doloop.5_6 = doloop.5_4 - 1;
>> 
>> it would be error if:
>>   doloop.5_7 = (long unsigned int) n_2(D);
>>   # doloop.5_4 = PHI <doloop.5_7(2), doloop.5_6(5)>
>>   doloop.5_6 = doloop.5_4 - 1;
>> 
>> 
>> >>> >
>> >>> > Is this pre-loop extra add really offsetting the in-loop doloop
>> >>> > improvements?
>> >>> I'm not catching this question too much, sorry.  I guess your concern
>> >>> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
>> >>> is decreasing niter until 0 (all number >0).
>> >>> If misunderstand,  thanks for point out.
>> >>
>> >> I'm questioning the argument that not being able to eliminate the +1-1
>> >> pair effects the overall cost improvement for the doloop IV type change
>> >> which hopefully is _not_ just loop IV setup cost but improving
>> >> performance in the loop body itself?
>> > Yes, I think so.  It would affect doloop IV setup cost, it may also affect
>> > loop itself on some targets, if the target prefers a changed type:
>> > using one jump-counter instruction for wider type, but cmp/jump instructions
>> > for other types.
>> > If the loop is nested in outer-loop, eliminating +1-1 would improving
>> > outer-loop performance.
> 
> Yes.  So we're asking targets whether using a different IV mode for
> the doloop is going to improve performance for the doloop related
> instructions, for other uses IVOPTs usual cost model will apply.
> Now that you only add one IV candidate for doloop it's a on/off
> decision here?
Yes, there is one doloop IV, so if there is a preferred mode on the 
target,
then try to use the mode for the doloop IV.
I was once thinking of adding more doloop IVs and use the cost model to
select the best one.  While it seems not necessary: it may not get a 
better
decision than update the doloop IV mode.

> 
> Still the setup cost should still make the inner loop faster, you
> can keep the < TYPE_MAX check and perform the -1 adjustment in the
> original type but I think it's not a reason to not perform the
> doloop IV promotion if we cannot do that.

Ok, during refine patch, I would add code to perform:
-1 in the original type and +1 in promoted type if "-1"/"+1" pair is
not suitable to compute together in the preferred mode.

Thanks for your suggestions!

BR,
Jiufu

> 
>> There is a patch that could mitigate "-1 +1 pair" in rtl part.
>> https://gcc.gnu.org/g:8a15faa730f99100f6f3ed12663563356ec5a2c0
>> 
>> With that patch, one minor thing is doloop.xxx is using shorter type 
>> on some
>> cases, then need subreg to access it.  I have this patch is trying to 
>> use
>> better
>> type and avoid to use subreg.
> 
> Yes, that's understood.
> 
> Richard.
> 
>> BR,
>> 
>> Jiufu Guo.
>> 
>> >
>> > BR,
>> > Jiufu Guo.
>> >
>> >>
>> >> Richard.
>> >>
>> >>> >
>> >>> >> >> >>
>> >>> >> >> >> doloop_valid_p guarantees it is simple and doesn't wrap.
>> >>> >> >> >>
>> >>> >> >> >> > I'd have expected sth like
>> >>> >> >> >> >
>> >>> >> >> >> >    ntype = lang_hooks.types.type_for_mode (word_mode,
>> >>> >> >> >> > TYPE_UNSIGNED
>> >>> >> >> >> > (ntype));
>> >>> >> >> >> >
>> >>> >> >> >> > thus the decision made using a mode - which is also why I
>> >>> >> >> >> > wonder
>> >>> >> >> >> > if there's a way to query the target for this.  As you say,
>> >>> >> >> >> > it _may_ be fast, so better check (somehow).
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> I was also thinking of using hooks like type_for_size/type_for_mode.
>> >>> >> >>     /* Use type in word size may fast.  */
>> >>> >> >>     if (TYPE_PRECISION (ntype) < BITS_PER_WORD
>> >>> >> >>         && Wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE
>> >>> >> >> (ntype))))
>> >>> >> >>       {
>> >>> >> >>         ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>> >>> >> >>         base = fold_convert (ntype, base);
>> >>> >> >>       }
>> >>> >> >>
>> >>> >> >> As you pointed out, this does not query the mode from targets.
>> >>> >> >> As Segher pointed out "doloop_end" checks unsupported mode, while it
>> >>> >> >> seems
>> >>> >> >> not easy to use it in tree-ssa-loop-ivopts.c.
>> >>> >> >> For implementations of doloop_end, tartgets like rs6000/aarch64/ia64
>> >>> >> >> requires
>> >>> >> >> Pmode/DImode; while there are other targets that work on other
>> >>> >> >> 'mode'
>> >>> >> >> (e.g.
>> >>> >> >> SI).
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> In doloop_optimize, there is code:
>> >>> >> >>
>> >>> >> >> ```
>> >>> >> >>     mode = desc->mode;
>> >>> >> >> .....
>> >>> >> >>     doloop_reg = gen_reg_rtx (mode);
>> >>> >> >>     rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >>> >> >> start_label);
>> >>> >> >>
>> >>> >> >>     word_mode_size = GET_MODE_PRECISION (word_mode);
>> >>> >> >>     word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1)
>> >>> >> >>     -
>> >>> >> >> 1;
>> >>> >> >>     if (! doloop_seq
>> >>> >> >>         && mode != word_mode
>> >>> >> >>         /* Before trying mode different from the one in that # of
>> >>> >> >> iterations is
>> >>> >> >>            computed, we must be sure that the number of iterations
>> >>> >> >> fits
>> >>> >> >> into
>> >>> >> >>            the new mode.  */
>> >>> >> >>         && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >>> >> >>             || wi::leu_p (iterations_max, word_mode_max)))
>> >>> >> >>       {
>> >>> >> >>         if (word_mode_size > GET_MODE_PRECISION (mode))
>> >>> >> >>           count = simplify_gen_unary (ZERO_EXTEND, word_mode, count,
>> >>> >> >>         mode);
>> >>> >> >>         else
>> >>> >> >>           count = lowpart_subreg (word_mode, count, mode);
>> >>> >> >>         PUT_MODE (doloop_reg, word_mode);
>> >>> >> >>         doloop_seq = targetm.gen_doloop_end (doloop_reg,
>> >>> >> >>         start_label);
>> >>> >> >>       }
>> >>> >> >>     if (! doloop_seq)
>> >>> >> >>       {
>> >>> >> >>         if (dump_file)
>> >>> >> >>           fprintf (dump_file,
>> >>> >> >>                    "Doloop: Target unwilling to use doloop
>> >>> >> >>         pattern!\n");
>> >>> >> >>         return false;
>> >>> >> >>       }
>> >>> >> >> ```
>> >>> >> >> The above code first tries the mode of niter_desc by call
>> >>> >> >> targetm.gen_doloop_end
>> >>> >> >> to see if the target can generate doloop insns, if fail, then try to
>> >>> >> >> use
>> >>> >> >> 'word_mode' against gen_doloop_end.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> >>
>> >>> >> >> >> Almost all targets just use Pmode, but there is no such guarantee
>> >>> >> >> >> I
>> >>> >> >> >> think, and esp. some targets that do not have machine insns for
>> >>> >> >> >> this
>> >>> >> >> >> (but want to generate different code for this anyway) can do
>> >>> >> >> >> pretty
>> >>> >> >> >> much
>> >>> >> >> >> anything.
>> >>> >> >> >>
>> >>> >> >> >> Maybe using just Pmode here is good enough though?
>> >>> >> >> >
>> >>> >> >> > I think Pmode is a particularly bad choice and I'd prefer
>> >>> >> >> > word_mode
>> >>> >> >> > if we go for any hardcoded mode.  s390x for example seems to
>> >>> >> >> > handle
>> >>> >> >> > both SImode and DImode (but names the helper gen_doloop_si64
>> >>> >> >> > for SImode?!).  But indeed it looks like somehow querying
>> >>> >> >> > doloop_end
>> >>> >> >> > is going to be difficult since the expander doesn't have any mode,
>> >>> >> >> > so we'd have to actually try emit RTL here.
>> >>> >> >>
>> >>> >> >> Instead of using hardcode mode, maybe we could add a hook for
>> >>> >> >> targets to
>> >>> >> >> return
>> >>> >> >> the preferred mode.
>> >>> >> >
>> >>> >> > That's a possiblity of course.  Like the following which just shows
>> >>> >> > the
>> >>> >> > default implementation then (pass in current mode, return a more
>> >>> >> > preferred
>> >>> >> > mode or the mode itself)
>> >>> >> >
>> >>> >> > enum machine_mode
>> >>> >> > prefered_doloop_mode (enum machine_mode mode)
>> >>> >> > {
>> >>> >> >   return mode;
>> >>> >> > }
>> >>> >> >
>> >>> >> Yes, thanks!
>> >>> >>
>> >>> >> Checking current do_loop_end in targets, in general, when do_loop_end
>> >>> >> requires
>> >>> >> SI mode, the target is defining Pmode as SImode and word_mode (from
>> >>> >> BITS_PER_WORD
>> >>> >> which defaults from UNITS_PER_WORD) is also defined to align with SI
>> >>> >> mode.
>> >>> >> When do_loop_end requires DI mode, the target is defining Pmode as
>> >>> >> DImode
>> >>> >> and word_mode/UNITS_PER_WORD is also defined to align with DI mode.
>> >>> >>
>> >>> >> So, if aggressively, then by default we may just return word_mode.
>> >>> >
>> >>> > Note we still have to check whether the prefered mode is valid
>> >>> > (passing in TImode but then returning DImode would be wrong).
>> >>> Yes, some code like in doloop_optimize (rtl code)
>> >>> ```
>> >>> mode != word_mode
>> >>> && (word_mode_size >= GET_MODE_PRECISION (mode)
>> >>>     || wi::leu_p (iterations_max, word_mode_max))
>> >>> ```
>> >>>
>> >>> Thanks again for your comments!
>> >>>
>> >>> BR,
>> >>> Jiufu Guo
>> >>> >
>> >>> > Richard.
>> >>> >
>> >>> >> BR,
>> >>> >>
>> >>> >> Jiufu Guo.
>> >>> >>
>> >>> >>
>> >>> >> >>
>> >>> >> >> Thanks for those valuable comments!
>> >>> >> >>
>> >>> >> >> Jiufu Guo
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> >
>> >>> >> >> > Richard.
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>>