[RFC] avoid type conversion through versioning loop

Fri Mar 26 08:19:38 GMT 2021

On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> >       Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (a[i]>1e-3)
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain.  What's the particular optimization
>> >> >>
>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases?  Note that for example for
>> >> >> >
>> >> >> >     unsigned int i = 0;
>> >> >> >     while (a[i]>1e-3)
>> >> >> >        i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off.  The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point.  The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet.  I tested the above example code:
>> >>   unsigned i = 0;
>> >>   while (a[i] > 1e-3 && i < n)
>> >>     i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>> >> on
>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> >   unsigned i = 0;
>> >   while (a[i] > 1e-3 && i < n)
>> >     i++;
>> >   return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> >         lfd %f0,0(%r9)
>> >         addi %r3,%r3,1
>> >         addi %r9,%r9,8
>> >         rldicl %r3,%r3,0,32
>> >         fcmpu %cr0,%f0,%f12
>> >         bnglr %cr0
>> >         bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>> 
>> Thanks so much for your quick reply!
>> 
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>> 
>> :) my test code is a little differentent:  argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>>   return i;
>> }
> 
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
> 
>>   111: L111:
>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>   102: %3:SI=%3:SI+0x1
>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>   104: %3:DI=zero_extend(%3:SI)
>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>   106: %0:DF=[%10:DI+%9:DI]
>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>   109: pc={(%0:CCFP>0)?L111:pc}
>> 
>> .L14:
>>         bgelr 7
>>         addi 3,3,1
>>         rldic 9,3,3,29
>>         rldicl 3,3,0,32
>>         cmpd 7,3,4
>>         lfdx 0,10,9
>> 	fcmpu 0,0,12
>>         bgt 0,.L14
>> 
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>> unsigned i)
>> {
>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>> sinkable
>>     i++;
>>   return i;
>> }
> 
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.
> 
>>    99: L99:
>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>>    36: L36:
>>    23: %5:SI=%5:SI+0x1
>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>    24: %5:DI=zero_extend(%5:SI)
>>    29: %0:DF=[%3:DI+%9:DI]
>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>    31: pc={(%0:CCFP>0)?L99:pc}
>> 
>> 
>> .L12:
>>         bdz .L2
>> .L5:
>>         addi 5,5,1
>>         rldic 9,5,3,29
>>         rldicl 5,5,0,32
>>         lfdx 0,3,9
>>         fcmpu 0,0,12
>>         bgt 0,.L12
>> .L2:
>> 
>> Any suggestions about how to relax the 'zext' in above cases? My 
>> previous idea
>> was versioning :).
> 
> I don't think there's another good way.  Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting?  So maybe it could be integrated
> into the loop split pass somehow?  Let's look at
> 
> unsigned foo (double *a, unsigned n, unsigned i)
> {
>    while (a[i] > 1e-3 && i != n)
>      i++;
>    return i;
> }
> 
> we'd split it as
> 
>   while (a[i] > 1e-3 && i > n)
>     i++;
>   while (a[i] > 1e-3 && i < n)
>     i++;
> 
And then, we can continue transform to:
   while (a[i] > 1e-3 && i > n && i <= UINT_MAX)
     i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===>
   if (i > n)
     while (a[i] > 1e-3 && i <= UINT_MAX)
       i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===> (transform to below, may similar with sinking zext out of loops)
   long il = i; long nl = n;
   if (il > nl)
     {
      while (a[il] > 1e-3 && il <= UINT_MAX)
        il++;
      il = il > UINT_MAX ? 0 : il; // il = zext (il#0)
     }
   while (a[il] > 1e-3 && il < nl)
     il++;
   i = il;

Thanks a lot for your great help!

BR.
Jiufu Guo.

> but I still see zero-extends here on x86.
> 
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway.  There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
> 
> Not sure if this makes sense..
> 
> Richard.
> 
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>> 
>>