[RFC] avoid type conversion through versioning loop

guojiufu guojiufu@linux.ibm.com
Thu Mar 25 13:47:01 GMT 2021


On 2021-03-25 21:32, guojiufu via Gcc wrote:
> On 2021-03-25 16:35, Richard Biener wrote:
>> On Wed, 24 Mar 2021, guojiufu wrote:
>> 
>>> On 2021-03-24 20:33, Richard Biener wrote:
>>> > On Wed, 24 Mar 2021, guojiufu wrote:
>>> >
>>> >> On 2021-03-24 15:55, Richard Biener wrote:
>>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>>> >> >>
>>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>>> >> >> >> > wrote:
>>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>>> >> >> >> >> also
>>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>>> >> >> >> >> such
>>> >> >> >> >> transform would then choose IVs in their transformed copy that
>>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>>> >> >> >> >> transform.
>>> >> >> >> >
>>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>>> >> >> >> > pass, either early during vectorization or separately before
>>> >> >> >> > vectorization
>>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>>> >> >> >> > Find out what type sizes the loop use, first try to demote
>>> >> >> >> > computations
>>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>>> >> >> >> > something
>>> >> >> >> > is computed in a wider type only to have the result demoted to
>>> >> >> >> > narrower
>>> >> >> >> > type), then pick up the widest type size still in use in the loop
>>> >> >> >> > (ok,
>>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>>> >> >> >> > currently
>>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>>> >> >> >> > that
>>> >> >> >> > could
>>> >> >> >> > be promoted to that type size.  We do partially something like that
>>> >> >> >> > during
>>> >> >> >> > vect patterns for bool types, but not other types I think.
>>> >> >> >> >
>>> >> >> >> >       Jakub
>>> >> >> >>
>>> >> >> >> Thanks for the suggestions!
>>> >> >> >>
>>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>>> >> >> >> in
>>> >> >> >> some cases.
>>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>>> >> >> >> eliminate
>>> >> >> >> conversions,
>>> >> >> >> even for some cases which are not easy to be optimized through
>>> >> >> >> ifconv/vectorization,
>>> >> >> >> for examples:
>>> >> >> >>
>>> >> >> >>    unsigned int i = 0;
>>> >> >> >>    while (a[i]>1e-3)
>>> >> >> >>      i++;
>>> >> >> >>
>>> >> >> >>    unsigned int i = 0;
>>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>>> >> >> >>      i++;
>>> >> >> >>
>>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>>> >> >> >
>>> >> >> > But the "optimization" resulting from such versioning is hard to
>>> >> >> > determine upfront which means we'll pay quite a big code size cost
>>> >> >> > for unknown questionable gain.  What's the particular optimization
>>> >> >>
>>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>>> >> >> is not significant, this optimization may not positive.
>>> >> >>
>>> >> >> > in the above cases?  Note that for example for
>>> >> >> >
>>> >> >> >     unsigned int i = 0;
>>> >> >> >     while (a[i]>1e-3)
>>> >> >> >        i++;
>>> >> >> >
>>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>>> >> >> > There's
>>> >> >>
>>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>>> >> >>
>>> >> >> > the address computation that is i * sizeof (T) which is done in a
>>> >> >> > larger
>>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>>> >> >> > operation
>>> >> >> > that is 'slow' for you?
>>> >> >>
>>> >> >> This is the point: "zext(i)" is the instruction that I want to
>>> >> >> eliminate,
>>> >> >> which is the direct goal of the optimization.
>>> >> >>
>>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>>> >> >> increasing is small enough or not, this is a question and needs to
>>> >> >> trade-off.
>>> >> >> It may be only acceptable if the loop is very small, then eliminating
>>> >> >> 'zext'
>>> >> >> would help to save runtime, and code size increase maybe not big.
>>> >> >
>>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>>> >> > make versioning a good trade-off.  The micro-architecture should better
>>> >> > not make that totally slow (I'd expect an extra latency comparable to
>>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>>> >>
>>> >> Agree, I understand your point.  The concern is some micro-architectures
>>> >> are
>>> >> not
>>> >> very well on this yet.  I tested the above example code:
>>> >>   unsigned i = 0;
>>> >>   while (a[i] > 1e-3 && i < n)
>>> >>     i++;
>>> >> there are ~30% performance improvement if using "long i" instead
>>> >> of"unsigned
>>> >> i"
>>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>>> >> on
>>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>>> >
>>> > On x86 I see indexed addressing modes being used which should be fine.
>>> > Compilable testcase:
>>> >
>>> > unsigned foo (double *a, unsigned n)
>>> > {
>>> >   unsigned i = 0;
>>> >   while (a[i] > 1e-3 && i < n)
>>> >     i++;
>>> >   return i;
>>> > }
>>> >
>>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>>> > produces
>>> >
>>> > .L5:
>>> >         lfd %f0,0(%r9)
>>> >         addi %r3,%r3,1
>>> >         addi %r9,%r9,8
>>> >         rldicl %r3,%r3,0,32
>>> >         fcmpu %cr0,%f0,%f12
>>> >         bnglr %cr0
>>> >         bdnz .L5
>>> >
>>> > which looks pretty good to me, I suppose the rldicl is the
>>> > zero-extension but the IVs are already 64bit and the
>>> > zero-extension should be sunk to the loop exit instead.
>>> 
>>> Thanks so much for your quick reply!
>>> 
>>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>>> which also improve performance.
>>> 
>>> :) my test code is a little differentent:  argument n's type is long
>>> (sorry).
>>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>>> {
>>>   unsigned i = 0;
>>>   while (a[i] > 1e-3 && i < n)
>>>     i++;
>>>   return i;
>>> }
>> 
>> You can't elide the zero-extend here since n can be bigger
>> than MAX_UINT and thus i has to wrap.
>> 
>>>   111: L111:
>>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>>   102: %3:SI=%3:SI+0x1
>>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>>   104: %3:DI=zero_extend(%3:SI)
>>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>>   106: %0:DF=[%10:DI+%9:DI]
>>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>>   109: pc={(%0:CCFP>0)?L111:pc}
>>> 
>>> .L14:
>>>         bgelr 7
>>>         addi 3,3,1
>>>         rldic 9,3,3,29
>>>         rldicl 3,3,0,32
>>>         cmpd 7,3,4
>>>         lfdx 0,10,9
>>> 	fcmpu 0,0,12
>>>         bgt 0,.L14
>>> 
>>> -------------- or for code:
>>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>>> unsigned i)
>>> {
>>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>>> sinkable
>>>     i++;
>>>   return i;
>>> }
>> 
>> With i != n we cannot compute the number of iterations because i can
>> wrap, thus we cannot elide the zero-extension.
> 
> Yes, because 'i' could wrap, so 'zext' can not be avoided from the loop 
> body.
> 
>> 
>>>    99: L99:
>>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>>> scratch;clobber scratch;}
>>>    36: L36:
>>>    23: %5:SI=%5:SI+0x1
>>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>>    24: %5:DI=zero_extend(%5:SI)
>>>    29: %0:DF=[%3:DI+%9:DI]
>>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>>    31: pc={(%0:CCFP>0)?L99:pc}
>>> 
>>> 
>>> .L12:
>>>         bdz .L2
>>> .L5:
>>>         addi 5,5,1
>>>         rldic 9,5,3,29
>>>         rldicl 5,5,0,32
>>>         lfdx 0,3,9
>>>         fcmpu 0,0,12
>>>         bgt 0,.L12
>>> .L2:
>>> 
>>> Any suggestions about how to relax the 'zext' in above cases? My 
>>> previous idea
>>> was versioning :).
>> 
>> I don't think there's another good way.  Not sure if I'd call it
>> versioning, I suppose for these kind of possibly wrapping IVs it
>> would be iteration space splitting?  So maybe it could be integrated
>> into the loop split pass somehow?  Let's look at
>> 
>> unsigned foo (double *a, unsigned n, unsigned i)
>> {
>>    while (a[i] > 1e-3 && i != n)
>>      i++;
>>    return i;
>> }
>> 
>> we'd split it as
>> 
>>   while (a[i] > 1e-3 && i > n)
>>     i++;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
> Or:
> if (i < n)
>   while (a[i] > 1e-3 && i < n)   // i step until n, not wrap
>     i++;
> else if (i > n)
>  while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <= 
> UINT_MAX)"
>     i++;
> 
   There is an typo, it should be.
else if (i > n)
   while (a[i] > 1e-3 && i != n) // i wrap.
    i++

As in your split loops:
while (a[i] > 1e-3 && i > n) // i may step to UINT_MAX and then 0
   i++;
while (a[i] > 1e-3 && i < n) // will no wrap
   i++;

> These two new loops maybe able to elide 'zext'.
> 
>> 
>> but I still see zero-extends here on x86.
>> 
>> For the case with 'long n' the issue is that the loop might not even
>> terminate and thus i wraps anyway.  There one could split the
>> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
> 
> Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in the 
> loop body.
> If the compiler can analyze that 'i' will not wrap, then 'zext' may
> move out of the loop.
> 
> if (n <= UINT_MAX)
>   loop run n times
> else
>   loop may infinite, or say 'unsigned i < n is always true'
> 
> 
> Thanks for your always helpful suggestions!
> 
> 
> BR.
> Jiufu Guo.
> 
> 
>> 
>> Not sure if this makes sense..
> 
> 
>> 
>> Richard.
>> 
>>> >
>>> >> >
>>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>>> >> > under which its solution is valid is a very good and useful thing to do.
>>> >>
>>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>>> >> optimizations
>>> >> are enabled.
>>> >>
>>> >> Thanks again for your suggestions!
>>> >>
>>> >> BR.
>>> >> Jiufu Guo.
>>> >>
>>> >> >
>>> >> > Richard.
>>> >> >
>>> >> >> Thanks again for your very helpful comments!
>>> >> >>
>>> >> >> BR.
>>> >> >> Jiufu Guo.
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >> >
>>> >> >> >> BR.
>>> >> >> >> Jiufu Guo.
>>> >>
>>> >>
>>> 
>>> 


More information about the Gcc mailing list