[RFC] avoid type conversion through versioning loop
guojiufu
guojiufu@linux.ibm.com
Fri Mar 26 08:19:38 GMT 2021
On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
>
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition. That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any. Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size. We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> > Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >> unsigned int i = 0;
>> >> >> >> while (a[i]>1e-3)
>> >> >> >> i++;
>> >> >> >>
>> >> >> >> unsigned int i = 0;
>> >> >> >> while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >> i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain. What's the particular optimization
>> >> >>
>> >> >> Right. Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases? Note that for example for
>> >> >> >
>> >> >> > unsigned int i = 0;
>> >> >> > while (a[i]>1e-3)
>> >> >> > i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable. Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off. The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point. The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet. I tested the above example code:
>> >> unsigned i = 0;
>> >> while (a[i] > 1e-3 && i < n)
>> >> i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86. It seems those instructions are not optimized too much
>> >> on
>> >> some platforms. So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> > unsigned i = 0;
>> > while (a[i] > 1e-3 && i < n)
>> > i++;
>> > return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> > lfd %f0,0(%r9)
>> > addi %r3,%r3,1
>> > addi %r9,%r9,8
>> > rldicl %r3,%r3,0,32
>> > fcmpu %cr0,%f0,%f12
>> > bnglr %cr0
>> > bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>>
>> Thanks so much for your quick reply!
>>
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>>
>> :) my test code is a little differentent: argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>> unsigned i = 0;
>> while (a[i] > 1e-3 && i < n)
>> i++;
>> return i;
>> }
>
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
>
>> 111: L111:
>> 39: pc={(%7:CC>=0)?simple_return:pc}
>> 102: %3:SI=%3:SI+0x1
>> 103: %9:DI=%3:DI<<0x3&0x7fffffff8
>> 104: %3:DI=zero_extend(%3:SI)
>> 105: %7:CC=cmp(%3:DI,%4:DI)
>> 106: %0:DF=[%10:DI+%9:DI]
>> 107: %0:CCFP=cmp(%0:DF,%12:DF)
>> 109: pc={(%0:CCFP>0)?L111:pc}
>>
>> .L14:
>> bgelr 7
>> addi 3,3,1
>> rldic 9,3,3,29
>> rldicl 3,3,0,32
>> cmpd 7,3,4
>> lfdx 0,10,9
>> fcmpu 0,0,12
>> bgt 0,.L14
>>
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,
>> unsigned i)
>> {
>> while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is
>> sinkable
>> i++;
>> return i;
>> }
>
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.
>
>> 99: L99:
>> 69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>> 36: L36:
>> 23: %5:SI=%5:SI+0x1
>> 25: %9:DI=%5:DI<<0x3&0x7fffffff8
>> 24: %5:DI=zero_extend(%5:SI)
>> 29: %0:DF=[%3:DI+%9:DI]
>> 30: %0:CCFP=cmp(%0:DF,%12:DF)
>> 31: pc={(%0:CCFP>0)?L99:pc}
>>
>>
>> .L12:
>> bdz .L2
>> .L5:
>> addi 5,5,1
>> rldic 9,5,3,29
>> rldicl 5,5,0,32
>> lfdx 0,3,9
>> fcmpu 0,0,12
>> bgt 0,.L12
>> .L2:
>>
>> Any suggestions about how to relax the 'zext' in above cases? My
>> previous idea
>> was versioning :).
>
> I don't think there's another good way. Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting? So maybe it could be integrated
> into the loop split pass somehow? Let's look at
>
> unsigned foo (double *a, unsigned n, unsigned i)
> {
> while (a[i] > 1e-3 && i != n)
> i++;
> return i;
> }
>
> we'd split it as
>
> while (a[i] > 1e-3 && i > n)
> i++;
> while (a[i] > 1e-3 && i < n)
> i++;
>
And then, we can continue transform to:
while (a[i] > 1e-3 && i > n && i <= UINT_MAX)
i++;
while (a[i] > 1e-3 && i < n)
i++;
===>
if (i > n)
while (a[i] > 1e-3 && i <= UINT_MAX)
i++;
while (a[i] > 1e-3 && i < n)
i++;
===> (transform to below, may similar with sinking zext out of loops)
long il = i; long nl = n;
if (il > nl)
{
while (a[il] > 1e-3 && il <= UINT_MAX)
il++;
il = il > UINT_MAX ? 0 : il; // il = zext (il#0)
}
while (a[il] > 1e-3 && il < nl)
il++;
i = il;
Thanks a lot for your great help!
BR.
Jiufu Guo.
> but I still see zero-extends here on x86.
>
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway. There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
>
> Not sure if this makes sense..
>
> Richard.
>
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>>
>>
More information about the Gcc
mailing list