[PATCH take 2] ivopts: Improve code generated for very simple loops.

Tue Nov 23 12:46:07 GMT 2021

On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Many thanks for the patch review.
>
> On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> > On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > >
> > > This patch tidies up the code that GCC generates for simple loops, by
> > > selecting/generating a simpler loop bound expression in ivopts.
> > > Previously:
> > >
> > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> > >   int i;
> > >   for (i = 0; i < n; i++) {
> > >     out[i] = in[i];
> > >   }
> > > }
> > >
> > > on x86_64 generated:
> > > v1:     testl   %edx, %edx
> > >         je      .L1
> > >         movl    %edx, %edx
> > >         xorl    %eax, %eax
> > > .L3:    movq    (%rdi,%rax,8), %rcx
> > >         movq    %rcx, (%rsi,%rax,8)
> > >         addq    $1, %rax
> > >         cmpq    %rax, %rdx
> > >         jne     .L3
> > > .L1:    ret
> > >
> > > and now instead generates:
> > > v1:     testl   %edx, %edx
> > >         je      .L1
> > >         movl    %edx, %edx
> > >         xorl    %eax, %eax
> > >         leaq    0(,%rdx,8), %rcx
> > > .L3:    movq    (%rdi,%rax), %rdx
> > >         movq    %rdx, (%rsi,%rax)
> > >         addq    $8, %rax
> > >         cmpq    %rax, %rcx
> > >         jne     .L3
> > > .L1:    ret
> >
> > Is that actually better?  IIRC the addressing modes are both complex and we
> > now have an extra lea?
>
>
> Technically the induction variable elimination is removing two multiplies by 8
> (or left shifts) from the body of the loop and replacing them with a single
> multiply by 8 prior to the loop, which is exactly the induction variable
> optimization that ivopts is designed to do.  It's true that with x86's complex
> addressing modes these "multiplications" are free, but ivopts is run on all
> targets including those without indexed addressing, and even on x86 there's
> a benefit from shorter instruction encodings.
>
> > For this case I see we generate
> >   _15 = n_10(D) + 4294967295;
> >   _8 = (unsigned long) _15;
> >   _7 = _8 + 1;
> >
> > where n is unsigned int so if we know that n is not zero we can simplify the
> > addition and conveniently the loop header test provides this guarantee.
> > IIRC there were some attempts to enhance match.pd for some cases of such
> > expressions.
>
> Exactly.  The loop optimizers are generating the expression (x-1)+1 and
> assuming that the middle-end will simplify this to x.  Unfortunately, the
> change of modes (designed to prevent overflow) actually makes this
> impossible to simplify, due to concerns about overflow.
>
> > +  /* If AFTER_ADJUST is required, the code below generates the equivalent
> > +   * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> > +   * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> > +   * SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
> > +   * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> > +   * class for common idioms that we know are safe.  */
> >
> > No '* ' each line.
> Doh!  Thanks.  Sometimes I hate vi.
>
> > I wonder if the non-overflowing can be captured by
> >     integer_onep (iv->step)
> >     && max_stmt_executions (loop, &max)
>
> Unfortunately, max_stmt_executions is intended to return the wide_int count
> of iterations, either the known constant value or a profile-based estimate, while
> the optimizations I'm proposing work with symbolic values from variable iteration
> counts.  When the iteration count is a constant, fold-const is already able to
> simplify (x-1)+1 to an integer constant even with type conversions.
>
> > if we then do (niter + 1) * step instead of niter*step + step would that do the
> > same?
>
> Yes.  I've extended the scope of the patch to now also handle loops of the
> form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
> now use (end-beg) as the incremented niter, so the invariant expression now
> becomes (end-beg)*4 instead of the currently generated:  ((end-beg)-1)*4 + 4.
>
> I'm assuming that the niter*step + step is by design (for the general case),
> so I'm only tweaking the (common) corner cases, where it's easy to see that
> it's safe to substitute a simpler expression.  For more general affine recurrences
> in complex loop nests, niter*step + step may be preferred.
>
> > That said - what the change does is actually ensure that we CSE niter + 1
> > with the bound of the simplified exit test?
>
> Not quite, this simply provides a simplified expression for "niter + 1" that
> takes advantage of the implicit range information we have.  You're right
> in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
> more generally.   In this case, I suspect it's just a historical artifact that
> much of the literature on (iv) loop optimizations dates back to Fortran
> and Algol where arrays were 1-based rather than 0-based, so decades
> later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
> for purely historical reasons.
>
> > The patch doesn't add any testcase.
>
> The three new attached tests check that the critical invariants have a
> simpler form, and hopefully shouldn't be affected by whether the optimizer
> and/or backend costs actually decide to perform this iv substitution or not.

The testcases might depend on lp64 though, did you test them with -m32?
IMHO it's fine to require lp64 here.

I'm a bit unsure about adding this special-casing in cand_value_at in
general - it does seem that we're doing sth wrong elsewhere - either
by not simplifying even though enough knowledge is there or by
throwing away knowledge earlier (during niter analysis?).

Anyway, the patch does look quite safe - can you show some
statistics in how many time there's extra simplification this way
during say bootstrap?

Thanks,
Richard.

> > Thanks,
> > Richard.
>
>
> This revised patch has been tested on x86_64-pc-linux-gnu with a
> make bootstrap and make -k check with no new failures.
>
> 2021-11-18  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * tree-ssa-loop-ivopts.c (cand_value_at): Take a class
>         tree_niter_desc* argument instead of just a tree for NITER.
>         If we require the iv candidate value at the end of the final
>         loop iteration, try using the original loop bound as the
>         NITER for sufficiently simple loops.
>         (may_eliminate_iv): Update (only) call to cand_value_at.
>
> gcc/testsuite
>         * gcc.dg/tree-ssa/ivopts-5.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-6.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-7.c: New test case.
>         * gcc.dg/wrapped-binop-simplify.c: Update expected test result.
>
>
> Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
> at the start of Stage 3).
>
> Roger
> --
>