[PATCH take 2] ivopts: Improve code generated for very simple loops.

Roger Sayle roger@nextmovesoftware.com
Thu Nov 18 15:18:02 GMT 2021

Hi Richard,
Many thanks for the patch review.

On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> > This patch tidies up the code that GCC generates for simple loops, by
> > selecting/generating a simpler loop bound expression in ivopts.
> > Previously:
> >
> > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> >   int i;
> >   for (i = 0; i < n; i++) {
> >     out[i] = in[i];
> >   }
> > }
> >
> > on x86_64 generated:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> > .L3:    movq    (%rdi,%rax,8), %rcx
> >         movq    %rcx, (%rsi,%rax,8)
> >         addq    $1, %rax
> >         cmpq    %rax, %rdx
> >         jne     .L3
> > .L1:    ret
> >
> > and now instead generates:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> >         leaq    0(,%rdx,8), %rcx
> > .L3:    movq    (%rdi,%rax), %rdx
> >         movq    %rdx, (%rsi,%rax)
> >         addq    $8, %rax
> >         cmpq    %rax, %rcx
> >         jne     .L3
> > .L1:    ret
> Is that actually better?  IIRC the addressing modes are both complex and we
> now have an extra lea?

Technically the induction variable elimination is removing two multiplies by 8
(or left shifts) from the body of the loop and replacing them with a single
multiply by 8 prior to the loop, which is exactly the induction variable 
optimization that ivopts is designed to do.  It's true that with x86's complex
addressing modes these "multiplications" are free, but ivopts is run on all
targets including those without indexed addressing, and even on x86 there's
a benefit from shorter instruction encodings.

> For this case I see we generate
>   _15 = n_10(D) + 4294967295;
>   _8 = (unsigned long) _15;
>   _7 = _8 + 1;
> where n is unsigned int so if we know that n is not zero we can simplify the
> addition and conveniently the loop header test provides this guarantee.
> IIRC there were some attempts to enhance match.pd for some cases of such
> expressions.

Exactly.  The loop optimizers are generating the expression (x-1)+1 and
assuming that the middle-end will simplify this to x.  Unfortunately, the
change of modes (designed to prevent overflow) actually makes this
impossible to simplify, due to concerns about overflow.

> +  /* If AFTER_ADJUST is required, the code below generates the equivalent
> +   * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> +   * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> +   * SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
> +   * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> +   * class for common idioms that we know are safe.  */
> No '* ' each line.
Doh!  Thanks.  Sometimes I hate vi.

> I wonder if the non-overflowing can be captured by
>     integer_onep (iv->step)
>     && max_stmt_executions (loop, &max)

Unfortunately, max_stmt_executions is intended to return the wide_int count
of iterations, either the known constant value or a profile-based estimate, while
the optimizations I'm proposing work with symbolic values from variable iteration
counts.  When the iteration count is a constant, fold-const is already able to
simplify (x-1)+1 to an integer constant even with type conversions.

> if we then do (niter + 1) * step instead of niter*step + step would that do the
> same?

Yes.  I've extended the scope of the patch to now also handle loops of the
form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
now use (end-beg) as the incremented niter, so the invariant expression now
becomes (end-beg)*4 instead of the currently generated:  ((end-beg)-1)*4 + 4.

I'm assuming that the niter*step + step is by design (for the general case),
so I'm only tweaking the (common) corner cases, where it's easy to see that
it's safe to substitute a simpler expression.  For more general affine recurrences
in complex loop nests, niter*step + step may be preferred.

> That said - what the change does is actually ensure that we CSE niter + 1
> with the bound of the simplified exit test?

Not quite, this simply provides a simplified expression for "niter + 1" that
takes advantage of the implicit range information we have.  You're right
in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
more generally.   In this case, I suspect it's just a historical artifact that
much of the literature on (iv) loop optimizations dates back to Fortran
and Algol where arrays were 1-based rather than 0-based, so decades
later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
for purely historical reasons.

> The patch doesn't add any testcase.

The three new attached tests check that the critical invariants have a
simpler form, and hopefully shouldn't be affected by whether the optimizer
and/or backend costs actually decide to perform this iv substitution or not.

> Thanks,
> Richard.

This revised patch has been tested on x86_64-pc-linux-gnu with a
make bootstrap and make -k check with no new failures.

2021-11-18  Roger Sayle  <roger@nextmovesoftware.com>

	* tree-ssa-loop-ivopts.c (cand_value_at): Take a class
	tree_niter_desc* argument instead of just a tree for NITER.
	If we require the iv candidate value at the end of the final
	loop iteration, try using the original loop bound as the
	NITER for sufficiently simple loops.
	(may_eliminate_iv): Update (only) call to cand_value_at.

	* gcc.dg/tree-ssa/ivopts-5.c: New test case.
	* gcc.dg/tree-ssa/ivopts-6.c: New test case.
	* gcc.dg/tree-ssa/ivopts-7.c: New test case.
	* gcc.dg/wrapped-binop-simplify.c: Update expected test result.

Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
at the start of Stage 3).


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patchn2b.txt
URL: <https://gcc.gnu.org/pipermail/gcc-patches/attachments/20211118/71330b1f/attachment.txt>

More information about the Gcc-patches mailing list