[PATCH take 2] ivopts: Improve code generated for very simple loops.
Richard Biener
richard.guenther@gmail.com
Tue Nov 23 12:46:07 GMT 2021
On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Many thanks for the patch review.
>
> On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> > On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > >
> > > This patch tidies up the code that GCC generates for simple loops, by
> > > selecting/generating a simpler loop bound expression in ivopts.
> > > Previously:
> > >
> > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> > > int i;
> > > for (i = 0; i < n; i++) {
> > > out[i] = in[i];
> > > }
> > > }
> > >
> > > on x86_64 generated:
> > > v1: testl %edx, %edx
> > > je .L1
> > > movl %edx, %edx
> > > xorl %eax, %eax
> > > .L3: movq (%rdi,%rax,8), %rcx
> > > movq %rcx, (%rsi,%rax,8)
> > > addq $1, %rax
> > > cmpq %rax, %rdx
> > > jne .L3
> > > .L1: ret
> > >
> > > and now instead generates:
> > > v1: testl %edx, %edx
> > > je .L1
> > > movl %edx, %edx
> > > xorl %eax, %eax
> > > leaq 0(,%rdx,8), %rcx
> > > .L3: movq (%rdi,%rax), %rdx
> > > movq %rdx, (%rsi,%rax)
> > > addq $8, %rax
> > > cmpq %rax, %rcx
> > > jne .L3
> > > .L1: ret
> >
> > Is that actually better? IIRC the addressing modes are both complex and we
> > now have an extra lea?
>
>
> Technically the induction variable elimination is removing two multiplies by 8
> (or left shifts) from the body of the loop and replacing them with a single
> multiply by 8 prior to the loop, which is exactly the induction variable
> optimization that ivopts is designed to do. It's true that with x86's complex
> addressing modes these "multiplications" are free, but ivopts is run on all
> targets including those without indexed addressing, and even on x86 there's
> a benefit from shorter instruction encodings.
>
> > For this case I see we generate
> > _15 = n_10(D) + 4294967295;
> > _8 = (unsigned long) _15;
> > _7 = _8 + 1;
> >
> > where n is unsigned int so if we know that n is not zero we can simplify the
> > addition and conveniently the loop header test provides this guarantee.
> > IIRC there were some attempts to enhance match.pd for some cases of such
> > expressions.
>
> Exactly. The loop optimizers are generating the expression (x-1)+1 and
> assuming that the middle-end will simplify this to x. Unfortunately, the
> change of modes (designed to prevent overflow) actually makes this
> impossible to simplify, due to concerns about overflow.
>
> > + /* If AFTER_ADJUST is required, the code below generates the equivalent
> > + * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> > + * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> > + * SSA_NAME - 1. Unfortunately, guaranteeing that adding 1 to NITER
> > + * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> > + * class for common idioms that we know are safe. */
> >
> > No '* ' each line.
> Doh! Thanks. Sometimes I hate vi.
>
> > I wonder if the non-overflowing can be captured by
> > integer_onep (iv->step)
> > && max_stmt_executions (loop, &max)
>
> Unfortunately, max_stmt_executions is intended to return the wide_int count
> of iterations, either the known constant value or a profile-based estimate, while
> the optimizations I'm proposing work with symbolic values from variable iteration
> counts. When the iteration count is a constant, fold-const is already able to
> simplify (x-1)+1 to an integer constant even with type conversions.
>
> > if we then do (niter + 1) * step instead of niter*step + step would that do the
> > same?
>
> Yes. I've extended the scope of the patch to now also handle loops of the
> form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
> now use (end-beg) as the incremented niter, so the invariant expression now
> becomes (end-beg)*4 instead of the currently generated: ((end-beg)-1)*4 + 4.
>
> I'm assuming that the niter*step + step is by design (for the general case),
> so I'm only tweaking the (common) corner cases, where it's easy to see that
> it's safe to substitute a simpler expression. For more general affine recurrences
> in complex loop nests, niter*step + step may be preferred.
>
> > That said - what the change does is actually ensure that we CSE niter + 1
> > with the bound of the simplified exit test?
>
> Not quite, this simply provides a simplified expression for "niter + 1" that
> takes advantage of the implicit range information we have. You're right
> in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
> more generally. In this case, I suspect it's just a historical artifact that
> much of the literature on (iv) loop optimizations dates back to Fortran
> and Algol where arrays were 1-based rather than 0-based, so decades
> later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
> for purely historical reasons.
>
> > The patch doesn't add any testcase.
>
> The three new attached tests check that the critical invariants have a
> simpler form, and hopefully shouldn't be affected by whether the optimizer
> and/or backend costs actually decide to perform this iv substitution or not.
The testcases might depend on lp64 though, did you test them with -m32?
IMHO it's fine to require lp64 here.
I'm a bit unsure about adding this special-casing in cand_value_at in
general - it does seem that we're doing sth wrong elsewhere - either
by not simplifying even though enough knowledge is there or by
throwing away knowledge earlier (during niter analysis?).
Anyway, the patch does look quite safe - can you show some
statistics in how many time there's extra simplification this way
during say bootstrap?
Thanks,
Richard.
> > Thanks,
> > Richard.
>
>
> This revised patch has been tested on x86_64-pc-linux-gnu with a
> make bootstrap and make -k check with no new failures.
>
> 2021-11-18 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * tree-ssa-loop-ivopts.c (cand_value_at): Take a class
> tree_niter_desc* argument instead of just a tree for NITER.
> If we require the iv candidate value at the end of the final
> loop iteration, try using the original loop bound as the
> NITER for sufficiently simple loops.
> (may_eliminate_iv): Update (only) call to cand_value_at.
>
> gcc/testsuite
> * gcc.dg/tree-ssa/ivopts-5.c: New test case.
> * gcc.dg/tree-ssa/ivopts-6.c: New test case.
> * gcc.dg/tree-ssa/ivopts-7.c: New test case.
> * gcc.dg/wrapped-binop-simplify.c: Update expected test result.
>
>
> Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
> at the start of Stage 3).
>
> Roger
> --
>
More information about the Gcc-patches
mailing list