[PATCH take 3] ivopts: Improve code generated for very simple loops.

Richard Biener richard.guenther@gmail.com
Fri Nov 26 07:36:51 GMT 2021


On Thu, Nov 25, 2021 at 7:17 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> On Tue, Nov 23, 2021 at 12:46PM Richard Biener < richard.guenther@gmail.com> wrote:
> > On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > > The patch doesn't add any testcase.
> > >
> > > The three new attached tests check that the critical invariants have a
> > > simpler form, and hopefully shouldn't be affected by whether the
> > > optimizer and/or backend costs actually decide to perform this iv substitution
> > or not.
> >
> > The testcases might depend on lp64 though, did you test them with -m32?
> > IMHO it's fine to require lp64 here.
>
> Great catch.  You're right that when the loop index has the same precision as the
> target's pointer, that fold is (already) able to simplify the ((EXPR)-1)+1, so that with
> -m32 my new tests ivopts-[567].c fail.  I've added "require lp64" to those tests, but
> I've also added two more tests, using char and unsigned char for the loop expression,
> which are optimized on both ilp32 and lp64.
>
> For example, with -O2 -m32, we see the following improvements in ivopts-8.c:
> diff ivopts-8.old.s ivopts-8.new.s
> 14,16c14,15
> <       subl    $1, %ecx
> <       movzbl  %cl, %ecx
> <       leal    4(%eax,%ecx,4), %ecx
> ---
> >       movsbl  %cl, %ecx
> >       leal    (%eax,%ecx,4), %ecx
>
> This might also explain why GCC currently generates sub-optimal code.  Back when
> ivopts was written, most folks were on i686, so the generated code was optimal.
> But with the transition to x86_64, the code is correct, just slightly less efficient.
>
> > I'm a bit unsure about adding this special-casing in cand_value_at in general - it
> > does seem that we're doing sth wrong elsewhere - either by not simplifying even
> > though enough knowledge is there or by throwing away knowledge earlier
> > (during niter analysis?).
>
> I agree this approach is a bit ugly.  Conceptually, an alternative might be to avoid
> throwing away knowledge earlier, during niter analysis, by adding an extra tree field
> to the tree_niter_desc structure, so that it returns both niter0 (the iteration count
> at the top of the loop) and niter1 (the iteration count at the bottom of the loop),
> so that later passes (cand_value_at) can use the tree that's relevant.

Yes, I also thought of this but I wasn't sure we always have that.  I also
wouldn't think of it as too ugly, but well ... it would definitely be useful
elsewhere.  Btw, it's generally the number of executions of the latch vs.
the number of executions of the header - currently what niter analysis
computes is the number of executions of the latch.  There are loops where
the number of iterations of the header is not representable in the IV.

>  Alas, this too is
> ugly, and inefficient as we're creating/folding trees that may never be used/useful.
> A compromise might be to add an enum field describing how the niter was
> calculated to tree_niter_desc, and this can be inspected/used by cand_value_at.
> The current patch figures this out by examining the other fields already in
> tree_niter_desc.
>
>
> > Anyway, the patch does look quite safe - can you show some statistics in how
> > many time there's extra simplification this way during say bootstrap?
>
> Certainly.  During stage2 and stage3 of a bootstrap on x86_64-pc-linux-gnu,
> cand_value_at is called 500657 times.  The majority of calls,
> 447607 (89.4%), request the value at the end of the loop (after_adjust),
> while 53050 (10.6%) request the value at the start of the loop.
> 102437 calls (20.5%) are optimized by clause 1 [0..N loops]
> 27939 calls (5.6%) are optimized by clause 2 [beg..end loops]

Thanks for the detailed data.

> Looking for opportunities to improve things further, I see that
> 319608 calls (63.8%) have a LT_EXPR exit test.
> 160965 calls (32.2%) have a NE_EXPR exit test.
> 20084 calls (4.0%) have a GT_EXPR exit test.
> so handling descending loops wouldn’t be a big win.
> I'll investigate whether (constant) step sizes other than 1 are
> (i) sufficiently common and (ii) benefit from improved folding.
>
>
> This revised patch has been test on x86_64-pc-linux-gnu with a
> make bootstrap and make -k check, both with and without
> --target-board=unix{-m32}, with no new failures.
> Ok for mainline?

OK.

Thanks,
Richard.

> 2021-11-25  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * tree-ssa-loop-ivopts.c (cand_value_at): Take a class
>         tree_niter_desc* argument instead of just a tree for NITER.
>         If we require the iv candidate value at the end of the final
>         loop iteration, try using the original loop bound as the
>         NITER for sufficiently simple loops.
>         (may_eliminate_iv): Update (only) call to cand_value_at.
>
> gcc/testsuite
>         * gcc.dg/wrapped-binop-simplify.c: Update expected test result.
>         * gcc.dg/tree-ssa/ivopts-5.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-6.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-7.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-8.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-9.c: New test case.
>
>
> Roger
> --
>


More information about the Gcc-patches mailing list