This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Unroller with branch and count patch
- From: Zdenek Dvorak <rakdver at atrey dot karlin dot mff dot cuni dot cz>
- To: Mircea Namolaru <NAMOLARU at il dot ibm dot com>
- Cc: Dale Johannesen <dalej at apple dot com>,David Edelsohn <dje at makai dot watson dot ibm dot com>, gcc-patches at gcc dot gnu dot org,Andrew Pinski <pinskia at physics dot uc dot edu>,Ulrich Weigand <weigand at i1 dot informatik dot uni-erlangen dot de>
- Date: Thu, 19 Feb 2004 14:00:25 +0100
- Subject: Re: Unroller with branch and count patch
- References: <20040218140311.GA16486@atrey.karlin.mff.cuni.cz> <OF6B59F8F2.692A36E1-ONC2256E3F.002FFFBE-42256E3F.0045AEEA@il.ibm.com>
Hello,
> > I don't think it is a good idea to include this in mainline (for one
> > reason, it does not apply any more -- simple loop analysis was rewritten
> > recently and moved to loop-iv.c); tomorrow I am going to send the
> > rewrite of the doloop optimization pass, thus making this completely
> > useless.
>
> You imply that the doloop optimization will be performed after
> the unrolling. But it would not be preferable to do it before the
> unrolling ?
>
> Performing the unrolling after the doloop optimization will give slightly
> better code, as the doloop optimization is performed also on the
> iterations
> generated before the unrolled loop. So for this region you have the usual
> doloop optimization improvements. The register pressure is decreased if
> the
> count register is a special register (think of the case
> of a loop with the exit condition i < N where N is no longer needed across
this won't work. You still must keep be able to determine the number of
iterations for doloop optimization, so you won't spare anything,
especially since
> this region). Also a compare is discarded and the count register controls
> the
> execution of the loop so you get better scheduling (think of the case
> i = i + 1; cmp cond = i < N; if-then-else cond; which in our case is
> i = i + 1; branch-and-count and can be executed in a single cycle).
... the exit checks are eliminated from the peeled copies; so you
do not gain anything by this, either.
> Performing the doloop optimization before the unrolling gives you a
> cleaner
> design.
I do not think so. As your own patch proves, you need to clutter the
unrolling code by a lot of strange (and basically unrelated) junk.
> Usually the unrolling invalidates much of the loop information.
No it does not -- we still know everything we have known before (in some
cases even more, since by peeling some of the iterations we already know
that the number of iterations is not "negative".
> If doloop optimization is performed first, the iv information is still
> correct and you could exploit this for other optimizations if wanted. The
> doloop optimization is independent from unrolling, you don't need to care
> about what loop information is invalidated by unrolling and ways to update
> it. And it gives you more freedom of where to place the doloop
> optimzation.
>
> Considering that with our patch the unrolling is able to work with
> branch and count, why do you think that performing doloop before unroling
> is preferable ?
It is easier, the code to handle it is cleaner and I do not see a reason
why not to.
> > Considering 3.4, could you please send some performance numbers? I would
> > be especially interested in seeing differences between
> >
> > -funroll-loops -fbranch-count-reg without the patch
> > -funroll-loops -fno-branch-count-reg without the patch
> >
> > and
> >
> >-funroll-loops -fbranch-count-reg with the patch.
> >
> > on some benchmark.
>
> For the option -funroll-loops -fbranch-count-reg, the patch gains more
> then 4% improvement overall CFSPEC2000 (f77, c) on Power4 with 3
> benchmarks showing around 10% improvement (wupise. swim, art).
and against -funroll-loops -fno-branch-count-reg? It is quite possible
the gains are mostly due to unrolling (that is prevented by the doloop
optimization), and that the gains obtained by the doloop optimization
are mostly negligible, so I would be nice to have numbers to either
prove or disprove this.
Zdenek