This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unroller with branch and count patch


Hello,

> > > You imply that the doloop optimization will be performed after 
> > > the unrolling. But it would not be preferable to do it before the 
> > >unrolling ?
> > >
> > > Performing the unrolling after the doloop optimization will give 
> slightly
> > > better code, as the doloop optimization is performed also on the 
> > > iterations
> > > generated before the unrolled loop. So for this region you have the 
> usual
> > > doloop optimization improvements. The register pressure is decreased 
> if 
> > > the
> > > count register is a special register (think of the case 
> > >of a loop with the exit condition i < N where N is no longer needed 
> across 
> 
> > this won't work.  You still must keep be able to determine the number of 
> 
> > iterations for doloop optimization, so you won't spare anything,
> > especially since
> 
> No. With our patch, the initialization of the count register is done 
> before the peeled copies.

it can be done this way in the case doloop is done after unrolling as
well.  I chose not to, since the count register often is not special,
and occupying it by the number would not be a win then.

This is really a target dependent decision that does not have much to do
with where the doloop optimization is done.

> Afterward  the number of iterations is no longer 
> 
> needed (the branch and count is used at the end of the peeled copies). But 
> 
> indeed, without our patch, the number of iterations will be needed at the 
> end 
> of the peeled copies. So with our patch there is no need for an 
> additional register to keep the number of iterations across the peeled 
> copies (assuming that the count register is a special register).

Okay, this may save one load & store of the register (outside of loop)
on such an architecture, assuming that really the bound of the loop is
unused anywhere else (one load if it is).  Does not seem like a great
issue, but still I will think about it.

> >> this region). Also a compare is discarded and the count register 
> controls 
> >> the 
> >> execution of the loop so you get better scheduling (think of the case 
> >> i = i + 1; cmp cond = i < N; if-then-else cond; which in our case is 
> >> i = i + 1; branch-and-count and can be executed in a single cycle).
> 
> >... the exit checks are eliminated from the peeled copies; so you
> > do not gain anything by this, either.
> 
> I am speaking about the exit check that still remains at the end of 
> peeling 
> copies.
> 
> >> Performing the doloop optimization before the unrolling gives you a 
> >> cleaner 
> >> design.
> 
> > I do not think so.  As your own patch proves, you need to clutter the
> > unrolling code by a lot of strange (and basically unrelated) junk.
> 
> Indeed it complicates a the unroller code in order to handle
> branch and count. On the good side, the unroller doesn't need to update
> some loop information for the sake of the (following) doloop optimization 
> alone.

It is a good idea to update it by itself, just in order not to lose
information unnecesarily and not to have to recompute it.  This is not
directly related to doloop optimization -- if the doloop optimizer was
run first, it would just have to preserve the information about the
number of iterations in the same way.

> This creates a dependency between the doloop code and the unroller.
> If the doloop optimization is changed and some more loop information is 
> needed you have to go to the unroller and to update this information 
> after the unrolling. 
> 
> This was the intend of cleaner design: no dependency between the loop 
> unroller code and the doloop optimization code.

But if you need to create quite a few hacks to prevent this dependency,
with relatively little gain, I don't think it is worth that.  Give me a
clean solution, and I will probably be more inclined to agree with your
arguments.

> >> Usually the unrolling invalidates much of the loop information.
> 
> > No it does not -- we still know everything we have known before (in some
> > cases even more, since by peeling some of the iterations we already know
> > that the number of iterations is not "negative".
> 
> After the unrolling the induction variables are no longer induction
> variables (as they have multiple definitions).

And what? They still behave as absolutely normal induction variables,
and loop-iv.c handles them perfectly.

> Maybe you have all the
> information but you have a loop not suitable for loop optimization. 
> And if someone comes with some optimization that needs branch and count 
> and induction variables ? 

No problem in either order.

> > > > Considering 3.4, could you please send some performance numbers? I 
> would
> > > > be especially interested in seeing differences between
> > > > 
> > > > -funroll-loops -fbranch-count-reg without the patch
> > > > -funroll-loops -fno-branch-count-reg without the patch
> > > > 
> > > > and
> > > >
> > > >-funroll-loops -fbranch-count-reg with the patch.
> > > >
> > > > on some benchmark.
> > >
> > > For the option -funroll-loops -fbranch-count-reg, the patch gains more
> > > then 4% improvement overall CFSPEC2000 (f77, c) on Power4 with 3 
> > > benchmarks showing around 10% improvement (wupise. swim, art). 
> 
> > and against -funroll-loops -fno-branch-count-reg?  It is quite possible
> > the gains are mostly due to unrolling (that is prevented by the doloop
> > optimization), and that the gains obtained by the doloop optimization
> > are mostly negligible, so I would be nice to have numbers to either
> > prove or disprove this.
> 
> Our patch make possible the unrolling of branch and count loops, so the
> unrolling gains are now possible also for such loops. This is what we 
> have measured. For the moment, we don't have any other results available.

so please get them (its just one run of specint, setting it up won't
take much time).  If you can show that doloop optimization of unrolled
loops is senseful, I will again be more inclined to consider the changes
to unrolling a necessary evil.

Zdenek


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]