[Bug target/29256] [4.9/5/6 regression] loop performance regression
amker at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Aug 12 07:34:00 GMT 2015
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
--- Comment #58 from amker at gcc dot gnu.org ---
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller. It's impossible to make good
> > decisions about unroll factors that early. But your second approach sounds
> > quite promising to me.
>
> I would be willing to soften this statement. I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision. However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions. So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).
Thanks for the comments.
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed. Then we can enable it at O3/Ofast level, that would be wanted I think
since now we don't have a general unroller by default.
>
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
>
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...
(In reply to rguenther@suse.de from comment #57)
> On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
> >
> > --- Comment #56 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> > (In reply to Bill Schmidt from comment #53)
> > > I'm not a fan of a tree-level unroller. It's impossible to make good
> > > decisions about unroll factors that early. But your second approach sounds
> > > quite promising to me.
> >
> > I would be willing to soften this statement. I think that an early unroller
> > might well be a profitable approach for most systems with large caches and so
> > forth, where if the unrolling heuristics are not completely accurate we are
> > still likely to make a reasonably good decision. However, I would expect to
> > see ports with limited caches/memory to want more accurate control over
> > unrolling decisions. So I could see allowing ports to select between a GIMPLE
> > unroller and an RTL unroller (I doubt anybody would want both).
> >
> > In general it seems like PowerPC could benefit from more aggressive unrolling
> > much of the time, provided we can also solve the related IVOPTS problems that
> > cause too much register spill.
> >
> > I may have an interest in working on a GIMPLE unroller, depending on how
> > quickly I can complete or shed some other projects...
>
> I think that a separate unrolling on GIMPLE would be a hard sell
> due to the lack of a good cost mode. _But_ doing unrolling as part
> of another transform like we are doing now makes sense. So does
> eventually moving parts of an RTL pass involving unrolling to
> GIMPLE, like modulo scheduling or SMS (leaving the scheduling part
> to RTL).
(In reply to Bill Schmidt from comment #56)
> (In reply to Bill Schmidt from comment #53)
> > I'm not a fan of a tree-level unroller. It's impossible to make good
> > decisions about unroll factors that early. But your second approach sounds
> > quite promising to me.
>
> I would be willing to soften this statement. I think that an early unroller
> might well be a profitable approach for most systems with large caches and
> so forth, where if the unrolling heuristics are not completely accurate we
> are still likely to make a reasonably good decision. However, I would
> expect to see ports with limited caches/memory to want more accurate control
> over unrolling decisions. So I could see allowing ports to select between a
> GIMPLE unroller and an RTL unroller (I doubt anybody would want both).
As David suggested, we can try to implement a relatively conservative unroller
and make sure it's a win in most unrolled cases, even with some opportunities
missed. Then we can enable it at O3/Ofast level, it would be nice since we
don't have a general unroller by default.
About cost-model. Is it possible to introduce cache information model in GCC?
I don't see it's a difficult problem, and can be a start for possible cache
sensitive optimizations in the future? Another general question is: what kind
of cost do we need in a fine unroller, besides cache/branch ones?
>
> In general it seems like PowerPC could benefit from more aggressive
> unrolling much of the time, provided we can also solve the related IVOPTS
> problems that cause too much register spill.
>
> I may have an interest in working on a GIMPLE unroller, depending on how
> quickly I can complete or shed some other projects...
>
> Note that the RTL unroller is not enabled by default by any optimization
> level and note that unfortunately the RTL unroller shares flags with
> the GIMPLE level complete peeling (where it mainly controls cost
> modeling). Oh, but it's enabled with -fprofile-use.
>
> It's been a long time since I've done SPEC measuring with/without
> -funroll-loops (or/and -fpeel-loops). Note that these flags have
> secondary effects as well:
>
> toplev.c: flag_web = flag_unroll_loops || flag_peel_loops;
> toplev.c: flag_rename_registers = flag_unroll_loops || flag_peel_loops;
More information about the Gcc-bugs
mailing list