This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: GCC 3.5 Plan, Take 2

From: Steven Bosscher <stevenb at suse dot de>
To: Andrew Pinski <pinskia at physics dot uc dot edu>,Mark Mitchell <mark at codesourcery dot com>, bje at au dot ibm dot com
Cc: gcc at gcc dot gnu dot org
Date: Mon, 16 Aug 2004 13:11:13 +0200
Subject: Re: GCC 3.5 Plan, Take 2
Organization: SUSE Labs
References: <41201242.10204@codesourcery.com> <9568AD0E-EF2D-11D8-959D-00039351ED8A@physics.uc.edu>

On Monday 16 August 2004 04:40, Andrew Pinski wrote:
> On Aug 15, 2004, at 6:47 PM, Mark Mitchell wrote:
> > We're already better than GCC 3.4 on a lot of these axes.  Right now,
> > we're probably losing a bit on compile-time, especially with
> > optimization enabled, and most people seem to think code generation is
> > about a wash.
>
> I don't you can claim that compile time is slower with optimization
> enabled.
> I will say that there is a bug report that was slow for all of GCC up
> till
> tree-ssa.  See PR 2692 which says that the compile time for the tree-ssa
> has improved many different things, there are other examples where
> expand
> was taking a huge amount of time because inlining was happening before
> expand and now we just actually do a many optimizations before expand
> making
> expand less important.  What is even more impressive is that we added
> more
> optimization passes and the overall compile time speed is about the same
> as before the tree-ssa merge.

But overall slower than any other GCC ever released before.  That's clearly
demonstrated by the occasional Mico timings and by Diego's testers:
http://people.redhat.com/dnovillo/spec2000/gcc/individual-build-secs_elapsed.html

> Most of the current compile time problems with the tree-ssa
> optimizations are
> that they are O(n^2)

Not true.  There are almost no quadratic bottlenecks in tree-ssa.  There
are no inherently O(n^2) algorithms in tree-ssa (it's *hard* to write a SSA
algorithm with non-linear behavior! ;-)
The only bottleneck I'm aware of is in DOM, the PR you mentioned.

Most of the current compile time problems come from us not being able to
turn off expensive, of even cheap, RTL passes so that we only added a whole
new set of passes.  What can you expect, it's going to slow down.

> see PR 15524 for an example of where one problem
> is, I and
> Steven had hoped the rewrite of jump threading would improve the
> situation
> here but it seems like we were wrong.

Actually that's only part of the problem.  The quadratic-ness of that test
case is now less a problem than the other one I've already mailed you and
bje about.   For everyone else:

From tree-ssa-dom.c:

  2089832: 2551:  for (e = bb->succ; e; e = e->succ_next)
        -: 2552:    {
(...)
        -: 2584:          /* If the hint is valid (!= phi_num_args), see if it points
        -: 2585:             us to the desired phi alternative.  */
  2687602: 2586:          if (hint != phi_num_args && PHI_ARG_EDGE (phi, hint) == e)
        -: 2587:            ;
        -: 2588:          else
        -: 2589:            {
        -: 2590:              /* The hint was either invalid or did not point to the
        -: 2591:                 correct phi alternative.  Search all the alternatives
        -: 2592:                 for the correct one.  Update the hint.  */
317517478: 2593:              for (i = 0; i < phi_num_args; i++)
317517478: 2594:                if (PHI_ARG_EDGE (phi, i) == e)
316924800: 2595:                  break;
   592678: 2596:              hint = i;
        -: 2597:            }

... and this loop's mirror in tree-flow-inline.h:

        -:  407:/* Return the phi index number for an edge.  */
        -:  408:static inline int
        -:  409:phi_arg_from_edge (tree phi, edge e)
  5360704:  410:{
  5360704:  411:  int i;
        -:  412:#if defined ENABLE_CHECKING
        -:  413:  if (!phi || TREE_CODE (phi) != PHI_NODE)
        -:  414:    abort();
        -:  415:#endif
        -:  416:
322669434:  417:  for (i = 0; i < PHI_NUM_ARGS (phi); i++)
322669434:  418:    if (PHI_ARG_EDGE (phi, i) == e)
        -:  419:      return i;
        -:  420:
        -:  421:  return -1;
  5360704:  422:}

Might not be typical code (insn-attrtab), but it shows that there really
are some data structures we use that are suboptimal: >1.2 billion runtime
conditional branches to look for 8 million PHI arguments!  We spend an
awful lot of time there, wading through memory.  This happens in DOM, but
also for any other tree SSA pass.
This will mostly go away when the edge-vector-branch is ready, hopefully
in time for GCC 3.5.

(bje, http://gcc.gnu.org/ml/gcc/2004-08/msg00685.html -- don't forget to
send an email about your project to Mark!)

> Also the reason behind the wash in code generation is because the tree
> optimizations
> do not do much more than the current generation of RTL optimizers
> except for SRA which
> is the single biggest win for more C++ programs.  The other big wins
> are DCE done right
> and done before sib calling finding happens.

The kind of optimizations that we do on trees may in paper be not very
different from the ones we do on RTL, but in practice they often do a lot
more than their RTL counterparts.

Just look at the GVN-PRE problem I reported the other day.  Look at jump
threads, we do many more than we did on just RTL, the named return value
optimization is now in the middle-end, and as you mentioned, dead store
and dead code elimination do a better job.  For sibling calls, last time
I looked we still had a few regressions wrt. sib/tail calls, but overall
we should catch more of them.

This does result in better code in many instances.  Look at SPEC's mcf
and parser on Diego's SPEC pages:
http://people.redhat.com/dnovillo/spec2000/gcc/individual-run-ratio.html
From the merge of the tree-ssa branch there's big jump there in the
right direction: up to the level of icc.

Also note that we may actually be generating much better code on non-x86
than older GCCs, but nobody is tracking performance on, say, ppc, ia64,
or MIPS, in public.

Overall the code we generate has not improvement (not for SPEC anyway).
Sometimes we generate worse code, so we need to find and understand the
reasons where and why we do that.  I showed one case of register pressure,
and there are probably more cases like that.  One other is still just poor
RTL generation at time.  Other people know more reasons I suppose.

> I also see that we can find uninitialized variables a lot better and
> IMA optimization
> is done with the correct aliasing sets and ...

...and debugging information is worse, and stack slot allocation is borked,
and we have no clear picture of the memory foot print, and...

Gr.
Steven

Follow-Ups:
- Re: GCC 3.5 Plan, Take 2
  - From: Karel Gardas
- Re: GCC 3.5 Plan, Take 2
  - From: Scott Robert Ladd

References:
- GCC 3.5 Plan, Take 2
  - From: Mark Mitchell
- Re: GCC 3.5 Plan, Take 2
  - From: Andrew Pinski

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]