Speed up DOM -- pr15524

Wed Sep 29 13:53:00 GMT 2004

> 
> This is an attempt to fix up the jump threading code so that we
> don't consume inordinate amounts of cpu time and memory for
> pr15524.
> 
> The first problem is in cases where the number of outgoing 
> edges is large and the number of unique threaded edge targets
> is large can result in extremely poor compile time and memory
> allocation behavior.
> 
> When we duplicate blocks we duplicate all the statements and all the
> outgoing edges, we then eliminate the unnecessary control statement
> at the end of the block and delete all but one of the edges.  This
> kept the code simple and avoided duplicating a lot of
> duplicate_block and its children.
> 
> You can see how this gets bloody expensive when presented with a
> testcase like we've got in this PR.  In fact, most of the time is
> spent in the edge removal code and the GC system (ggc_free).  Not
> to mention that we allocate a ton of memory.

How did you measured this?  I oprofiled this testcase about month ago
and didn't see ggc_free on the top (only ggc_alloc).  Perhaps something
changed with the edge vector changes, I will try again on the yesterday
code.
> 
> Note carefully that for the overwhelming majority of real code this
> isn't a problem because the number of unique threaded destinations
> for any basic block is typically small (usually 1-2).   However,
> in those cases we do get an ever-so-slight reduction in the
> amount of memory we consume by avoiding creation of useless nodes.
> 
> 
> The testcase in the PR fails to compile on the machine I've got
> available for testing - it runs out of virtual memory.  So I
> stripped down the testcase by using CL3 rather than CL4 to
> instantiate the code.  The relevant info:
> 
>  {GC 53460k -> 9824k} {GC 12859k -> 1340k}
> [ ... ]
>  dominator optimization:   3.23 (50%)  
> [ ... ]
> TOTAL                 :   6.44
> 
> As we can see we generated quite a bit of garbage and burned an
> amazing amount of time attributed to DOM (via the edge removal
> code and ggc_free).
> 
> With my change we get:
>  {GC 5329k -> 1340k}
> [ ... ]
>  dominator optimization:   0.45 (13%)
>  TOTAL                 :   3.51
> [ ... ]
> e.
> 
> In fact, with my change we can compile the full testcase.  DOM
> still sucks up 18% of the compilation time.  But that's a huge
> improvement and #3 behind tree CFG cleanup and the branch
> prediction code.
> 
> I think there's some additional work that can be done for this
> testcase, but this patch is the biggie.

The problem of branch prediction code is the fact that loop header
unsharing creates nest of roughly 6000 loops and the propagation pass is
inherently quadratic within the loop nest.
Killing old loop optimizer cuts this easilly to half, I wonder if I can
make it to propagate only across fixed number of innermost loops as the
outer loops will likely get frequency of zero anyway.

Also can your patch be resoponsible for 0.6% memory consumption increase
on combine.c tester reported tonight?  It is not big deal, but would be
nice to keep track of things...

Also what is making tree CFG cleanup that slow?

Honza