Ping! [PATCH]: Pool allocate et_forest structures

Fri Jan 3 23:11:00 GMT 2003

On Friday, January 3, 2003, at 05:15  PM, Jan Hubicka wrote:

>> Been a few weeks.
>>
>> Just to note, this patch was bootstrapped on i686-pc-linux-gnu and
>> powerpc-apple-darwin6.3, and as noted in a different message, improves
>> overall compile times on some testcases i have with large numbers of
>> BB's by 10-20%.
> Where the speedup comes from?  I understand that the use obstacks is
> dirtier than using specialized allocator, but the result should be 
> about
> the same, as the obstack is essentially used to manage such a pool...
>
et_forest nodes and occurrences are not obstack allocated, it's 
malloc'd.
I'll just quote an earlier message as to why *this* patch does well:
"
VTune tells me that on a bunch of  my test cases, calculate_value is
responsible for the most cache misses (both in 2nd and 1st level
misses in total number, and in loads retired which missed 1st and 2nd
level cache).
This patch improves the speed not by changing how calculate_value work,
but instead by improving et-forest cache locality
...

  It works, too, and cuts 10% off the runtime of gcc on one of my test
cases, shaving 10 seconds off branch prediction.
VTune tells me all it does is halve the number of 2nd level cache
read misses that occur in calculate_value.
"
That's why *this* patch is a good idea.
As for why not using the flow obstack for bb's and edge's does a ton 
better, on test cases i have, the top cache missers (besides 
calculate_value) are:
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4278
(count_or_remove_death_notes, the sbitmap test of block->index)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/ifcvt.c:2311
(find_if_edge, the for (cur_edge = else_bb->pred))
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4274
(count_or_remove_death_notes, the FOR_EACH_BB_REVERSE)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/reload.c:6407
(find_equiv_reg, the GET_CODE (p) == INSN)

This gave me the idea that not all of our bb's and edges are even close 
to on the same page.
So how do pools do better?
With the pool, we guarantee at least x objects are contiguous, where x 
is set by whoever creates the pool (you can choose the number of 
objects it allocates at a time). This is because we allocate x of them 
at a time. We don't do this on the obstack, so whether we get 
contiguous objects depends on allocation order and how many we end up 
freeing/needing, *plus*, in the case of bb's and edges, whether 
anything else got allocated on the flow_obstack in the meantime.  It's 
essentially pot luck, and it doesn't turn out good as you get large 
numbers of bb's.
The things we don't want to cache miss on are RTL, trees, bb's, and 
edges. We are always manipulating these things.

You could of course, make the allocations contiguous by dedicating it's 
own obstack to bb's, allocating a bunch of bb's at a time, and just 
throwing the rest on the free list.  However, you could just instead 
reuse the code i've written that nicely does this, all abstracted and 
easily usable.  Otherwise, you'd have to do the same thing for edges, 
et_forest things, bb's, etc.

As for performance, the numbers don't lie.
VTune tells me the basic block/edge pool allocation patch knocks the 
FOR_BB_REVERSE and the "(find_if_edge, the for (cur_edge = 
else_bb->pred))" out of the top of cache misses, and improves runtime 
on large numbers of bb's.

> Honza