Ping! [PATCH]: Pool allocate et_forest structures
Daniel Berlin
dberlin@dberlin.org
Fri Jan 3 23:11:00 GMT 2003
On Friday, January 3, 2003, at 05:15 PM, Jan Hubicka wrote:
>> Been a few weeks.
>>
>> Just to note, this patch was bootstrapped on i686-pc-linux-gnu and
>> powerpc-apple-darwin6.3, and as noted in a different message, improves
>> overall compile times on some testcases i have with large numbers of
>> BB's by 10-20%.
> Where the speedup comes from? I understand that the use obstacks is
> dirtier than using specialized allocator, but the result should be
> about
> the same, as the obstack is essentially used to manage such a pool...
>
et_forest nodes and occurrences are not obstack allocated, it's
malloc'd.
I'll just quote an earlier message as to why *this* patch does well:
"
VTune tells me that on a bunch of my test cases, calculate_value is
responsible for the most cache misses (both in 2nd and 1st level
misses in total number, and in loads retired which missed 1st and 2nd
level cache).
This patch improves the speed not by changing how calculate_value work,
but instead by improving et-forest cache locality
...
It works, too, and cuts 10% off the runtime of gcc on one of my test
cases, shaving 10 seconds off branch prediction.
VTune tells me all it does is halve the number of 2nd level cache
read misses that occur in calculate_value.
"
That's why *this* patch is a good idea.
As for why not using the flow obstack for bb's and edge's does a ton
better, on test cases i have, the top cache missers (besides
calculate_value) are:
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4278
(count_or_remove_death_notes, the sbitmap test of block->index)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/ifcvt.c:2311
(find_if_edge, the for (cur_edge = else_bb->pred))
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4274
(count_or_remove_death_notes, the FOR_EACH_BB_REVERSE)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/reload.c:6407
(find_equiv_reg, the GET_CODE (p) == INSN)
This gave me the idea that not all of our bb's and edges are even close
to on the same page.
So how do pools do better?
With the pool, we guarantee at least x objects are contiguous, where x
is set by whoever creates the pool (you can choose the number of
objects it allocates at a time). This is because we allocate x of them
at a time. We don't do this on the obstack, so whether we get
contiguous objects depends on allocation order and how many we end up
freeing/needing, *plus*, in the case of bb's and edges, whether
anything else got allocated on the flow_obstack in the meantime. It's
essentially pot luck, and it doesn't turn out good as you get large
numbers of bb's.
The things we don't want to cache miss on are RTL, trees, bb's, and
edges. We are always manipulating these things.
You could of course, make the allocations contiguous by dedicating it's
own obstack to bb's, allocating a bunch of bb's at a time, and just
throwing the rest on the free list. However, you could just instead
reuse the code i've written that nicely does this, all abstracted and
easily usable. Otherwise, you'd have to do the same thing for edges,
et_forest things, bb's, etc.
As for performance, the numbers don't lie.
VTune tells me the basic block/edge pool allocation patch knocks the
FOR_BB_REVERSE and the "(find_if_edge, the for (cur_edge =
else_bb->pred))" out of the top of cache misses, and improves runtime
on large numbers of bb's.
> Honza
More information about the Gcc-patches
mailing list