This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Ping! [PATCH]: Pool allocate et_forest structures


On Friday, January 3, 2003, at 05:15  PM, Jan Hubicka wrote:

Been a few weeks.

Just to note, this patch was bootstrapped on i686-pc-linux-gnu and
powerpc-apple-darwin6.3, and as noted in a different message, improves
overall compile times on some testcases i have with large numbers of
BB's by 10-20%.
Where the speedup comes from? I understand that the use obstacks is
dirtier than using specialized allocator, but the result should be about
the same, as the obstack is essentially used to manage such a pool...

et_forest nodes and occurrences are not obstack allocated, it's malloc'd.
I'll just quote an earlier message as to why *this* patch does well:
"
VTune tells me that on a bunch of my test cases, calculate_value is
responsible for the most cache misses (both in 2nd and 1st level
misses in total number, and in loads retired which missed 1st and 2nd
level cache).
This patch improves the speed not by changing how calculate_value work,
but instead by improving et-forest cache locality
...

It works, too, and cuts 10% off the runtime of gcc on one of my test
cases, shaving 10 seconds off branch prediction.
VTune tells me all it does is halve the number of 2nd level cache
read misses that occur in calculate_value.
"
That's why *this* patch is a good idea.
As for why not using the flow obstack for bb's and edge's does a ton better, on test cases i have, the top cache missers (besides calculate_value) are:
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4278
(count_or_remove_death_notes, the sbitmap test of block->index)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/ifcvt.c:2311
(find_if_edge, the for (cur_edge = else_bb->pred))
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/flow.c:4274
(count_or_remove_death_notes, the FOR_EACH_BB_REVERSE)
/gccbuild/pch-branch/gcc/build/gcc/../../gcc/reload.c:6407
(find_equiv_reg, the GET_CODE (p) == INSN)

This gave me the idea that not all of our bb's and edges are even close to on the same page.
So how do pools do better?
With the pool, we guarantee at least x objects are contiguous, where x is set by whoever creates the pool (you can choose the number of objects it allocates at a time). This is because we allocate x of them at a time. We don't do this on the obstack, so whether we get contiguous objects depends on allocation order and how many we end up freeing/needing, *plus*, in the case of bb's and edges, whether anything else got allocated on the flow_obstack in the meantime. It's essentially pot luck, and it doesn't turn out good as you get large numbers of bb's.
The things we don't want to cache miss on are RTL, trees, bb's, and edges. We are always manipulating these things.

You could of course, make the allocations contiguous by dedicating it's own obstack to bb's, allocating a bunch of bb's at a time, and just throwing the rest on the free list. However, you could just instead reuse the code i've written that nicely does this, all abstracted and easily usable. Otherwise, you'd have to do the same thing for edges, et_forest things, bb's, etc.

As for performance, the numbers don't lie.
VTune tells me the basic block/edge pool allocation patch knocks the FOR_BB_REVERSE and the "(find_if_edge, the for (cur_edge = else_bb->pred))" out of the top of cache misses, and improves runtime on large numbers of bb's.



Honza






Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]