alloc_page is slow because it transverses a linked list that is sometimes huge.
It takes about 5% out of a 10s sample (around 20s into compiling) when compiling code from bug
One way of fixing this is to have an array for that list that can grow, this will most of the problems
of transversing an linked list.
The code in alloc_page is remarkably stupid, but the linked list traversal
should only occur when trying to allocate an object larger than a single
page, which is believed to happen almost never. Can you please find out
whether it's really spending time in that list, or if some other area is
the cause of the problem?
I'll take responsibility for this bug; I've been meaning to get us per-order
freelists for a long time.
Subject: Re: [Bug other/10944] alloc_page in ggc-page.c is slow
Yes it is spending 90% of the time in alloc_page in that list traversal.
I used Shikari (part of the CHUD tools) on Mac OS X for profiling the
10000 samples ever 1ms (10s in total).
I can find out the backtrace if you want it.
On Friday, May 23, 2003, at 17:52 US/Eastern, firstname.lastname@example.org wrote:
> The code in alloc_page is remarkably stupid, but the linked list
> should only occur when trying to allocate an object larger than a
> page, which is believed to happen almost never. Can you please find
> whether it's really spending time in that list, or if some other area
> the cause of the problem?
> I'll take responsibility for this bug; I've been meaning to get us
> freelists for a long time.
The backtraces are spread all over so that it is not just one place.
Here is a sample output of shikari based on 75168 samples each one 1ms apart
samples function program
1680 alloc_page cc1plus
based on asm:
Samples asm file:line
2 lwz r0,4(r3) ggc-page.c:709
1375 cmpw cr7,r0,r28 ggc-page.c:709
beq cr7,+224 ggc-page.c:709
6 mr r2,r3 ggc-page.c:708
6 lwz r3,0(r3) ggc-page.c:708
266 cmpwi cr6,r3,0 ggc-page.c:708
bne cr6,-24 ggc-page.c:708
samples line code
707 /* Check the list of free pages for one we can use. */
278 708 for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp)
1377 709 if (p->bytes == entry_size)
Created attachment 4924 [details]
Discussed this with Zack, and the patch "is essentially what [he] had in mind."
Zack suggested there are still a few things that need to be done, however:
1) two-level free page caching:
- after one collection, put all freepages on the per-order lists;
- after the next collection, put all thepages that are still on
the per-order lists on the global list;
- and after a third collection release the pages still on the global
list tothe OS.
This probably takes some fiddling with the page_entry struct. Also
we'll have to see how this works in relation with page groups
2) scavenge pages from other per-order lists rather than go back to the
OS for more.
(Personally I think this won't be necessary if (1) is implemented properly)
Created attachment 4928 [details]
Proof-of-concept patch (2)
The new patch implements suggestion 1 from last nights list.
Andrew, you identified the loop over free_pages as a time consumer, but now that
it is gone, I get virtually no measurable time improvements (10ths of seconds on
>1min total compilation time) for the test case for PR8361. Which (not
surprisingly) once again shows that GC is slow because with each call
ggc_collect we need to do more and more marking...
Are you absolutely, positively sure that this loop is a performance hog?
Perhaps you can try the patch, it bootstrappes (c,objc,c++) and showed no
regressions on i686. So far I haven't found convincing evidence that it
improves performance significantly except for the most pathological (artificial)
Sigh, back to waiting for a new GC strategy...
I am going to push this (only because I have not have time to test the patch) and there is
already a new GC strategy (for 3.4) but does not work with everything.
I have experimented with my patch on three different targets (i686,x86_64, and
hpux) and have never seen any improvements with it even though it kills the
supposed hot spot in the code. I'm marking this as WAITING and I'll close it
soonish, unless someone can test this on the reporter's platform (ppc-darwin)
and show that it does help on that target.
Steven can you wait like three weeks before closing because I go on vaction tommorrow for about
two weeks and I cannot test the patch untill after I get back?
Andrew, did you get a chance to test the patch for this?
I do not have time to test this, and it now this patch does not matter as Honza's work with the
compiler's memory and compile time stuff and Steven did not get any benifit from it.