|Summary:||alloc_page in ggc-page.c is slow|
|Product:||gcc||Reporter:||Andrew Pinski <pinskia>|
|Component:||other||Assignee:||Zack Weinberg <zackw>|
|Severity:||normal||CC:||aaronl, gcc-bugs, steven|
|Build:||powerpc-apple-darwin6.6||Known to work:|
|Known to fail:||Last reconfirmed:||2003-11-30 08:38:20|
|Bug Depends on:|
Proof-of-concept patch (2)
Description Andrew Pinski 2003-05-22 19:23:32 UTC
alloc_page is slow because it transverses a linked list that is sometimes huge. It takes about 5% out of a 10s sample (around 20s into compiling) when compiling code from bug 8361 <http://gcc.gnu.org/PR8361>. One way of fixing this is to have an array for that list that can grow, this will most of the problems of transversing an linked list.
Comment 1 Zack Weinberg 2003-05-23 21:52:36 UTC
The code in alloc_page is remarkably stupid, but the linked list traversal should only occur when trying to allocate an object larger than a single page, which is believed to happen almost never. Can you please find out whether it's really spending time in that list, or if some other area is the cause of the problem? I'll take responsibility for this bug; I've been meaning to get us per-order freelists for a long time.
Comment 2 Andrew Pinski 2003-05-23 22:29:39 UTC
Subject: Re: [Bug other/10944] alloc_page in ggc-page.c is slow Yes it is spending 90% of the time in alloc_page in that list traversal. I used Shikari (part of the CHUD tools) on Mac OS X for profiling the complier 10000 samples ever 1ms (10s in total). I can find out the backtrace if you want it. Thanks, Andrew Pinski On Friday, May 23, 2003, at 17:52 US/Eastern, email@example.com wrote: > > The code in alloc_page is remarkably stupid, but the linked list > traversal > should only occur when trying to allocate an object larger than a > single > page, which is believed to happen almost never. Can you please find > out > whether it's really spending time in that list, or if some other area > is > the cause of the problem? > > I'll take responsibility for this bug; I've been meaning to get us > per-order > freelists for a long time.
Comment 3 Andrew Pinski 2003-05-23 22:40:18 UTC
The backtraces are spread all over so that it is not just one place.
Comment 4 Andrew Pinski 2003-05-23 23:19:05 UTC
Here is a sample output of shikari based on 75168 samples each one 1ms apart samples function program 1680 alloc_page cc1plus based on asm: Samples asm file:line 2 lwz r0,4(r3) ggc-page.c:709 1375 cmpw cr7,r0,r28 ggc-page.c:709 beq cr7,+224 ggc-page.c:709 6 mr r2,r3 ggc-page.c:708 6 lwz r3,0(r3) ggc-page.c:708 266 cmpwi cr6,r3,0 ggc-page.c:708 bne cr6,-24 ggc-page.c:708 samples line code 707 /* Check the list of free pages for one we can use. */ 278 708 for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp) 1377 709 if (p->bytes == entry_size) 710 break;
Comment 5 Steven Bosscher 2003-10-12 20:30:44 UTC
Created attachment 4924 [details] Proof-of-concept patch
Comment 6 Steven Bosscher 2003-10-12 20:34:54 UTC
Discussed this with Zack, and the patch "is essentially what [he] had in mind." Zack suggested there are still a few things that need to be done, however: 1) two-level free page caching: - after one collection, put all freepages on the per-order lists; - after the next collection, put all thepages that are still on the per-order lists on the global list; - and after a third collection release the pages still on the global list tothe OS. This probably takes some fiddling with the page_entry struct. Also we'll have to see how this works in relation with page groups 2) scavenge pages from other per-order lists rather than go back to the OS for more. (Personally I think this won't be necessary if (1) is implemented properly)
Comment 7 Steven Bosscher 2003-10-13 14:30:57 UTC
Created attachment 4928 [details] Proof-of-concept patch (2)
Comment 8 Steven Bosscher 2003-10-13 14:37:11 UTC
The new patch implements suggestion 1 from last nights list. Andrew, you identified the loop over free_pages as a time consumer, but now that it is gone, I get virtually no measurable time improvements (10ths of seconds on >1min total compilation time) for the test case for PR8361. Which (not surprisingly) once again shows that GC is slow because with each call ggc_collect we need to do more and more marking... Are you absolutely, positively sure that this loop is a performance hog? Perhaps you can try the patch, it bootstrappes (c,objc,c++) and showed no regressions on i686. So far I haven't found convincing evidence that it improves performance significantly except for the most pathological (artificial) cases. Sigh, back to waiting for a new GC strategy...
Comment 9 Andrew Pinski 2003-11-30 08:38:19 UTC
I am going to push this (only because I have not have time to test the patch) and there is already a new GC strategy (for 3.4) but does not work with everything.
Comment 10 Steven Bosscher 2003-12-20 14:10:47 UTC
I have experimented with my patch on three different targets (i686,x86_64, and hpux) and have never seen any improvements with it even though it kills the supposed hot spot in the code. I'm marking this as WAITING and I'll close it soonish, unless someone can test this on the reporter's platform (ppc-darwin) and show that it does help on that target.
Comment 11 Andrew Pinski 2003-12-23 01:10:12 UTC
Steven can you wait like three weeks before closing because I go on vaction tommorrow for about two weeks and I cannot test the patch untill after I get back?
Comment 12 Dara Hazeghi 2004-01-17 22:59:07 UTC
Andrew, did you get a chance to test the patch for this?
Comment 13 Andrew Pinski 2004-02-06 01:44:38 UTC
I do not have time to test this, and it now this patch does not matter as Honza's work with the compiler's memory and compile time stuff and Steven did not get any benifit from it.