Please make jump threading related bugs block this meta bug.
An update on the new jump threading selection code. All the base infrastructure is in place (namely the code to avoid creating irreducible regions). Benchmarking the new selection code has turned up one issue with eon that I'm trying to sort out right now. jeff
The EON issue mentioned in my notes from last week is an instability in how IV opts selects which induction variables to use. Zdenek has a patch which helps increase the stability of the IV selection code and produce better debugging information. That patch eliminates the huge EON regression, but there's still some bad things happening. Here's some hard data from SPEC2000: Mainline + Zdenek's patch + Threading changes: Runtime Ratio Runtime Ratio 164.gzip 158 887* 158 887* 175.vpr 264 530* 271 517* 176.gcc 130 846* 133 829* 181.mcf 337 534* 353 510* 186.crafty 116 864* 115 869* 197.parser 252 715* 255 706* 252.eon 166 783* 155 838* 253.perlbmk 190 947* 169 1063* 254.gap 117 939* 119 924* 255.vortex 183 1039* 182 1043* 256.bzip2 235 639* 230 652* 300.twolf 433 692* 447 671* ---------------------------------------------------------------- 768 772 Which is rather disappointing, with the obvious exception of eon and perlbmk which improve enough to offset all the other consistent losses. I was still concerned about the instability of IV opts skewing the results (either in a positive or negative way). So I did runs with IV opts disabled. Mainline no IV opts + Threading Improvements Runtime Ratio Runtime Ratio 164.gzip 157 894* 157 892* 175.vpr 263 533* 264 531* 176.gcc 127 866* 126 872* 181.mcf 329 547* 327 550* 186.crafty 114 877* 115 873* 197.parser 250 719* 249 723* 252.eon 173 752* 159 815* 253.perlbmk 172 1047* 150 1201* 254.gap 117 941* 118 934* 255.vortex 182 1046* 180 1055* 256.bzip2 225 667* 224 670* 300.twolf 429 699* 429 699* ----------------------------------------------------------- 781 796 Which looks *much* better. There are still some regressions, but they are relatively small and are offset by similar small improvements. We still get big speedups for perl and eon. From that I can conclude that there's still some bad interaction between IV opts and the new threading code. Either the IV selection instability is still causing significant issues or some other effect I haven't discovered yet. I'm going to spend a little time trying to figure out what interaction between the new threading code and IV opts is still causing us heartburn before moving forward with the threading changes. [ I'll note that according to this test, we generally seem to get better code without IV opts. Which is probably a good sign that the IV opts code needs some serious tuning. ] Jeff
Created attachment 8434 [details] Current changes for jump thread selection code
Some notes on recent poking and prodding. The big perl speedup is consistent on my P4 -- but perl shows no significant change on my AMD box. Perl spends ~50% of its time in one routine (regexec) and, surprise, that's the routine where profiling shows the great improvements. Unfortunately, the profiling data hasn't pinpointed _why_ that routine is running so much faster. If one is to believe the oprofile data, the huge reduction in cycles actually occurs in the function's header block. But it's not substantially different between the version compiled with and without the threading updates. And it does not appear that threading has turned any of the recursive calls into simple loops. Looking at the profile results from different P4 counters hasn't provided any additional insight yet. Sigh. Anyway, I'll continue poking at Perl -- I'd really like to understand the huge improvements before installing the patch.
Just more notes on the huge perl speedup with the threading changes... For reasons yet unknown, we're seeing a lot less L2 cache traffic when perl is compiled with the threading changes. The decreased L2 traffic also corresponds to the areas of code which I had previously identified as showing the most significant runtime improvements. What I don't have an answer for yet is _why_ we're seeing the huge decrease in L2 traffic. We're seeing a huge jump in MEMORY_CANCEL events, which if I understand the docs correctly will cause an increase in cache activity as we are either low on store buffers or we have conflicts due to 64k aliasing. [Both sub-events are way-down with the threading changes. ] Attempts to perturb the size of the stack for the recursive routine in question to avoid the 64k aliasing problem haven't met with any success yet. We are seeing variables spilled into different stack locations, so that _could_ be the ultimate cause of the cache behavior. These things would definitely classify as secondary order effects We're also seeing a measurable decrease in ITLB misses, howerver, I'm not convinced that the decrease is significant enough to cause the kind of performance improvement we're seeing. Anyway, I'm still plugging away...
Subject: Re: [meta-bug] Jump threading related bugs On Wed, 2005-04-06 at 00:25 +0000, law at redhat dot com wrote: > ------- Additional Comments From law at redhat dot com 2005-04-06 00:25 ------- > Just more notes on the huge perl speedup with the threading changes... > Have you tried on non-x86 to see if the results bear out on other platforms?
Subject: Re: [meta-bug] Jump threading related bugs On Wed, 2005-04-06 at 02:05 +0000, dberlin at dberlin dot org wrote: > ------- Additional Comments From dberlin at gcc dot gnu dot org 2005-04-06 02:05 ------- > Subject: Re: [meta-bug] Jump threading > related bugs > > On Wed, 2005-04-06 at 00:25 +0000, law at redhat dot com wrote: > > ------- Additional Comments From law at redhat dot com 2005-04-06 00:25 ------- > > Just more notes on the huge perl speedup with the threading changes... > > > > Have you tried on non-x86 to see if the results bear out on other > platforms? They don't even bear-out on other x86 platforms -- my older P3 and AMD boxes don't show the same kind of big improvement (they show a small improvements). However, both of my P4s show big improvements. jeff
(In reply to comment #7) > They don't even bear-out on other x86 platforms -- my older P3 and AMD > boxes don't show the same kind of big improvement (they show a small > improvements). However, both of my P4s show big improvements. Huh, both of those targets are still x86. So what about on say Power4 or PowerPC 970?
Subject: Re: [meta-bug] Jump threading related bugs On Wed, 2005-04-06 at 06:38 +0000, pinskia at gcc dot gnu dot org wrote: > ------- Additional Comments From pinskia at gcc dot gnu dot org 2005-04-06 06:38 ------- > (In reply to comment #7) > > They don't even bear-out on other x86 platforms -- my older P3 and AMD > > boxes don't show the same kind of big improvement (they show a small > > improvements). However, both of my P4s show big improvements. > > Huh, both of those targets are still x86. And my point was that the improvement from everything I've managed to gather so far is either specific to the P4 or possibly specific to the size of the cache on the P4. > So what about on say Power4 or PowerPC 970? I don't have either of those. My non-x86 boxes are all, err, old. My PPC I think is a 233MHZ PPC750. It'd probably take more than I day to get things built and running on that ancient box. I only use it to diagnose PPC specific problems. jeff
More info. It appears that threading one specific jump is responsible for triggering the big speedup. And it could cause the kind of effects we're seeing. Basically we're threading a conditional branch to a loop exit test back to the top of the loop. This has the effect of creating nested loops. This in turn causes the register allocators to make different choices in regards to what values should be kept in registers and which end up on the stack (and at what offsets each object appears on the stack). That could cause the kind of decrease in L2 activity I'm seeing, particularly with the recursive nature of the function in question. I've got a few more tests to run before I claim this to be the cause of the huge improvement. But this is the best theory which fits the data I've seen so far.
OK. I'm pretty sure the perl improvements are really just an artifact of changes in what objects get spilled onto the stack on the offsets of each particular object. I can with a small amount of work twiddle register allocation (and thus spilling behavior) in such a way as to make perl built without the threading changes run significantly faster than perl built with the threading changes. [ Which is the exact opposite of the runtime behavior if we do not twiddle the register allocation priorities. ] While it's certainly not proof, I'm confident that the reason performance is swinging back and forth is due to cache effects in the stack, particuliarly for spill slots in one critical function. Anything which changes which objects are spilled on the stack, or the offsets of those objects on the stack is likely to cause wild swings in performance, at least on P4s. The net result is, IMHO, we should ignore Perl's results when benchmarking the new threading changes on P4s, and possibly other platforms. This may also hold true when benchmarking other changes which could potentially affect register allocation (which, of course, is just about everything).
An update. The jump threading changes are blocked pending resolution of a semi-latent bug in reload which is exposed by the combination of the jump threading changes plus the recent merges from TCB. I'm testing a patch which resolves the semi-latent bug and hope to remove that blocker by COB today. jeff
The reload bug has been "resolved" or at least it's in a state where we can move forward with the jump threading changes. I've finally got the dynamic branching data which shows pretty much what I expected -- we execute fewer jumps :-) What I've done is used GCC's arc-profiling and branch prediction support to give us a picture of the runtime branching behavior of a set of programs (SPECINT2000). Specifically we're looking at how many conditional branches were executed (regardless of whether or not they were taken or not taken). We measure this with and without the jump threading changes. This gives us a highly accurate measure of how effective the jump threading changes are at reduncing the number of branches a program executes at runtime (it does not measure the secondary effects like exposing new expression redundancies). For SPECINT2000 we see a reduction in conditional branches of 4%. Yes, the threading changes manage to eliminate 4 out of every 100 branches executed at runtime across a fairly large benchmark suite. One benchmark (gzip) saw an incredible 21% reduction in runtime branches executed, though sadly gzip doesn't show a measurable runtime improvement (most likely because there were few, if any secondary optimization opportunities exposed by jump threading and/or the branches it eliminated were easily predicted by the hardware and thus weren't causing considerable branch mis-prediction penalties). These changes are also now showing a net improvement in compile-time.
18832 has no jump threading issues left, says Jeff.
In bug 55860 Jeff said he reviews all of these each release, so ASSIGNED to him.
(In reply to Eric Gallager from comment #15) > In bug 55860 Jeff said he reviews all of these each release, so ASSIGNED to > him. ...and status actually changed to ASSIGNED this time.