Bug 19794 (jumpthreading) - [meta-bug] Jump threading related bugs
Summary: [meta-bug] Jump threading related bugs
Status: ASSIGNED
Alias: jumpthreading
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P2 enhancement
Target Milestone: ---
Assignee: Jeffrey A. Law
URL:
Keywords: meta-bug
Depends on: 26731 32306 42646 50346 55860 56574 67194 69196 73550 81958 85186 102844 103680 13875 15221 15352 16538 17116 18046 18076 18576 19516 19804 19938 19940 21417 21559 21829 21883 23622 23901 37060 54345 54742 58455 58698 61428 68548 70879 77445 78496 81165 84078
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-06 17:52 UTC by Kazu Hirata
Modified: 2021-12-13 13:58 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-03-05 03:10:28


Attachments
Current changes for jump thread selection code (5.61 KB, patch)
2005-03-22 17:26 UTC, Jeffrey A. Law
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kazu Hirata 2005-02-06 17:52:59 UTC
Please make jump threading related bugs block this meta bug.
Comment 1 Jeffrey A. Law 2005-03-16 19:24:58 UTC
An update on the new jump threading selection code.

All the base infrastructure is in place (namely the code to avoid creating
irreducible regions).  Benchmarking the new selection code has turned up one
issue with eon that I'm trying to sort out right now.

jeff
Comment 2 Jeffrey A. Law 2005-03-22 17:12:22 UTC
The EON issue mentioned in my notes from last week is an instability in how IV
opts selects which induction variables to use.  Zdenek has a patch which helps
increase the stability of the IV selection code and produce better debugging 
information.  That patch eliminates the huge EON regression, but there's still
some bad things happening.


Here's some hard data from SPEC2000:

                    Mainline + Zdenek's patch    + Threading changes:

                          Runtime    Ratio         Runtime    Ratio
   164.gzip                 158       887*           158       887*
   175.vpr                  264       530*           271       517*
   176.gcc                  130       846*           133       829*   
   181.mcf                  337       534*           353       510*
   186.crafty               116       864*           115       869*
   197.parser               252       715*           255       706*
   252.eon                  166       783*           155       838*
   253.perlbmk              190       947*           169      1063*
   254.gap                  117       939*           119       924*
   255.vortex               183      1039*           182      1043*
   256.bzip2                235       639*           230       652*
   300.twolf                433       692*           447       671*
   ----------------------------------------------------------------
                                      768                      772


Which is rather disappointing, with the obvious exception of eon and
perlbmk which improve enough to offset all the other consistent losses.

I was still concerned about the instability of IV opts skewing the results
(either in a positive or negative way).  So I did runs with IV opts disabled.

                    Mainline no IV opts     + Threading Improvements

                   Runtime    Ratio            Runtime   Ratio
   164.gzip          157       894*             157       892*                 
         
   175.vpr           263       533*             264       531*               
   176.gcc           127       866*             126       872*              
   181.mcf           329       547*             327       550*                  
   186.crafty        114       877*             115       873*               
   197.parser        250       719*             249       723*                   
   252.eon           173       752*             159       815*                     
   253.perlbmk       172      1047*             150      1201*                 
      
   254.gap           117       941*             118       934*                 
       
   255.vortex        182      1046*             180      1055*                      
   256.bzip2         225       667*             224       670*                      
   300.twolf         429       699*             429       699*
   -----------------------------------------------------------                 
                               781                        796

Which looks *much* better.  There are still some regressions, but they are
relatively small and are offset by similar small improvements.  We still get
big speedups for perl and eon.

From that I can conclude that there's still some bad interaction between
IV opts and the new threading code.  Either the IV selection instability
is still causing significant issues or some other effect I haven't discovered
yet.

I'm going to spend a little time trying to figure out what interaction between
the new threading code and IV opts is still causing us heartburn before moving
forward with the threading changes.

[ I'll note that according to this test, we generally seem to get better code
  without IV opts.  Which is probably a good sign that the IV opts code needs
  some serious tuning. ]

Jeff
Comment 3 Jeffrey A. Law 2005-03-22 17:26:10 UTC
Created attachment 8434 [details]
Current changes for jump thread selection code
Comment 4 Jeffrey A. Law 2005-04-02 01:31:36 UTC
Some notes on recent poking and prodding.

The big perl speedup is consistent on my P4 -- but perl shows no significant
change on my AMD box.  Perl spends ~50% of its time in one routine (regexec)
and, surprise, that's the routine where profiling shows the great improvements.

Unfortunately, the profiling data hasn't pinpointed _why_ that routine is
running so much faster.  If one is to believe the oprofile data, the
huge reduction in cycles actually occurs in the function's header block.
But it's not substantially different between the version compiled with
and without the threading updates.  And it does not appear that threading has
turned any of the recursive calls into simple loops.  Looking at the profile
results from different P4 counters hasn't provided any additional insight
yet.  Sigh.

Anyway, I'll continue poking at Perl -- I'd really like to understand the 
huge improvements before installing the patch. 
Comment 5 Jeffrey A. Law 2005-04-06 00:25:50 UTC
Just more notes on the huge perl speedup with the threading changes...

For reasons yet unknown, we're seeing a lot less L2 cache traffic when perl
is compiled with the threading changes.  The decreased L2 traffic also
corresponds to the areas of code which I had previously identified as
showing the most significant runtime improvements.

What I don't have an answer for yet is _why_ we're seeing the huge decrease
in L2 traffic.  We're seeing a huge jump in MEMORY_CANCEL events, which if
I understand the docs correctly will cause an increase in cache activity as
we are either low on store buffers or we have conflicts due to 64k aliasing.
[Both sub-events are way-down with the threading changes. ]

Attempts to perturb the size of the stack for the recursive routine in question
to avoid the 64k aliasing problem haven't met with any success yet.  We are
seeing variables spilled into different stack locations, so that _could_ be
the ultimate cause of the cache behavior.  These things would definitely
classify as secondary order effects

We're also seeing a measurable decrease in ITLB misses, howerver, I'm not
convinced that the decrease is significant enough to cause the kind of
performance improvement we're seeing.

Anyway, I'm still plugging away...
Comment 6 Daniel Berlin 2005-04-06 02:05:13 UTC
Subject: Re:  [meta-bug] Jump threading
	related bugs

On Wed, 2005-04-06 at 00:25 +0000, law at redhat dot com wrote:
> ------- Additional Comments From law at redhat dot com  2005-04-06 00:25 -------
> Just more notes on the huge perl speedup with the threading changes...
> 

Have you tried on non-x86 to see if the results bear out on other
platforms?


Comment 7 Jeffrey A. Law 2005-04-06 03:21:51 UTC
Subject: Re:  [meta-bug] Jump threading
	related bugs

On Wed, 2005-04-06 at 02:05 +0000, dberlin at dberlin dot org wrote:
> ------- Additional Comments From dberlin at gcc dot gnu dot org  2005-04-06 02:05 -------
> Subject: Re:  [meta-bug] Jump threading
> 	related bugs
> 
> On Wed, 2005-04-06 at 00:25 +0000, law at redhat dot com wrote:
> > ------- Additional Comments From law at redhat dot com  2005-04-06 00:25 -------
> > Just more notes on the huge perl speedup with the threading changes...
> > 
> 
> Have you tried on non-x86 to see if the results bear out on other
> platforms?
They don't even bear-out on other x86 platforms -- my older P3 and AMD
boxes don't show the same kind of big improvement (they show a small
improvements).  However, both of my P4s show big improvements.


jeff

Comment 8 Andrew Pinski 2005-04-06 06:38:42 UTC
(In reply to comment #7)
> They don't even bear-out on other x86 platforms -- my older P3 and AMD
> boxes don't show the same kind of big improvement (they show a small
> improvements).  However, both of my P4s show big improvements.

Huh, both of those targets are still x86. So what about on say Power4 or PowerPC 970?
Comment 9 Jeffrey A. Law 2005-04-06 17:41:30 UTC
Subject: Re:  [meta-bug] Jump threading
	related bugs

On Wed, 2005-04-06 at 06:38 +0000, pinskia at gcc dot gnu dot org wrote:
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2005-04-06 06:38 -------
> (In reply to comment #7)
> > They don't even bear-out on other x86 platforms -- my older P3 and AMD
> > boxes don't show the same kind of big improvement (they show a small
> > improvements).  However, both of my P4s show big improvements.
> 
> Huh, both of those targets are still x86.
And my point was that the improvement from everything I've managed to
gather so far is either specific to the P4 or possibly specific to the
size of the cache on the P4.




>  So what about on say Power4 or PowerPC 970?
I don't have either of those.  My non-x86 boxes are all, err, old.  My
PPC I think is a 233MHZ PPC750.  It'd probably take more than I day to
get things built and running on that ancient box.  I only use it to 
diagnose PPC specific problems.

jeff


Comment 10 Jeffrey A. Law 2005-04-06 19:21:26 UTC
More info.

It appears that threading one specific jump is responsible for triggering
the big speedup.  And it could cause the kind of effects we're seeing.

Basically we're threading a conditional branch to a loop exit test back to the
top of the loop.  This has the effect of creating nested loops.  This in turn
causes the register allocators to make different choices in regards to what
values should be kept in registers and which end up on the stack (and at
what offsets each object appears on the stack).

That could cause the kind of decrease in L2 activity I'm seeing, particularly
with the recursive nature of the function in question.  I've got a few more
tests to run before I claim this to be the cause of the huge improvement.
But this is the best theory which fits the data I've seen so far.
Comment 11 Jeffrey A. Law 2005-04-08 18:19:26 UTC
OK. I'm pretty sure the perl improvements are really just an artifact of
changes in what objects get spilled onto the stack on the offsets of each
particular object.

I can with a small amount of work twiddle register allocation (and thus
spilling behavior) in such a way as to make perl built without the threading
changes run significantly faster than perl built with the threading changes.
[ Which is the exact opposite of the runtime behavior if we do not twiddle
  the register allocation priorities. ]

While it's certainly not proof, I'm confident that the reason performance is
swinging back and forth is due to cache effects in the stack, particuliarly
for spill slots in one critical function.  Anything which changes which objects
are spilled on the stack, or the offsets of those objects on the stack is
likely to cause wild swings in performance, at least on P4s.

The net result is, IMHO, we should ignore Perl's results when benchmarking
the new threading changes on P4s, and possibly other platforms.  This may
also hold true when benchmarking other changes which could potentially affect
register allocation (which, of course, is just about everything).
Comment 12 Jeffrey A. Law 2005-04-19 17:59:15 UTC
An update.  The jump threading changes are blocked pending resolution of
a semi-latent bug in reload which is exposed by the combination of the
jump threading changes plus the recent merges from TCB.

I'm testing a patch which resolves the semi-latent bug and hope to remove
that blocker by COB today.

jeff
Comment 13 Jeffrey A. Law 2005-04-22 18:30:08 UTC
The reload bug has been "resolved" or at least it's in a state where we can
move forward with the jump threading changes.

I've finally got the dynamic branching data which shows pretty much what I
expected -- we execute fewer jumps :-)

What I've done is used  GCC's arc-profiling and branch prediction support to
give us a picture of the runtime branching behavior of a set of programs
(SPECINT2000).  Specifically we're looking at how many conditional branches were
executed (regardless of whether or not they were taken or not taken).  We 
measure this with and without the jump threading changes.  This gives us a
highly accurate measure of how effective the jump threading changes are at
reduncing the number of branches a program executes at runtime (it does not
measure the secondary effects like exposing new expression redundancies).

For SPECINT2000 we see a reduction in conditional branches of 4%.  Yes, the
threading changes manage to eliminate 4 out of every 100 branches executed
at runtime across a fairly large benchmark suite.  One benchmark (gzip) saw
an incredible 21% reduction in runtime branches executed, though sadly gzip
doesn't show a measurable runtime improvement (most likely because there were
few, if any secondary optimization opportunities exposed by jump threading
and/or the branches it eliminated were easily predicted by the hardware and
thus weren't causing considerable branch mis-prediction penalties).

These changes are also now showing a net improvement in compile-time.
Comment 14 Steven Bosscher 2005-04-23 16:58:46 UTC
18832 has no jump threading issues left, says Jeff. 
Comment 15 Eric Gallager 2018-06-29 15:50:45 UTC
In bug 55860 Jeff said he reviews all of these each release, so ASSIGNED to him.
Comment 16 Eric Gallager 2018-06-29 15:52:05 UTC
(In reply to Eric Gallager from comment #15)
> In bug 55860 Jeff said he reviews all of these each release, so ASSIGNED to
> him.

...and status actually changed to ASSIGNED this time.