This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

SSA jump threading/bypassing -- an update


I've been largely quiet for a while as I've been evaluating how
to best deal with jump bypassing/threading.

As you may recall, Kenny presented a new algorithm that I'll call
"jump bypassing" at the GCC summit.  The basic idea was to avoid
the insane contortions currently done by jump threading in the
dominator optimizer to keep the SSA graph up-to-date (I'll use
"jump threading" to refer to the old method in tree-ssa-dom.c).

While the algorithm looked quite promising, I was concerned from
the onset that it was going to be too conservative in its
behavior and that it would fail to optimize many cases that were
currently handled by the dominator optimizer's jump threading
capabilities.

However, even with those concerns I was really hoping to be able
to make use of the jump bypassing algorithm.

My theory was that the jump bypassing algorithm would optimize the
cases which were most likely to expose secondary optimization
opportunities to the rest of the SSA optimizers and that the old
jump threading code would be used immediately before the out-of-ssa
translation and was not expected to expose secondary optimization
opportunities.

With that in mind I set forth and implemented the jump bypassing
algorithm and restricted jump threading to run just before out-of-ssa
and started evaluating the result based on measuring how well the
schemes were doing at eliminating branches (I primarily measured
dynamic branching behavior, but from time to time also looked at
the number of static branches eliminated).

I was rather disappointed to find that the mixed use of jump bypassing
and jump threading performed significantly worse in terms of 
eliminating dynamic branches than the old jump threading code.

The significant increase in dynamic branches suggested that there
were cases where the old jump threading code exposed significant
secondary optimization opportunities to the rest of the SSA path,
including exposing additional jump threading opportunities.

It was relatively easy to use the dynamic branching data gathered
by -fprofile-arcs/-fbranch-probabilities and the analyze_brprob
script to identify specific functions with significant increases
in the number of dynamic branches executed.  In fact, it was possible
to pinpoint exactly what edges in the CFG were not optimized as 
well as they could/should have been.

Armed with that data I've been able to classify the problems with
jump bypassing into three rough areas.  CFG shape, reliance upon
condition domination, and propagation of information through
PHIs at the leafs in the dominance tree.

First, jump bypassing is far too conservative in terms of the 
CFG shapes it allows.  A few examples:



     ORIG                     DESIRED

       0                        0
      /|                       / \
     2 |                      2   5
      \|                      |
       3                      4
      / \
     4   5


  Assume blocks 0 & 3 test the same predicate.  Jump bypassing will
  fail to optimize this case as the successors of 0 do not dominate
  predecessors of block 3.  This is trivially fixed by changing the
  algorithm to use edge dominance rather than block dominance when
  testing the shape of the CFG.

  Specifically each incoming edge into the join point (block 3 in
  this example) must be dominated by an outgoing edge from the
  first predicate (block 0 in this example).


  Another example of CFG shapes the original jump bypassing 
  algorithm does not handle:


       0                         0
     /   \                     /   \
    1     2                   1     2
   / \   / \                 / \   / \
  3   4 5   6               3   4 5   6
   \  | |  /                 \ /   \ /
    \ | | /                   8     9
     \| |/
       7
      / \
     8   9
  
  [ Assume the same predicate in block 0 and block 7. ]

  Again, this is relatively easy to fix by extending jump bypassing
  to handle more than two incoming edges into the join block (block 7
  in this example).  PHIs in block 7 need to turn into PHIs in blocks
  8 & 9 (instead of copies).


  A third example of CFG shapes not handled by the jump bypassing
  algorithm:



       0                          0
      / \                        / \
     1   2                      1   2
    / \ / \                    / \ / \
   3   4   5                  3   4   5
    \  |  /                   |   |   |
     \ | /                    |   6   |
       6                      |  / \  |
      / \                      \/   \/
     7   8                     7    8

  Assume that blocks 0 & 6 have the same predicate.  The problem in
  this CFG is the edge from 4-6 is not dominated by an outgoing edge
  from block 0.  Believe it or not, CFGs like this are quite common.
  This is somewhat harder to fix, but not terrible.  [ Note depending
  on the contents of block 6, you may need to insert assignments on
  the edges 3->7 and 5->8. ]

  Finally, the jump bypassing algorithm doesn't discuss at all
  how to handle the case where the block with the second predicate
  does not dominate its successor blocks.  ie

      0
     / \
    1   2
    |  / \
    | 3   4
    |  \ /
    |   5
    |  / \
    +>6   7


   Consider if/how the bypassing algorithm handles the case where
   blocks 2 & 5 test the same predicate.


   It's worth noting that the existing jump threading code doesn't
   care about CFG shape at all and handles all these cases right now.


The second class of weakness in the jump bypassing algorithm is its
reliance upon condition domination.  ie, the only time the jump
bypassing algorithm optimizes is when it encounters an instance
of a predicate that is dominated by an earlier instance of the
predicate.  This can cause jump bypassing to miss a large class
of optimization opportunities, including those with nice secondary
effects. 

 For example, consider a CFG such as:



          0
         / \
        1   2
         \ /
          3
         / \
        4   5

 
  With the following statements:
 
  block 0
    if (a1 == 1)  goto block 1 else goto block 2

  block 1
    z2 = 0;
    goto block 3

  block 2
    z3 = unknown;
    goto block 3

  block 3
    z4 = PHI (z2, z3)
    if (z4 == 0) goto block 4 else goto block 5


  Note how the predicates in block 0 and block 3 are different.  The
  jump bypassing algorithm will do nothing in this case.    That's
  unfortunate since we should be able to thread the edge 1->3 to
  block 4.


  These kinds of situations are amazingly common and easily handled
  by the jump threading code in the dominator optimizer.  


  
The third area of weakness is the jump bypassing's algorithm 
inability to propagate information at the leafs in the dominator
tree.  Consider:

  block 0
    #VUSE (memtag1)
    z4 = foo->bar->com
    if (z4 == 0)  goto block 1 else goto block 2

  block 1
    some statements which do not do an aliased write
    goto block 3

  block 2
    #memtag2 = V_MAY_DEF (memtag1)
    fubar ();
    goto block 3

  block 3
    memtag3 = PHI (memtag1, memtag2)

    #VUSE (memtag3)
    z5 = foo->bar->com;
    if (z5 == 0) goto block 4 else goto block 5


  This is a rather interesting (and common) situation.  The existing
  jump threader handles this already -- if block 3 is reached from
  block 1, then we know the that z5 will have the value 0 and we
  can thread the edge 1->3 to block 4 -- which avoids a memory load
  test and conditional branch.  Jump bypassing might be extendable
  to handle this.

And a reminder, I'm showing the simplest examples necessary to
show the weaknesses.  Clearly these trivial examples could be
handled with more work.  However, handling them in the general
case adds more and more complexity to the jump bypassing
algorithm.  In fact, doing so starts to make the jump
bypassing algorithm look more and more like the jump threading
algorithm in tree-ssa-dom.c, except that the SSA update mechanisms
are different. ]

I found it was relatively common for one or more of the above cases
to apply during the first or second pass of the old jump threading
code, which in turn caused certain statements to then be discovered
as redundant, which in turn allowed those statements to be eliminated
by DCE, which in turn exposed yet more jump threading opportunities
in a later pass of jump threading.

It's somewhat ironic in that I've known for some time that we would
benefit from alternating DOM and DCE for precisely the same reasons.
And in fact, one class of cases where the RTL optimizers are finding
threading opportunities would be resolved by alternating DOM & DCE
until they reached a steady state.  Of course, that would be rather
expensive....


However, I should also note that there is a class of cases that are
caught by the jump bypassing algorithm, but which are not caught
by jump threading.  Specifically those where the block with the
second conditional has statements with side effects that must be
preserved.  These are not that common, but they do occur from time
to time.  The block/statement copying nature of the bypassing update
algorithm allows the bypassing version to handle these cases
gracefully.

Anyway, this is already quite long and the afternoon is nearly 
done.  I'll have more to say on Monday...



Jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]