[Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

Thu Jan 14 08:55:00 GMT 2016

On 01/13/2016 01:10 AM, Jeff Law wrote:
>
> I'm going to focus on adpcm for the moment, in particular adpcm_coder.
> It appears the key blocks are:
Looking at adpcm_decoder we have the same idiom as in adpcm_coder:

   if (bufferstep_79 != 0)
     goto <bb 6>;
   else
     goto <bb 7>;
;;    succ:       6 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                7 [50.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 6, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 5, next block 7, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (TRUE_VALUE,EXECUTABLE)
   delta_31 = inputbuffer_65 & 15;
   goto <bb 8>;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 7, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 6, next block 8, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (FALSE_VALUE,EXECUTABLE)
   inp_32 = inp_70 + 1;
   _33 = *inp_70;
   inputbuffer_34 = (int) _33;
   _35 = inputbuffer_34 >> 4;
   delta_36 = _35 & 15;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 8, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 7, next block 9, flags: (NEW, REACHABLE)
;;    pred:       6 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                7 [100.0%]  (FALLTHRU,EXECUTABLE)
   # inp_2 = PHI <inp_70(6), inp_32(7)>
   # delta_5 = PHI <delta_31(6), delta_36(7)>
   # inputbuffer_16 = PHI <inputbuffer_65(6), inputbuffer_34(7)>
   _83 = bufferstep_79 ^ 1;
   _1 = _83 & 1;
   _39 = indexTable[delta_5];
   index_40 = _39 + index_81;
   _84 = MIN_EXPR <index_40, 88>;
   index_62 = MAX_EXPR <_84, 0>;
   sign_41 = delta_5 & 8;
   vpdiff_42 = step_71 >> 3;
   _43 = delta_5 & 4;
   if (_43 != 0)
     goto <bb 9>;
   else
     goto <bb 10>;

The difference is there's all kinds of code between BB8 and the latch. 
So it doesn't trigger path splitting, which in turn doesn't expose the 
CSE/DCE opportunity for adpcm_decoder.  We're likely leaving some 
performance on the table for this code.

This brings up a deeper question.   Are we approaching path splitting in 
a fundamentally backward way?

Right now path splitting is guided by finding the right shape in the CFG 
immediately preceding the latch block of a loop.  Plus some heuristics 
I'm working on to predict when the duplication is more likely to lead to 
CSE/DCE opportunities.  But really it's driven by finding the right CFG 
shape before the latch block.

Should path splitting really be driven by partitions of PHI arguments 
and other incoming values (such as the implicit sets from conditionals). 
  I've been pondering this for a long time as it's a natural extension 
of what we do in the erroneous path isolation pass.

ie at any join point in the CFG, look for partitions of the arguments in 
any PHI nodes and split paths based on that.

So for example if we had a PHI like

x5 = phi (x4, 0, 0, 1, 1)

Look at how x5 is used.  If propagation of any of those PHI argument 
values would result in simplifications at the use points of x5, then we 
should consider splitting off all the paths with the beneficial PHI 
argument.

So in the case above, assume that the value 0 results in 
simplifications.  We'd create two paths.  One for the paths where x5 
would get the value zero, another for everything else.

This is *a lot* like the path isolation done erroneous paths in terms of 
the CFG & SSA manipulations.  Essentially in both cases we want to 
isolate a path so that we can manipulate it based on known inputs 
without affecting paths with uninteresting inputs.  All that really 
changes is how we drive the selection of paths to isolate and whether or 
not we insert the __builtin_trap or not.

Anyway, going back to adpcm_decode, we do end up splitting this path:

  # vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>
   if (sign_41 != 0)
     goto <bb 15>;
   else
     goto <bb 16>;
;;    succ:       15
;;                16

;;   basic block 15, loop depth 1
;;    pred:       14
   valpred_51 = valpred_76 - vpdiff_12;
   goto <bb 17>;
;;    succ:       17

;;   basic block 16, loop depth 1
;;    pred:       14
   valpred_52 = vpdiff_12 + valpred_76;
;;    succ:       17

;;   basic block 17, loop depth 1
;;    pred:       15
;;                16
   # valpred_7 = PHI <valpred_51(15), valpred_52(16)>
   _85 = MAX_EXPR <valpred_7, -32768>;
   valpred_13 = MIN_EXPR <_85, 32767>;
   step_53 = stepsizeTable[index_62];
   outp_54 = outp_69 + 2;
   _55 = (short int) valpred_13;
   MEM[base: outp_54, offset: -2B] = _55;
   if (outp_54 != _74)
     goto <bb 20>;
   else
     goto <bb 18>;

This doesn't result in anything particularly interesting/good AFAICT. 
We propagate valpred_51/52 into the use in the MAX_EXPR in the duplicate 
paths, but that doesn't allow any further simplification.

Ajit, can you confirm which of adpcm_code or adpcm_decode where path 
splitting is showing a gain?  I suspect it's the former but would like 
to make sure so that I can adjust the heuristics properly.

jeff