[PATCH v3] Fix incomplete computation in fill_always_executed_in_1

Xionghu Luo luoxhu@linux.ibm.com
Tue Aug 31 07:43:35 GMT 2021



On 2021/8/30 17:19, Richard Biener wrote:
>>>>      bitmap_set_bit (work_set, loop->header->index);
>>>> +      unsigned bb_index;
>>>>    -      for (i = 0; i < loop->num_nodes; i++)
>>>> -	{
>>>> -	  edge_iterator ei;
>>>> -	  bb = bbs[i];
>>>> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>>>> +      int *bbd = XNEWVEC (int, array_size);
>>>> +      bbd = XDUPVEC (int, bbi, array_size);
>>> I don't think you need to copy 'bbi' but you can re-use the
>>> state from the outer loop processing.  Did you run into any
>>> issues with that?
>> Yes.  For example, adding a small if-else block to ssa-lim-19.c,
>> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
>> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
>> then if fill_always_executed_in_1 is called again for loop 2, it's value is
>> not
>> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
>>
>>
>> struct X { int i; int j; int k;};
>>
>> void foo(struct X *x, int n, int l, int m)
>> {
>>   for (int j = 0; j < l; j++)  // loop 1
>>     {
>>       for (int i = 0; i < n; ++i)  // loop 2
>>         {
>>           if (m)
>>             x->j++;
>>           else
>>             x->j = m+n+l;
>>
>>           int *p = &x->j;   // bb 6
>>           int tem = *p;
>>           x->j += tem * i;
>>         }
>>       int *r = &x->k;
>>       int tem2 = *r;
>>       x->k += tem2 * j;
>>     }
>> }
> Hmm, but if the outer loop processing reaches bb 6 then
> it should have set it ALWAYS_EXECUTED in loop 1 already?

But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for
loop 2 as it requires n>0.  Please refer to the attached file ssa-lim-19.c.138t.lim2.

;;
;; Loop 1
;;  header 8, latch 12
;;  depth 1, outer 0
;;  nodes: 8 12 7 6 4 5 3 13 11
;;
;; Loop 2
;;  header 3, latch 13
;;  depth 2, outer 1
;;  nodes: 3 13 6 4 5
;; 2 succs { 10 9 }
;; 10 succs { 8 }
;; 11 succs { 3 }
;; 3 succs { 4 5 }
;; 4 succs { 6 }
;; 5 succs { 6 }
;; 6 succs { 13 7 }
;; 13 succs { 3 }
;; 7 succs { 12 9 }
;; 12 succs { 8 }
;; 8 succs { 11 7 }
;; 9 succs { 1 }

always executed: bb->index:8, loop->num: 1
always executed: bb->index:7, loop->num: 1
always executed: bb->index:3, loop->num: 2
always executed: bb->index:6, loop->num: 2

     8<---------------
   /  \              |
  11   \             |
  /     \            |
  3<---  \           | 
 /\    |  \          |
 4 5   |   \         |
 \/    |    \        |
  6    |     \       |
  |-->13      \      |
  |----------> 7     |
               /\    |
              9 12---

(gdb) x /15x bbd
0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000001
0x1354c9c0: 0x00000001  0x00000001  0x00000002  0x00000002
0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000001
0x1354c9e0: 0x00000001  0x00000001  0x00000000

our algorithm will walk through 8->11->3->4->5->6->7,
for loop 1, exit at edge 7->9.

(gdb) x /15x bbd
0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000000
0x1354c9c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000000
0x1354c9e0: 0x00000001  0x00000000  0x00000000

If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5]
and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set
ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6.

> 
>>
>>>    
>>>> +      while (!bitmap_empty_p (work_set))
>>>> +	{
>>>> +	  bb_index = bitmap_first_set_bit (work_set);
>>>> +	  bitmap_clear_bit (work_set, bb_index);
>>>> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
>>>>    	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -	    last = bb;
>>>> -
>>>> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
>>>>       if (bitmap_bit_p (contains_call, bb->index))
>>>>         break;
>>> I think you want to continue; here (process remaining worklist
>>> but not continue greedy walking this block)
>> Same as above, if use 'continue' instead of 'break', the algorithm
>> seems also not work again.  If inner loop contains a jump to outmost
>> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
>> incorrectly.
>>
>>>> -
>>>> +	  edge_iterator ei;
>>>>       FOR_EACH_EDGE (e, ei, bb->succs)
>>>>    	    {
>>>> -	      /* If there is an exit from this BB.  */
>>>>           if (!flow_bb_inside_loop_p (loop, e->dest))
>>>>      break;
>>> in particular this should keep the outer 'bbi' valid to re-use.
>>>
>>> But again, you want 'continue;' the greedy walk to other edges.
>>> If that's not valid (I'd need to think about this) then with
>>> your patch whether we process an edge depends on the order
>>> of the edge visit so you'd have to walk successors twice,
>>> once to determine whether we can greedily walk any of it
>>> and once to actually do the greedy walk.
>>>
>>> So thinking about it an exit edge is like a not returning call
>>> and thus we indeed should not process any outgoing edges of
>>> this block.
>>>
>>>> +
>>>>           /* Or we enter a possibly non-finite loop.  */
>>>>           if (flow_loop_nested_p (bb->loop_father,
>>>>              e->dest->loop_father)
>>>>        && ! finite_loop_p (e->dest->loop_father))
>>>>      break;
>>> I think this is no longer necessary?  In any case it would
>>> again be 'continue;', see also above.
>>>
>>> I'm missing skipping of backedges in the walk.
>>>
>>>> +
>>>> +	      if (bbd[e->dest->index] == 1)
>>>> +	      {
>>>> +		bitmap_set_bit (work_set, e->dest->index);
>>>> +		bbd[e->dest->index]--;
>>>> +	      }
>>>> +	      else if (bbd[e->dest->index] > 1)
>>>> +		bbd[e->dest->index]--;
>>> maybe just
>>>
>>>               bbd[e->dest->index]--;
>>>               if (bbd[e->dest->index] == 0)
>>>                 bitmap_set_bit (work_set, e->dest->index);
>> They are not equivalent.  Former won't decrease if bbd[e->dest->index]
>> is 0, but the latter does.
> we should be never visiting in this case, so sth indeed doesn't
> work as I expected.

I've verified this won't happen if duplicate the bb_incoming_edges
in fill_always_executed_in_1, so could replace it.

> 
>>>>    	    }
>>>> +
>>>>       if (e)
>>>>         break;
>>> continue; processing the worklist
>>>
>>>>    
>>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
>>>> @@ sbitmap contains_call)
>>>>          to disprove this if possible).  */
>>>>       if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>>>         break;
>>> continue; processing the worklist.  I think this has to
>>> come early, before processing successors, next to
>>> the contains_call processing.  We could think of ignoring
>>> entering of irreducible regions by tracking exits from them,
>>> but I'm not sure we're identifying the regions in a good
>>> enough way to allow this.
>>>
>>> Likewise possibly infinite inner loops can be processed
>>> when we track exits from those which would be sth like
>>>
>>>       /* When this is an exit from an inner loop that we cannot
>>>          prove as finite do not follow this edge.  */
>>>       if (bb->loop_father != loop
>>>           && loop_exit_edge_p (bb->loop_father, e)
>>>           && ! finite_loop_p (bb->loop_father))
>>>         continue;
>>>
>>>> -
>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>> -	    break;
>>>> -
>>>> -	  if (bb->loop_father->header == bb)
>>>> -	    {
>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -		break;
>>>> -
>>>> -	      /* In a loop that is always entered we may proceed anyway.
>>>> -		 But record that we entered it and stop once we leave it.  */
>>>> -	      inn_loop = bb->loop_father;
>>>> -	    }
>>>> -	}
>>>> -
>>>> -      while (1)
>>>> -	{
>>>> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
>>>> -	  if (last == loop->header)
>>>> -	    break;
>>>> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>>>>    	}
>>>> -
>>>> -      free (bbs);
>>>> +      free (bbd);
>>>>        }
>>>>    
>>>>      for (loop = loop->inner; loop; loop = loop->next)
>>>> -    fill_always_executed_in_1 (loop, contains_call);
>>>> +    fill_always_executed_in_1 (loop, contains_call, bbi);
>>>>    }
>>>>    
>>>>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
>>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
>>>>     bitmap_set_bit (contains_call, bb->index);
>>>>        }
>>>>    +  mark_dfs_back_edges ();
>>> Please put this to loop_invariant_motion_in_fun right before
>>> the call to fill_always_executed_in.
>> Done.
>>
>>>> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>>>> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
>>> There's XCNEWVEC that also zeros the array.
>> OK, thanks.
>>
>>>> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
>>>> +  edge_iterator ei;
>>>> +  edge e;
>>>> +
>>>> +  FOR_EACH_BB_FN (bb, cfun)
>>>> +    {
>>>> +      int len = bb->preds->length ();
>>>> +      FOR_EACH_EDGE (e, ei, bb->preds)
>>>> +	if (e->flags & EDGE_DFS_BACK)
>>>> +	  len--;
>>>> +      bb_incoming_edges[bb->index] = len;
>>>> +    }
>>> it would be nice to combine this with the preceeding loop over
>>> all blocks.
>> Done.
>>
>>>> +  static unsigned long diff = 0;
>>>> +  struct timeval tv;
>>>> +  gettimeofday (&tv, NULL);
>>>> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>>>> +
>>>> +  //for (loop : loops_list (cfun, 0))
>>>>      for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
>>>> -    fill_always_executed_in_1 (loop, contains_call);
>>>> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
>>>> +
>>>> +  gettimeofday (&tv, NULL);
>>>> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>>>> +  diff += end - begin;
>>>> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
>>>> +  free (bb_incoming_edges);
>>> Eh, looks very much like work in progress.  From the O() complexity
>>> point duplicating bb_incoming_edges for each loop makes it quadratic
>>> again but as said I think we don't need this duplication.
>> Sorry I didn't quite follow your above comments about reusing
>> bb_incoming_edges,
>> we are searching the bbs that dominate loop->latch loop by loop independently,
>> if outer loop changes the inner loops bb_incoming_edges, it will make the
>> inner
>> loop's ALWAYS EXECUTE computation incorrect?
> We're only supposed to greedily walk into always excuted blocks.  I see
> this probably breaks down for conditionally executed inner loops like
> 
>    for ()
>     {
>      if ()
>        for ();
>      A;
>     }
> 
> where we want to process the inner loop separately but we still want
> to make sure A is marked as always executed in the outer loop.

Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG.
so we have to check every bb and maintain bb_incoming_edges in each loop. 
If there is any block goes to outer loop, we have to 'break' instead of
'continue' as followed bbs are not ALWAYS_EXECUTE again.

> 
>> Seems duplicating bb_incoming_edges not increases the complexity but the
>> recursive call does?
> Since the bb_incoming_edges array has a size of N and we copy it
> number_of_loop times (which is O(N) bound) the copying effectively
> has O(N^2) cost.  So yes, it's the iteration over all loops,
> whether by recursion or iteration.


Actually, the copying is outside of the greedy CFG walk loop. I guess there
will be some optimizations in memcpy bb_incoming_edges by libraries?  But
maybe only works when most of the value are same then copy_by_pieces?

And I modified the time-consuming case with many bbs to simulate the extreme
usage, didn't observe quadratic behavior from it, it shows almost linear
build time increase for both base and V3 patch, but V3 patch generates more
accurate ALWAYS_EXECUTED result.  The function ratio in perf tool also looks
not bad.  At least, it looks like an improvement compared with the base code.

A function with 10K bbs would cost more than 8 minutes to build, so it seems
duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will
not be a performance bottle neck? (While most values are different, the copying
will exactly slow down when bb_incoming_edges array is quite large like 1000*1000*1000,
if so, that's really crazy automatic program generators...)

Base:		
number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant motion (usr/sys/wall/Mem)
7850+15615	123.99	13.52 ( 11%)   2.16 ( 52%)  15.68 ( 11%)    60M ( 21%)
15338+30591	208.98	24.55 ( 12%)   4.04 ( 52%)  28.57 ( 12%)   121M ( 21%)
30314+60543	511.54	48.96 ( 10%)   7.70 ( 52%)  56.90 ( 10%)   241M ( 21%)
		
V3 patch:		
number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant motion (usr/sys/wall/Mem)
7850+15615	121.18	12.26 ( 10%)   1.99 ( 49%)  14.42 ( 11%)    60M ( 21%)
15338+30591	204.99	24.72 ( 12%)   3.98 ( 52%)  28.85 ( 13%)   121M ( 21%)
30314+60543	500.76	48.26 ( 10%)   8.27 ( 51%)  57.12 ( 11%)   241M ( 21%)


Samples: 455K of event 'cycles', Event count (approx.): 426772700334
Overhead  Command  Shared Object     Symbol
   7.19%  cc1      cc1               [.] bitmap_set_bit
   5.86%  cc1      cc1               [.] get_ref_base_and_extent
   4.08%  cc1      cc1               [.] get_continuation_for_phi
   3.48%  cc1      cc1               [.] hash_table<vn_ssa_aux_hasher, false, xcallocator>::find
   3.41%  cc1      cc1               [.] dominated_by_p
   3.28%  cc1      cc1               [.] compute_transp
   2.98%  cc1      cc1               [.] bitmap_set_range
   2.60%  cc1      cc1               [.] bitmap_list_insert_element_after
   2.44%  cc1      cc1               [.] stmt_may_clobber_ref_p_1
   2.27%  cc1      cc1               [.] refs_may_alias_p_2
   2.26%  cc1      cc1               [.] hash_table<vn_reference_hasher, false, xcallocator>::fi
   2.26%  cc1      cc1               [.] df_rd_transfer_function
   2.07%  cc1      cc1               [.] bitmap_ior_into
   1.84%  cc1      cc1               [.] find_base_term
   1.82%  cc1      cc1               [.] get_alias_set
   1.79%  cc1      cc1               [.] tree_strip_nop_conversions
   1.73%  cc1      cc1               [.] dfs_enumerate_from
   1.73%  cc1      cc1               [.] reference_alias_ptr_type_1
   1.53%  cc1      cc1               [.] alias_sets_conflict_p
   1.30%  cc1      cc1               [.] c_get_alias_set
   1.16%  cc1      cc1               [.] bitmap_and_into
   1.12%  cc1      cc1               [.] indirect_ref_may_alias_decl_p
   1.04%  cc1      cc1               [.] et_splay
   0.99%  cc1      libc-2.27.so      [.] __memset_power8
   0.95%  cc1      cc1               [.] bitmap_ior_and_compl
   0.86%  cc1      cc1               [.] bitmap_copy
   0.78%  cc1      cc1               [.] ao_ref_from_mem

> 
> I wonder if we can more cleverly handle this by picking up
> conditionally executed inner loops on the fly for example
> by recording a 'outermost unconditionally executed loop'
> for each loop header we visit and using that like
> SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
> 
> So make this a single greedy CFG walk starting at the outermost
> loops.  loop exists will need to be processed specially then.

Not sure do you mean we only check loop headers and exits?  As said above,
the ALWAYS_EXECUTED block chains are not successive, there may be blocks
are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they
could only have two chains?  One is from loop->header, and another is from
loop->exit->src for loops has one one exit?

-- 
Thanks,
Xionghu
-------------- next part --------------

;; Function foo (foo, funcdef_no=0, decl_uid=3306, cgraph_uid=1, symbol_order=0)

Created preheader block for loop 1
Created preheader block for loop 2
;; 3 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 10 11 3 4 5 6 13 7 12 8 9
;;
;; Loop 1
;;  header 8, latch 12
;;  depth 1, outer 0
;;  nodes: 8 12 7 6 4 5 3 13 11
;;
;; Loop 2
;;  header 3, latch 13
;;  depth 2, outer 1
;;  nodes: 3 13 6 4 5
;; 2 succs { 10 9 }
;; 10 succs { 8 }
;; 11 succs { 3 }
;; 3 succs { 4 5 }
;; 4 succs { 6 }
;; 5 succs { 6 }
;; 6 succs { 13 7 }
;; 13 succs { 3 }
;; 7 succs { 12 9 }
;; 12 succs { 8 }
;; 8 succs { 11 7 }
;; 9 succs { 1 }
Memory reference 1: x_17(D)->j
Memory reference 2: MEM[(int *)x_17(D) + 8B]
always executed: bb->index:8, loop->num: 1
Found loop 2 to be finite: upper bound found.
always executed: bb->index:7, loop->num: 1
always executed: bb->index:3, loop->num: 2
always executed: bb->index:6, loop->num: 2
Basic block 8 (loop 1 -- depth 1):

if (n_16(D) > 0)
  invariant up to level 1, cost 20.

Basic block 11 (loop 1 -- depth 1):

Basic block 3 (loop 2 -- depth 2):

if (m_21(D) != 0)
  invariant up to level 1, cost 20.

Basic block 5 (loop 2 -- depth 2):

_4 = l_15(D) + n_16(D);
  invariant up to level 1, cost 1.

Basic block 4 (loop 2 -- depth 2):

Basic block 6 (loop 2 -- depth 2):

Basic block 7 (loop 1 -- depth 1):

Basic block 12 (loop 1 -- depth 1):

Basic block 13 (loop 2 -- depth 2):

Querying dependency of refs 2 and 1: independent.
Querying RAW dependencies of ref 2 in loop 2: independent
Querying RAW dependencies of ref 2 in loop 1: independent
Querying dependency of refs 2 and 1: independent.
Querying SM WAR dependencies of ref 2 in loop 2: independent
Querying SM WAR dependencies of ref 2 in loop 1: independent
Executing store motion of MEM[(int *)x_17(D) + 8B] from loop 1
Querying RAW dependencies of ref 1 in loop 2: independent
Querying SM WAR dependencies of ref 1 in loop 2: independent
Executing store motion of MEM[(int *)x_17(D) + 4B] from loop 2
Moving statement
x__lsm.5 = MEM[(int *)x_17(D) + 4B];
(cost 0) out of loop 2.

Moving statement
x__lsm.4 = MEM[(int *)x_17(D) + 8B];
(cost 0) out of loop 1.


Updating SSA:
creating PHI node in block #8 for x__lsm.4
creating PHI node in block #3 for x__lsm.5
Registering new PHI nodes in block #0
Registering new PHI nodes in block #2
Registering new PHI nodes in block #10
Updating SSA information for statement x__lsm.4 = MEM[(int *)x_17(D) + 8B];
Registering new PHI nodes in block #8
Registering new PHI nodes in block #11
Updating SSA information for statement x__lsm.5 = MEM[(int *)x_17(D) + 4B];
Registering new PHI nodes in block #3
Registering new PHI nodes in block #5
Registering new PHI nodes in block #4
Updating SSA information for statement _1 = x__lsm.5;
Registering new PHI nodes in block #6
Updating SSA information for statement x__lsm.5 = cstore_28;
Updating SSA information for statement tem_24 = x__lsm.5;
Updating SSA information for statement x__lsm.5 = _6;
Registering new PHI nodes in block #15
Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5;
Registering new PHI nodes in block #13
Registering new PHI nodes in block #7
Updating SSA information for statement tem2_18 = x__lsm.4;
Updating SSA information for statement x__lsm.4 = _8;
Registering new PHI nodes in block #14
Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4;
Registering new PHI nodes in block #12
Registering new PHI nodes in block #9
Updating SSA information for statement return;

Symbols to be put in SSA form
{ D.3326 D.3331 D.3332 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 16
Number of blocks to update: 15 ( 94%)
Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15


;; Created LCSSA PHI: x__lsm.5_36 = PHI <x__lsm.5_10(6)>
;; Created LCSSA PHI: x__lsm.4_37 = PHI <x__lsm.4_33(7)>

Updating SSA:
Registering new PHI nodes in block #8
Registering new PHI nodes in block #11
Registering new PHI nodes in block #3
Registering new PHI nodes in block #5
Registering new PHI nodes in block #4
Registering new PHI nodes in block #6
Updating SSA information for statement x__lsm.5_10 = _6;
Registering new PHI nodes in block #15
Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5_10;
Registering new PHI nodes in block #13
Registering new PHI nodes in block #7
Updating SSA information for statement x__lsm.4_33 = _8;
Registering new PHI nodes in block #14
Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4_33;
Registering new PHI nodes in block #12

SSA replacement table
N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j

x__lsm.5_36 -> { x__lsm.5_10 }
x__lsm.4_37 -> { x__lsm.4_33 }
Incremental SSA update started at block: 8
Number of blocks in CFG: 16
Number of blocks to update: 8 ( 50%)
Affected blocks: 3 6 7 8 12 13 14 15


void foo (struct X * x, int n, int l, int m)
{
  int x__lsm.5;
  int x__lsm.4;
  int tem;
  int i;
  int tem2;
  int j;
  int _1;
  int _2;
  int _4;
  int _5;
  int _6;
  int _7;
  int _8;
  int cstore_28;

  <bb 2> [local count: 14598063]:
  if (l_15(D) > 0)
    goto <bb 10>; [89.00%]
  else
    goto <bb 9>; [11.00%]

  <bb 10> [local count: 12992276]:
  x__lsm.4_13 = MEM[(int *)x_17(D) + 8B];
  goto <bb 8>; [100.00%]

  <bb 11> [local count: 105119324]:
  x__lsm.5_9 = MEM[(int *)x_17(D) + 4B];

  <bb 3> [local count: 955630225]:
  # i_30 = PHI <i_26(13), 0(11)>
  # x__lsm.5_32 = PHI <x__lsm.5_10(13), x__lsm.5_9(11)>
  if (m_21(D) != 0)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 477815112]:
  _1 = x__lsm.5_32;
  _2 = _1 + 1;
  goto <bb 6>; [100.00%]

  <bb 5> [local count: 477815112]:
  _4 = l_15(D) + n_16(D);

  <bb 6> [local count: 955630225]:
  # cstore_28 = PHI <_2(4), _4(5)>
  x__lsm.5_12 = cstore_28;
  tem_24 = x__lsm.5_12;
  _5 = tem_24 * i_30;
  _6 = _5 + tem_24;
  x__lsm.5_10 = _6;
  i_26 = i_30 + 1;
  if (n_16(D) > i_26)
    goto <bb 13>; [89.00%]
  else
    goto <bb 15>; [11.00%]

  <bb 13> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 15> [local count: 105119324]:
  # x__lsm.5_36 = PHI <x__lsm.5_10(6)>
  MEM[(int *)x_17(D) + 4B] = x__lsm.5_36;

  <bb 7> [local count: 118111600]:
  tem2_18 = x__lsm.4_11;
  _7 = tem2_18 * j_31;
  _8 = _7 + tem2_18;
  x__lsm.4_33 = _8;
  j_20 = j_31 + 1;
  if (l_15(D) > j_20)
    goto <bb 12>; [89.00%]
  else
    goto <bb 14>; [11.00%]

  <bb 12> [local count: 105119324]:

  <bb 8> [local count: 118111600]:
  # j_31 = PHI <j_20(12), 0(10)>
  # x__lsm.4_11 = PHI <x__lsm.4_33(12), x__lsm.4_13(10)>
  if (n_16(D) > 0)
    goto <bb 11>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 14> [local count: 12992276]:
  # x__lsm.4_37 = PHI <x__lsm.4_33(7)>
  MEM[(int *)x_17(D) + 8B] = x__lsm.4_37;

  <bb 9> [local count: 14598063]:
  return;

}




More information about the Gcc-patches mailing list