[RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

Mon Nov 16 17:57:00 GMT 2015

Sorry I missed out some of the points in earlier mail which is given below.

-----Original Message-----
From: Ajit Kumar Agarwal 
Sent: Monday, November 16, 2015 11:07 PM
To: 'Jeff Law'; GCC Patches
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

-----Original Message-----
From: Jeff Law [mailto:law@redhat.com]
Sent: Friday, November 13, 2015 11:44 AM
To: Ajit Kumar Agarwal; GCC Patches
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

On 10/07/2015 10:32 PM, Ajit Kumar Agarwal wrote:

>
> 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch
>
>
>  From f164fd80953f3cffd96a492c8424c83290cd43cc Mon Sep 17 00:00:00
> 2001
> From: Ajit Kumar Agarwal<ajitkum@xilix.com>
> Date: Wed, 7 Oct 2015 20:50:40 +0200
> Subject: [PATCH] [RFC, Patch]: Optimized changes in the register used inside
>   loop for LICM and IVOPTS.
>
> Changes are done in the Loop Invariant(LICM) at RTL level and also the 
> Induction variable optimization based on SSA representation. The 
> current logic used in LICM for register used inside the loops is 
> changed. The Live Out of the loop latch node and the Live in of the 
> destination of the exit nodes is used to set the Loops Liveness at the exit of the Loop.
> The register used is the number of live variables at the exit of the 
> Loop calculated above.
>
> For Induction variable optimization on tree SSA representation, the 
> register used logic is based on the number of phi nodes at the loop 
> header to represent the liveness at the loop.  Current Logic used only 
> the number of phi nodes at the loop header.  Changes are made to 
> represent the phi operands also live at the loop. Thus number of phi 
> operands also gets incremented in the number of registers used.
>
> ChangeLog:
> 2015-10-09  Ajit Agarwal<ajitkum@xilinx.com>
>
> 	* loop-invariant.c (compute_loop_liveness): New.
> 	(determine_regs_used): New.
> 	(find_invariants_to_move): Use of determine_regs_used.
> 	* tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi
> 	arguments for register used.
>>I think Bin rejected the tree-ssa-loop-ivopts change.  However, the loop-invariant change is still pending, right?

>
> Signed-off-by:Ajit Agarwalajitkum@xilinx.com
> ---
>   gcc/loop-invariant.c       | 72 +++++++++++++++++++++++++++++++++++++---------
>   gcc/tree-ssa-loop-ivopts.c |  4 +--
>   2 files changed, 60 insertions(+), 16 deletions(-)
>
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index 
> 52c8ae8..e4291c9 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1413,6 +1413,19 @@ set_move_mark (unsigned invno, int gain)
>       }
>   }
>
> +static int
> +determine_regs_used()
> +{
> +  unsigned int j;
> +  unsigned int reg_used = 2;
> +  bitmap_iterator bi;
> +
> +  EXECUTE_IF_SET_IN_BITMAP (&LOOP_DATA (curr_loop)->regs_live, 0, j, bi)
> +    (reg_used) ++;
> +
> +  return reg_used;
> +}
>>Isn't this just bitmap_count_bits (regs_live) + 2?

> @@ -2055,9 +2057,43 @@ calculate_loop_reg_pressure (void)
>       }
>   }
>
> -
> +static void
> +calculate_loop_liveness (void)
>>Needs a function comment.

I will incorporate the above comments.
> +{
> +  basic_block bb;
> +  struct loop *loop;
>
> -/* Move the invariants out of the loops.  */
> +  FOR_EACH_LOOP (loop, 0)
> +    if (loop->aux == NULL)
> +      {
> +        loop->aux = xcalloc (1, sizeof (struct loop_data));
> +        bitmap_initialize (&LOOP_DATA (loop)->regs_live, &reg_obstack);
> +     }
> +
> +  FOR_EACH_BB_FN (bb, cfun)
>>Why loop over blocks here?  Why not just iterate through all the loops 
>>in the loop structure.  Order isn't particularly important AFAICT for 
>>this code.

Iterating over the Loop structure is enough. We don't need iterating over the basic blocks.

> +       {
> +         int  i;
> +         edge e;
> +         vec<edge> edges;
> +         edges = get_loop_exit_edges (loop);
> +         FOR_EACH_VEC_ELT (edges, i, e)
> +         {
> +           bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_OUT(e->src));
> +           bitmap_ior_into (&LOOP_DATA (loop)->regs_live, 
> + DF_LR_IN(e->dest));
>>Space before the open-paren in the previous two lines DF_LR_OUT 
>>(e->src) and FD_LR_INT (e->dest))

I will incorporate this.

> +         }
> +      }
> +  }
> +}
> +
> +/* Move the invariants  ut of the loops.  */
>>Looks like you introduced a typo.

>>I'd like to see testcases which show the change in # regs used 
>>computation helping generate better code.

We need to measure the test case with the scenario where the new variable created for loop invariant increases the register pressure 
and the cost with respect to reg_used and new_regs increases that lead to spill and fetch and drop the invariant movement.

Getting that environment in the test case seems to be little difficult. But I will try to identify the testcases extracted from the Benchmarks 
 where we got the gains by the above method.

>>And  I'd also like to see some background information on why you think 
>>this is a more accurate measure for the number of registers used in 
>>the loop.  regs_used AFAICT is supposed to be an estimate of the 
>>registers live around the loop.  So ISTM that you get that value by 
>>live-out set on the backedge of the loop.  I guess you get somethign 
>>similar by looking at the exit edge's source block's live-out set.  
>>But I don't see any value  in including stuff live at the block outside the loop.

>>It also seems fairly non-intuitive.  Get the block's latch and use its 
>>live-out set.  That seems more intuitive.

We are interested in registers pressure that arises with code motion optimization like Loop Invariant. The register pressure is based out 
of the number of interfering live ranges at a given point. If that exceeds the number of available registers, there may be chance of spilling 
that live range. Again the spilling is based on the coloring scheme where the Briggs Allocator place such live ranges into the stack during 
simplification phase instead of just spilling. This might enables the chances of getting the register and reduces the spill and fetch.

Again some of the coloring scheme choose the live range that has high register pressure at the highest priority for coloring and removes 
Such nodes from interference graph and place it on the stack at the highest priority. This may give a better changes of covering the Interference 
graph colorable.

In the region based allocator the loops forms a region and accordingly the allocno are assigned accumulating the live range info from inner loops to outerloops.
The number of liveness and the register pressure contributes significantly  for the loops forming the regions. 

Based on the above background and considering all the above points the number of liveness of the loop is a better approximation of the register pressure and
 should be considered for the register used to do the cost analysis.

Getting the loop exit edges and the source of the exit edge is the Loop latch node. This will take care of Live out of the latch node.
Moreover adding the Live-in of the destination edge of the loop exit edges will add  the liveness after the loops.  I also tried using only liveout of the loop 
latch nodes but did not see much gains using only  liveout of the loop latch node.

Thanks & Regards
Ajit