trivial RTL fast_dce speedup

Richard Guenther rguenther@suse.de
Fri Jun 4 11:46:00 GMT 2010


On Fri, 4 Jun 2010, Jan Hubicka wrote:

> Hi,
> with my local patch to partially inline bitmap_bit_p and bitmap_set
> so one can see better where the time is going (they are good on collecting
> cache misses right now) I get the following profile on cc1 build with LTO
> 
> 54122     3.7557  lto1                     htab_find_slot_with_hash
> 31179     2.1636  lto1                     bitmap_ior_into
> 20588     1.4287  lto1                     fast_dce.463190.constprop.674
> 20432     1.4178  lto1                     bitmap_set_bit_1
> 19827     1.3759  lto1                     df_note_compute.40936
> 18915     1.3126  lto1                     bitmap_ior_and_compl
> 18269     1.2677  lto1                     df_worklist_dataflow
> 16676     1.1572  lto1                     bitmap_copy
> 16560     1.1492  lto1                     bitmap_and
> 14233     0.9877  lto1                     bitmap_bit_p_1
> 13698     0.9506  lto1                     bitmap_and_into
> 12530     0.8695  lto1                     htab_find_with_hash
> 12373     0.8586  lto1                     record_reg_classes.102576.constprop.811.3366
> 12307     0.8540  lto1                     ggc_alloc_stat
> 11962     0.8301  lto1                     bitmap_clear_bit
> 10863     0.7538  lto1                     vt_find_locations.357140.3452
> 10762     0.7468  lto1                     execute_function_todo.149789.4241
> 10495     0.7283  lto1                     sorted_array_from_bitmap_set.306751.4292
> 10307     0.7152  lto1                     walk_tree_1.constprop.778
> 9854      0.6838  lto1                     lto_input_tree
> 
> The following patch reduce fast_dce.463190.constprop.674 contribution
> by about 1/3rd
> 
> 17105     1.2949  bitmap_ior_and_compl
> 16745     1.2676  df_worklist_dataflow
> 14965     1.1329  bitmap_copy
> 14910     1.1287  bitmap_and
> 14352     1.0865  fast_dce.463190.constprop.674
> 13234     1.0018  bitmap_bit_p_1
> 12599     0.9538  bitmap_and_into
> 12022     0.9101  htab_find_with_hash
> 
> The change is simple, we do not scan trivially needed instructions if they are needed
> because of liveness and avoid multiple acceses to marked array.
> Currently WHOPR cc1 build is 1m46 of real time and 10m57 of cpu time for me.
> 
> Without bitmap inlining patch (included for anyone who is interested) the time
> is dominated by bitmap_bit_p and bitmap_set.  I think it is mostly cache
> mispredict since all the time goes in both profiles to first header check.
> Will try to cure some of those by aviding some indirection in df datastructures
> by embedding bitmap_header structures so we get similar layout to what former
> flow.c had.
> 
> Also the large time spent in fast_dce seems bogus.  There can be two reasons
> for this.  First the algorihm is not really fast - it is a dataflow algorithm
> that in each iteration invoke dataflow, so its time is D(CFG)*O(dataflow solver
> of CFG).  THere is DU/UD based DCE too that does not have this wrost case
> issue.  Yet this DCE is executed only post reload and I would not expect too
> many loop carried dead regs to be there, so it should work fast.  THe reason
> for overly large time spent there might be also invocation from crossjumping.
> Will experiment with this later today.
> 
> Bootstrapped/regtested x86_64-linux, OK?

Ok.

Thanks,
Richard.

> Honza
> 
> 	* dce.c (dce_process_block): Do not re-scan already marked instructions.
> 
> Index: dce.c
> ===================================================================
> --- dce.c	(revision 160215)
> +++ dce.c	(working copy)
> @@ -904,27 +904,26 @@
>    FOR_BB_INSNS_REVERSE (bb, insn)
>      if (INSN_P (insn))
>        {
> -	bool needed = false;
> +	bool needed = marked_insn_p (insn);
>  
>  	/* The insn is needed if there is someone who uses the output.  */
> -	for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
> -	  if (bitmap_bit_p (local_live, DF_REF_REGNO (*def_rec))
> -	      || bitmap_bit_p (au, DF_REF_REGNO (*def_rec)))
> -	    {
> -	      needed = true;
> -	      break;
> -	    }
> +	if (!needed)
> +	  for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
> +	    if (bitmap_bit_p (local_live, DF_REF_REGNO (*def_rec))
> +		|| bitmap_bit_p (au, DF_REF_REGNO (*def_rec)))
> +	      {
> +		needed = true;
> +		mark_insn (insn, true);
> +		break;
> +	      }
>  
> -	if (needed)
> -	  mark_insn (insn, true);
> -
>  	/* No matter if the instruction is needed or not, we remove
>  	   any regno in the defs from the live set.  */
>  	df_simulate_defs (insn, local_live);
>  
>  	/* On the other hand, we do not allow the dead uses to set
>  	   anything in local_live.  */
> -	if (marked_insn_p (insn))
> +	if (needed)
>  	  df_simulate_uses (insn, local_live);
>        }
>  
> 
> 

-- 
Richard Guenther <rguenther@suse.de>
Novell / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 - GF: Markus Rex



More information about the Gcc-patches mailing list