This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Ping: IRA-based register pressure calculation for RTL loop invariant motion


Richard Guenther wrote:
On Sat, Oct 17, 2009 at 5:34 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
Richard Guenther wrote:
On Wed, Oct 14, 2009 at 6:27 PM, Vladimir Makarov <vmakarov@redhat.com>
wrote:

Zdenek Dvorak wrote:

Hi,


+      if (i < ira_reg_class_cover_size)
+       size_cost = comp_cost + 10;
+      else
+       size_cost = 0;


Including comp_cost in size_cost makes no sense (this would prevent us
from
moving even very costly invariants out of the loop if we run out of
registers).



That is exactly what I intended.  As I wrote above, I tried a lot of
 heuristics with different parameters which decided to move loop
 invariant
depending on spill cost and loop invariant cost.  But they  don't  work
well
at least for x86/x86_64 and power6.  I have some  speculation for this.
 x86/x86_64 is OOO processors these days.  And  costly invariant will
be
hidden because usually the invariant has a lot  of freedom to be
executed
out-of-order.  For power6, long latency is  hidden by insn scheduling.
 It
is hard to me find a processor where it  will be important.  Another
reason
for this, it is very hard to evaluate  accurately spill cost at this
stage.
 So I decided not to use  combination of register pressure and
invariant
cost in my approach.

could you please add this reasoning to the comment?  Another reason why
preventing the invariant motion does not hurt might be that all
expensive
invariants were already moved out of the loop by PRE and gimple
invariant
motion pass.


+      for (i = 0; i < ira_reg_class_cover_size; i++)
+       {
+         cover_class = ira_reg_class_cover[i];
+         if ((int) new_regs[cover_class]
+             + (int) regs_needed[cover_class]
+             + LOOP_DATA (curr_loop)->max_reg_pressure[cover_class]
+             + IRA_LOOP_RESERVED_REGS
+             - ira_available_class_regs[cover_class] > 0)
+           break;
+       }

It might be clearer to write this as ... >
ira_available_class_regs[cover_class] instead
of ... - ira_available_class_regs[cover_class] > 0.  Otherwise, the
patch
is OK.


Zdenek, thanks for the additional comments.  I incorporated them into the
patch just before committing.  Here is the affected patch part:

I think this consistently regressed both compile-time and runtime for
Polyhedron on x86_64.  For Itanium the story isn't clear, but effects
are seen there as well (it's also the only one I see off-noise effects
on SPEC 2000 - significant ups and downs).


 Yes, it is expensive optimization (at least 3 additional passes
through RTL insns one for calculating register pressure and two very
expensive passes for finding register classes for pseudos).  It is
clearly seen from SPEC compilation time graphs on

http://vmakarov.fedorapeople.org/spec

for 2 last benchmarking. Therefore I proposed it only for -O3.

Overall SPEC2000 scores are practically the same on x86/x86_64.

As for Polyhedron benchmarks, here is my results on Core I7:

first:  -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
second: -ffast-math -funroll-loops -O3 -fira-loop-pressure

x86:
Geometric Mean Execution Time =      12.84 seconds
Geometric Mean Execution Time =      12.82 seconds

x86_64:
Geometric Mean Execution Time =       9.89 seconds
Geometric Mean Execution Time =       9.91 seconds

On power6:
first:  -mtune=power6 -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
second: -mtune=power6 -ffast-math -funroll-loops -O3 -fira-loop-pressure

Geometric Mean Execution Time =      19.22 seconds
Geometric Mean Execution Time =      19.04 seconds

 As I wrote earlier the winner of the optimization usage will be
loops with pressure lower (but not too lower) than #registers.  For
x86/x86_64, practically all loops have pressure more than #registers.
For such loops, evaluation of invariant cost vs spill cost would be
important.  But at this stage, spill cost is impossible to evaluate
accurately.  So usage of old and new loop invariant motion criteria on
processors similar x86/x86_64 will give different results for particular
tests (some tests better, some worse) but overall score will be
practically the same.

 Probably, there is no sense to use IRA-based register pressure calculation
for all targets (including x86/x86_64) but for power it is a clear win as it
is seen from polyhedron and as I reported for SPEC2000.

So we could switch it off by default for -O3. What do you think about this
solution, Richard?

I think we could switch it on by default at -O3 for a selected group of targets. Itanium overall also improves with the new heuristics. That would make it power and Itanium.
The patch is below. Ok to commit?
  Did you try restricting the heuristics to certain
register classes, like SSE registers on x86_64?

No, I did not try. I am not sure it is worth to do it.


2009-10-19 Vladimir Makarov <vmakarov@redhat.com>


* doc/invoke.texi (fira-loop-pressure): Update default value.
* opts.c (decode_options): Remove default value setting for
flag_ira_loop_pressure.
* config/ia64/ia64.c (ia64_override_options): Set
flag_ira_loop_pressure up for -O3.
* config/rs6000/rs6000.c (rs6000_override_options): Ditto.


Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 152770)
+++ doc/invoke.texi	(working copy)
@@ -5720,8 +5720,7 @@ invoking @option{-O2} on programs that u
 Optimize yet more.  @option{-O3} turns on all optimizations specified
 by @option{-O2} and also turns on the @option{-finline-functions},
 @option{-funswitch-loops}, @option{-fpredictive-commoning},
-@option{-fgcse-after-reload}, @option{-ftree-vectorize} and
-@option{-fira-loop-pressure} options.
+@option{-fgcse-after-reload} and @option{-ftree-vectorize} options.
 
 @item -O0
 @opindex O0
@@ -6222,9 +6221,10 @@ architectures with big regular register 
 @opindex fira-loop-pressure
 Use IRA to evaluate register pressure in loops for decision to move
 loop invariants.  Usage of this option usually results in generation
-of faster and smaller code but can slow compiler down.
+of faster and smaller code on machines with big register files (>= 32
+registers) but it can slow compiler down.
 
-This option is enabled at level @option{-O3}.
+This option is enabled at level @option{-O3} for some targets.
 
 @item -fno-ira-share-save-slots
 @opindex fno-ira-share-save-slots
Index: opts.c
===================================================================
--- opts.c	(revision 152770)
+++ opts.c	(working copy)
@@ -917,7 +917,6 @@ decode_options (unsigned int argc, const
   flag_ipa_cp_clone = opt3;
   if (flag_ipa_cp_clone)
     flag_ipa_cp = 1;
-  flag_ira_loop_pressure = opt3;
 
   /* Just -O1/-O0 optimizations.  */
   opt1_max = (optimize <= 1);
Index: config/ia64/ia64.c
===================================================================
--- config/ia64/ia64.c	(revision 152769)
+++ config/ia64/ia64.c	(working copy)
@@ -5496,6 +5496,14 @@ ia64_override_options (void)
   if (TARGET_AUTO_PIC)
     target_flags |= MASK_CONST_GP;
 
+  /* Numerous experiment shows that IRA based loop pressure
+     calculation works better for RTL loop invariant motion on targets
+     with enough (>= 32) registers.  It is an expensive optimization.
+     So it is on only for peak performance.  */
+  if (optimize >= 3)
+    flag_ira_loop_pressure = 1;
+
+
   ia64_flag_schedule_insns2 = flag_schedule_insns_after_reload;
   flag_schedule_insns_after_reload = 0;
 
Index: config/rs6000/rs6000.c
===================================================================
--- config/rs6000/rs6000.c	(revision 152769)
+++ config/rs6000/rs6000.c	(working copy)
@@ -2266,6 +2266,13 @@ rs6000_override_options (const char *def
 		     | MASK_POPCNTD | MASK_VSX | MASK_ISEL | MASK_NO_UPDATE)
   };
 
+  /* Numerous experiment shows that IRA based loop pressure
+     calculation works better for RTL loop invariant motion on targets
+     with enough (>= 32) registers.  It is an expensive optimization.
+     So it is on only for peak performance.  */
+  if (optimize >= 3)
+    flag_ira_loop_pressure = 1;
+
   /* Set the pointer size.  */
   if (TARGET_64BIT)
     {

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]