This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 26 Jul 2017 10:20:35 +0000
Subject: [Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark
Auto-submitted: auto-generated
References: <bug-81554-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 41833
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41833&action=edit
patch

While we do this transform late with the attached patch it doesn't help (noisy)
performance.  Before:

 Score based on Pentium III 600MHz using Fortran 77: 19.465005
 Score based on Pentium III 600MHz using Fortran 77: 19.558720
 Score based on Pentium III 600MHz using Fortran 77: 19.546069
 Score based on Pentium III 600MHz using Fortran 77: 19.572887
 Score based on Pentium III 600MHz using Fortran 77: 19.528043
 Score based on Pentium III 600MHz using Fortran 77: 19.477979
 Score based on Pentium III 600MHz using Fortran 77: 19.534370
 Score based on Pentium III 600MHz using Fortran 77: 19.562271
 Score based on Pentium III 600MHz using Fortran 77: 19.495751
 Score based on Pentium III 600MHz using Fortran 77: 19.542132

After:

 Score based on Pentium III 600MHz using Fortran 77: 19.436746
 Score based on Pentium III 600MHz using Fortran 77: 19.510495
 Score based on Pentium III 600MHz using Fortran 77: 19.479649
 Score based on Pentium III 600MHz using Fortran 77: 19.470079
 Score based on Pentium III 600MHz using Fortran 77: 19.470537
 Score based on Pentium III 600MHz using Fortran 77: 19.539023
 Score based on Pentium III 600MHz using Fortran 77: 19.421880
 Score based on Pentium III 600MHz using Fortran 77: 19.504202
 Score based on Pentium III 600MHz using Fortran 77: 19.545846
 Score based on Pentium III 600MHz using Fortran 77: 19.571152

Either the transform is required pre-loop opts
or flag_wrapv pessimizes stuff.  I suppose some additional pass
re-shuffling would be in order, like moving the block late_gimple_start,
reassoc, strength_reduction to after vrp, phi_only_cprop so VRP has
the chance to compute good !flag_wrapv ranges late.  That results in

 Score based on Pentium III 600MHz using Fortran 77: 19.076637
 Score based on Pentium III 600MHz using Fortran 77: 19.141776
 Score based on Pentium III 600MHz using Fortran 77: 19.078936
 Score based on Pentium III 600MHz using Fortran 77: 19.146834
 Score based on Pentium III 600MHz using Fortran 77: 19.098964
 Score based on Pentium III 600MHz using Fortran 77: 19.098782
 Score based on Pentium III 600MHz using Fortran 77: 19.127632
 Score based on Pentium III 600MHz using Fortran 77: 19.095203
 Score based on Pentium III 600MHz using Fortran 77: 19.111919
 Score based on Pentium III 600MHz using Fortran 77: 18.993788

thus looks even worse ;)  (all the above is with just -O3 on a Broadwell
system)  I guess reassoc is necessary for DOM to do a good CSE job.  OTOH
tracer and path splitting should enable more reassoc/SLSR so should be
before (but they shouldn't care about flag_wrapv).

Thus if we do

      NEXT_PASS (pass_sprintf_length, true);
      NEXT_PASS (pass_split_paths);
      NEXT_PASS (pass_tracer);
      NEXT_PASS (pass_thread_jumps);
      NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
      /* The only const/copy propagation opportunities left after
         DOM and VRP should be due to degenerate PHI nodes.  So rather than
         run the full propagators, run a specialized pass which
         only examines PHIs to discover const/copy propagation
         opportunities.  */
      NEXT_PASS (pass_phi_only_cprop);
      /* Dumbing down to -fwrapv for reassoc to work and forwprop 
         folding not hindered by undefined overflow disabling transforms.
         Matches semantics of RTL.  */
      NEXT_PASS (pass_late_gimple_start);
      NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
      NEXT_PASS (pass_strength_reduction);
      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
      /* The only const/copy propagation opportunities left after
         DOM and VRP should be due to degenerate PHI nodes.  So rather than
         run the full propagators, run a specialized pass which
         only examines PHIs to discover const/copy propagation
         opportunities.  */
      NEXT_PASS (pass_phi_only_cprop);
      NEXT_PASS (pass_strlen);
      NEXT_PASS (pass_thread_jumps);
      NEXT_PASS (pass_dse);

we end up with

 Score based on Pentium III 600MHz using Fortran 77: 19.467136
 Score based on Pentium III 600MHz using Fortran 77: 19.489240
 Score based on Pentium III 600MHz using Fortran 77: 19.413257
 Score based on Pentium III 600MHz using Fortran 77: 19.285549
 Score based on Pentium III 600MHz using Fortran 77: 19.352476
 Score based on Pentium III 600MHz using Fortran 77: 19.487067
 Score based on Pentium III 600MHz using Fortran 77: 19.513724
 Score based on Pentium III 600MHz using Fortran 77: 19.515330
 Score based on Pentium III 600MHz using Fortran 77: 19.523810
 Score based on Pentium III 600MHz using Fortran 77: 19.518709

Anyway, some more detailed analysis is required here [note I didn't try to
reproduce the slowdown].  Pass shuffling is always "interesting"...

References:
- [Bug tree-optimization/81554] New: [8 Regression] 25% performance regression in Himeno benchmark
  - From: kristerw at gcc dot gnu.org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]