This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 26 Jul 2017 10:20:35 +0000
- Subject: [Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark
- Auto-submitted: auto-generated
- References: <bug-81554-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 41833
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41833&action=edit
patch
While we do this transform late with the attached patch it doesn't help (noisy)
performance. Before:
Score based on Pentium III 600MHz using Fortran 77: 19.465005
Score based on Pentium III 600MHz using Fortran 77: 19.558720
Score based on Pentium III 600MHz using Fortran 77: 19.546069
Score based on Pentium III 600MHz using Fortran 77: 19.572887
Score based on Pentium III 600MHz using Fortran 77: 19.528043
Score based on Pentium III 600MHz using Fortran 77: 19.477979
Score based on Pentium III 600MHz using Fortran 77: 19.534370
Score based on Pentium III 600MHz using Fortran 77: 19.562271
Score based on Pentium III 600MHz using Fortran 77: 19.495751
Score based on Pentium III 600MHz using Fortran 77: 19.542132
After:
Score based on Pentium III 600MHz using Fortran 77: 19.436746
Score based on Pentium III 600MHz using Fortran 77: 19.510495
Score based on Pentium III 600MHz using Fortran 77: 19.479649
Score based on Pentium III 600MHz using Fortran 77: 19.470079
Score based on Pentium III 600MHz using Fortran 77: 19.470537
Score based on Pentium III 600MHz using Fortran 77: 19.539023
Score based on Pentium III 600MHz using Fortran 77: 19.421880
Score based on Pentium III 600MHz using Fortran 77: 19.504202
Score based on Pentium III 600MHz using Fortran 77: 19.545846
Score based on Pentium III 600MHz using Fortran 77: 19.571152
Either the transform is required pre-loop opts
or flag_wrapv pessimizes stuff. I suppose some additional pass
re-shuffling would be in order, like moving the block late_gimple_start,
reassoc, strength_reduction to after vrp, phi_only_cprop so VRP has
the chance to compute good !flag_wrapv ranges late. That results in
Score based on Pentium III 600MHz using Fortran 77: 19.076637
Score based on Pentium III 600MHz using Fortran 77: 19.141776
Score based on Pentium III 600MHz using Fortran 77: 19.078936
Score based on Pentium III 600MHz using Fortran 77: 19.146834
Score based on Pentium III 600MHz using Fortran 77: 19.098964
Score based on Pentium III 600MHz using Fortran 77: 19.098782
Score based on Pentium III 600MHz using Fortran 77: 19.127632
Score based on Pentium III 600MHz using Fortran 77: 19.095203
Score based on Pentium III 600MHz using Fortran 77: 19.111919
Score based on Pentium III 600MHz using Fortran 77: 18.993788
thus looks even worse ;) (all the above is with just -O3 on a Broadwell
system) I guess reassoc is necessary for DOM to do a good CSE job. OTOH
tracer and path splitting should enable more reassoc/SLSR so should be
before (but they shouldn't care about flag_wrapv).
Thus if we do
NEXT_PASS (pass_sprintf_length, true);
NEXT_PASS (pass_split_paths);
NEXT_PASS (pass_tracer);
NEXT_PASS (pass_thread_jumps);
NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
/* The only const/copy propagation opportunities left after
DOM and VRP should be due to degenerate PHI nodes. So rather than
run the full propagators, run a specialized pass which
only examines PHIs to discover const/copy propagation
opportunities. */
NEXT_PASS (pass_phi_only_cprop);
/* Dumbing down to -fwrapv for reassoc to work and forwprop
folding not hindered by undefined overflow disabling transforms.
Matches semantics of RTL. */
NEXT_PASS (pass_late_gimple_start);
NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
NEXT_PASS (pass_strength_reduction);
NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
/* The only const/copy propagation opportunities left after
DOM and VRP should be due to degenerate PHI nodes. So rather than
run the full propagators, run a specialized pass which
only examines PHIs to discover const/copy propagation
opportunities. */
NEXT_PASS (pass_phi_only_cprop);
NEXT_PASS (pass_strlen);
NEXT_PASS (pass_thread_jumps);
NEXT_PASS (pass_dse);
we end up with
Score based on Pentium III 600MHz using Fortran 77: 19.467136
Score based on Pentium III 600MHz using Fortran 77: 19.489240
Score based on Pentium III 600MHz using Fortran 77: 19.413257
Score based on Pentium III 600MHz using Fortran 77: 19.285549
Score based on Pentium III 600MHz using Fortran 77: 19.352476
Score based on Pentium III 600MHz using Fortran 77: 19.487067
Score based on Pentium III 600MHz using Fortran 77: 19.513724
Score based on Pentium III 600MHz using Fortran 77: 19.515330
Score based on Pentium III 600MHz using Fortran 77: 19.523810
Score based on Pentium III 600MHz using Fortran 77: 19.518709
Anyway, some more detailed analysis is required here [note I didn't try to
reproduce the slowdown]. Pass shuffling is always "interesting"...