Bug 39141 - overzealous unrolling (peeling) destroys code locality
Summary: overzealous unrolling (peeling) destroys code locality
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Jorn Wolfgang Rennecke
URL:
Keywords: missed-optimization, patch
Depends on:
Blocks: 39363
  Show dependency treegraph
 
Reported: 2009-02-09 16:18 UTC by Jorn Wolfgang Rennecke
Modified: 2024-03-10 05:07 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-03-04 19:08:52


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jorn Wolfgang Rennecke 2009-02-09 16:18:16 UTC
I see a 50% cycle increase for EEMBC idctrn01 going from gcc 4.2.1 to gcc 4.4 .
There are two issues, overzealous unrolling, and constant propagation in the
unrolled loops.
While both issues can be avoided by reducing the parameter
"max-completely-peeled-insns" to 200, this is not really satisfactory, since
that is a rather fragile parameter setting and is not really related to the
problem, and it holds other code back, like EEMBC viterb00 benchmark, which
looses 5% performance with that setting.

There are a number of loops which are completely unrolled,
thus pushing their containing loop size above the size of the instruction
cache.  There is no point in doing such unrolling, since it is more expensive
to fill the instruction cache with an unrolled loop than to execute a rolled loop.
I have implemented a heuristic to estimate the size of the outer loop (or
function in absence of an outer loop) assuming that its inner loops will be
unrolled in accordance with PARAM_MAX_UNROLL_TIMES, PARAM_MAX_UNROLLED_INSNS
and PARAM_MAX_COMPLETELY_PEEL_TIMES, and if that size exceeds a threshold
(a new parameter), complete unrolling is inhibited.

This change has reduced the idctrn01 regression to 13.7% while leaving the
other EEMBC benchmarks alone.
I expect that I can address the remaining performance regression by
inhibiting inappropriate constant propagation - there are 798 addresses
of the form absolute address+offset in the main benchmark assembly file,
each of which translates into one instruction word too many, in total
20% of the text size of that module.

I can post the unroller heuristic patch as soon as we have the confirmation
from the FSF that they have filed the Copyright Assignment we gave them
last year.
Comment 1 Jorn Wolfgang Rennecke 2009-03-04 19:08:52 UTC
patch submitted
Comment 2 Jorn Wolfgang Rennecke 2009-03-05 00:33:57 UTC
The patch is here:
http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00231.html
Comment 3 Jorn Wolfgang Rennecke 2010-01-27 20:04:11 UTC
Subject: Bug 39141

Author: amylaar
Date: Wed Jan 27 20:03:57 2010
New Revision: 156301

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=156301
Log:
	PR tree-optimization/39141
	* tree-ssa-loop-manip.c (gimple_can_duplicate_loop_to_header_edge):
	New function.
	* tree-ssa-loop-ivcanon.c (enum unroll_level): New value
	UL_ESTIMATE_GROWTH.
	(try_unroll_loop_completely): Handle UL_ESTIMATE_GROWTH.
	(canonicalize_loop_induction_variables): Likewise.
	(tree_unroll_loops_completely): Don't completely unroll loops where
	the outer loop/function is larger than
	PARAM_MAX_COMPLETELY_PEELED_OUTER_INSNS, or will/would become thus
	due to unrolling.
	* cfgloop.h (enum li_flags): New value LI_REALLY_FROM_INNERMOST.
	(fel_init): Handle LI_REALLY_FROM_INNERMOST.
	* tree-flow.h (gimple_can_duplicate_loop_to_header_edge): Declare.
	* params.def (PARAM_MAX_COMPLETELY_PEELED_OUTER_INSNS): New parameter.

Added:
    branches/mpost-opt-imp-20100127/gcc/ChangeLog.mpost
Modified:
    branches/mpost-opt-imp-20100127/gcc/cfgloop.h
    branches/mpost-opt-imp-20100127/gcc/params.def
    branches/mpost-opt-imp-20100127/gcc/tree-flow.h
    branches/mpost-opt-imp-20100127/gcc/tree-ssa-loop-ivcanon.c
    branches/mpost-opt-imp-20100127/gcc/tree-ssa-loop-manip.c