This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/60172] ARM performance regression from trunk at 207239
- From: "rguenther at suse dot de" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 20 Feb 2014 10:02:05 +0000
- Subject: [Bug tree-optimization/60172] ARM performance regression from trunk at 207239
- Auto-submitted: auto-generated
- References: <bug-60172-4 at http dot gcc dot gnu dot org/bugzilla/>
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 19 Feb 2014, steven at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
>
> Steven Bosscher <steven at gcc dot gnu.org> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |steven at gcc dot gnu.org
>
> --- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> ---
> (In reply to Joey Ye from comment #11)
>
> Sometimes it helps to use -fdump-rtl-slim. Matter of taste but I find
> that much easier to interpret than LISP-like RTL dumps.
>
> Annotated "good expansion":
> ;; _41 = _42 * 4;
> 20: r126=r131<<2
>
> ;; _40 = _2 + _41;
> 21: r136=r130+r119 // r136=Arr_2_Par_Ref+r119
> 22: r125=r136+r126 // r125=Arr_2_Par_Ref+r119+r131<<2
>
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 29: r139=r130+r119 // r139=Arr_2_Par_Ref+r119
> 30: r140=r139+r126 // r140=Arr_2_Par_Ref+r119+r131<<2 (==r125)
> 31: r141=r140+1000 // r141=Arr_2_Par_Ref+r119+r131<<2+1000 (==r125+1000)
> 32: [r141+20]=r124
>
> In this case, the RHS for the SETs of r140 and r125 are lexically
> identical for value numbering, so the job for CSE is easy.
>
>
> Annotated "bad expansion":
> ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> 22: r138=r128+r121
> 23: r127=r132+r138 // r127=Arr_2_Par_Ref+r128+r121
>
> ;; _32 = _20 + 1000;
> 29: r124=r121+1000
>
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 32: r141=r132+r124 // r141=Arr_2_Par_Ref+r121+1000
> 33: r142=r141+r128 // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)
(==r138+1000)
> 34: [r142+20]=r126
>
> Here, the "+1000" confuses CSE. The sets of r127 and r142 have a common
> sub-expression as value, but none of the sub-expressions are lexically
> identical. RTL CSE has limited ability to look through sub-expressions
> to identify "same value" sub-expressions (anchors, base regs, etc.) but
> apparently this case is too complex for it to handle.
So expansion generates "better" code (a single insn covering the
two adds), caused by expanding a chain of two regular PLUS_EXPR
rather than a chain of two POINTER_PLUS_EXPRs.
That's of course unfortunate - but I can't see how this should
be not a missed optimization in CSE ...
On the GIMPLE level before expansion we have
+40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);
_51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));
thus a similar issue - missed CSE due to bad association (and to
not having a CSE pass after forwprop exposed the opportunity).
Unfortunately we expose the opportunity by late complete unrolling
only because early unrolling says
size: 7-2, last_iteration: 3-0
Loop size: 7
Estimated size after unrolling: 8
Not unrolling loop 1: size would grow.
and you can't make it unroll that loop (outer loops are only ever
unrolled early if doing so doesn't increase code-size).
Now the order is, late unroll - reassoc - DOM - forwprop,
exactly the wrong way around to eventuall catch the CSE opportunity
at the GIMPLE level as it would need to be, late unroll - forwprop -
reassoc - DOM.
Richard.