54717 – [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

Bug 54717 - [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

Summary: [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-09-26 12:53 UTC by Sergey Ostanevich
Modified:	2012-12-06 16:51 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-09-26 00:00:00

Attachments
bzipped tar archive of a reduced test (313.39 KB, application/octet-stream) 2012-10-02 20:23 UTC, Dominique d'Humieres	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sergey Ostanevich 2012-09-26 12:53:40 UTC

commit 024fee2c369096e6fe6cde620243df5843893004
Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Sep 13 12:43:58 2012 +0000

    2012-09-13  Richard Guenther  <rguenther@suse.de>

        * tree-ssa-sccvn.h (enum vn_kind): New.
        (vn_get_stmt_kind): Likewise.
        * tree-ssa-sccvn.c (vn_get_stmt_kind): New function, adjust
        ADDR_EXPR handling.
        (visit_use): Use it.
        * tree-ssa-pre.c (compute_avail): Likewise, simplify further.

        * gcc.dg/tree-ssa/ssa-fre-37.c: New testcase.


    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@191253 138bc75d-0d04-0410-961f-82ee72b054a4

caused a 20% degradation on polyhedron's "rnflow"

commit 780bedc1ccae5ae85fb99afed8a1ac1cc598121b
Geometric Mean Execution Time =      18.28 seconds

commit 024fee2c369096e6fe6cde620243df5843893004
Geometric Mean Execution Time =      24.82 seconds


compilation options used:
gfortran -march=native -ffast-math -funroll-loops -O3 -ftree-vectorize %n.f90 -static -o %n

Comment 1 Uroš Bizjak 2012-09-26 13:17:15 UTC

Adding CCs.

Comment 2 Richard Biener 2012-09-26 14:17:04 UTC

What's "-march=native" to you?  Any help in reduction appreciated.

Comment 3 Sergey Ostanevich 2012-09-26 15:11:38 UTC

adding -### gives (in part of options)

 /export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951 air.f90 "-march=corei7" -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param "l1-cache-size=32" --param "l1-cache-line-size=64" --param "l2-cache-size=12288" "-mtune=corei7" -quiet -dumpbase air.f90 -auxbase air -fintrinsic-modules-path /export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude -o /tmp/ccmW82c1.s

Comment 4 Dominique d'Humieres 2012-09-26 15:41:05 UTC

The slowdown is mostly hidden by  -fno-tree-loop-if-convert.

Comment 5 Sergey Ostanevich 2012-09-26 20:07:26 UTC

for 093t.pre I see the following missing in cptrf2 function, first is good, second is degraded:

***************
*** 8947,8966 ****
    goto <bb 35>;

    <bb 93>:
-   pretmp_325 = (integer(kind=8)) ival2_80;
-   pretmp_326 = pretmp_325 + -1;
-   pretmp_327 = *xxtrt_25(D)[pretmp_326];

    <bb 27>:
    # ival2_136 = PHI <ival2_62(93), ival2_144(97)>
    # ival2_140 = PHI <ival2_80(93), ival2_146(97)>
-   # prephitmp_328 = PHI <pretmp_327(93), prephitmp_290(97)>
    _137 = (integer(kind=8)) ival2_136;
    _138 = _137 + -1;
    _139 = *xxtrt_25(D)[_138];
    _141 = (integer(kind=8)) ival2_140;
    _142 = _141 + -1;
!   _143 = prephitmp_328;
    if (_139 < _143)
      goto <bb 28>;
    else
--- 8838,8853 ----
    goto <bb 35>;

    <bb 93>:

    <bb 27>:
    # ival2_136 = PHI <ival2_62(93), ival2_144(97)>
    # ival2_140 = PHI <ival2_80(93), ival2_146(97)>
    _137 = (integer(kind=8)) ival2_136;
    _138 = _137 + -1;
    _139 = *xxtrt_25(D)[_138];
    _141 = (integer(kind=8)) ival2_140;
    _142 = _141 + -1;
!   _143 = *xxtrt_25(D)[_142];
    if (_139 < _143)
      goto <bb 28>;
    else
***************

but more surprising to me is that first diff is in 020t.inline_param1

***************
*** 16790,16794 ****
    calls:
      dtrti2/26 function not considered for inlining
!       loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28
      dtrsm/21 function not considered for inlining
        loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4
--- 16790,16794 ----
    calls:
      dtrti2/26 function not considered for inlining
!       loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28
      dtrsm/21 function not considered for inlining
        loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4
***************

Comment 6 Richard Biener 2012-09-27 09:28:04 UTC

(In reply to comment #4)
> The slowdown is mostly hidden by  -fno-tree-loop-if-convert.

I would say this means we have more vectorization opportunities after the
patch.  Opportunities that might end up being not profitable.

Sergey, are those differences you quote the only differences?

Comment 7 Richard Biener 2012-09-27 10:43:00 UTC

I can reproduce the slowdown.  Code differences appear first in early FRE,
good ones like:

-  _84 = &*a_56(D)[_83];
+  _84 = _75;

which was the intention of the patch (and that is also likely the
reason for the inliner code size/time estimate changes).

It would be nice to get a smaller testcase for the PRE change you quote.

Unfortunately the big slowdown does not reproduce with -fno-inline which makes
it harder to track down.

The real differences do appear in PRE, some of the kind you quote and
some where we perform more PRE like:

@@ -19695,11 +19720,13 @@
   <bb 289>:
   pretmp_ = stride.258_ * _;
   pretmp_ = offset.259_ + pretmp_;
+  pretmp_ = stride.258_ * _;
+  pretmp_ = offset.259_ + pretmp_;
 
   <bb 123>:
   # i_ = PHI <1(289), i_(292)>
-  _ = stride.258_ * _;
-  _ = _ + offset.259_;
+  _ = pretmp_;
+  _ = pretmp_;

Aside from that the differences you quote result in less if-conversion
applied:

   # ival2_ = PHI <ival2_(39), ival2_(41)>
   # ival2_ = PHI <ival2_(39), ival2_(41)>
-  # prephitmp_ = PHI <pretmp_(39), prephitmp_(41)>
   _ = (integer(kind=8)) ival2_;
   _ = _ + -1;
   _ = *xxtrt_(D)[_];
-  ival2_ = _ < prephitmp_ ? ival2_ : ival2_;
-  prephitmp_ = MIN_EXPR <_, prephitmp_>;
+  _ = (integer(kind=8)) ival2_;
+  _ = _ + -1;
+  _ = *xxtrt_(D)[_];
+  ival2_ = _ < _ ? ival2_ : ival2_;

but that does not result in any extra or missed vectorization.

Btw, dropping to -O2 also fixes the regression.

So, it's not at all clear what we are chasing here (the PRE seems to be
a partial antic expression).

Comment 8 Dominique d'Humieres 2012-10-02 20:23:42 UTC

Created attachment 28333 [details]
bzipped tar archive of a reduced test

The tar archive contains the files
cptrf2_inl_1.f90  rnflow.in  rnflow_red.f90  rnfprm.h
and can be used as in

[macbook] dbg_rnflow/pr54717% gfc -c -Ofast -funroll-loops rnflow_red.f90
[macbook] dbg_rnflow/pr54717% gfc -c -O2 cptrf2_inl_1.f90
[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o
[macbook] dbg_rnflow/pr54717% time a.out > /dev/null
21.036u 0.051s 0:21.09 99.9%	0+0k 0+0io 0pf+0w
[macbook] dbg_rnflow/pr54717% gfc -c -O2 -ftree-loop-if-convert cptrf2_inl_1.f90
[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o
[macbook] dbg_rnflow/pr54717% time a.out > /dev/null
27.150u 0.051s 0:27.20 100.0%	0+0k 0+0io 0pf+0w

This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert gives a slow executable without involving inlining or vectorization.

Comment 9 Sergey Ostanevich 2012-10-08 08:55:25 UTC

Thanks for the reduced test, Dominique!

I see that vectorized did not manage to generate MIN after the change. Also, it is looks pretty similar to what I posted at first: there was no prephitmp created for the xxtrt_[]


> ival2_15 = _85 < prephitmp_266 ? ival2_10 : iva
> prephitmp_237 = MIN_EXPR <_85, prephitmp_266>;
-----------------------
< _86 = (integer(kind=8)) ival2_14;
< _87 = _86 + -1;
< _88 = *xxtrt_46(D)[_87];
< ival2_15 = _85 < _88 ? ival2_10 : ival2_14;

I suspect that one of the iterator you removed - possibly VEC_iterate - made more traverse than that you created?

I also double check that for the reduced test MIN did not generated and not appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min contributes the difference in clocks.

Comment 10 Uroš Bizjak 2012-11-13 18:39:28 UTC

(In reply to comment #8)

> This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert
> gives a slow executable without involving inlining or vectorization.

Dup of PR53346 ?

Comment 11 Dominique d'Humieres 2012-11-13 18:54:40 UTC

> Dup of PR53346 ?

May be! Both PRs seem also related to pr54073.

Comment 12 Sergey Ostanevich 2012-11-14 18:56:22 UTC

Actually, it is not. 
I found that PRE did not collected a memory access within the loop that caused later missing vectorization. Here is dump before (good one) and after the commit (bad one)

    <bb 88>:
    pretmp_263 = (integer(kind=8)) ival2_82;
    pretmp_264 = pretmp_263 + -1;
    pretmp_265 = *xxtrt_46(D)[pretmp_264];

    <bb 28>:
    # ival2_10 = PHI <ival2_63(88), ival2_89(92)>
    # ival2_14 = PHI <ival2_82(88), ival2_15(92)>
    # prephitmp_266 = PHI <pretmp_265(88), prephitmp_237(92)>
    _83 = (integer(kind=8)) ival2_10;
    _84 = _83 + -1;
    _85 = *xxtrt_46(D)[_84];
    _86 = (integer(kind=8)) ival2_14;
    _87 = _86 + -1;
    _88 = prephitmp_266;
    if (_85 < _88)
      goto <bb 29>;
    else
      goto <bb 90>;

    <bb 90>:
    goto <bb 30>;

    <bb 29>:

    <bb 30>:
    # ival2_15 = PHI <ival2_14(90), ival2_10(29)>
    # prephitmp_237 = PHI <_88(90), _85(29)>
    ival2_89 = ival2_10 + -1;
    if (ival2_10 == ipos1_12)
      goto <bb 91>;
    else
      goto <bb 92>;

   <bb 92>:
   goto <bb 28>;
---------------------------------
    <bb 88>:

    <bb 28>:
    # ival2_10 = PHI <ival2_63(88), ival2_89(92)>
   # ival2_14 = PHI <ival2_82(88), ival2_15(92)>
    _83 = (integer(kind=8)) ival2_10;
    _84 = _83 + -1;
    _85 = *xxtrt_46(D)[_84];
    _86 = (integer(kind=8)) ival2_14;
    _87 = _86 + -1;
    _88 = *xxtrt_46(D)[_87];
    if (_85 < _88)
      goto <bb 29>;
    else
      goto <bb 90>;

    <bb 90>:
    goto <bb 30>;

    <bb 29>:

    <bb 30>:
    # ival2_15 = PHI <ival2_14(90), ival2_10(29)>
    ival2_89 = ival2_10 + -1;
    if (ival2_10 == ipos1_12)
      goto <bb 91>;
    else
      goto <bb 92>;

   <bb 92>:
   goto <bb 28>;
-------------------------

So for the loop that starting at bb 28 you can see the xxtrt_46 access was not put into pretemp. Possible reason is exactly as it was mentioned by Richard - there were extra candidates collected and this one become less anticipatable

Skipping partial partial redundancy for expression                    
{array_ref<pretmp_8,0,4>,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   
   not partially anticipated on any to be optimized for speed edges    
  -----------------------------------------------------------------------
Found partial partial redundancy for expression
 {array_ref<pretmp_8,0,4>,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)
Created phi prephitmp_237 = PHI <_88(90), _85(29)>
 in block 30

Comment 13 Jan Hubicka 2012-11-14 19:43:00 UTC

> So for the loop that starting at bb 28 you can see the xxtrt_46 access was not
> put into pretemp. Possible reason is exactly as it was mentioned by Richard -
> there were extra candidates collected and this one become less anticipatable
> 
> Skipping partial partial redundancy for expression                    
> {array_ref<pretmp_8,0,4>,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   
>    not partially anticipated on any to be optimized for speed edges    
>   -----------------------------------------------------------------------
> Found partial partial redundancy for expression
>  {array_ref<pretmp_8,0,4>,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)
> Created phi prephitmp_237 = PHI <_88(90), _85(29)>
>  in block 30

Hmm, interesting, what is the edge resonsible?
I would expect it to be the loopback edge and its frequency is:
;;   basic block 28, loop depth 0, count 0, freq 1998, maybe hot
;;    prev block 92, next block 94, flags: (NEW, REACHABLE)
;;    pred:       92 [100.0%, 180]  (FALLTHRU)
;;                96 [100.0%, 1818]  (FALLTHRU,DFS_BACK)
  # ival2_136 = PHI <ival2_62(92), ival2_144(96)>
  # ival2_140 = PHI <ival2_80(92), ival2_146(96)>
  _137 = (integer(kind=8)) ival2_136;
  _138 = _137 + -1;
  _139 = *xxtrt_25(D)[_138];
  _141 = (integer(kind=8)) ival2_140;
  _142 = _141 + -1;
  _143 = *xxtrt_25(D)[_142];
  if (_139 < _143)
    goto <bb 29>; 
  else            
    goto <bb 94>;

1818 that should be still hot.  Or isn't the heuristic backwards? I.e. I would expect
the partial anticipance to sit on edge 92->28 (with freq 180) where we need to insert
the computation to get the other path hot.

Honza

Comment 14 Jan Hubicka 2012-11-14 20:11:17 UTC

Hmm, the optimize_edge_for_speed never returns false here. The problem is that patch assumes that interesting successors of block with partial anticipance are blocks with partial anticipance. The anticipance however could be full and it seems that full anticipance do not imply partial one
Index: tree-ssa-pre.c
===================================================================
*** tree-ssa-pre.c      (revision 193503)
--- tree-ssa-pre.c      (working copy)
*************** do_partial_partial_insertion (basic_bloc
*** 3525,3531 ****
                 may cause regressions on the speed path.  */
              FOR_EACH_EDGE (succ, ei, block->succs)
                {
!                 if (bitmap_set_contains_value (PA_IN (succ->dest), val))
                    {
                      if (optimize_edge_for_speed_p (succ))
                        do_insertion = true;
--- 3525,3532 ----
                 may cause regressions on the speed path.  */
              FOR_EACH_EDGE (succ, ei, block->succs)
                {
!                 if (bitmap_set_contains_value (PA_IN (succ->dest), val)
!                     || bitmap_set_contains_value (ANTIC_IN (succ->dest), val))
                    {
                      if (optimize_edge_for_speed_p (succ))
                        do_insertion = true;

Comment 15 Jan Hubicka 2012-11-15 10:27:49 UTC

Path posted at http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01222.html
Can we figure out why the vectorization still does not happen?

Comment 16 Jan Hubicka 2012-11-15 10:52:13 UTC

OK, 4.7 vectorize two loops in the function in cptrf2

loop at ../a.f90:3538

      if (nxtr < 4) then
         kerr = 1
         do ixtr = 1, nxtr - 1
           ixtrt (ixtr) = ixtr + 1
         enddo
         goto 9000
      endif


and 

loop at ../a.f90:3530
 

         ixtrt = 0


The second loop is recognized as memset by mainline, so it remains to figure out what is wrong with the first loop.  It is unrolled:

Analyzing # of iterations of loop 9
  exit condition [1, + , 1](no_overflow) != ival2_27 + -1
  bounds on difference of bases: 0 ... 1
  result:
    # of iterations (unsigned int) ival2_27 + 4294967294, bounded by 1
Loop 9 iterates at most 1 times.
Estimating sizes for loop 9
 BB: 8, after_exit: 0
  size:   0 _38 = (integer(kind=8)) ixtr_12;
   Induction variable computation will be folded away.
  size:   1 _39 = _38 + -1;
   Induction variable computation will be folded away.
  size:   1 ixtr_40 = ixtr_12 + 1;
   Induction variable computation will be folded away.
  size:   1 *ixtrt_33(D)[_39] = ixtr_40;
  size:   2 if (ixtr_12 == _37)
   Exit condition will be eliminated in last copy.
 BB: 79, after_exit: 1
size: 5-2, last_iteration: 5-4
  Loop size: 5
  Estimated size after unrolling: 2
Unrolled loop 9 completely (duplicated 1 times).

I do not quite see why it iterates at most once, but if seems to work. So I would say that it is good idea to unroll rather than vectorize.

Is the slowdown still reproducing with my patch?

Comment 17 Dominique d'Humieres 2012-11-15 15:07:33 UTC

> Is the slowdown still reproducing with my patch?

Most of it (if not all) is gone with the patch: 
23.96s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto' compared to 
23.37s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert'.

Comment 18 Jan Hubicka 2012-11-16 10:37:30 UTC

Author: hubicka
Date: Fri Nov 16 10:37:25 2012
New Revision: 193553

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193553
Log:
	PR tree-optimization/54717
	* tree-ssa-pre.c (do_partial_partial_insertion): Consider also edges
	with ANTIC_IN.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-pre.c

Comment 19 Richard Biener 2012-12-06 16:51:11 UTC

Fixed.