Bug 60418 - [4.9 Regression] 435.gromacs in SPEC CPU 2006 is miscompiled
Summary: [4.9 Regression] 435.gromacs in SPEC CPU 2006 is miscompiled
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: 4.9.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-04 22:37 UTC by H.J. Lu
Modified: 2014-03-13 09:40 UTC (History)
2 users (show)

See Also:
Host:
Target: x32
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
The assembly difference of good and bad executables (1.48 KB, text/plain)
2014-03-05 19:10 UTC, H.J. Lu
Details
gcc49-pr60418.patch (778 bytes, patch)
2014-03-11 17:46 UTC, Jakub Jelinek
Details | Diff
gcc49-pr60418.patch (884 bytes, patch)
2014-03-12 09:13 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2014-03-04 22:37:11 UTC
Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with

-mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver
-fuse-linker-plugin

The failure is

  Running 435.gromacs ref peak lto default

*** Miscompare of gromacs.out; for details see
    /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom
acs/run/run_peak_ref_lto.0000/gromacs.out.mis

cat /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis
0002:  3.07684e+02
       3.03594e+02

The result of 3.03594e+02 is outside of tolerance for 3.07684e+02.
I added a static counter in tree_forwarder_block_p for the number of
times returned true above and return false if it is between getenv("from")
and getenv("to").  I noticed that as more basic blocks get removed,
the deviation is getting bigger. One removed basic block has

;; basic block 3, loop depth 0, count 0, freq 225, maybe hot 
;;  prev block 2, next block 4, flags: (NEW, REACHABLE)
;;  pred:       2
;; 
;;  succ:       4

and its successor has

;; basic block 4, loop depth 1, count 0, freq 2500, maybe hot 
;;  prev block 3, next block 5, flags: (NEW)
;;  pred:       3
;;              6
;; starting at line 722 
# gimple_phi <i_429, 0(3), [bondfree.c : 726:24] i_27(6)>

Is is safe to remove basic block 3?
Comment 1 H.J. Lu 2014-03-04 23:31:06 UTC
Do we need to verify that basic block DEST has a single predecessor?
This patch works for me:

diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c
index 926d300..0ed15df 100644
--- a/gcc/tree-cfgcleanup.c
+++ b/gcc/tree-cfgcleanup.c
@@ -316,6 +316,8 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted)
       /* Protect loop preheaders and latches if requested.  */
       if (dest->loop_father->header == dest)
   {
+    if (single_pred_p (dest))
+      {
         if (bb->loop_father == dest->loop_father)
      {
        if (loops_state_satisfies_p (LOOPS_HAVE_SIMPLE_LATCHES))
@@ -329,6 +331,7 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted)
      }
         else if (bb->loop_father == loop_outer (dest->loop_father))
      return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS);
+      }
     /* Always preserve other edges into loop headers that are
        not simple latches or preheaders.  */
     return false;
Comment 2 Richard Biener 2014-03-05 09:32:11 UTC
Hmm, what's the target triplet of a x32-only target?(In reply to H.J. Lu from comment #0)
> Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with
> 
> -mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver
> -fuse-linker-plugin
> 
> The failure is
> 
>   Running 435.gromacs ref peak lto default
> 
> *** Miscompare of gromacs.out; for details see
>    
> /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom
> acs/run/run_peak_ref_lto.0000/gromacs.out.mis
> 
> cat
> /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.
> gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis
> 0002:  3.07684e+02
>        3.03594e+02
> 
> The result of 3.03594e+02 is outside of tolerance for 3.07684e+02.
> I added a static counter in tree_forwarder_block_p for the number of
> times returned true above and return false if it is between getenv("from")
> and getenv("to").  I noticed that as more basic blocks get removed,
> the deviation is getting bigger. One removed basic block has
> 
> ;; basic block 3, loop depth 0, count 0, freq 225, maybe hot 
> ;;  prev block 2, next block 4, flags: (NEW, REACHABLE)
> ;;  pred:       2
> ;; 
> ;;  succ:       4
> 
> and its successor has
> 
> ;; basic block 4, loop depth 1, count 0, freq 2500, maybe hot 
> ;;  prev block 3, next block 5, flags: (NEW)
> ;;  pred:       3
> ;;              6
> ;; starting at line 722 
> # gimple_phi <i_429, 0(3), [bondfree.c : 726:24] i_27(6)>
> 
> Is is safe to remove basic block 3?

Sure it is.  This is a loop pre-header.
Comment 3 Richard Biener 2014-03-05 09:33:55 UTC
(In reply to H.J. Lu from comment #1)
> Do we need to verify that basic block DEST has a single predecessor?

No, that's not necessary.  In fact that disables all pre-header
removal (because the loop header always has the loop entry and at least
one latch)
Comment 4 Richard Biener 2014-03-05 09:36:15 UTC
(In reply to H.J. Lu from comment #0)
> Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with
> 
> -mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver
> -fuse-linker-plugin
> 
> The failure is
> 
>   Running 435.gromacs ref peak lto default
> 
> *** Miscompare of gromacs.out; for details see
>    
> /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom
> acs/run/run_peak_ref_lto.0000/gromacs.out.mis
> 
> cat
> /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.
> gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis
> 0002:  3.07684e+02
>        3.03594e+02
> 
> The result of 3.03594e+02 is outside of tolerance for 3.07684e+02.
> I added a static counter in tree_forwarder_block_p for the number of
> times returned true above and return false if it is between getenv("from")
> and getenv("to").  I noticed that as more basic blocks get removed,
> the deviation is getting bigger. One removed basic block has

That's odd.  What's the code generation difference in the end
for the minimal range {to, from} that still shows a difference
of the result?
Comment 5 H.J. Lu 2014-03-05 19:10:53 UTC
Created attachment 32280 [details]
The assembly difference of good and bad executables

Here is the assembly difference of good and bad executables due
the removal of the single loop preheader.
Comment 6 H.J. Lu 2014-03-05 19:18:06 UTC
(In reply to Richard Biener from comment #4)
> 
> That's odd.  What's the code generation difference in the end
> for the minimal range {to, from} that still shows a difference
> of the result?

In this particular case, loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS)
returns false:

(gdb) bt
#0  tree_forwarder_block_p (bb=0x7f69144186e8, phi_wanted=false)
    at ../../../../gcc/gcc/tree-cfgcleanup.c:357
#1  0x0000000000a9a07c in cleanup_tree_cfg_bb (bb=0x7f69144186e8)
    at ../../../../gcc/gcc/tree-cfgcleanup.c:666
#2  0x0000000000a9a19c in cleanup_tree_cfg_1 ()
    at ../../../../gcc/gcc/tree-cfgcleanup.c:709
#3  0x0000000000a9a2d4 in cleanup_tree_cfg_noloop ()
    at ../../../../gcc/gcc/tree-cfgcleanup.c:765
#4  0x0000000000a9a3e0 in cleanup_tree_cfg ()
    at ../../../../gcc/gcc/tree-cfgcleanup.c:820
#5  0x0000000000967f20 in execute_function_todo (data=0x834)
    at ../../../../gcc/gcc/passes.c:1811
#6  0x000000000096732a in do_per_function (
    callback=0x967ee1 <execute_function_todo(void*)>, data=0x834)
    at ../../../../gcc/gcc/passes.c:1574
#7  0x0000000000968176 in execute_todo (flags=2100)
    at ../../../../gcc/gcc/passes.c:1887
#8  0x0000000000968c05 in execute_one_pass (pass=0x29469b0)
    at ../../../../gcc/gcc/passes.c:2243
#9  0x0000000000968d8f in execute_pass_list (pass=0x29469b0)
    at ../../../../gcc/gcc/passes.c:2282
#10 0x0000000000968dc0 in execute_pass_list (pass=0x2945ed0)
    at ../../../../gcc/gcc/passes.c:2283
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) f 5
#5  0x0000000000967f20 in execute_function_todo (data=0x834)
    at ../../../../gcc/gcc/passes.c:1811
1811	      cleanup_tree_cfg ();
(gdb) f 6
#6  0x000000000096732a in do_per_function (
    callback=0x967ee1 <execute_function_todo(void*)>, data=0x834)
    at ../../../../gcc/gcc/passes.c:1574
1574	    callback (data);
(gdb) f 8
#8  0x0000000000968c05 in execute_one_pass (pass=0x29469b0)
    at ../../../../gcc/gcc/passes.c:2243
2243	  execute_todo (todo_after | pass->todo_flags_finish);
(gdb) p *pass
$4 = {<pass_data> = {type = GIMPLE_PASS, name = 0x1300470 "phicprop", 
    optinfo_flags = 0, has_gate = true, has_execute = true, 
    tv_id = TV_TREE_PHI_CPROP, properties_required = 40, 
    properties_provided = 0, properties_destroyed = 0, 
    todo_flags_start = 524288, todo_flags_finish = 2100}, 
  _vptr.opt_pass = 0x13003d0 <vtable for (anonymous namespace)::pass_phi_only_cprop+16>, sub = 0x0, next = 0x2946a10, static_pass_number = 85, 
  graph_dump_initialized = false, m_ctxt = 0x29214e0}

Is this OK for the phicprop pass to remove loop pre-headers when
loop_optimizer_init isn't called?
Comment 7 H.J. Lu 2014-03-05 20:08:42 UTC
current_loops is setup by

(gdb) bt
#0  set_loops_for_fn (fn=0x7ffff15901f8, loops=0x7fffef079e20)
    at ../../../../gcc/gcc/function.h:747
#1  0x00000000008fe35e in input_cfg (ib=0x7fffffffdbc0, data_in=0x1fc4a80, 
    fn=0x7ffff15901f8, count_materialization_scale=10000)
    at ../../../../gcc/gcc/lto-streamer-in.c:687
#2  0x00000000008ff088 in input_function (fn_decl=0x7ffff1a6d800, 
    data_in=0x1fc4a80, ib=0x7fffffffdbb0, ib_cfg=0x7fffffffdbc0)
    at ../../../../gcc/gcc/lto-streamer-in.c:935
#3  0x00000000008ff7ca in lto_read_body (file_data=0x7ffff1866000, 
    node=0x7ffff10468f8, data=0x2439810 "\003", 
    section_type=LTO_section_function_body)
    at ../../../../gcc/gcc/lto-streamer-in.c:1081
#4  0x00000000008ffa55 in lto_input_function_body (file_data=0x7ffff1866000, 
    node=0x7ffff10468f8, data=0x2439810 "\003")
    at ../../../../gcc/gcc/lto-streamer-in.c:1123
#5  0x000000000063ea32 in cgraph_get_body (node=0x7ffff10468f8)
    at ../../../../gcc/gcc/cgraph.c:3012
#6  0x00000000006466ad in expand_function (node=0x7ffff10468f8)
    at ../../../../gcc/gcc/cgraphunit.c:1746
#7  0x0000000000646d5e in expand_all_functions ()
    at ../../../../gcc/gcc/cgraphunit.c:1908
#8  0x000000000064772a in compile () at ../../../../gcc/gcc/cgraphunit.c:2252
#9  0x000000000059a615 in lto_main () at ../../../../gcc/gcc/lto/lto.c:3427
---Type <return> to continue, or q <return> to quit---
#10 0x0000000000a53c2c in compile_file () at ../../../../gcc/gcc/toplev.c:548
#11 0x0000000000a55dc3 in do_compile () at ../../../../gcc/gcc/toplev.c:1914
#12 0x0000000000a55f2e in toplev_main (argc=22, argv=0x1ac9810)
    at ../../../../gcc/gcc/toplev.c:1990
#13 0x00000000011ddec4 in main (argc=22, argv=0x7fffffffdef8)
    at ../../../../gcc/gcc/main.c:36

if (current_loops)
{
}

is entered without calling loop_optimizer_init first.
Comment 8 H.J. Lu 2014-03-05 20:45:21 UTC
With LTO, current_loops->state may not be set and
loops_state_satisfies_p (xxxx) may always be false.
Comment 9 H.J. Lu 2014-03-06 18:44:53 UTC
One pass removes a preheader in

  if (_47 > 0)
    goto <bb 3>;
  else
    goto <bb 8>;

  <bb 3>:

  <bb 4>:
  # n_20 = PHI <1(3), n_269(7)>

and becomes:

  if (_47 > 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 3>:
  # n_20 = PHI <1(2), n_269(6)>

The next pass adds the preheader back:

  if (_47 > 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 3>:
  # n_264 = PHI <1(2)> 

  <bb 8>:
  # n_20 = PHI <n_264(3), n_269(9)>

which changes the loop behavior and generates different results.
Comment 10 H.J. Lu 2014-03-06 21:56:41 UTC
Sources have many FP loops contains codes like:

rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;

When they are compiled with

-O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver
-fuse-linker-plugin

LTO input IRs contain statements like

  powmult_241 = dy11_71 * dy11_71;
  powmult_240 = dz11_72 * dz11_72;
  _699 = powmult_240 + powmult_80;
  rsq11_77 = _699 + powmult_241;

During the final LTO link, lto1 repeatedly removes loop a preheader
in one pass and adds it back in the next pass.  Each removal/add
changes the statements to

  powmult_213 = dy11_71 * dy11_71;
  _75 = powmult_213 + powmult_80;
  powmult_244 = dz11_72 * dz11_72;
  rsq11_77 = _75 + powmult_244;

Each such re-order may change the FP result slightly.  They
can accumulate to such a degree that the end result is
outside of tolerance.
Comment 11 H.J. Lu 2014-03-06 23:29:15 UTC
sort_by_operand_rank in reassoc.cin has

  /* Lastly, make sure the versions that are the same go next to each
     other.  We use SSA_NAME_VERSION because it's stable.  */
  if ((oeb->rank - oea->rank == 0)
      && TREE_CODE (oea->op) == SSA_NAME
      && TREE_CODE (oeb->op) == SSA_NAME)
    {    
      if (SSA_NAME_VERSION (oeb->op) != SSA_NAME_VERSION (oea->op))
        return SSA_NAME_VERSION (oeb->op) - SSA_NAME_VERSION (oea->op);
      else 
        return oeb->id - oea->id;
    }    

It will treat

  powmult_80 = dx11_70 * dx11_70;
  powmult_241 = dy11_71 * dy11_71;
  _75 = powmult_80 + powmult_241;
  powmult_240 = dz11_72 * dz11_72;
  rsq11_77 = _75 + powmult_240;

where powmult_240 is between powmult_80 and powmult_241, differently from

  powmult_80 = dx11_70 * dx11_70;
  powmult_213 = dy11_71 * dy11_71;
  _75 = powmult_80 + powmult_213;
  powmult_244 = dz11_72 * dz11_72;
  rsq11_77 = _75 + powmult_244;

where powmult_244 > powmult_80 and powmult_213.
Comment 12 Richard Biener 2014-03-07 09:14:52 UTC
(In reply to H.J. Lu from comment #8)
> With LTO, current_loops->state may not be set and
> loops_state_satisfies_p (xxxx) may always be false.

It's zero.  It probably should be LOOPS_MAY_HAVE_MULTIPLE_LATCHES, but that
doesn't change anything ;)
Comment 13 Richard Biener 2014-03-07 09:18:03 UTC
(In reply to H.J. Lu from comment #10)
> Sources have many FP loops contains codes like:
> 
> rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;
> 
> When they are compiled with
> 
> -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver
> -fuse-linker-plugin
> 
> LTO input IRs contain statements like
> 
>   powmult_241 = dy11_71 * dy11_71;
>   powmult_240 = dz11_72 * dz11_72;
>   _699 = powmult_240 + powmult_80;
>   rsq11_77 = _699 + powmult_241;
> 
> During the final LTO link, lto1 repeatedly removes loop a preheader
> in one pass and adds it back in the next pass.  Each removal/add
> changes the statements to
> 
>   powmult_213 = dy11_71 * dy11_71;
>   _75 = powmult_213 + powmult_80;
>   powmult_244 = dz11_72 * dz11_72;
>   rsq11_77 = _75 + powmult_244;
> 
> Each such re-order may change the FP result slightly.  They
> can accumulate to such a degree that the end result is
> outside of tolerance.

Huh, adding a pre-header should _never_ do sth like that.  Can you produce
a small testcase that exhibits these kind of changes with adding/removing
a preheader?
Comment 14 H.J. Lu 2014-03-07 21:57:44 UTC
(In reply to Richard Biener from comment #13)
> 
> Huh, adding a pre-header should _never_ do sth like that.  Can you produce
> a small testcase that exhibits these kind of changes with adding/removing
> a preheader?

I don't have a small testcase and it only shows up with -mx32.  Here
is what I see.  The diffs in 064t.copyprop3 are

diff -up good/gromacs.x.064t.copyprop3 bad/gromacs.x.064t.copyprop3
--- good/gromacs.x.064t.copyprop3	2014-03-06 13:14:22.897298915 -0800
+++ bad/gromacs.x.064t.copyprop3	2014-03-06 13:22:00.221999369 -0800
@@ -22774,6 +22774,7 @@ inl3300 (int & restrict nri, int[0:] * r
     goto <bb 8>;
 
   <bb 3>:
+  n_213 = 1;
 
   <bb 4>:
   # n_8 = PHI <1(3), n_218(7)>

n_213 is unused.  There are many diffs in SSA_NAME_VERSION in 066t.vrp1.
Diffs in 082t.reassoc1 are

diff -up good/gromacs.x.082t.reassoc1 bad/gromacs.x.082t.reassoc1
--- good/gromacs.x.082t.reassoc1	2014-03-06 13:14:22.948297990 -0800
+++ bad/gromacs.x.082t.reassoc1	2014-03-06 13:22:00.273998425 -0800
@@ -24250,11 +24250,11 @@ inl3300 (int & restrict nri, int[0:] * r
   float _201;
   float _202;
   int _206;
+  float _208;
   float _210;
   float _211;
   float _215;
   float _216;
-  float _242;
 
   <bb 2>:
   _19 = *nri_18(D);
@@ -24403,8 +24403,8 @@ inl3300 (int & restrict nri, int[0:] * r
   ff_152 = _155 + fp_147;
   vnb12_153 = vv_149 * c12_116;
   fijr_154 = ff_152 * c12_116;
-  _242 = vnb12_153 + vnb6_134;
-  vnbtot_156 = _242 + vnbtot_11;
+  _208 = vnb12_153 + vnb6_134;
+  vnbtot_156 = _208 + vnbtot_11;
   _157 = fijd_135 + fijc_109;
   _158 = _157 + fijr_154;
   _159 = ((_158));

When it gets to 089t.sincos, diffs are

diff -up good/gromacs.x.089t.sincos bad/gromacs.x.089t.sincos
--- good/gromacs.x.089t.sincos	2014-03-06 13:14:22.967297646 -0800
+++ bad/gromacs.x.089t.sincos	2014-03-06 13:22:00.291998098 -0800
@@ -24252,14 +24252,14 @@ inl3300 (int & restrict nri, int[0:] * r
   float _201;
   float _202;
   int _206;
+  float _208;
   float _210;
   float _211;
+  float powmult_213;
   float _215;
   float _216;
-  float powmult_238;
-  float powmult_240;
-  float powmult_241;
-  float _242;
+  float powmult_243;
+  float powmult_244;

powmult_xxx are created by make_temp_ssa_name (type, NULL, "powmult")
in tree-ssa-math-opts.c for

rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;

Different powmult_xxx have nothing to do with each others. The orders of
powmult SSA_NAME_VERSION are different.  As the result, sort_by_operand_rank
sorts them differently and diffs in 125t.reassoc2 are

   powmult_80 = dx11_70 * dx11_70;
-  powmult_241 = dy11_71 * dy11_71;
-  powmult_240 = dz11_72 * dz11_72;
-  _699 = powmult_240 + powmult_80;
-  rsq11_77 = _699 + powmult_241;
+  powmult_213 = dy11_71 * dy11_71;
+  _75 = powmult_213 + powmult_80;
+  powmult_244 = dz11_72 * dz11_72;
+  rsq11_77 = _75 + powmult_244;

This is

rsq11             = dx11*dx11+dz11*dz11+dy11*dy11;

vs

rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;

For FP operations, they may generate slightly different results.
It shows the reassoc pass is unstable, depending on FREE_SSANAMES when
make_temp_ssa_name is called.
Comment 15 Jakub Jelinek 2014-03-07 22:12:48 UTC
Please look at PR59025, 435.gromacs is apparently very sensitive to any changes in reassociation orders, not just on x32.
Comment 16 H.J. Lu 2014-03-07 23:12:47 UTC
This patch:

diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
index 2fc8220..56160bd 100644
--- a/gcc/tree-ssanames.c
+++ b/gcc/tree-ssanames.c
@@ -136,7 +136,7 @@ make_ssa_name_fn (struct function *fn, tree var, gimple stmt)
 	      || (TYPE_P (var) && is_gimple_reg_type (var)));
 
   /* If our free list has an element, then use it.  */
-  if (!vec_safe_is_empty (FREE_SSANAMES (fn)))
+  if (!vec_safe_is_empty (FREE_SSANAMES (fn)) && stmt)
     {
       t = FREE_SSANAMES (fn)->pop ();
       if (GATHER_STATISTICS)

avoids putting temporaries free SSANAMES list so that sort_by_operand_rank
is more stable.
Comment 17 rguenther@suse.de 2014-03-08 09:49:28 UTC
On 3/7/14 10:57 PM, hjl.tools at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418
> 
> --- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> ---
> (In reply to Richard Biener from comment #13)
>>
>> Huh, adding a pre-header should _never_ do sth like that.  Can you produce
>> a small testcase that exhibits these kind of changes with adding/removing
>> a preheader?
> 
> I don't have a small testcase and it only shows up with -mx32.  Here
> is what I see.  The diffs in 064t.copyprop3 are
> 
> diff -up good/gromacs.x.064t.copyprop3 bad/gromacs.x.064t.copyprop3
> --- good/gromacs.x.064t.copyprop3    2014-03-06 13:14:22.897298915 -0800
> +++ bad/gromacs.x.064t.copyprop3    2014-03-06 13:22:00.221999369 -0800
> @@ -22774,6 +22774,7 @@ inl3300 (int & restrict nri, int[0:] * r
>      goto <bb 8>;
> 
>    <bb 3>:
> +  n_213 = 1;
> 
>    <bb 4>:
>    # n_8 = PHI <1(3), n_218(7)>
> 
> n_213 is unused.  There are many diffs in SSA_NAME_VERSION in 066t.vrp1.
> Diffs in 082t.reassoc1 are
> 
> diff -up good/gromacs.x.082t.reassoc1 bad/gromacs.x.082t.reassoc1
> --- good/gromacs.x.082t.reassoc1    2014-03-06 13:14:22.948297990 -0800
> +++ bad/gromacs.x.082t.reassoc1    2014-03-06 13:22:00.273998425 -0800
> @@ -24250,11 +24250,11 @@ inl3300 (int & restrict nri, int[0:] * r
>    float _201;
>    float _202;
>    int _206;
> +  float _208;
>    float _210;
>    float _211;
>    float _215;
>    float _216;
> -  float _242;
> 
>    <bb 2>:
>    _19 = *nri_18(D);
> @@ -24403,8 +24403,8 @@ inl3300 (int & restrict nri, int[0:] * r
>    ff_152 = _155 + fp_147;
>    vnb12_153 = vv_149 * c12_116;
>    fijr_154 = ff_152 * c12_116;
> -  _242 = vnb12_153 + vnb6_134;
> -  vnbtot_156 = _242 + vnbtot_11;
> +  _208 = vnb12_153 + vnb6_134;
> +  vnbtot_156 = _208 + vnbtot_11;
>    _157 = fijd_135 + fijc_109;
>    _158 = _157 + fijr_154;
>    _159 = ((_158));
> 
> When it gets to 089t.sincos, diffs are
> 
> diff -up good/gromacs.x.089t.sincos bad/gromacs.x.089t.sincos
> --- good/gromacs.x.089t.sincos    2014-03-06 13:14:22.967297646 -0800
> +++ bad/gromacs.x.089t.sincos    2014-03-06 13:22:00.291998098 -0800
> @@ -24252,14 +24252,14 @@ inl3300 (int & restrict nri, int[0:] * r
>    float _201;
>    float _202;
>    int _206;
> +  float _208;
>    float _210;
>    float _211;
> +  float powmult_213;
>    float _215;
>    float _216;
> -  float powmult_238;
> -  float powmult_240;
> -  float powmult_241;
> -  float _242;
> +  float powmult_243;
> +  float powmult_244;
> 
> powmult_xxx are created by make_temp_ssa_name (type, NULL, "powmult")
> in tree-ssa-math-opts.c for
> 
> rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;
> 
> Different powmult_xxx have nothing to do with each others. The orders of
> powmult SSA_NAME_VERSION are different.  As the result, sort_by_operand_rank
> sorts them differently and diffs in 125t.reassoc2 are
> 
>    powmult_80 = dx11_70 * dx11_70;
> -  powmult_241 = dy11_71 * dy11_71;
> -  powmult_240 = dz11_72 * dz11_72;
> -  _699 = powmult_240 + powmult_80;
> -  rsq11_77 = _699 + powmult_241;
> +  powmult_213 = dy11_71 * dy11_71;
> +  _75 = powmult_213 + powmult_80;
> +  powmult_244 = dz11_72 * dz11_72;
> +  rsq11_77 = _75 + powmult_244;
> 
> This is
> 
> rsq11             = dx11*dx11+dz11*dz11+dy11*dy11;
> 
> vs
> 
> rsq11             = dx11*dx11+dy11*dy11+dz11*dz11;
> 
> For FP operations, they may generate slightly different results.
> It shows the reassoc pass is unstable, depending on FREE_SSANAMES when
> make_temp_ssa_name is called.

The question is why that all is dependent on whether we add/remove
preheaders.  That shouldn't create any new SSA names or remove
existing ones.

Yes, maybe sort_by_operand_rank shouldn't sort by SSA_NAME_VERSION but
by existing gimple_uid of the definition stmt (so try to preserve
stmt order).  But that's a completely unrelated thing.
Comment 18 Jakub Jelinek 2014-03-10 10:28:50 UTC
Shouldn't we just prefer the original IL if possible?  That is not SSA_NAME_VERSION, but not gimple_uid of the stmt definition either.
If you have:
  _4 = something;
  _5 = somethingelse;
  _6 = somethingdifferent;
  _7 = _6 + _4;
  _8 = _7 + _5;
then both SSA_NAME_VERSION and gimple_uid of def_stmt sorting would result in
  _9 = _4 + _5;
  _8 = _9 + _6;
wouldn't it?  But what do we gain by reassociating this (perhaps it can help value numbering and CSE if you have differently ordered sequences, but other than that this seems to be unnecessary reshufling and especially for floating point values and -ffast-math unnecessary source of extra ulps).
So perhaps we want to sort by gimple uid of the first use among the insns we are looking at (and take into account also the operand number)?
Comment 19 rguenther@suse.de 2014-03-10 13:27:03 UTC
On Mon, 10 Mar 2014, jakub at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418
> 
> --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Shouldn't we just prefer the original IL if possible?  That is not
> SSA_NAME_VERSION, but not gimple_uid of the stmt definition either.
> If you have:
>   _4 = something;
>   _5 = somethingelse;
>   _6 = somethingdifferent;
>   _7 = _6 + _4;
>   _8 = _7 + _5;
> then both SSA_NAME_VERSION and gimple_uid of def_stmt sorting would result in
>   _9 = _4 + _5;
>   _8 = _9 + _6;
> wouldn't it?  But what do we gain by reassociating this (perhaps it can help
> value numbering and CSE if you have differently ordered sequences, but other
> than that this seems to be unnecessary reshufling and especially for floating
> point values and -ffast-math unnecessary source of extra ulps).
> So perhaps we want to sort by gimple uid of the first use among the insns we
> are looking at (and take into account also the operand number)?

Yes, it wants to canonicalize to get more CSE as followup.  So sorting 
after SSA_NAME_VERSION (or its definition UID) does make sense ...

Looking at the first use is also possible, but what _is_ the first use?

 _4 = something
 if (foo)
   _5 = _4 + 1;
 else
   _6 = _4 + 2;
Comment 20 H.J. Lu 2014-03-10 15:46:27 UTC
(In reply to Richard Biener from comment #13)
> Huh, adding a pre-header should _never_ do sth like that.  Can you produce
> a small testcase that exhibits these kind of changes with adding/removing
> a preheader?

copyprop2 pass removed a preheader and cunrolli pass added it back:

  <bb 3>:
  # n_213 = PHI <1(2)> 

  <bb 8>:
  # n_8 = PHI <n_213(3), n_218(9)>

copyprop3 pass optimized it to

  <bb 3>:
  n_213 = 1;

  <bb 4>:
  # n_8 = PHI <1(3), n_218(7)>

Then the unused n_213 disappeared in reassoc1 pass and n_213 was
put on FREE_SSANAMES.
Comment 21 Jakub Jelinek 2014-03-10 16:36:18 UTC
Can you try if sorting on gimple_uid would help this or not?  I think it would be something like:
--- gcc/tree-ssa-reassoc.c.jj	2014-02-19 06:59:35.000000000 +0100
+++ gcc/tree-ssa-reassoc.c	2014-03-10 17:26:06.707683626 +0100
@@ -506,11 +506,17 @@ sort_by_operand_rank (const void *pa, co
     }
 
   /* Lastly, make sure the versions that are the same go next to each
-     other.  We use SSA_NAME_VERSION because it's stable.  */
+     other.  Prefer gimple_uid of def stmt, fall back to SSA_NAME_VERSION
+     if more stmts have the same uid.  */
   if ((oeb->rank - oea->rank == 0)
       && TREE_CODE (oea->op) == SSA_NAME
       && TREE_CODE (oeb->op) == SSA_NAME)
     {
+      unsigned int uida = gimple_uid (SSA_NAME_DEF_STMT (oea->op));
+      unsigned int uidb = gimple_uid (SSA_NAME_DEF_STMT (oeb->op));
+      if (uida && uidb && uida != uidb)
+	return uidb - uida;
+
       if (SSA_NAME_VERSION (oeb->op) != SSA_NAME_VERSION (oea->op))
 	return SSA_NAME_VERSION (oeb->op) - SSA_NAME_VERSION (oea->op);
       else

(make check RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} dg.exp=*reassoc* tree-ssa.exp=*reassoc*'
with it still passes, haven't tested it more than that).
Comment 22 H.J. Lu 2014-03-10 17:01:01 UTC
(In reply to Jakub Jelinek from comment #21)
> Can you try if sorting on gimple_uid would help this or not?  I think it
> would be something like:

Yes, it works.
Comment 23 rguenther@suse.de 2014-03-11 08:49:14 UTC
On Mon, 10 Mar 2014, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418
> 
> --- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> ---
> (In reply to Richard Biener from comment #13)
> > Huh, adding a pre-header should _never_ do sth like that.  Can you produce
> > a small testcase that exhibits these kind of changes with adding/removing
> > a preheader?
> 
> copyprop2 pass removed a preheader and cunrolli pass added it back:
> 
>   <bb 3>:
>   # n_213 = PHI <1(2)> 
> 
>   <bb 8>:
>   # n_8 = PHI <n_213(3), n_218(9)>

Not sure how it manages to add this kind of PHI node for just
adding a preheader?  That looks odd to me.

> copyprop3 pass optimized it to
> 
>   <bb 3>:
>   n_213 = 1;
> 
>   <bb 4>:
>   # n_8 = PHI <1(3), n_218(7)>
>
> Then the unused n_213 disappeared in reassoc1 pass and n_213 was
> put on FREE_SSANAMES.

By dce1 probably.

That's all expected optimizations and not really dependent on
preheader add/remove.

Richard.
Comment 24 H.J. Lu 2014-03-11 15:06:14 UTC
(In reply to rguenther@suse.de from comment #23)
> On Mon, 10 Mar 2014, hjl.tools at gmail dot com wrote:
> 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418
> > 
> > --- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> ---
> > (In reply to Richard Biener from comment #13)
> > > Huh, adding a pre-header should _never_ do sth like that.  Can you produce
> > > a small testcase that exhibits these kind of changes with adding/removing
> > > a preheader?
> > 
> > copyprop2 pass removed a preheader and cunrolli pass added it back:
> > 
> >   <bb 3>:
> >   # n_213 = PHI <1(2)> 
> > 
> >   <bb 8>:
> >   # n_8 = PHI <n_213(3), n_218(9)>
> 
> Not sure how it manages to add this kind of PHI node for just
> adding a preheader?  That looks odd to me.
> 

(gdb) bt
#0  make_ssa_name_fn (fn=0x7f039cc63d20, var=0x7f039d127ed8, 
    stmt=0x7f039c095800) at ../../../../gcc/gcc/tree-ssanames.c:133
#1  0x0000000000c66392 in copy_ssa_name_fn (fn=0x7f039cc63d20, 
    name=0x7f039cacdcf0, stmt=0x7f039c095800)
    at ../../../../gcc/gcc/tree-ssanames.c:459
#2  0x0000000000a7e389 in copy_ssa_name (var=0x7f039cacdcf0, 
    stmt=0x7f039c095800) at ../../../../gcc/gcc/tree-ssanames.h:118
#3  0x0000000000a8d874 in gimple_make_forwarder_block (fallthru=0x7f039d777f18)
    at ../../../../gcc/gcc/tree-cfg.c:5291
#4  0x000000000061a663 in make_forwarder_block (bb=0x7f039d0c7a90, 
    redirect_edge_p=0x627199 <mfb_keep_just(edge_def*)>, new_bb_cbk=0x0)
    at ../../../../gcc/gcc/cfghooks.c:912
#5  0x0000000000627416 in create_preheader (loop=0x7f039d25e280, flags=1)
    at ../../../../gcc/gcc/cfgloopmanip.c:1552
#6  0x000000000062756a in create_preheaders (flags=1)
    at ../../../../gcc/gcc/cfgloopmanip.c:1602
#7  0x00000000008b76ee in apply_loop_flags (flags=15)
    at ../../../../gcc/gcc/loop-init.c:63
#8  0x00000000008b780d in loop_optimizer_init (flags=15)
    at ../../../../gcc/gcc/loop-init.c:121
#9  0x0000000000b9ef15 in tree_complete_unroll_inner ()
    at ../../../../gcc/gcc/tree-ssa-loop-ivcanon.c:1377
#10 0x0000000000b9efcf in (anonymous namespace)::pass_complete_unrolli::execute
---Type <return> to continue, or q <return> to quit---
    (this=0x28d2110) at ../../../../gcc/gcc/tree-ssa-loop-ivcanon.c:1422
#11 0x0000000000968b76 in execute_one_pass (pass=0x28d2110)
    at ../../../../gcc/gcc/passes.c:2229
#12 0x0000000000968d8f in execute_pass_list (pass=0x28d2110)
    at ../../../../gcc/gcc/passes.c:2282
#13 0x0000000000968dc0 in execute_pass_list (pass=0x28d1ed0)
    at ../../../../gcc/gcc/passes.c:2283
#14 0x000000000064676a in expand_function (node=0x7f039dba5668)
    at ../../../../gcc/gcc/cgraphunit.c:1774
#15 0x0000000000646d5e in expand_all_functions ()
    at ../../../../gcc/gcc/cgraphunit.c:1908
#16 0x000000000064772a in compile () at ../../../../gcc/gcc/cgraphunit.c:2252
#17 0x000000000059a615 in lto_main () at ../../../../gcc/gcc/lto/lto.c:3427
#18 0x0000000000a53c2c in compile_file () at ../../../../gcc/gcc/toplev.c:548
#19 0x0000000000a55dc3 in do_compile () at ../../../../gcc/gcc/toplev.c:1914
#20 0x0000000000a55f2e in toplev_main (argc=24, argv=0x28ac850)
    at ../../../../gcc/gcc/toplev.c:1990
#21 0x00000000011dde12 in main (argc=24, argv=0x7fffdea91ed8)
    at ../../../../gcc/gcc/main.c:36
(gdb)
Comment 25 Jakub Jelinek 2014-03-11 17:46:56 UTC
Created attachment 32334 [details]
gcc49-pr60418.patch

Unfortunately the patch fails to bootstrap due to -fcompare-debug failures, so tried this instead, but it also fails due to -fcompare-debug failures.
Comment 26 H.J. Lu 2014-03-11 21:03:48 UTC
This patch avoids removing preheader when optimizing since
preheader will be added back later even if it is removed:

diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c
index 926d300..af5b24b 100644
--- a/gcc/tree-cfgcleanup.c
+++ b/gcc/tree-cfgcleanup.c
@@ -328,7 +328,10 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted)
                  (LOOPS_MAY_HAVE_MULTIPLE_LATCHES));
       }
     else if (bb->loop_father == loop_outer (dest->loop_father))
-      return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS);
+      /* Preserve preheader when optimizing since it will be added
+         back later.  */
+      return !(optimize
+          || loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS));
     /* Always preserve other edges into loop headers that are
        not simple latches or preheaders.  */
     return false;
Comment 27 Jakub Jelinek 2014-03-12 08:53:57 UTC
  basic_block bbb = gimple_bb (SSA_NAME_DEF_STMT (oea->op));
in the patch contains a pasto, which I guess can explain most if not all the -fcompare-debug failures.  That said, thinking more about it, there still could be -fcompare-debug differences, because debug stmts also get uids and sometimes when inserting stmts the previous stmt uid is used, sometimes next stmt uid, so we probably should use reassoc_stmt_dominates_stmt_p.
Comment 28 rguenther@suse.de 2014-03-12 08:54:48 UTC
On Tue, 11 Mar 2014, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418
> 
> --- Comment #26 from H.J. Lu <hjl.tools at gmail dot com> ---
> This patch avoids removing preheader when optimizing since
> preheader will be added back later even if it is removed:
> 
> diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c
> index 926d300..af5b24b 100644
> --- a/gcc/tree-cfgcleanup.c
> +++ b/gcc/tree-cfgcleanup.c
> @@ -328,7 +328,10 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted)
>                   (LOOPS_MAY_HAVE_MULTIPLE_LATCHES));
>        }
>      else if (bb->loop_father == loop_outer (dest->loop_father))
> -      return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS);
> +      /* Preserve preheader when optimizing since it will be added
> +         back later.  */
> +      return !(optimize
> +          || loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS));
>      /* Always preserve other edges into loop headers that are
>         not simple latches or preheaders.  */
>      return false;

HJ, these patches don't make sense - this patch will regress the
testcase the code above was introduced for.
Comment 29 Jakub Jelinek 2014-03-12 09:13:09 UTC
Created attachment 32338 [details]
gcc49-pr60418.patch

New version of the patch, going to bootstrap/regtest it momentarily.  H.J., could you please check if this also fixes the gromacs -mx32 issue?
Comment 30 Jakub Jelinek 2014-03-12 13:01:09 UTC
The #c29 patch now passed bootstrap/regtest on x86_64-linux and i686-linux.
So, if it also fixes 435.gromacs on -mx32, I'll submit it to gcc-patches.
Comment 31 H.J. Lu 2014-03-12 15:29:27 UTC
(In reply to Jakub Jelinek from comment #29)
> Created attachment 32338 [details]
> gcc49-pr60418.patch
> 
> New version of the patch, going to bootstrap/regtest it momentarily.  H.J.,
> could you please check if this also fixes the gromacs -mx32 issue?

Yes, it works with -mx32. Thanks.
Comment 32 Jakub Jelinek 2014-03-13 09:39:00 UTC
Author: jakub
Date: Thu Mar 13 09:38:28 2014
New Revision: 208535

URL: http://gcc.gnu.org/viewcvs?rev=208535&root=gcc&view=rev
Log:
	PR tree-optimization/59025
	PR middle-end/60418
	* tree-ssa-reassoc.c (sort_by_operand_rank): For SSA_NAMEs with the
	same rank, sort by bb_rank and gimple_uid of SSA_NAME_DEF_STMT first.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-reassoc.c
Comment 33 Jakub Jelinek 2014-03-13 09:40:10 UTC
Hopefully fixed now.