Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with -mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin The failure is Running 435.gromacs ref peak lto default *** Miscompare of gromacs.out; for details see /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom acs/run/run_peak_ref_lto.0000/gromacs.out.mis cat /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis 0002: 3.07684e+02 3.03594e+02 The result of 3.03594e+02 is outside of tolerance for 3.07684e+02. I added a static counter in tree_forwarder_block_p for the number of times returned true above and return false if it is between getenv("from") and getenv("to"). I noticed that as more basic blocks get removed, the deviation is getting bigger. One removed basic block has ;; basic block 3, loop depth 0, count 0, freq 225, maybe hot ;; prev block 2, next block 4, flags: (NEW, REACHABLE) ;; pred: 2 ;; ;; succ: 4 and its successor has ;; basic block 4, loop depth 1, count 0, freq 2500, maybe hot ;; prev block 3, next block 5, flags: (NEW) ;; pred: 3 ;; 6 ;; starting at line 722 # gimple_phi <i_429, 0(3), [bondfree.c : 726:24] i_27(6)> Is is safe to remove basic block 3?
Do we need to verify that basic block DEST has a single predecessor? This patch works for me: diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c index 926d300..0ed15df 100644 --- a/gcc/tree-cfgcleanup.c +++ b/gcc/tree-cfgcleanup.c @@ -316,6 +316,8 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted) /* Protect loop preheaders and latches if requested. */ if (dest->loop_father->header == dest) { + if (single_pred_p (dest)) + { if (bb->loop_father == dest->loop_father) { if (loops_state_satisfies_p (LOOPS_HAVE_SIMPLE_LATCHES)) @@ -329,6 +331,7 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted) } else if (bb->loop_father == loop_outer (dest->loop_father)) return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS); + } /* Always preserve other edges into loop headers that are not simple latches or preheaders. */ return false;
Hmm, what's the target triplet of a x32-only target?(In reply to H.J. Lu from comment #0) > Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with > > -mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver > -fuse-linker-plugin > > The failure is > > Running 435.gromacs ref peak lto default > > *** Miscompare of gromacs.out; for details see > > /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom > acs/run/run_peak_ref_lto.0000/gromacs.out.mis > > cat > /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435. > gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis > 0002: 3.07684e+02 > 3.03594e+02 > > The result of 3.03594e+02 is outside of tolerance for 3.07684e+02. > I added a static counter in tree_forwarder_block_p for the number of > times returned true above and return false if it is between getenv("from") > and getenv("to"). I noticed that as more basic blocks get removed, > the deviation is getting bigger. One removed basic block has > > ;; basic block 3, loop depth 0, count 0, freq 225, maybe hot > ;; prev block 2, next block 4, flags: (NEW, REACHABLE) > ;; pred: 2 > ;; > ;; succ: 4 > > and its successor has > > ;; basic block 4, loop depth 1, count 0, freq 2500, maybe hot > ;; prev block 3, next block 5, flags: (NEW) > ;; pred: 3 > ;; 6 > ;; starting at line 722 > # gimple_phi <i_429, 0(3), [bondfree.c : 726:24] i_27(6)> > > Is is safe to remove basic block 3? Sure it is. This is a loop pre-header.
(In reply to H.J. Lu from comment #1) > Do we need to verify that basic block DEST has a single predecessor? No, that's not necessary. In fact that disables all pre-header removal (because the loop header always has the loop entry and at least one latch)
(In reply to H.J. Lu from comment #0) > Since r208165, 435.gromacs in SPEC CPU 2006 is miscompiled on x32 with > > -mx32 -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver > -fuse-linker-plugin > > The failure is > > Running 435.gromacs ref peak lto default > > *** Miscompare of gromacs.out; for details see > > /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435.grom > acs/run/run_peak_ref_lto.0000/gromacs.out.mis > > cat > /export/project/git/gcc-regression/spec/2006/spec/benchspec/CPU2006/435. > gromacs/run/run_peak_ref_lto.0000/gromacs.out.mis > 0002: 3.07684e+02 > 3.03594e+02 > > The result of 3.03594e+02 is outside of tolerance for 3.07684e+02. > I added a static counter in tree_forwarder_block_p for the number of > times returned true above and return false if it is between getenv("from") > and getenv("to"). I noticed that as more basic blocks get removed, > the deviation is getting bigger. One removed basic block has That's odd. What's the code generation difference in the end for the minimal range {to, from} that still shows a difference of the result?
Created attachment 32280 [details] The assembly difference of good and bad executables Here is the assembly difference of good and bad executables due the removal of the single loop preheader.
(In reply to Richard Biener from comment #4) > > That's odd. What's the code generation difference in the end > for the minimal range {to, from} that still shows a difference > of the result? In this particular case, loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS) returns false: (gdb) bt #0 tree_forwarder_block_p (bb=0x7f69144186e8, phi_wanted=false) at ../../../../gcc/gcc/tree-cfgcleanup.c:357 #1 0x0000000000a9a07c in cleanup_tree_cfg_bb (bb=0x7f69144186e8) at ../../../../gcc/gcc/tree-cfgcleanup.c:666 #2 0x0000000000a9a19c in cleanup_tree_cfg_1 () at ../../../../gcc/gcc/tree-cfgcleanup.c:709 #3 0x0000000000a9a2d4 in cleanup_tree_cfg_noloop () at ../../../../gcc/gcc/tree-cfgcleanup.c:765 #4 0x0000000000a9a3e0 in cleanup_tree_cfg () at ../../../../gcc/gcc/tree-cfgcleanup.c:820 #5 0x0000000000967f20 in execute_function_todo (data=0x834) at ../../../../gcc/gcc/passes.c:1811 #6 0x000000000096732a in do_per_function ( callback=0x967ee1 <execute_function_todo(void*)>, data=0x834) at ../../../../gcc/gcc/passes.c:1574 #7 0x0000000000968176 in execute_todo (flags=2100) at ../../../../gcc/gcc/passes.c:1887 #8 0x0000000000968c05 in execute_one_pass (pass=0x29469b0) at ../../../../gcc/gcc/passes.c:2243 #9 0x0000000000968d8f in execute_pass_list (pass=0x29469b0) at ../../../../gcc/gcc/passes.c:2282 #10 0x0000000000968dc0 in execute_pass_list (pass=0x2945ed0) at ../../../../gcc/gcc/passes.c:2283 ---Type <return> to continue, or q <return> to quit---q Quit (gdb) f 5 #5 0x0000000000967f20 in execute_function_todo (data=0x834) at ../../../../gcc/gcc/passes.c:1811 1811 cleanup_tree_cfg (); (gdb) f 6 #6 0x000000000096732a in do_per_function ( callback=0x967ee1 <execute_function_todo(void*)>, data=0x834) at ../../../../gcc/gcc/passes.c:1574 1574 callback (data); (gdb) f 8 #8 0x0000000000968c05 in execute_one_pass (pass=0x29469b0) at ../../../../gcc/gcc/passes.c:2243 2243 execute_todo (todo_after | pass->todo_flags_finish); (gdb) p *pass $4 = {<pass_data> = {type = GIMPLE_PASS, name = 0x1300470 "phicprop", optinfo_flags = 0, has_gate = true, has_execute = true, tv_id = TV_TREE_PHI_CPROP, properties_required = 40, properties_provided = 0, properties_destroyed = 0, todo_flags_start = 524288, todo_flags_finish = 2100}, _vptr.opt_pass = 0x13003d0 <vtable for (anonymous namespace)::pass_phi_only_cprop+16>, sub = 0x0, next = 0x2946a10, static_pass_number = 85, graph_dump_initialized = false, m_ctxt = 0x29214e0} Is this OK for the phicprop pass to remove loop pre-headers when loop_optimizer_init isn't called?
current_loops is setup by (gdb) bt #0 set_loops_for_fn (fn=0x7ffff15901f8, loops=0x7fffef079e20) at ../../../../gcc/gcc/function.h:747 #1 0x00000000008fe35e in input_cfg (ib=0x7fffffffdbc0, data_in=0x1fc4a80, fn=0x7ffff15901f8, count_materialization_scale=10000) at ../../../../gcc/gcc/lto-streamer-in.c:687 #2 0x00000000008ff088 in input_function (fn_decl=0x7ffff1a6d800, data_in=0x1fc4a80, ib=0x7fffffffdbb0, ib_cfg=0x7fffffffdbc0) at ../../../../gcc/gcc/lto-streamer-in.c:935 #3 0x00000000008ff7ca in lto_read_body (file_data=0x7ffff1866000, node=0x7ffff10468f8, data=0x2439810 "\003", section_type=LTO_section_function_body) at ../../../../gcc/gcc/lto-streamer-in.c:1081 #4 0x00000000008ffa55 in lto_input_function_body (file_data=0x7ffff1866000, node=0x7ffff10468f8, data=0x2439810 "\003") at ../../../../gcc/gcc/lto-streamer-in.c:1123 #5 0x000000000063ea32 in cgraph_get_body (node=0x7ffff10468f8) at ../../../../gcc/gcc/cgraph.c:3012 #6 0x00000000006466ad in expand_function (node=0x7ffff10468f8) at ../../../../gcc/gcc/cgraphunit.c:1746 #7 0x0000000000646d5e in expand_all_functions () at ../../../../gcc/gcc/cgraphunit.c:1908 #8 0x000000000064772a in compile () at ../../../../gcc/gcc/cgraphunit.c:2252 #9 0x000000000059a615 in lto_main () at ../../../../gcc/gcc/lto/lto.c:3427 ---Type <return> to continue, or q <return> to quit--- #10 0x0000000000a53c2c in compile_file () at ../../../../gcc/gcc/toplev.c:548 #11 0x0000000000a55dc3 in do_compile () at ../../../../gcc/gcc/toplev.c:1914 #12 0x0000000000a55f2e in toplev_main (argc=22, argv=0x1ac9810) at ../../../../gcc/gcc/toplev.c:1990 #13 0x00000000011ddec4 in main (argc=22, argv=0x7fffffffdef8) at ../../../../gcc/gcc/main.c:36 if (current_loops) { } is entered without calling loop_optimizer_init first.
With LTO, current_loops->state may not be set and loops_state_satisfies_p (xxxx) may always be false.
One pass removes a preheader in if (_47 > 0) goto <bb 3>; else goto <bb 8>; <bb 3>: <bb 4>: # n_20 = PHI <1(3), n_269(7)> and becomes: if (_47 > 0) goto <bb 3>; else goto <bb 7>; <bb 3>: # n_20 = PHI <1(2), n_269(6)> The next pass adds the preheader back: if (_47 > 0) goto <bb 3>; else goto <bb 7>; <bb 3>: # n_264 = PHI <1(2)> <bb 8>: # n_20 = PHI <n_264(3), n_269(9)> which changes the loop behavior and generates different results.
Sources have many FP loops contains codes like: rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; When they are compiled with -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver -fuse-linker-plugin LTO input IRs contain statements like powmult_241 = dy11_71 * dy11_71; powmult_240 = dz11_72 * dz11_72; _699 = powmult_240 + powmult_80; rsq11_77 = _699 + powmult_241; During the final LTO link, lto1 repeatedly removes loop a preheader in one pass and adds it back in the next pass. Each removal/add changes the statements to powmult_213 = dy11_71 * dy11_71; _75 = powmult_213 + powmult_80; powmult_244 = dz11_72 * dz11_72; rsq11_77 = _75 + powmult_244; Each such re-order may change the FP result slightly. They can accumulate to such a degree that the end result is outside of tolerance.
sort_by_operand_rank in reassoc.cin has /* Lastly, make sure the versions that are the same go next to each other. We use SSA_NAME_VERSION because it's stable. */ if ((oeb->rank - oea->rank == 0) && TREE_CODE (oea->op) == SSA_NAME && TREE_CODE (oeb->op) == SSA_NAME) { if (SSA_NAME_VERSION (oeb->op) != SSA_NAME_VERSION (oea->op)) return SSA_NAME_VERSION (oeb->op) - SSA_NAME_VERSION (oea->op); else return oeb->id - oea->id; } It will treat powmult_80 = dx11_70 * dx11_70; powmult_241 = dy11_71 * dy11_71; _75 = powmult_80 + powmult_241; powmult_240 = dz11_72 * dz11_72; rsq11_77 = _75 + powmult_240; where powmult_240 is between powmult_80 and powmult_241, differently from powmult_80 = dx11_70 * dx11_70; powmult_213 = dy11_71 * dy11_71; _75 = powmult_80 + powmult_213; powmult_244 = dz11_72 * dz11_72; rsq11_77 = _75 + powmult_244; where powmult_244 > powmult_80 and powmult_213.
(In reply to H.J. Lu from comment #8) > With LTO, current_loops->state may not be set and > loops_state_satisfies_p (xxxx) may always be false. It's zero. It probably should be LOOPS_MAY_HAVE_MULTIPLE_LATCHES, but that doesn't change anything ;)
(In reply to H.J. Lu from comment #10) > Sources have many FP loops contains codes like: > > rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; > > When they are compiled with > > -O3 -funroll-loops -ffast-math -fwhole-program -flto=jobserver > -fuse-linker-plugin > > LTO input IRs contain statements like > > powmult_241 = dy11_71 * dy11_71; > powmult_240 = dz11_72 * dz11_72; > _699 = powmult_240 + powmult_80; > rsq11_77 = _699 + powmult_241; > > During the final LTO link, lto1 repeatedly removes loop a preheader > in one pass and adds it back in the next pass. Each removal/add > changes the statements to > > powmult_213 = dy11_71 * dy11_71; > _75 = powmult_213 + powmult_80; > powmult_244 = dz11_72 * dz11_72; > rsq11_77 = _75 + powmult_244; > > Each such re-order may change the FP result slightly. They > can accumulate to such a degree that the end result is > outside of tolerance. Huh, adding a pre-header should _never_ do sth like that. Can you produce a small testcase that exhibits these kind of changes with adding/removing a preheader?
(In reply to Richard Biener from comment #13) > > Huh, adding a pre-header should _never_ do sth like that. Can you produce > a small testcase that exhibits these kind of changes with adding/removing > a preheader? I don't have a small testcase and it only shows up with -mx32. Here is what I see. The diffs in 064t.copyprop3 are diff -up good/gromacs.x.064t.copyprop3 bad/gromacs.x.064t.copyprop3 --- good/gromacs.x.064t.copyprop3 2014-03-06 13:14:22.897298915 -0800 +++ bad/gromacs.x.064t.copyprop3 2014-03-06 13:22:00.221999369 -0800 @@ -22774,6 +22774,7 @@ inl3300 (int & restrict nri, int[0:] * r goto <bb 8>; <bb 3>: + n_213 = 1; <bb 4>: # n_8 = PHI <1(3), n_218(7)> n_213 is unused. There are many diffs in SSA_NAME_VERSION in 066t.vrp1. Diffs in 082t.reassoc1 are diff -up good/gromacs.x.082t.reassoc1 bad/gromacs.x.082t.reassoc1 --- good/gromacs.x.082t.reassoc1 2014-03-06 13:14:22.948297990 -0800 +++ bad/gromacs.x.082t.reassoc1 2014-03-06 13:22:00.273998425 -0800 @@ -24250,11 +24250,11 @@ inl3300 (int & restrict nri, int[0:] * r float _201; float _202; int _206; + float _208; float _210; float _211; float _215; float _216; - float _242; <bb 2>: _19 = *nri_18(D); @@ -24403,8 +24403,8 @@ inl3300 (int & restrict nri, int[0:] * r ff_152 = _155 + fp_147; vnb12_153 = vv_149 * c12_116; fijr_154 = ff_152 * c12_116; - _242 = vnb12_153 + vnb6_134; - vnbtot_156 = _242 + vnbtot_11; + _208 = vnb12_153 + vnb6_134; + vnbtot_156 = _208 + vnbtot_11; _157 = fijd_135 + fijc_109; _158 = _157 + fijr_154; _159 = ((_158)); When it gets to 089t.sincos, diffs are diff -up good/gromacs.x.089t.sincos bad/gromacs.x.089t.sincos --- good/gromacs.x.089t.sincos 2014-03-06 13:14:22.967297646 -0800 +++ bad/gromacs.x.089t.sincos 2014-03-06 13:22:00.291998098 -0800 @@ -24252,14 +24252,14 @@ inl3300 (int & restrict nri, int[0:] * r float _201; float _202; int _206; + float _208; float _210; float _211; + float powmult_213; float _215; float _216; - float powmult_238; - float powmult_240; - float powmult_241; - float _242; + float powmult_243; + float powmult_244; powmult_xxx are created by make_temp_ssa_name (type, NULL, "powmult") in tree-ssa-math-opts.c for rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; Different powmult_xxx have nothing to do with each others. The orders of powmult SSA_NAME_VERSION are different. As the result, sort_by_operand_rank sorts them differently and diffs in 125t.reassoc2 are powmult_80 = dx11_70 * dx11_70; - powmult_241 = dy11_71 * dy11_71; - powmult_240 = dz11_72 * dz11_72; - _699 = powmult_240 + powmult_80; - rsq11_77 = _699 + powmult_241; + powmult_213 = dy11_71 * dy11_71; + _75 = powmult_213 + powmult_80; + powmult_244 = dz11_72 * dz11_72; + rsq11_77 = _75 + powmult_244; This is rsq11 = dx11*dx11+dz11*dz11+dy11*dy11; vs rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; For FP operations, they may generate slightly different results. It shows the reassoc pass is unstable, depending on FREE_SSANAMES when make_temp_ssa_name is called.
Please look at PR59025, 435.gromacs is apparently very sensitive to any changes in reassociation orders, not just on x32.
This patch: diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c index 2fc8220..56160bd 100644 --- a/gcc/tree-ssanames.c +++ b/gcc/tree-ssanames.c @@ -136,7 +136,7 @@ make_ssa_name_fn (struct function *fn, tree var, gimple stmt) || (TYPE_P (var) && is_gimple_reg_type (var))); /* If our free list has an element, then use it. */ - if (!vec_safe_is_empty (FREE_SSANAMES (fn))) + if (!vec_safe_is_empty (FREE_SSANAMES (fn)) && stmt) { t = FREE_SSANAMES (fn)->pop (); if (GATHER_STATISTICS) avoids putting temporaries free SSANAMES list so that sort_by_operand_rank is more stable.
On 3/7/14 10:57 PM, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418 > > --- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> --- > (In reply to Richard Biener from comment #13) >> >> Huh, adding a pre-header should _never_ do sth like that. Can you produce >> a small testcase that exhibits these kind of changes with adding/removing >> a preheader? > > I don't have a small testcase and it only shows up with -mx32. Here > is what I see. The diffs in 064t.copyprop3 are > > diff -up good/gromacs.x.064t.copyprop3 bad/gromacs.x.064t.copyprop3 > --- good/gromacs.x.064t.copyprop3 2014-03-06 13:14:22.897298915 -0800 > +++ bad/gromacs.x.064t.copyprop3 2014-03-06 13:22:00.221999369 -0800 > @@ -22774,6 +22774,7 @@ inl3300 (int & restrict nri, int[0:] * r > goto <bb 8>; > > <bb 3>: > + n_213 = 1; > > <bb 4>: > # n_8 = PHI <1(3), n_218(7)> > > n_213 is unused. There are many diffs in SSA_NAME_VERSION in 066t.vrp1. > Diffs in 082t.reassoc1 are > > diff -up good/gromacs.x.082t.reassoc1 bad/gromacs.x.082t.reassoc1 > --- good/gromacs.x.082t.reassoc1 2014-03-06 13:14:22.948297990 -0800 > +++ bad/gromacs.x.082t.reassoc1 2014-03-06 13:22:00.273998425 -0800 > @@ -24250,11 +24250,11 @@ inl3300 (int & restrict nri, int[0:] * r > float _201; > float _202; > int _206; > + float _208; > float _210; > float _211; > float _215; > float _216; > - float _242; > > <bb 2>: > _19 = *nri_18(D); > @@ -24403,8 +24403,8 @@ inl3300 (int & restrict nri, int[0:] * r > ff_152 = _155 + fp_147; > vnb12_153 = vv_149 * c12_116; > fijr_154 = ff_152 * c12_116; > - _242 = vnb12_153 + vnb6_134; > - vnbtot_156 = _242 + vnbtot_11; > + _208 = vnb12_153 + vnb6_134; > + vnbtot_156 = _208 + vnbtot_11; > _157 = fijd_135 + fijc_109; > _158 = _157 + fijr_154; > _159 = ((_158)); > > When it gets to 089t.sincos, diffs are > > diff -up good/gromacs.x.089t.sincos bad/gromacs.x.089t.sincos > --- good/gromacs.x.089t.sincos 2014-03-06 13:14:22.967297646 -0800 > +++ bad/gromacs.x.089t.sincos 2014-03-06 13:22:00.291998098 -0800 > @@ -24252,14 +24252,14 @@ inl3300 (int & restrict nri, int[0:] * r > float _201; > float _202; > int _206; > + float _208; > float _210; > float _211; > + float powmult_213; > float _215; > float _216; > - float powmult_238; > - float powmult_240; > - float powmult_241; > - float _242; > + float powmult_243; > + float powmult_244; > > powmult_xxx are created by make_temp_ssa_name (type, NULL, "powmult") > in tree-ssa-math-opts.c for > > rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; > > Different powmult_xxx have nothing to do with each others. The orders of > powmult SSA_NAME_VERSION are different. As the result, sort_by_operand_rank > sorts them differently and diffs in 125t.reassoc2 are > > powmult_80 = dx11_70 * dx11_70; > - powmult_241 = dy11_71 * dy11_71; > - powmult_240 = dz11_72 * dz11_72; > - _699 = powmult_240 + powmult_80; > - rsq11_77 = _699 + powmult_241; > + powmult_213 = dy11_71 * dy11_71; > + _75 = powmult_213 + powmult_80; > + powmult_244 = dz11_72 * dz11_72; > + rsq11_77 = _75 + powmult_244; > > This is > > rsq11 = dx11*dx11+dz11*dz11+dy11*dy11; > > vs > > rsq11 = dx11*dx11+dy11*dy11+dz11*dz11; > > For FP operations, they may generate slightly different results. > It shows the reassoc pass is unstable, depending on FREE_SSANAMES when > make_temp_ssa_name is called. The question is why that all is dependent on whether we add/remove preheaders. That shouldn't create any new SSA names or remove existing ones. Yes, maybe sort_by_operand_rank shouldn't sort by SSA_NAME_VERSION but by existing gimple_uid of the definition stmt (so try to preserve stmt order). But that's a completely unrelated thing.
Shouldn't we just prefer the original IL if possible? That is not SSA_NAME_VERSION, but not gimple_uid of the stmt definition either. If you have: _4 = something; _5 = somethingelse; _6 = somethingdifferent; _7 = _6 + _4; _8 = _7 + _5; then both SSA_NAME_VERSION and gimple_uid of def_stmt sorting would result in _9 = _4 + _5; _8 = _9 + _6; wouldn't it? But what do we gain by reassociating this (perhaps it can help value numbering and CSE if you have differently ordered sequences, but other than that this seems to be unnecessary reshufling and especially for floating point values and -ffast-math unnecessary source of extra ulps). So perhaps we want to sort by gimple uid of the first use among the insns we are looking at (and take into account also the operand number)?
On Mon, 10 Mar 2014, jakub at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418 > > --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Shouldn't we just prefer the original IL if possible? That is not > SSA_NAME_VERSION, but not gimple_uid of the stmt definition either. > If you have: > _4 = something; > _5 = somethingelse; > _6 = somethingdifferent; > _7 = _6 + _4; > _8 = _7 + _5; > then both SSA_NAME_VERSION and gimple_uid of def_stmt sorting would result in > _9 = _4 + _5; > _8 = _9 + _6; > wouldn't it? But what do we gain by reassociating this (perhaps it can help > value numbering and CSE if you have differently ordered sequences, but other > than that this seems to be unnecessary reshufling and especially for floating > point values and -ffast-math unnecessary source of extra ulps). > So perhaps we want to sort by gimple uid of the first use among the insns we > are looking at (and take into account also the operand number)? Yes, it wants to canonicalize to get more CSE as followup. So sorting after SSA_NAME_VERSION (or its definition UID) does make sense ... Looking at the first use is also possible, but what _is_ the first use? _4 = something if (foo) _5 = _4 + 1; else _6 = _4 + 2;
(In reply to Richard Biener from comment #13) > Huh, adding a pre-header should _never_ do sth like that. Can you produce > a small testcase that exhibits these kind of changes with adding/removing > a preheader? copyprop2 pass removed a preheader and cunrolli pass added it back: <bb 3>: # n_213 = PHI <1(2)> <bb 8>: # n_8 = PHI <n_213(3), n_218(9)> copyprop3 pass optimized it to <bb 3>: n_213 = 1; <bb 4>: # n_8 = PHI <1(3), n_218(7)> Then the unused n_213 disappeared in reassoc1 pass and n_213 was put on FREE_SSANAMES.
Can you try if sorting on gimple_uid would help this or not? I think it would be something like: --- gcc/tree-ssa-reassoc.c.jj 2014-02-19 06:59:35.000000000 +0100 +++ gcc/tree-ssa-reassoc.c 2014-03-10 17:26:06.707683626 +0100 @@ -506,11 +506,17 @@ sort_by_operand_rank (const void *pa, co } /* Lastly, make sure the versions that are the same go next to each - other. We use SSA_NAME_VERSION because it's stable. */ + other. Prefer gimple_uid of def stmt, fall back to SSA_NAME_VERSION + if more stmts have the same uid. */ if ((oeb->rank - oea->rank == 0) && TREE_CODE (oea->op) == SSA_NAME && TREE_CODE (oeb->op) == SSA_NAME) { + unsigned int uida = gimple_uid (SSA_NAME_DEF_STMT (oea->op)); + unsigned int uidb = gimple_uid (SSA_NAME_DEF_STMT (oeb->op)); + if (uida && uidb && uida != uidb) + return uidb - uida; + if (SSA_NAME_VERSION (oeb->op) != SSA_NAME_VERSION (oea->op)) return SSA_NAME_VERSION (oeb->op) - SSA_NAME_VERSION (oea->op); else (make check RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} dg.exp=*reassoc* tree-ssa.exp=*reassoc*' with it still passes, haven't tested it more than that).
(In reply to Jakub Jelinek from comment #21) > Can you try if sorting on gimple_uid would help this or not? I think it > would be something like: Yes, it works.
On Mon, 10 Mar 2014, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418 > > --- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> --- > (In reply to Richard Biener from comment #13) > > Huh, adding a pre-header should _never_ do sth like that. Can you produce > > a small testcase that exhibits these kind of changes with adding/removing > > a preheader? > > copyprop2 pass removed a preheader and cunrolli pass added it back: > > <bb 3>: > # n_213 = PHI <1(2)> > > <bb 8>: > # n_8 = PHI <n_213(3), n_218(9)> Not sure how it manages to add this kind of PHI node for just adding a preheader? That looks odd to me. > copyprop3 pass optimized it to > > <bb 3>: > n_213 = 1; > > <bb 4>: > # n_8 = PHI <1(3), n_218(7)> > > Then the unused n_213 disappeared in reassoc1 pass and n_213 was > put on FREE_SSANAMES. By dce1 probably. That's all expected optimizations and not really dependent on preheader add/remove. Richard.
(In reply to rguenther@suse.de from comment #23) > On Mon, 10 Mar 2014, hjl.tools at gmail dot com wrote: > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418 > > > > --- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> --- > > (In reply to Richard Biener from comment #13) > > > Huh, adding a pre-header should _never_ do sth like that. Can you produce > > > a small testcase that exhibits these kind of changes with adding/removing > > > a preheader? > > > > copyprop2 pass removed a preheader and cunrolli pass added it back: > > > > <bb 3>: > > # n_213 = PHI <1(2)> > > > > <bb 8>: > > # n_8 = PHI <n_213(3), n_218(9)> > > Not sure how it manages to add this kind of PHI node for just > adding a preheader? That looks odd to me. > (gdb) bt #0 make_ssa_name_fn (fn=0x7f039cc63d20, var=0x7f039d127ed8, stmt=0x7f039c095800) at ../../../../gcc/gcc/tree-ssanames.c:133 #1 0x0000000000c66392 in copy_ssa_name_fn (fn=0x7f039cc63d20, name=0x7f039cacdcf0, stmt=0x7f039c095800) at ../../../../gcc/gcc/tree-ssanames.c:459 #2 0x0000000000a7e389 in copy_ssa_name (var=0x7f039cacdcf0, stmt=0x7f039c095800) at ../../../../gcc/gcc/tree-ssanames.h:118 #3 0x0000000000a8d874 in gimple_make_forwarder_block (fallthru=0x7f039d777f18) at ../../../../gcc/gcc/tree-cfg.c:5291 #4 0x000000000061a663 in make_forwarder_block (bb=0x7f039d0c7a90, redirect_edge_p=0x627199 <mfb_keep_just(edge_def*)>, new_bb_cbk=0x0) at ../../../../gcc/gcc/cfghooks.c:912 #5 0x0000000000627416 in create_preheader (loop=0x7f039d25e280, flags=1) at ../../../../gcc/gcc/cfgloopmanip.c:1552 #6 0x000000000062756a in create_preheaders (flags=1) at ../../../../gcc/gcc/cfgloopmanip.c:1602 #7 0x00000000008b76ee in apply_loop_flags (flags=15) at ../../../../gcc/gcc/loop-init.c:63 #8 0x00000000008b780d in loop_optimizer_init (flags=15) at ../../../../gcc/gcc/loop-init.c:121 #9 0x0000000000b9ef15 in tree_complete_unroll_inner () at ../../../../gcc/gcc/tree-ssa-loop-ivcanon.c:1377 #10 0x0000000000b9efcf in (anonymous namespace)::pass_complete_unrolli::execute ---Type <return> to continue, or q <return> to quit--- (this=0x28d2110) at ../../../../gcc/gcc/tree-ssa-loop-ivcanon.c:1422 #11 0x0000000000968b76 in execute_one_pass (pass=0x28d2110) at ../../../../gcc/gcc/passes.c:2229 #12 0x0000000000968d8f in execute_pass_list (pass=0x28d2110) at ../../../../gcc/gcc/passes.c:2282 #13 0x0000000000968dc0 in execute_pass_list (pass=0x28d1ed0) at ../../../../gcc/gcc/passes.c:2283 #14 0x000000000064676a in expand_function (node=0x7f039dba5668) at ../../../../gcc/gcc/cgraphunit.c:1774 #15 0x0000000000646d5e in expand_all_functions () at ../../../../gcc/gcc/cgraphunit.c:1908 #16 0x000000000064772a in compile () at ../../../../gcc/gcc/cgraphunit.c:2252 #17 0x000000000059a615 in lto_main () at ../../../../gcc/gcc/lto/lto.c:3427 #18 0x0000000000a53c2c in compile_file () at ../../../../gcc/gcc/toplev.c:548 #19 0x0000000000a55dc3 in do_compile () at ../../../../gcc/gcc/toplev.c:1914 #20 0x0000000000a55f2e in toplev_main (argc=24, argv=0x28ac850) at ../../../../gcc/gcc/toplev.c:1990 #21 0x00000000011dde12 in main (argc=24, argv=0x7fffdea91ed8) at ../../../../gcc/gcc/main.c:36 (gdb)
Created attachment 32334 [details] gcc49-pr60418.patch Unfortunately the patch fails to bootstrap due to -fcompare-debug failures, so tried this instead, but it also fails due to -fcompare-debug failures.
This patch avoids removing preheader when optimizing since preheader will be added back later even if it is removed: diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c index 926d300..af5b24b 100644 --- a/gcc/tree-cfgcleanup.c +++ b/gcc/tree-cfgcleanup.c @@ -328,7 +328,10 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted) (LOOPS_MAY_HAVE_MULTIPLE_LATCHES)); } else if (bb->loop_father == loop_outer (dest->loop_father)) - return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS); + /* Preserve preheader when optimizing since it will be added + back later. */ + return !(optimize + || loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS)); /* Always preserve other edges into loop headers that are not simple latches or preheaders. */ return false;
basic_block bbb = gimple_bb (SSA_NAME_DEF_STMT (oea->op)); in the patch contains a pasto, which I guess can explain most if not all the -fcompare-debug failures. That said, thinking more about it, there still could be -fcompare-debug differences, because debug stmts also get uids and sometimes when inserting stmts the previous stmt uid is used, sometimes next stmt uid, so we probably should use reassoc_stmt_dominates_stmt_p.
On Tue, 11 Mar 2014, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60418 > > --- Comment #26 from H.J. Lu <hjl.tools at gmail dot com> --- > This patch avoids removing preheader when optimizing since > preheader will be added back later even if it is removed: > > diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c > index 926d300..af5b24b 100644 > --- a/gcc/tree-cfgcleanup.c > +++ b/gcc/tree-cfgcleanup.c > @@ -328,7 +328,10 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted) > (LOOPS_MAY_HAVE_MULTIPLE_LATCHES)); > } > else if (bb->loop_father == loop_outer (dest->loop_father)) > - return !loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS); > + /* Preserve preheader when optimizing since it will be added > + back later. */ > + return !(optimize > + || loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS)); > /* Always preserve other edges into loop headers that are > not simple latches or preheaders. */ > return false; HJ, these patches don't make sense - this patch will regress the testcase the code above was introduced for.
Created attachment 32338 [details] gcc49-pr60418.patch New version of the patch, going to bootstrap/regtest it momentarily. H.J., could you please check if this also fixes the gromacs -mx32 issue?
The #c29 patch now passed bootstrap/regtest on x86_64-linux and i686-linux. So, if it also fixes 435.gromacs on -mx32, I'll submit it to gcc-patches.
(In reply to Jakub Jelinek from comment #29) > Created attachment 32338 [details] > gcc49-pr60418.patch > > New version of the patch, going to bootstrap/regtest it momentarily. H.J., > could you please check if this also fixes the gromacs -mx32 issue? Yes, it works with -mx32. Thanks.
Author: jakub Date: Thu Mar 13 09:38:28 2014 New Revision: 208535 URL: http://gcc.gnu.org/viewcvs?rev=208535&root=gcc&view=rev Log: PR tree-optimization/59025 PR middle-end/60418 * tree-ssa-reassoc.c (sort_by_operand_rank): For SSA_NAMEs with the same rank, sort by bb_rank and gimple_uid of SSA_NAME_DEF_STMT first. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-reassoc.c
Hopefully fixed now.