Following interesting case is reduced from a benchmark. It implements a FSM with a loop and switch. There is opportunity to eliminate the switch since all state transition is definite in compile time. Source program: --- int sum0, sum1, sum2, sum3; int foo(const char * str) { int state=0; const char *s=str; for (; *s; s++) { char c=*s; switch (state) { case 0: if (c == '+') state = 1; else if (c != '-') sum0+=c; break; case 1: if (c == '+') state = 2; else if (c == '-') state = 0; else sum1+=c; break; case 2: if (c == '+') state = 3; else if (c == '-') state = 1; else sum2+=c; break; case 3: if (c == '-') state = 2; else if (c != '+') sum3+=c; break; default: break; } } return state; } --- Say, instead of setting state=1 and loop back to switch head, it can be optimized to setting state=1, check loop end condition and jump directly to the label of case_1. Thus the overhead of switch (either if-then-else or jump table) is eliminated. On machine without sophisticate branch prediction, such an optimization is even more appealing. GCC trunk 4.8 doesn't have such a optimization. Expected tree output in source form: --- int sum0, sum1, sum2, sum3; int foo(const char * str) { int state=0; const char *s=str; char c=*s; if (!c) goto end; state_0: if (c == '+') { state = 1; if ((c=* (++s))!=0) goto state_1; // Check loop end condition and go directly to next state else goto end; } else if (c != '-') sum0+=c; if ((c=* (++s))!=0) goto state_0; goto end; state_1: if (c == '+') { state = 2; if ((c=* (++s))!=0) goto state_2; else goto end; } else if (c == '-') { state = 0; if ((c=* (++s))!=0) goto state_0; else goto end; } else sum1+=c; if ((c=* (++s))!=0) goto state_1; goto end; state_2: if (c == '+') { state = 3; if ((c=* (++s))!=0) goto state_3; else goto end; } else if (c == '-') { state = 1; if ((c=* (++s))!=0) goto state_1; else goto end; } else sum1+=c; if ((c=* (++s))!=0) goto state_2; goto end; state_3: if (c == '-') { state = 2; if ((c=* (++s))!=0) goto state_2; else goto end; } else if (c != '+') sum3+=c; if ((c=* (++s))!=0) goto state_3; end: return state; } ---
This is coremarks.
This is just (iterated) jump-threading I believe.
Current jump-threading is too conservative to thread this case. Following limits are what I observed by reading code: 1. It only thread around blocks that * Single entry * Multiple exit * No PHI nodes Even the simple case as forwarding block is excluded. 2. It only thread a block once. For blocks with more than one entries, it is possible to be threaded more than one time too.
Confirmed.
While analyzing cases of threading through backedge. GCC threads this case. But does not thread this case. int first; void thread_backedge (void) { int i = 0; do { if (first ==1) { foo (); first=0; } bla (); first =0; } while (i++ < 100); } int first; void thread_backedge (void) { int i = 0; do { if (first ==1) { foo (); first=0; } else { bla (); first =0; } } while (i++ < 100); }
(In reply to comment #5) > int first; > void thread_backedge (void) > { > int i = 0; > > do > { > if (first ==1) > { > foo (); > first=0; > } > else > { > bla (); > first =0; > } > > } while (i++ < 100); > } Can be jump threaded if (first ==1) { foo(); } else { L1: bla() } first =0 i++ if i<=99 goto L1 else return.
WRT the second example in c#5. See thread_across_edge where we refuse to thread across a DFS_EDGE_BACK when one of the arguments in the conditional is set in the block. This is the equivalency problem I mentioned in IRC. When you traverse the backedge, you have to be very careful because equivalences created when you traversed from outside the loop into the loop are no longer valid once you traverse the backedge. Or at least that's my best memory of the situation.
// A none loop case shows how minor changes impacts current jump thread behavior int foo(int state, int check) { switch (state) { case 0: state = 1; zoo_0(); break; case 1: default: state = 0; zoo_1(); break; } if (!check) return 0; //check++; // Uncomment it results will disable jump thread //check=foo(); // Uncomment it results will disable jump thread switch (state) { case 0: bar_0(); break; case 1: default: bar_1(); break; } return check; }
See http://gcc.gnu.org/ml/gcc/2013-05/msg00009.html for a dynamically loadable pass to do this optimization. It is not a finished product but it seems to work for coremark.
So I've got the base infrastructure in the jump threader to perform this optimization. The outstanding issues specific to the coremark optimization: 1. GCC now preserves loop nest information. This optimization scrambles the loop nest beyond all repair. So we have to arrange to throw away and rebuild the loop nest structures, which might be controversial... Leading to... 2. Heuristics -- We really don't want to throw away the loop nest information unless we're going to see a big win. So ISTM we want to apply this optimization only when we're going to be able to eliminate a multi-way branch in the loop. I've got prototypes for #1 and #2 already done, the problem is getting them tested with real code. I've only seen this code trigger (thread across the backedge to a threadable multi-way branch) *once* during a bootstrap. I've manually verified the sample code from this PR, but that's far from thorough testing. 3. Relax the restriction for threading across backedges. So far I've just been hacking that test out of my sources when I want to look at the coremark testcase. That's obviously not sufficient. The good news is we can thoroughly any changes to allow more threading opportunities to be found across loop backedges.
greta.yorsh no longer works for ARM. Your email will be forwarded to their line manager. Please do not reply to this email. If you need more information, please email real-postmaster@arm.com Thank you.
Fixed on trunk.
Created attachment 31308 [details] Dumps for less reduced testcase in comment 27 As of revision 205398, I'm not seeing this optimisation trigger when compiling the benchmark in question. I've attached the dumps from a less agressively reduced version of the testcase given in the intial report, which we don't currently thread. This testcase is more representative of the control structure in the benchmark code. In particular, we have the problematic scenario of two 'joiner' blocks in the thread path. Looking at the dumps for this testcase I think that we would need to spot threads like: (17, 23) incoming edge; (23, 4) joiner; (4, 5) joiner; (5, 8) back-edge; (8, 15) switch-statement; The testcase I am using is: --- int sum0, sum1, sum2, sum3; int foo(char * s, char** ret) { int state=0; char c; for (; *s && state != 4; s++) { c = *s; if (c == '*') { s++; break; } switch (state) { case 0: if (c == '+') state = 1; else if (c != '-') sum0+=c; break; case 1: if (c == '+') state = 2; else if (c == '-') state = 0; else sum1+=c; break; case 2: if (c == '+') state = 3; else if (c == '-') state = 1; else sum2+=c; break; case 3: if (c == '-') state = 2; else if (c == 'x') state = 4; break; default: break; } } *ret = s; return state; }
I've REOPENED this bug for the less-reduced testcase given in #27. If anyone has objections, or thinks it would be more appropriate, I can open a new bug.
So, any further plans for this issue?
Not at the moment. Focus is on bugfixing for 4.9, particularly regressions. This doesn't qualify.
The problem is that there is a performance regression on i686 for Coremark test
Without a testcase that is representative of the issue, there's nothing I can do.
(In reply to Igor Zamyatin from comment #31) > The problem is that there is a performance regression on i686 for Coremark > test If you can reproduce a testcase please file a new bug for this issue.
Done - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59597
Here is good expansion: ;; _41 = _42 * 4; (insn 20 19 0 (set (reg:SI 126 [ D.5038 ]) (ashift:SI (reg/v:SI 131 [ Int_1_Par_Val ]) (const_int 2 [0x2]))) -1 (nil)) ;; _40 = _2 + _41; (insn 21 20 22 (set (reg:SI 136 [ D.5035 ]) (plus:SI (reg/v/f:SI 130 [ Arr_2_Par_Ref ]) (reg:SI 119 [ D.5036 ]))) -1 (nil)) (insn 22 21 0 (set (reg/f:SI 125 [ D.5035 ]) (plus:SI (reg:SI 136 [ D.5035 ]) (reg:SI 126 [ D.5038 ]))) -1 (nil)) ;; MEM[(int[25] *)_51 + 20B] = _34; (insn 29 28 30 (set (reg:SI 139) (plus:SI (reg/v/f:SI 130 [ Arr_2_Par_Ref ]) (reg:SI 119 [ D.5036 ]))) Proc_8.c:23 -1 (nil)) (insn 30 29 31 (set (reg:SI 140) (plus:SI (reg:SI 139) (reg:SI 126 [ D.5038 ]))) Proc_8.c:23 -1 (nil)) (insn 31 30 32 (set (reg/f:SI 141) (plus:SI (reg:SI 140) (const_int 1000 [0x3e8]))) Proc_8.c:23 -1 (nil)) (insn 32 31 0 (set (mem:SI (plus:SI (reg/f:SI 141) (const_int 20 [0x14])) [2 MEM[(int[25] *)_51 + 20B]+0 S4 A32]) (reg:SI 124 [ D.5039 ])) Proc_8.c:23 -1 (nil)) After cse1 140 can be replaced by 125, thus lead a series of transformation make it much more efficient. Here is bad expansion: ;; _40 = Arr_2_Par_Ref_22(D) + _12; (insn 22 21 23 (set (reg:SI 138 [ D.5038 ]) (plus:SI (reg:SI 128 [ D.5038 ]) (reg:SI 121 [ D.5036 ]))) -1 (nil)) (insn 23 22 0 (set (reg/f:SI 127 [ D.5035 ]) (plus:SI (reg/v/f:SI 132 [ Arr_2_Par_Ref ]) (reg:SI 138 [ D.5038 ]))) -1 (nil)) ;; _32 = _20 + 1000; (insn 29 28 0 (set (reg:SI 124 [ D.5038 ]) (plus:SI (reg:SI 121 [ D.5036 ]) (const_int 1000 [0x3e8]))) Proc_8.c:23 -1 (nil)) ;; MEM[(int[25] *)_51 + 20B] = _34; (insn 32 31 33 (set (reg:SI 141) (plus:SI (reg/v/f:SI 132 [ Arr_2_Par_Ref ]) (reg:SI 124 [ D.5038 ]))) Proc_8.c:23 -1 (nil)) (insn 33 32 34 (set (reg/f:SI 142) (plus:SI (reg:SI 141) (reg:SI 128 [ D.5038 ]))) Proc_8.c:23 -1 (nil)) (insn 34 33 0 (set (mem:SI (plus:SI (reg/f:SI 142) (const_int 20 [0x14])) [2 MEM[(int[25] *)_51 + 20B]+0 S4 A32]) (reg:SI 126 [ D.5039 ])) Proc_8.c:23 -1 (nil)) Here cse doesn't happen, resulting in less optimal insns. Reason why cse doesn't happen is unclear yet.
Please ignore previous comment as it shouldn't be here.
Hi Jeff, can you fix this bug, or should I give it a try? I know some folks who would be happy to have this coremark perf bug fixed in trunk ;-) Thanks, Sebastian (In reply to jgreenhalgh from comment #27) > Created attachment 31308 [details] > Dumps for less reduced testcase in comment 27 > > As of revision 205398, I'm not seeing this optimisation trigger when > compiling the benchmark in question. > > I've attached the dumps from a less agressively reduced version of the > testcase given in the intial report, which we don't currently thread. > > This testcase is more representative of the control structure in the > benchmark code. In particular, we have the problematic scenario of two > 'joiner' blocks in the thread path. > > Looking at the dumps for this testcase I think that we would need to spot > threads like: > > (17, 23) incoming edge; (23, 4) joiner; (4, 5) joiner; (5, 8) back-edge; > (8, 15) switch-statement; > > The testcase I am using is: > > --- > > int sum0, sum1, sum2, sum3; > int foo(char * s, char** ret) > { > int state=0; > char c; > > for (; *s && state != 4; s++) > { > c = *s; > if (c == '*') > { > s++; > break; > } > switch (state) { > case 0: > if (c == '+') state = 1; > else if (c != '-') sum0+=c; > break; > case 1: > if (c == '+') state = 2; > else if (c == '-') state = 0; > else sum1+=c; > break; > case 2: > if (c == '+') state = 3; > else if (c == '-') state = 1; > else sum2+=c; > break; > case 3: > if (c == '-') state = 2; > else if (c == 'x') state = 4; > break; > default: > break; > } > } > *ret = s; > return state; > }
FYI: I am testing a new patch for this that adds a new pass to do this optimization. See https://gcc.gnu.org/ml/gcc-patches/2014-08/msg01228.html
Author: spop Date: Sat Dec 6 19:19:37 2014 New Revision: 218451 URL: https://gcc.gnu.org/viewcvs?rev=218451&root=gcc&view=rev Log: extend jump thread for finite state automata PR tree-optimization/54742 * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges calling duplicate_seme_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New test. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c Modified: trunk/gcc/ChangeLog trunk/gcc/doc/invoke.texi trunk/gcc/params.def trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-cfg.h trunk/gcc/tree-ssa-threadedge.c trunk/gcc/tree-ssa-threadupdate.c trunk/gcc/tree-ssa-threadupdate.h
Fixed in r218451.
Author: yroux Date: Wed Jan 14 10:22:48 2015 New Revision: 219584 URL: https://gcc.gnu.org/viewcvs?rev=219584&root=gcc&view=rev Log: gcc/ 2015-01-14 Yvan Roux <yvan.roux@linaro.org> Backport from trunk r218451. 2014-12-06 James Greenhalgh <james.greenhalgh@arm.com> Sebastian Pop <s.pop@samsung.com> Brian Rzycki <b.rzycki@samsung.com> PR tree-optimization/54742 * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges calling duplicate_seme_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD. gcc/testsuite/ 2015-01-14 Yvan Roux <yvan.roux@linaro.org> Backport from trunk r218451. 2014-12-06 James Greenhalgh <james.greenhalgh@arm.com> Sebastian Pop <s.pop@samsung.com> Brian Rzycki <b.rzycki@samsung.com> PR tree-optimization/54742 * gcc.dg/tree-ssa/ssa-dom-thread-6.c: New test. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: New test. Added: branches/linaro/gcc-4_9-branch/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c branches/linaro/gcc-4_9-branch/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c Modified: branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro branches/linaro/gcc-4_9-branch/gcc/doc/invoke.texi branches/linaro/gcc-4_9-branch/gcc/params.def branches/linaro/gcc-4_9-branch/gcc/testsuite/ChangeLog.linaro branches/linaro/gcc-4_9-branch/gcc/tree-cfg.c branches/linaro/gcc-4_9-branch/gcc/tree-cfg.h branches/linaro/gcc-4_9-branch/gcc/tree-ssa-threadedge.c branches/linaro/gcc-4_9-branch/gcc/tree-ssa-threadupdate.c branches/linaro/gcc-4_9-branch/gcc/tree-ssa-threadupdate.h