Created attachment 42815 [details] exchange2_reproducer Bechmark regression is well noticeable on Broadwell/Haswell with: -m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp Attached reproducer demonstrates ~27% longer execution with: -m64 -O[3|fast] -funroll-loops There are 18 similar lines in 648.exchange2_s source code which execution time was noticeably changed after r255267. It looks like: --- some_int_array(index1+1:index1+2, 1:3, index2) = some_int_array(index1+1:index1+2, 1:3, index2) - 10 --- "-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2 iterations and with 3 iterations by r255266. This seems to be reasonable since we see in source that two rows and three columns are being modified. For a particular line, r255266: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 22444250) exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled (header execution count 16831504) --- For r255267 it goes another way: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 14963581) exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 11221564) ---
I will investigate.
We no longer unroll the inner loops in cunrolli because cunrolli will leave us with exit checks. We fail to compute the number of iterations of the inner loop(s) (pre loop header copying): <bb 5> [local count: 21065692]: L.5: _3 = _1 + 1; _53 = (integer(kind=8)) _3; _4 = _1 + 2; _54 = (integer(kind=8)) _4; _55 = (integer(kind=8)) i1_25; _5 = _55 * 81; _56 = _5 + -91; $3 = <basic_block 0x7ffff689a478 (5)> (gdb) p debug_bb_n (7) <bb 7> [local count: 63197075]: _6 = S.0_27 * 9; _57 = _6 + _56; <bb 8> [local count: 189610187]: # S.1_28 = PHI <_53(7), S.1_59(9)> if (S.1_28 > _54) goto <bb 10>; [33.33%] else goto <bb 9>; [66.67%] $1 = <basic_block 0x7ffff689a5b0 (8)> (gdb) p debug_bb_n (9) <bb 9> [local count: 126413112]: _7 = S.1_28 + _57; _8 = test_array[_7]; _9 = _8 + -10; test_array[_7] = _9; S.1_59 = S.1_28 + 1; goto <bb 8>; [100.00%] this one being a bit difficult, but the other (but not as interesting(?)): <bb 17> [local count: 119292717]: L.14: _14 = _1 + 1; _69 = (integer(kind=8)) _14; _15 = _1 + 2; _70 = (integer(kind=8)) _15; _71 = (integer(kind=8)) i2_26; _16 = _71 * 81; _72 = _16 + -91; # S.4_31 = PHI <_69(19), S.4_75(21)> if (S.4_31 > _70) goto <bb 22>; [33.33%] else goto <bb 21>; [66.67%] <bb 21> [local count: 715863674]: _18 = S.4_31 + _73; _19 = test_array[_18]; _20 = _19 + 10; test_array[_18] = _20; S.4_75 = S.4_31 + 1; goto <bb 20>; [100.00%] looks like it should be doable. And indeed it is - we are just "confused" by the maybe_zero test. IMHO we should allow constant zero or N iterations by performing the loop header copying alongside the unrolling (leaving the first exit test unremoved). Testing a patch to do that.
Created attachment 42879 [details] patch in testing Patch I am testing. Performance evaluation appreciated.
Should be fixed.
Author: rguenth Date: Thu Dec 14 14:32:24 2017 New Revision: 255635 URL: https://gcc.gnu.org/viewcvs?rev=255635&root=gcc&view=rev Log: 2017-12-14 Richard Biener <rguenther@suse.de> PR tree-optimization/83326 * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Add may_be_zero parameter and handle it by not marking the first peeled copy as not exiting the loop. (try_peel_loop): Likewise. (canonicalize_loop_induction_variables): Use number_of_iterations_exit to handle the case of constant or zero iterations and perform loop header copying on-the-fly. * gcc.dg/tree-ssa/pr81388-2.c: Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr81388-2.c trunk/gcc/tree-ssa-loop-ivcanon.c
Thanks! I see performance gain on 648.exchange2_s (~6% on Broadwell and ~3% on Skylake-X) reverting performance to r255266 level (Skylake-X regression was ~3%). And loops unrolled with 2 and 3 iterations. It's surely fixed.