83326 – [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)

Bug 83326 - [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)

Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	8.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	83202
	Show dependency tree / graph

Reported:	2017-12-08 14:58 UTC by Alexander Nesterovskiy
Modified:	2017-12-21 12:58 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-12-10 00:00:00

Attachments
exchange2_reproducer (790 bytes, application/x-zip-compressed) 2017-12-08 14:58 UTC, Alexander Nesterovskiy	Details
patch in testing (1.61 KB, patch) 2017-12-14 10:17 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Nesterovskiy 2017-12-08 14:58:32 UTC

Created attachment 42815 [details]
exchange2_reproducer

Bechmark regression is well noticeable on Broadwell/Haswell with:
-m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp

Attached reproducer demonstrates ~27% longer execution with:
-m64 -O[3|fast] -funroll-loops

There are 18 similar lines in 648.exchange2_s source code which execution time was noticeably changed after r255267.
It looks like:
---
    some_int_array(index1+1:index1+2, 1:3, index2) = some_int_array(index1+1:index1+2, 1:3, index2) - 10
---

"-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2 iterations and with 3 iterations by r255266.
This seems to be reasonable since we see in source that two rows and three columns are being modified.
For a particular line, r255266:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 22444250)
exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled (header execution count 16831504)
---

For r255267 it goes another way:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 14963581)
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 11221564)
---

Comment 1 Richard Biener 2017-12-10 13:56:00 UTC

I will investigate.

Comment 2 Richard Biener 2017-12-14 10:13:38 UTC

We no longer unroll the inner loops in cunrolli because cunrolli will leave us with exit checks.

We fail to compute the number of iterations of the inner loop(s) (pre loop header copying):

<bb 5> [local count: 21065692]:
L.5:
_3 = _1 + 1;
_53 = (integer(kind=8)) _3;
_4 = _1 + 2;
_54 = (integer(kind=8)) _4;
_55 = (integer(kind=8)) i1_25;
_5 = _55 * 81;
_56 = _5 + -91;

$3 = <basic_block 0x7ffff689a478 (5)>
(gdb) p debug_bb_n (7)
<bb 7> [local count: 63197075]:
_6 = S.0_27 * 9;
_57 = _6 + _56;

<bb 8> [local count: 189610187]:
# S.1_28 = PHI <_53(7), S.1_59(9)>
if (S.1_28 > _54)
  goto <bb 10>; [33.33%]
else
  goto <bb 9>; [66.67%]

$1 = <basic_block 0x7ffff689a5b0 (8)>
(gdb) p debug_bb_n (9)
<bb 9> [local count: 126413112]:
_7 = S.1_28 + _57;
_8 = test_array[_7];
_9 = _8 + -10;
test_array[_7] = _9;
S.1_59 = S.1_28 + 1;
goto <bb 8>; [100.00%]

this one being a bit difficult, but the other (but not as interesting(?)):

<bb 17> [local count: 119292717]:
L.14:
_14 = _1 + 1;
_69 = (integer(kind=8)) _14;
_15 = _1 + 2;
_70 = (integer(kind=8)) _15;
_71 = (integer(kind=8)) i2_26;
_16 = _71 * 81;
_72 = _16 + -91;

# S.4_31 = PHI <_69(19), S.4_75(21)>
if (S.4_31 > _70)
  goto <bb 22>; [33.33%]
else
  goto <bb 21>; [66.67%]

<bb 21> [local count: 715863674]:
_18 = S.4_31 + _73;
_19 = test_array[_18];
_20 = _19 + 10;
test_array[_18] = _20;
S.4_75 = S.4_31 + 1;
goto <bb 20>; [100.00%]

looks like it should be doable.

And indeed it is - we are just "confused" by the maybe_zero test.  IMHO
we should allow constant zero or N iterations by performing the loop
header copying alongside the unrolling (leaving the first exit test
unremoved).

Testing a patch to do that.

Comment 3 Richard Biener 2017-12-14 10:17:09 UTC

Created attachment 42879 [details]
patch in testing

Patch I am testing.  Performance evaluation appreciated.

Comment 4 Richard Biener 2017-12-14 14:32:53 UTC

Should be fixed.

Comment 5 Richard Biener 2017-12-14 14:32:56 UTC

Author: rguenth
Date: Thu Dec 14 14:32:24 2017
New Revision: 255635

URL: https://gcc.gnu.org/viewcvs?rev=255635&root=gcc&view=rev
Log:
2017-12-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/83326
	* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Add
	may_be_zero parameter and handle it by not marking the first
	peeled copy as not exiting the loop.
	(try_peel_loop): Likewise.
	(canonicalize_loop_induction_variables): Use number_of_iterations_exit
	to handle the case of constant or zero iterations and perform
	loop header copying on-the-fly.

	* gcc.dg/tree-ssa/pr81388-2.c: Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr81388-2.c
    trunk/gcc/tree-ssa-loop-ivcanon.c

Comment 6 Alexander Nesterovskiy 2017-12-21 12:58:00 UTC

Thanks! I see performance gain on 648.exchange2_s (~6% on Broadwell and ~3% on Skylake-X) reverting performance to r255266 level (Skylake-X regression was ~3%).
And loops unrolled with 2 and 3 iterations. It's surely fixed.