78634 – [7 Regression] 30% performance drop after r242832.

Bug 78634 - [7 Regression] 30% performance drop after r242832.

Summary: [7 Regression] 30% performance drop after r242832.

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	7.0

Importance:	P2 normal
Target Milestone:	7.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-12-01 14:49 UTC by Yuri Rumyantsev
Modified:	2017-02-06 13:55 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
test-case to reproduce (2.99 KB, text/x-csrc) 2016-12-01 14:49 UTC, Yuri Rumyantsev	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri Rumyantsev 2016-12-01 14:49:06 UTC

Created attachment 40215 [details]
test-case to reproduce

We noticed a huge performance regression on x86 for one important benchmark (the reproduced for which is attached). It is caused by additional if-conversion which can be seen in ce2 dump:
IF-THEN-ELSE-JOIN block found, pass 1, test 12, then 13, else 14, join 15
scanning new insn with uid = 163.
scanning new insn with uid = 164.
scanning new insn with uid = 165.
scanning new insn with uid = 166.
scanning new insn with uid = 167.
scanning new insn with uid = 168.
scanning new insn with uid = 169.
if-conversion succeeded through noce_try_cmove_arith
deleting insn with uid = 85.
deleting block 14
Removing jump 78.
deleting insn with uid = 78.
deleting insn with uid = 80.
deleting block 13
Merging block 15 into block 12...
changing bb of uid 87
changing bb of uid 88
  from 15 to 12
changing bb of uid 89
  from 15 to 12
Merged blocks 12 and 15.
Conversion succeeded on pass 1.

On AVX2 machine we see:

time ./test1.1124.exe   // build by compiler before r242832.                           
 
real    0m0.577s
user    0m0.575s
sys     0m0.002s
time ./test1.1125.exe  // build by compiler after r242832.                           
 
real    0m0.888s
user    0m0.886s
sys     0m0.001s

It is sufficient to compile it with -Ofast option to reproduce on x86.

Comment 1 Bernd Schmidt 2017-01-18 12:11:47 UTC

Patch and discussion here.

https://gcc.gnu.org/ml/gcc-patches/2016-12/msg00212.html

Comment 2 Bernd Schmidt 2017-01-23 16:18:05 UTC

Author: bernds
Date: Mon Jan 23 16:17:33 2017
New Revision: 244816

URL: https://gcc.gnu.org/viewcvs?rev=244816&root=gcc&view=rev
Log:
	PR rtl-optimization/78634
	* config/i386/i386.c (ix86_max_noce_ifcvt_seq_cost): New function.
	(TARGET_MAX_NOCE_IFCVT_SEQ_COST): Define.
	* ifcvt.c (noce_try_cmove): Add missing cost check.

testsuite/
	PR rtl-optimization/78634
	* gcc.target/i386/funcspec-11.c: Also pass -mtune=i686.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/ifcvt.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/funcspec-11.c

Comment 3 Bernd Schmidt 2017-01-23 16:28:14 UTC

Fixed.

Comment 4 Dominik Vogt 2017-02-03 15:50:53 UTC

This commit has broken a test case on s390x:

FAIL: gcc.target/s390/loc-1.c scan-assembler \tlocgrne\t%r2,%r4

The load-on-condition instruction is no longer used because the branch cost is very low on s390x (1).  Using -mbranch-cost=2 fixes the test failure.

Comment 5 Bernd Schmidt 2017-02-06 12:49:37 UTC

I don't know the machine, but with a branch cost of 1 this seems like it might be expected. Do you think this is a testcase problem or something else?

Comment 6 Dominik Vogt 2017-02-06 13:55:15 UTC

It fails with -march=zEC12 but not with -march=z900.  It seems to be a tuning issue of the branch cost in the backend; a colleague is working on that and will mave a patch at some time in the future.  So, I think you can ignore this, it's something to be dealt with in the backend.