55286 – [4.7 Regression] Bytemark ASSIGNMENT 10% slower

Bug 55286 - [4.7 Regression] Bytemark ASSIGNMENT 10% slower

Summary: [4.7 Regression] Bytemark ASSIGNMENT 10% slower

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.7.3

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-12 13:14 UTC by wbrana
Modified:	2014-06-12 13:20 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:	4.7.4
Last reconfirmed:

Attachments
function Assignment without 175752 (12.97 KB, text/plain) 2012-11-15 16:12 UTC, wbrana	Details
function Assignment with 175752 (12.93 KB, text/plain) 2012-11-15 16:16 UTC, wbrana	Details
assign.c (890 bytes, text/plain) 2012-11-16 18:28 UTC, Jakub Jelinek	Details
Gentoo patches 1 (31.78 KB, application/octet-stream) 2012-11-17 14:24 UTC, wbrana	Details
Gentoo patches 2 (13.87 KB, application/octet-stream) 2012-11-17 14:25 UTC, wbrana	Details
Gentoo patches 3 (2.78 KB, application/octet-stream) 2012-11-17 14:26 UTC, wbrana	Details
build log from non-broken gcc (131.33 KB, application/octet-stream) 2012-11-17 14:29 UTC, wbrana	Details
build log from broken gcc (73.32 KB, application/octet-stream) 2012-11-17 14:30 UTC, wbrana	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description wbrana 2012-11-12 13:14:30 UTC

gcc 4.6 branch
ASSIGNMENT          :          64.389  :     245.01  :      63.55
gcc 4.7 branch
ASSIGNMENT          :          57.737  :     219.70  :      56.98
gcc 4.7 branch without 175752
ASSIGNMENT          :          64.163  :     244.15  :      63.33
gcc 4.8 branch
ASSIGNMENT          :          61.751  :     234.97  :      60.95

175752:

Date:   Fri Jul 1 10:00:25 2011 +0000

    2011-07-01  Kai Tietz  <ktietz@redhat.com>

            * tree-ssa-forwprop.c (simplify_bitwise_binary): Fix typo.

    2011-07-01  Kai Tietz  <ktietz@redhat.com>

            * gcc.dg/tree-ssa/bitwise-sink.c: New test.

Comment 1 Mikael Pettersson 2012-11-12 13:44:32 UTC

r175752 is a follow-up fix to r175589, so my guess is that it's the combination of the two that's causing the regression.

Can you construct a small test case that demonstrates the code quality regression from these two revisions?

Comment 2 wbrana 2012-11-15 16:12:57 UTC

Created attachment 28699 [details]
function Assignment without 175752

Comment 3 wbrana 2012-11-15 16:16:05 UTC

Created attachment 28700 [details]
function Assignment with 175752

according to gprof Assignment is called 
1574 times without 175752
1449 times with 175752

Comment 4 wbrana 2012-11-15 17:01:22 UTC

Bytemark source code
http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz

Comment 5 Jakub Jelinek 2012-11-16 18:28:30 UTC

Created attachment 28712 [details]
assign.c

Assignment extracted into a self-contained testcase, does this also make a similar difference for you?  On which CPU?  Yes, there is a code generation difference with that commit, in *.optimized the difference seems to be (-vanilla, + with Kai's patch reverted):
@@ -192,13 +192,12 @@ Assignment (long int[101] * arraybase)
   sizetype _302;
   unsigned long _303;
   sizetype _306;
   long unsigned int pretmp_307;
   long unsigned int pretmp_308;
   long int[101] * pretmp_318;
-  unsigned long _322;
   short unsigned int ivtmp_334;
   unsigned long _350;
   unsigned int _351;
   long unsigned int patt_353;
   short unsigned int _354;
   unsigned long _355;
@@ -286,27 +285,26 @@ Assignment (long int[101] * arraybase)
   <bb 5>:
   # currentmin_72 = PHI <currentmin_402(4)>
   _356 = ivtmp.312_453 & 15;
   _350 = _356 >> 3;
   _355 = -_350;
   _354 = (short unsigned int) _355;
-  _322 = _355 & 1;
-  prolog_loop_niters.10_359 = (short unsigned int) _322;
+  prolog_loop_niters.10_359 = _354 & 1;
   if (prolog_loop_niters.10_359 == 0)
     goto <bb 7>;
   else
     goto <bb 6>;
 
   <bb 6>:
   _272 = MEM[base: pretmp_395, offset: 0B];
   _256 = _272 - currentmin_72;
   MEM[base: pretmp_395, offset: 0B] = _256;
 
   <bb 7>:
   # j_269 = PHI <1(6), 0(5)>
-  prolog_loop_adjusted_niters.11_124 = _355 & 1;
+  prolog_loop_adjusted_niters.11_124 = (sizetype) prolog_loop_niters.10_359;
   niters.12_129 = 101 - prolog_loop_niters.10_359;
   base_off.19_523 = prolog_loop_adjusted_niters.11_124 * 8;
   vect_p.20_524 = pretmp_395 + base_off.19_523;
   vect_cst_.23_528 = {currentmin_72, currentmin_72};
 
   <bb 8>:

This change happens very late (forwprop4) and nothing afterwards cleans it up (there is no DCE etc. that would DCE the dead assignment to _354 and there is no PRE/FRE to replace _355 & 1 in the second case with _322.  Still just zero-extending _359 is perhaps cheaper register pressure-wise.

That said, I can't find any measurable differences between the two.

Comment 6 wbrana 2012-11-17 14:24:44 UTC

Created attachment 28715 [details]
Gentoo patches 1

Comment 7 wbrana 2012-11-17 14:25:23 UTC

Created attachment 28716 [details]
Gentoo patches 2

Comment 8 wbrana 2012-11-17 14:26:18 UTC

Created attachment 28717 [details]
Gentoo patches 3

Comment 9 wbrana 2012-11-17 14:29:20 UTC

Created attachment 28718 [details]
build log from non-broken gcc

Comment 10 wbrana 2012-11-17 14:30:22 UTC

Created attachment 28719 [details]
build log from broken gcc

Comment 11 wbrana 2012-11-17 14:52:44 UTC

It seems I was wrong. Reverting 175752 doesn't fix performance.
I used also Gentoo patches with patch which reverts 175752. 
I thought that it isn't possible, but it seems some of Gentoo patches fixes performance. Any idea which?

CPU Sandy Bridge
CFLAGS = -fomit-frame-pointer -Wall -O3 -funroll-loops -g0  -march=native -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static

There is almost no difference in run time between Gentoo patched and vanilla gcc with self-contained testcase.

Comment 12 wbrana 2012-11-17 15:01:34 UTC

more exact CFLAGS
-fomit-frame-pointer -Wall -O3 -funroll-loops -g0  -march=corei7
-ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static

Comment 13 wbrana 2012-11-30 20:23:40 UTC

It seems it is caused by 182844

182839 
ASSIGNMENT          :          64.374  :     244.96  :      63.54
182844
ASSIGNMENT          :          57.697  :     219.55  :      56.95

Author: irar <irar@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Jan 3 13:24:04 2012 +0000

            PR tree-optimization/51269
            * tree-vect-loop-manip.c (set_prologue_iterations): Make
            first_niters a pointer.
            (slpeel_tree_peel_loop_to_edge): Likewise.
            (vect_do_peeling_for_loop_bound): Update call to
            slpeel_tree_peel_loop_to_edge.
            (vect_gen_niters_for_prolog_loop): Don't compute
            wide_prolog_niters here.  Remove it from the parameters list.
            (vect_do_peeling_for_alignment): Update calls and compute
            wide_prolog_niters.

Comment 14 wbrana 2012-12-03 14:13:28 UTC

It seems to be fixed in 4.8 branch
ASSIGNMENT          :          64.311  :     244.72  :      63.47

Comment 15 Richard Biener 2013-04-11 07:59:12 UTC

GCC 4.7.3 is being released, adjusting target milestone.

Comment 16 Richard Biener 2014-06-12 13:20:27 UTC

Fixed for 4.8.0?