Bug 55286 - [4.7 Regression] Bytemark ASSIGNMENT 10% slower
Summary: [4.7 Regression] Bytemark ASSIGNMENT 10% slower
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.3
: P3 normal
Target Milestone: 4.8.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-12 13:14 UTC by wbrana
Modified: 2014-06-12 13:20 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.7.4
Last reconfirmed:


Attachments
function Assignment without 175752 (12.97 KB, text/plain)
2012-11-15 16:12 UTC, wbrana
Details
function Assignment with 175752 (12.93 KB, text/plain)
2012-11-15 16:16 UTC, wbrana
Details
assign.c (890 bytes, text/plain)
2012-11-16 18:28 UTC, Jakub Jelinek
Details
Gentoo patches 1 (31.78 KB, application/octet-stream)
2012-11-17 14:24 UTC, wbrana
Details
Gentoo patches 2 (13.87 KB, application/octet-stream)
2012-11-17 14:25 UTC, wbrana
Details
Gentoo patches 3 (2.78 KB, application/octet-stream)
2012-11-17 14:26 UTC, wbrana
Details
build log from non-broken gcc (131.33 KB, application/octet-stream)
2012-11-17 14:29 UTC, wbrana
Details
build log from broken gcc (73.32 KB, application/octet-stream)
2012-11-17 14:30 UTC, wbrana
Details

Note You need to log in before you can comment on or make changes to this bug.
Description wbrana 2012-11-12 13:14:30 UTC
gcc 4.6 branch
ASSIGNMENT          :          64.389  :     245.01  :      63.55
gcc 4.7 branch
ASSIGNMENT          :          57.737  :     219.70  :      56.98
gcc 4.7 branch without 175752
ASSIGNMENT          :          64.163  :     244.15  :      63.33
gcc 4.8 branch
ASSIGNMENT          :          61.751  :     234.97  :      60.95

175752:

Date:   Fri Jul 1 10:00:25 2011 +0000

    2011-07-01  Kai Tietz  <ktietz@redhat.com>

            * tree-ssa-forwprop.c (simplify_bitwise_binary): Fix typo.

    2011-07-01  Kai Tietz  <ktietz@redhat.com>

            * gcc.dg/tree-ssa/bitwise-sink.c: New test.
Comment 1 Mikael Pettersson 2012-11-12 13:44:32 UTC
r175752 is a follow-up fix to r175589, so my guess is that it's the combination of the two that's causing the regression.

Can you construct a small test case that demonstrates the code quality regression from these two revisions?
Comment 2 wbrana 2012-11-15 16:12:57 UTC
Created attachment 28699 [details]
function Assignment without 175752
Comment 3 wbrana 2012-11-15 16:16:05 UTC
Created attachment 28700 [details]
function Assignment with 175752

according to gprof Assignment is called 
1574 times without 175752
1449 times with 175752
Comment 4 wbrana 2012-11-15 17:01:22 UTC
Bytemark source code
http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz
Comment 5 Jakub Jelinek 2012-11-16 18:28:30 UTC
Created attachment 28712 [details]
assign.c

Assignment extracted into a self-contained testcase, does this also make a similar difference for you?  On which CPU?  Yes, there is a code generation difference with that commit, in *.optimized the difference seems to be (-vanilla, + with Kai's patch reverted):
@@ -192,13 +192,12 @@ Assignment (long int[101] * arraybase)
   sizetype _302;
   unsigned long _303;
   sizetype _306;
   long unsigned int pretmp_307;
   long unsigned int pretmp_308;
   long int[101] * pretmp_318;
-  unsigned long _322;
   short unsigned int ivtmp_334;
   unsigned long _350;
   unsigned int _351;
   long unsigned int patt_353;
   short unsigned int _354;
   unsigned long _355;
@@ -286,27 +285,26 @@ Assignment (long int[101] * arraybase)
   <bb 5>:
   # currentmin_72 = PHI <currentmin_402(4)>
   _356 = ivtmp.312_453 & 15;
   _350 = _356 >> 3;
   _355 = -_350;
   _354 = (short unsigned int) _355;
-  _322 = _355 & 1;
-  prolog_loop_niters.10_359 = (short unsigned int) _322;
+  prolog_loop_niters.10_359 = _354 & 1;
   if (prolog_loop_niters.10_359 == 0)
     goto <bb 7>;
   else
     goto <bb 6>;
 
   <bb 6>:
   _272 = MEM[base: pretmp_395, offset: 0B];
   _256 = _272 - currentmin_72;
   MEM[base: pretmp_395, offset: 0B] = _256;
 
   <bb 7>:
   # j_269 = PHI <1(6), 0(5)>
-  prolog_loop_adjusted_niters.11_124 = _355 & 1;
+  prolog_loop_adjusted_niters.11_124 = (sizetype) prolog_loop_niters.10_359;
   niters.12_129 = 101 - prolog_loop_niters.10_359;
   base_off.19_523 = prolog_loop_adjusted_niters.11_124 * 8;
   vect_p.20_524 = pretmp_395 + base_off.19_523;
   vect_cst_.23_528 = {currentmin_72, currentmin_72};
 
   <bb 8>:

This change happens very late (forwprop4) and nothing afterwards cleans it up (there is no DCE etc. that would DCE the dead assignment to _354 and there is no PRE/FRE to replace _355 & 1 in the second case with _322.  Still just zero-extending _359 is perhaps cheaper register pressure-wise.

That said, I can't find any measurable differences between the two.
Comment 6 wbrana 2012-11-17 14:24:44 UTC
Created attachment 28715 [details]
Gentoo patches 1
Comment 7 wbrana 2012-11-17 14:25:23 UTC
Created attachment 28716 [details]
Gentoo patches 2
Comment 8 wbrana 2012-11-17 14:26:18 UTC
Created attachment 28717 [details]
Gentoo patches 3
Comment 9 wbrana 2012-11-17 14:29:20 UTC
Created attachment 28718 [details]
build log from non-broken gcc
Comment 10 wbrana 2012-11-17 14:30:22 UTC
Created attachment 28719 [details]
build log from broken gcc
Comment 11 wbrana 2012-11-17 14:52:44 UTC
It seems I was wrong. Reverting 175752 doesn't fix performance.
I used also Gentoo patches with patch which reverts 175752. 
I thought that it isn't possible, but it seems some of Gentoo patches fixes performance. Any idea which?

CPU Sandy Bridge
CFLAGS = -fomit-frame-pointer -Wall -O3 -funroll-loops -g0  -march=native -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static

There is almost no difference in run time between Gentoo patched and vanilla gcc with self-contained testcase.
Comment 12 wbrana 2012-11-17 15:01:34 UTC
more exact CFLAGS
-fomit-frame-pointer -Wall -O3 -funroll-loops -g0  -march=corei7
-ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static
Comment 13 wbrana 2012-11-30 20:23:40 UTC
It seems it is caused by 182844

182839 
ASSIGNMENT          :          64.374  :     244.96  :      63.54
182844
ASSIGNMENT          :          57.697  :     219.55  :      56.95

Author: irar <irar@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Jan 3 13:24:04 2012 +0000

            PR tree-optimization/51269
            * tree-vect-loop-manip.c (set_prologue_iterations): Make
            first_niters a pointer.
            (slpeel_tree_peel_loop_to_edge): Likewise.
            (vect_do_peeling_for_loop_bound): Update call to
            slpeel_tree_peel_loop_to_edge.
            (vect_gen_niters_for_prolog_loop): Don't compute
            wide_prolog_niters here.  Remove it from the parameters list.
            (vect_do_peeling_for_alignment): Update calls and compute
            wide_prolog_niters.
Comment 14 wbrana 2012-12-03 14:13:28 UTC
It seems to be fixed in 4.8 branch
ASSIGNMENT          :          64.311  :     244.72  :      63.47
Comment 15 Richard Biener 2013-04-11 07:59:12 UTC
GCC 4.7.3 is being released, adjusting target milestone.
Comment 16 Richard Biener 2014-06-12 13:20:27 UTC
Fixed for 4.8.0?