gcc 4.6 branch ASSIGNMENT : 64.389 : 245.01 : 63.55 gcc 4.7 branch ASSIGNMENT : 57.737 : 219.70 : 56.98 gcc 4.7 branch without 175752 ASSIGNMENT : 64.163 : 244.15 : 63.33 gcc 4.8 branch ASSIGNMENT : 61.751 : 234.97 : 60.95 175752: Date: Fri Jul 1 10:00:25 2011 +0000 2011-07-01 Kai Tietz <ktietz@redhat.com> * tree-ssa-forwprop.c (simplify_bitwise_binary): Fix typo. 2011-07-01 Kai Tietz <ktietz@redhat.com> * gcc.dg/tree-ssa/bitwise-sink.c: New test.
r175752 is a follow-up fix to r175589, so my guess is that it's the combination of the two that's causing the regression. Can you construct a small test case that demonstrates the code quality regression from these two revisions?
Created attachment 28699 [details] function Assignment without 175752
Created attachment 28700 [details] function Assignment with 175752 according to gprof Assignment is called 1574 times without 175752 1449 times with 175752
Bytemark source code http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz
Created attachment 28712 [details] assign.c Assignment extracted into a self-contained testcase, does this also make a similar difference for you? On which CPU? Yes, there is a code generation difference with that commit, in *.optimized the difference seems to be (-vanilla, + with Kai's patch reverted): @@ -192,13 +192,12 @@ Assignment (long int[101] * arraybase) sizetype _302; unsigned long _303; sizetype _306; long unsigned int pretmp_307; long unsigned int pretmp_308; long int[101] * pretmp_318; - unsigned long _322; short unsigned int ivtmp_334; unsigned long _350; unsigned int _351; long unsigned int patt_353; short unsigned int _354; unsigned long _355; @@ -286,27 +285,26 @@ Assignment (long int[101] * arraybase) <bb 5>: # currentmin_72 = PHI <currentmin_402(4)> _356 = ivtmp.312_453 & 15; _350 = _356 >> 3; _355 = -_350; _354 = (short unsigned int) _355; - _322 = _355 & 1; - prolog_loop_niters.10_359 = (short unsigned int) _322; + prolog_loop_niters.10_359 = _354 & 1; if (prolog_loop_niters.10_359 == 0) goto <bb 7>; else goto <bb 6>; <bb 6>: _272 = MEM[base: pretmp_395, offset: 0B]; _256 = _272 - currentmin_72; MEM[base: pretmp_395, offset: 0B] = _256; <bb 7>: # j_269 = PHI <1(6), 0(5)> - prolog_loop_adjusted_niters.11_124 = _355 & 1; + prolog_loop_adjusted_niters.11_124 = (sizetype) prolog_loop_niters.10_359; niters.12_129 = 101 - prolog_loop_niters.10_359; base_off.19_523 = prolog_loop_adjusted_niters.11_124 * 8; vect_p.20_524 = pretmp_395 + base_off.19_523; vect_cst_.23_528 = {currentmin_72, currentmin_72}; <bb 8>: This change happens very late (forwprop4) and nothing afterwards cleans it up (there is no DCE etc. that would DCE the dead assignment to _354 and there is no PRE/FRE to replace _355 & 1 in the second case with _322. Still just zero-extending _359 is perhaps cheaper register pressure-wise. That said, I can't find any measurable differences between the two.
Created attachment 28715 [details] Gentoo patches 1
Created attachment 28716 [details] Gentoo patches 2
Created attachment 28717 [details] Gentoo patches 3
Created attachment 28718 [details] build log from non-broken gcc
Created attachment 28719 [details] build log from broken gcc
It seems I was wrong. Reverting 175752 doesn't fix performance. I used also Gentoo patches with patch which reverts 175752. I thought that it isn't possible, but it seems some of Gentoo patches fixes performance. Any idea which? CPU Sandy Bridge CFLAGS = -fomit-frame-pointer -Wall -O3 -funroll-loops -g0 -march=native -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static There is almost no difference in run time between Gentoo patched and vanilla gcc with self-contained testcase.
more exact CFLAGS -fomit-frame-pointer -Wall -O3 -funroll-loops -g0 -march=corei7 -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static
It seems it is caused by 182844 182839 ASSIGNMENT : 64.374 : 244.96 : 63.54 182844 ASSIGNMENT : 57.697 : 219.55 : 56.95 Author: irar <irar@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Tue Jan 3 13:24:04 2012 +0000 PR tree-optimization/51269 * tree-vect-loop-manip.c (set_prologue_iterations): Make first_niters a pointer. (slpeel_tree_peel_loop_to_edge): Likewise. (vect_do_peeling_for_loop_bound): Update call to slpeel_tree_peel_loop_to_edge. (vect_gen_niters_for_prolog_loop): Don't compute wide_prolog_niters here. Remove it from the parameters list. (vect_do_peeling_for_alignment): Update calls and compute wide_prolog_niters.
It seems to be fixed in 4.8 branch ASSIGNMENT : 64.311 : 244.72 : 63.47
GCC 4.7.3 is being released, adjusting target milestone.
Fixed for 4.8.0?