http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz 196263 ASSIGNMENT : 57.274 : 217.94 : 56.53 196260 ASSIGNMENT : 62.83 : 239.08 : 62.01 4.6 branch ASSIGNMENT : 64.311 : 244.72 : 63.47 196263: 2013-02-25 Richard Biener <rguenther@suse.de> PR tree-optimization/56175 * tree-ssa-forwprop.c (hoist_conversion_for_bitop_p): New predicate, split out from ... (simplify_bitwise_binary): ... here. Also guard the conversion of (type) X op CST to (type) (X op ((type-x) CST)) with it. * gcc.dg/tree-ssa/forwprop-24.c: New testcase. -O3 -g0 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static CPU Sandy Bridge
Can you create a testcase that pinpoints the different forwprop results?
Created attachment 29598 [details] assign.c With -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math the different in *.optimized dump from r196262 to r196263 is just: @@ -176,7 +176,6 @@ Assignment (long int[101] * x) short int[101][101] * pretmp_418; long int _429; long int _431; - unsigned long _432; long unsigned int patt_438; unsigned int _440; long unsigned int patt_441; @@ -293,8 +292,7 @@ Assignment (long int[101] * x) _108 = _130 >> 3; _89 = -_108; _72 = (short unsigned int) _89; - _432 = _89 & 1; - prolog_loop_niters.59_193 = (short unsigned int) _432; + prolog_loop_niters.59_193 = _72 & 1; if (prolog_loop_niters.59_193 == 0) goto <bb 19>; else @@ -307,7 +305,7 @@ Assignment (long int[101] * x) <bb 19>: # j_288 = PHI <1(18), 0(17)> # c_287 = PHI <c_141(18), 9223372036854775807(17)> - prolog_loop_adjusted_niters.60_357 = _89 & 1; + prolog_loop_adjusted_niters.60_357 = (sizetype) prolog_loop_niters.59_193; niters.61_359 = 101 - prolog_loop_niters.59_193; base_off.68_53 = prolog_loop_adjusted_niters.60_357 * 8; vect_p.69_48 = pretmp_386 + base_off.68_53; From the bugreport, it isn't clear if you were measuring -m32 or -m64 performance, but I guess the *.optimized dump change could just increase register pressure and pessimize the loop RA or something.
On Wed, 6 Mar 2013, jakub at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56522 > > --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-03-06 15:06:41 UTC --- > Created attachment 29598 [details] > --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29598 > assign.c > > With -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math > the different in *.optimized dump from r196262 to r196263 is just: > @@ -176,7 +176,6 @@ Assignment (long int[101] * x) > short int[101][101] * pretmp_418; > long int _429; > long int _431; > - unsigned long _432; > long unsigned int patt_438; > unsigned int _440; > long unsigned int patt_441; > @@ -293,8 +292,7 @@ Assignment (long int[101] * x) > _108 = _130 >> 3; > _89 = -_108; > _72 = (short unsigned int) _89; > - _432 = _89 & 1; > - prolog_loop_niters.59_193 = (short unsigned int) _432; > + prolog_loop_niters.59_193 = _72 & 1; > if (prolog_loop_niters.59_193 == 0) > goto <bb 19>; > else > @@ -307,7 +305,7 @@ Assignment (long int[101] * x) > <bb 19>: > # j_288 = PHI <1(18), 0(17)> > # c_287 = PHI <c_141(18), 9223372036854775807(17)> > - prolog_loop_adjusted_niters.60_357 = _89 & 1; > + prolog_loop_adjusted_niters.60_357 = (sizetype) prolog_loop_niters.59_193; > niters.61_359 = 101 - prolog_loop_niters.59_193; > base_off.68_53 = prolog_loop_adjusted_niters.60_357 * 8; > vect_p.69_48 = pretmp_386 + base_off.68_53; > > From the bugreport, it isn't clear if you were measuring -m32 or -m64 > performance, but I guess the *.optimized dump change could just increase > register pressure and pessimize the loop RA or something. Yeah, I don't see anything wrong with the change otherwise. Note that forwprop's tree combiner doesn't seem to restrict itself to single-use defs in all cases.
compiled 196260 again using same way and nbench is now slow, which is strange. When I compile nbench using gcc compiled from snapshot ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20130224/ there is different result from nbench compiled using gcc from GIT using revision 196245 http://gcc.gnu.org/ml/gcc/2013-02/msg00273.html nbench compiled using gcc snapshot is fast nbench compiled using gcc revision is slow file nbench1.c.164t.optimized is same with both gcc builds, but executable has different size despite of using same CFLAGS nbench compiled using gcc revision has 1366219 bytes nbench compiled using gcc snapshot has 1205879 bytes
weird results in comment 4 were caused by unexpected Gentoo patches and/or broken GIT I made own build which doesn't contain any Gentoo patches and still can reproduce 9% slow down caused by 196263 When I run reduced test there is only 1% slow down. Reduced test case has similar difference on my PC as in comment 2. I'm using -m64.
Created attachment 29622 [details] assign.c with main function
Created attachment 29623 [details] assign.c.164t.optimized.diff
Created attachment 29624 [details] nbench1.c.164t.optimized.diff
I don't see any substantial differences in code-generation (register allocation and some basic-block order differences appear), and I cannot reproduce a slowdown. Flags as you cited: ./xgcc -B. -o t t.c -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math -static
I found strange thing - result depends on linker there is slow down with "GNU ld (GNU Binutils) 2.23.1" there is improvement with "GNU gold (GNU Binutils 2.23.1) 1.11"
GNU ld (GNU Binutils) 2.23.1 192263 - slow 192260 - fast GNU gold (GNU Binutils 2.23.1) 1.11 192263 - fast 192260 - slow It is possible that result also depends on CPU model (core count, cache size, etc.)
(In reply to comment #11) > GNU ld (GNU Binutils) 2.23.1 > 192263 - slow > 192260 - fast I meant 196260 and 196263
There is almost no difference with reduced test case. Assignment in nbench can be tested with: ./nbench -cCOM.DAT where file COM.DAT has content: ALLSTATS=F DONUMSORT=F DOSTRINGSORT=F DOBITFIELD=F DOEMF=F DOFOUR=F DOASSIGN=T DOIDEA=F DOHUFF=F DONNET=F DOLU=F Which CPU have you tested?
(In reply to comment #13) > There is almost no difference with reduced test case. Assignment in nbench can > be tested with: > ./nbench -cCOM.DAT > > where file COM.DAT has content: > > ALLSTATS=F > DONUMSORT=F > DOSTRINGSORT=F > DOBITFIELD=F > DOEMF=F > DOFOUR=F > DOASSIGN=T > DOIDEA=F > DOHUFF=F > DONNET=F > DOLU=F > > Which CPU have you tested? processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 30 model name : Intel(R) Core(TM) CPU 860 @ 2.80GHz stepping : 5 note that there were _zero_ assembly differences with/without the patch apart from using different register numbers and one single switched bb order.
I can see different results with different linkers - see above. Your CPU is Nehalem quad core, but my CPU is Sandy Bridge dual core, which have less L1/L2/L3 cache.
On Tue, 12 Mar 2013, wbrana at gmail dot com wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56522 > > --- Comment #15 from wbrana <wbrana at gmail dot com> 2013-03-12 14:28:43 UTC --- > I can see different results with different linkers - see above. Might be alignment. > Your CPU is Nehalem quad core, but my CPU is Sandy Bridge dual core, which have > less L1/L2/L3 cache. Well, the code is exactly the same, so I can't measure any difference.
I can switch to gold linker since 4.8