56522 – [4.8/4.9 Regression] Bytemark ASSIGNMENT 9% / 11% slower

Bug 56522 - [4.8/4.9 Regression] Bytemark ASSIGNMENT 9% / 11% slower

Summary: [4.8/4.9 Regression] Bytemark ASSIGNMENT 9% / 11% slower

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-03-04 15:41 UTC by wbrana
Modified:	2013-03-20 11:37 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2013-03-05 00:00:00

Attachments
assign.c (704 bytes, text/plain) 2013-03-06 15:06 UTC, Jakub Jelinek	Details
assign.c with main function (890 bytes, text/plain) 2013-03-08 14:22 UTC, wbrana	Details
assign.c.164t.optimized.diff (453 bytes, text/plain) 2013-03-08 14:23 UTC, wbrana	Details
nbench1.c.164t.optimized.diff (3.87 KB, text/plain) 2013-03-08 14:24 UTC, wbrana	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description wbrana 2013-03-04 15:41:22 UTC

http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz

196263
ASSIGNMENT          :          57.274  :     217.94  :      56.53
196260
ASSIGNMENT          :           62.83  :     239.08  :      62.01
4.6 branch
ASSIGNMENT          :          64.311  :     244.72  :      63.47

196263:

2013-02-25  Richard Biener  <rguenther@suse.de>

PR tree-optimization/56175
* tree-ssa-forwprop.c (hoist_conversion_for_bitop_p): New predicate,
split out from ...
(simplify_bitwise_binary): ... here.  Also guard the conversion
of (type) X op CST to (type) (X op ((type-x) CST)) with it.

* gcc.dg/tree-ssa/forwprop-24.c: New testcase.

-O3 -g0  -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math -fno-PIE -fno-exceptions -fno-stack-protector -static
CPU Sandy Bridge

Comment 1 Richard Biener 2013-03-05 10:04:11 UTC

Can you create a testcase that pinpoints the different forwprop results?

Comment 2 Jakub Jelinek 2013-03-06 15:06:41 UTC

Created attachment 29598 [details]
assign.c

With -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math
the different in *.optimized dump from r196262 to r196263 is just:
@@ -176,7 +176,6 @@ Assignment (long int[101] * x)
   short int[101][101] * pretmp_418;
   long int _429;
   long int _431;
-  unsigned long _432;
   long unsigned int patt_438;
   unsigned int _440;
   long unsigned int patt_441;
@@ -293,8 +292,7 @@ Assignment (long int[101] * x)
   _108 = _130 >> 3;
   _89 = -_108;
   _72 = (short unsigned int) _89;
-  _432 = _89 & 1;
-  prolog_loop_niters.59_193 = (short unsigned int) _432;
+  prolog_loop_niters.59_193 = _72 & 1;
   if (prolog_loop_niters.59_193 == 0)
     goto <bb 19>;
   else
@@ -307,7 +305,7 @@ Assignment (long int[101] * x)
   <bb 19>:
   # j_288 = PHI <1(18), 0(17)>
   # c_287 = PHI <c_141(18), 9223372036854775807(17)>
-  prolog_loop_adjusted_niters.60_357 = _89 & 1;
+  prolog_loop_adjusted_niters.60_357 = (sizetype) prolog_loop_niters.59_193;
   niters.61_359 = 101 - prolog_loop_niters.59_193;
   base_off.68_53 = prolog_loop_adjusted_niters.60_357 * 8;
   vect_p.69_48 = pretmp_386 + base_off.68_53;

From the bugreport, it isn't clear if you were measuring -m32 or -m64 performance, but I guess the *.optimized dump change could just increase register pressure and pessimize the loop RA or something.

Comment 3 rguenther@suse.de 2013-03-07 08:33:10 UTC

On Wed, 6 Mar 2013, jakub at gcc dot gnu.org wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56522
> 
> --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2013-03-06 15:06:41 UTC ---
> Created attachment 29598 [details]
>   --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29598
> assign.c
> 
> With -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math
> the different in *.optimized dump from r196262 to r196263 is just:
> @@ -176,7 +176,6 @@ Assignment (long int[101] * x)
>    short int[101][101] * pretmp_418;
>    long int _429;
>    long int _431;
> -  unsigned long _432;
>    long unsigned int patt_438;
>    unsigned int _440;
>    long unsigned int patt_441;
> @@ -293,8 +292,7 @@ Assignment (long int[101] * x)
>    _108 = _130 >> 3;
>    _89 = -_108;
>    _72 = (short unsigned int) _89;
> -  _432 = _89 & 1;
> -  prolog_loop_niters.59_193 = (short unsigned int) _432;
> +  prolog_loop_niters.59_193 = _72 & 1;
>    if (prolog_loop_niters.59_193 == 0)
>      goto <bb 19>;
>    else
> @@ -307,7 +305,7 @@ Assignment (long int[101] * x)
>    <bb 19>:
>    # j_288 = PHI <1(18), 0(17)>
>    # c_287 = PHI <c_141(18), 9223372036854775807(17)>
> -  prolog_loop_adjusted_niters.60_357 = _89 & 1;
> +  prolog_loop_adjusted_niters.60_357 = (sizetype) prolog_loop_niters.59_193;
>    niters.61_359 = 101 - prolog_loop_niters.59_193;
>    base_off.68_53 = prolog_loop_adjusted_niters.60_357 * 8;
>    vect_p.69_48 = pretmp_386 + base_off.68_53;
> 
> From the bugreport, it isn't clear if you were measuring -m32 or -m64
> performance, but I guess the *.optimized dump change could just increase
> register pressure and pessimize the loop RA or something.

Yeah, I don't see anything wrong with the change otherwise.

Note that forwprop's tree combiner doesn't seem to restrict itself
to single-use defs in all cases.

Comment 4 wbrana 2013-03-07 18:35:10 UTC

compiled 196260 again using same way and nbench is now slow, which is strange.

When I compile nbench using gcc compiled from snapshot
ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20130224/
there is different result from nbench compiled using gcc from GIT using revision 196245
http://gcc.gnu.org/ml/gcc/2013-02/msg00273.html
nbench compiled using gcc snapshot is fast
nbench compiled using gcc revision is slow

file nbench1.c.164t.optimized is same with both gcc builds,
but executable has different size despite of using same CFLAGS
nbench compiled using gcc revision has 1366219 bytes
nbench compiled using gcc snapshot has 1205879 bytes

Comment 5 wbrana 2013-03-08 14:17:52 UTC

weird results in comment 4 were caused by unexpected Gentoo patches and/or broken GIT
I made own build which doesn't contain any Gentoo patches and still can reproduce 9% slow down caused by 196263
When I run reduced test there is only 1% slow down.
Reduced test case has similar difference on my PC as in comment 2.
I'm using -m64.

Comment 6 wbrana 2013-03-08 14:22:03 UTC

Created attachment 29622 [details]
assign.c with main function

Comment 7 wbrana 2013-03-08 14:23:35 UTC

Created attachment 29623 [details]
assign.c.164t.optimized.diff

Comment 8 wbrana 2013-03-08 14:24:38 UTC

Created attachment 29624 [details]
nbench1.c.164t.optimized.diff

Comment 9 Richard Biener 2013-03-08 15:28:29 UTC

I don't see any substantial differences in code-generation (register allocation
and some basic-block order differences appear), and I cannot reproduce a
slowdown.

Flags as you cited:

./xgcc -B. -o t t.c -O3 -march=corei7 -fomit-frame-pointer -funroll-loops -ffast-math -static

Comment 10 wbrana 2013-03-08 17:27:49 UTC

I found strange thing - result depends on linker
there is slow down with "GNU ld (GNU Binutils) 2.23.1" 
there is improvement with "GNU gold (GNU Binutils 2.23.1) 1.11"

Comment 11 wbrana 2013-03-08 17:36:10 UTC

GNU ld (GNU Binutils) 2.23.1
192263 - slow
192260 - fast

GNU gold (GNU Binutils 2.23.1) 1.11
192263 - fast
192260 - slow

It is possible that result also depends on CPU model (core count, cache size, etc.)

Comment 12 wbrana 2013-03-08 17:41:09 UTC

(In reply to comment #11)
> GNU ld (GNU Binutils) 2.23.1
> 192263 - slow
> 192260 - fast
I meant 196260 and 196263

Comment 13 wbrana 2013-03-08 17:57:32 UTC

There is almost no difference with reduced test case. Assignment in nbench can be tested with:
./nbench -cCOM.DAT

where file COM.DAT has content:

ALLSTATS=F
DONUMSORT=F
DOSTRINGSORT=F
DOBITFIELD=F
DOEMF=F
DOFOUR=F
DOASSIGN=T
DOIDEA=F
DOHUFF=F
DONNET=F
DOLU=F

Which CPU have you tested?

Comment 14 Richard Biener 2013-03-11 08:45:04 UTC

(In reply to comment #13)
> There is almost no difference with reduced test case. Assignment in nbench can
> be tested with:
> ./nbench -cCOM.DAT
> 
> where file COM.DAT has content:
> 
> ALLSTATS=F
> DONUMSORT=F
> DOSTRINGSORT=F
> DOBITFIELD=F
> DOEMF=F
> DOFOUR=F
> DOASSIGN=T
> DOIDEA=F
> DOHUFF=F
> DONNET=F
> DOLU=F
> 
> Which CPU have you tested?

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 30
model name      : Intel(R) Core(TM) CPU            860  @ 2.80GHz
stepping        : 5

note that there were _zero_ assembly differences with/without the patch
apart from using different register numbers and one single switched
bb order.

Comment 15 wbrana 2013-03-12 14:28:43 UTC

I can see different results with different linkers - see above.
Your CPU is Nehalem quad core, but my CPU is Sandy Bridge dual core, which have less L1/L2/L3 cache.

Comment 16 rguenther@suse.de 2013-03-12 14:33:52 UTC

On Tue, 12 Mar 2013, wbrana at gmail dot com wrote:

> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56522
> 
> --- Comment #15 from wbrana <wbrana at gmail dot com> 2013-03-12 14:28:43 UTC ---
> I can see different results with different linkers - see above.

Might be alignment.

> Your CPU is Nehalem quad core, but my CPU is Sandy Bridge dual core, which have
> less L1/L2/L3 cache.

Well, the code is exactly the same, so I can't measure any difference.

Comment 17 wbrana 2013-03-20 11:37:05 UTC

I can switch to gold linker since 4.8