Bug 33928 - [4.3/4.4/4.5/4.6/4.7 Regression] 30% performance slowdown in floating-point code caused by r118475
Summary: [4.3/4.4/4.5/4.6/4.7 Regression] 30% performance slowdown in floating-point c...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.5.0
: P2 normal
Target Milestone: 4.6.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 39839
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-28 01:46 UTC by lucier
Modified: 2011-04-05 12:02 UTC (History)
12 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work: 4.6.0
Known to fail: 4.3.0, 4.5.2
Last reconfirmed: 2007-10-28 16:39:27


Attachments
.i file for fft routine (80.74 KB, text/plain)
2007-10-28 01:49 UTC, lucier
Details
Assembly from 4.2.2 (2.80 KB, text/plain)
2007-10-28 15:41 UTC, lucier
Details
assembly from 4.3.0 (3.07 KB, text/plain)
2007-10-28 15:42 UTC, lucier
Details
assembly after replacing -O1 with -O2 (2.93 KB, text/plain)
2007-10-28 15:45 UTC, lucier
Details
assembly after replacing -O1 with -O2 (3.07 KB, text/plain)
2007-10-28 15:45 UTC, lucier
Details
.i file using a switch instead of computed gotos (80.29 KB, text/plain)
2007-11-12 21:51 UTC, lucier
Details
4.2.2 assembly for code using switch. (2.70 KB, text/plain)
2007-11-12 21:52 UTC, lucier
Details
4.3.0 assembly for code using a switch (3.00 KB, text/plain)
2007-11-12 21:53 UTC, lucier
Details
Much shorter testcase. (2.10 KB, text/plain)
2008-01-22 12:03 UTC, Uroš Bizjak
Details
asm with alias-oracle enabled FRE (3.06 KB, text/plain)
2008-01-22 13:06 UTC, Richard Biener
Details
direct.s generated by 4.4.0 (2.72 KB, text/plain)
2009-04-23 16:00 UTC, lucier
Details
svn diff of cse.c to fix the performance regression (3.43 KB, patch)
2009-05-06 03:50 UTC, lucier
Details | Diff
svn diff of cse.c to "fix" the performance regression (updated) (3.48 KB, patch)
2009-05-06 09:20 UTC, Paolo Bonzini
Details | Diff
usable testcase (2.46 KB, text/plain)
2009-05-06 09:31 UTC, Paolo Bonzini
Details
usable testcase (2.42 KB, text/plain)
2009-05-06 09:59 UTC, Paolo Bonzini
Details
time report related to comment 69, time for PR 31957 with no options (2.67 KB, text/plain)
2009-05-07 16:00 UTC, lucier
Details
time for 31957, with rename-registers (2.68 KB, text/plain)
2009-05-07 16:02 UTC, lucier
Details
time for 31957, with rename-registers no-move-loop-invariants (2.62 KB, text/plain)
2009-05-07 16:03 UTC, lucier
Details
time for 31957, with rename-registers no-move-loop-invariants forward-propagate (2.72 KB, text/plain)
2009-05-07 16:04 UTC, lucier
Details
speed up fwprop and enable it at -O1 (3.74 KB, patch)
2009-05-08 07:55 UTC, Paolo Bonzini
Details | Diff
Large test file for testing time and memory usage (871.88 KB, application/x-gzip)
2009-05-16 00:20 UTC, lucier
Details
Time and memory report for compiler.i (24.72 KB, text/plain)
2009-05-16 00:29 UTC, lucier
Details
patch I'm testing (18.46 KB, patch)
2009-06-08 08:40 UTC, Paolo Bonzini
Details | Diff
correct version (18.45 KB, patch)
2009-06-08 08:59 UTC, Paolo Bonzini
Details | Diff
time and memory report for compiler.i after Paolo's patch (25.27 KB, text/plain)
2009-06-08 18:19 UTC, lucier
Details
inner loop of direct.c with -fschedule-insns (561 bytes, text/plain)
2009-08-27 01:22 UTC, lucier
Details
inner loop of direct.c without -fschedule-insns (504 bytes, text/plain)
2009-08-27 01:22 UTC, lucier
Details

Note You need to log in before you can comment on or make changes to this bug.
Description lucier 2007-10-28 01:46:04 UTC
With these compile options

-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp

With this compiler:

euler-44% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2
Thread model: posix
gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) 

With the following routine compiled with gcc-4.2.2 you get

(time (direct-fft-recursive-4 a table))
    366 ms real time
    366 ms cpu time (366 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

while with today's mainline you get

(time (direct-fft-recursive-4 a table))
    448 ms real time
    448 ms cpu time (448 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

I've isolated that one routine and I'll add it at the end of an attachment; unfortunately there are a lot of declarations and global data that are difficult to winnow.

There is really only one main loop in the routine, the one that begins at ___L19_direct_2d_fft_2d_recursive_2d_4.  This loop was scheduled in 102 cycles (sched2) on 4.4.2 and in 134 cycles in mainline.
Comment 1 lucier 2007-10-28 01:49:04 UTC
Created attachment 14418 [details]
.i file for fft routine
Comment 2 Richard Biener 2007-10-28 12:05:13 UTC
Can you attach assembler files?  What happens if you use -O2?  Why do you need
-fno-strict-aliasing?  Does -fno-ivopts help?
Comment 3 lucier 2007-10-28 15:41:01 UTC
Created attachment 14423 [details]
Assembly from 4.2.2
Comment 4 lucier 2007-10-28 15:42:21 UTC
Created attachment 14424 [details]
assembly from 4.3.0

I had to remove the "static" from the declaration of direct-fft-recursive to get assembly.  (In the larger file the address of direct-fft-recursive is eventually put into an array.)
Comment 5 lucier 2007-10-28 15:45:10 UTC
Created attachment 14425 [details]
assembly after replacing -O1 with -O2
Comment 6 lucier 2007-10-28 15:45:52 UTC
Created attachment 14426 [details]
assembly after replacing -O1 with -O2
Comment 7 lucier 2007-10-28 16:05:35 UTC
time with -O2 instead of -O1:

with 4.2.2:

(time (direct-fft-recursive-4 a table))
    426 ms real time
    426 ms cpu time (425 user, 1 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

with 4.3.0:

(time (direct-fft-recursive-4 a table))
    433 ms real time
    433 ms cpu time (433 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

With -O1 -fno-ivopts:

with 4.2.2:

(time (direct-fft-recursive-4 a table))
    374 ms real time
    374 ms cpu time (374 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

with 4.3.0:

(time (direct-fft-recursive-4 a table))
    443 ms real time
    443 ms cpu time (443 user, 0 system)
    no collections
    64 bytes allocated
    1 minor fault
    no major faults

Why -fno-strict-aliasing: I don't need it for this particular routine, but in the rest of the file is part of a bignum library that accesses the bignum digits as arrays of either 8-, 32-, or 64-bit unsigned ints, and it hasn't been rewritten to use unions of arrays.  (This is part of the runtime system of a Scheme implementation, and there are other places that just cast pointers to achieve low-level things.)
Comment 8 lucier 2007-10-28 16:08:56 UTC
Subject: Re:  33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code


On Oct 28, 2007, at 8:05 AM, rguenth at gcc dot gnu dot org wrote:

> ------- Comment #2 from rguenth at gcc dot gnu dot org  2007-10-28  
> 12:05 -------
> Can you attach assembler files?  What happens if you use -O2?  Why  
> do you need
> -fno-strict-aliasing?  Does -fno-ivopts help?

I think I've answered your questions in the attachments and comments  
to the PR.

Brad

Comment 9 Richard Biener 2007-10-28 16:38:16 UTC
The main difference I see is that 4.2 avoids re-use of %eax as index register:

.L34:
	movq	%r11, %rdi
	addq	8(%r10), %rdi
	movq	8(%r10), %rsi
	movq	8(%r10), %rdx
	movq	40(%r10), %rax
	leaq	4(%r11), %rbx
	addq	%rdi, %rsi
	leaq	4(%rdi), %r9
	movq	%rdi, -8(%r10)
	addq	%rsi, %rdx
	leaq	4(%rsi), %r8
	movq	%rsi, -24(%r10)
	leaq	4(%rdx), %rcx
	movq	%r9, -16(%r10)
	movq	%rdx, -40(%r10)
	movq	%r8, -32(%r10)
	addq	$7, %rax
	movq	%rcx, -48(%r10)
	movsd	(%rax,%rcx,2), %xmm12
	leaq	(%rbx,%rbx), %rcx
	movsd	(%rax,%rdx,2), %xmm3
	leaq	(%rax,%r11,2), %rdx
	addq	$8, %r11
	movsd	(%rax,%r8,2), %xmm14
	cmpq	%r11, %r13
	movsd	(%rax,%rsi,2), %xmm13
	movsd	(%rax,%r9,2), %xmm11
	movsd	(%rax,%rdi,2), %xmm10
	movsd	(%rax,%rcx), %xmm8
...

while 4.3 always re-loads %rax as index:

.L26:
	leaq	4(%rdi), %rdx
	movq	%rdi, %rax
	movq	%rdx, -8(%rsp)
	addq	(%r8), %rax
	movq	%rax, (%r9)
	addq	$4, %rax
	movq	%rax, (%rbp)
	movq	(%r9), %rax
	addq	(%r8), %rax
	movq	%rax, (%r10)
	addq	$4, %rax
	movq	%rax, (%rbx)
	movq	(%r10), %rax
	addq	(%r8), %rax
	movq	%rax, (%r11)
	movq	-64(%rsp), %rcx
	addq	$4, %rax
	movq	%rax, (%rcx)
	movq	(%rsi), %rdx
	movq	-8(%rsp), %rcx
	addq	$7, %rdx
	movsd	(%rdx,%rax,2), %xmm13
	movq	(%r11), %rax
	addq	%rcx, %rcx
	movsd	(%rdx,%rcx), %xmm8
	movsd	(%rdx,%rax,2), %xmm3
	movq	(%rbx), %rax
	movsd	(%rdx,%rax,2), %xmm14
	movq	(%r10), %rax
	movsd	(%rdx,%rax,2), %xmm12
	movq	(%rbp), %rax
	movsd	(%rdx,%rax,2), %xmm11
	movq	(%r9), %rax
	movsd	(%rdx,%rax,2), %xmm10
	movq	(%r12), %rax
	leaq	(%rdx,%rdi,2), %rdx
...

the root cause needs to be investigated still.
Comment 10 Richard Biener 2007-10-28 16:39:27 UTC
So, confirmed.
Comment 11 lucier 2007-11-12 21:50:21 UTC
I suspected that the slowdown had nothing to do with computed gotos, so I regenerated the C code using a switch instead of the computed gotos and got the following:

For that same copy of mainline

gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) 

:

(time (direct-fft-recursive-4 a table))
    470 ms real time
    470 ms cpu time (470 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

For 4.2.2:

(time (direct-fft-recursive-4 a table))
    384 ms real time
    384 ms cpu time (383 user, 1 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

So that's almost exactly the same slowdown as with computed gotos.

I changed the subject line to use 22% instead of 33% (I don't know how I got 33% before, perhaps I just mistyped it) and removed the phrase "with computed gotos".

I'll include the new .i and .s files as attachments.
Comment 12 lucier 2007-11-12 21:51:30 UTC
Created attachment 14534 [details]
.i file using a switch instead of computed gotos

This is the generated code with a switch instead of computed gotos.
Comment 13 lucier 2007-11-12 21:52:36 UTC
Created attachment 14535 [details]
4.2.2 assembly for code using switch.
Comment 14 lucier 2007-11-12 21:53:11 UTC
Created attachment 14536 [details]
4.3.0 assembly for code using a switch
Comment 15 Mark Mitchell 2007-11-27 05:53:12 UTC
I've marked this P1 because I'd like to see us start to explain these kinds of dramatic performance changes.  If we can explain the issue coherently, we may well decide that it's not important to fix it, but I think we ought to force ourselves to figure out what's going on.
Comment 16 Paolo Bonzini 2007-11-30 05:39:19 UTC
One suspect is fwprop.  Anyone can confirm?
Comment 17 lucier 2007-11-30 14:47:24 UTC
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code

On Nov 30, 2007, at 12:39 AM, bonzini at gnu dot org wrote:

> One suspect is fwprop.  Anyone can confirm?

How does one turn off fwprop?  It doesn't seem to like "-fno-fwprop".
Comment 18 Paolo Bonzini 2007-11-30 14:58:45 UTC
It would be -fno-forward-propagate, but what I meant is that the changes *connected to* fwprop could be the culprit.  One has to look at dumps to understand if this is the case.

It would be possible, maybe, to put an asm around the problematic basic block, so that one could plot the number of instructions in that basic block over time?
Comment 19 lucier 2007-12-01 18:59:39 UTC
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code


On Nov 30, 2007, at 9:58 AM, bonzini at gnu dot org wrote:

> -fno-forward-propagate

I don't know how to debug this, that's clear enough, but adding -fno- 
forward-propagate as an option doesn't change the code at all.
Comment 20 Richard Biener 2008-01-09 12:45:09 UTC
Can we have updated measurements please?  Also I don't think this bug should be P1.
Comment 21 lucier 2008-01-09 18:44:09 UTC
The assembler is identical to that in the third attachment and the time is basically the same (other things were going on at the same time):

(time (direct-fft-recursive-4 a table))
    465 ms real time
    466 ms cpu time (466 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

euler-86% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2 --enable-gather-detailed-mem-stats
Thread model: posix
gcc version 4.3.0 20080109 (experimental) [trunk revision 131427] (GCC) 
Comment 22 Richard Biener 2008-01-12 17:56:38 UTC
I'm downgrading this to P2.
Comment 23 Uroš Bizjak 2008-01-21 19:21:40 UTC
It is not possible to create an executable from direct.i. My compilation fails:

(.text+0x20): undefined reference to `main'
/tmp/cc0VOLHm.o: In function `___H_direct_2d_fft_2d_recursive_2d_4':
_num.c:(.text+0xf1): undefined reference to `___gstate'
_num.c:(.text+0x18e): undefined reference to `___gstate'
_num.c:(.text+0x1c7): undefined reference to `___gstate'
_num.c:(.text+0x27b): undefined reference to `___gstate'
_num.c:(.text+0x2e0): undefined reference to `___gstate'
/tmp/cc0VOLHm.o:_num.c:(.text+0x6f0): more undefined references to `___gstate' follow

Could you attach the source that can be used to create the executable? Or perhaps a detailed instructions how to create one from sources you already posted.
Comment 24 lucier 2008-01-21 22:43:54 UTC
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code


On Jan 21, 2008, at 2:21 PM, ubizjak at gmail dot com wrote:

> It is not possible to create an executable from direct.i.

That's correct, sorry.

> Could you attach the source that can be used to create the executable?

Here are instructions on how to build and test a modified version of  
Gambit, from which I derived direct.i.

Download the file

http://www.math.purdue.edu/~lucier/gcc/test-files/bugzilla/33928/ 
gambc-v4_1_2.tgz

Build it with the following commands:

> tar zxf gambc-v4_1_2.tgz
> cd gambc-v4_1_2
> ./configure CC='/pkgs/gcc-mainline/bin/gcc -save-temps'
> make -j

If you want to recompile the source after reconfiguring, do

> make mostlyclean


not 'make clean', unfortunately.

Then test it with

> gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a  
> a)))'

The output ends with something like

> (time (##bignum.make (##fixnum.quotient result-length  
> (##fixnum.quotient ##bignum.adigit-width ##bignum.fdigit-width)) #f  
> #f))
>     4 ms real time
>     5 ms cpu time (3 user, 2 system)
>     no collections
>     3962448 bytes allocated
>     968 minor faults
>     no major faults
> (time (##make-f64vector (##fixnum.* two^n 2)))
>     5 ms real time
>     5 ms cpu time (1 user, 4 system)
>     1 collection accounting for 5 ms real time (1 user, 4 system)
>     33554464 bytes allocated
>     59 minor faults
>     no major faults
> (time (make-w (##fixnum.- log-two^n 1)))
>     30 ms real time
>     31 ms cpu time (17 user, 14 system)
>     no collections
>     16810144 bytes allocated
>     4097 minor faults
>     no major faults
> (time (make-w-rac log-two^n))
>     28 ms real time
>     28 ms cpu time (16 user, 12 system)
>     no collections
>     16826272 bytes allocated
>     4097 minor faults
>     no major faults
> (time (bignum->f64vector-rac x a))
>     45 ms real time
>     45 ms cpu time (20 user, 25 system)
>     no collections
>     -16 bytes allocated
>     8192 minor faults
>     no major faults
> (time (componentwise-rac-multiply a rac-table))
>     26 ms real time
>     26 ms cpu time (26 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (direct-fft-recursive-4 a table))
>     445 ms real time
>     445 ms cpu time (445 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults
> (time (componentwise-complex-multiply a a))
>     24 ms real time
>     24 ms cpu time (24 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (inverse-fft-recursive-4 a table))
>     418 ms real time
>     418 ms cpu time (418 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults
> (time (componentwise-rac-multiply-conjugate a rac-table))
>     26 ms real time
>     26 ms cpu time (26 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (bignum<-f64vector-rac a result result-length))
>     108 ms real time
>     108 ms cpu time (108 user, 0 system)
>     no collections
>     112 bytes allocated
>     no minor faults
>     no major faults
> (time (* a a))
>     1170 ms real time
>     1170 ms cpu time (1105 user, 65 system)
>     1 collection accounting for 5 ms real time (1 user, 4 system)
>     71266896 bytes allocated
>     17413 minor faults
>     no major faults


The time for the routine in direct.i is the time reported for direct- 
fft-recursive-4:

> (time (direct-fft-recursive-4 a table))
>     445 ms real time
>     445 ms cpu time (445 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults

The name of the routine in the .i and .s files is  
___H_direct_2d_fft_2d_recursive_2d_4.

By the way, ___H_inverse_2d_fft_2d_recursive_2d_4 is a similar  
routine implementing the inverse fft, which, for some reason, goes  
faster than the direct (forward) fft.

Brad
Comment 25 Uroš Bizjak 2008-01-22 12:03:32 UTC
Created attachment 14996 [details]
Much shorter testcase.

This testcase was used to track down problems with fre pass. Stay tuned for an analysis.
Comment 26 Andrew Pinski 2008-01-22 12:07:05 UTC
Really I bet FRE is doing its job and the RA can't do its.
Comment 27 Uroš Bizjak 2008-01-22 12:20:36 UTC
As already noted by Richi in Comment #9, the difference is in usage of %rax.

gcc-4.2 generates:
	...
	addq	$7, %rax
	leaq	(%rax,%rbp,2), %r10
	leaq	(%rax,%rdx,2), %rdx
	leaq	(%rax,%rdi,2), %rdi
	movq	(%rcx), %rsi
	movq	(%r13), %rcx
	leaq	(%rax,%r9,2), %r9
	leaq	(%rax,%r8,2), %r8
	leaq	(%rax,%r14,2), %r11
	addq	$8, %rbp
	movsd	(%rdx), %xmm3
	leaq	(%rax,%rsi,2), %rsi
	leaq	(%rax,%rcx,2), %rcx
	...
	movsd	%xmm7, (%rcx)
	subsd	%xmm1, %xmm10
	addsd	%xmm1, %xmm0
	movsd	%xmm8, (%rsi)
	movsd	%xmm0, (%rdi)
	movapd	%xmm12, %xmm0
	subsd	%xmm3, %xmm12
	addsd	%xmm3, %xmm0
	movsd	%xmm0, (%r8)
	movsd	%xmm10, (%r9)
	movsd	%xmm12, (%rdx)
	jg	.L26

where gcc-4.3 limps along with:
	...
	leaq	7(%rax), %r9
	movq	%rbx, -64(%rsp)
	movq	-56(%rsp), %rcx
	addq	%r10, %r10
	movsd	7(%rax,%rdx), %xmm3
	movsd	(%r9,%rbx,2), %xmm8
	movq	(%r11), %rbx
	movsd	7(%rax,%r10), %xmm5
	addq	%r8, %r8
	addq	%rdi, %rdi
	movsd	7(%rax,%r8), %xmm12
	movsd	15(%rbx), %xmm2
	leaq	(%r9,%rbp,2), %r9
	movsd	7(%rbx), %xmm1
	...
	movsd	%xmm0, 7(%rax,%r9,2)
	movapd	%xmm10, %xmm0
	movsd	%xmm7, 7(%rax,%rcx)
	subsd	%xmm1, %xmm10
	addsd	%xmm1, %xmm0
	movsd	%xmm8, 7(%rax,%rsi)
	movsd	%xmm0, 7(%rax,%rdi)
	movapd	%xmm12, %xmm0
	subsd	%xmm3, %xmm12
	addsd	%xmm3, %xmm0
	movsd	%xmm0, 7(%rax,%r8)
	movsd	%xmm10, 7(%rax,%r10)
	movsd	%xmm12, 7(%rax,%rdx)
	jg	.L17

The difference is in offseted addresses. Looking at the tree dumps, it is obvious that the problem is in fre pass.

At the end of the loop (line 685+ in _.034.fre) gcc-4.2 transforms every seqence  of:

  D.2013_432 = ___fp_256 + 40B;
  D.2014_433 = *D.2013_432;
  D.2068_434 = (long int *) D.2014_433;
  D.2069_435 = D.2068_434 + 7B;
  D.2070_436 = (long int) D.2069_435;
  D.2094_437 = ___r3_35 << 1;
  D.2095_438 = D.2070_436 + D.2094_437;
  D.2096_439 = (double *) D.2095_438;
  *D.2096_439 = ___F64V53_431;
  D.2013_440 = ___fp_256 + 40B;
  D.2014_441 = *D.2013_440;
  D.2068_442 = (long int *) D.2014_441;
  D.2069_443 = D.2068_442 + 7B;
  D.2070_444 = (long int) D.2069_443;
  D.2091_445 = ___r4_257 << 1;
  D.2092_446 = D.2070_444 + D.2091_445;
  D.2093_447 = (double *) D.2092_446;
  *D.2093_447 = ___F64V52_430;
  D.2013_448 = ___fp_256 + 40B;
  D.2014_449 = *D.2013_448;
  D.2068_450 = (long int *) D.2014_449;
  D.2069_451 = D.2068_450 + 7B;
  D.2070_452 = (long int) D.2069_451;
  ...

into:

  D.2013_432 = D.2013_286;
  D.2014_433 = D.2014_287;
  D.2068_434 = D.2068_288;
  D.2069_435 = D.2069_289;
  D.2070_436 = D.2070_290;
  D.2094_437 = D.2094_366;
  D.2095_438 = D.2095_367;
  D.2096_439 = D.2096_368;
  *D.2096_439 = ___F64V53_431;
  D.2013_440 = D.2013_286;
  D.2014_441 = D.2014_287;
  D.2068_442 = D.2068_288;
  D.2069_443 = D.2069_289;
  D.2070_444 = D.2070_290;
  D.2091_445 = D.2091_357;
  D.2092_446 = D.2092_358;
  D.2093_447 = D.2093_359;
  *D.2093_447 = ___F64V52_430;
  D.2013_448 = D.2013_286;
  D.2014_449 = D.2014_287;
  D.2068_450 = D.2068_288;
  D.2069_451 = D.2069_289;
  D.2070_452 = D.2070_290;
  D.1994_453 = D.1994_258;
  D.2040_454 = D.2040_347;
  D.2041_455 = D.2041_348;
  D.2089_456 = D.2089_349;
  D.2090_457 = D.2090_350;
  ...

and this is optimized in further passes into:

  *D.2096 = ___F64V32 + ___F64V45;
  *D.2093 = ___F64V31 + ___F64V42;
  *D.2090 = ___F64V32 - ___F64V45;
  *D.2088 = ___F64V31 - ___F64V42;
  *D.2084 = ___F64V28 + ___F64V39;
  *D.2081 = ___F64V27 + ___F64V36;
  *D.2077 = ___F64V28 - ___F64V39;
  *D.2074 = ___F64V27 - ___F64V36;

However, for some reason gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump), creating:

  D.1683_428 = D.1683_282;
  D.1684_429 = D.1684_283;
  D.1738_430 = D.1738_284;
  D.1739_431 = D.1739_285;
  D.1740_432 = D.1740_286;
  D.1764_433 = D.1764_362;
  D.1765_434 = D.1765_363;
  D.1766_435 = D.1766_364;
  *D.1766_435 = ___F64V53_427;
  D.1683_436 = D.1683_282;
  D.1684_437 = *D.1683_436;
  D.1738_438 = (long unsigned int) D.1684_437;
  D.1739_439 = D.1738_438 + 7;
  D.1740_440 = (long int) D.1739_439;
  D.1761_441 = D.1761_353;
  D.1762_442 = D.1740_440 + D.1761_441;
  D.1763_443 = (double *) D.1762_442;
  *D.1763_443 = ___F64V52_426;
  D.1683_444 = D.1683_282;
  D.1684_445 = *D.1683_444;
  D.1738_446 = (long unsigned int) D.1684_445;
  D.1739_447 = D.1738_446 + 7;
  D.1740_448 = (long int) D.1739_447;
  ...

which leaves us with:

  *D.1766 = ___F64V32 + ___F64V45;
  *(double *) (D.1761 + (long int) ((long unsigned int) *pretmp.33 + 7)) = ___F64V31 + ___F64V42;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.65 << 1)) = ___F64V32 - ___F64V45;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*D.1685 << 1)) = ___F64V31 - ___F64V42;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.61 << 1)) = ___F64V28 + ___F64V39;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.152 << 1)) = ___F64V27 + ___F64V36;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.147 << 1)) = ___F64V28 - ___F64V39;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*___fp.47 << 1)) = ___F64V27 - ___F64V36;

and creates unoptimal asm as above.
Comment 28 Richard Biener 2008-01-22 12:38:06 UTC
This is an alias partitioning problem, with --param max-aliased-vops=10000 I
see the sequence optimized by FRE.  Or, with the alias-oracle patch for FRE
--param max-fields-for-field-sensitive=1 does the job as well.
Comment 29 Paolo Bonzini 2008-01-22 12:39:13 UTC
target independent
Comment 30 Uroš Bizjak 2008-01-22 12:52:12 UTC
Please note that for the original testcase (direct.i), even '-O2 --param max-aliased-vops=100000' doesn't generate expected code.
Comment 31 Richard Biener 2008-01-22 13:06:14 UTC
Created attachment 14997 [details]
asm with alias-oracle enabled FRE

This is the asm produced from direct.i with -O2 --param max-fields-for-field-sensitive=1 (SFTs disabled, which is the goal for 4.4)
with the (ok, a modified) alias-oracle patch for FRE applied.
Comment 32 lucier 2008-05-30 16:01:23 UTC
I've decided to test the current ira branch with this problem.  I used the build instructions in comment 24.

With -fno-ira I get the same results as with 4.3.0 (no surprise there).

With -fira I get the time

(time (direct-fft-recursive-4 a table))
    422 ms real time
    421 ms cpu time (421 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

which is an improvement, and the code at the beginning of the loop is

.L7262:
        movq    %rdx, %rcx
        addq    (%rsi), %rcx
        leaq    4(%rdx), %r15
        movq    %rcx, (%rbx)
        addq    $4, %rcx
        movq    %rcx, (%rbp)
        movq    (%rbx), %rcx
        addq    (%rsi), %rcx
        movq    %rcx, (%rdi)
        addq    $4, %rcx
        movq    %rcx, (%r8)
        movq    (%rdi), %rcx
        addq    (%rsi), %rcx
        leaq    4(%rcx), %r10
        movq    %rcx, (%r9)
        movq    %r10, (%r13)
        movq    (%rax), %rcx
        addq    $7, %rcx
        movsd   (%rcx,%r10,2), %xmm4
        movq    (%r9), %r10
        leaq    (%rcx,%rdx,2), %r11
        addq    $8, %rdx
        movsd   (%r11), %xmm11
        movsd   (%rcx,%r10,2), %xmm5
        movq    (%r8), %r10 
        movsd   (%rcx,%r10,2), %xmm6
        movq    (%rdi), %r10
        movsd   (%rcx,%r10,2), %xmm7
        movq    (%rbp), %r10
        movsd   (%rcx,%r10,2), %xmm8
        movq    (%rbx), %r10
        movapd  %xmm8, %xmm14
        movsd   (%rcx,%r10,2), %xmm9
        leaq    (%r15,%r15), %r10
        movsd   (%rcx,%r10), %xmm10
        movq    (%r12), %rcx
        movapd  %xmm9, %xmm15
        movsd   15(%rcx), %xmm1
        movsd   7(%rcx), %xmm2
        movapd  %xmm1, %xmm13
        movsd   31(%rcx), %xmm3
        movapd  %xmm2, %xmm12

which is also an improvement, but it still is nowhere near the result for 4.2.2.

So, whatever is causing this problem, it appears the new register allocator isn't going to fix it.

The code generated by today's mainline (136210) isn't better than 4.3.0; the time is

(time (direct-fft-recursive-4 a table))
    469 ms real time
    469 ms cpu time (469 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

and code is essentially the same as for 4.3.0
Comment 33 Richard Biener 2008-06-06 14:58:11 UTC
4.3.1 is being released, adjusting target milestone.
Comment 34 lucier 2008-07-09 16:05:38 UTC
Problem still exists with

euler-18% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2/ --with-mpfr=/pkgs/gmp-4.2.2/ --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-gather-detailed-mem-stats
Thread model: posix
gcc version 4.4.0 20080708 (experimental) [trunk revision 137644] (GCC) 

Just checking whether recent changes happened to fix it.
Comment 35 Joseph S. Myers 2008-08-27 22:02:57 UTC
4.3.2 is released, changing milestones to 4.3.3.
Comment 36 lucier 2008-09-04 20:39:38 UTC
I don't really understand the status of this bug.

Before 4.3.0, it was P!, and Mark said he said he'd "like to see us start to explain these kinds of dramatic performance changes."

There was quite a bit of detective work that ended with "for some reason gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump) ...".

Richard opined that it was an "alias partitioning problem", but Uros noted that for the original code instead of the reduced testcase expanding some parameter to its maximum still doesn't fix the problem.

So (a) we don't know what the current code is doing wrong, and (b) we don't know why 4.2 got it right.

So I don't think Mark got what he wanted, and now it's P2, and each release the target release for fixing it gets pushed back.

I've been testing mainline on this bug sporadically, especially when an entry in gcc-patches mentions some words that also appear on this PR, to see if it's fixed.  I'm a bit concerned that the target of 4.3.* is becoming increasingly out of reach, as changes committed to that branch seem to be more and more conservative because it's a release branch.

I don't think the code for this bug is terribly atypical for machine-generated code; it would be nice to be able to remove this performance regression.  Unfortunately, I'm in no position to do so.
Comment 37 Richard Biener 2008-09-04 20:43:43 UTC
We have to admit that this bug is unlikely to get fixed in the 4.3 series.
It still lacks proper analysis, as unfortunately that done on the shorter
testcase was not valid.  Analysis takes time, and honestly at this point I
rather spend time fixing wrong-code or ice-on-valid bugs.
Comment 38 lucier 2008-09-04 20:49:29 UTC
OK, but I was moved to write because Jakub's latest 4.4 status report requests

Please concentrate now on fixing bugs, especially the performance regressions.

and this is a definite 4.3/4.4 performance regression from 4.2.  (How many of the P1 PRs are performance regressions?)
Comment 39 lucier 2008-12-06 16:37:46 UTC
I may have narrowed down the problem a bit.

With this compiler (revision 118491):

pythagoras-277% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release --prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061105 (experimental)

one gets (on a faster machine than previous reports)

(time (direct-fft-recursive-4 a table))
    133 ms real time
    140 ms cpu time (140 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

With this compiler (revision 118474):

pythagoras-24% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release --prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061104 (experimental)

one gets

(time (direct-fft-recursive-4 a table))
    116 ms real time
    108 ms cpu time (108 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

and you see the typical problem with assembly code from direct.i with the later compiler.

Paolo may have been right about fwprop, this patch was installed that day:

Author: bonzini
Date: Sat Nov  4 08:36:45 2006
New Revision: 118475

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=118475
Log:
2006-11-03  Paolo Bonzini  <bonzini@gnu.org>
            Steven Bosscher  <stevenb.gcc@gmail.com>

        * fwprop.c: New file.
        * Makefile.in: Add fwprop.o.
        * tree-pass.h (pass_rtl_fwprop, pass_rtl_fwprop_with_addr): New.
        * passes.c (init_optimization_passes): Schedule forward propagation.
        * rtlanal.c (loc_mentioned_in_p): Support NULL value of the second
        parameter.
        * timevar.def (TV_FWPROP): New.
        * common.opt (-fforward-propagate): New.
        * opts.c (decode_options): Enable forward propagation at -O2.
        * gcse.c (one_cprop_pass): Do not run local cprop unless touching jumps.
        * cse.c (fold_rtx_subreg, fold_rtx_mem, fold_rtx_mem_1, find_best_addr,
        canon_for_address, table_size): Remove.
        (new_basic_block, insert, remove_from_table): Remove references to
        table_size.
        (fold_rtx): Process SUBREGs and MEMs with equiv_constant, make
        simplification loop more straightforward by not calling fold_rtx
        recursively.
        (equiv_constant): Move here a small part of fold_rtx_subreg,
        do not call fold_rtx.  Call avoid_constant_pool_reference
        to process MEMs.
        * recog.c (canonicalize_change_group): New.
        * recog.h (canonicalize_change_group): New.

        * doc/invoke.texi (Optimization Options): Document fwprop.
        * doc/passes.texi (RTL passes): Document fwprop.


Added:
    trunk/gcc/fwprop.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/common.opt
    trunk/gcc/cse.c
    trunk/gcc/doc/invoke.texi
    trunk/gcc/doc/passes.texi
    trunk/gcc/gcse.c
    trunk/gcc/opts.c
    trunk/gcc/passes.c
    trunk/gcc/recog.c
    trunk/gcc/recog.h
    trunk/gcc/rtlanal.c
    trunk/gcc/timevar.def
    trunk/gcc/tree-pass.h
Comment 40 Paolo Bonzini 2008-12-07 02:55:20 UTC
IIUC this is a typical case in which CSE was fixing something that earlier passes messed up.  Unfortunately fwprop does (better) what CSE was meant to do, but does not do what I assumed was already done before CSE.

If the problem is aliasing/FRE, then I think Richi is the one who could fix it for good in the tree passes.  If there is more to it, however, I can take a look at why fwprop is generating the ugly code.
Comment 41 Richard Biener 2008-12-07 13:00:04 UTC
There's not much to be done for aliasing - everything points to global memory and thus aliases.  There may be some opportunities for offset-based disambiguations
via pointers, but I didn't investigate in detail.  Whoever wants someone to
work on specific details needs to provide way shorter testcases ;)
Comment 42 lucier 2008-12-07 19:39:18 UTC
Just a comment that -fforward-propagate isn't enabled at -O1 (the main optimization option in the test) while the cse code it replaces was enabled at -O1.  This is presumably why adding -fno-forward-propagate to the command line in the test a year ago didn't affect the generated code.

Adding -fno-forward-propagate to the command line of the test case with revision r118475 of gcc changes the generated code, but doesn't improve the problem code in the main loop.

Updated the title to report the performance hit on

Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz

as reported by /proc/cpuinfo
Comment 43 Richard Biener 2009-01-24 10:19:59 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 44 Paolo Bonzini 2009-02-13 16:05:29 UTC
A simplified (local, noncascading) fwprop not using UD chains would not be hard to do...  Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of every basic block), and use that info instead of UD chains in use_killed_between...
Comment 45 lucier 2009-02-13 16:09:50 UTC
Subject: Re:  [4.3/4.4 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Fri, 2009-02-13 at 16:05 +0000, bonzini at gnu dot org wrote:
> ------- Comment #44 from bonzini at gnu dot org  2009-02-13 16:05 -------
> A simplified (local, noncascading) fwprop not using UD chains would not be hard
> to do...  Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking
> the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of
> every basic block), and use that info instead of UD chains in
> use_killed_between...

As noted in comment 42, enabling FWPROP on this test case does not fix
the performance problem.

Comment 46 Paolo Bonzini 2009-02-13 16:32:01 UTC
Regarding your comment in bug 26854:

> address calculations are no longer optimized as much as they
> were before 

Sometimes, actually, they are optimized better.  It depends on the case.

In comment #42, also, you talked about -O1, where fwprop is not enabled.  So I'm failing to understand if the problem is at the tree or RTL level for this bug.

My comment was related to something said in PR39517, i.e. that chains are very expensive and a reason why fwprop should not be enabled at -O1.  Following up on my comment, alternatively, fwprop could compute its own dataflow instead of using UD chains, since it only cares by design about uses with a single definition.  This looks much better.

You would use something like df_chain_create_bb and df_chain_create_bb_process_use, with code like the following (cfr. df_chain_create_bb_process_use):

          /* Do not want to go through this for an uninitialized var.  */
          int count = DF_DEFS_COUNT (regno);
          if (count)
            {
              if (top_flag == (DF_REF_FLAGS (use) & DF_REF_AT_TOP))
                {
                  unsigned int first_index = DF_DEFS_BEGIN (uregno);
                  unsigned int last_index = first_index + count - 1;

                  /* Uninitialized?  Exit.  */
                  bmp_iter_set_init (&bi, local_rd, first_index, &def_index);
                  if (!bmp_iter_set (&bi, &def_index) || def_index > last_index)
                    continue;

                  /* 2 or more defs for this use, exit.  */
                  bmp_iter_next (&(ITER), &(BITNUM)))
                  if (!bmp_iter_set (&bi, &def_index) || def_index > last_index)
                    SET_BIT (can_fwprop, DF_REF_ID (use));
                }
            }

With this change there would be no reason not to run fwprop at -O1.
Comment 47 lucier 2009-02-13 17:22:55 UTC
Subject: Re:  [4.3/4.4 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Fri, 2009-02-13 at 16:32 +0000, bonzini at gnu dot org wrote:
> 
> 
> ------- Comment #46 from bonzini at gnu dot org  2009-02-13 16:32 -------
> Regarding your comment in bug 26854:
> 
> > address calculations are no longer optimized as much as they
> > were before 
> 
> Sometimes, actually, they are optimized better.  It depends on the case.

Yes.  I don't see why the optimizations in CSE, which were relatively
cheap and which were effective for this case, needed to be disabled when
FWPROP was added without, evidently, understanding why FWPROP does not
do what CSE was already doing.

> In comment #42, also, you talked about -O1, where fwprop is not enabled.  So
> I'm failing to understand if the problem is at the tree or RTL level for this
> bug.

When I add -fforward-propagate to the command line, then the assembly
code changes in some ways, but the performance problem remains the same.

Brad

Comment 48 Paolo Bonzini 2009-02-13 20:09:57 UTC
Subject: Re:  [4.3/4.4 Regression] 30% 
	performance slowdown in floating-point code caused by r118475

> Yes.  I don't see why the optimizations in CSE, which were relatively
> cheap and which were effective for this case, needed to be disabled when
> FWPROP was added without, evidently, understanding why FWPROP does not
> do what CSE was already doing.

Just to mention it, fwprop saved 3% of compile time.  That's not
"cheap".  It was also tested with SPEC and Nullstone on several
architectures.
Comment 49 lucier 2009-04-23 15:58:45 UTC
With 4.4.0 and with mainline this code now runs in 280 ms instead of in 156 ms with 4.2.4.

Since 280/156 = 1.794871794871795 I changed the subject line (the slowdown is now not completely caused by r118475).

I guess I'll post the assembly code generated by 4.4.0 in the next attachment.

Timings (best of three runs) for the last

(time (direct-fft-recursive-4 a table))

from

 gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a a)))'

With gcc-4.1.2:

    188 ms cpu time (188 user, 0 system)

With gcc-4.2.4

    156 ms cpu time (152 user, 4 system)

With gcc-4.3.3:

    180 ms cpu time (180 user, 0 system)

With gcc-4.4.0

    280 ms cpu time (280 user, 0 system)

With 4.5.0 20090423 (experimental) [trunk revision 146634]

    280 ms cpu time (280 user, 0 system)

Comment 50 lucier 2009-04-23 16:00:49 UTC
Created attachment 17685 [details]
direct.s generated by 4.4.0
Comment 51 lucier 2009-04-23 16:03:01 UTC
Forgot to mention, the main loop starts at .L2947.

This is on

model name	: Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz

Brad
Comment 52 lucier 2009-04-26 18:27:38 UTC
I narrowed down the new performance regression to code added some time around March 12, 2009, so I changed back the subject line of this PR to reflect the performance regression caused only by the code added 2006-11-03 and added a new PR

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39914

to reflect the effects of the March, 2009, code.
Comment 53 lucier 2009-05-06 03:43:34 UTC
I posted a possible fix to gcc-patches with the subject line

Possible fix for 30% performance regression in PR 33928

Here's the assembly for the main loop after the changes I proposed:

.L4230:
        movq    %r11, %rdi
        addq    8(%r10), %rdi
        movq    8(%r10), %rsi
        movq    8(%r10), %rdx
        movq    40(%r10), %rax
        leaq    4(%r11), %rbx
        addq    %rdi, %rsi
        leaq    4(%rdi), %r9
        movq    %rdi, -8(%r10)
        addq    %rsi, %rdx
        leaq    4(%rsi), %r8
        movq    %rsi, -24(%r10)
        leaq    4(%rdx), %rcx
        movq    %r9, -16(%r10)
        movq    %rdx, -40(%r10)
        movq    %r8, -32(%r10)
        addq    $7, %rax
        movq    %rcx, -48(%r10)
        movsd   (%rax,%rcx,2), %xmm12
        leaq    (%rbx,%rbx), %rcx
        movsd   (%rax,%rdx,2), %xmm3
        leaq    (%rax,%r11,2), %rdx
        addq    $8, %r11
        movsd   (%rax,%r8,2), %xmm14
        cmpq    %r11, %r13
        movsd   (%rax,%rsi,2), %xmm13
        movsd   (%rax,%r9,2), %xmm11
        movsd   (%rax,%rdi,2), %xmm10
        movsd   (%rax,%rcx), %xmm8
        movq    24(%r10), %rax
        movsd   (%rdx), %xmm7
        movsd   15(%rax), %xmm2
        movsd   7(%rax), %xmm1
        movapd  %xmm2, %xmm0
        movsd   31(%rax), %xmm9
        movapd  %xmm1, %xmm6
        mulsd   %xmm3, %xmm0
        movapd  %xmm1, %xmm4
        mulsd   %xmm12, %xmm6
        mulsd   %xmm3, %xmm4
        movapd  %xmm1, %xmm3
        mulsd   %xmm13, %xmm1
        mulsd   %xmm14, %xmm3
        addsd   %xmm0, %xmm6
        movapd  %xmm2, %xmm0
        movsd   23(%rax), %xmm5
        mulsd   %xmm12, %xmm0
        movapd  %xmm7, %xmm12
        subsd   %xmm0, %xmm4
        movapd  %xmm2, %xmm0
        mulsd   %xmm14, %xmm2
        movapd  %xmm8, %xmm14
        mulsd   %xmm13, %xmm0
        movapd  %xmm11, %xmm13
        addsd   %xmm6, %xmm11
        subsd   %xmm6, %xmm13
        subsd   %xmm2, %xmm1
        movapd  %xmm10, %xmm2
        addsd   %xmm0, %xmm3
        movapd  %xmm5, %xmm0
        subsd   %xmm4, %xmm2
        addsd   %xmm4, %xmm10
        subsd   %xmm1, %xmm12
        addsd   %xmm1, %xmm7
        movapd  %xmm9, %xmm1
        subsd   %xmm3, %xmm14
        mulsd   %xmm2, %xmm0
        xorpd   .LC5(%rip), %xmm1
        addsd   %xmm3, %xmm8
        movapd  %xmm1, %xmm3
        mulsd   %xmm2, %xmm1
        movapd  %xmm5, %xmm2
        mulsd   %xmm13, %xmm3
        mulsd   %xmm11, %xmm2
        addsd   %xmm0, %xmm3
        movapd  %xmm5, %xmm0
        mulsd   %xmm10, %xmm5
        mulsd   %xmm13, %xmm0
        subsd   %xmm0, %xmm1
        movapd  %xmm9, %xmm0
        mulsd   %xmm11, %xmm9
        mulsd   %xmm10, %xmm0
        subsd   %xmm9, %xmm5
        addsd   %xmm0, %xmm2
        movapd  %xmm7, %xmm0
        addsd   %xmm5, %xmm0
        subsd   %xmm5, %xmm7
        movsd   %xmm0, (%rdx)
        movapd  %xmm8, %xmm0
        movq    40(%r10), %rax
        subsd   %xmm2, %xmm8
        addsd   %xmm2, %xmm0
        movsd   %xmm0, 7(%rcx,%rax)
        movq    -8(%r10), %rdx
        movq    40(%r10), %rax
        movapd  %xmm12, %xmm0
        subsd   %xmm1, %xmm12
        movsd   %xmm7, 7(%rax,%rdx,2)
        movq    -16(%r10), %rdx
        movq    40(%r10), %rax
        addsd   %xmm1, %xmm0
        movsd   %xmm8, 7(%rax,%rdx,2)
        movq    -24(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm0, 7(%rax,%rdx,2)
        movapd  %xmm14, %xmm0
        movq    -32(%r10), %rdx
        movq    40(%r10), %rax
        subsd   %xmm3, %xmm14
        addsd   %xmm3, %xmm0
        movsd   %xmm0, 7(%rax,%rdx,2)
        movq    -40(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm12, 7(%rax,%rdx,2)
        movq    -48(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm14, 7(%rax,%rdx,2)
        jg      .L4230
        movq    %rbx, %r13
.L4228:
Comment 54 lucier 2009-05-06 03:50:14 UTC
Created attachment 17805 [details]
svn diff of cse.c to fix the performance regression

This partially reverts r118475 and adds code to call find_best_address for MEMs in fold_rtx.
Comment 55 Paolo Bonzini 2009-05-06 09:20:51 UTC
Created attachment 17807 [details]
svn diff of cse.c to "fix" the performance regression (updated)
Comment 56 Paolo Bonzini 2009-05-06 09:31:47 UTC
Created attachment 17808 [details]
usable testcase

Ok, I managed to make a reasonably readable source code (uninclude stdlib files, remove unused gambit stuff and ___ prefixes, simplify some expressions), find the heavy loops, annotate them with asm statements (see comment #18, 2007-11-30) and find the length of the loops.

                   4.2      4.5     4.5 + patch
LOOP 1            ~190     ~230    ~190
INNER LOOP 1.1    ~120     ~130    ~120
LOOP 2             33       36      31

I am thus obsoleting (almost) everything that was posted and is not relevant anymore.  Let's start from scratch with the new testcase.
Comment 57 Jakub Jelinek 2009-05-06 09:49:52 UTC
Why do you need any #include lines at all in the reduced testcase?  Compiles just fine even without them...
Comment 58 Paolo Bonzini 2009-05-06 09:56:53 UTC
Uhm, it's better to run unpatched 4.5 with -O1 -fforward-propagate to get a fair comparison.  Also, I was counting the loop headers, which are not part of the hot code.

                   4.2 -O1     4.5 -O1 -ffw-prop     4.5 + patch -O1
LOOP 1                181         201                   180
INNER LOOP 1.1        117         118                   113
LOOP 2                27           27                    26

This shows that you should compare running the code (you can use direct.i) with 4.2/-O1 and 4.5/-O1 -fforward-propagate.  This is very important, otherwise you're comparing apples to oranges.

fwprop is creating too high register pressure by creating offsets like these in the loop header:

        leaq    -8(%r12), %rsi
        leaq    8(%r12), %r10
        leaq    -16(%r12), %r9
        leaq    -24(%r12), %rbx
        leaq    -32(%r12), %rbp
        leaq    -40(%r12), %rdi
        leaq    -48(%r12), %r11
        leaq    40(%r12), %rdx

Then, the additional register pressure is causing the bad scheduling we have in the fast assembly outputs:

        movq    (%rdx), %rax
        movsd   (%rax,%r15,2), %xmm7
        movq    (%rdi), %r15
        movsd   (%rax,%r15,2), %xmm10
        movq    (%rbp), %r15
        movsd   (%rax,%r15,2), %xmm5
        movq    (%rbx), %r15
        movsd   (%rax,%r15,2), %xmm6
        movq    (%r9), %r15
        movsd   (%rax,%r15,2), %xmm15
        movq    (%rsi), %r15
        movsd   (%rax,%r15,2), %xmm11
Comment 59 Paolo Bonzini 2009-05-06 09:59:38 UTC
Created attachment 17809 [details]
usable testcase

Without includes as Jakub suggested.
Comment 60 Paolo Bonzini 2009-05-06 10:47:33 UTC
Actually those are created by -fmove-loop-invariants.  With -O1 -fforward-propagate -fno-move-loop-invariants I get:

                   4.5 -O1 -ffw-prop -fno-move-loop-inv
LOOP 1                183
INNER LOOP 1.1        116
LOOP 2                25

You should be able to get performance close to 4.2 or better with options "-O1 -fforward-propagate -fno-move-loop-invariants -fschedule-insns2".  If you do, this means two things:

1) That the bug is in the register pressure estimations of -fno-move-loop-invariants, and merely exposed by the fwprop patch.

2) That maybe you should start from -O2 and go backwards, eliminating optimizations that do not help you or cause high compilation time, instead of using -O1.
Comment 61 Jakub Jelinek 2009-05-06 13:05:48 UTC
Also see PR39871, maybe that's related (though on ARM).
Comment 62 Paolo Bonzini 2009-05-06 15:07:18 UTC
No, totally unrelated to PR39871
Comment 63 lucier 2009-05-06 19:57:49 UTC
Was the patch in comment 55 meant for me to bootstrap and test with today's mainline?  It crashes at the gcc_assert at

/* Subroutine of canon_reg.  Pass *XLOC through canon_reg, and validate
   the result if necessary.  INSN is as for canon_reg.  */

static void
validate_canon_reg (rtx *xloc, rtx insn)
{
  if (*xloc)
    {
      rtx new_rtx = canon_reg (*xloc, insn);

      /* If replacing pseudo with hard reg or vice versa, ensure the
         insn remains valid.  Likewise if the insn has MATCH_DUPs.  */
      gcc_assert (insn && new_rtx);
      validate_change (insn, xloc, new_rtx, 1);
    }
}

when building libgcc:

/tmp/lucier/gcc/objdirs/mainline/./gcc/xgcc -B/tmp/lucier/gcc/objdirs/mainline/./gcc/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/bin/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/lib/ -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/include -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/sys-include -g -O2 -m32 -O2  -g -O2 -DIN_GCC   -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition  -isystem ./include  -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED   -I. -I. -I../../.././gcc -I../../../../../mainline/libgcc -I../../../../../mainline/libgcc/. -I../../../../../mainline/libgcc/../gcc -I../../../../../mainline/libgcc/../include -I../../../../../mainline/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _moddi3.o -MT _moddi3.o -MD -MP -MF _moddi3.dep -DL_moddi3 -c ../../../../../mainline/libgcc/../gcc/libgcc2.c \
          -fexceptions -fnon-call-exceptions -fvisibility=hidden -DHIDE_EXPORTS
../../../../../mainline/libgcc/../gcc/libgcc2.c: In function â:
../../../../../mainline/libgcc/../gcc/libgcc2.c:1121: internal compiler error: in validate_canon_reg, at cse.c:2730

Comment 64 lucier 2009-05-06 20:43:53 UTC
In answer to comment 60, here's the command line where I added -fforward-propagate -fno-move-loop-invariants:

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

here's the compiler:

/pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /tmp/lucier/gcc/mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c
Thread model: posix
gcc version 4.5.0 20090506 (experimental) [trunk revision 147199] (GCC) 

and the runtime didn't change (substantially)

    132 ms cpu time (132 user, 0 system)

and the loop looks pretty much just as bad (it's 117 instructions long, by my count):

.L2752:
        movq    %rcx, %rdx
        addq    8(%rax), %rdx
        leaq    4(%rcx), %rdi
        movq    %rdx, -8(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -16(%rax)
        movq    %rdx, -24(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -32(%rax)
        movq    %rdx, -40(%rax)
        leaq    4(%rdx), %rbx
        movq    40(%rax), %rdx
        movq    %rbx, -48(%rax)
        movsd   7(%rdx,%rbx,2), %xmm9
        movq    -40(%rax), %rbx
        leaq    7(%rdx,%rcx,2), %r8
        addq    $8, %rcx
        movsd   (%r8), %xmm4
        cmpq    %rcx, %r13
        movsd   7(%rdx,%rbx,2), %xmm11
        movq    -32(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm5
        movq    -24(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm7
        movq    -16(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm14
        movq    -8(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm6
        leaq    (%rdi,%rdi), %rbx
        movsd   7(%rbx,%rdx), %xmm8
        movq    24(%rax), %rdx
        movapd  %xmm6, %xmm13
        movsd   15(%rdx), %xmm1
        movsd   7(%rdx), %xmm2
        movapd  %xmm1, %xmm10
        movsd   31(%rdx), %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm11, %xmm10
        mulsd   %xmm9, %xmm12
        mulsd   %xmm2, %xmm11
        mulsd   %xmm1, %xmm9
        movsd   23(%rdx), %xmm0
        addsd   %xmm12, %xmm10
        movapd  %xmm2, %xmm12
        mulsd   %xmm7, %xmm2
        subsd   %xmm9, %xmm11
        movapd  %xmm1, %xmm9
        mulsd   %xmm5, %xmm12
        mulsd   %xmm5, %xmm1
        movapd  %xmm8, %xmm5
        mulsd   %xmm7, %xmm9
        movapd  %xmm4, %xmm7
        subsd   %xmm11, %xmm13
        addsd   %xmm6, %xmm11
        movsd   .LC5(%rip), %xmm6
        subsd   %xmm1, %xmm2
        movapd  %xmm0, %xmm1
        addsd   %xmm12, %xmm9
        movapd  %xmm14, %xmm12
        xorpd   %xmm3, %xmm6
        subsd   %xmm10, %xmm12
        mulsd   %xmm13, %xmm1
        subsd   %xmm2, %xmm7
        addsd   %xmm4, %xmm2
        movapd  %xmm6, %xmm4
        addsd   %xmm14, %xmm10
        mulsd   %xmm13, %xmm6
        mulsd   %xmm12, %xmm4
        subsd   %xmm9, %xmm5
        mulsd   %xmm0, %xmm12
        addsd   %xmm8, %xmm9
        movapd  %xmm0, %xmm8
        mulsd   %xmm11, %xmm0
        addsd   %xmm1, %xmm4
        movapd  %xmm3, %xmm1
        mulsd   %xmm10, %xmm3
        subsd   %xmm12, %xmm6
        mulsd   %xmm11, %xmm1
        mulsd   %xmm10, %xmm8
        subsd   %xmm3, %xmm0
        addsd   %xmm1, %xmm8
        movapd  %xmm2, %xmm1
        addsd   %xmm0, %xmm1
        subsd   %xmm0, %xmm2
        movapd  %xmm7, %xmm0
        subsd   %xmm6, %xmm7
        addsd   %xmm6, %xmm0
        movsd   %xmm1, (%r8)
        movapd  %xmm9, %xmm1
        movq    40(%rax), %rdx
        subsd   %xmm8, %xmm9
        addsd   %xmm8, %xmm1
        movsd   %xmm1, 7(%rbx,%rdx)
        movq    -8(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm2, 7(%rdx,%rbx,2)
        movq    -16(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm9, 7(%rdx,%rbx,2)
        movq    -24(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movapd  %xmm5, %xmm0
        movq    -32(%rax), %rbx
        movq    40(%rax), %rdx
        subsd   %xmm4, %xmm5
        addsd   %xmm4, %xmm0
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movq    -40(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm7, 7(%rdx,%rbx,2)
        movq    -48(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm5, 7(%rdx,%rbx,2)
        jg      .L2752
        movq    %rdi, %r13
.L2751:
Comment 65 Paolo Bonzini 2009-05-07 05:03:40 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

lucier at math dot purdue dot edu wrote:
> ------- Comment #64 from lucier at math dot purdue dot edu  2009-05-06 20:43 -------
> In answer to comment 60, here's the command line where I added
> -fforward-propagate -fno-move-loop-invariants:

Hmm, can you try adding -frename-registers *or* -fweb (i.e. together
they get no benefit) too?

> and the loop looks pretty much just as bad (it's 117 instructions long, by my
> count):

116 actually: the movq here is outside the loop (that's how I made all
the instruction counts).

>         movsd   %xmm5, 7(%rdx,%rbx,2)
>         jg      .L2752
>         movq    %rdi, %r13
> .L2751:
Comment 66 lucier 2009-05-07 05:27:25 UTC
Adding -frename-registers gives a significant speedup (sometimes as fast as 4.1.2 on this shared machine, i.e., it somtimes hits 108 ms instead of 132-140ms), the command line with -fforward-propagate -fno-move-loop-invariants -frename-registers  is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

and the loop is

.L2752:
	movq	%rcx, %r12
	addq	8(%rax), %r12
	leaq	4(%rcx), %rdi
	movq	%r12, -8(%rax)
	leaq	4(%r12), %r8
	addq	8(%rax), %r12
	movq	%r8, -16(%rax)
	movq	-8(%rax), %r8
	movq	-16(%rax), %rdx
	movq	%r12, -24(%rax)
	leaq	4(%r12), %rbx
	addq	8(%rax), %r12
	movq	-24(%rax), %r9
	movq	%rbx, -32(%rax)
	movq	24(%rax), %rbx
	movq	-32(%rax), %r10
	leaq	4(%r12), %r11
	movq	%r12, -40(%rax)
	movq	40(%rax), %r12
	movq	-40(%rax), %r14
	movq	%r11, -48(%rax)
	movsd	15(%rbx), %xmm1
	movsd	7(%rbx), %xmm2
	movsd	7(%r12,%r11,2), %xmm9
	movapd	%xmm1, %xmm3
	movsd	7(%r12,%r14,2), %xmm11
	leaq	7(%r12,%rcx,2), %r11
	movapd	%xmm2, %xmm10
	leaq	(%rdi,%rdi), %r14
	mulsd	%xmm11, %xmm3
	movapd	%xmm2, %xmm12
	mulsd	%xmm9, %xmm10
	addq	$8, %rcx
	mulsd	%xmm1, %xmm9
	cmpq	%rcx, %r13
	mulsd	%xmm2, %xmm11
	movsd	7(%r12,%r10,2), %xmm5
	movsd	7(%r12,%r9,2), %xmm7
	addsd	%xmm10, %xmm3
	movsd	7(%r12,%r8,2), %xmm6
	subsd	%xmm9, %xmm11
	mulsd	%xmm7, %xmm2
	movapd	%xmm1, %xmm9
	mulsd	%xmm5, %xmm1
	movapd	%xmm6, %xmm13
	movsd	7(%r12,%rdx,2), %xmm14
	mulsd	%xmm5, %xmm12
	mulsd	%xmm7, %xmm9
	subsd	%xmm11, %xmm13
	movsd	31(%rbx), %xmm0
	addsd	%xmm6, %xmm11
	movsd	.LC5(%rip), %xmm6
	subsd	%xmm1, %xmm2
	movsd	(%r11), %xmm4
	movapd	%xmm14, %xmm10
	xorpd	%xmm0, %xmm6
	addsd	%xmm12, %xmm9
	movsd	7(%r14,%r12), %xmm8
	subsd	%xmm3, %xmm10
	movapd	%xmm4, %xmm7
	addsd	%xmm14, %xmm3
	movsd	23(%rbx), %xmm15
	subsd	%xmm2, %xmm7
	movapd	%xmm8, %xmm5
	addsd	%xmm4, %xmm2
	movapd	%xmm6, %xmm4
	subsd	%xmm9, %xmm5
	movapd	%xmm15, %xmm14
	addsd	%xmm8, %xmm9
	mulsd	%xmm10, %xmm4
	movapd	%xmm15, %xmm8
	mulsd	%xmm15, %xmm10
	movapd	%xmm0, %xmm12
	mulsd	%xmm11, %xmm15
	mulsd	%xmm3, %xmm0
	movapd	%xmm7, %xmm1
	mulsd	%xmm13, %xmm6
	mulsd	%xmm3, %xmm8
	movapd	%xmm9, %xmm3
	mulsd	%xmm11, %xmm12
	subsd	%xmm0, %xmm15
	mulsd	%xmm13, %xmm14
	subsd	%xmm10, %xmm6
	movapd	%xmm2, %xmm10
	movapd	%xmm5, %xmm0
	addsd	%xmm12, %xmm8
	addsd	%xmm15, %xmm10
	subsd	%xmm15, %xmm2
	addsd	%xmm14, %xmm4
	addsd	%xmm8, %xmm3
	movsd	%xmm10, (%r11)
	movq	40(%rax), %r10
	subsd	%xmm8, %xmm9
	addsd	%xmm6, %xmm1
	addsd	%xmm4, %xmm0
	movsd	%xmm3, 7(%r14,%r10)
	movq	-8(%rax), %r9
	movq	40(%rax), %rdx
	subsd	%xmm6, %xmm7
	subsd	%xmm4, %xmm5
	movsd	%xmm2, 7(%rdx,%r9,2)
	movq	-16(%rax), %r8
	movq	40(%rax), %r12
	movsd	%xmm9, 7(%r12,%r8,2)
	movq	-24(%rax), %rbx
	movq	40(%rax), %r11
	movsd	%xmm1, 7(%r11,%rbx,2)
	movq	-32(%rax), %r14
	movq	40(%rax), %r10
	movsd	%xmm0, 7(%r10,%r14,2)
	movq	-40(%rax), %r9
	movq	40(%rax), %rdx
	movsd	%xmm7, 7(%rdx,%r9,2)
	movq	-48(%rax), %r8
	movq	40(%rax), %r12
	movsd	%xmm5, 7(%r12,%r8,2)
	jg	.L2752

Adding -fforward-propagate -fno-move-loop-invariants -fweb instead of -fforward-propagate -fno-move-loop-invariants -frename-registers, so the compile line is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -fweb -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

the time is not so good (consistently 128ms) and the loop is

.L2752:
	movq	%rcx, %rdx
	addq	8(%rax), %rdx
	leaq	4(%rcx), %rdi
	movq	%rdx, -8(%rax)
	leaq	4(%rdx), %rbx
	addq	8(%rax), %rdx
	movq	%rbx, -16(%rax)
	movq	%rdx, -24(%rax)
	leaq	4(%rdx), %rbx
	addq	8(%rax), %rdx
	movq	%rbx, -32(%rax)
	movq	%rdx, -40(%rax)
	leaq	4(%rdx), %rbx
	movq	40(%rax), %rdx
	movq	%rbx, -48(%rax)
	movsd	7(%rdx,%rbx,2), %xmm9
	movq	-40(%rax), %rbx
	leaq	7(%rdx,%rcx,2), %r8
	addq	$8, %rcx
	movsd	(%r8), %xmm4
	cmpq	%rcx, %r13
	movsd	7(%rdx,%rbx,2), %xmm11
	movq	-32(%rax), %rbx
	movsd	7(%rdx,%rbx,2), %xmm5
	movq	-24(%rax), %rbx
	movsd	7(%rdx,%rbx,2), %xmm7
	movq	-16(%rax), %rbx
	movsd	7(%rdx,%rbx,2), %xmm14
	movq	-8(%rax), %rbx
	movsd	7(%rdx,%rbx,2), %xmm6
	leaq	(%rdi,%rdi), %rbx
	movsd	7(%rbx,%rdx), %xmm8
	movq	24(%rax), %rdx
	movapd	%xmm6, %xmm13
	movsd	15(%rdx), %xmm1
	movsd	7(%rdx), %xmm2
	movapd	%xmm1, %xmm10
	movsd	31(%rdx), %xmm3
	movapd	%xmm2, %xmm12
	mulsd	%xmm11, %xmm10
	mulsd	%xmm9, %xmm12
	mulsd	%xmm2, %xmm11
	mulsd	%xmm1, %xmm9
	movsd	23(%rdx), %xmm0
	addsd	%xmm12, %xmm10
	movapd	%xmm2, %xmm12
	mulsd	%xmm7, %xmm2
	subsd	%xmm9, %xmm11
	movapd	%xmm1, %xmm9
	mulsd	%xmm5, %xmm12
	mulsd	%xmm5, %xmm1
	movapd	%xmm8, %xmm5
	mulsd	%xmm7, %xmm9
	movapd	%xmm4, %xmm7
	subsd	%xmm11, %xmm13
	addsd	%xmm6, %xmm11
	movsd	.LC5(%rip), %xmm6
	subsd	%xmm1, %xmm2
	movapd	%xmm0, %xmm1
	addsd	%xmm12, %xmm9
	movapd	%xmm14, %xmm12
	xorpd	%xmm3, %xmm6
	subsd	%xmm10, %xmm12
	mulsd	%xmm13, %xmm1
	subsd	%xmm2, %xmm7
	addsd	%xmm4, %xmm2
	movapd	%xmm6, %xmm4
	addsd	%xmm14, %xmm10
	mulsd	%xmm13, %xmm6
	mulsd	%xmm12, %xmm4
	subsd	%xmm9, %xmm5
	mulsd	%xmm0, %xmm12
	addsd	%xmm8, %xmm9
	movapd	%xmm0, %xmm8
	mulsd	%xmm11, %xmm0
	addsd	%xmm1, %xmm4
	movapd	%xmm3, %xmm1
	mulsd	%xmm10, %xmm3
	subsd	%xmm12, %xmm6
	mulsd	%xmm11, %xmm1
	mulsd	%xmm10, %xmm8
	subsd	%xmm3, %xmm0
	addsd	%xmm1, %xmm8
	movapd	%xmm2, %xmm1
	addsd	%xmm0, %xmm1
	subsd	%xmm0, %xmm2
	movapd	%xmm7, %xmm0
	subsd	%xmm6, %xmm7
	addsd	%xmm6, %xmm0
	movsd	%xmm1, (%r8)
	movapd	%xmm9, %xmm1
	movq	40(%rax), %rdx
	subsd	%xmm8, %xmm9
	addsd	%xmm8, %xmm1
	movsd	%xmm1, 7(%rbx,%rdx)
	movq	-8(%rax), %rbx
	movq	40(%rax), %rdx
	movsd	%xmm2, 7(%rdx,%rbx,2)
	movq	-16(%rax), %rbx
	movq	40(%rax), %rdx
	movsd	%xmm9, 7(%rdx,%rbx,2)
	movq	-24(%rax), %rbx
	movq	40(%rax), %rdx
	movsd	%xmm0, 7(%rdx,%rbx,2)
	movapd	%xmm5, %xmm0
	movq	-32(%rax), %rbx
	movq	40(%rax), %rdx
	subsd	%xmm4, %xmm5
	addsd	%xmm4, %xmm0
	movsd	%xmm0, 7(%rdx,%rbx,2)
	movq	-40(%rax), %rbx
	movq	40(%rax), %rdx
	movsd	%xmm7, 7(%rdx,%rbx,2)
	movq	-48(%rax), %rbx
	movq	40(%rax), %rdx
	movsd	%xmm5, 7(%rdx,%rbx,2)
	jg	.L2752

And I still count 117 instructions in the loop in comment 64 (whether that matters, I don't know).
Comment 67 Paolo Bonzini 2009-05-07 13:40:41 UTC
I'm thinking of enabling -frename-registers on x86; since it does not enable the first scheduling pass, the live ranges will be shorter and the register allocator may reuse the same register over and over with no freedom on schedule-insns2.  

This would leave only the bug with RTL loop invariant motion.

Brad, you are the one who's regularly producing "insane" testcases, can you measure the slowdown from -O1 to -O1 -frename-registers?  It is a local pass, so it should not be that much, but I'd rather check before (I'll check on a bootstrap instead).
Comment 68 Steven Bosscher 2009-05-07 15:40:47 UTC
Be careful with -frename-registers, it is quadratic in the size of a basic block. For Bradley's test cases it will certainly give a slow-down.

I have tried a rewrite of -frename-registers, but I keep running into trouble with the INDEX_REGS and BASE_REGS non-classes. Paolo, we could look at this stuff together if you want my help.
Comment 69 lucier 2009-05-07 15:57:54 UTC
    Well, adding -frename-registers by itself to -O1 and not -fforward-propagate and -fno-move-loop-invariants doesn't help (loop is given below, along with complete compile options), the time is

        140 ms cpu time (140 user, 0 system)

    and adding -frename-registers and -fno-move-loop-invariants without -fforward-propagate doesn't help (loop is again given below), it gets

        140 ms cpu time (140 user, 0 system)

    Adding all three gives a very consistent time this morning of

        120 ms cpu time (120 user, 0 system)

    so which is the same as the 4.2.4 time without any of these options (this morning).

    But -fforward-propagate is not a viable option in general for this type of code; here are some times for the testcase from PR 31957 with various options on a 2.something GHz Xeon server:

    pythagoras-45% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report >& rename-report
    252.987u 9.592s 4:23.20 99.7%   0+0k 0+0io 0pf+0w
    pythagoras-46% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & no-rename-report
    249.875u 10.544s 4:21.73 99.4%  0+0k 0+0io 0pf+0w
    pythagoras-47% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & rename-no-move-loop-invariants-report
    246.663u 10.484s 4:18.30 99.5%  0+0k 0+0io 0pf+0w
    pythagoras-48% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -fforward-propagate -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & rename-no-move-loop-invariants-forward-propagate-report
    357.830u 28.417s 6:27.81 99.5%  0+0k 0+0io 11pf+0w

    With -fforward-propagate the memory required went up to at least 21GB.

    I'll attach the time reports for the various options, but the compiler wasn't configured to provide detailed memory reports.

    Brad


    Loop with -frename-registers

    /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers  -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c



            movq    %rdx, %r12
            addq    (%r11), %r12
            leaq    4(%rdx), %r14
            movq    %r12, (%rsi)
            addq    $4, %r12
            movq    %r12, (%r10)
            movq    (%r11), %rcx
            addq    (%rsi), %rcx
            movq    %rcx, (%rbx)
            addq    $4, %rcx
            movq    %rcx, (%r9)
            movq    (%r11), %r13
            addq    (%rbx), %r13
            movq    %r13, (%r8)
            addq    $4, %r13
            movq    %r13, (%r15)
            movq    (%rax), %rcx
            movq    (%r8), %r12
            addq    $7, %rcx
            movsd   (%rcx,%r12,2), %xmm10
            movq    (%rbx), %r12
            movsd   (%rcx,%r13,2), %xmm13
            movq    (%r9), %r13
            movsd   (%rcx,%r12,2), %xmm6
            movq    (%rsi), %r12
            movsd   (%rcx,%r13,2), %xmm5
            movq    (%r10), %r13
            movsd   (%rcx,%r12,2), %xmm9
            leaq    (%r14,%r14), %r12
            movsd   (%rcx,%r13,2), %xmm11
            leaq    (%rcx,%rdx,2), %r13
            movsd   (%rcx,%r12), %xmm3
            movq    24(%rdi), %rcx
            movsd   (%r13), %xmm4
            addq    $8, %rdx
            movsd   15(%rcx), %xmm14
            movsd   7(%rcx), %xmm15
            movapd  %xmm14, %xmm8
            movapd  %xmm14, %xmm7
            movapd  %xmm15, %xmm12
            mulsd   %xmm10, %xmm8
            mulsd   %xmm13, %xmm12
            mulsd   %xmm15, %xmm10
            mulsd   %xmm14, %xmm13
            movsd   31(%rcx), %xmm2
            addsd   %xmm8, %xmm12
            movapd  %xmm15, %xmm8
            mulsd   %xmm6, %xmm7
            mulsd   %xmm5, %xmm14
            subsd   %xmm13, %xmm10
            mulsd   %xmm5, %xmm8
            movapd  %xmm2, %xmm13
            mulsd   %xmm6, %xmm15
            movapd  %xmm4, %xmm6
            xorpd   .LC5(%rip), %xmm13
            movapd  %xmm3, %xmm5
            addsd   %xmm7, %xmm8
            movapd  %xmm11, %xmm7
            subsd   %xmm14, %xmm15
            movapd  %xmm9, %xmm14
            movsd   23(%rcx), %xmm0
            subsd   %xmm12, %xmm7
            subsd   %xmm10, %xmm14
            movapd  %xmm13, %xmm1
            addsd   %xmm11, %xmm12
            movapd  %xmm2, %xmm11
            subsd   %xmm15, %xmm6
            addsd   %xmm4, %xmm15
            movapd  %xmm0, %xmm4
            mulsd   %xmm7, %xmm1
            addsd   %xmm9, %xmm10
            mulsd   %xmm14, %xmm4
            subsd   %xmm8, %xmm5
            mulsd   %xmm0, %xmm7
            addsd   %xmm3, %xmm8
            mulsd   %xmm13, %xmm14
            movapd  %xmm15, %xmm9
            mulsd   %xmm10, %xmm11
            mulsd   %xmm0, %xmm10
            addsd   %xmm1, %xmm4
            movapd  %xmm8, %xmm3
            movapd  %xmm5, %xmm1
            subsd   %xmm7, %xmm14
            movapd  %xmm0, %xmm7
            mulsd   %xmm12, %xmm7
            addsd   %xmm4, %xmm1
            mulsd   %xmm2, %xmm12
            movapd  %xmm6, %xmm2
            subsd   %xmm14, %xmm6
            addsd   %xmm14, %xmm2
            addsd   %xmm11, %xmm7
            subsd   %xmm12, %xmm10
            subsd   %xmm4, %xmm5
            addsd   %xmm7, %xmm3
            addsd   %xmm10, %xmm9
            subsd   %xmm10, %xmm15
            subsd   %xmm7, %xmm8
            movsd   %xmm9, (%r13)
            movq    (%rax), %rcx
            movsd   %xmm3, 7(%r12,%rcx)
            movq    (%rsi), %r13
            movq    (%rax), %rcx
            movsd   %xmm15, 7(%rcx,%r13,2)
            movq    (%r10), %r12
            movq    (%rax), %r13
            movsd   %xmm8, 7(%r13,%r12,2)
            movq    (%rbx), %rcx
            movq    (%rax), %r13
            movsd   %xmm2, 7(%r13,%rcx,2)
            movq    (%r9), %r12
            movq    (%rax), %rcx
            movsd   %xmm1, 7(%rcx,%r12,2)
            movq    (%r8), %r13
            movq    (%rax), %rcx
            movsd   %xmm6, 7(%rcx,%r13,2)
            movq    (%r15), %r12
            movq    (%rax), %r13
            movsd   %xmm5, 7(%r13,%r12,2)
            cmpq    %rdx, -104(%rsp)
            jg      .L2941

    Loop with -frename-registers -fno-move-loop-invariants

    /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

    .L2755:
            leaq    8(%rax), %rdx
            movq    %rcx, %r13
            leaq    -16(%rax), %r9
            leaq    -8(%rax), %r10
            leaq    -24(%rax), %r8
            leaq    -32(%rax), %rdi
            addq    (%rdx), %r13
            leaq    4(%rcx), %r14
            leaq    4(%r13), %rsi
            movq    %r13, (%r10)
            movq    %rsi, (%r9)
            addq    (%rdx), %r13
            leaq    -40(%rax), %rsi
            leaq    4(%r13), %r11
            movq    %r13, (%r8)
            movq    %r11, (%rdi)
            addq    (%rdx), %r13
            leaq    -48(%rax), %r11
            leaq    40(%rax), %rdx
            movq    %r13, (%rsi)
            addq    $4, %r13
            movq    %r13, (%r11)
            movq    (%rdx), %rbx
            movq    (%rsi), %r12
            addq    $7, %rbx
            movsd   (%rbx,%r12,2), %xmm11
            movq    (%r8), %r12
            movsd   (%rbx,%r13,2), %xmm9
            movq    (%rdi), %r13
            movsd   (%rbx,%r12,2), %xmm7
            movq    (%r10), %r12
            movsd   (%rbx,%r13,2), %xmm5
            movq    (%r9), %r13
            movsd   (%rbx,%r12,2), %xmm6
            leaq    (%r14,%r14), %r12
            movsd   (%rbx,%r13,2), %xmm14
            leaq    (%rbx,%rcx,2), %r13
            movsd   (%rbx,%r12), %xmm8
            movq    24(%rax), %rbx
            movapd  %xmm6, %xmm13
            addq    $8, %rcx
            movsd   (%r13), %xmm4
            cmpq    %rcx, %r15
            movsd   15(%rbx), %xmm1
            movsd   7(%rbx), %xmm2
            movapd  %xmm1, %xmm3
            movsd   31(%rbx), %xmm0
            movapd  %xmm2, %xmm10
            mulsd   %xmm11, %xmm3
            movapd  %xmm2, %xmm12
            mulsd   %xmm9, %xmm10
            mulsd   %xmm2, %xmm11
            mulsd   %xmm1, %xmm9
            mulsd   %xmm7, %xmm2
            addsd   %xmm10, %xmm3
            mulsd   %xmm5, %xmm12
            movapd  %xmm14, %xmm10
            movsd   23(%rbx), %xmm15
            subsd   %xmm9, %xmm11
            movapd  %xmm1, %xmm9
            mulsd   %xmm5, %xmm1
            movapd  %xmm8, %xmm5
            mulsd   %xmm7, %xmm9
            subsd   %xmm3, %xmm10
            movapd  %xmm4, %xmm7
            subsd   %xmm11, %xmm13
            addsd   %xmm6, %xmm11
            movsd   .LC5(%rip), %xmm6
            subsd   %xmm1, %xmm2
            xorpd   %xmm0, %xmm6
            addsd   %xmm14, %xmm3
            addsd   %xmm12, %xmm9
            movapd  %xmm15, %xmm14
            movapd  %xmm0, %xmm12
            subsd   %xmm2, %xmm7
            mulsd   %xmm13, %xmm14
            addsd   %xmm4, %xmm2
            movapd  %xmm6, %xmm4
            subsd   %xmm9, %xmm5
            mulsd   %xmm3, %xmm0
            addsd   %xmm8, %xmm9
            mulsd   %xmm10, %xmm4
            movapd  %xmm15, %xmm8
            mulsd   %xmm15, %xmm10
            mulsd   %xmm11, %xmm15
            movapd  %xmm7, %xmm1
            mulsd   %xmm13, %xmm6
            mulsd   %xmm3, %xmm8
            movapd  %xmm9, %xmm3
            mulsd   %xmm11, %xmm12
            addsd   %xmm14, %xmm4
            subsd   %xmm0, %xmm15
            movapd  %xmm5, %xmm0
            subsd   %xmm10, %xmm6
            movapd  %xmm2, %xmm10
            addsd   %xmm12, %xmm8
            addsd   %xmm15, %xmm10
            subsd   %xmm15, %xmm2
            addsd   %xmm6, %xmm1
            addsd   %xmm8, %xmm3
            movsd   %xmm10, (%r13)
            movq    (%rdx), %rbx
            subsd   %xmm8, %xmm9
            addsd   %xmm4, %xmm0
            subsd   %xmm6, %xmm7
            movsd   %xmm3, 7(%r12,%rbx)
            movq    (%r10), %r10
            movq    (%rdx), %r13
            subsd   %xmm4, %xmm5
            movsd   %xmm2, 7(%r13,%r10,2)
            movq    (%r9), %rbx
            movq    (%rdx), %r12
            movsd   %xmm9, 7(%r12,%rbx,2)
            movq    (%r8), %r13
            movq    (%rdx), %r10
            movsd   %xmm1, 7(%r10,%r13,2)
            movq    (%rdi), %r9
            movq    (%rdx), %rbx
            movsd   %xmm0, 7(%rbx,%r9,2)
            movq    (%rsi), %rsi
            movq    (%rdx), %r8
            movsd   %xmm7, 7(%r8,%rsi,2)
            movq    (%r11), %rdi
            movq    (%rdx), %r12
            movsd   %xmm5, 7(%r12,%rdi,2)
            jg      .L2755

Comment 70 lucier 2009-05-07 16:00:39 UTC
Created attachment 17819 [details]
time report related to comment 69, time for PR 31957 with no options
Comment 71 lucier 2009-05-07 16:02:40 UTC
Created attachment 17820 [details]
time for 31957, with rename-registers
Comment 72 lucier 2009-05-07 16:03:38 UTC
Created attachment 17821 [details]
time for 31957, with rename-registers no-move-loop-invariants
Comment 73 lucier 2009-05-07 16:04:25 UTC
Created attachment 17822 [details]
time for 31957, with rename-registers no-move-loop-invariants forward-propagate
Comment 74 Paolo Bonzini 2009-05-07 16:21:01 UTC
Ok.  One step at a time. :-)  To recap, here is the situation:

- the CSE optimization you mention was *not* removed, it was moved to fwprop, so it does not run at -O1.

- once this was done, the way to go is to tune new optimizations, not to reintroduce old ones

- for example, fwprop in turn triggered a bad choice in loop invariant motion, for which a patch has been posted.  This patch will remove the need for -fno-move-loop-invariants on this testcase (this is a deficiency in LIM that is not specific to machine-generated code, OTOH the presence of many fp[N] accesses helps triggering it).

- that scheduling is necessary now and not in 4.2.x, probably is just a matter of luck

- why renaming registers is necessary now and not in 4.2.x is still a mystery; but, there is an explanation as to why it helps (it prolongs live ranges, something that on non-x86 archs is done by the pre-regalloc scheduling)

- at least we have a set of options providing good performance on this testcase, and guidance towards better tuning of the various problematic optimizations

To conclude, nobody is underestimating the significance of its PR, it's just a matter of priorities.  Near the end of the release cycle, you tend to look at PRs with small testcases to minimize the time spent understanding the code; near the beginning, you hope that new features magically fix the PRs and concentrate on wrong-code bugs and so on.  Complex P2s such as this one unfortunately tend to stay in a limbo.
Comment 75 lucier 2009-05-07 16:31:42 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475


On May 7, 2009, at 12:21 PM, bonzini at gnu dot org wrote:

> ------- Comment #74 from bonzini at gnu dot org  2009-05-07 16:21  
> -------
> Ok.  One step at a time. :-)  To recap, here is the situation:
>
> - that scheduling is necessary now and not in 4.2.x, probably is  
> just a matter
> of luck

If you mean -fschedule-insns2, it has always been part of the options  
list.

> - at least we have a set of options providing good performance on this
> testcase, and guidance towards better tuning of the various  
> problematic
> optimizations

OK, but -fforward-propagate is not viable in general for these  
machine-generated codes.

>
Brad
Comment 76 Paolo Bonzini 2009-05-07 16:37:53 UTC
It should be possible to modify fwprop to avoid excessive memory usage (doing its own dataflow, basically, instead of using UD chains)
Comment 77 Steven Bosscher 2009-05-07 17:50:33 UTC
Re. comment #75: Just the fact that an option is enabled in both releases doesn't mean the pass behind it is doing the same thing in both releases. What the scheduler does, depends heavily on the code you feed it.  Sometimes it is pure (good or bad) luck that changes  the behavior of a pass in the compiler.  The interactions between all the pieces are just very complicated (which is why, IMHO, retargetable-compiler engineering is so difficult: controlling the pipeline is undoable).

Re. comment #76:
Sad as it may be, I think this is the best short-term solution.
Alternatively we could re-work fwprop to work on regions and use the partial-CFG dataflow stuff, similar to what the RTL loop optimizers (like loop-invariant) do.  To be honest, I'd much prefer the latter, but the DIY-fwprop thing is probably easier in the short term.

Comment 78 Paolo Bonzini 2009-05-08 06:51:28 UTC
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 06:51:12 2009
New Revision: 147270

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147270
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

	PR rtl-optimization/33928
        * loop-invariant.c (struct use): Add addr_use_p.
        (struct def): Add n_addr_uses.
        (struct invariant): Add cheap_address.
        (create_new_invariant): Set cheap_address.
        (record_use): Accept df_ref.  Set addr_use_p and update n_addr_uses.
        (record_uses): Pass df_ref to record_use.
        (get_inv_cost): Do not add inv->cost to comp_cost for cheap addresses used
	only as such.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/loop-invariant.c

Comment 79 Paolo Bonzini 2009-05-08 07:18:05 UTC
I'm cobbling up the DIY dataflow patch and it is all but ugly, actually.
Comment 80 Paolo Bonzini 2009-05-08 07:51:59 UTC
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 07:51:46 2009
New Revision: 147274

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147274
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

	PR rtl-optimization/33928
        * loop-invariant.c (record_use): Fix && vs. || mishap.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/loop-invariant.c

Comment 81 Paolo Bonzini 2009-05-08 07:55:25 UTC
Created attachment 17825 [details]
speed up fwprop and enable it at -O1

Here is a patch I'm bootstrapping to remove fwprop's usage of UD chains.  It does not affect at all the assembly output, it just changes the data structure that is used.

compiler.i is probably too big for me, but I tried slatex.i and fwprop was ~2% of compilation time with this patch.
Comment 82 Paolo Bonzini 2009-05-08 09:41:19 UTC
Hm, looking at the time reports the patch will save about 30-40% of the fwprop execution time, and should fix the memory hog problem, but will still leave in the 70s needed to compute reaching definitions.  I guess it's a step forward for -O2 but borderline for -O1.
Comment 83 Paolo Bonzini 2009-05-08 12:22:46 UTC
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 12:22:30 2009
New Revision: 147282

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147282
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

	PR rtl-optimization/33928
	PR 26854
	* fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
	process_uses, build_single_def_use_links): New.
	(update_df): Update use_def_ref.
	(forward_propagate_into): Use get_def_for_use instead of use-def
	chains.
	(fwprop_init): Call build_single_def_use_links and let it initialize
	dataflow.
	(fwprop_done): Free use_def_ref.
	(fwprop_addr): Eliminate duplicate call to df_set_flags.
	* df-problems.c (df_rd_simulate_artificial_defs_at_top, 
	df_rd_simulate_one_insn): New.
	(df_rd_bb_local_compute_process_def): Update head comment.
	(df_chain_create_bb): Use the new RD simulation functions.
	* df.h (df_rd_simulate_artificial_defs_at_top, 
	df_rd_simulate_one_insn): New.
	* opts.c (decode_options): Enable fwprop at -O1.
	* doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/df-problems.c
    trunk/gcc/df.h
    trunk/gcc/doc/invoke.texi
    trunk/gcc/fwprop.c
    trunk/gcc/opts.c

Comment 84 Paolo Bonzini 2009-05-15 10:35:45 UTC
Ok, I am working on a patch to add a multiple-definitions DF problem and use that together with a domwalk to find the single definitions (instead of reaching-definitions, which is the remaining slow part).  The new problem has a bitvector sized by the number of registers rather than the number of defs (that is sized like the bitvectors for liveness), which means it will be fast.  It is defined as follows:

MDkill (B) = regs that have a def in B
MDinit (B) = (union of MDkill (P) for every P : B \in DomFrontier(P) \cap LRin(B)
MDin (B) = MDinit (B) \cup (union of MDout (P) for every predecessor P of B)
MDout (B) = MDin (B) - MDkill (B)
Comment 85 lucier 2009-05-16 00:20:42 UTC
Created attachment 17878 [details]
Large test file for testing time and memory usage

This is the file compiler.i used in the previous tests.
Comment 86 lucier 2009-05-16 00:29:25 UTC
Created attachment 17879 [details]
Time and memory report for compiler.i

This is the time and memory report after the hack from

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39301#c8

to make the statistic fields HOST_WIDEST_INTs.

Some interesting lines:

fwprop.c:178 (build_single_def_use_links)        8      8438189160           82240               0    1027496
df-problems.c:311 (df_rd_alloc)             155420      8433928200      8433870880      8433870880          0
df-problems.c:593 (df_rd_transfer_functio   909666     40718919320      6755812320      6755736840    2025096
Total                                     13171390     61130398320
Comment 87 lucier 2009-05-16 00:33:12 UTC
The compiler options for the previous report:

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.  -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers
-fno-move-loop-invariants -fforward-propagate -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -c compiler.i -ftime-report -fmem-report > &
rename-no-move-loop-invariants-forward-propagate-report-new
Comment 88 Paolo Bonzini 2009-06-08 08:40:27 UTC
Created attachment 17963 [details]
patch I'm testing

Here is a patch I'm testing that completes the rewrite of fwprop's dataflow.  This should make it much faster and less memory hungry.  It should also keep the generated code fast (with -frename-registers of course), if not it's a bug in the patch.
Comment 89 Paolo Bonzini 2009-06-08 08:59:13 UTC
Created attachment 17964 [details]
correct version

oops, the previous one didn't work at -O1 even though it bootstrapped :-)
Comment 90 Paolo Bonzini 2009-06-08 16:35:18 UTC
Yo, with the patch the time to compile compiler.i with the given options is 331s on my machine (with a checking compiler).  Fwprop takes only 1% (including computation of the new dataflow problem).  I'd estimate around 250s with your nonchecking build.  I'll split it and post it tomorrow.
Comment 91 lucier 2009-06-08 18:19:43 UTC
Created attachment 17968 [details]
time and memory report for compiler.i after Paolo's patch

The patch cut the total bitmaps used compiling compiler.i from > 60GB to 3GB; maximum memory (just from top) was 1631MB.
Comment 92 Paolo Bonzini 2009-06-12 14:50:54 UTC
In the meanwhile something caused "tree incremental SSA" to jump up from 10s to 26s.  Sob.
Comment 93 Richard Biener 2009-06-13 14:18:35 UTC
I would say that was the new SRA.
Comment 94 Martin Jambor 2009-06-14 04:43:50 UTC
(In reply to comment #92)
> In the meanwhile something caused "tree incremental SSA" to jump up from 10s to
> 26s.  Sob.
> 

(In reply to comment #93)
> I would say that was the new SRA.
> 

OK, I'll try to investigate.  Which of the various attachments to this
bug is the one to look at?

Martin
Comment 95 lucier 2009-06-14 14:59:11 UTC
The test case is compiler.i.gz
Comment 96 lucier 2009-06-14 15:02:28 UTC
Sorry, the gcc options are in comment 87 (the -fforward-propagate is now redundant), and without Paolo's recently proposed patch it requires about 9GB of memory to compile.

Comment 97 Paolo Bonzini 2009-06-15 15:14:17 UTC
Brad, could you try to time compiler.i with and without -ftime-report to see how much of the "tree stmt walking" timevar is just accounting overhead?
Comment 98 lucier 2009-06-15 16:11:57 UTC
I don't quite understand how you would like me to configure and run the test.

First, I've applied your patches to speed up computing DF to my tree; do you want them included in the test, or should I use a pristine mainline?

Second, when configuring mainline, should I include, or not include

1.  --enable-gather-detailed-mem-stats
2.  --enable-checking=release

After that, I think you just want to run two compiles with and without -ftime-report, is that right?  (Nothing about -fmem-report.)
Comment 99 paolo.bonzini@gmail.com 2009-06-15 16:20:51 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

> First, I've applied your patches to speed up computing DF to my tree; do you
> want them included in the test, or should I use a pristine mainline?

It doesn't matter, but yes, use them.

> Second, when configuring mainline, should I include, or not include
> 
> 1.  --enable-gather-detailed-mem-stats
> 2.  --enable-checking=release

Again it shouldn't matter, but use only --enable-checking=release.

> After that, I think you just want to run two compiles with and without
> -ftime-report, is that right?  (Nothing about -fmem-report.)

Yes, and the output of -ftime-report is not needed.  Just the "time 
./cc1 ..." output for the two.  Thanks!
Comment 100 Paolo Bonzini 2009-06-15 16:22:22 UTC
Just as a reminder for after the fwprop patches are committed, the problem in CFG cleanup is that the iterative fixing of dominators in remove_edge_and_dominated_blocks is very expensive.  Probably we should make sure no dominators are there in some key cfgcleanup passes.
Comment 101 Paolo Bonzini 2009-06-15 16:26:39 UTC
Time for cleanup.  This bug is fixed on mainline, and likely WONTFIX on 4.3/4.4 (though it could in principle be fixed by backporting the fwprop patches to 4.4).  I'll add some pointers to PR26854 for the attachments related to compile-time problems.
Comment 102 lucier 2009-06-15 19:57:47 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Mon, 2009-06-15 at 16:20 +0000, paolo dot bonzini at gmail dot com
wrote:

> Yes, and the output of -ftime-report is not needed.  Just the "time 
> ./cc1 ..." output for the two.  Thanks!

The two commands:

time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -c compiler.i 
261.424u 1.184s 4:22.76 99.9%	0+0k 0+28456io 0pf+0w
time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -c compiler.i -ftime-report 
263.424u 4.900s 4:28.68 99.8%	0+0k 0+28480io 0pf+0w


Comment 103 lucier 2009-06-15 20:21:05 UTC
Regarding comment #101 ...

With

heine:~/programs/gcc/objdirs/gsc-fft-tests/gambc-v4_1_2> /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --disable-multilib --enable-checking=release
Thread model: posix
gcc version 4.5.0 20090608 (experimental) [trunk revision 148276] (GCC) 

(and including Paolo's patch to speed up DF), the routine in direct.c takes

    168 ms cpu time (168 user, 0 system)

As reported here

http://www.math.purdue.edu/~lucier/bugzilla/9/

with gcc-4.2.4, this routine takes 156 ms on the same machine.

Comment #9 gives the code that 4.2.4 generates at the start of the main loop;  the start of the main loop with the version of 4.5.0 I gave above is:

.L2938:
        movq    %rcx, %rdx
        addq    8(%rax), %rdx
        leaq    4(%rcx), %rbx
        movq    %rdx, -8(%rax)
        leaq    4(%rdx), %rdi
        addq    8(%rax), %rdx
        movq    %rdi, -16(%rax)
        movq    %rdx, -24(%rax)
        leaq    4(%rdx), %rdi
        addq    8(%rax), %rdx
        movq    %rdi, -32(%rax)
        movq    %rdx, -40(%rax)
        leaq    4(%rdx), %rdi
        movq    40(%rax), %rdx
        movq    %rdi, -48(%rax)
        movsd   7(%rdx,%rdi,2), %xmm7
        movq    -40(%rax), %rdi
        leaq    7(%rdx,%rcx,2), %r8
        addq    $8, %rcx
        movsd   (%r8), %xmm4
        cmpq    %rcx, %r13
        movsd   7(%rdx,%rdi,2), %xmm10
        movq    -32(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm5
        movq    -24(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm6
        movq    -16(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm13
        movq    -8(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm11
        leaq    (%rbx,%rbx), %rdi
        movsd   7(%rdi,%rdx), %xmm9
        movq    24(%rax), %rdx
        movapd  %xmm11, %xmm14
        movsd   15(%rdx), %xmm1
        movsd   7(%rdx), %xmm2
        movapd  %xmm1, %xmm8
        movsd   31(%rdx), %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm10, %xmm8
        mulsd   %xmm7, %xmm12
        mulsd   %xmm2, %xmm10
        mulsd   %xmm1, %xmm7
        movsd   23(%rdx), %xmm0

So, to my mind, this is still a 4.5 regression, as there is still a slow-down and the code is still much less optimized by 4.5.0 than by 4.2.4. 168/156 ~ 1.08, so if you want to change the Summary of this bug to 8% regression, or some other things, that's fine, but I've changed this PR back to being a 4.5 regression.

I was not really thrilled when Richard marked PR 39157 as a duplicate of this PR.  To my mind, there are three more or less independent things---run time of Gambit-generated code, compile time of the code, and the space required to compile the code.  This PR is about run time; PR 39157 was about space needed by the compiler; PR 26854 is about compile time.  They seem to have all been mushed together.
Comment 104 Paolo Bonzini 2009-06-16 06:47:53 UTC
I understood that with -frename-registers the regression is fixed.  As I said, without a pre-regalloc scheduling pass and without register renaming, the scheduling quality you get is more or less random.
Comment 105 Paolo Bonzini 2009-06-16 07:01:43 UTC
Marking PR39157 as a duplicate of PR26854 is not exact (only the fwprop part is a duplicate, because we were getting large compile times because of building large data structures; the CFG Cleanup part is not exactly a duplicate) but I don't think it's important because anyway we have a patch for the fwprop issue.
Comment 106 lucier 2009-06-16 07:24:45 UTC
This machine has 4ms ticks, so we're getting down to a few ticks difference with a benchmark of this size.  It's 156ms with 4.2.4, 168ms with 4.5.0, and 164 ms when -frename-registers is added to the command line.

It's not just scheduling, there are more memory accesses with 4.5.0.

With a problem roughly 10 times as large, the times are

4.2.4:  2912ms
4.5.0:  3204ms
4.5.0:  3120ms (adding -frename-registers)

So there's a 7% difference with -frename-registers.
Comment 107 Richard Biener 2009-08-04 12:28:30 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 108 lucier 2009-08-27 01:18:38 UTC
direct.c contains a direct FFT; I've compiled the direct and inverse fft and I ran it on arrays with 2^23 double-precision complex elements and

heine:~/programs/gcc/objdirs/bench-mainline-on-fft> /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c,c++ -enable-stage1-languages=c,c++
Thread model: posix
gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

The compile options were

/pkgs/gcc-mainline/bin/gcc -save-temps -c -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared -fschedule-insns

and the same without -fschedule-insns.

The runtime for direct+inverse FFT with instruction scheduling was 1.264 seconds and the time for direct+inverse FFT without -fschedule-insns was 1.444 seconds, which is a 14% speedup for that one compiler option.  This is on a 2.33GHz Core 2 quad machine.

I'll attach the inner loops of direct.c with and with -fschedule-insns.

I haven't been able to compile the complete Gambit runtime with -fschedule-insns on either x86-64 or ppc64; I've filed PR41164 and PR41176 for those two different failures.
Comment 109 lucier 2009-08-27 01:22:11 UTC
Created attachment 18432 [details]
inner loop of direct.c with -fschedule-insns
Comment 110 lucier 2009-08-27 01:22:58 UTC
Created attachment 18433 [details]
inner loop of direct.c without -fschedule-insns
Comment 111 lucier 2009-08-27 17:02:10 UTC
I can compile gambit 4.1.2 with -fschedule-insns except for the function noted in PR41164.

On

model name	: Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz

with

gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

the times with -fschedule-insns are

(time (direct-fft-recursive-4 a table))
    144 ms cpu time (144 user, 0 system)
(time (inverse-fft-recursive-4 a table))
    136 ms cpu time (136 user, 0 system)

and the times without -fschedule-insns are

(time (direct-fft-recursive-4 a table))
    168 ms cpu time (168 user, 0 system)
(time (inverse-fft-recursive-4 a table))
    172 ms cpu time (172 user, 0 system)

That's a pretty big improvement.
Comment 112 Peter Bergner 2009-10-03 01:39:35 UTC
Subject: Bug 33928

Author: bergner
Date: Sat Oct  3 01:39:14 2009
New Revision: 152430

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=152430
Log:
	Backport from mainline.

	2009-08-30  Alan Modra  <amodra@bigpond.net.au>

	PR target/41081
	* fwprop.c (get_reg_use_in): Delete.
	(free_load_extend): New function.
	(forward_propagate_subreg): Use it.

	2009-08-23  Alan Modra  <amodra@bigpond.net.au>

	PR target/41081
	* fwprop.c (try_fwprop_subst): Allow multiple sets.
	(get_reg_use_in): New function.
	(forward_propagate_subreg): Propagate through subreg of zero_extend
	or sign_extend.

	2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

	PR rtl-optimization/33928
	PR 26854
	* fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
	process_uses, build_single_def_use_links): New.
	(update_df): Update use_def_ref.
	(forward_propagate_into): Use get_def_for_use instead of use-def
	chains.
	(fwprop_init): Call build_single_def_use_links and let it initialize
	dataflow.
	(fwprop_done): Free use_def_ref.
	(fwprop_addr): Eliminate duplicate call to df_set_flags.
	* df-problems.c (df_rd_simulate_artificial_defs_at_top,
	df_rd_simulate_one_insn): New.
	(df_rd_bb_local_compute_process_def): Update head comment.
	(df_chain_create_bb): Use the new RD simulation functions.
	* df.h (df_rd_simulate_artificial_defs_at_top,
	df_rd_simulate_one_insn): New.
	* opts.c (decode_options): Enable fwprop at -O1.
	* doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    branches/ibm/gcc-4_3-branch/gcc/ChangeLog.ibm
    branches/ibm/gcc-4_3-branch/gcc/REVISION
    branches/ibm/gcc-4_3-branch/gcc/df-problems.c
    branches/ibm/gcc-4_3-branch/gcc/df.h
    branches/ibm/gcc-4_3-branch/gcc/doc/invoke.texi
    branches/ibm/gcc-4_3-branch/gcc/fwprop.c
    branches/ibm/gcc-4_3-branch/gcc/opts.c

Comment 113 Peter Bergner 2010-04-29 14:34:59 UTC
Subject: Bug 33928

Author: bergner
Date: Thu Apr 29 14:34:35 2010
New Revision: 158902

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=158902
Log:
	Backport from mainline.

	2009-08-30  Alan Modra  <amodra@bigpond.net.au>

	PR target/41081
	* fwprop.c (get_reg_use_in): Delete.
	(free_load_extend): New function.
	(forward_propagate_subreg): Use it.

	2009-08-23  Alan Modra  <amodra@bigpond.net.au>

	PR target/41081
	* fwprop.c (try_fwprop_subst): Allow multiple sets.
	(get_reg_use_in): New function.
	(forward_propagate_subreg): Propagate through subreg of zero_extend
	or sign_extend.

	2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

	PR rtl-optimization/33928
	PR 26854
	* fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
	process_uses, build_single_def_use_links): New.
	(update_df): Update use_def_ref.
	(forward_propagate_into): Use get_def_for_use instead of use-def
	chains.
	(fwprop_init): Call build_single_def_use_links and let it initialize
	dataflow.
	(fwprop_done): Free use_def_ref.
	(fwprop_addr): Eliminate duplicate call to df_set_flags.
	* df-problems.c (df_rd_simulate_artificial_defs_at_top,
	df_rd_simulate_one_insn): New.
	(df_rd_bb_local_compute_process_def): Update head comment.
	(df_chain_create_bb): Use the new RD simulation functions.
	* df.h (df_rd_simulate_artificial_defs_at_top,
	df_rd_simulate_one_insn): New.
	* opts.c (decode_options): Enable fwprop at -O1.
	* doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    branches/ibm/gcc-4_4-branch/gcc/ChangeLog.ibm
    branches/ibm/gcc-4_4-branch/gcc/df-problems.c
    branches/ibm/gcc-4_4-branch/gcc/df.h
    branches/ibm/gcc-4_4-branch/gcc/doc/invoke.texi
    branches/ibm/gcc-4_4-branch/gcc/fwprop.c
    branches/ibm/gcc-4_4-branch/gcc/opts.c

Comment 114 Richard Biener 2010-05-22 18:11:46 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 115 Richard Biener 2011-03-04 11:58:13 UTC
Hm, there doesn't seem to be a runtime testcase attached to this bug, so I
can't produce numbers for the upcoming 4.6 release.  Brad, can you do so
if you have time?

Thanks.

Btw, how difficult is it to setup a continuous performance testing of Gambit?
Is Gambit reasonably self-contained (no external dependenices, commandline-driven)?  I'm considering to add it to http://gcc.opensuse.org/c++bench/
I probably can get it built but would appreciate hints on how to setup an
automated performance test.
Comment 116 lucier 2011-03-04 16:09:13 UTC
On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
> 
> --- Comment #115 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-04 11:58:13 UTC ---
> Hm, there doesn't seem to be a runtime testcase attached to this bug, so I
> can't produce numbers for the upcoming 4.6 release.  Brad, can you do so
> if you have time?

I'll work on it.

I just went through all the comments in this bug report to remind me of
the issues, of which there seem to be two.  The first is the runtime
performance of the direct FFT in direct.c, as discussed, e.g., in
comment 103

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928#c103

and the second is the compile-time performance.

I presume you want to know about the performance of the FFT code.  This
is a very specific benchmark (one routine) and would not be indicative
of general 

> Btw, how difficult is it to setup a continuous performance testing of Gambit?
> Is Gambit reasonably self-contained (no external dependenices,
> commandline-driven)?  I'm considering to add it to
> http://gcc.opensuse.org/c++bench/
> I probably can get it built but would appreciate hints on how to setup an
> automated performance test.
> 

It's completely self-contained and very portable.  Benchmarking could be
automated.  It has a benchmark suite that measures runtime and
compile-time performance of a number of programs, most small, but some
larger (so compilation used to take quite a few GB of memory and several
minutes or more of CPU time; these are not benchmarked by default; would
you want to run these as extreme tests of the compiler?).

I'll talk with Marc Feeley, the author of Gambit, about how to automate
the benchmarks; it will probably require just "make bench" with various
options if desired.

Brad
Comment 117 rguenther@suse.de 2011-03-04 16:14:55 UTC
On Fri, 4 Mar 2011, lucier at math dot purdue.edu wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
> 
> --- Comment #116 from lucier at math dot purdue.edu 2011-03-04 16:09:13 UTC ---
> On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928
> > 
> > --- Comment #115 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-04 11:58:13 UTC ---
> > Hm, there doesn't seem to be a runtime testcase attached to this bug, so I
> > can't produce numbers for the upcoming 4.6 release.  Brad, can you do so
> > if you have time?
> 
> I'll work on it.

Thanks.

> I just went through all the comments in this bug report to remind me of
> the issues, of which there seem to be two.  The first is the runtime
> performance of the direct FFT in direct.c, as discussed, e.g., in
> comment 103
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928#c103
> 
> and the second is the compile-time performance.
> 
> I presume you want to know about the performance of the FFT code.  This
> is a very specific benchmark (one routine) and would not be indicative
> of general 

Yes, I want to know about runtime performance.

> > Btw, how difficult is it to setup a continuous performance testing of Gambit?
> > Is Gambit reasonably self-contained (no external dependenices,
> > commandline-driven)?  I'm considering to add it to
> > http://gcc.opensuse.org/c++bench/
> > I probably can get it built but would appreciate hints on how to setup an
> > automated performance test.
> > 
> 
> It's completely self-contained and very portable.  Benchmarking could be
> automated.  It has a benchmark suite that measures runtime and
> compile-time performance of a number of programs, most small, but some
> larger (so compilation used to take quite a few GB of memory and several
> minutes or more of CPU time; these are not benchmarked by default; would
> you want to run these as extreme tests of the compiler?).
> 
> I'll talk with Marc Feeley, the author of Gambit, about how to automate
> the benchmarks; it will probably require just "make bench" with various
> options if desired.

Ah, so it's not Gambit from TAMU (the game theory software) then ;)

Richard.
Comment 118 lucier 2011-03-10 18:50:12 UTC
On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote:

> Hm, there doesn't seem to be a runtime testcase attached to this bug, so I
> can't produce numbers for the upcoming 4.6 release.  Brad, can you do so
> if you have time?

At 

http://www.math.purdue.edu/~lucier/bugzilla/14/

is a Readme file and a tarball; I think it should be easy to script a
runtime test for this PR from the instructions in the Readme file.

Later we'll devise a "make bench" for general Gambit benchmarking.

Brad
Comment 119 lucier 2011-03-10 19:55:54 UTC
It's nearly impossible to examine the assembly code responsible for the FFT in the package I set up in the previous comment.  If you want a runtime benchmark for this PR where you can easily examine the code I'll have to do more work.
Comment 120 lucier 2011-03-10 22:00:22 UTC
At

http://www.math.purdue.edu/~lucier/bugzilla/15/

I've put a tarfile and instructions that allow one to build Gambit-C in
a way that splits out the FFT code into its own C function, so the
assembly code can be more easily examined.

Brad
Comment 121 lucier 2011-04-02 16:58:16 UTC
I'm inclined to close this as "Fixed" for 4.6.0.

I've taken the file mentioned in the previous comment and followed the instructions in the readme.  The times for a forward FFT of 2^{25} complex doubles on a 2.4HGz Intel Core i5 on x86_64-apple-darwin10.7.0 are as follows:

With the usual compiler options of

-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp

4.5.2:

    2433 ms cpu time (2427 user, 6 system)

4.6.0:

    2158 ms cpu time (2154 user, 4 system)

Adding -fschedule-insns -march=native to the above:

4.5.2:

    2067 ms cpu time (2060 user, 7 system)

4.6.0:

    2016 ms cpu time (2012 user, 4 system)

The assembly for the main loop looks much better.
Comment 122 lucier 2011-04-02 17:05:10 UTC
Just to be clear, the command to do the test is

gsi/gsi -e '(define a (expt 3 100000000))(set! *bench-bignum-fft* #t)(define b (* a a))'
Comment 123 Richard Biener 2011-04-05 12:02:16 UTC
Fixed for GCC 4.6.