47167 – Performance regression in numerical code

Bug 47167 - Performance regression in numerical code

Summary: Performance regression in numerical code

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	other (show other bugs)
Version:	4.6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2011-01-04 14:44 UTC by Martin Reinecke
Modified:	2011-01-20 10:36 UTC (History)
CC List:	4 users (show)

See Also:
Host:	x86_64-unknown-linux-gnu
Target:	x86_64-unknown-linux-gnu
Build:	x86_64-unknown-linux-gnu
Known to work:	4.5.1
Known to fail:	4.6.0
Last reconfirmed:

Attachments
test case (46.34 KB, application/gzip) 2011-01-04 14:44 UTC, Martin Reinecke	Details
shorter test case (4.75 KB, application/octet-stream) 2011-01-05 14:42 UTC, Martin Reinecke	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Reinecke 2011-01-04 14:44:53 UTC

Created attachment 22897 [details]
test case

When compiling the attached testcase on a machine with a Core 2 Duo E8500 CPU and 64bit Linux using

gcc -O2 -fomit-frame-pointer testcase.i -lm

the results with gcc 4.5.1 are

Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0

Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)

iteration 0:
wall time for alm2map: 8.811477s
wall time for map2alm: 9.204556s
component 0: rms 1.390734e-13, maxerr 1.582512e-12

However, with current trunk one obtains

Testing map analysis accuracy.
lmax=2047, 0 iterations, spin=0

Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels)

iteration 0:
wall time for alm2map: 9.518667s
wall time for map2alm: 9.780509s
component 0: rms 1.390734e-13, maxerr 1.582512e-12

The numerical result is identical, but the code generated by the more recent compiler is noticeably slower.

Reducing the test case is unfortunately not trivial; the computational hot spots are located in pshtd_inner_loop() and Ylmgen_recalc_Ylm_sse2().

Please let me know if I can provide further information.

Comment 1 Martin Reinecke 2011-01-05 14:42:20 UTC

Created attachment 22904 [details]
shorter test case

More compact test case; the hot spot is marked with "CRITICAL LOOP".
Compile with "gcc -O2 -fomit-frame-pointer testcase2.c -lm" and
test using "time ./a.out".

Comment 2 Uroš Bizjak 2011-01-05 17:31:20 UTC

The only difference in the hot loop is the usage of two regs in the address:

4.5.1:

.L142:
	movapd	%xmm0, (%rcx)
	mulpd	%xmm6, %xmm2
	addq	$32, %rbx
	movapd	%xmm1, %xmm6
	mulpd	%xmm0, %xmm6
	movsd	(%rax), %xmm1
	movsd	8(%rax), %xmm3
	unpcklpd	%xmm1, %xmm1
	subpd	%xmm2, %xmm6
	unpcklpd	%xmm3, %xmm3
	mulpd	%xmm9, %xmm1
	mulpd	%xmm0, %xmm3
	movapd	%xmm6, 16(%rcx)
	addq	$32, %rcx
	movapd	%xmm1, %xmm0
	movsd	16(%rax), %xmm1
	mulpd	%xmm6, %xmm0
	unpcklpd	%xmm1, %xmm1
	movsd	24(%rax), %xmm2
	addq	$32, %rax
	cmpq	%rsi, %rbx
	unpcklpd	%xmm2, %xmm2
	subpd	%xmm3, %xmm0
	mulpd	%xmm9, %xmm1
	jne	.L142

4.6:

.L167:
	movapd	%xmm0, %xmm10
.L143:
	mulpd	%xmm2, %xmm6
	movapd	%xmm3, %xmm2
	movapd	%xmm10, (%rsi,%rcx)
	mulpd	%xmm10, %xmm2
	movsd	(%rdx), %xmm0
	movsd	8(%rdx), %xmm1
	subpd	%xmm6, %xmm2
	unpcklpd	%xmm0, %xmm0
	unpcklpd	%xmm1, %xmm1
	mulpd	%xmm11, %xmm0
	movapd	%xmm2, 16(%rsi,%rcx)
	mulpd	%xmm10, %xmm1
	addq	$32, %rcx
	mulpd	%xmm2, %xmm0
	movsd	16(%rdx), %xmm3
	movsd	24(%rdx), %xmm6
	addq	$32, %rdx
	cmpq	%rdi, %rcx
	unpcklpd	%xmm3, %xmm3
	unpcklpd	%xmm6, %xmm6
	subpd	%xmm1, %xmm0
	mulpd	%xmm11, %xmm3
	jne	.L167

Given the comment above ix86_address_cost:

/* Return cost of the memory address x.
   For i386, it is better to use a complex address than let gcc copy
   the address into a reg and make a new pseudo.  But not if the address
   requires to two regs - that would mean more pseudos with longer
   lifetimes.  */

this could be the reason for slowdown.

Comment 3 Uroš Bizjak 2011-01-05 19:30:49 UTC

> this could be the reason for slowdown.

Hm, not really.

But, by changing the generated assembly around loop entry:

$ diff -u testcase2.s testcase2_.s
--- testcase2.s	2011-01-05 20:21:01.492919294 +0100
+++ testcase2_.s	2011-01-05 20:22:23.616577277 +0100
@@ -1678,6 +1678,7 @@
 	addq	%r15, %rdx
 	addq	$1, %rdi
 	salq	$5, %rdi
+	movapd	%xmm10, %xmm0
 	jmp	.L143
 	.p2align 4,,10
 	.p2align 3
@@ -1687,7 +1688,7 @@
 	mulpd	%xmm2, %xmm6
 	movapd	%xmm3, %xmm2
 	movapd	%xmm10, (%rsi,%rcx)
-	mulpd	%xmm10, %xmm2
+	mulpd	%xmm0, %xmm2
 	movsd	(%rdx), %xmm0
 	movsd	8(%rdx), %xmm1
 	subpd	%xmm6, %xmm2

The slowdown is magically fixed:

$ gcc -lm testcase2_.s
$ time ./a.out

real	0m4.041s
user	0m4.034s
sys	0m0.003s

versus:

$ gcc -lm testcase2.s
$ time ./a.out

real	0m4.239s
user	0m4.234s
sys	0m0.001s

The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction. The functionality of the test didn't change due to existing "movapd    %xmm0, %xmm10" at the top of the loop and added extra "movapd	%xmm10, %xmm0" before the loop.

This all happens on SnB, the code is generated with -O2 only.

H.J., any ideas?

Comment 4 Uroš Bizjak 2011-01-05 19:48:58 UTC

Applying the same medicine to original test gets us from:

wall time for map2alm: 6.908527s

to

wall time for map2alm: 6.703142s

where 4.5.1 wins with:

wall time for map2alm: 6.651740s

Comment 5 H.J. Lu 2011-01-05 20:09:11 UTC

(In reply to comment #3)
> > this could be the reason for slowdown.
> 
....
> 
> $ gcc -lm testcase2.s
> $ time ./a.out
> 
> real    0m4.239s
> user    0m4.234s
> sys    0m0.001s
> 
> The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
> The functionality of the test didn't change due to existing "movapd    %xmm0,
> %xmm10" at the top of the loop and added extra "movapd    %xmm10, %xmm0" before
> the loop.
> 
> This all happens on SnB, the code is generated with -O2 only.
> 
> H.J., any ideas?

Some loop performance is very sensitive to code sizes.  This change

-    mulpd    %xmm10, %xmm2
+    mulpd    %xmm0, %xmm2

will impact loop size due to exta REX prefix.

Comment 6 Uroš Bizjak 2011-01-06 07:38:11 UTC

(In reply to comment #5)

> Some loop performance is very sensitive to code sizes.  This change
> 
> -    mulpd    %xmm10, %xmm2
> +    mulpd    %xmm0, %xmm2
> 
> will impact loop size due to exta REX prefix.

Adding nop (or several of them, FWIW) around changed mulpd insn does not affect the performance, so IMO it is not the loop layout that matters in this case.

Comment 7 Martin Reinecke 2011-01-19 14:16:18 UTC

OK, I located the problematic commit, at least on the 4.5 branch: it's revision number 167492 (fix for PR tree-optimization/46806).

Between revisions 167491 and 167492 the CPU time for the testcase2.c jumps from 4.7s to 5.4s.

Do you think that anything can be done about this regression?

Comment 8 Richard Biener 2011-01-19 16:13:31 UTC

Can you check if the following patch solves your problem?

Index: tree-ssa-copyrename.c
===================================================================
--- tree-ssa-copyrename.c       (revision 168987)
+++ tree-ssa-copyrename.c       (working copy)
@@ -226,8 +226,11 @@ copy_rename_partition_coalesce (var_map
       ign2 = false;
     }
 
-  /* Don't coalesce if the two variables are not of the same type.  */
-  if (TREE_TYPE (root1) != TREE_TYPE (root2))
+  /* Don't coalesce if the two variables are not compatible .  */
+  if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2))
+      || ((TREE_CODE (TREE_TYPE (root1)) == ENUMERAL_TYPE
+          || TREE_CODE (TREE_TYPE (root2)) == ENUMERAL_TYPE)
+         && TREE_TYPE (root1) != TREE_TYPE (root2)))
     {
       if (debug)
        fprintf (debug, " : Different types.  No coalesce.\n");


The differences in GIMPLE of the patch do not explain the code-differences
though, so it might be just bad luck that the patch regressed things for
you.  I can see other unwanted differences though.

Comment 9 Martin Reinecke 2011-01-19 17:26:31 UTC

(In reply to comment #8)
> Can you check if the following patch solves your problem?

Yes, this patch gets performance back to normal on the 4.5 branch and on trunk. Great!

> The differences in GIMPLE of the patch do not explain the code-differences
> though, so it might be just bad luck that the patch regressed things for
> you.  I can see other unwanted differences though.

I would of course be happy if the code generation would be less "erratic", and if the nice performance I'm seeing does not depend on my luck ;)
So if I can do anything to help optimizing this kind of code more consistently, please let me know! Of course, I'm more into numerics than into compiler writing ...

Comment 10 Richard Biener 2011-01-20 10:33:18 UTC

Author: rguenth
Date: Thu Jan 20 10:33:15 2011
New Revision: 169050

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169050
Log:
2011-01-20  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/47167
	* tree-ssa-copyrename.c (copy_rename_partition_coalesce):
	Revert previous change, only avoid enumeral type changes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-copyrename.c

Comment 11 Richard Biener 2011-01-20 10:36:32 UTC

Author: rguenth
Date: Thu Jan 20 10:36:29 2011
New Revision: 169051

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169051
Log:
2011-01-20  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/47167
	* tree-ssa-copyrename.c (copy_rename_partition_coalesce):
	Revert previous change, only avoid enumeral type changes.

Modified:
    branches/gcc-4_5-branch/gcc/ChangeLog
    branches/gcc-4_5-branch/gcc/tree-ssa-copyrename.c

Comment 12 Richard Biener 2011-01-20 10:36:50 UTC

Fixed.