42027 – [4.5 Regression] Performance regression in convolution loop optimization

Bug 42027 - [4.5 Regression] Performance regression in convolution loop optimization

Summary: [4.5 Regression] Performance regression in convolution loop optimization

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.5.0

Importance:	P2 normal
Target Milestone:	4.5.0
Assignee:	Michael Matz

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2009-11-13 09:49 UTC by Nicolas BENOIT
Modified:	2009-12-20 13:11 UTC (History)
CC List:	3 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	i686-pc-linux-gnu
Build:	i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:	2009-12-17 00:27:36

Attachments
Source file with a convolution loop pattern. (200 bytes, text/plain) 2009-11-13 09:51 UTC, Nicolas BENOIT	Details
Diff of the RTL expand dump between revisions 151079 and 151080 (1.61 KB, text/plain) 2009-12-16 12:55 UTC, Nicolas BENOIT	Details
Real fix (710 bytes, patch) 2009-12-17 01:42 UTC, Michael Matz	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nicolas BENOIT 2009-11-13 09:49:39 UTC

GCC trunk rev. 154141 seems to handle less efficiently a convolution code than previous stable releases, it was also spotted in revision 153048.

Here are some average timings on an Intel E5320 clocked at 1.86 GHz with 4 MB of L2 cache, Debian GNU/Linux with a 2.6.26 kernel.

* with -O2 -march=native
GCC 4.3.2               8239 ms
GCC-4.4.2               8102 ms
GCC-snapshot-20091105   9347 ms
GCC-trunk-r154141       9343 ms

* with -O2
GCC 4.3.2               8128 ms
GCC-4.4.2               8158 ms
GCC-snapshot-20091105   9824 ms
GCC-trunk-r154141       9828 ms

* with -O1
GCC 4.3.2              20926 ms
GCC-4.4.2               8277 ms
GCC-snapshot-20091105   9369 ms
GCC-trunk-r154141       9375 ms

* with -O0
GCC 4.3.2              34061 ms
GCC-4.4.2              34241 ms
GCC-snapshot-20091105  34903 ms
GCC-trunk-r154141      34910 ms


GCC compiled with : configure --prefix=/export/home/nicolas/gcc/trunk-install --enable-languages=c --disable-multilib --disable-bootstrap --enable-checking=release

I haven't been able to track down the origin of the performance difference.


Note that data are not initialized in the attached code, as the slowdown is observed wether they are or not.

---BEGIN code---
#define N  1024*512
#define M  512
#define ITER 16

double in[N];
double H[M];
double vH[N];

int main ( int argc,
           char **argv )
{
  int i, j, k;

  for ( i=0; i<ITER; ++i )
    for ( j=0; j<N; ++j )
      for ( k=0; (k<M)&&(k<=j); ++k )
        vH[j] += H[k]*in[j-k];

  return (int) vH[argc];
}
---END code---

Comment 1 Nicolas BENOIT 2009-11-13 09:51:50 UTC

Created attachment 19010 [details]
Source file with a convolution loop pattern.

Comment 2 Richard Biener 2009-11-13 13:49:47 UTC

It looks like it is induction variable related.

Comment 3 Nicolas BENOIT 2009-11-26 15:08:54 UTC

Using integer instead of double, the performance difference is even more noticeable :

* with -O1
GCC 4.4.2               7475 ms
GCC-trunk-r154672       9390 ms

Comment 4 Nicolas BENOIT 2009-12-01 10:11:20 UTC

It seems that this regression first appeared with revision 151080

* with -O1
GCC-4.4.2          7.4 s
GCC-trunk-r151078  7.4 s
GCC-trunk-r151079  7.4 s
GCC-trunk-r151080  9.4 s
GCC-trunk-r151081  9.4 s
GCC-trunk-r151082  9.4 s

Changelog for revision 151080
http://gcc.gnu.org/ml/gcc-patches/2009-08/msg01336.html

009-08-25  Michael Matz  <matz@suse.de>

        * expr.h (jumpifnot_1, jumpif_1, do_jump_1): Declare.
        * dojump.c (do_jump_by_parts_greater): Take two operands instead of
        full expression.
        (do_jump_by_parts_equality, do_compare_and_jump): Ditto.
        (jumpifnot_1, jumpif_1): New wrappers for do_jump_1.
        (do_jump): Split out code for simple binary comparisons into ...
        (do_jump_1): ... this, taking the individual operands and code.
        Change callers to helper function above accordingly.
        * expr.c (expand_expr_real_1): Use jumpifnot_1 for simple binary
        comparisons.

Comment 5 Michael Matz 2009-12-01 11:24:11 UTC

Hmpf, something fishy is going on, as this patch should have been only
refactoring without influence on the generated code.  I'll look at it.

Comment 6 Michael Matz 2009-12-13 21:51:45 UTC

Subject: Bug 42027

Author: matz
Date: Sun Dec 13 21:51:34 2009
New Revision: 155196

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155196
Log:
	PR tree-optimization/42027
	* dojump.c (do_jump <TRUTH_AND_EXPR, TRUTH_OR_EXPR>): Go to
	TRUTH_ANDIF_EXPR resp. TRUTH_ORIF_EXPR expander, instead of
	falling through.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/dojump.c

Comment 7 Michael Matz 2009-12-13 21:53:39 UTC

Fixed.

Comment 8 Nicolas BENOIT 2009-12-16 10:34:54 UTC

I am confused, a performance regression is still noticeable:

* Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1 flag
GCC-4.4.2          7364 ms
GCC-trunk-r155286  9515 ms

* Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 flag
GCC-4.4.1          5960 ms
GCC-trunk-r155286  7355 ms


Here is a diff on the assembly generated for the Intel E5320:

$ diff 442/convol.s r155286/convol.s
11c11
< 	subl	$8, %esp
---
> 	subl	$12, %esp
13d12
< 	movl	$H, %esi
17c16
< 	imull	(%esi,%eax,4), %ebx
---
> 	imull	H(,%eax,4), %ebx
22c21
< 	jg	.L10
---
> 	setle	%bl
24,25c23,25
< 	jle	.L3
< .L10:
---
> 	setle	-21(%ebp)
> 	testb	%bl, -21(%ebp)
> 	jne	.L3
28c28
< .L6:
---
> .L5:
31,32c31,32
< 	je	.L5
< .L8:
---
> 	je	.L4
> .L7:
34c34
< 	js	.L6
---
> 	js	.L5
40c40
< .L5:
---
> .L4:
43c43
< 	je	.L7
---
> 	je	.L6
46,47c46,47
< 	jmp	.L8
< .L7:
---
> 	jmp	.L7
> .L6:
50c50
< 	addl	$8, %esp
---
> 	addl	$12, %esp
60c60
< 	.ident	"GCC: (GNU) 4.4.2"
---
> 	.ident	"GCC: (GNU) 4.5.0 20091216 (experimental)"

Comment 9 Nicolas BENOIT 2009-12-16 11:06:40 UTC

Here is a unified diff which focuses on the inner-loop exit conditions.

--- 442/convol.s
+++ r155286/convol.s

 .L3:
 	movl	(%edx), %ebx
-	imull	(%esi,%eax,4), %ebx
+	imull	H(,%eax,4), %ebx
 	addl	%ebx, %ecx
 	addl	$1, %eax
 	subl	$4, %edx
 	cmpl	$511, %eax
-	jg	.L10
+	setle	%bl
 	cmpl	%edi, %eax
-	jle	.L3
-.L10:
+	setle	-21(%ebp)
+	testb	%bl, -21(%ebp)
+	jne	.L3
 	movl	-16(%ebp), %eax
 	movl	%ecx, vH(,%eax,4)
-.L6:
+.L5:


L3 corresponds to the inner loop body.

Comment 10 Richard Biener 2009-12-16 12:24:21 UTC

Which is the good version, the one with less or the one with more branches?

Comment 11 Nicolas BENOIT 2009-12-16 12:53:53 UTC

The fastest is the variant with more jumps (442/convol.s in the diff) generated by GCC-4.4.2.
In the one jump variant (r155286/convol.s in the diff), I guess it is the computing of both conditions before jumping which slows down the program.

Looking between revisions 151079 and 151080 (which is when the regression first appeared), the RTL expand dump shows a difference regarding branch prediction. I do not know if it is relevant:

--- r151079/convol.c.135r.expand	2009-12-01 14:10:55.000000000 +0100
+++ r151080/convol.c.135r.expand	2009-12-01 14:11:03.000000000 +0100
@@ -5,7 +5,6 @@
 ;; Generating RTL for gimple basic block 2
 
 ;; Generating RTL for gimple basic block 3
-Failed to add probability note
 
 ;; Generating RTL for gimple basic block 4
 
@@ -20,13 +19,6 @@
 ;; Generating RTL for gimple basic block 9
 
 ;; Generating RTL for gimple basic block 10
-Purged non-fallthru edges from bb 14
-Predictions for insn 61 bb 3
-  no prediction heuristics: 50.0%
-  combined heuristics: 50.0%
-Predictions for insn 65 bb 13
-  no prediction heuristics: 50.0%
-  combined heuristics: 50.0%

I am about to attach the whole diff file.

Comment 12 Nicolas BENOIT 2009-12-16 12:55:00 UTC

Created attachment 19321 [details]
Diff of the RTL expand dump between revisions 151079 and 151080

Comment 13 Richard Biener 2009-12-16 14:14:55 UTC

It's indeed expand that makes the difference when expanding

;; if (k <= 511 && k <= j != 0)

probably due to the way TER works now.

Comment 14 Michael Matz 2009-12-16 23:43:39 UTC

That's exactly what I fixed with my last patch.  If this still results in a
difference it's caused by difference in cheapness of branches.  I'll poke at
it.

Comment 15 Michael Matz 2009-12-17 00:27:36 UTC

Hmm, no, it's still my fault.  Something must have gone wrong with my brain
when I thought the bug was fixed, no idea how that happened.  The patch wasn't
complete.

Comment 16 Michael Matz 2009-12-17 01:42:00 UTC

Created attachment 19332 [details]
Real fix

Now, before I blow it again, would you be so kind to test this patch (on top
of some recent trunk, doesn't have to be the newest one, you don't need to
bootstrap) if it fixes the performance problem.  For me it does now, I swear :-)

Comment 17 Nicolas BENOIT 2009-12-17 09:32:11 UTC

(In reply to comment #16)
> Created an attachment (id=19332) [edit]
> Real fix
> 
> Now, before I blow it again, would you be so kind to test this patch (on top
> of some recent trunk, doesn't have to be the newest one, you don't need to
> bootstrap) if it fixes the performance problem.  For me it does now, I swear
> :-)
> 

Tested with trunk revision 155304, the regression is gone.

* Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1
flag
GCC-4.4.2          7364 ms
GCC-trunk-r155286  7360 ms

* Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1
flag
GCC-4.4.1          5968 ms
GCC-trunk-r155286  5963 ms

Comment 18 Nicolas BENOIT 2009-12-17 09:34:11 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > Created an attachment (id=19332) [edit]
> > Real fix
> > 
> > Now, before I blow it again, would you be so kind to test this patch (on top
> > of some recent trunk, doesn't have to be the newest one, you don't need to
> > bootstrap) if it fixes the performance problem.  For me it does now, I swear
> > :-)
> > 
> 
> Tested with trunk revision 155304, the regression is gone.
> 
> * Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1
> flag
> GCC-4.4.2          7364 ms
> GCC-trunk-r155286  7360 ms
> 
> * Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1
> flag
> GCC-4.4.1          5968 ms
> GCC-trunk-r155286  5963 ms
> 

Oups, copy-pasted the GCC versions for the timings.
The correct versions are: 
* Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1
 flag
GCC-4.4.2                  7364 ms
GCC-trunk-r155304-patched  7360 ms

* Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 flag
GCC-4.4.1                  5968 ms
GCC-trunk-r155304-patched  5963 ms

Comment 19 Michael Matz 2009-12-20 01:16:03 UTC

Subject: Bug 42027

Author: matz
Date: Sun Dec 20 01:15:46 2009
New Revision: 155367

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155367
Log:
	PR tree-optimization/42027
	* cfgexpand.c (expand_gimple_cond): Use jumpy sequence for &, &&, |
	and || if jumps are cheap.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c

Comment 20 Richard Biener 2009-12-20 13:11:59 UTC

Fixed.