22563 – [4.0 Regression] performance regression for gcc newer than 2.95

Bug 22563 - [4.0 Regression] performance regression for gcc newer than 2.95

Summary: [4.0 Regression] performance regression for gcc newer than 2.95

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 minor
Target Milestone:	4.1.1
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	bitfield
	Show dependency tree / graph

Reported:	2005-07-19 19:13 UTC by Anthony Danalis
Modified:	2007-02-03 15:32 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	i686-linux
Build:
Known to work:	2.95.3 4.2.0 4.1.1
Known to fail:	3.3.3 3.0.4 3.2.3 3.4.0 4.1.0
Last reconfirmed:	2006-05-14 19:03:37

Attachments
reduced test (385 bytes, text/plain) 2005-07-19 19:20 UTC, Anthony Danalis	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anthony Danalis 2005-07-19 19:13:03 UTC

We ran bench++ to look for c++ samples that ran slower at -O3 with
gcc-[34].x than with gcc-2.95.  We're attaching one such case,
minimized as far as we can (so it might not be testing the same
thing as the original code).  It consists of a simple function that
accesses bitfields, called in a loop from main. 
gcc-3.4.3/gcc-4.0.0/gcc-4.1-20050627 all produce binaries that seem
to be ten times slower on this than those produced by gcc-2.95.3.
All the compilers happily inlined
the function, which is fine.

Here's the code from the older compiler:

.L12:
        movb $86,%dl
        movb %dl,b_rec
        movb %dl,%al
        andb $7,%al
        cmpb $6,%al
        je .L14
        call abort
        .align 4
.L14:
        andb $240,%dl
        cmpb $80,%dl
        je .L11
        call abort
        .align 4
.L11:
        decl %ecx
        testl %ecx,%ecx
        jg .L12

And here's code from gcc-4.1-20050625:
        jmp     .L16
        .p2align 4,,7
.L27:
        andb    $-16, %dl
        cmpb    $80, %dl
        jne     .L25
        decl    %ebx
        je      .L26
.L16:
        movl    %ecx, %eax
        andl    $-8, %eax
        orl     $6, %eax
        movl    %eax, b_rec
        andb    $-9, b_rec
        movl    b_rec, %eax
        andl    $-241, %eax
        orl     $80, %eax
        movl    %eax, b_rec
        movl    %eax, %ecx
        movzbl  b_rec, %edx
        movb    %dl, %al
        andb    $7, %al
        cmpb    $6, %al
        je      .L27

We'll attach the preprocessed source.

Comment 1 Anthony Danalis 2005-07-19 19:20:16 UTC

Created attachment 9307 [details]
reduced test

Is it bitchy to complain about few nanoseconds slowdown (per iteration) :)

Comment 2 Andrew Pinski 2005-07-19 19:53:16 UTC

There are a couple problems here, first we don't move the store to b_rec out side of the loop.  Doing 
that on the mainline, we remove the loop as it is now unswitchable and really just empty.

In fact that will not really be what you wantted but hey fast empty loops :).

The other issue is that we don't constant prop the constants as we have a BIT_FIELD_REF which is most 
likely the cause of the orginal regression in the first place though BIT_FIELD_REF was there in 2.95.3.

We can reduce your testcase down to stores really but that might not help the orginal code (except for 
the fact this is just a benchmark which is really useless).

Comment 3 Andrew Pinski 2005-07-22 21:12:37 UTC

Moving to 4.0.2 pre Mark.

Comment 4 Anthony Danalis 2005-08-04 19:16:08 UTC

For the record the reduced test case was derived from h000007.cpp of bench++

Comment 5 Dan Nicolaescu 2005-08-25 02:49:56 UTC

This message:
http://gcc.gnu.org/ml/gcc/2005-08/msg00208.html

was asking for the reason for the slowdown for S000005e

AFAICT the inner loop for the benchmark (in s000005e_test) gets compiled to: 

.L153:
        fstl    (%edx)
        leal    8(%edx), %eax
        fstl    (%eax)
        fstl    8(%eax)
        fstl    16(%eax)
        fstl    24(%eax)
        fstl    32(%eax)
        fstl    40(%eax)
        fstl    48(%eax)
        leal    56(%eax), %edx
        cmpl    %edx, %ecx
        jne     .L153

and to:

.L9:
        movl    $0, (%edx)
        movl    $1074266112, 4(%edx)
        movl    $0, 8(%edx)
        movl    $1074266112, 12(%edx)
        movl    $0, 16(%edx)
        movl    $1074266112, 20(%edx)
        movl    $0, 24(%edx)
        movl    $1074266112, 28(%edx)
        movl    $0, 32(%edx)
        movl    $1074266112, 36(%edx)
        movl    $0, 40(%edx)
        movl    $1074266112, 44(%edx)
        movl    $0, 48(%edx)
        movl    $1074266112, 52(%edx)
        movl    $0, 56(%edx)
        movl    $1074266112, 60(%edx)
        addl    $64, %edx
        cmpl    %edx, %ebx
        jne     .L9

by 4.1

The 4.1 code looks much worse...

Comment 6 Andrew Pinski 2005-10-27 00:20:10 UTC

Hmm, this is truely all bit-field issues.

Comment 7 Mark Mitchell 2005-10-31 04:12:47 UTC

Leaving as P2.

I've seen reports of similar bitfield problems on a variety of problems.  This kind of code doesn't show up much in scientific computing, but it does show up in network applications, operating-system kernels, etc.

Comment 8 Jan Hubicka 2005-11-03 22:54:29 UTC

Actually the code 4.1 in comment #5 should execute faster on true i686. It is longer and will trigger partial memory stalls on later chips.

Honza

Comment 9 Ian Lance Taylor 2006-02-16 02:08:47 UTC

FYI, this code looks OK to me on mainline, entering the loop at .L18:

.L29:
	andl	$-16, %edx
	cmpb	$80, %dl
	jne	.L27
	subl	$1, %ecx
	je	.L28
.L18:
	movl	$86, %edx
	movl	%edx, %eax
	andl	$7, %eax
	cmpb	$6, %al
	movb	$86, b_rec
	je	.L29
.L27:
	call	abort
.L28:

Still looks kind of sloppy in 4.1, though.

I haven't tried to figure out what fixed it on mainline.

Comment 10 Mark Mitchell 2006-02-24 00:25:59 UTC

This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

Comment 11 Andrew Pinski 2006-04-06 01:33:46 UTC

(In reply to comment #9)
> FYI, this code looks OK to me on mainline, entering the loop at .L18:
Except for the fact the bit-fields that are looked at are constant :).

Comment 12 Ian Lance Taylor 2006-04-06 02:46:32 UTC

Yes, true.  I can get constant code by not producing BIT_FIELD_REF so early.  But that disables other worthy optimizations.  I don't have a coherent patch yet.

Comment 13 Jan Hubicka 2006-05-09 11:59:23 UTC

The simplified testcase seems to be solved now (on mainline and Athlon):
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ gcc-2.95 -O3 t.C -march=i686  
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ ./a.out
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.809s
user    0m1.798s
sys     0m0.000s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.841s
user    0m1.796s
sys     0m0.002s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++  -O3 t.C -static  -march=i686  
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.713s
user    0m1.676s
sys     0m0.003s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.719s
user    0m1.700s
sys     0m0.000s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++  -O3 t.C -static  -march=athlon
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out
real    0m1.353s
user    0m1.347s
sys     0m0.002s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ 

The assembly looks comparable to 2.95 one (instruction count wise, form is closer to 4.0)
.L29:
        andl    $-16, %edx
        cmpb    $80, %dl
        jne     .L27
        decl    %ecx
        je      .L28
.L18:
        movl    $86, %edx
        movb    $86, b_rec
        movl    %edx, %eax
        andl    $7, %eax
        cmpb    $6, %al
        je      .L29
.L27:
        call    abort


Since no direct testcase for code in comment 5 is attached, can I ask if the problem presist with generic model?  It looks like the benchmark was executed on different core than i686 but compiled with i686.  With generic we should now assume the partial memory stores and thus avoid the integer moves by halves of destination.

Honza

Comment 14 Roger Sayle 2006-05-14 15:48:21 UTC

Subject: Bug 22563

Author: sayle
Date: Sun May 14 15:48:11 2006
New Revision: 113762

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113762
Log:

	PR rtl-optimization/22563
	* expmed.c (store_fixed_bit_field): When using AND and IOR to store
	a fixed width bitfield, always force the intermediates into psuedos.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expmed.c

Comment 15 Roger Sayle 2006-05-15 04:43:32 UTC

Subject: Bug 22563

Author: sayle
Date: Mon May 15 04:43:05 2006
New Revision: 113775

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113775
Log:

	PR rtl-optimization/22563
	Backports from mainline
	* expmed.c (store_fixed_bit_field): When using AND and IOR to store
	a fixed width bitfield, always force the intermediates into pseudos.
        Also check whether the bitsize is valid for the machine's "insv"
	instruction before moving the target into a pseudo for use with
	the insv.
        * config/i386/predicates.md (const8_operand): New predicate.
        * config/i386/i386.md (extv, extzv, insv): Use the new
        const8_operand predicate where appropriate.


Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/config/i386/i386.md
    branches/gcc-4_1-branch/gcc/config/i386/predicates.md
    branches/gcc-4_1-branch/gcc/expmed.c

Comment 16 Roger Sayle 2006-05-16 01:17:21 UTC

Subject: Bug 22563

Author: sayle
Date: Tue May 16 01:17:13 2006
New Revision: 113810

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113810
Log:

	PR rtl-optimization/22563
	Backports from mainline
	* expmed.c (store_fixed_bit_field): When using AND and IOR to store
	a fixed width bitfield, always force the intermediates into pseudos.
        Also check whether the bitsize is valid for the machine's "insv"
	instruction before moving the target into a pseudo for use with
	the insv.
        * config/i386/predicates.md (const8_operand): New predicate.
        * config/i386/i386.md (extv, extzv, insv): Use the new
        const8_operand predicate where appropriate.


Modified:
    branches/gcc-4_0-branch/gcc/ChangeLog
    branches/gcc-4_0-branch/gcc/config/i386/i386.md
    branches/gcc-4_0-branch/gcc/config/i386/predicates.md
    branches/gcc-4_0-branch/gcc/expmed.c

Comment 17 Gabriel Dos Reis 2007-02-03 15:32:58 UTC

Fixed in GCC-4.1.1 and higher.