Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 22563
Product:  
Component:  
Status: RESOLVED
Resolution: FIXED
Assigned To: Not yet assigned to anyone <unassigned@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: Anthony Danalis <danalis@cis.udel.edu>
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
pr22563.cc reduced test text/plain 2005-07-19 19:20 385 bytes Edit
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 22563 depends on: Show dependency tree
Show dependency graph
Bug 22563 blocks: 19466

Additional Comments:






View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2006-05-14 19:03 Opened: 2005-07-19 19:13
We ran bench++ to look for c++ samples that ran slower at -O3 with
gcc-[34].x than with gcc-2.95.  We're attaching one such case,
minimized as far as we can (so it might not be testing the same
thing as the original code).  It consists of a simple function that
accesses bitfields, called in a loop from main. 
gcc-3.4.3/gcc-4.0.0/gcc-4.1-20050627 all produce binaries that seem
to be ten times slower on this than those produced by gcc-2.95.3.
All the compilers happily inlined
the function, which is fine.

Here's the code from the older compiler:

.L12:
        movb $86,%dl
        movb %dl,b_rec
        movb %dl,%al
        andb $7,%al
        cmpb $6,%al
        je .L14
        call abort
        .align 4
.L14:
        andb $240,%dl
        cmpb $80,%dl
        je .L11
        call abort
        .align 4
.L11:
        decl %ecx
        testl %ecx,%ecx
        jg .L12

And here's code from gcc-4.1-20050625:
        jmp     .L16
        .p2align 4,,7
.L27:
        andb    $-16, %dl
        cmpb    $80, %dl
        jne     .L25
        decl    %ebx
        je      .L26
.L16:
        movl    %ecx, %eax
        andl    $-8, %eax
        orl     $6, %eax
        movl    %eax, b_rec
        andb    $-9, b_rec
        movl    b_rec, %eax
        andl    $-241, %eax
        orl     $80, %eax
        movl    %eax, b_rec
        movl    %eax, %ecx
        movzbl  b_rec, %edx
        movb    %dl, %al
        andb    $7, %al
        cmpb    $6, %al
        je      .L27

We'll attach the preprocessed source.

------- Comment #1 From Anthony Danalis 2005-07-19 19:20 -------
Created an attachment (id=9307) [edit]
reduced test

Is it bitchy to complain about few nanoseconds slowdown (per iteration) :)

------- Comment #2 From Andrew Pinski 2005-07-19 19:53 -------
There are a couple problems here, first we don't move the store to b_rec out
side of the loop.  Doing 
that on the mainline, we remove the loop as it is now unswitchable and really
just empty.

In fact that will not really be what you wantted but hey fast empty loops :).

The other issue is that we don't constant prop the constants as we have a
BIT_FIELD_REF which is most 
likely the cause of the orginal regression in the first place though
BIT_FIELD_REF was there in 2.95.3.

We can reduce your testcase down to stores really but that might not help the
orginal code (except for 
the fact this is just a benchmark which is really useless).

------- Comment #3 From Andrew Pinski 2005-07-22 21:12 -------
Moving to 4.0.2 pre Mark.

------- Comment #4 From Anthony Danalis 2005-08-04 19:16 -------
For the record the reduced test case was derived from h000007.cpp of bench++

------- Comment #5 From dann@godzilla.ics.uci.edu 2005-08-25 02:49 -------
This message:
http://gcc.gnu.org/ml/gcc/2005-08/msg00208.html

was asking for the reason for the slowdown for S000005e

AFAICT the inner loop for the benchmark (in s000005e_test) gets compiled to: 

.L153:
        fstl    (%edx)
        leal    8(%edx), %eax
        fstl    (%eax)
        fstl    8(%eax)
        fstl    16(%eax)
        fstl    24(%eax)
        fstl    32(%eax)
        fstl    40(%eax)
        fstl    48(%eax)
        leal    56(%eax), %edx
        cmpl    %edx, %ecx
        jne     .L153

and to:

.L9:
        movl    $0, (%edx)
        movl    $1074266112, 4(%edx)
        movl    $0, 8(%edx)
        movl    $1074266112, 12(%edx)
        movl    $0, 16(%edx)
        movl    $1074266112, 20(%edx)
        movl    $0, 24(%edx)
        movl    $1074266112, 28(%edx)
        movl    $0, 32(%edx)
        movl    $1074266112, 36(%edx)
        movl    $0, 40(%edx)
        movl    $1074266112, 44(%edx)
        movl    $0, 48(%edx)
        movl    $1074266112, 52(%edx)
        movl    $0, 56(%edx)
        movl    $1074266112, 60(%edx)
        addl    $64, %edx
        cmpl    %edx, %ebx
        jne     .L9

by 4.1

The 4.1 code looks much worse...

------- Comment #6 From Andrew Pinski 2005-10-27 00:20 -------
Hmm, this is truely all bit-field issues.

------- Comment #7 From Mark Mitchell 2005-10-31 04:12 -------
Leaving as P2.

I've seen reports of similar bitfield problems on a variety of problems.  This
kind of code doesn't show up much in scientific computing, but it does show up
in network applications, operating-system kernels, etc.

------- Comment #8 From Jan Hubicka 2005-11-03 22:54 -------
Actually the code 4.1 in comment #5 should execute faster on true i686. It is
longer and will trigger partial memory stalls on later chips.

Honza

------- Comment #9 From Ian Lance Taylor 2006-02-16 02:08 -------
FYI, this code looks OK to me on mainline, entering the loop at .L18:

.L29:
        andl    $-16, %edx
        cmpb    $80, %dl
        jne     .L27
        subl    $1, %ecx
        je      .L28
.L18:
        movl    $86, %edx
        movl    %edx, %eax
        andl    $7, %eax
        cmpb    $6, %al
        movb    $86, b_rec
        je      .L29
.L27:
        call    abort
.L28:

Still looks kind of sloppy in 4.1, though.

I haven't tried to figure out what fixed it on mainline.

------- Comment #10 From Mark Mitchell 2006-02-24 00:25 -------
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

------- Comment #11 From Andrew Pinski 2006-04-06 01:33 -------
(In reply to comment #9)
> FYI, this code looks OK to me on mainline, entering the loop at .L18:
Except for the fact the bit-fields that are looked at are constant :).

------- Comment #12 From Ian Lance Taylor 2006-04-06 02:46 -------
Yes, true.  I can get constant code by not producing BIT_FIELD_REF so early. 
But that disables other worthy optimizations.  I don't have a coherent patch
yet.

------- Comment #13 From Jan Hubicka 2006-05-09 11:59 -------
The simplified testcase seems to be solved now (on mainline and Athlon):
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ gcc-2.95 -O3 t.C -march=i686  
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ ./a.out
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.809s
user    0m1.798s
sys     0m0.000s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.841s
user    0m1.796s
sys     0m0.002s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++
 -O3 t.C -static  -march=i686  
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.713s
user    0m1.676s
sys     0m0.003s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out

real    0m1.719s
user    0m1.700s
sys     0m0.000s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++
 -O3 t.C -static  -march=athlon
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out
real    0m1.353s
user    0m1.347s
sys     0m0.002s
hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ 

The assembly looks comparable to 2.95 one (instruction count wise, form is
closer to 4.0)
.L29:
        andl    $-16, %edx
        cmpb    $80, %dl
        jne     .L27
        decl    %ecx
        je      .L28
.L18:
        movl    $86, %edx
        movb    $86, b_rec
        movl    %edx, %eax
        andl    $7, %eax
        cmpb    $6, %al
        je      .L29
.L27:
        call    abort


Since no direct testcase for code in comment 5 is attached, can I ask if the
problem presist with generic model?  It looks like the benchmark was executed
on different core than i686 but compiled with i686.  With generic we should now
assume the partial memory stores and thus avoid the integer moves by halves of
destination.

Honza

------- Comment #14 From Roger Sayle 2006-05-14 15:48 -------
Subject: Bug 22563

Author: sayle
Date: Sun May 14 15:48:11 2006
New Revision: 113762

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113762
Log:

        PR rtl-optimization/22563
        * expmed.c (store_fixed_bit_field): When using AND and IOR to store
        a fixed width bitfield, always force the intermediates into psuedos.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expmed.c

------- Comment #15 From Roger Sayle 2006-05-15 04:43 -------
Subject: Bug 22563

Author: sayle
Date: Mon May 15 04:43:05 2006
New Revision: 113775

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113775
Log:

        PR rtl-optimization/22563
        Backports from mainline
        * expmed.c (store_fixed_bit_field): When using AND and IOR to store
        a fixed width bitfield, always force the intermediates into pseudos.
        Also check whether the bitsize is valid for the machine's "insv"
        instruction before moving the target into a pseudo for use with
        the insv.
        * config/i386/predicates.md (const8_operand): New predicate.
        * config/i386/i386.md (extv, extzv, insv): Use the new
        const8_operand predicate where appropriate.


Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/config/i386/i386.md
    branches/gcc-4_1-branch/gcc/config/i386/predicates.md
    branches/gcc-4_1-branch/gcc/expmed.c

------- Comment #16 From Roger Sayle 2006-05-16 01:17 -------
Subject: Bug 22563

Author: sayle
Date: Tue May 16 01:17:13 2006
New Revision: 113810

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113810
Log:

        PR rtl-optimization/22563
        Backports from mainline
        * expmed.c (store_fixed_bit_field): When using AND and IOR to store
        a fixed width bitfield, always force the intermediates into pseudos.
        Also check whether the bitsize is valid for the machine's "insv"
        instruction before moving the target into a pseudo for use with
        the insv.
        * config/i386/predicates.md (const8_operand): New predicate.
        * config/i386/i386.md (extv, extzv, insv): Use the new
        const8_operand predicate where appropriate.


Modified:
    branches/gcc-4_0-branch/gcc/ChangeLog
    branches/gcc-4_0-branch/gcc/config/i386/i386.md
    branches/gcc-4_0-branch/gcc/config/i386/predicates.md
    branches/gcc-4_0-branch/gcc/expmed.c

------- Comment #17 From Gabriel Dos Reis 2007-02-03 15:32 -------
Fixed in GCC-4.1.1 and higher.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug