We ran bench++ to look for c++ samples that ran slower at -O3 with gcc-[34].x than with gcc-2.95. We're attaching one such case, minimized as far as we can (so it might not be testing the same thing as the original code). It consists of a simple function that accesses bitfields, called in a loop from main. gcc-3.4.3/gcc-4.0.0/gcc-4.1-20050627 all produce binaries that seem to be ten times slower on this than those produced by gcc-2.95.3. All the compilers happily inlined the function, which is fine. Here's the code from the older compiler: .L12: movb $86,%dl movb %dl,b_rec movb %dl,%al andb $7,%al cmpb $6,%al je .L14 call abort .align 4 .L14: andb $240,%dl cmpb $80,%dl je .L11 call abort .align 4 .L11: decl %ecx testl %ecx,%ecx jg .L12 And here's code from gcc-4.1-20050625: jmp .L16 .p2align 4,,7 .L27: andb $-16, %dl cmpb $80, %dl jne .L25 decl %ebx je .L26 .L16: movl %ecx, %eax andl $-8, %eax orl $6, %eax movl %eax, b_rec andb $-9, b_rec movl b_rec, %eax andl $-241, %eax orl $80, %eax movl %eax, b_rec movl %eax, %ecx movzbl b_rec, %edx movb %dl, %al andb $7, %al cmpb $6, %al je .L27 We'll attach the preprocessed source.
Created attachment 9307 [details] reduced test Is it bitchy to complain about few nanoseconds slowdown (per iteration) :)
There are a couple problems here, first we don't move the store to b_rec out side of the loop. Doing that on the mainline, we remove the loop as it is now unswitchable and really just empty. In fact that will not really be what you wantted but hey fast empty loops :). The other issue is that we don't constant prop the constants as we have a BIT_FIELD_REF which is most likely the cause of the orginal regression in the first place though BIT_FIELD_REF was there in 2.95.3. We can reduce your testcase down to stores really but that might not help the orginal code (except for the fact this is just a benchmark which is really useless).
Moving to 4.0.2 pre Mark.
For the record the reduced test case was derived from h000007.cpp of bench++
This message: http://gcc.gnu.org/ml/gcc/2005-08/msg00208.html was asking for the reason for the slowdown for S000005e AFAICT the inner loop for the benchmark (in s000005e_test) gets compiled to: .L153: fstl (%edx) leal 8(%edx), %eax fstl (%eax) fstl 8(%eax) fstl 16(%eax) fstl 24(%eax) fstl 32(%eax) fstl 40(%eax) fstl 48(%eax) leal 56(%eax), %edx cmpl %edx, %ecx jne .L153 and to: .L9: movl $0, (%edx) movl $1074266112, 4(%edx) movl $0, 8(%edx) movl $1074266112, 12(%edx) movl $0, 16(%edx) movl $1074266112, 20(%edx) movl $0, 24(%edx) movl $1074266112, 28(%edx) movl $0, 32(%edx) movl $1074266112, 36(%edx) movl $0, 40(%edx) movl $1074266112, 44(%edx) movl $0, 48(%edx) movl $1074266112, 52(%edx) movl $0, 56(%edx) movl $1074266112, 60(%edx) addl $64, %edx cmpl %edx, %ebx jne .L9 by 4.1 The 4.1 code looks much worse...
Hmm, this is truely all bit-field issues.
Leaving as P2. I've seen reports of similar bitfield problems on a variety of problems. This kind of code doesn't show up much in scientific computing, but it does show up in network applications, operating-system kernels, etc.
Actually the code 4.1 in comment #5 should execute faster on true i686. It is longer and will trigger partial memory stalls on later chips. Honza
FYI, this code looks OK to me on mainline, entering the loop at .L18: .L29: andl $-16, %edx cmpb $80, %dl jne .L27 subl $1, %ecx je .L28 .L18: movl $86, %edx movl %edx, %eax andl $7, %eax cmpb $6, %al movb $86, b_rec je .L29 .L27: call abort .L28: Still looks kind of sloppy in 4.1, though. I haven't tried to figure out what fixed it on mainline.
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
(In reply to comment #9) > FYI, this code looks OK to me on mainline, entering the loop at .L18: Except for the fact the bit-fields that are looked at are constant :).
Yes, true. I can get constant code by not producing BIT_FIELD_REF so early. But that disables other worthy optimizations. I don't have a coherent patch yet.
The simplified testcase seems to be solved now (on mainline and Athlon): hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ gcc-2.95 -O3 t.C -march=i686 hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ ./a.out hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out real 0m1.809s user 0m1.798s sys 0m0.000s hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out real 0m1.841s user 0m1.796s sys 0m0.002s hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++ -O3 t.C -static -march=i686 hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out real 0m1.713s user 0m1.676s sys 0m0.003s hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out real 0m1.719s user 0m1.700s sys 0m0.000s hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ /aux/hubicka/egcs-mainline/bin/g++ -O3 t.C -static -march=athlon hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ time ./a.out real 0m1.353s user 0m1.347s sys 0m0.002s hubicka@kampanus:/aux/hubicka/gcc/build/gcc$ The assembly looks comparable to 2.95 one (instruction count wise, form is closer to 4.0) .L29: andl $-16, %edx cmpb $80, %dl jne .L27 decl %ecx je .L28 .L18: movl $86, %edx movb $86, b_rec movl %edx, %eax andl $7, %eax cmpb $6, %al je .L29 .L27: call abort Since no direct testcase for code in comment 5 is attached, can I ask if the problem presist with generic model? It looks like the benchmark was executed on different core than i686 but compiled with i686. With generic we should now assume the partial memory stores and thus avoid the integer moves by halves of destination. Honza
Subject: Bug 22563 Author: sayle Date: Sun May 14 15:48:11 2006 New Revision: 113762 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113762 Log: PR rtl-optimization/22563 * expmed.c (store_fixed_bit_field): When using AND and IOR to store a fixed width bitfield, always force the intermediates into psuedos. Modified: trunk/gcc/ChangeLog trunk/gcc/expmed.c
Subject: Bug 22563 Author: sayle Date: Mon May 15 04:43:05 2006 New Revision: 113775 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113775 Log: PR rtl-optimization/22563 Backports from mainline * expmed.c (store_fixed_bit_field): When using AND and IOR to store a fixed width bitfield, always force the intermediates into pseudos. Also check whether the bitsize is valid for the machine's "insv" instruction before moving the target into a pseudo for use with the insv. * config/i386/predicates.md (const8_operand): New predicate. * config/i386/i386.md (extv, extzv, insv): Use the new const8_operand predicate where appropriate. Modified: branches/gcc-4_1-branch/gcc/ChangeLog branches/gcc-4_1-branch/gcc/config/i386/i386.md branches/gcc-4_1-branch/gcc/config/i386/predicates.md branches/gcc-4_1-branch/gcc/expmed.c
Subject: Bug 22563 Author: sayle Date: Tue May 16 01:17:13 2006 New Revision: 113810 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113810 Log: PR rtl-optimization/22563 Backports from mainline * expmed.c (store_fixed_bit_field): When using AND and IOR to store a fixed width bitfield, always force the intermediates into pseudos. Also check whether the bitsize is valid for the machine's "insv" instruction before moving the target into a pseudo for use with the insv. * config/i386/predicates.md (const8_operand): New predicate. * config/i386/i386.md (extv, extzv, insv): Use the new const8_operand predicate where appropriate. Modified: branches/gcc-4_0-branch/gcc/ChangeLog branches/gcc-4_0-branch/gcc/config/i386/i386.md branches/gcc-4_0-branch/gcc/config/i386/predicates.md branches/gcc-4_0-branch/gcc/expmed.c
Fixed in GCC-4.1.1 and higher.