Consider this tiny function: void foo() { extern int a; if(++a) ++a; } It produces the following assembly with 3.4 (experimental) and 3.2 (Red Hat 9) at -O2 and better: .file "t.c" .text .globl foo .type foo, @function foo: movl a, %eax incl %eax testl %eax, %eax movl %eax, a je .L1 incl %eax movl %eax, a .L1: ret .size foo, .-foo .section .note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.5-tree-ssa 20030924 (merged 20030817)" This is subobtimal. The optimal code should be: foo: addl $1,a je .L1 addl $1,a .L1: ret
Confirmed. Present mainline gives different assembly than the tree-ssa results you showed, but the code is equally suboptimal. One question, though: from the summary line, it seems as if you imply that the problem is related to global variables. Do you have an example that this doesn't happen with locals? W. Here's what I get with mainline (actually from the C++ FE, but that shouldn't matter so much): _Z3foov: .LFB2: movl a, %eax cmpl $-1, %eax je .L3 addl $2, %eax movl %eax, a ret .p2align 4,,7 .L3: xorl %eax, %eax movl %eax, a ret .LFE2: .size _Z3foov, .-_Z3foov .section .note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.4 20030919 (experimental)"
Well I don't _know_ for sure if it is related to global variabes, but it sure seems so: void foo(int a) { if(++a) ++a; printf ("%d",a); } .file "t.c" .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "%d" .text .globl foo .type foo,@function foo: movl 4(%esp), %eax incl %eax je .L2 incl %eax .L2: pushl %eax pushl $.LC0 call printf popl %eax popl %edx ret .Lfe1: .size foo,.Lfe1-foo .ident "GCC: (GNU) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)" void foo() { int a if(++a) ++a; printf ("%d",a); } produces something similar.
Thanks, makes sense. I tried myself, but using a function-local variable was no good since it could be optimized away. Didn't think of a function parameter, though. Thanks for clarifying this W.
Subject: Re: New: Suboptimal code with global variables > void foo() { > extern int a; > if(++a) ++a; > } > This is subobtimal. The optimal code should be: > > foo: > addl $1,a > je .L1 > addl $1,a > .L1: > ret [I haven't written x86 assembler for a long time, but...] That code isn't optimal either; something like foo: addl $1,a sbbl $-1,a ret would be even better :-)
*** Bug 14552 has been marked as a duplicate of this bug. ***
*** Bug 16857 has been marked as a duplicate of this bug. ***
"GCC: (GNU) 4.2.0 20060210 (experimental)" produces this (at "-O2 -march=pentium4"): foo: movl a, %eax addl $1, %eax movl %eax, a testl %eax, %eax je .L4 addl $1, %eax movl %eax, a .L4: ret
I really think this should be fixed, otherwise gcc wont be able to follow its exponential decaying performance which it has so accurately followed since 2.95 at least, to show clearer how much speed we could loose by fixing this i was nice and benchmarked the code (a simple for loop running 100 times with the code inside, rdtsc based timing outside with a 1000 times executed loop surounding it benchmarink was done on a 800mhz duron and a 500mhz pentium3, the first number is the number of cpu cycles for the duron, second one for p3 first let me show you the optimal code by steven boscher? "addl $1,a\n" " je .L1\n" "addl $1,a\n" ".L1:\n" 11.557 / 12.514 now what gcc 3.4/3.2 generated: "movl a, %%eax\n" "incl %%eax\n" "testl %%eax, %%eax\n" "movl %%eax, a\n" "je .L1\n" "incl %%eax\n" "movl %%eax, a\n" ".L1:\n" //6.220 / 6.159 the code generated by mainline had 2 ret so it didnt fit in my benchmark loop the even better code by segher AT d12relay01 DOT megacenter.de.ibm.com "addl $1,a\n" "sbbl $-1,a\n" //11.755 / 15.111 one case which you must be carefull not to generate as its almost twice as fast as the on above while still being just 2 instructions is: "cmpl $-1,a\n" "adcl $1,a\n" //7.827 / 7.422 another 2 slightly faster variants are: "movl a, %%eax\n" "cmpl $-1,%%eax\n" "adcl $1,%%eax\n" "movl %%eax,a\n" //6.567 / 8.811 "movl a, %%eax\n" "addl $1,%%eax\n" "sbbl $-1,%%eax\n" "movl %%eax,a\n" //6.564 / 8.813 what a 14year old script kid would write and what gcc would generate if it where local variables: "movl a, %%eax\n" "incl %%eax\n" "je .L1\n" "incl %%eax\n" ".L1:\n" "movl %%eax, a\n" //6.162 / 5.426 what i would write (as the variable isnt used in my testcase): "\n" //2.155 / 2.410
Re. comment #8: "exponential decaying performance which it has so accurately followed since 2.95" Can you back this up with numbers, or are you just trolling? If the latter, please don't do that, you are insulting the work of a dedicated few. Maybe you should help out instead of trolling, if you think you're so good. If you continue to make this kind of unhelpful comments, I will ask to have you blocked from our bugzilla.
This is, in fact, a rare case where RTL store motion does something useful. With "-O2 -fomit-frame-pointer -march=pentium4" we produce: movl a, %eax addl $1, %eax movl %eax, a testl %eax, %eax je .L4 addl $1, %eax movl %eax, a .L4: ret but with "-O2 -fomit-frame-pointer -march=pentium4 -fgcse-sm" we get: foo: movl a, %eax addl $1, %eax je .L5 addl $1, %eax movl %eax, a ret .L5: movl $0, a ret which is much closer to what we want to get to eventually. Looking at the first snippet, we shouldn't really need GCSE store-motion for this, because the store to a is not partially redundant. It is fully redundant and could be sunk if we had a generic hoisting pass that can lift and sink code. The current implementation of code hoisting in GCSE can only lift code, not sink it.
(In reply to comment #9) > Re. comment #8: > "exponential decaying performance which it has so accurately followed since > 2.95" > > Can you back this up with numbers, or are you just trolling? If the latter, > please don't do that, you are insulting the work of a dedicated few. Maybe you > should help out instead of trolling, if you think you're so good. If you > continue to make this kind of unhelpful comments, I will ask to have you > blocked from our bugzilla. the benchmark was unhelpfull? anyway, compiling dsputil.c from libavcodec takes gcc 2.95 0m26.530s gcc 3.4 0m46.839s gcc 4.0 1m 1.515s (time /usr/bin/gcc-4.0 -O3 -g -DHAVE_AV_CONFIG_H -I.. -I'/home/michael/ffmpeg-write2/ffmpeg'/libavutil -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_GNU_SOURCE -c -o dsputil.o dsputil.c) and runtime performance, just try the recommanded way of writing asm/mmx code for gcc 2.95 vs gcc 3/4.*, handwritten asm code is quite a bit faster then what gcc creates from these intrinsics sometimes sure saying gcc gets exponentially slower in general isnt true but in some specific and common cases there is a big speedloss ...
This bug is just a dup of the tree-PRE doesn't work for globals bug (if we have one for that).
Re. comment #12 I don't see how PRE for globals would make a difference here. In any case, AFAIU tree PRE for globals should now work, and we still don't produce any better code. The output of today (4.4.0 20080921) is still the same as that of comment #0.
Here is the .final_cleanup dump, fwiw: ;; Function foo (foo) foo () { int a.1; <bb 2>: a.1 = a + 1; a = a.1; if (a.1 != 0) goto <bb 3>; else goto <bb 4>; <bb 3>: a = [plus_expr] a.1 + 1; <bb 4>: return; } which translates (on Cygwin) to: .file "t.c" .text .p2align 4,,15 .globl _foo .def _foo; .scl 2; .type 32; .endef _foo: movl _a, %edx leal 1(%edx), %eax movl %eax, _a testl %eax, %eax je L3 addl $2, %edx movl %edx, _a L3: rep ret
GCC still doesn't get it right for this PR. The .139t.optimized dump for trunk r156706: ;; Function foo (foo) foo () { int a.1; int a.0; <bb 2>: a.0_1 = a; a.1_2 = a.0_1 + 1; a = a.1_2; if (a.1_2 != 0) goto <bb 3>; else goto <bb 4>; <bb 3>: a.1_5 = a.1_2 + 1; a = a.1_5; <bb 4>: return; } Assembly output on ix86 (with -O2 -fomit-frame-pointer): foo: movl a, %edx leal 1(%edx), %eax testl %eax, %eax movl %eax, a je .L1 addl $2, %edx movl %edx, a .L1: rep ret And on x86_64 (-O2): foo: .LFB0: .cfi_startproc movl a(%rip), %edx leal 1(%rdx), %eax testl %eax, %eax movl %eax, a(%rip) je .L1 addl $2, %edx movl %edx, a(%rip) .L1: rep ret .cfi_endproc
To optimize void foo() { extern int a; if(++a) ++a; } we need to have partial dead store elimination, basically sink the store to after the condition: void bar() { extern int a; int tmp = a; if (++tmp) ++tmp; a = tmp; } bar: movl a, %eax pushl %ebp movl %esp, %ebp movl %eax, %edx addl $1, %edx je .L5 leal 2(%eax), %edx .L5: movl %edx, a popl %ebp ret we'd then assemble this to similar code as the testcase from comment #2. See also PR41490, though a working tree-ssa-sink would at most generate void foobar() { extern int a; int tmp = a; if (++tmp) { ++tmp; a = tmp; } else a = tmp; } Still an improvement as we'd propagate zero to the second store: foobar: movl a, %eax pushl %ebp movl %esp, %ebp cmpl $-1, %eax jne .L9 movl $0, a popl %ebp ret .p2align 4,,7 .p2align 3 .L9: addl $2, %eax movl %eax, a popl %ebp ret
(In reply to comment #16) > To optimize > > void foo() { > extern int a; > if(++a) ++a; > } > > we need to have partial dead store elimination, basically sink the store > to after the condition: > > void bar() { > extern int a; > int tmp = a; > if (++tmp) > ++tmp; > a = tmp; > } > > bar: > movl a, %eax > pushl %ebp > movl %esp, %ebp > movl %eax, %edx > addl $1, %edx > je .L5 > leal 2(%eax), %edx > .L5: > movl %edx, a > popl %ebp > ret > > we'd then assemble this to similar code as the testcase from comment #2. > > See also PR41490, though a working tree-ssa-sink would at most generate > > void foobar() { > extern int a; > int tmp = a; > if (++tmp) > { > ++tmp; > a = tmp; > } > else > a = tmp; > } > > Still an improvement as we'd propagate zero to the second store: > > foobar: > movl a, %eax > pushl %ebp > movl %esp, %ebp > cmpl $-1, %eax > jne .L9 > movl $0, a > popl %ebp > ret > .p2align 4,,7 > .p2align 3 > .L9: > addl $2, %eax > movl %eax, a > popl %ebp > ret It now does this (for 4.7).
This is what is produced (at least for 7.3.0): movl a(%rip), %edx movl $0, %eax leal 2(%rdx), %ecx cmpl $-1, %edx cmovne %ecx, %eax movl %eax, a(%rip) ret
(In reply to Andrew Pinski from comment #18) > This is what is produced (at least for 7.3.0): Which has been produced since GCC 6. that is due to ifcvt.c changes. On the trunk we can produce on the gimple now: a.1_1 = a; if (a.1_1 != -1) goto <bb 3>; [50.00%] else goto <bb 4>; [50.00%] <bb 3> [local count: 536870913]: _3 = a.1_1 + 2; <bb 4> [local count: 1073741824]: # _2 = PHI <0(2), _3(3)> a = _2; Note most optimial code is really: movl a(%rip), %eax addl $1, %eax adcl $0, %eax movl %eax, a(%rip) Which can be produced with LLVM with: void foo1() { extern int a; a += 1; a+=a==0; } GCC is close: movl a(%rip), %eax xorl %edx, %edx addl $1, %eax sete %dl addl %edx, %eax movl %eax, a(%rip) But the the sete and addl can be combined.
(In reply to Andrew Pinski from comment #19) > (In reply to Andrew Pinski from comment #18) > > This is what is produced (at least for 7.3.0): > Which has been produced since GCC 6. > that is due to ifcvt.c changes. > On the trunk we can produce on the gimple now: > a.1_1 = a; > if (a.1_1 != -1) > goto <bb 3>; [50.00%] > else > goto <bb 4>; [50.00%] > > <bb 3> [local count: 536870913]: > _3 = a.1_1 + 2; > > <bb 4> [local count: 1073741824]: > # _2 = PHI <0(2), _3(3)> > a = _2; > > Note most optimial code is really: > movl a(%rip), %eax > addl $1, %eax > adcl $0, %eax > movl %eax, a(%rip) I see movl a(%rip), %edx leal 2(%rdx), %eax cmpl $-1, %edx movl $0, %edx cmove %edx, %eax movl %eax, a(%rip) with -O2 or movl a(%rip), %edx xorl %eax, %eax cmpl $-1, %edx je .L2 leal 2(%rdx), %eax .L2: movl %eax, a(%rip) with -Os. The GIMPLE code makes it difficult to do better I guess, it doesn't help that we don't have any CC there.
Btw, the code for the global variable case is now exactly the same as for int foo(int a) { if(++a) ++a; return a; } besides the load/store instead of the argument/return, so the bug isn't about global variables anymore but about optimization obfuscating the if (++a) ++a; sequence in a way that results in worse code than when literally translating stmt-at-a-time.