12395 – Suboptimal code with global variables

Bug 12395 - Suboptimal code with global variables

Summary: Suboptimal code with global variables

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	3.4.0

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	16857 (view as bug list)
Depends on:	41490
Blocks:	13355
	Show dependency tree / graph

Reported:	2003-09-24 23:17 UTC by Steven Bosscher
Modified:	2021-08-23 06:58 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	x86_64-- i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-03-04 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steven Bosscher 2003-09-24 23:17:50 UTC

Consider this tiny function:

void foo() {
        extern int a;
        if(++a) ++a;
}

It produces the following assembly with 3.4 (experimental) and 3.2 (Red Hat 9)
at -O2 and better:

        .file   "t.c"
        .text
.globl foo
        .type   foo, @function
foo:
        movl    a, %eax
        incl    %eax
        testl   %eax, %eax
        movl    %eax, a
        je      .L1
        incl    %eax
        movl    %eax, a
.L1:
        ret
        .size   foo, .-foo
        .section        .note.GNU-stack,"",@progbits
        .ident  "GCC: (GNU) 3.5-tree-ssa 20030924 (merged 20030817)"

This is subobtimal. The optimal code should be:

foo:
        addl    $1,a
        je      .L1
        addl    $1,a
.L1:
        ret

Comment 1 Wolfgang Bangerth 2003-09-24 23:30:00 UTC

Confirmed. Present mainline gives different assembly than the
tree-ssa results you showed, but the code is equally suboptimal.

One question, though: from the summary line, it seems as if you
imply that the problem is related to global variables. Do you
have an example that this doesn't happen with locals?

W.

Here's what I get with mainline (actually from the C++ FE, but that
shouldn't matter so much):
_Z3foov:
.LFB2:
        movl    a, %eax
        cmpl    $-1, %eax
        je      .L3
        addl    $2, %eax
        movl    %eax, a
        ret
        .p2align 4,,7
.L3:
        xorl    %eax, %eax
        movl    %eax, a
        ret
.LFE2:
        .size   _Z3foov, .-_Z3foov
        .section        .note.GNU-stack,"",@progbits
        .ident  "GCC: (GNU) 3.4 20030919 (experimental)"

Comment 2 Steven Bosscher 2003-09-25 19:38:39 UTC

Well I don't _know_ for sure if it is related to global variabes, but it sure
seems so:

void foo(int a) {
  if(++a) ++a;
  printf ("%d",a);
}



        .file   "t.c"
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%d"
        .text
.globl foo
        .type   foo,@function
foo:
        movl    4(%esp), %eax
        incl    %eax
        je      .L2
        incl    %eax
.L2:
        pushl   %eax
        pushl   $.LC0
        call    printf
        popl    %eax
        popl    %edx
        ret
.Lfe1:
        .size   foo,.Lfe1-foo
        .ident  "GCC: (GNU) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)"


void foo() {
  int a
  if(++a) ++a;
  printf ("%d",a);
}

produces something similar.

Comment 3 Wolfgang Bangerth 2003-09-25 22:26:45 UTC

Thanks, makes sense. I tried myself, but using a function-local
variable was no good since it could be optimized away. Didn't
think of a function parameter, though.

Thanks for clarifying this
  W.

Comment 4 segher@d12relay01.megacenter.de.ibm.com 2003-09-29 14:39:11 UTC

Subject: Re:  New: Suboptimal code with global variables

> void foo() {
>         extern int a;
>         if(++a) ++a;
> }

> This is subobtimal. The optimal code should be:
> 
> foo:
>         addl    $1,a
>         je      .L1
>         addl    $1,a
> .L1:
>         ret

[I haven't written x86 assembler for a long time, but...]

That code isn't optimal either; something like

foo:
	addl	$1,a
	sbbl	$-1,a
	ret

would be even better :-)

Comment 5 Andrew Pinski 2004-03-12 16:38:02 UTC

*** Bug 14552 has been marked as a duplicate of this bug. ***

Comment 6 Andrew Pinski 2004-08-02 22:05:17 UTC

*** Bug 16857 has been marked as a duplicate of this bug. ***

Comment 7 Steven Bosscher 2006-02-11 00:52:36 UTC

"GCC: (GNU) 4.2.0 20060210 (experimental)" produces this
(at "-O2 -march=pentium4"):

foo:
        movl    a, %eax
        addl    $1, %eax
        movl    %eax, a
        testl   %eax, %eax
        je      .L4
        addl    $1, %eax
        movl    %eax, a
.L4:
        ret

Comment 8 michaelni 2006-02-11 11:40:52 UTC

I really think this should be fixed, otherwise gcc wont be able to follow its exponential decaying performance which it has so accurately followed since 2.95 at least, to show clearer how much speed we could loose by fixing this i was nice and benchmarked the code (a simple for loop running 100 times with the code inside, rdtsc based timing outside with a 1000 times executed loop surounding it
benchmarink was done on a 800mhz duron and a 500mhz pentium3, the first number is the number of cpu cycles for the duron, second one for p3

first let me show you the optimal code by steven boscher?
            "addl    $1,a\n"
            "   je      .L1\n"
            "addl    $1,a\n"
            ".L1:\n"
11.557 / 12.514

now what gcc 3.4/3.2 generated:
            "movl    a, %%eax\n"
            "incl    %%eax\n"
            "testl   %%eax, %%eax\n"
            "movl    %%eax, a\n"
            "je      .L1\n"
            "incl    %%eax\n"
            "movl    %%eax, a\n"
            ".L1:\n"
//6.220 / 6.159

the code generated by mainline had 2 ret so it didnt fit in my benchmark loop

the even better code by segher AT d12relay01 DOT megacenter.de.ibm.com
            "addl    $1,a\n"
            "sbbl   $-1,a\n"
//11.755 / 15.111


one case which you must be carefull not to generate as its almost twice as fast as the on above while still being just 2 instructions is:
            "cmpl   $-1,a\n"
            "adcl    $1,a\n"
//7.827 / 7.422

another 2 slightly faster variants are:
        "movl    a, %%eax\n"
        "cmpl   $-1,%%eax\n"
        "adcl    $1,%%eax\n"
        "movl  %%eax,a\n"
//6.567 / 8.811

        "movl    a, %%eax\n"
        "addl    $1,%%eax\n"
        "sbbl   $-1,%%eax\n"
        "movl  %%eax,a\n"
//6.564 / 8.813


what a 14year old script kid would write and what gcc would generate if it where local variables:
        "movl    a, %%eax\n"
        "incl    %%eax\n"
        "je      .L1\n"
        "incl    %%eax\n"
        ".L1:\n"
        "movl    %%eax, a\n"
//6.162 / 5.426

what i would write (as the variable isnt used in my testcase):
            "\n"
//2.155 / 2.410

Comment 9 Steven Bosscher 2006-02-11 13:02:59 UTC

Re. comment #8:
"exponential decaying performance which it has so accurately followed since 2.95"

Can you back this up with numbers, or are you just trolling?  If the latter, please don't do that, you are insulting the work of a dedicated few.  Maybe you should help out instead of trolling, if you think you're so good.  If you continue to make this kind of unhelpful comments, I will ask to have you blocked from our bugzilla.

Comment 10 Steven Bosscher 2006-02-11 13:14:47 UTC

This is, in fact, a rare case where RTL store motion does something useful.  With "-O2 -fomit-frame-pointer -march=pentium4" we produce:

        movl    a, %eax
        addl    $1, %eax
        movl    %eax, a
        testl   %eax, %eax
        je      .L4
        addl    $1, %eax
        movl    %eax, a
.L4:
        ret

but with "-O2 -fomit-frame-pointer -march=pentium4 -fgcse-sm" we get:

foo:
        movl    a, %eax
        addl    $1, %eax
        je      .L5
        addl    $1, %eax
        movl    %eax, a
        ret
.L5:
        movl    $0, a
        ret

which is much closer to what we want to get to eventually.

Looking at the first snippet, we shouldn't really need GCSE store-motion for this, because the store to a is not partially redundant.  It is fully redundant and could be sunk if we had a generic hoisting pass that can lift and sink code. The current implementation of code hoisting in GCSE can only lift code, not sink it.

Comment 11 michaelni 2006-02-11 13:54:24 UTC

(In reply to comment #9)
> Re. comment #8:
> "exponential decaying performance which it has so accurately followed since
> 2.95"
> 
> Can you back this up with numbers, or are you just trolling?  If the latter,
> please don't do that, you are insulting the work of a dedicated few.  Maybe you
> should help out instead of trolling, if you think you're so good.  If you
> continue to make this kind of unhelpful comments, I will ask to have you
> blocked from our bugzilla.

the benchmark was unhelpfull?
anyway, compiling dsputil.c from libavcodec takes
gcc 2.95 	0m26.530s
gcc 3.4         0m46.839s
gcc 4.0  	1m 1.515s

(time /usr/bin/gcc-4.0 -O3 -g -DHAVE_AV_CONFIG_H -I.. -I'/home/michael/ffmpeg-write2/ffmpeg'/libavutil -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_GNU_SOURCE   -c -o dsputil.o dsputil.c)


and runtime performance, just try the recommanded way of writing asm/mmx code for gcc 2.95 vs gcc 3/4.*, handwritten asm code is quite a bit faster then what gcc creates from these intrinsics sometimes

sure saying gcc gets exponentially slower in general isnt true but in some specific and common cases there is a big speedloss ...

Comment 12 Richard Biener 2008-03-20 09:28:18 UTC

This bug is just a dup of the tree-PRE doesn't work for globals bug (if we have
one for that).

Comment 13 Steven Bosscher 2008-09-21 13:49:47 UTC

Re. comment #12
I don't see how PRE for globals would make a difference here.  In any case, AFAIU tree PRE for globals should now work, and we still don't produce any better code. The output of today (4.4.0 20080921) is still the same as that of comment #0.

Comment 14 Steven Bosscher 2008-09-21 13:52:33 UTC

Here is the .final_cleanup dump, fwiw:


;; Function foo (foo)

foo ()
{
  int a.1;

<bb 2>:
  a.1 = a + 1;
  a = a.1;
  if (a.1 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

<bb 3>:
  a = [plus_expr] a.1 + 1;

<bb 4>:
  return;

}

which translates (on Cygwin) to:

	.file	"t.c"
	.text
	.p2align 4,,15
.globl _foo
	.def	_foo;	.scl	2;	.type	32;	.endef
_foo:
	movl	_a, %edx
	leal	1(%edx), %eax
	movl	%eax, _a
	testl	%eax, %eax
	je	L3
	addl	$2, %edx
	movl	%edx, _a
L3:
	rep
	ret

Comment 15 Steven Bosscher 2010-02-12 21:58:52 UTC

GCC still doesn't get it right for this PR. The .139t.optimized dump for trunk r156706:

;; Function foo (foo)

foo ()
{
  int a.1;
  int a.0;

<bb 2>:
  a.0_1 = a;
  a.1_2 = a.0_1 + 1;
  a = a.1_2;
  if (a.1_2 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

<bb 3>:
  a.1_5 = a.1_2 + 1;
  a = a.1_5;

<bb 4>:
  return;

}


Assembly output on ix86 (with -O2 -fomit-frame-pointer):

foo:
	movl	a, %edx
	leal	1(%edx), %eax
	testl	%eax, %eax
	movl	%eax, a
	je	.L1
	addl	$2, %edx
	movl	%edx, a
.L1:
	rep
	ret


And on x86_64 (-O2):

foo:
.LFB0:
	.cfi_startproc
	movl	a(%rip), %edx
	leal	1(%rdx), %eax
	testl	%eax, %eax
	movl	%eax, a(%rip)
	je	.L1
	addl	$2, %edx
	movl	%edx, a(%rip)
.L1:
	rep
	ret
	.cfi_endproc

Comment 16 Richard Biener 2010-02-13 10:23:17 UTC

To optimize

void foo() {
        extern int a;
        if(++a) ++a;
}

we need to have partial dead store elimination, basically sink the store
to after the condition:

void bar() {
    extern int a;
    int tmp = a;
    if (++tmp)
      ++tmp;
    a = tmp;
}

bar:
        movl    a, %eax
        pushl   %ebp
        movl    %esp, %ebp
        movl    %eax, %edx
        addl    $1, %edx
        je      .L5
        leal    2(%eax), %edx
.L5:
        movl    %edx, a
        popl    %ebp
        ret

we'd then assemble this to similar code as the testcase from comment #2.

See also PR41490, though a working tree-ssa-sink would at most generate

void foobar() {
    extern int a;
    int tmp = a;
    if (++tmp)
      {
        ++tmp;
        a = tmp;
      }
    else
      a = tmp;
}

Still an improvement as we'd propagate zero to the second store:

foobar:
        movl    a, %eax
        pushl   %ebp
        movl    %esp, %ebp
        cmpl    $-1, %eax
        jne     .L9
        movl    $0, a
        popl    %ebp
        ret
        .p2align 4,,7
        .p2align 3
.L9:
        addl    $2, %eax
        movl    %eax, a
        popl    %ebp
        ret

Comment 17 Richard Biener 2011-03-15 11:24:13 UTC

(In reply to comment #16)
> To optimize
> 
> void foo() {
>         extern int a;
>         if(++a) ++a;
> }
> 
> we need to have partial dead store elimination, basically sink the store
> to after the condition:
> 
> void bar() {
>     extern int a;
>     int tmp = a;
>     if (++tmp)
>       ++tmp;
>     a = tmp;
> }
> 
> bar:
>         movl    a, %eax
>         pushl   %ebp
>         movl    %esp, %ebp
>         movl    %eax, %edx
>         addl    $1, %edx
>         je      .L5
>         leal    2(%eax), %edx
> .L5:
>         movl    %edx, a
>         popl    %ebp
>         ret
> 
> we'd then assemble this to similar code as the testcase from comment #2.
> 
> See also PR41490, though a working tree-ssa-sink would at most generate
> 
> void foobar() {
>     extern int a;
>     int tmp = a;
>     if (++tmp)
>       {
>         ++tmp;
>         a = tmp;
>       }
>     else
>       a = tmp;
> }
> 
> Still an improvement as we'd propagate zero to the second store:
> 
> foobar:
>         movl    a, %eax
>         pushl   %ebp
>         movl    %esp, %ebp
>         cmpl    $-1, %eax
>         jne     .L9
>         movl    $0, a
>         popl    %ebp
>         ret
>         .p2align 4,,7
>         .p2align 3
> .L9:
>         addl    $2, %eax
>         movl    %eax, a
>         popl    %ebp
>         ret

It now does this (for 4.7).

Comment 18 Andrew Pinski 2018-04-22 23:05:48 UTC

This is what is produced (at least for 7.3.0):
        movl    a(%rip), %edx
        movl    $0, %eax
        leal    2(%rdx), %ecx
        cmpl    $-1, %edx
        cmovne  %ecx, %eax
        movl    %eax, a(%rip)
        ret

Comment 19 Andrew Pinski 2021-08-21 22:21:33 UTC

(In reply to Andrew Pinski from comment #18)
> This is what is produced (at least for 7.3.0):
Which has been produced since GCC 6.
that is due to ifcvt.c changes.
On the trunk we can produce on the gimple now:
  a.1_1 = a;
  if (a.1_1 != -1)
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 536870913]:
  _3 = a.1_1 + 2;

  <bb 4> [local count: 1073741824]:
  # _2 = PHI <0(2), _3(3)>
  a = _2;

Note most optimial code is really:
        movl    a(%rip), %eax
        addl    $1, %eax
        adcl    $0, %eax
        movl    %eax, a(%rip)
Which can be produced with LLVM with:
void foo1() {
        extern int a;
        a += 1; a+=a==0;
}

GCC is close:
        movl    a(%rip), %eax
        xorl    %edx, %edx
        addl    $1, %eax
        sete    %dl
        addl    %edx, %eax
        movl    %eax, a(%rip)

But the the sete and addl can be combined.

Comment 20 Richard Biener 2021-08-23 06:54:14 UTC

(In reply to Andrew Pinski from comment #19)
> (In reply to Andrew Pinski from comment #18)
> > This is what is produced (at least for 7.3.0):
> Which has been produced since GCC 6.
> that is due to ifcvt.c changes.
> On the trunk we can produce on the gimple now:
>   a.1_1 = a;
>   if (a.1_1 != -1)
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
> 
>   <bb 3> [local count: 536870913]:
>   _3 = a.1_1 + 2;
> 
>   <bb 4> [local count: 1073741824]:
>   # _2 = PHI <0(2), _3(3)>
>   a = _2;
> 
> Note most optimial code is really:
>         movl    a(%rip), %eax
>         addl    $1, %eax
>         adcl    $0, %eax
>         movl    %eax, a(%rip)

I see

        movl    a(%rip), %edx
        leal    2(%rdx), %eax
        cmpl    $-1, %edx
        movl    $0, %edx
        cmove   %edx, %eax
        movl    %eax, a(%rip)

with -O2 or

        movl    a(%rip), %edx
        xorl    %eax, %eax
        cmpl    $-1, %edx
        je      .L2
        leal    2(%rdx), %eax
.L2:
        movl    %eax, a(%rip)

with -Os.  The GIMPLE code makes it difficult to do better I guess, it
doesn't help that we don't have any CC there.

Comment 21 Richard Biener 2021-08-23 06:58:00 UTC

Btw, the code for the global variable case is now exactly the same as for

int foo(int a) {
        if(++a) ++a;
        return a;
}

besides the load/store instead of the argument/return, so the bug isn't
about global variables anymore but about optimization obfuscating the
if (++a) ++a; sequence in a way that results in worse code than when
literally translating stmt-at-a-time.