This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: G++ could optimize ASM code more


Daniel Marschall <daniel-marschall@viathinksoft.de> writes:

> As I was optimizing my program, I found a few things which looked odd
> to me in the assembler code.

Thanks.  It's often best to report missed optimizations at
http://gcc.gnu.org/bugzilla/ .  They will tend to be forgotten on the
mailing list.


> I am on an AMD x64_32 box running Debian Squeeze, GCC: (Debian
> 4.4.5-8) 4.4.5.

Note that the current GCC release is 4.7.0.


> 		#ifdef addcast
> 		// Contains a cast to "unsigned long long int", which
> was not done by "-O3"
> 		// This cast makes the output 1 OP code shorter
> 		// imulq   %rdi, %rdx      # tmp80, tmp81
> 		// addq    %rdx, %rcx      # tmp81, c
> 		c += (unsigned long long int)a[idx_a] * a[idx_b];
> 		#else
> 		// Using "-O3", it produces 1 OP code which could be optimized away
> 		// imull   %edi, %edx      # tmp80, tmp81  <-- the
> compiler should use imulq instead of imull
> 		// movslq  %edx,%rdx       # tmp81, tmp82  <-- not
> neccessary... BETTER: optimize away using imulq !
> 		// addq    %rdx, %rcx      # tmp82, c
> 		c += a[idx_a] * a[idx_b];
> 		#endif

This cast changes the meaning of the code, so it's not surprising that
you see different assembler instructions.  The first case above will do
the multiplication in the type "unsigned long long".  In the second case
the "unsigned char" values are zero-extended to int, and the
multiplication is done in the type "int".  Then the "int" result is
sign-extended to "unsigned long long" for the addition.

In this case it's true that the compiler could convert the code as you
suggest, based on the knowledge that the int values are always in the
range 0 to 255.  However, it's not clear to me that using imulq would be
better.  My copy of the Intel optimization manual suggests that imull
has slightly lower latency than imulq, so I think that in many cases
imull would be preferred.


> Compiling following program:
>
> #include <stdio.h>
> #include <strings.h>
> int main(void) {
>         volatile unsigned char a = 4;
>         volatile unsigned char b = 6;
>         volatile unsigned long long int c = a * b;
>         return c;
> }
>
> produces:
>
>         .file   "main.c"
>         .text
>         .p2align 4,,15
> .globl main
>         .type   main, @function
> main:
> .LFB16:
>         .cfi_startproc
>         .cfi_personality 0x3,__gxx_personality_v0
>         movb    $4, -1(%rsp)
>         movb    $6, -2(%rsp)
>         movzbl  -1(%rsp), %edx
>         movzbl  -2(%rsp), %eax
>         movzbl  %dl, %edx
>         movzbl  %al, %eax
>         imull   %edx, %eax
>         cltq
>         movq    %rax, -16(%rsp)     # REDUNDANT??
>         movq    -16(%rsp), %rax     # REDUNDANT??
>         ret
>         .cfi_endproc
> .LFE16:
>         .size   main, .-main
>         .ident  "GCC: (Debian 4.4.5-8) 4.4.5"
>         .section        .note.GNU-stack,"",@progbits
>
> AFAIK, the two movq statements are redundant. What do they do? The
> just do rax=rsp[-16] and rsp[-16]=rax . Or am I wrong?

Those movq instructions exist because you declared c as volatile.  A
volatile local variable must live on the stack.  The first instruction
stores the value into the local variable c.  The second retrieves the
value for the return statement.

In general uses of volatile variables are not optimized.  That is
intentional and based on the definition of volatile in the language
standard.  So it takes a pretty high bar to argue about a missing
optimization for a volatile variable.

Ian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]