This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

G++ could optimize ASM code more

Hello dear people,

As I was optimizing my program, I found a few things which looked odd to me in the assembler code.

I am on an AMD x64_32 box running Debian Squeeze, GCC: (Debian 4.4.5-8) 4.4.5.

I am using "-O3 -mtune=native" for optimization.



I was using a performane-critical for-loop (very tight) which is iterated a few million times. Of couse, it is important that this for loop uses as less OP codes as possible since it makes a difference in such great dimensions. Also, I was using OpenMP.

I marked the loop with a "NOP" margin and looked if the loop contents could be optimized. I found out that "-O3" did not optimize a special case where I needed to do a manual cast to make the loop 1 OP code shorter. Please look at the following short example which reproduces the scenario:

(I do not know if this behavior can be reproduced without using OpenMP)

#include <stdlib.h>
#include <stdio.h>
#include <strings.h>

// Please toggle and compare ASM output ("nop" markers). Using "addcast" makes output 1 OP code shorter.
#define addcast

int main(void) {
	volatile int idx_a = 0;
	volatile int idx_b = 1;

	unsigned char *a;
	a = (unsigned char*)malloc(2);

unsigned long long int c = 0;

#pragma omp parallel for firstprivate(a) reduction(+:c)
for (int i = 0; i < 2; ++i) {
__asm__ ("nop"); // marker
#ifdef addcast
// Contains a cast to "unsigned long long int", which was not done by "-O3"
// This cast makes the output 1 OP code shorter
// imulq %rdi, %rdx # tmp80, tmp81
// addq %rdx, %rcx # tmp81, c
c += (unsigned long long int)a[idx_a] * a[idx_b];
// Using "-O3", it produces 1 OP code which could be optimized away
// imull %edi, %edx # tmp80, tmp81 <-- the compiler should use imulq instead of imull
// movslq %edx,%rdx # tmp81, tmp82 <-- not neccessary... BETTER: optimize away using imulq !
// addq %rdx, %rcx # tmp82, c
c += a[idx_a] * a[idx_b];
__asm__ ("nop"); // marker


	return c;



Compiling following program:

#include <stdio.h>
#include <strings.h>
int main(void) {
        volatile unsigned char a = 4;
        volatile unsigned char b = 6;
        volatile unsigned long long int c = a * b;
        return c;


        .file   "main.c"
        .p2align 4,,15
.globl main
        .type   main, @function
        .cfi_personality 0x3,__gxx_personality_v0
        movb    $4, -1(%rsp)
        movb    $6, -2(%rsp)
        movzbl  -1(%rsp), %edx
        movzbl  -2(%rsp), %eax
        movzbl  %dl, %edx
        movzbl  %al, %eax
        imull   %edx, %eax
        movq    %rax, -16(%rsp)     # REDUNDANT??
        movq    -16(%rsp), %rax     # REDUNDANT??
        .size   main, .-main
        .ident  "GCC: (Debian 4.4.5-8) 4.4.5"
        .section        .note.GNU-stack,"",@progbits

AFAIK, the two movq statements are redundant. What do they do? The just do rax=rsp[-16] and rsp[-16]=rax . Or am I wrong?


Best regards and I hope I could help

Daniel Marschall

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]