This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

'tree-loop-vectorize' Optimization Question

From: Andrew Williams <anwilli5 at ncsu dot edu>
To: gcc-help at gcc dot gnu dot org
Date: Thu, 19 Feb 2015 00:48:08 -0500
Subject: 'tree-loop-vectorize' Optimization Question
Authentication-results: sourceware.org; auth=none

When I enable the tree-loop-vectorize optimization I'm seeing some
behavior that I don't understand...

Here is a test program that highlights the scenario:

#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>

int main(){

    int a = 2;
    int *buffer = malloc(10000 * sizeof(*buffer));

    for (int i = 0; i < 10000; i++){
        if ((i % 1000) == 0){
            a = a * 15;
        }
        buffer[i] = a;
    }

    int fd = open("output", O_CREAT | O_WRONLY);
    write(fd, buffer, 10000 * sizeof(*buffer));
    close(fd);

    return 0;
}

When compiled with the option disabled (gcc -std=c99 -O3
-fno-tree-loop-vectorize test.c -S -masm=intel) the following code is
produced for the loop:

    mov    ebx, 2 ; ebx is a
    mov    edi, 274877907
...
    call    malloc
...
    mov    esi, eax ; esi is the buffer pointer
    xor    ecx, ecx ; ecx is i
...
.L3:
    mov    eax, ecx
    imul    edi
    mov    eax, ecx
    sar    eax, 31
    sar    edx, 6
    sub    edx, eax
    imul    edx, edx, 1000 ; Compute optimized modulus - see
hackersdelight.org/magic.htm
    cmp    ecx, edx
    jne    .L2
    mov    eax, ebx
    sal    eax, 4
    sub    eax, ebx
    mov    ebx, eax ; Compute optimized multiplication of a by 15
.L2:
    mov    DWORD PTR [esi+ecx*4], ebx
    add    ecx, 1
    cmp    ecx, 10000
    jne    .L3

When the tree-loop-vectorize option is enabled (gcc -std=c99 -O3
-ftree-loop-vectorize test.c -S -masm=intel), though, the following
code is generated:

    mov    ebx, 2
...
    call    malloc
...
    mov    edi, eax
    xor    ecx, ecx
...
.L2:
    mov    eax, 274877907
    mov    esi, ebx
    imul    ecx
    mov    eax, ecx
    sal    esi, 4
    sar    eax, 31
    sub    esi, ebx
    sar    edx, 6
    sub    edx, eax
    imul    edx, edx, 1000
    cmp    ecx, edx
    cmove    ebx, esi
    mov    DWORD PTR [edi+ecx*4], ebx
    add    ecx, 1
    cmp    ecx, 10000
    jne    .L2

The main difference here is that the 'a * 15' calculation is done
every loop iteration instead of every 1000th, but a is only assigned
the new value every 1000th time via the conditional move instruction.

This example isn't too crazy, but performance seems to worsen if you
are doing more complex operations inside the if block (try 'a =
a*a*a*a*a;') and/or if you switch to 64-bit data types (try changing
'a' and 'buffer' to use long long instead of int - the conditional
move gets replaced with two moves after the modulus check branch, and
you need more computational instructions to compensate for the
larger-than-register data-size arithmetic.)

Anyway, my question is why the optimization does this.  Why not keep
the modulus check at the top and bypass the calculations each loop
iteration?  It might be compelling to get rid of the branch in the
case above, but in the 64-bit data types case this benefit goes away.

This example program is a bit contrived, but I saw this behavior in
"real" code and was trying to understand it.  If anyone could share
some insight into this I'd greatly appreciate it!

-Andrew

One last thing to note - The assembly above was generated using gcc
version 4.9.2 20141224 (prerelease), the one that Arch Linux currently
has in it's software repo.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]