This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
'tree-loop-vectorize' Optimization Question
- From: Andrew Williams <anwilli5 at ncsu dot edu>
- To: gcc-help at gcc dot gnu dot org
- Date: Thu, 19 Feb 2015 00:48:08 -0500
- Subject: 'tree-loop-vectorize' Optimization Question
- Authentication-results: sourceware.org; auth=none
When I enable the tree-loop-vectorize optimization I'm seeing some
behavior that I don't understand...
Here is a test program that highlights the scenario:
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
int main(){
int a = 2;
int *buffer = malloc(10000 * sizeof(*buffer));
for (int i = 0; i < 10000; i++){
if ((i % 1000) == 0){
a = a * 15;
}
buffer[i] = a;
}
int fd = open("output", O_CREAT | O_WRONLY);
write(fd, buffer, 10000 * sizeof(*buffer));
close(fd);
return 0;
}
When compiled with the option disabled (gcc -std=c99 -O3
-fno-tree-loop-vectorize test.c -S -masm=intel) the following code is
produced for the loop:
mov ebx, 2 ; ebx is a
mov edi, 274877907
...
call malloc
...
mov esi, eax ; esi is the buffer pointer
xor ecx, ecx ; ecx is i
...
.L3:
mov eax, ecx
imul edi
mov eax, ecx
sar eax, 31
sar edx, 6
sub edx, eax
imul edx, edx, 1000 ; Compute optimized modulus - see
hackersdelight.org/magic.htm
cmp ecx, edx
jne .L2
mov eax, ebx
sal eax, 4
sub eax, ebx
mov ebx, eax ; Compute optimized multiplication of a by 15
.L2:
mov DWORD PTR [esi+ecx*4], ebx
add ecx, 1
cmp ecx, 10000
jne .L3
When the tree-loop-vectorize option is enabled (gcc -std=c99 -O3
-ftree-loop-vectorize test.c -S -masm=intel), though, the following
code is generated:
mov ebx, 2
...
call malloc
...
mov edi, eax
xor ecx, ecx
...
.L2:
mov eax, 274877907
mov esi, ebx
imul ecx
mov eax, ecx
sal esi, 4
sar eax, 31
sub esi, ebx
sar edx, 6
sub edx, eax
imul edx, edx, 1000
cmp ecx, edx
cmove ebx, esi
mov DWORD PTR [edi+ecx*4], ebx
add ecx, 1
cmp ecx, 10000
jne .L2
The main difference here is that the 'a * 15' calculation is done
every loop iteration instead of every 1000th, but a is only assigned
the new value every 1000th time via the conditional move instruction.
This example isn't too crazy, but performance seems to worsen if you
are doing more complex operations inside the if block (try 'a =
a*a*a*a*a;') and/or if you switch to 64-bit data types (try changing
'a' and 'buffer' to use long long instead of int - the conditional
move gets replaced with two moves after the modulus check branch, and
you need more computational instructions to compensate for the
larger-than-register data-size arithmetic.)
Anyway, my question is why the optimization does this. Why not keep
the modulus check at the top and bypass the calculations each loop
iteration? It might be compelling to get rid of the branch in the
case above, but in the 64-bit data types case this benefit goes away.
This example program is a bit contrived, but I saw this behavior in
"real" code and was trying to understand it. If anyone could share
some insight into this I'd greatly appreciate it!
-Andrew
One last thing to note - The assembly above was generated using gcc
version 4.9.2 20141224 (prerelease), the one that Arch Linux currently
has in it's software repo.