When I enable the tree-loop-vectorize optimization I'm seeing some behavior that I don't understand... Here is a minimized test case that highlights the scenario: typedef long unsigned int size_t; extern void *malloc (size_t __size); int main(){ unsigned int a = 2; unsigned int *buffer = malloc(10000 * sizeof(*buffer)); for (int i = 0; i < 10000; i++){ if ((i % 1000) == 0){ a = a * a * a * a * a; } buffer[i] = a; } return buffer[999]; } When compiled with the option disabled (xgcc -save-temps -m32 -Wall -Wextra -std=c99 -O3 -fno-tree-loop-vectorize -S -masm=intel test.c) the following code is produced: mov ebx, 2 ; a = 2 mov edi, 274877907 sub esp, 20 push 40000 call malloc add esp, 16 mov esi, eax ; buffer = malloc(...) xor ecx, ecx ; i = 0 .p2align 4,,10 .p2align 3 .L3: mov eax, ecx imul edi mov eax, ecx sar eax, 31 sar edx, 6 sub edx, eax imul edx, edx, 1000 cmp ecx, edx jne .L2 ; if ((i % 1000) == 0) { mov eax, ebx imul eax, ebx imul eax, eax imul ebx, eax ; a = a * a * a * a * a; } .L2: mov DWORD PTR [esi+ecx*4], ebx ; buffer[i] = a add ecx, 1 ; i++ cmp ecx, 10000 jne .L3 ; continue if i < 10000 ... When the tree-loop-vectorize option is enabled (xgcc -save-temps -m32 -Wall -Wextra -std=c99 -O3 -ftree-loop-vectorize test.c -S -masm=intel), though, the following code is generated: mov esi, 2 ; a = 2 sub esp, 20 push 40000 call malloc add esp, 16 mov edi, eax ; buffer = malloc(...) xor ecx, ecx ; i = 0 .p2align 4,,10 .p2align 3 .L2: mov ebx, esi mov eax, 274877907 imul ecx mov eax, ecx imul ebx, esi sar eax, 31 sar edx, 6 imul ebx, ebx sub edx, eax imul edx, edx, 1000 imul ebx, esi ; a = a * a * a * a * a; cmp ecx, edx cmove esi, ebx ; move new value if ((i % 1000) == 0) mov DWORD PTR [edi+ecx*4], esi ; buffer[i] = a add ecx, 1 ; i++ cmp ecx, 10000 jne .L2 ; continue if i < 10000 The main difference here is that the 'a * a * a * a * a' calculation is done every loop iteration instead of every 1000th, but a is only assigned the new value every 1000th time via the conditional move instruction. It seems inefficient to do this, and from basic testing the code compiled without the tree-loop-vectorize optimization seems to run faster on my machine. The "real" code that I derived this from has it worse - it uses 64-bit data types in a 32-bit binary, so there are several multiply instructions for each logical multiplication in the code, the stack gets used for storing some of the intermediate values, and after computing everything into some registers it just replaces those values with ones stored on the stack from previous iterations in the case where the modulus condition is not met. :( GCC version: Using built-in specs. COLLECT_GCC=xgcc Target: x86_64-unknown-linux-gnu Configured with: ./configure Thread model: posix gcc version 4.9.2 (GCC) I've also reproduced the issue on gcc 4.9.2 20141224 (prerelease) from an Arch Linux distro and gcc 4.5.2-8ubuntu4 from a Ubuntu distro. I'm happy to provide any other information needed. Thanks!
This is if-conversion at work making the code vectorizable. You can disable it with -fno-tree-loop-if-convert. It's not easy to assess profitablility here (well, locally it should use bb frequencies), as computing a*a*a*a*a always could be a win if vectorizing the rest of the loop compensates for the cost. In this case the optimal thing to do is to turn the loop into a loop nest. Not sure how you'd call this kind of loop transform - sth with unswitching IV-based conditions (it's not exactly splitting in this case). We want for (int j = 0; j < 10000/1000; j++){ int i; for (i = j*1000; i < 999; ++i) buffer[i] = a; a = a * a * a * a * a; buffer[i] = a; } so we can vectorize the inner loop and don't need to evaluate the conditional there.
So for GCC 7+, -ftree-vectorize vs -fno-tree-vectorize case is no longer different and there is no cmov in the code any more. Note I noticed ICX/LLVM convert this to cmov while ICC and MSVC do not.
This has been fixed in GCC 7+ which now performs if-conversion only when the result is vectorized.