65335 – Potential optimization issue with 'tree-loop-vectorize'

Bug 65335 - Potential optimization issue with 'tree-loop-vectorize'

Summary: Potential optimization issue with 'tree-loop-vectorize'

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.9.2

Importance:	P3 enhancement
Target Milestone:	7.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2015-03-06 16:16 UTC by anwilli5
Modified:	2021-08-23 06:44 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-03-09 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description anwilli5 2015-03-06 16:16:31 UTC

When I enable the tree-loop-vectorize optimization I'm seeing some behavior that I don't understand...

Here is a minimized test case that highlights the scenario:


typedef long unsigned int size_t;
extern void *malloc (size_t __size);

int main(){

    unsigned int a = 2;
    unsigned int *buffer = malloc(10000 * sizeof(*buffer));

    for (int i = 0; i < 10000; i++){
        if ((i % 1000) == 0){
            a = a * a * a * a * a;
        }
        buffer[i] = a;
    }

    return buffer[999];
}


When compiled with the option disabled (xgcc -save-temps -m32 -Wall -Wextra -std=c99 -O3 -fno-tree-loop-vectorize -S -masm=intel test.c) the following code is produced:

	mov	ebx, 2           ; a = 2
	mov	edi, 274877907
	sub	esp, 20
	push	40000
	call	malloc
	add	esp, 16
	mov	esi, eax         ; buffer = malloc(...)
	xor	ecx, ecx         ; i = 0
	.p2align 4,,10
	.p2align 3
.L3:
	mov	eax, ecx
	imul	edi
	mov	eax, ecx
	sar	eax, 31
	sar	edx, 6
	sub	edx, eax
	imul	edx, edx, 1000 
	cmp	ecx, edx
	jne	.L2              ; if ((i % 1000) == 0) {
	mov	eax, ebx
	imul	eax, ebx
	imul	eax, eax
	imul	ebx, eax         ; a = a * a * a * a * a; }
.L2:
	mov	DWORD PTR [esi+ecx*4], ebx ; buffer[i] = a
	add	ecx, 1           ; i++
	cmp	ecx, 10000
	jne	.L3              ; continue if i < 10000
        ...

When the tree-loop-vectorize option is enabled (xgcc -save-temps -m32 -Wall -Wextra -std=c99 -O3 -ftree-loop-vectorize test.c -S -masm=intel), though, the following code is generated:

	mov	esi, 2           ; a = 2
	sub	esp, 20
	push	40000
	call	malloc
	add	esp, 16
	mov	edi, eax         ; buffer = malloc(...)
	xor	ecx, ecx         ; i = 0
	.p2align 4,,10
	.p2align 3
.L2:
	mov	ebx, esi
	mov	eax, 274877907
	imul	ecx
	mov	eax, ecx
	imul	ebx, esi
	sar	eax, 31
	sar	edx, 6
	imul	ebx, ebx
	sub	edx, eax
	imul	edx, edx, 1000
	imul	ebx, esi          ; a = a * a * a * a * a;
	cmp	ecx, edx
	cmove	esi, ebx          ; move new value if ((i % 1000) == 0)
	mov	DWORD PTR [edi+ecx*4], esi ; buffer[i] = a
	add	ecx, 1           ; i++
	cmp	ecx, 10000
	jne	.L2              ; continue if i < 10000


The main difference here is that the 'a * a * a * a * a' calculation is done every loop iteration instead of every 1000th, but a is only assigned the new value every 1000th time via the conditional move instruction.  It seems inefficient to do this, and from basic testing the code compiled without the tree-loop-vectorize optimization seems to run faster on my machine.

The "real" code that I derived this from has it worse - it uses 64-bit data types in a 32-bit binary, so there are several multiply instructions for each logical multiplication in the code, the stack gets used for storing some of the intermediate values, and after computing everything into some registers it just replaces those values with ones stored on the stack from previous iterations in the case where the modulus condition is not met.  :(

GCC version:

Using built-in specs.
COLLECT_GCC=xgcc
Target: x86_64-unknown-linux-gnu
Configured with: ./configure
Thread model: posix
gcc version 4.9.2 (GCC)

I've also reproduced the issue on gcc 4.9.2 20141224 (prerelease) from an Arch Linux distro and gcc 4.5.2-8ubuntu4 from a Ubuntu distro.

I'm happy to provide any other information needed.  Thanks!

Comment 1 Richard Biener 2015-03-09 12:06:33 UTC

This is if-conversion at work making the code vectorizable.  You can disable it with -fno-tree-loop-if-convert.

It's not easy to assess profitablility here (well, locally it should use
bb frequencies), as computing a*a*a*a*a always could be a win if vectorizing
the rest of the loop compensates for the cost.

In this case the optimal thing to do is to turn the loop into a loop nest.
Not sure how you'd call this kind of loop transform - sth with unswitching
IV-based conditions (it's not exactly splitting in this case).  We want

    for (int j = 0; j < 10000/1000; j++){
       int i;
       for (i = j*1000; i < 999; ++i)
         buffer[i] = a;
       a = a * a * a * a * a;
       buffer[i] = a;
    }

so we can vectorize the inner loop and don't need to evaluate the conditional
there.

Comment 2 Andrew Pinski 2021-08-21 18:59:50 UTC

So for GCC 7+,  -ftree-vectorize vs  -fno-tree-vectorize case is no longer different and there is no cmov in the code any more.

Note I noticed ICX/LLVM convert this to cmov while ICC and MSVC do not.

Comment 3 Richard Biener 2021-08-23 06:44:35 UTC

This has been fixed in GCC 7+ which now performs if-conversion only when the result is vectorized.