Bug 18562

Summary:	SSE constant vector initialization produces dead constant values on stack
Product:	gcc	Reporter:	uros
Component:	target	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	gcc-bugs, pinskia, steven
Priority:	P2	Keywords:	missed-optimization, ssemmx
Version:	4.0.0
Target Milestone:	4.9.0
Host:		Target:	i686--
Build:		Known to work:
Known to fail:		Last reconfirmed:	2006-09-18 03:13:39
Attachments:	Patch which implements CONSTRUCTOR to VECTOR_CST

Description uros 2004-11-19 08:48:42 UTC

Compiling this testcase with '-O2 -msse' an unoptimal code is produced. 'val1'
is merged into vector at compile time, but it is still loaded onto stack. Gcc
does not detect that 'val1' value on stack is sitting there unused.

#include <xmmintrin.h>
#include <stdio.h>

int main(void) {
	float val1 = 1.3f;
	float result[4];
	__m128 A;

	A = _mm_load1_ps(&val1);
	_mm_storeu_ps(result, A);

	printf("%f %f %f %f\n", result[0], result[1], result[2], result[3]);
	return 0;
}

This code is produced:
...
.LC2:                       <- merged vector
        .long   1067869798
        .long   0
        .long   0
        .long   0
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $72, %esp
        movss   .LC2, %xmm0           <- vector is loaded into %xmm0
        andl    $-16, %esp
        subl    $16, %esp
        movl    $0x3fa66666, -4(%ebp) <- 'val1' is put on stack here
        shufps  $0, %xmm0, %xmm0
        movl    $.LC1, (%esp)
        movups  %xmm0, -20(%ebp)
        flds    -8(%ebp)
        fstpl   28(%esp)
        flds    -12(%ebp)
        fstpl   20(%esp)
        flds    -16(%ebp)
        fstpl   12(%esp)
        flds    -20(%ebp)
        fstpl   4(%esp)
        call    printf
        xorl    %eax, %eax
        leave
        ret

Even worser situation arises with:

int main(void) {
	float val1[4] = {1.3f, 1.4f, 1.5f, 1.6f};
	float result[4];
	__m128 A;

	A = _mm_loadu_ps(val1);
	_mm_storeu_ps(result, A);

	printf("%f %f %f %f\n", result[0], result[1], result[2], result[3]);
	return 0;

Following asm code is produced:
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $72, %esp
        andl    $-16, %esp
        movl    $0x3fa66666, -16(%ebp)
        subl    $16, %esp
        movl    $0x3fb33333, -12(%ebp)
        movl    $0x3fc00000, -8(%ebp)
        movl    $0x3fcccccd, -4(%ebp)
        movups  -16(%ebp), %xmm0
        movups  %xmm0, -32(%ebp)
        flds    -20(%ebp)
        fstpl   28(%esp)
        flds    -24(%ebp)
        fstpl   20(%esp)
        flds    -28(%ebp)
        fstpl   12(%esp)
        flds    -32(%ebp)
        fstpl   4(%esp)
        movl    $.LC4, (%esp)
        call    printf
        xorl    %eax, %eax
        leave
        ret

The constant values are not merged into vector at compile time, the vector is
built on the stack and then loaded into %xmm register. Value on stack is again
left unused after vector initialization.

Uros.

Comment 1 Andrew Pinski 2004-11-19 14:44:06 UTC

Confirmed.  This is either a target bug or an RTL optimization problem.  The reason why I say that is 
because the builtins are not expandded before reaching RTL.

Comment 2 uros 2004-11-19 15:41:35 UTC

If val1 is moved out of the main() function, produced code is OK:

--cut here--
#include <xmmintrin.h>
#include <stdio.h>

float val1[4] = {1.3f, 1.4f, 1.5f, 1.6f};

int main(void) {
	float result[4];
	__m128 A;
...
--cut here--

val1:                          <- vector is merged this time
        .long   1067869798
        .long   1068708659
        .long   1069547520
        .long   1070386381
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%f %f %f %f\n"
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $56, %esp
        andl    $-16, %esp
        movups  val1, %xmm0      <- loaded directy into xmm reg
        movups  %xmm0, -16(%ebp)
        subl    $16, %esp
        flds    -4(%ebp)
        ...

Comment 3 Richard Henderson 2005-01-11 23:58:39 UTC

Your first test case is fixed by the patch for PR13366.  We now get

        .align 16
.LC0:
        .long   1067869798
        .long   1067869798
        .long   1067869798
        .long   1067869798
...
        movaps  .LC0, %xmm0
        movups  %xmm0, 56(%esp)

I really don't know what you expected out of your second test case.  Perhaps
the problem is that we don't expose what movups means to the compiler.  I can
see that perhaps we could reuse MISALIGNED_INDIRECT_REF to represent this
(which as a side benefit would result in movhps+movlps instead of one movups,
which runs faster).  But I doubt that we have the machinery at the tree level
to copy-propagate the aggregate initialization from val1[4] -> result[4].  I'm
pretty sure that we have open enhancement requests for this already.

So shall we mark this fixed?

Comment 4 Uroš Bizjak 2005-01-12 06:53:49 UTC

(In reply to comment #3)

In the second testcase, compiler should figure out that the whole val1[] array
is initialized with constants. In this case, .LCx constant vector can be built
in compile time and a load of constants to stack can be implemented by "movaps
.LCx, %xmm0 / movups -32(%ebs), %xmm" insn pair.

Comment 5 Steven Bosscher 2005-01-12 09:24:44 UTC

If this is closed, let's be sure to find that enhancement request and close 
this as a dup.

Comment 6 Steven Bosscher 2005-01-12 09:37:13 UTC

AMD64 has this at -O2: 
 
Test case 1: 
main: 
.LFB499: 
        subq    $24, %rsp 
.LCFI0: 
        movl    $.LC1, %edi 
        movl    $4, %eax 
        movaps  .LC0(%rip), %xmm0 
        movups  %xmm0, (%rsp) 
        cvtss2sd        12(%rsp), %xmm3 
        cvtss2sd        8(%rsp), %xmm2 
        cvtss2sd        4(%rsp), %xmm1 
        cvtss2sd        (%rsp), %xmm0 
        call    printf 
        xorl    %eax, %eax 
        addq    $24, %rsp 
        ret 
.LFE499: 
 
 
Test case 2: 
main: 
.LFB499: 
        subq    $40, %rsp 
.LCFI0: 
        movl    $.LC4, %edi 
        movl    $4, %eax 
        movl    $0x3fa66666, 16(%rsp) 
        movl    $0x3fb33333, 20(%rsp) 
        movl    $0x3fc00000, 24(%rsp) 
        movl    $0x3fcccccd, 28(%rsp) 
        movups  16(%rsp), %xmm0 
        movups  %xmm0, (%rsp) 
        cvtss2sd        12(%rsp), %xmm3 
        cvtss2sd        8(%rsp), %xmm2 
        cvtss2sd        4(%rsp), %xmm1 
        cvtss2sd        (%rsp), %xmm0 
        call    printf 
        xorl    %eax, %eax 
        addq    $40, %rsp 
        ret 
 
 
 
With -m32 -march=pentium4 -mtune=prescott I get this: 
 
Test case 1: 
main: 
        pushl   %ebp 
        movl    %esp, %ebp 
        subl    $56, %esp 
        andl    $-16, %esp 
        subl    $16, %esp 
        movaps  .LC0, %xmm0 
        movups  %xmm0, -16(%ebp) 
        flds    -4(%ebp) 
        fstpl   28(%esp) 
        flds    -8(%ebp) 
        fstpl   20(%esp) 
        flds    -12(%ebp) 
        fstpl   12(%esp) 
        flds    -16(%ebp) 
        fstpl   4(%esp) 
        movl    $.LC1, (%esp) 
        call    printf 
        xorl    %eax, %eax 
        leave 
        ret 
 
Test case 2 (which is basically the same as Uros' code): 
main: 
        pushl   %ebp 
        movl    %esp, %ebp 
        subl    $72, %esp 
        andl    $-16, %esp 
        subl    $16, %esp 
        movl    $0x3fa66666, -16(%ebp) 
        movl    $0x3fb33333, -12(%ebp) 
        movl    $0x3fc00000, -8(%ebp) 
        movl    $0x3fcccccd, -4(%ebp) 
        movups  -16(%ebp), %xmm0 
        movups  %xmm0, -32(%ebp) 
        flds    -20(%ebp) 
        fstpl   28(%esp) 
        flds    -24(%ebp) 
        fstpl   20(%esp) 
        flds    -28(%ebp) 
        fstpl   12(%esp) 
        flds    -32(%ebp) 
        fstpl   4(%esp) 
        movl    $.LC4, (%esp) 
        call    printf 
        xorl    %eax, %eax 
        leave 
        ret

Comment 7 Uroš Bizjak 2005-01-12 10:54:35 UTC

Another testcase that I think should be optimized:
#include <xmmintrin.h>

__m128 test() {
	float val1[4] = {0.0f, 0.0f, 0.0f, 0.0f};

	return _mm_loadu_ps(val1);
}

This is currently compiled to:
test:
        pushl  %ebp
        movl   $0x00000000, %eax
        movl   %esp, %ebp
        subl   $16, %esp
        movl   %eax, -16(%ebp)
        movl   %eax, -12(%ebp)
        movl   %eax, -8(%ebp)
        movl   %eax, -4(%ebp)
        movups -16(%ebp), %xmm0
        leave
        ret

But I think gcc it should produce something like:
test:
        pushl  %ebp
        xorps  %xmm0, %xmm0
        movl   %esp, %ebp
        subl   $16, %esp
(*)     movups %xmm0, -16(%ebp)
        leave
        ret

Perhaps even the store to stack is not necessary in this particular case.

Comment 8 Richard Henderson 2005-01-18 09:50:35 UTC

Found the tree-ssa aggregate copy-propagation pr.  Made this pr depend on it,
as this has a different sort of test case.

Comment 9 Andrew Pinski 2005-09-14 06:51:50 UTC

Hmm, but vectors are not consider as aggregates.
at the tree level right before optimization, we have:
  A_6 = {1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 
1.2999999523162841796875e+0, 1.2999999523162841796875e+0};
  #   result_26 = V_MAY_DEF <result_14>;
  __builtin_ia32_storeups (&result, A_6);

I would have thought that VECTOR_CST would be considered a constant and propgrated into 
__builtin_ia32_storeups and that we would have folded __builtin_ia32_storeups at the tree level.

So I think there are two issues now, the first is that we don't constant prop VECTOR_CST (if this is truely 
a VECTOR_CST in store_ccp):
A_6 = {1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 
1.2999999523162841796875e+0, 1.2999999523162841796875e+0};

Lattice value changed to VARYING.  Adding SSA edges to worklist.


And then we need a fold specific to x86 for __builtin_ia32_storeups, after that it should just work.

Comment 10 Andrew Pinski 2005-09-14 07:09:01 UTC

Actually the issue is that we don't change the constructor into a VECTOR_CST:
<constructor 0x41ec8720
        type <vector_type 0x41ebb9a0 __v4sf type <real_type 0x41e0fd90 float>
            sizes-gimplified BLK size <integer_cst 0x41e11a60 128> unit size <integer_cst 
0x41e11a80 16>
            align 128 symtab 0 alias set -1 nunits 4>
       >

which we should in this case.

And this happens in DOM or in CCP with replacing _mm_set1_ps with _mm_set1_ps (which really are 
implemented the same way).

Comment 11 Andrew Pinski 2005-09-14 07:43:51 UTC

Created attachment 9728 [details]
Patch which implements CONSTRUCTOR to VECTOR_CST

This actually helps on ppc-darwin with altivec too with copying the
preprocessed source and changing _mm_storeu_ps to be just a prototype.

Comment 12 Andrew Pinski 2005-09-14 08:21:26 UTC

(In reply to comment #11)
> Created an attachment (id=9728)
> Patch which implements CONSTRUCTOR to VECTOR_CST
> 
> This actually helps on ppc-darwin with altivec too with copying the
> preprocessed source and changing _mm_storeu_ps to be just a prototype.

It does not help in any of these cases but will help in other.  The issue after fixing this is the issue of a 
target tree based foldding of some of these builtins.

Comment 13 Andrew Pinski 2006-09-18 03:13:39 UTC

The problem here is the x86 back-end pushes constant vectors to the constant pool.

Note my patch will not work any more because it has been outdated to take into account the new CONSTRUCTOR layout.

Comment 14 Andrew Pinski 2021-07-26 02:43:29 UTC

Fixed fully in 4.9 and above.