Summary: | SSE constant vector initialization produces dead constant values on stack | ||
---|---|---|---|
Product: | gcc | Reporter: | uros |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | gcc-bugs, pinskia, steven |
Priority: | P2 | Keywords: | missed-optimization, ssemmx |
Version: | 4.0.0 | ||
Target Milestone: | 4.9.0 | ||
Host: | Target: | i686-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2006-09-18 03:13:39 | |
Attachments: | Patch which implements CONSTRUCTOR to VECTOR_CST |
Description
uros
2004-11-19 08:48:42 UTC
Confirmed. This is either a target bug or an RTL optimization problem. The reason why I say that is because the builtins are not expandded before reaching RTL. If val1 is moved out of the main() function, produced code is OK: --cut here-- #include <xmmintrin.h> #include <stdio.h> float val1[4] = {1.3f, 1.4f, 1.5f, 1.6f}; int main(void) { float result[4]; __m128 A; ... --cut here-- val1: <- vector is merged this time .long 1067869798 .long 1068708659 .long 1069547520 .long 1070386381 .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "%f %f %f %f\n" .text .p2align 4,,15 .globl main .type main, @function main: pushl %ebp movl %esp, %ebp subl $56, %esp andl $-16, %esp movups val1, %xmm0 <- loaded directy into xmm reg movups %xmm0, -16(%ebp) subl $16, %esp flds -4(%ebp) ... Your first test case is fixed by the patch for PR13366. We now get .align 16 .LC0: .long 1067869798 .long 1067869798 .long 1067869798 .long 1067869798 ... movaps .LC0, %xmm0 movups %xmm0, 56(%esp) I really don't know what you expected out of your second test case. Perhaps the problem is that we don't expose what movups means to the compiler. I can see that perhaps we could reuse MISALIGNED_INDIRECT_REF to represent this (which as a side benefit would result in movhps+movlps instead of one movups, which runs faster). But I doubt that we have the machinery at the tree level to copy-propagate the aggregate initialization from val1[4] -> result[4]. I'm pretty sure that we have open enhancement requests for this already. So shall we mark this fixed? (In reply to comment #3) In the second testcase, compiler should figure out that the whole val1[] array is initialized with constants. In this case, .LCx constant vector can be built in compile time and a load of constants to stack can be implemented by "movaps .LCx, %xmm0 / movups -32(%ebs), %xmm" insn pair. If this is closed, let's be sure to find that enhancement request and close this as a dup. AMD64 has this at -O2: Test case 1: main: .LFB499: subq $24, %rsp .LCFI0: movl $.LC1, %edi movl $4, %eax movaps .LC0(%rip), %xmm0 movups %xmm0, (%rsp) cvtss2sd 12(%rsp), %xmm3 cvtss2sd 8(%rsp), %xmm2 cvtss2sd 4(%rsp), %xmm1 cvtss2sd (%rsp), %xmm0 call printf xorl %eax, %eax addq $24, %rsp ret .LFE499: Test case 2: main: .LFB499: subq $40, %rsp .LCFI0: movl $.LC4, %edi movl $4, %eax movl $0x3fa66666, 16(%rsp) movl $0x3fb33333, 20(%rsp) movl $0x3fc00000, 24(%rsp) movl $0x3fcccccd, 28(%rsp) movups 16(%rsp), %xmm0 movups %xmm0, (%rsp) cvtss2sd 12(%rsp), %xmm3 cvtss2sd 8(%rsp), %xmm2 cvtss2sd 4(%rsp), %xmm1 cvtss2sd (%rsp), %xmm0 call printf xorl %eax, %eax addq $40, %rsp ret With -m32 -march=pentium4 -mtune=prescott I get this: Test case 1: main: pushl %ebp movl %esp, %ebp subl $56, %esp andl $-16, %esp subl $16, %esp movaps .LC0, %xmm0 movups %xmm0, -16(%ebp) flds -4(%ebp) fstpl 28(%esp) flds -8(%ebp) fstpl 20(%esp) flds -12(%ebp) fstpl 12(%esp) flds -16(%ebp) fstpl 4(%esp) movl $.LC1, (%esp) call printf xorl %eax, %eax leave ret Test case 2 (which is basically the same as Uros' code): main: pushl %ebp movl %esp, %ebp subl $72, %esp andl $-16, %esp subl $16, %esp movl $0x3fa66666, -16(%ebp) movl $0x3fb33333, -12(%ebp) movl $0x3fc00000, -8(%ebp) movl $0x3fcccccd, -4(%ebp) movups -16(%ebp), %xmm0 movups %xmm0, -32(%ebp) flds -20(%ebp) fstpl 28(%esp) flds -24(%ebp) fstpl 20(%esp) flds -28(%ebp) fstpl 12(%esp) flds -32(%ebp) fstpl 4(%esp) movl $.LC4, (%esp) call printf xorl %eax, %eax leave ret Another testcase that I think should be optimized: #include <xmmintrin.h> __m128 test() { float val1[4] = {0.0f, 0.0f, 0.0f, 0.0f}; return _mm_loadu_ps(val1); } This is currently compiled to: test: pushl %ebp movl $0x00000000, %eax movl %esp, %ebp subl $16, %esp movl %eax, -16(%ebp) movl %eax, -12(%ebp) movl %eax, -8(%ebp) movl %eax, -4(%ebp) movups -16(%ebp), %xmm0 leave ret But I think gcc it should produce something like: test: pushl %ebp xorps %xmm0, %xmm0 movl %esp, %ebp subl $16, %esp (*) movups %xmm0, -16(%ebp) leave ret Perhaps even the store to stack is not necessary in this particular case. Found the tree-ssa aggregate copy-propagation pr. Made this pr depend on it, as this has a different sort of test case. Hmm, but vectors are not consider as aggregates. at the tree level right before optimization, we have: A_6 = {1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 1.2999999523162841796875e+0}; # result_26 = V_MAY_DEF <result_14>; __builtin_ia32_storeups (&result, A_6); I would have thought that VECTOR_CST would be considered a constant and propgrated into __builtin_ia32_storeups and that we would have folded __builtin_ia32_storeups at the tree level. So I think there are two issues now, the first is that we don't constant prop VECTOR_CST (if this is truely a VECTOR_CST in store_ccp): A_6 = {1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 1.2999999523162841796875e+0, 1.2999999523162841796875e+0}; Lattice value changed to VARYING. Adding SSA edges to worklist. And then we need a fold specific to x86 for __builtin_ia32_storeups, after that it should just work. Actually the issue is that we don't change the constructor into a VECTOR_CST: <constructor 0x41ec8720 type <vector_type 0x41ebb9a0 __v4sf type <real_type 0x41e0fd90 float> sizes-gimplified BLK size <integer_cst 0x41e11a60 128> unit size <integer_cst 0x41e11a80 16> align 128 symtab 0 alias set -1 nunits 4> > which we should in this case. And this happens in DOM or in CCP with replacing _mm_set1_ps with _mm_set1_ps (which really are implemented the same way). Created attachment 9728 [details]
Patch which implements CONSTRUCTOR to VECTOR_CST
This actually helps on ppc-darwin with altivec too with copying the
preprocessed source and changing _mm_storeu_ps to be just a prototype.
(In reply to comment #11) > Created an attachment (id=9728) > Patch which implements CONSTRUCTOR to VECTOR_CST > > This actually helps on ppc-darwin with altivec too with copying the > preprocessed source and changing _mm_storeu_ps to be just a prototype. It does not help in any of these cases but will help in other. The issue after fixing this is the issue of a target tree based foldding of some of these builtins. The problem here is the x86 back-end pushes constant vectors to the constant pool. Note my patch will not work any more because it has been outdated to take into account the new CONSTRUCTOR layout. Fixed fully in 4.9 and above. |