PASS: gcc.target/powerpc/pr60203.c (test for excess errors) PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stfd FAIL: gcc.target/powerpc/pr60203.c scan-assembler-not lfd PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lxsdx PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stxsdx PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mfvsrd PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mtvsrd There are now two lfds (which trigger the error above) and lots of other ops in the generated code pack: xxpermdi 0,2,1,0 addi 9,1,-16 xxpermdi 0,0,0,2 stxvd2x 0,0,9 lfd 1,-16(1) lfd 2,-8(1) blr while previously there was just pack: blr Compiled with -mcpu=power8 -O3.
Is this little-endian only?
No, it fails on big endian, too.
As said I can't reproduce this with a cross. The function is optimized way before we get to vectorization. spawn /home/abuild/rguenther/obj-ppc64le-g/gcc/xgcc -B/home/abuild/rguenther/obj-ppc64le-g/gcc/ /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/powerpc/pr60203.c -fno-diagnostics-show-caret -fdiagnostics-color=never -mcpu=power8 -O3 -S -o pr60203.s^M PASS: gcc.target/powerpc/pr60203.c (test for excess errors) PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stfd PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lfd PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lxsdx PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stxsdx PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mfvsrd PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mtvsrd pack: blr this is for ppc64le. Already .fre1 has pack (double a, double aa) { double u$d$0; union u_ld u; <bb 2>: u.d[1] = aa_4(D); u ={v} {CLOBBER}; return a_2(D); } where u is later trivially deleted as dead. Cross from x86_64 configured as > ./xgcc -v Using built-in specs. COLLECT_GCC=./xgcc Target: powerpc64le-suse-linux Configured with: /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux target_alias=powerpc64le-suse-linux CFLAGS=-g CXXFLAGS=-g --enable-languages=c,c++,fortran,lto --no-create --no-recursion Thread model: posix gcc version 6.0.0 20151217 (experimental) [trunk revision 221942] (GCC)
Confirmed on powerpc64le: $ /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O3 -S -Wall -Wextra -Wpedantic -mcpu=power8 -o/dev/stdout /src/gcc/trunk/gcc/testsuite/gcc.target/powerpc/pr60203.c .file "pr60203.c" .machine power8 .abiversion 2 .section ".toc","aw" .section ".text" .align 2 .p2align 4,,15 .globl pack .type pack, @function pack: xxpermdi 0,2,1,0 addi 9,1,-16 xxpermdi 0,0,0,2 stxvd2x 0,0,9 lfd 1,-16(1) lfd 2,-8(1) blr .long 0 .byte 0,0,0,0,0,0,0,0 .size pack,.-pack .align 2 .p2align 4,,15 .globl unpack_0 .type unpack_0, @function unpack_0: blr .long 0 .byte 0,0,0,0,0,0,0,0 .size unpack_0,.-unpack_0 .align 2 .p2align 4,,15 .globl unpack_1 .type unpack_1, @function unpack_1: fmr 1,2 blr .long 0 .byte 0,0,0,0,0,0,0,0 .size unpack_1,.-unpack_1 .ident "GCC: (GNU) 6.0.0 20160113 (experimental)" .section .note.GNU-stack,"",@progbits
(In reply to Martin Sebor from comment #4) > Confirmed on powerpc64le: > > $ /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O3 -S -Wall -Wextra > -Wpedantic -mcpu=power8 -o/dev/stdout > /src/gcc/trunk/gcc/testsuite/gcc.target/powerpc/pr60203.c > .file "pr60203.c" > .machine power8 > .abiversion 2 > .section ".toc","aw" > .section ".text" > .align 2 > .p2align 4,,15 > .globl pack > .type pack, @function > pack: > xxpermdi 0,2,1,0 > addi 9,1,-16 > xxpermdi 0,0,0,2 > stxvd2x 0,0,9 > lfd 1,-16(1) > lfd 2,-8(1) > blr > .long 0 > .byte 0,0,0,0,0,0,0,0 > .size pack,.-pack > .align 2 > .p2align 4,,15 > .globl unpack_0 > .type unpack_0, @function > unpack_0: > blr > .long 0 > .byte 0,0,0,0,0,0,0,0 > .size unpack_0,.-unpack_0 > .align 2 > .p2align 4,,15 > .globl unpack_1 > .type unpack_1, @function > unpack_1: > fmr 1,2 > blr > .long 0 > .byte 0,0,0,0,0,0,0,0 > .size unpack_1,.-unpack_1 > .ident "GCC: (GNU) 6.0.0 20160113 (experimental)" > .section .note.GNU-stack,"",@progbits Can you properly bisect it then? Still can't reproduce with a cross from x86_64. Does on native .0341.fre properly contain just pack (double a, double aa) { double u$d$0; union u_ld u; <bb 2>: u.d[1] = aa_4(D); u ={v} {CLOBBER}; return a_2(D); } or do we have some odd host dependency in early opts until here?
I can only reproduce with a ppc64le cross, but not be cross, supposedly because my ppc64le cross is using auto-host.h which Marek sent me from a native configured compiler, while the be one does not. Maybe the difference is whether power8 instructions are supported by assembler or something similar. Looking at the be->le differences, it starts in esra: ... -Rejected (2230): not aggregate: a -Rejected (2231): not aggregate: aa -Candidate (2234): u -Created a replacement for u offset: 0, size: 64: u$d$0 +Rejected (2351): not aggregate: a +Rejected (2352): not aggregate: aa +Candidate (2355): u +! Disqualifying u - No scalar replacements to be created. ... pack (double a, double aa) { - double u$d$0; union u_ld u; long double _6; <bb 2>: - u$d$0_8 = a_2(D); + u.d[0] = a_2(D); u.d[1] = aa_4(D); - _6 = u$d$0_8; + _6 = u.ld; u ={v} {CLOBBER}; return _6; and if we don't SRA this, we don't optimize it away. Ah, actually, I can reproduce even with the ppc64be cross, if I use explicit -mlong-double-128.
(In reply to Jakub Jelinek from comment #6) > I can only reproduce with a ppc64le cross, but not be cross, supposedly > because my ppc64le cross is using auto-host.h which Marek sent me from a > native configured compiler, while the be one does not. Maybe the difference > is whether power8 instructions are supported by assembler or something > similar. > > Looking at the be->le differences, it starts in esra: > ... > -Rejected (2230): not aggregate: a > -Rejected (2231): not aggregate: aa > -Candidate (2234): u > -Created a replacement for u offset: 0, size: 64: u$d$0 > +Rejected (2351): not aggregate: a > +Rejected (2352): not aggregate: aa > +Candidate (2355): u > +! Disqualifying u - No scalar replacements to be created. > ... > pack (double a, double aa) > { > - double u$d$0; > union u_ld u; > long double _6; > > <bb 2>: > - u$d$0_8 = a_2(D); > + u.d[0] = a_2(D); > u.d[1] = aa_4(D); > - _6 = u$d$0_8; > + _6 = u.ld; > u ={v} {CLOBBER}; > return _6; > > and if we don't SRA this, we don't optimize it away. Ah, actually, I can > reproduce even with the ppc64be cross, if I use explicit -mlong-double-128. Ok, I can reproduce with -mlong-double-128 and with that the SLP vectorizer triggers and produces pack (double a, double aa) { vector(2) double * vectp.6; vector(2) double * vectp_u.5; union u_ld u; long double _6; vector(2) double vect_cst__8; <bb 2>: vect_cst__8 = {a_2(D), aa_4(D)}; MEM[(double *)&u] = vect_cst__8; _6 = u.ld; u ={v} {CLOBBER}; return _6; which then, as no FRE is running after vectorization, is not optimized to a register vector-to-long-double punning (not sure if that would help). SRA would help as well here. With -fno-tree-slp-vectorize it's optimized somewhere on RTL. Not sure why it is able to grok the more complicated two-stores-one-load but not the load-after-store. Before combine we have with the vector code (note 5 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 (set (reg/v:DF 158 [ a ]) (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 33 1 [ a ]) (nil))) (insn 3 2 4 2 (set (reg/v:DF 159 [ aa ]) (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 34 2 [ aa ]) (nil))) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 9 2 (set (reg:V2DF 160) (vec_concat:V2DF (reg/v:DF 158 [ a ]) (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df} (expr_list:REG_DEAD (reg/v:DF 159 [ aa ]) (expr_list:REG_DEAD (reg/v:DF 158 [ a ]) (nil)))) (insn 9 7 13 2 (set (reg:TF 157 [ <retval> ]) (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:V2DF 160) (nil))) (insn 13 9 14 2 (set (reg/i:TF 33 1) (reg:TF 157 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:TF 157 [ <retval> ]) (nil))) and with the non-vector code (insn 2 5 3 2 (set (reg/v:DF 157 [ a ]) (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 33 1 [ a ]) (nil))) (insn 3 2 4 2 (set (reg/v:DF 158 [ aa ]) (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 34 2 [ aa ]) (nil))) (note 4 3 16 2 NOTE_INSN_FUNCTION_BEG) (insn 16 4 7 2 (set (reg/v:TI 155 [ u ]) (const_int 0 [0])) t.c:7 -1 (nil)) (insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0) (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 157 [ a ]) (nil))) (insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8) (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 158 [ aa ]) (nil))) (insn 9 8 13 2 (set (reg:TF 156 [ <retval> ]) (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg/v:TI 155 [ u ]) (nil))) (insn 13 9 14 2 (set (reg/i:TF 33 1) (reg:TF 156 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:TF 156 [ <retval> ]) (nil))) so the difference is (insn 7 4 9 2 (set (reg:V2DF 160) (vec_concat:V2DF (reg/v:DF 158 [ a ]) (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df} (expr_list:REG_DEAD (reg/v:DF 159 [ aa ]) (expr_list:REG_DEAD (reg/v:DF 158 [ a ]) (nil)))) (insn 9 7 13 2 (set (reg:TF 157 [ <retval> ]) (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:V2DF 160) (nil))) vs. (insn 16 4 7 2 (set (reg/v:TI 155 [ u ]) (const_int 0 [0])) t.c:7 -1 (nil)) (insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0) (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 157 [ a ]) (nil))) (insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8) (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 158 [ aa ]) (nil))) (insn 9 8 13 2 (set (reg:TF 156 [ <retval> ]) (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg/v:TI 155 [ u ]) (nil))) combine forwards the argument reg setup into the latter but not the former but in the end the backend is probably confused by the VSX register use. I think this should be addressed at the target level as the user may choose to write this code by exchanging double d[2] in the unions testcase with v2df d and use GCCs vector extension.
Btw, the testcase seems to be really "special" given exact register overlap between return value and incoming args (if you'd look at the vectorizers choice to say this is profitable to vectorize). Making it fairer like with long double pack (double x, double a, double aa) { union u_ld u; u.d[0] = a; u.d[1] = aa; return u.ld; } produces without SLP pack: mfvsrd 10,2 fmr 2,3 mtvsrd 1,10 blr and with pack: xxpermdi 0,3,2,0 addi 9,1,-16 xxpermdi 0,0,0,2 stxvd2x 0,0,9 lfd 1,-16(1) lfd 2,-8(1) blr to that would be the thing to compare cost-wise. Currently we have t.c:9:11: note: Cost model analysis: Vector inside of basic block cost: 1 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 2 so for some reason the vector build is not accounted for. Ah, I see why. Mine.
Author: rguenth Date: Fri Jan 15 11:49:43 2016 New Revision: 232415 URL: https://gcc.gnu.org/viewcvs?rev=232415&root=gcc&view=rev Log: 2016-01-15 Richard Biener <rguenther@suse.de> PR tree-optimization/68961 * tree-vect-slp.c (vect_analyze_slp_cost_1): Consider cost of invariants in stores again. * gcc.dg/vect/costmodel/x86_64/costmodel-pr68961.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr68961.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-slp.c
Should be fixed now.
Note that the fix depends on "bogus" cost for the vector construction on x86_64. Currently it is two stmts (nunits / 2 + 1) but the vector can be constructed by a single unpcklpd stmt. The correct cost is nunits - 1. Similar to the PPC case the main issue is that the fact that incoming registers have an exact overlap with the return value registers is hidden from the GIMPLE IL: pack (double a, double aa) { struct x D.1756; <bb 2>: MEM[(struct x *)&D.1756] = a_2(D); MEM[(struct x *)&D.1756 + 8B] = aa_3(D); return D.1756; } Detecting the exact overlap is probably too hard but at least detecting that we don't return in memory and thus the store is not a store and that we return in two different regs and thus require two vector extractions should be possible.
Author: rguenth Date: Wed Jun 29 07:52:35 2016 New Revision: 237840 URL: https://gcc.gnu.org/viewcvs?rev=237840&root=gcc&view=rev Log: 2016-06-29 Richard Biener <rguenther@suse.de> PR rtl-optimization/68961 * simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT. Modified: trunk/gcc/ChangeLog trunk/gcc/simplify-rtx.c
Author: rguenth Date: Tue Jul 12 08:56:14 2016 New Revision: 238238 URL: https://gcc.gnu.org/viewcvs?rev=238238&root=gcc&view=rev Log: 2016-07-12 Richard Biener <rguenther@suse.de> PR rtl-optimization/68961 * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT to simplify to a non-constant. * gcc.target/i386/pr68961.c: New testcase. Added: trunk/gcc/testsuite/gcc.target/i386/pr68961.c Modified: trunk/gcc/ChangeLog trunk/gcc/fwprop.c trunk/gcc/testsuite/ChangeLog
Author: uros Date: Tue Jul 12 22:06:51 2016 New Revision: 238268 URL: https://gcc.gnu.org/viewcvs?rev=238268&root=gcc&view=rev Log: PR rtl-optimization/68961 * config/i386/sse.md (movsd/movhpd to movupd peephole2s): Add new peephole variant. Use sse_reg_operand predicates. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md