Bug 68961 - [6 regression] Test case gcc.target/powerpc/pr60203.c fails since r231674
Summary: [6 regression] Test case gcc.target/powerpc/pr60203.c fails since r231674
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P1 normal
Target Milestone: 6.0
Assignee: Richard Biener
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-17 16:34 UTC by Bill Seurer
Modified: 2016-07-12 22:07 UTC (History)
5 users (show)

See Also:
Host:
Target: powerpc64*-unknown-linux-gnu
Build:
Known to work:
Known to fail: 6.0
Last reconfirmed: 2016-01-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bill Seurer 2015-12-17 16:34:10 UTC
PASS: gcc.target/powerpc/pr60203.c (test for excess errors)
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stfd
FAIL: gcc.target/powerpc/pr60203.c scan-assembler-not lfd
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lxsdx
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stxsdx
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mfvsrd
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mtvsrd

There are now two lfds (which trigger the error above) and lots of other ops in the generated code

pack:
	xxpermdi 0,2,1,0
	addi 9,1,-16
	xxpermdi 0,0,0,2
	stxvd2x 0,0,9
	lfd 1,-16(1)
	lfd 2,-8(1)
	blr

while previously there was just

pack:
	blr

Compiled with  -mcpu=power8 -O3.
Comment 1 Bill Schmidt 2015-12-17 17:13:13 UTC
Is this little-endian only?
Comment 2 Bill Seurer 2015-12-17 18:26:48 UTC
No, it fails on big endian, too.
Comment 3 Richard Biener 2015-12-18 10:38:29 UTC
As said I can't reproduce this with a cross.  The function is optimized way before we get to vectorization.

spawn /home/abuild/rguenther/obj-ppc64le-g/gcc/xgcc -B/home/abuild/rguenther/obj-ppc64le-g/gcc/ /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/powerpc/pr60203.c -fno-diagnostics-show-caret -fdiagnostics-color=never -mcpu=power8 -O3 -S -o pr60203.s^M
PASS: gcc.target/powerpc/pr60203.c (test for excess errors)
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stfd
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lfd
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not lxsdx
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not stxsdx
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mfvsrd
PASS: gcc.target/powerpc/pr60203.c scan-assembler-not mtvsrd

pack:
        blr

this is for ppc64le.  Already .fre1 has

pack (double a, double aa)
{
  double u$d$0;
  union u_ld u;

  <bb 2>:
  u.d[1] = aa_4(D);
  u ={v} {CLOBBER};
  return a_2(D);

}

where u is later trivially deleted as dead.

Cross from x86_64 configured as

> ./xgcc -v
Using built-in specs.
COLLECT_GCC=./xgcc
Target: powerpc64le-suse-linux
Configured with: /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux --enable-languages=c,c++,fortran : (reconfigured) /space/rguenther/src/svn/trunk3/configure --disable-bootstrap --target=powerpc64le-suse-linux target_alias=powerpc64le-suse-linux CFLAGS=-g CXXFLAGS=-g --enable-languages=c,c++,fortran,lto --no-create --no-recursion
Thread model: posix
gcc version 6.0.0 20151217 (experimental) [trunk revision 221942] (GCC)
Comment 4 Martin Sebor 2016-01-13 21:08:27 UTC
Confirmed on powerpc64le:

$ /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O3 -S -Wall -Wextra -Wpedantic -mcpu=power8 -o/dev/stdout /src/gcc/trunk/gcc/testsuite/gcc.target/powerpc/pr60203.c
	.file	"pr60203.c"
	.machine power8
	.abiversion 2
	.section	".toc","aw"
	.section	".text"
	.align 2
	.p2align 4,,15
	.globl pack
	.type	pack, @function
pack:
	xxpermdi 0,2,1,0
	addi 9,1,-16
	xxpermdi 0,0,0,2
	stxvd2x 0,0,9
	lfd 1,-16(1)
	lfd 2,-8(1)
	blr
	.long 0
	.byte 0,0,0,0,0,0,0,0
	.size	pack,.-pack
	.align 2
	.p2align 4,,15
	.globl unpack_0
	.type	unpack_0, @function
unpack_0:
	blr
	.long 0
	.byte 0,0,0,0,0,0,0,0
	.size	unpack_0,.-unpack_0
	.align 2
	.p2align 4,,15
	.globl unpack_1
	.type	unpack_1, @function
unpack_1:
	fmr 1,2
	blr
	.long 0
	.byte 0,0,0,0,0,0,0,0
	.size	unpack_1,.-unpack_1
	.ident	"GCC: (GNU) 6.0.0 20160113 (experimental)"
	.section	.note.GNU-stack,"",@progbits
Comment 5 Richard Biener 2016-01-14 09:31:13 UTC
(In reply to Martin Sebor from comment #4)
> Confirmed on powerpc64le:
> 
> $ /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O3 -S -Wall -Wextra
> -Wpedantic -mcpu=power8 -o/dev/stdout
> /src/gcc/trunk/gcc/testsuite/gcc.target/powerpc/pr60203.c
> 	.file	"pr60203.c"
> 	.machine power8
> 	.abiversion 2
> 	.section	".toc","aw"
> 	.section	".text"
> 	.align 2
> 	.p2align 4,,15
> 	.globl pack
> 	.type	pack, @function
> pack:
> 	xxpermdi 0,2,1,0
> 	addi 9,1,-16
> 	xxpermdi 0,0,0,2
> 	stxvd2x 0,0,9
> 	lfd 1,-16(1)
> 	lfd 2,-8(1)
> 	blr
> 	.long 0
> 	.byte 0,0,0,0,0,0,0,0
> 	.size	pack,.-pack
> 	.align 2
> 	.p2align 4,,15
> 	.globl unpack_0
> 	.type	unpack_0, @function
> unpack_0:
> 	blr
> 	.long 0
> 	.byte 0,0,0,0,0,0,0,0
> 	.size	unpack_0,.-unpack_0
> 	.align 2
> 	.p2align 4,,15
> 	.globl unpack_1
> 	.type	unpack_1, @function
> unpack_1:
> 	fmr 1,2
> 	blr
> 	.long 0
> 	.byte 0,0,0,0,0,0,0,0
> 	.size	unpack_1,.-unpack_1
> 	.ident	"GCC: (GNU) 6.0.0 20160113 (experimental)"
> 	.section	.note.GNU-stack,"",@progbits

Can you properly bisect it then?  Still can't reproduce with a cross from x86_64.

Does on native .0341.fre properly contain just

pack (double a, double aa)
{
  double u$d$0;
  union u_ld u;

  <bb 2>:
  u.d[1] = aa_4(D);
  u ={v} {CLOBBER};
  return a_2(D);

}

or do we have some odd host dependency in early opts until here?
Comment 6 Jakub Jelinek 2016-01-14 15:45:49 UTC
I can only reproduce with a ppc64le cross, but not be cross, supposedly because my ppc64le cross is using auto-host.h which Marek sent me from a native configured compiler, while the be one does not.  Maybe the difference is whether power8 instructions are supported by assembler or something similar.

Looking at the be->le differences, it starts in esra:
...
-Rejected (2230): not aggregate: a
-Rejected (2231): not aggregate: aa
-Candidate (2234): u
-Created a replacement for u offset: 0, size: 64: u$d$0
+Rejected (2351): not aggregate: a
+Rejected (2352): not aggregate: aa
+Candidate (2355): u
+! Disqualifying u - No scalar replacements to be created.
...
 pack (double a, double aa)
 {
-  double u$d$0;
   union u_ld u;
   long double _6;
 
   <bb 2>:
-  u$d$0_8 = a_2(D);
+  u.d[0] = a_2(D);
   u.d[1] = aa_4(D);
-  _6 = u$d$0_8;
+  _6 = u.ld;
   u ={v} {CLOBBER};
   return _6;

and if we don't SRA this, we don't optimize it away.  Ah, actually, I can reproduce even with the ppc64be cross, if I use explicit -mlong-double-128.
Comment 7 Richard Biener 2016-01-15 08:33:09 UTC
(In reply to Jakub Jelinek from comment #6)
> I can only reproduce with a ppc64le cross, but not be cross, supposedly
> because my ppc64le cross is using auto-host.h which Marek sent me from a
> native configured compiler, while the be one does not.  Maybe the difference
> is whether power8 instructions are supported by assembler or something
> similar.
> 
> Looking at the be->le differences, it starts in esra:
> ...
> -Rejected (2230): not aggregate: a
> -Rejected (2231): not aggregate: aa
> -Candidate (2234): u
> -Created a replacement for u offset: 0, size: 64: u$d$0
> +Rejected (2351): not aggregate: a
> +Rejected (2352): not aggregate: aa
> +Candidate (2355): u
> +! Disqualifying u - No scalar replacements to be created.
> ...
>  pack (double a, double aa)
>  {
> -  double u$d$0;
>    union u_ld u;
>    long double _6;
>  
>    <bb 2>:
> -  u$d$0_8 = a_2(D);
> +  u.d[0] = a_2(D);
>    u.d[1] = aa_4(D);
> -  _6 = u$d$0_8;
> +  _6 = u.ld;
>    u ={v} {CLOBBER};
>    return _6;
> 
> and if we don't SRA this, we don't optimize it away.  Ah, actually, I can
> reproduce even with the ppc64be cross, if I use explicit -mlong-double-128.

Ok, I can reproduce with -mlong-double-128 and with that the SLP vectorizer
triggers and produces

pack (double a, double aa)
{
  vector(2) double * vectp.6;
  vector(2) double * vectp_u.5;
  union u_ld u;
  long double _6;
  vector(2) double vect_cst__8;

  <bb 2>:
  vect_cst__8 = {a_2(D), aa_4(D)};
  MEM[(double *)&u] = vect_cst__8;
  _6 = u.ld;
  u ={v} {CLOBBER};
  return _6;

which then, as no FRE is running after vectorization, is not optimized
to a register vector-to-long-double punning (not sure if that would help).
SRA would help as well here.

With -fno-tree-slp-vectorize it's optimized somewhere on RTL.  Not sure
why it is able to grok the more complicated two-stores-one-load but
not the load-after-store.  Before combine we have with the vector code

(note 5 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 5 3 2 (set (reg/v:DF 158 [ a ])
        (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 33 1 [ a ])
        (nil)))
(insn 3 2 4 2 (set (reg/v:DF 159 [ aa ])
        (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 34 2 [ aa ])
        (nil)))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 9 2 (set (reg:V2DF 160)
        (vec_concat:V2DF (reg/v:DF 158 [ a ])
            (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df}
     (expr_list:REG_DEAD (reg/v:DF 159 [ aa ])
        (expr_list:REG_DEAD (reg/v:DF 158 [ a ])
            (nil))))
(insn 9 7 13 2 (set (reg:TF 157 [ <retval> ])
        (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:V2DF 160)
        (nil)))
(insn 13 9 14 2 (set (reg/i:TF 33 1)
        (reg:TF 157 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:TF 157 [ <retval> ])
        (nil)))

and with the non-vector code

(insn 2 5 3 2 (set (reg/v:DF 157 [ a ])
        (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 33 1 [ a ])
        (nil)))
(insn 3 2 4 2 (set (reg/v:DF 158 [ aa ])
        (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 34 2 [ aa ])
        (nil)))
(note 4 3 16 2 NOTE_INSN_FUNCTION_BEG)
(insn 16 4 7 2 (set (reg/v:TI 155 [ u ])
        (const_int 0 [0])) t.c:7 -1
     (nil))
(insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0)
        (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 157 [ a ])
        (nil)))
(insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8)
        (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 158 [ aa ])
        (nil)))
(insn 9 8 13 2 (set (reg:TF 156 [ <retval> ])
        (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg/v:TI 155 [ u ])
        (nil)))
(insn 13 9 14 2 (set (reg/i:TF 33 1)
        (reg:TF 156 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:TF 156 [ <retval> ])
        (nil)))


so the difference is

(insn 7 4 9 2 (set (reg:V2DF 160)
        (vec_concat:V2DF (reg/v:DF 158 [ a ])
            (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df}
     (expr_list:REG_DEAD (reg/v:DF 159 [ aa ])
        (expr_list:REG_DEAD (reg/v:DF 158 [ a ])
            (nil))))
(insn 9 7 13 2 (set (reg:TF 157 [ <retval> ])
        (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:V2DF 160)
        (nil)))

vs.

(insn 16 4 7 2 (set (reg/v:TI 155 [ u ])
        (const_int 0 [0])) t.c:7 -1
     (nil))
(insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0)
        (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 157 [ a ])
        (nil)))
(insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8)
        (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 158 [ aa ])
        (nil)))
(insn 9 8 13 2 (set (reg:TF 156 [ <retval> ])
        (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg/v:TI 155 [ u ])
        (nil)))

combine forwards the argument reg setup into the latter but not the former
but in the end the backend is probably confused by the VSX register use.

I think this should be addressed at the target level as the user may choose
to write this code by exchanging double d[2] in the unions testcase with
v2df d and use GCCs vector extension.
Comment 8 Richard Biener 2016-01-15 09:28:11 UTC
Btw, the testcase seems to be really "special" given exact register overlap between return value and incoming args (if you'd look at the vectorizers choice
to say this is profitable to vectorize).  Making it fairer like with

long double
pack (double x, double a, double aa)
{
  union u_ld u;
  u.d[0] = a;
  u.d[1] = aa;
  return u.ld;
}

produces without SLP

pack:
        mfvsrd 10,2
        fmr 2,3
        mtvsrd 1,10
        blr

and with

pack:
        xxpermdi 0,3,2,0
        addi 9,1,-16
        xxpermdi 0,0,0,2
        stxvd2x 0,0,9
        lfd 1,-16(1)
        lfd 2,-8(1)
        blr

to that would be the thing to compare cost-wise.  Currently we have

t.c:9:11: note: Cost model analysis:
  Vector inside of basic block cost: 1
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 2

so for some reason the vector build is not accounted for.

Ah, I see why.  Mine.
Comment 9 Richard Biener 2016-01-15 11:50:15 UTC
Author: rguenth
Date: Fri Jan 15 11:49:43 2016
New Revision: 232415

URL: https://gcc.gnu.org/viewcvs?rev=232415&root=gcc&view=rev
Log:
2016-01-15  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/68961
	* tree-vect-slp.c (vect_analyze_slp_cost_1): Consider cost
	of invariants in stores again.

	* gcc.dg/vect/costmodel/x86_64/costmodel-pr68961.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr68961.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-slp.c
Comment 10 Richard Biener 2016-01-15 11:53:25 UTC
Should be fixed now.
Comment 11 Richard Biener 2016-06-09 11:53:50 UTC
Note that the fix depends on "bogus" cost for the vector construction on x86_64.
Currently it is two stmts (nunits / 2 + 1) but the vector can be constructed
by a single unpcklpd stmt.  The correct cost is nunits - 1.

Similar to the PPC case the main issue is that the fact that incoming registers
have an exact overlap with the return value registers is hidden from the GIMPLE IL:

pack (double a, double aa)
{
  struct x D.1756;

  <bb 2>:
  MEM[(struct x *)&D.1756] = a_2(D);
  MEM[(struct x *)&D.1756 + 8B] = aa_3(D);
  return D.1756;
}

Detecting the exact overlap is probably too hard but at least detecting that
we don't return in memory and thus the store is not a store and that we return
in two different regs and thus require two vector extractions should be possible.
Comment 12 Richard Biener 2016-06-29 07:53:06 UTC
Author: rguenth
Date: Wed Jun 29 07:52:35 2016
New Revision: 237840

URL: https://gcc.gnu.org/viewcvs?rev=237840&root=gcc&view=rev
Log:
2016-06-29  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
Comment 13 Richard Biener 2016-07-12 08:56:47 UTC
Author: rguenth
Date: Tue Jul 12 08:56:14 2016
New Revision: 238238

URL: https://gcc.gnu.org/viewcvs?rev=238238&root=gcc&view=rev
Log:
2016-07-12  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr68961.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fwprop.c
    trunk/gcc/testsuite/ChangeLog
Comment 14 uros 2016-07-12 22:07:23 UTC
Author: uros
Date: Tue Jul 12 22:06:51 2016
New Revision: 238268

URL: https://gcc.gnu.org/viewcvs?rev=238268&root=gcc&view=rev
Log:
	PR rtl-optimization/68961
	* config/i386/sse.md (movsd/movhpd to movupd peephole2s): Add new
	peephole variant.  Use sse_reg_operand predicates.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md