This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/81504] New: gcc-7 regression: vec_st in loop misoptimized
- From: "zoltan at hidvegi dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 21 Jul 2017 07:06:12 +0000
- Subject: [Bug target/81504] New: gcc-7 regression: vec_st in loop misoptimized
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81504
Bug ID: 81504
Summary: gcc-7 regression: vec_st in loop misoptimized
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: zoltan at hidvegi dot com
CC: wschmidt at gcc dot gnu.org
Target Milestone: ---
Target: powerpc64le-unknown-linux-gnu
Created attachment 41802
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41802&action=edit
gcc-7 -O2, vec_st in loop misoptimized
The attached code miscompiles with gcc-7 -O2, but gcc-6 produces correct code.
gcc-7 -O1 also generates good code. With gcc-7 -O2 idx is always incremented
before p[idx] is written using vec_st. I've tried to create a minimal testcase
to reproduce the problem, this is not real code. The background on this is that
I needed a pointer wrapper class for ppcle vectors because gcc by default never
uses the lvx / stvx instructions even if it knows an address is aligned, it
always wants to use lxvd2x / xxswapd, and generates tons of unnecessary xxswapd
instructions. I'm aware of attempts to optimize away swaps, but those don't
apply to my application, so I just want to use lvx without the swaps. This is
somehow related to vec_st, since if I use inline asm to generate stvx it works.
It also works if there is no builtin_constant_p check in the rotate_left macro.
It would be really nice if there was a way to disable lane swap optimizations
and allow gcc to use the aligned load/stores when the address is known to be
aligned. On x86 gcc already knows when to use aligned vs. unaligned loads, so
it must be possible on pcc as well. My code can execute over 100 million vector
load and store instructions per second, so removing the swaps have a real
performance impact.
Here is the gcc-7 assembly, note the addi 3,3,16 after the unconditional branch
to L4 is executed before the first stvx 0,0,3, the vector pointer is
incremented before it's ever written:
sldi 9,5,4
srdi 10,4,5
add 3,3,9
addi 9,10,1
mtctr 9
li 8,1
b .L3
.p2align 4,,15
.L2:
stvx 0,0,3
addi 5,5,1
.L3:
#APP
# 20 "msary_bug.C" 1
rotld 9,8,5
# 0 "" 2
#NO_APP
mtvsrd 32,9
addi 3,3,16
xxpermdi 32,32,32,0
bdnz .L2
sldi 10,10,5
subf 4,10,4
mtvsrd 32,4
xxpermdi 32,32,32,0
stvx 0,0,3
blr