This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/81504] New: gcc-7 regression: vec_st in loop misoptimized


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81504

            Bug ID: 81504
           Summary: gcc-7 regression: vec_st  in loop misoptimized
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: zoltan at hidvegi dot com
                CC: wschmidt at gcc dot gnu.org
  Target Milestone: ---
            Target: powerpc64le-unknown-linux-gnu

Created attachment 41802
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41802&action=edit
gcc-7 -O2, vec_st  in loop misoptimized

The attached code miscompiles with gcc-7 -O2, but gcc-6 produces correct code.
gcc-7 -O1 also generates good code. With gcc-7 -O2 idx is always incremented
before p[idx] is written using vec_st. I've tried to create a minimal testcase
to reproduce the problem, this is not real code. The background on this is that
I needed a pointer wrapper class for ppcle vectors because gcc by default never
uses the lvx / stvx instructions even if it knows an address is aligned, it
always wants to use lxvd2x / xxswapd, and generates tons of unnecessary xxswapd
instructions. I'm aware of attempts to optimize away swaps, but those don't
apply to my application, so I just want to use lvx without the swaps. This is
somehow related to vec_st, since if I use inline asm to generate stvx it works.
It also works if there is no builtin_constant_p check in the rotate_left macro.
It would be really nice if there was a way to disable lane swap optimizations
and allow gcc to use the aligned load/stores when the address is known to be
aligned. On x86 gcc already knows when to use aligned vs. unaligned loads, so
it must be possible on pcc as well. My code can execute over 100 million vector
load and store instructions per second, so removing the swaps have a real
performance impact.

Here is the gcc-7 assembly, note the addi 3,3,16 after the unconditional branch
to L4 is executed before the first stvx 0,0,3, the vector pointer is
incremented before it's ever written:

        sldi 9,5,4
        srdi 10,4,5
        add 3,3,9
        addi 9,10,1
        mtctr 9
        li 8,1
        b .L3
        .p2align 4,,15
.L2:
        stvx 0,0,3
        addi 5,5,1
.L3:
#APP
 # 20 "msary_bug.C" 1
        rotld 9,8,5
 # 0 "" 2
#NO_APP
        mtvsrd 32,9
        addi 3,3,16
        xxpermdi 32,32,32,0
        bdnz .L2
        sldi 10,10,5
        subf 4,10,4
        mtvsrd 32,4
        xxpermdi 32,32,32,0
        stvx 0,0,3
        blr

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]