[Bug target/65456] New: powerpc64le autovectorized copy loop missed optimization

Wed Mar 18 03:52:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65456

            Bug ID: 65456
           Summary: powerpc64le autovectorized copy loop missed
                    optimization
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: anton at samba dot org

Created attachment 35049
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35049&action=edit
Testcase pulled from valgrind

The attached copy loop (out of valgrind) produces some pretty bad code:

     df8:       e4 06 9e 78     rldicr  r30,r4,0,59
     dfc:       e4 26 df 78     rldicr  r31,r6,4,59
     e00:       10 00 84 38     addi    r4,r4,16
     e04:       01 00 c6 38     addi    r6,r6,1
     e08:       99 f6 20 7c     lxvd2x  vs33,0,r30
     e0c:       57 0a 21 f0     xxswapd vs33,vs33
     e10:       2b 03 a1 11     vperm   v13,v1,v0,v12
     e14:       97 0c 01 f0     xxlor   vs32,vs33,vs33
     e18:       56 6a 0d f0     xxswapd vs0,vs45
     e1c:       98 4f 1f 7c     stxvd2x vs0,r31,r9
     e20:       d8 ff 00 42     bdnz    df8 <memmove+0x6e8>

Since we are using VSX storage ops, we should just align the source and do
unaligned stores. That will remove the permute, and then the gcc pass to remove
redundant swaps should kick in and remove them too.