This is the mail archive of the
mailing list for the GCC project.
vec_ld versus vec_vsx_ld on power8
- From: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
- To: timothee dot ewart at epfl dot ch
- Cc: gcc at gcc dot gnu dot org
- Date: Fri, 13 Mar 2015 10:16:10 -0500
- Subject: vec_ld versus vec_vsx_ld on power8
- Authentication-results: sourceware.org; auth=none
I'll discuss the loads here for simplicity; the situation for stores is
There are a couple of differences between lvx and lxvd2x. The most
important one is that lxvd2x supports unaligned loads, while lvx does
not. You'll note that lvx will zero out the lower 4 bits of the
effective address in order to force an aligned load.
lxvd2x loads two doublewords into a vector register using big-endian
element order, regardless of whether the processor is running in
big-endian or little-endian mode. That is, the first doubleword from
memory goes into the high-order bits of the vector register, and the
second doubleword goes into the low-order bits. This is semantically
incorrect for little-endian, so the xxpermdi swaps the doublewords in
the register to correct for this.
At optimization -O1 and higher, gcc will remove many of the xxpermdi
instructions that are added to correct for LE semantics. In many vector
computations, the lanes where the computations are performed do not
matter, so we don't have to perform the swaps.
For unaligned loads where we are unable to remove the swaps, this is
still better than the alternative using lvx. An unaligned load requires
a four-instruction sequence to load the two aligned quadwords that
contain the desired data, set up a permutation control vector, and
combine the desired pieces of the two aligned quadwords into a vector
register. This can be pipelined in a loop so that only one load occurs
per loop iteration, but that requires additional vector copies. The
four-instruction sequence takes longer and increases vector register
pressure more than an lxvd2x/xxpermdi.
When the data is known to be aligned, lvx is equivalent to lxvd2x
performance if we are able to remove the permutes, and is preferable to
lxvd2x if not.
There are cases where we do not yet use lvx in lieu of lxvd2x when we
could do so and improve performance. For example, saving and restoring
of vector parameters in a function prolog and epilog does not yet always
use lvx. This is a performance opportunity we plan to improve in the
A rule of thumb for your purposes is that if you can guarantee that you
are using aligned data, you should use vec_ld and vec_st, and otherwise
you should use vec_vsx_ld and vec_vsx_st. Depending on your
application, it may be worthwhile to copy your data into an aligned
buffer before performing vector calculations on it. GCC provides
attributes that will allow you to specify alignment on a 16-byte
Note that the above discussion presumes POWER8, which is the only POWER
hardware that currently supports little-endian distributions and
applications. Unaligned load/store performance on earlier processors
was less efficient, so the tradeoffs differ.
I hope this is helpful!
Bill Schmidt, Ph.D.
IBM Linux Technology Center
> I have a issue/question using VMX/VSX on Power8 processor on a little endian system.
> Using intrinsics function, if I perform an operation with vec_vsx_ld(Ã) - vet_vsx_st(), the compiler will add
> a permutation, and then perform an operations (memory correctly aligned)
> lxvd2x Ã
> xxpermdi Ã
> operations Ã.
> stxvd2x Ã
> If I use vec_ld() - vec_st()
> operations Ã
> Reading the ISA, I do not see a real difference between this 2 instructions ( or I miss it)
> So my 3 questions are:
> Why do I have permutations ?
> What is the cost of these permutations ?
> What is the difference vet_vsx_ld and vec_ld for the performance ?