This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: lvx versus lxvd2x on power8

From: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
To: igor dot nunes at eldorado dot org dot br
Cc: gcc at gcc dot gnu dot org
Date: Tue, 11 Apr 2017 09:33:54 -0500
Subject: Re: lvx versus lxvd2x on power8
Authentication-results: sourceware.org; auth=none

Hi Igor,

(Apologies for not threading this, I haven't received my digest for this
list yet....)

You wrote:

>I recently checked this old discussion about when/why to use lxvd2x instead of 
>lvsl/lvx/vperm/lvx to load elements from memory to vector: 
>https://gcc.gnu.org/ml/gcc/2015-03/msg00135.html

>I had the same doubt and I was also concerned how performance influences on these 
>approaches. So that, I created the following project to check which one is faster 
>and how memory alignment can influence on results:

>https://github.com/PPC64/load_vec_cmp

>This is a simple code, that many loads (using both approaches) are executed in a 
>simple loop in order to measure which implementation is slower. The project also 
>considers alignment.

>As it can be seen on this plot (https://raw.githubusercontent.com/igorsnunes/load_vec_cmp/master/doc/LoadVecCompare.png)
>an unaligned load using lxvd2x takes more time.

>The previous discussion (as far as I could see) addresses that lxvd2x performs 
>better than lvsl/lvx/vperm/lvx in all cases. Is that correct? Is my analysis wrong?

>This issue concerned me, once lxvd2x is heavily used on compiled code.

One problem with your analysis is that you are forcing the use of the xxswapd
following the lxvd2x.  Although this is technically required for a load in
isolation to place elements in the correct lanes, in practice the compiler is
able to remove almost all of the xxswapd instructions during optimization.  Most
SIMD code does not care about which lanes are used for calculation, so long as
results in memory are placed properly.  For computations that do care, we can
often adjust the computations to still allow the swaps to be removed.  So your
analysis does not show anything about how code is produced in practice.

Another issue is that you're throwing away the results of the loads, which isn't
a particularly useful way to measure the costs of the latencies of the
instructions.  Typically with the pipelined lvx implementation, you will have
an lvx feeding the vperm feeding at least one use of the loaded value in each 
iteration of the loop, while with lxvd2x and optimization you will only have an 
lxvd2x feeding the use(s).  The latter is easier for the scheduler to cover 
latencies in most cases.

Finally, as a rule of thumb, these kind of "loop kernels" are really bad for
predicting performance, particularly on POWER.

In the upcoming POWER9 processors, the swap issue goes away entirely, as we will
have true little-endian unaligned loads (the indexed-form lxvx to replace lxvd2x/
xxswapd, and the offset-form lxv to reduce register pressure).

Now, you will of course see slightly worse unaligned performance for lxvd2x
versus aligned performance for lxvd2x.  This happens at specific crossing
points where the hardware has to work a bit harder.

I hate to just say "trust me" but I want you to understand that we have been
looking at these kinds of performance issues for several years.  This does
not mean that there are no cases where the pipelined lvx solution works better
for a particular loop, but if you let the compiler optimize it (or do similar
optimization in your own assembly code), lxvd2x is almost always better.

Thanks,
Bill

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]