Scatter/Gather vector operations

Tim Prince n8tm@aol.com
Sun Apr 8 14:02:00 GMT 2007


dzonatas@dzonux.net wrote:
> Tim Prince wrote:
>> Dzonatas wrote:
>>> Is this the only portable way to do to a pack/unpack without asm()? 
>>> How do I set it up differently to trigger a pack/unpack optimization?
>>>
>>> Thank you.
>>>
>> If you're talking about optimization for a specific CPU, but you don't 
>> want to reveal which CPU that is, why even post this?
> No. I'm just trying to get an idea of what direction the future of such 
> code may take, as I also wonder what is the best format for now.
>> This code looks OK to me.  There isn't any special hardware support 
>> for this on commonly available CPUs, like Opteron or Xeon. Scalar 
>> moves should work as well as anything, and you are within the limits 
>> for efficient Write Combine buffering.  If you have problems, you 
>> won't get any help if you can't describe them more specifically.
>>
> The problem is bandwidth. Vector processes help greatly with that alone 
> despite the matrix math.
No, vector operations don't necessarily make any difference to 
performance of bandwidth limited operations.  In the example you give, 
it looks like all the memory operations are sequential, and within the 
limit on parallel data streams supported by recent AMD and Intel CPUs.
> 
> Currently, there are immediate targets for SSE2 and Altivec enabled 
> architectures. I could probably write assembly code to overcome it with 
> instructions to unpack a vector and scatter the data that is specific 
> for SSE2/Altivec, but I don't want to aim that short. I would like to 
> avoid the assembly code if possible.
I don't see how Altivec has a future.
Problems with matrix transposition are somewhat related. Intel compilers 
can be persuaded to generate code (by setting #pragma vector always) 
which packs large stride data into vector registers so as to optimize 
memory use on the store side for Intel CPUs.  This is counter-productive 
for AMD CPUs, which have different trade-offs between the cost of pack 
operations and partial write combine buffer flush.
I would not expect any advantage for grabbing your data in vector 
registers and unpacking.  You don't change the memory access pattern. 
Even if you did, read operations don't have nearly the performance 
problems of write operations, where increasing the size of each data 
access could gain enough in memory performance to offset the cost of 
unpacking.
> 
> For example, is there a formal way to use a vector register as a pointer 
> to main memory to fetch that data into another vector register. I know 
> this is beyond the basic vector operations implemented now, but like:
> 
Do you want the GPU of the future to look so much like the Convex or 
Cray vector machines of the past?   If you persuade enough people, it 
might happen.  It doesn't necessarily pay off, if it does nothing but 
embed [un]packing in microcode.  SSE3 already has examples (e.g. 
horizontal add) which do nothing but clean up the asm code, which is of 
relatively little use when no one codes at that level.



More information about the Gcc-help mailing list