what optimization can be expected?

Fri Apr 24 14:44:00 GMT 2009

Burlen Loring wrote:
> Tim Prince wrote:
>> burlen wrote:
>>
>>  
>>> Can loops with a non-unit stride be automagically optimized by compiler
>>> with SSE?
>>>
>>> template <int nComp>
>>> void norm(double *result, double *data, size_t n)
>>> {
>>>  double *pDat=data;
>>>  double *pRes=result;
>>>
>>>  for (size_t i=0; i<n; ++i)
>>>  {
>>>    *pRes=*pDat**pDat;
>>>    for (int j=1; j<nComp; ++j)
>>>    {
>>>      *pRes+=pDat[j]*pDat[j];
>>>    }
>>>    *pRes=sqrt(*pRes);
>>>
>>>    pRes+=1;
>>>    pDat+=nComp;
>>>  }
>>> }
>>>     
>>
>> Your inner loop appears to have unit stride, and might be optimized
>> easily
>> if you didn't write it with potential aliases.  If you meant
>> inner_product(), why not use that?
>>   
> Inner loop does have unit stride but its usually small between 1 and 12
> and the outer loop is usually large in the 10-100s of thousands. That
> example is simply one simple situation that I encounter. I want to
> understand how the compiler applies SSE optimization. What can be
> automatically SSE optimized by g++? Is this documented somewhere?
> 
> I want to write in such a way to take advantage of g++ capability. It's
> important for me to let g++ do optimization because the code needs to be
> cross platform.
> 

SSE vectorization can work only on a stride 1 inner loop (other than
limited cases with sse4).  If that loop had a known constant trip count,
you might instruct the compiler to unroll it entirely, but that is not
possible according to your follow-up.
You still ought to get a significant optimization by writing the apparent
aliasing out of the loop, if in fact there is no such aliasing, as the
inner_product() would do.  If it did vectorize, which would be enabled
only with g++ -O3 -ffast-math, and that slowed it down on account of the
short loop length, you could remove the vectorization by removing
-ffast-math or other ways.
To optimize the outer loop, you would need to declare operands as double *
__restrict__ , if that is valid, and if you are willing to deal with the
different names given to the restrict extension by each C++ compiler.
That may not make much difference, as long as you optimize the inner loop.
You haven't told us whether the operands are in fact aliased, in which
case the compiler would break your code by attempting to optimize it.