This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: enabling SSE for 3-vector inner product

From: Qianqian Fang <fangqq at gmail dot com>
To: Brian Budge <brian dot budge at gmail dot com>
Cc: gcc-help at gcc dot gnu dot org, Marc Glisse <marc dot glisse+gcc at normalesup dot org>
Date: Sat, 01 May 2010 23:37:35 -0400
Subject: Re: enabling SSE for 3-vector inner product
References: <4BD9FE06.90809@gmail.com> <alpine.DEB.2.00.1004301214400.11246@laptop-mg.saclay.inria.fr> <4BDBA165.4030703@gmail.com> <s2i5b7094581005011858z6cf44c1x5d27737a494f07cc@mail.gmail.com>

On 05/01/2010 09:58 PM, Brian Budge wrote:

Unless you modify your algorithm to perform many dot products simultaneously, you're probably at your limit. Depending on what you're trying to do, you might get close to a 4x speedup on the dot products (conversely, you might not be able to do any better)

hi Brian

may I ask you what did you mean by "many dot products simultaneously"?

from my reading, there are only limited number of xmm registers,
in order to do a dot product, I need to at least do

_mm_loadu_ps(p1);
_mm_loadu_ps(p2);
_mm_dp_ps(p1,p2);
_mm_store_ss(result1);

if I do many dot products, did you mean by repeating the above
pseudo-code for different vectors? or try to run
as many mm_dp_ps as possible for a fixed set of vectors?

In the inner loop of my code, I need to perform 12 inner
products in a row using 14 vectors. Will the overhead for
loading/storing kill the speed-up?

thanks

Qianqian

Brian

On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang<fangqq@gmail.com> wrote:
hi Marc

On 04/30/2010 06:31 AM, Marc Glisse wrote:
On Thu, 29 Apr 2010, Qianqian Fang wrote:

Shouldn't there be some magic here for alignment purposes?
thank you for pointing this out. I changed the definition to
typedef struct CPU_float4{
    float x,y,z,w;
} float4 __attribute__ ((aligned(16)));
but the run-time using SSE3 remains the same.
Is my above change correct?
now I am trying to use SSE4.x DPPS, but gcc gave me error. I don't know if I used it with a wrong format.
Did you try using the intrinsic _mm_dp_ps?
yes, I removed the asm and use mm_dp_ps, it works now.
the code now looks like this:
inline float vec_dot(float3 *a,float3 *b){
        float dot;
        __m128 na,nb,res;
        na=_mm_loadu_ps((float*)a);
        nb=_mm_loadu_ps((float*)b);
        res=_mm_dp_ps(na,nb,0x7f);
        _mm_store_ss(&dot,res);
        return dot;
}
sadly, using SSE4 only gave me a few percent (2~5%)
speed-up over the original C code. My profiling result
indicated the inner product took about 30% of my total
run time. Does this speedup make sense?
"dpps %%xmm0, %%xmm1, 0xF1 \n\t"
Maybe the order of the arguments is reversed in asm and it likes a $ before a constant (and it prefers fewer parentheses on the next line).
with gcc -S, I can see that the assembly is in fact
dpps 127, xmm1, xmm0, so perhaps it was reversed
in my previous version.
In any case, you shouldn't get a factor 2 compared to the SSE3 version, so that won't be enough for you.
well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not
faster.
SSE4 is now 2~5% faster, but still not as significant as I thought.
I guess that's probably the best I can do with it. Right?
thanks

Qianqian

Follow-Ups:
- Re: enabling SSE for 3-vector inner product
  - From: Brian Budge

References:
- Re: enabling SSE for 3-vector inner product
  - From: Qianqian Fang
- Re: enabling SSE for 3-vector inner product
  - From: Brian Budge

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]