This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFA: pervasive SSE codegen inefficiency

On Sep 14, 2005, at 9:50 PM, Andrew Pinski wrote:
On Sep 14, 2005, at 9:21 PM, Dale Johannesen wrote:
Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)

The first inner loop compiles to

paddq %xmm0, %xmm1

Good. The second compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

when it could be using a single paddw. The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need
to generate code. I'd like to fix this, but am not sure how to go about it.

From real looks of this looks more like a register allocation issue and nothing to do with subregs at all, except subregs being there.

That's kind of an overstatement; obviously getting rid of the subregs would
solve the problem as you can see from the first function. I think you're right that

If we allocated 64 and 63 as the same register, it would have worked correctly.

(you mean 64 and 66) would fix this example; I'll look at that. Having a more
uniform representation for operations on __m128i objects would simplify things
all over the place, though.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]