This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: arch-specific template code


On Sat, 8 Sep 2012, Ulrich Drepper wrote:

I believe every __builtin_shuffle that can be done in a single instruction
is already properly expanded on x86. For this 16 byte vector shuffle, it
uses pshufb. Is there a better instruction?

The shuffle operations need a memory or [xy]mm register parameter. That's expensive to set up. The shuffling which doesn't rearrange bytes but just rotates them should use the equivalent of the

_mm_srli_si128

and

_mm_slli_si128

intrinsics.

I did a quick test, and psrldq+pslldq+por was slower than pshufb with a memory operand. True, the test isn't representative, in a program that uses more memory the value won't be in cache and the load can be much slower. I don't know how people usually determine what the best code is.
From your post it seems that psrldq+pslldq+por should always be preferred,
even if in some cases it can be a loss. That makes sense, I am just more used to "obvious" optimizations. Is there a way to decide when a load becomes worth it, compared to a large number of pure logical instructions?

With -Os, is the choice the same?

(libstdc++ is not really the right list anymore...)

--
Marc Glisse


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]