This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: arch-specific template code

From: Marc Glisse <marc dot glisse at inria dot fr>
To: Ulrich Drepper <drepper at gmail dot com>
Cc: libstdc++ at gcc dot gnu dot org
Date: Sat, 8 Sep 2012 21:17:45 +0200 (CEST)
Subject: Re: arch-specific template code
References: <CAOPLpQex7niqQ4VWoSheHpbb8-Jf9S+nULgxB3WNUp4r_xyriw@mail.gmail.com> <503BF427.1030300@oracle.com> <CAOPLpQdRUONpKZ9WfD1VP+YgmJVoJxp1L6OL=5hXH2LzjnZ8ww@mail.gmail.com> <alpine.DEB.2.02.1208281418540.20374@stedding.saclay.inria.fr> <CAOPLpQfgMoJjgwtZ6WMvpY41Vj0BD3SUQs3RNO1dbKQzp8KMaw@mail.gmail.com> <alpine.DEB.2.02.1208281716410.20374@stedding.saclay.inria.fr> <504B0C31.2040207@oracle.com> <alpine.DEB.2.02.1209081115290.3775@laptop-mg.saclay.inria.fr> <CAOPLpQfz+tKos-6SNqdbcoR7zedVAPcxbQpmH97y8ZOiCcwNXQ@mail.gmail.com> <alpine.DEB.2.02.1209081549090.3775@laptop-mg.saclay.inria.fr> <CAOPLpQd5zZxNL=fvJaTPeUOaEEh_XaPVzibiEXwS+X-Sc-jQ2Q@mail.gmail.com>

On Sat, 8 Sep 2012, Ulrich Drepper wrote:

I believe every __builtin_shuffle that can be done in a single instruction
is already properly expanded on x86. For this 16 byte vector shuffle, it
uses pshufb. Is there a better instruction?


The shuffle operations need a memory or [xy]mm register parameter.
That's expensive to set up.  The shuffling which doesn't rearrange
bytes but just rotates them should use the equivalent of the

_mm_srli_si128

and

_mm_slli_si128

intrinsics.

I did a quick test, and psrldq+pslldq+por was slower than pshufb with a memory operand. True, the test isn't representative, in a program that uses more memory the value won't be in cache and the load can be much slower. I don't know how people usually determine what the best code is.

From your post it seems that psrldq+pslldq+por should always be preferred,

even if in some cases it can be a loss. That makes sense, I am just more used to "obvious" optimizations. Is there a way to decide when a load becomes worth it, compared to a large number of pure logical instructions?

With -Os, is the choice the same?

(libstdc++ is not really the right list anymore...)

--
Marc Glisse

References:
- Re: arch-specific template code
  - From: Paolo Carlini
- Re: arch-specific template code
  - From: Marc Glisse
- Re: arch-specific template code
  - From: Ulrich Drepper
- Re: arch-specific template code
  - From: Marc Glisse
- Re: arch-specific template code
  - From: Ulrich Drepper

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]