This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode


On Fri, 1 Mar 2013 14:35:05 +0000
Paul Brook <paul@codesourcery.com> wrote:

> > It's not even necessary to use explicit shuffles -- NEON has
> > perfectly good instructions for loading/storing vectors in the
> > "right" order, in the form of vld1 & vst1. I'm afraid the solution
> > to this problem might have been staring us in the face for years,
> > which is simply to forbid vldr/vstr/vldm/vstm (the instructions
> > which lead to weird element permutations in BE mode) for
> > loading/storing NEON vectors altogether. That way the vectorizer
> > gets what it wants, the intrinsics can continue to use
> > __builtin_shuffle exactly as they are doing, and we get to remove
> > all the bits which fiddle vector element numbering in BE mode in
> > the ARM backend.
> > 
> > I can't exactly remember why we didn't do that to start with. I
> > think the problem was ABI-related, or to do with transferring NEON
> > vectors to/from ARM registers when it was necessary to do that...
> > I'm planning to do some archaeology to try to see if I can figure
> > out a definitive answer.
> 
> The ABI defined vector types (uint32x4_t etc) are defined to be in
> vldm/vstm order.

There's no conflict with the ABI-defined vector order -- the ABI
(looking at AAPCS, IHI 0042D) describes "containerized" vectors which
should be used to pass and return vector quantities at ABI boundaries,
but I couldn't find any further restrictions. Internally to a function,
we are still free to use vld1/vst1 vector ordering. Using
"containerized"/opaque transfers, the bit pattern of a vector in one
function (using vld1/vst1 ordering internally) will of course remain
unchanged if passed to another function and using the same ordering
there also.

Actually making that work (especially efficiently) with GCC is a
slightly different matter. Let's call vldm/vstm-ordered vectors
"containerized" format, and vld1/vst1-ordered vectors "array" format. We
need to do introduce the concept of marshalling vector arguments from
array format to containerized format when passing them to a function,
and unmarshalling those vector arguments back the other way on function
entry. AFAICT, GCC does not have suitable infrastructure for
implementing such functionality at present: consider that e.g. vectors
passed by value on the stack should use containerized format, which
means the called function cannot simply dereference the stack pointer
to read the vector:

void foo (int dummy1, int dummy2, int dummy3, int dummy4, v4si myvec)
{
  v4si *myvec_ptr = &myvec;
  ...
}

Here the hypothetical "unmarshal" operation would need to do something
like:

  add r0, sp, #myvec_offset
  vldm r0, {q0}
  add r0, sp, #myvec_temp_offset
  vst1.32 {q0}, [r0]
  /* myvec_ptr points to myvec_temp_offset.  */

In many cases the marshall/unmarshall operations don't have to do
anything except use vldr/vstr/vldm/vstm or the core-register transfer
equivalents instead of vld1/vst1 for reading/writing vectors used as
arguments, so we generally don't have to incur any overhead like that,
though.

I experimented with a patch which tried to do marshalling/unmarshalling
in RTL, using DImode/TImode for the containerized format (splitting
neon.md/*neon_mov<mode> into DImode/TImode versions for containerized
vectors, and V*mode versions for array-format vectors with only
vmov/vld1/vst1 alternatives, and tweaking several other target macros
etc. appropriately). but that didn't work very well, and wouldn't be
able to handle the case which requires a copy described above, I don't
think. (Several optimisation passes are keen to form V*mode subregs of
DImode values, even if CANNOT_CHANGE_MODE_CLASS/MODES_TIEABLE_P are
tweaked. The hooks/macros controlling argument & function-return
promotion appear to get some of the way there to implementing the RTL
"solution", but evidently not far enough.)

So, I think the proper way of implementing this is probably at the tree
level -- maybe rewriting vector types in function argument lists to
"opaque" vectors, like e.g. rs6000 uses for some intrinsics, and
inserting machine-dependent operations for marshalling and
unmarshalling at appropriate points -- maybe still using DImode/TImode
to represent containerized (opaque) vectors at the RTL level, or maybe
introducing new machine modes if that doesn't work reliably.

The two main advantages of this approach over the status quo are:

1. Big-endian mode works as well as little-endian mode for NEON --
intrinsics, vectorization, the lot.

2. Even in little-endian mode, using vld1/vst1 predominantly over
vldr/vstr means that the alignment hints in those instructions can be
used more often, which might be a minor performance boost.

Would this be a sensible approach, or am I completely wrong? I'm not
sure if I can dedicate time to implementing it at the moment in any
case. Maybe someone within ARM (or Linaro) could take it up? ;-)

(Anyway, I still think it might be a good idea to apply the original
patch until such work is done, considering vectorization -- enabled at
-O3 -- is broken with NEON turned on in big-endian mode at the moment.)

Thanks,

Julian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]