This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode

From: Tejas Belagod <tbelagod at arm dot com>
To: Julian Brown <julian at codesourcery dot com>
Cc: Richard Biener <richard dot guenther at gmail dot com>, "paul at codesourcery dot com" <paul at codesourcery dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>
Date: Wed, 06 Mar 2013 14:57:49 +0000
Subject: Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
References: <20130227172947.31fa279c@octopus> <20130304152922.25c9cfba@octopus> <20130304172440.68e49f14@octopus> <201303042347.22565.paul@codesourcery.com> <CAFiYyc0R+HekS8gsR2z4q9kOgpL6pjn2JyXVemr+o+0qPUxaUg@mail.gmail.com> <20130305121843.274a7f21@octopus>

Julian Brown wrote:

On Tue, 5 Mar 2013 10:42:59 +0100
Richard Biener <richard.guenther@gmail.com> wrote:

On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com>
wrote:

I somehow missed the "Appendix A: Support for Advanced SIMD
Extensions" in the AAPCS document (it's not in the TOC!). It looks
like the builtin vector types are indeed defined to be stored in
memory in vldm/vstm order -- I think that means we're back to
square one.

There's still the possibility of making gcc "generic" vector types
different from the ABI specified types[1], but that feels like it's
probably a really bad idea.

Having a distinct set of types just for the vectorizer may be a
more viable option. IIRC the type selection hooks are more flexible
than when we first looked at this problem.

Paul

[1] e.g. int gcc __attribute__((vector_size(8)));  v.s. int32x2_t
eabi;

I think int32x2_t should not be a GCC vector type (thus not have a
vector mode). The ABI specified types should map to an integer mode
of the right size instead.  The vectorizer would then still use
internal GCC vector types and modes and the backend needs to provide
instruction patterns that do the right thing with the element
ordering the vectorizer expects.

How are the int32x2_t types used?  I suppose they are arguments to
the intrinsics.  Which means that for _most_ operations element order
does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply
use the equivalent of return (int32x2_t)((gcc_int32x2_t)x +
(gcc_int32x2_t)y). In intrinsics where order matters you'd insert
appropriate __builtin_shuffle()s.


Maybe there's no need to interpret the vector layout for any of the
intrinsics -- just treat all inputs & outputs as opaque (there are
intrinsics for getting/setting lanes -- IMO these shouldn't attempt to
convert lane numbers at all, though they do at present). Several
intrinsics are currently implemented using __builtin_shuffle, e.g.:

__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
vrev64_s8 (int8x8_t __a)
{
  return (int8x8_t) __builtin_shuffle (__a, (uint8x8_t) { 7, 6, 5, 4, 3, 2, 1, 0 });
}

I'd imagine that if int8x8_t are not actual vector types, we could
invent extra builtins to convert them to and from such types to be able
to still do this kind of thing (in arm_neon.h, not necessarily for
direct use by users), i.e.:

typedef char gcc_int8x8_t __attribute__((vector_size(8)));

int8x8_t
vrev64_s8 (int8x8_t __a)
{
  gcc_int8x8_t tmp = __builtin_neon2generic (__a);
  tmp = __builtin_shuffle (tmp, (gcc_int8x8_t) { 7, 6, 5, 4, ... });
  return __builtin_generic2neon (tmp);
}

(On re-reading, that's basically the same as what you suggested, I
think.)

Oh, of course do the above only for big-endian mode ...

The other way around, mapping intrinsics and ABI vectors to vector
modes will have issues ... you'd have to guard all optab queries in
the middle-end to fail for arm big-endian as they expect instruction
patterns that deal with the GCC vector ordering.

Thus: model the backend after GCCs expectations and "fixup" the rest
by fixing the ABI types and intrinsics.


I think this plan will work fine -- it has the added advantage (which
looks like a disadvantage, but really isn't) that generic vector
operations like:

void foo (void)
{
  int8x8_t x = { 0, 1, 2, 3, 4, 5, 6, 7 };
}

will *not* work -- nor will e.g. subscripting ABI-defined vectors using
[]s. At the moment using these features can lead to surprising results.

Unfortunately NEON's pretty complicated, and the ARM backend currently
uses vector modes quite heavily implementing it, so just using integer
modes for intrinsics is going to be tough. It might work to create a
shadow set of vector modes for use only by the intrinsics (O*mode for
"opaque" instead of V*mode, say), if the middle end won't barf at that.

I suspect the mid-end may not be too happy with opaque modes for vectors. I've faced some issues in the past while experimenting with large int modes for vector register lists while implementing permuted loads in AArch64 particularly in the area of subreg generation where SUBREG_BYTE is generated based on BITS_PER_WORD for all INT mode classes not taking into account which registers the values of the particular mode end up in. This causes subreg_bytes to be unaligned to vector register boundary. To illustrate this, here is an example that exposed this issue:

For aarch64, I mirrored the approach that the arm/thumb backend employs and
defined 'large int' opaque modes to represent the register lists i.e. OImode,
CImode and XImode and defined the standard patterns that implement permuted
load/stores - vec_store_lanes<INT_MODE><VEC_MODE> and
vec_load_lanes<INT_MODE><VEC_MODE>.

At the time, I remember this test case

typedef unsigned short V __attribute__((vector_size(32)));
typedef V VI;

V in = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
VI mask = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, };
V out = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

extern void bar(V);

int main()
{

V r = __builtin_shuffle(in, mask);

    bar (r);
}

generated this RTL with my experimental compiler:

...
(insn 65 59 61 2 (set (reg:DI 178)
          (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 8)
                  (const_int 1 [0x1]))
              (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz}
       (nil))
...

(insn 151 145 147 2 (set (reg:DI 256)
          (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 24)
                  (const_int 1 [0x1]))
              (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz}
       (nil))

....

which is the short value extraction out of the vectors. I ran into this situation where the subregs were generated with byte offsets such that byte_offset % UNITS_PER_VREG != 0 i.e. subreg offsets that were not aligned to the vector register boundary. The above dump is before the reload phase. During reload subreg elimination, these subregs were converted to refer to the incorrect part of vector registers.

Though OImode is a large INT mode, we force these modes only to live in FPSIMD registers for which the UNITS_PER_VREG or BITS_PER_WORD is different from the integer word size i.e. UNITS_PER_VREG is 16 and BITS_PER_WORD for FPSIMD is 128.

I discovered in the mid-end that subregs were generated using BITS_PER_WORD and there weren't checks during generation to see that BITS_PER_WORD could be dependent on the mode which the subreg is being generated for. There was an assumption that BITS_PER_WORD applied to all INT modes. In this case, because OImode was only allowed in FPSIMD regs, BITS_PER_WORD should've been 128 or in other words mode-dependent. In general, shouldn't BITS_PER_WORD be dependent on the registers that a particular mode ultimately ends up in dictated by the target hook HARD_REGNO_MODE_OK? As far as I can see, expmed.c:store_bit_field_1 () hasn't changed much in this respect and I suspect this issue still remains.

We don't have the same issue on the ARM backend because the basic unit of
register allocation is 32-bits for both FP and Int units(arm.h #define
ARM_NUM_INTS) and the FP unit is a register-packing architecture.

That was in the context of register lists where large opaque int modes represent more than one 1 vector register. As you suggest, if we extend opaque int modes to represent 1 vector register, and with SUBREG being generated independent of modes in the mid-end, I imagine this may cause pain for later phases(like reload subreg elimination).

But that said, I'm not an expert on how mid-end handles opaque int modes and things might have improved in the area of SUBREG generation since my experiments.

Thanks,
Tejas Belagod
ARM.

References:
- Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
  - From: Julian Brown
- Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
  - From: Julian Brown
- Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
  - From: Paul Brook
- Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
  - From: Richard Biener
- Re: [PATCH, ARM, RFC] Fix vect.exp failures for NEON in big-endian mode
  - From: Julian Brown

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]