This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/4] [AARCH64,NEON] Convert arm_neon.h to use new builtins for vld[234](q?)_lane_*

From: Charles Baylis <charles dot baylis at linaro dot org>
To: Tejas Belagod <tejas dot belagod at arm dot com>
Cc: Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
Date: Fri, 26 Sep 2014 02:16:28 +0100
Subject: Re: [PATCH 2/4] [AARCH64,NEON] Convert arm_neon.h to use new builtins for vld[234](q?)_lane_*
Authentication-results: sourceware.org; auth=none
References: <1411069109-31425-1-git-send-email-charles dot baylis at linaro dot org> <1411069109-31425-3-git-send-email-charles dot baylis at linaro dot org> <541C11BC dot 9050800 at arm dot com>

On 19 September 2014 12:21, Tejas Belagod <tejas.belagod@arm.com> wrote:
> The reason we avoided using type-punning using unions was that reload would
> get confused with potential subreg(mem) that could be introduced because of
> memory xfer caused by unions and large int modes. As a result, we would get
> incorrect or sub-optimal code. But this seems to have fixed itself. :-)
>
> Because this involves xfers between large int modes and
> CANNOT_CHANGE_MODE_CLASS has some impact on it, it would be good to test
> what impact your patch has with C_C_M_C removed, so that it will be easier
> to fix the fallout once we remove C_C_M_C eventually. To test this you will
> need Richard's patch set
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01440.html.
>
> Same for your other 2 patches in this series(3,4).

I tried those patches, and altered aarch64_cannot_change_mode_class to
return false for all cases.

However, this does not avoid the unnecessary moves.

Taking a really simple test case:

#include <arm_neon.h>

int32x2x2_t xvld2_s32(int32_t *__a)
{
  int32x2x2_t ret;
  __builtin_aarch64_simd_oi __o;
  __o = __builtin_aarch64_ld2v2si ((const __builtin_aarch64_simd_si *) __a);
  ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregoiv2si (__o, 0);
  ret.val[1] = (int32x2_t) __builtin_aarch64_get_dregoiv2si (__o, 1);
  return ret;
}

(disabling scheduling for clarity)
$ aarch64-oe-linux-gcc -O2 -S -o - simd.c -fno-schedule-insns
-fno-schedule-insns2
        ...
xvld2_s32:
        ld2     {v2.2s - v3.2s}, [x0]
        orr     v0.8b, v2.8b, v2.8b
        orr     v1.8b, v3.8b, v3.8b
        ret
        ...


The reason is apparent in the rtl dump from ira:
...
      Allocno a0r73 of FP_REGS(32) has 31 avail. regs  33-63, node:
33-63 (confl regs =  0-32 64 65)
...
(insn 2 4 3 2 (set (reg/v/f:DI 79 [ __a ])
        (reg:DI 0 x0 [ __a ])) simd.c:5 34 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 0 x0 [ __a ])
        (nil)))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 20 2 (set (reg/v:OI 73 [ __o ])
        (subreg:OI (vec_concat:V8SI (vec_concat:V4SI (unspec:V2SI [
                            (mem:TI (reg/v/f:DI 79 [ __a ]) [0  S16 A8])
                        ] UNSPEC_LD2)
                    (vec_duplicate:V2SI (const_int 0 [0])))
                (vec_concat:V4SI (unspec:V2SI [
                            (mem:TI (reg/v/f:DI 79 [ __a ]) [0  S16 A8])
                        ] UNSPEC_LD2)
                    (vec_duplicate:V2SI (const_int 0 [0])))) 0))
simd.c:8 2149 {aarch64_ld2v2si_dreg}
     (expr_list:REG_DEAD (reg/v/f:DI 79 [ __a ])
        (nil)))
(insn 20 6 21 2 (set (reg:V2SI 32 v0)
        (subreg:V2SI (reg/v:OI 73 [ __o ]) 0)) simd.c:12 778
{*aarch64_simd_movv2si}
     (nil))
(insn 21 20 22 2 (set (reg:V2SI 33 v1)
        (subreg:V2SI (reg/v:OI 73 [ __o ]) 16)) simd.c:12 778
{*aarch64_simd_movv2si}
     (expr_list:REG_DEAD (reg/v:OI 73 [ __o ])
        (nil)))
(insn 22 21 23 2 (use (reg:V2SI 32 v0)) simd.c:12 -1
     (nil))
(insn 23 22 0 2 (use (reg:V2SI 33 v1)) simd.c:12 -1
     (nil))

The register allocator considers r73 to conflict with v0, because they
are simultaneously live after insn 20. Without the 2nd use of v73 (eg
if the write to res.val[1] is replaced with vdup_n_s32(0) ) then the
allocator does do the right thing with the subreg and allocates v73 to
{v0,v1}.

I haven't read all of the old threads relating to Richard's patches
yet, but I don't see why they would affect this issue.

I don't think the register allocator is able to resolve this unless
the conversion between the __builtin_simd type and the int32x4x2_t
type is done as a single operation.

However, type-punning is not possible with the arrays of 64 bit
vectors, as the arrays are not the same size as the corresponding
__builtin_simd types, and any solution for those would probably help
with the q variants too. Maybe the solution is to pass the NEON
intrinsic types directly to the builtins? Is there a reason that it
wasn't done that way before?

Thanks
Charles

Follow-Ups:
- Re: [PATCH 2/4] [AARCH64,NEON] Convert arm_neon.h to use new builtins for vld[234](q?)_lane_*
  - From: Tejas Belagod

References:
- [PATCH 0/4] [AARCH64,NEON] Improve various NEON load/store intrinsics
  - From: Charles Baylis
- [PATCH 2/4] [AARCH64,NEON] Convert arm_neon.h to use new builtins for vld[234](q?)_lane_*
  - From: Charles Baylis
- Re: [PATCH 2/4] [AARCH64,NEON] Convert arm_neon.h to use new builtins for vld[234](q?)_lane_*
  - From: Tejas Belagod

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]