This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.


Ping
________________________________________
From: Tamar Christina
Sent: Monday, July 3, 2017 7:11:51 AM
To: James Greenhalgh
Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.

Ping
________________________________________
From: gcc-patches-owner@gcc.gnu.org <gcc-patches-owner@gcc.gnu.org> on behalf of Tamar Christina <Tamar.Christina@arm.com>
Sent: Monday, June 26, 2017 11:49:42 AM
To: James Greenhalgh
Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.

Hi All,

I've updated patch accordingly.

This mostly involves removing the loop to create the ival
and removing the *2 code and instead defaulting to 64bit
and switching to 128 when needed.

Regression tested on aarch64-none-linux-gnu and no regressions.

OK for trunk?

Thanks,
Tamar


gcc/
2017-06-26  Tamar Christina  <tamar.christina@arm.com>

        * config/aarch64/aarch64.c
        (aarch64_simd_container_mode): Add prototype.
        (aarch64_expand_mov_immediate): Add HI support.
        (aarch64_reinterpret_float_as_int, aarch64_float_const_rtx_p: New.
        (aarch64_can_const_movi_rtx_p): New.
        (aarch64_preferred_reload_class):
        Remove restrictions of using FP registers for certain SIMD operations.
        (aarch64_rtx_costs): Added new cost for CONST_DOUBLE moves.
        (aarch64_valid_floating_const): Add integer move validation.
        (aarch64_simd_imm_scalar_p): Remove.
        (aarch64_output_scalar_simd_mov_immediate): Generalize function.
        (aarch64_legitimate_constant_p): Expand list of supported cases.
        * config/aarch64/aarch64-protos.h
        (aarch64_float_const_rtx_p, aarch64_can_const_movi_rtx_p): New.
        (aarch64_reinterpret_float_as_int): New.
        (aarch64_simd_imm_scalar_p): Remove.
        * config/aarch64/predicates.md (aarch64_reg_or_fp_float): New.
        * config/aarch64/constraints.md (Uvi): New.
        (Dd): Split into Ds and new Dd.
        * config/aarch64/aarch64.md (*movsi_aarch64):
        Add SIMD mov case.
        (*movdi_aarch64): Add SIMD mov case.
________________________________________
From: Tamar Christina
Sent: Thursday, June 15, 2017 1:50:19 PM
To: James Greenhalgh
Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
Subject: RE: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.

>
> This patch is pretty huge, are there any opportunities to further split it to aid
> review?

Unfortunately because I'm also changing some constraints it introduced a bit of a dependency cycle.
If I were to break it up more, the individual patches won't work on their own anymore. If this is acceptable
I can break it up more.

> > +  ival = zext_hwi (res[needed - 1], 32);  for (int i = needed - 2; i
> > + >= 0; i--)
> > +    {
> > +      ival <<= 32;
> > +      ival |= zext_hwi (res[i], 32);
> > +    }
> > +
> > +  *intval = ival;
>
> ???
>
> Two cases here, needed is either 2 if GET_MODE_BITSIZE (mode) == 64, or it
> is 1 otherwise. So i starts at either -1 or 0. So this for loop either runs
> 0 or 1 times. What am I missing? I'm sure this is all an indirect way of
> writing:
>

Yes, the code was set up to be easily extended to support 128 floats as well,
Which was deprioritized. I'll just remove the loop.

> > +
> > +  /* Determine whether it's cheaper to write float constants as
> > +     mov/movk pairs over ldr/adrp pairs.  */  unsigned HOST_WIDE_INT
> > + ival;
> > +
> > +  if (GET_CODE (x) == CONST_DOUBLE
> > +      && SCALAR_FLOAT_MODE_P (mode)
> > +      && aarch64_reinterpret_float_as_int (x, &ival))
> > +    {
> > +      machine_mode imode = mode == HFmode ? SImode :
> int_mode_for_mode (mode);
> > +      int num_instr = aarch64_internal_mov_immediate
> > +                   (NULL_RTX, gen_int_mode (ival, imode), false,
> imode);
> > +      return num_instr < 3;
>
> Should this cost model be static on a magin number? Is it not the case that
> the decision should be based on the relative speeds of a memory access
> compared with mov/movk/fmov ?
>

As far as I'm aware, the cost model is too simplistic to be able to express the
Actual costs of mov/movk and movk/movk pairs. E.g it doesn't take into account
The latency and throughput difference when the instructions occur in sequence/pairs.

This leads to it allowing a smaller subset through here then what would be beneficial.

> > +/* Return TRUE if rtx X is immediate constant that fits in a single
> > +   MOVI immediate operation.  */
> > +bool
> > +aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode) {
> > +  if (!TARGET_SIMD)
> > +     return false;
> > +
> > +  machine_mode vmode, imode;
> > +  unsigned HOST_WIDE_INT ival;
> > +
> > +  /* Don't write float constants out to memory.  */
> > +  if (GET_CODE (x) == CONST_DOUBLE
> > +      && SCALAR_FLOAT_MODE_P (mode))
> > +    {
> > +      if (!aarch64_reinterpret_float_as_int (x, &ival))
> > +   return false;
> > +
> > +      imode = int_mode_for_mode (mode);
> > +    }
> > +  else if (GET_CODE (x) == CONST_INT
> > +      && SCALAR_INT_MODE_P (mode))
> > +    {
> > +       imode = mode;
> > +       ival = INTVAL (x);
> > +    }
> > +  else
> > +    return false;
> > +
> > +  unsigned width = GET_MODE_BITSIZE (mode) * 2;
>
> Why * 2? It isn't obvious to me from my understanding of movi why that
> would be better than just clamping to 64-bit?

The idea is to get the smallest vector mode for the given mode.
For SF that's V2SF and DF: V2DF, which is why the *2. Clamping to 64 bit there
would be no 64 bit DF vector mode as far as I'm aware of.

The clamping is done for modes smaller than SF, e.g. HF. Which mapped to the smallest
Option, V4HF thanks to the clamping. Forcing everything to 128 bit vectors would work,
But I don't see the advantage of that.

For this particular function is doesn't matter much as no code is generated. So clamping to
128 bits would work, but when generating the code, I don't see why V4SF and V8HF would be
Better than V2SF and V4HF.

Alternatively I could instead of reusing aarch64_simd_container_mode just create my own
Mapping function which just does the mapping I expect. Would that be a better option?

>
> > +  if (width < GET_MODE_BITSIZE (DFmode))
> > +     width = GET_MODE_BITSIZE (DFmode);
> > +
> > +  vmode = aarch64_simd_container_mode (imode, width);  rtx v_op =
> > + aarch64_simd_gen_const_vector_dup (vmode, ival);
> > +
> > +  return aarch64_simd_valid_immediate (v_op, vmode, false, NULL); }
> > +
> > +
> >  /* Return the fixed registers used for condition codes.  */
> >
> >  static bool
> > @@ -5758,12 +5860,6 @@ aarch64_preferred_reload_class (rtx x,
> reg_class_t regclass)
> >        return NO_REGS;
> >      }
> >
> > -  /* If it's an integer immediate that MOVI can't handle, then
> > -     FP_REGS is not an option, so we return NO_REGS instead.  */
> > -  if (CONST_INT_P (x) && reg_class_subset_p (regclass, FP_REGS)
> > -      && !aarch64_simd_imm_scalar_p (x, GET_MODE (x)))
> > -    return NO_REGS;
> > -
> >    /* Register eliminiation can result in a request for
> >       SP+constant->FP_REGS.  We cannot support such operations which
> >       use SP as source and an FP_REG as destination, so reject out @@
> > -6674,26 +6770,44 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int
> outer ATTRIBUTE_UNUSED,
> >        return true;
> >
> >      case CONST_DOUBLE:
> > +
> > +    /* First determine number of instructions to do the move
> > +       as an integer constant.  */
> > +   if (!aarch64_float_const_representable_p (x)
> > +       && !aarch64_can_const_movi_rtx_p (x, mode)
> > +       && aarch64_float_const_rtx_p (x))
> > +     {
> > +       unsigned HOST_WIDE_INT ival;
> > +       bool succeed = aarch64_reinterpret_float_as_int (x, &ival);
> > +       gcc_assert (succeed);
>
> Just:
>
>   gcc_assert (aarch64_reinterpret_float_as_int (x, &ival));
>
> There's not much extra information in the name "succeed", so no extra value
> in the variable assignment.

But if asserts are disabled won't the code be optimized out. This is why I only assert
on the result (as an extra sanity check) since the call itself has a side-effect.

> >  char*
> > -aarch64_output_scalar_simd_mov_immediate (rtx immediate,
> > -                                     machine_mode mode)
> > +aarch64_output_scalar_simd_mov_immediate (rtx immediate,
> > +machine_mode mode)
> >  {
> > +
> > +  /* If a floating point number was passed and we desire to use it in an
> > +     integer mode do the conversion to integer.  */
> > +  if (CONST_DOUBLE_P (immediate) && GET_MODE_CLASS (mode) ==
> MODE_INT)
> > +    {
> > +      unsigned HOST_WIDE_INT ival;
> > +      if (!aarch64_reinterpret_float_as_int (immediate, &ival))
> > +     gcc_unreachable ();
> > +      immediate = gen_int_mode (ival, mode);
> > +    }
> > +
> >    machine_mode vmode;
> > +  int width = GET_MODE_BITSIZE (mode) * 2;
>
> Dubious * 2 again!

Same reason as above, however here code is actually generated.

>
> > +  if (width < 64)
> > +     width = 64;
> >
> >    gcc_assert (!VECTOR_MODE_P (mode));
> > -  vmode = aarch64_simd_container_mode (mode, 64);
> > +  vmode = aarch64_simd_container_mode (mode, width);
> >    rtx v_op = aarch64_simd_gen_const_vector_dup (vmode, INTVAL
> > (immediate));
> > -  return aarch64_output_simd_mov_immediate (v_op, vmode, 64);
> > +  return aarch64_output_simd_mov_immediate (v_op, vmode, width);
> >  }
> >
> >  /* Split operands into moves from op[1] + op[2] into op[0].  */
>
> Thanks,
> James


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]