[PATCH][AArch64] Implement workaround for ARM Cortex-A53 erratum 835769

Fri Oct 10 15:40:00 GMT 2014

On 10/10/14 15:22, Kyrill Tkachov wrote:
> 
> On 10/10/14 15:18, Richard Earnshaw wrote:
>> On 10/10/14 11:53, Kyrill Tkachov wrote:
>>> Hi all,
>>>
>>>
>>> Some early revisions of the Cortex-A53 have an erratum (835769) whereby
>>> it is possible for a 64-bit multiply-accumulate instruction in
>>> AArch64 state to generate an incorrect result.  The details are quite
>>> complex and hard to determine statically, since branches in the code
>>> may exist in some circumstances, but all cases end with a memory
>>> (load, store, or prefetch) instruction followed immediately by the
>>> multiply-accumulate operation.
>>>
>>> The safest work-around for this issue is to make the compiler avoid
>>> emitting multiply-accumulate instructions immediately after memory
>>> instructions and the simplest way to do this is to insert a NOP. A
>>> linker patching technique can also be used, by moving potentially
>>> affected multiply-accumulate instruction into a patch region and
>>> replacing the original instruction with a branch to the patch.
>>>
>>> This patch achieves the compiler part by using the final prescan pass.
>>> The insn length attribute is updated accordingly using ADJUST_INSN_LENGTH
>>> so that conditional branches work properly.
>>>
>>> The fix is disabled by default and can be turned on by the
>>> -mfix-cortex-a53-835769 compiler option.
>>>
>>> I'm attaching a trunk and a 4.9 version of the patch.
>>> The 4.9 version is different only in that rtx_insn* is replaced by rtx.
>>>
>>> Tested on aarch64-none-linux-gnu (and bootstrap with the option) and
>>> built various large benchmarks with it.
>>>
>>> Ok?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2014-10-10  Kyrylo Tkachov<kyrylo.tkachov@arm.com>
>>>               Ramana Radhakrishnan<ramana.radhakrishnan@arm.com>
>>>
>>>        * config/aarch64/aarch64.h (FINAL_PRESCAN_INSN): Define.
>>>        (ADJUST_INSN_LENGTH): Define.
>>>        * config/aarch64/aarch64.opt (mfix-cortex-a53-835769): New option.
>>>        * config/aarch64/aarch64.c (is_mem_p): New function.
>>>        (is_memory_op): Likewise.
>>>        (aarch64_prev_real_insn): Likewise.
>>>        (is_madd_op): Likewise.
>>>        (dep_between_memop_and_next): Likewise.
>>>        (aarch64_madd_needs_nop): Likewise.
>>>        (aarch64_final_prescan_insn): Likewise.
>>>        * doc/invoke.texi (AArch64 Options): Document -mfix-cortex-a53-835769
>>>        and -mno-fix-cortex-a53-835769 options.
>>>
>>>
>>> a53-workaround-new.patch
>>>
>>>
>>> commit 6ee2bec04fbce15b13b59b54ef4fe11aeda74ff4
>>> Author: Kyrill Tkachov <kyrylo.tkachov@arm.com>
>>> Date:   Wed Oct 8 12:48:34 2014 +0000
>>>
>>>      [AArch64] Add final prescan workaround for Cortex-A53 erratum.
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
>>> index b5f53d2..c57a467 100644
>>> --- a/gcc/config/aarch64/aarch64-protos.h
>>> +++ b/gcc/config/aarch64/aarch64-protos.h
>>> @@ -308,6 +308,8 @@ aarch64_builtin_vectorized_function (tree fndecl,
>>>   
>>>   extern void aarch64_split_combinev16qi (rtx operands[3]);
>>>   extern void aarch64_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
>>> +extern bool aarch64_madd_needs_nop (rtx_insn *);
>>> +extern void aarch64_final_prescan_insn (rtx_insn *);
>>>   extern bool
>>>   aarch64_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
>>>   void aarch64_atomic_assign_expand_fenv (tree *, tree *, tree *);
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 5144c35..76a2480 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -7586,6 +7586,128 @@ aarch64_mangle_type (const_tree type)
>>>     return NULL;
>>>   }
>>>   
>>> +static int
>>> +is_mem_p (rtx *x, void *data ATTRIBUTE_UNUSED)
>>> +{
>>> +  return MEM_P (*x);
>>> +}
>>> +
>>> +static bool
>>> +is_memory_op (rtx_insn *mem_insn)
>>> +{
>>> +   rtx pattern = PATTERN (mem_insn);
>>> +   return for_each_rtx (&pattern, is_mem_p, NULL);
>>> +}
>>> +
>>> +/* Find the first rtx_insn before insn that will generate an assembly
>>> +   instruction.  */
>>> +
>>> +static rtx_insn *
>>> +aarch64_prev_real_insn (rtx_insn *insn)
>>> +{
>>> +  if (!insn)
>>> +    return NULL;
>>> +
>>> +  do
>>> +    {
>>> +      insn = prev_real_insn (insn);
>>> +    }
>>> +  while (insn && recog_memoized (insn) < 0);
>>> +
>>> +  return insn;
>>> +}
>>> +
>>> +static bool
>>> +is_madd_op (enum attr_type t1)
>>> +{
>>> +  unsigned int i;
>>> +  /* A number of these may be AArch32 only.  */
>>> +  enum attr_type mlatypes[] = {
>>> +    TYPE_MLA, TYPE_MLAS, TYPE_SMLAD, TYPE_SMLADX, TYPE_SMLAL, TYPE_SMLALD,
>>> +    TYPE_SMLALS, TYPE_SMLALXY, TYPE_SMLAWX, TYPE_SMLAWY, TYPE_SMLAXY,
>>> +    TYPE_SMMLA, TYPE_UMLAL, TYPE_UMLALS,TYPE_SMLSD, TYPE_SMLSDX, TYPE_SMLSLD
>>> +  };
>>> +
>>> +  for (i = 0; i < sizeof (mlatypes) / sizeof (enum attr_type); i++)
>>> +    {
>>> +      if (t1 == mlatypes[i])
>>> +	return true;
>>> +    }
>>> +
>>> +  return false;
>>> +}
>>> +
>>> +/* Check if there is a register dependency between a load and the insn
>>> +   for which we hold recog_data.  */
>>> +
>>> +static bool
>>> +dep_between_memop_and_curr (rtx memop)
>>> +{
>>> +  rtx load_reg;
>>> +  int opno;
>>> +
>>> +  if (!memop)
>>> +    return false;
>>> +
>>> +  if (!REG_P (SET_DEST (memop)))
>>> +    return false;
>>> +
>>> +  load_reg = SET_DEST (memop);
>>> +  for (opno = 0; opno < recog_data.n_operands; opno++)
>>> +    {
>>> +      rtx operand = recog_data.operand[opno];
>>> +      if (REG_P (operand)
>>> +          && reg_overlap_mentioned_p (load_reg, operand))
>>> +        return true;
>>> +
>>> +    }
>>> +  return false;
>>> +}
>>> +
>>> +bool
>>> +aarch64_madd_needs_nop (rtx_insn* insn)
>>> +{
>>> +  enum attr_type attr_type;
>>> +  rtx_insn *prev;
>>> +  rtx body;
>>> +
>>> +  if (!aarch64_fix_a53_err835769)
>>> +    return false;
>>> +
>>> +  if (recog_memoized (insn) < 0)
>>> +    return false;
>>> +
>>> +  attr_type = get_attr_type (insn);
>>> +  if (!is_madd_op (attr_type))
>>> +    return false;
>>> +
>>> +  prev = aarch64_prev_real_insn (insn);
>>> +  if (!prev)
>>> +    return false;
>>> +
>>> +  body = single_set (prev);
>>> +
>>> +  /* If the previous insn is a memory op and there is no dependency between
>>> +     it and the madd, emit a nop between them.  If we know the previous insn is
>>> +     a memory op but body is NULL, emit the nop to be safe, it's probably a
>>> +     load/store pair insn.  */
>>> +  if (is_memory_op (prev)
>>> +      && GET_MODE (recog_data.operand[0]) == DImode
>>> +      && (!dep_between_memop_and_curr (body)))
>>> +    return true;
>>> +
>>> +  return false;
>>> +
>>> +}
>>> +
>>> +void
>>> +aarch64_final_prescan_insn (rtx_insn *insn)
>>> +{
>>> +  if (aarch64_madd_needs_nop (insn))
>>> +    fprintf (asm_out_file, "\tnop // between mem op and mult-accumulate\n");
>>> +}
>>> +
>>> +
>>>   /* Return the equivalent letter for size.  */
>>>   static char
>>>   sizetochar (int size)
>>> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>>> index db950da..e9e5fd8 100644
>>> --- a/gcc/config/aarch64/aarch64.h
>>> +++ b/gcc/config/aarch64/aarch64.h
>>> @@ -486,6 +486,15 @@ enum target_cpus
>>>     (TARGET_CPU_generic | (AARCH64_CPU_DEFAULT_FLAGS << 6))
>>>   #endif
>>>   
>>> +/* If inserting NOP before a mult-accumulate insn remember to adjust the
>>> +   length so that conditional branching code is updated appropriately.  */
>>> +#define ADJUST_INSN_LENGTH(insn, length)	\
>>> +  if (aarch64_madd_needs_nop (insn))		\
>>> +    length += 4;
>>> +
>>> +#define FINAL_PRESCAN_INSN(INSN, OPVEC, NOPERANDS)	\
>>> +    aarch64_final_prescan_insn (INSN);			\
>>> +
>>>   /* The processor for which instructions should be scheduled.  */
>>>   extern enum aarch64_processor aarch64_tune;
>>>   
>>> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>>> index f5a15b7..77deb2e 100644
>>> --- a/gcc/config/aarch64/aarch64.opt
>>> +++ b/gcc/config/aarch64/aarch64.opt
>>> @@ -67,6 +67,10 @@ mgeneral-regs-only
>>>   Target Report RejectNegative Mask(GENERAL_REGS_ONLY)
>>>   Generate code which uses only the general registers
>>>   
>>> +mfix-cortex-a53-835769
>>> +Target Report Var(aarch64_fix_a53_err835769) Init(0)
>>> +Workaround for ARM Cortex-A53 Erratum number 835769
>>> +
>>>   mlittle-endian
>>>   Target Report RejectNegative InverseMask(BIG_END)
>>>   Assume target CPU is configured as little endian
>>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>>> index 5fe7e15..acc9dd9 100644
>>> --- a/gcc/doc/invoke.texi
>>> +++ b/gcc/doc/invoke.texi
>>> @@ -489,6 +489,7 @@ Objective-C and Objective-C++ Dialects}.
>>>   -mstrict-align @gol
>>>   -momit-leaf-frame-pointer  -mno-omit-leaf-frame-pointer @gol
>>>   -mtls-dialect=desc  -mtls-dialect=traditional @gol
>>> +-mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
>>>   -march=@var{name}  -mcpu=@var{name}  -mtune=@var{name}}
>>>   
>>>   @emph{Adapteva Epiphany Options}
>>> @@ -11744,6 +11745,14 @@ of TLS variables.  This is the default.
>>>   Use traditional TLS as the thread-local storage mechanism for dynamic accesses
>>>   of TLS variables.
>>>   
>>> +@item -mfix-cortex-a53-835769
>>> +@itemx -mno-fix-cortex-a53-835769
>>> +@opindex -mfix-cortex-a53-835769
>>> +@opindex -mno-fix-cortex-a53-835769
>>> +Enable or disable the workaround for the ARM Cortex-A53 erratum number 835769.
>>> +This will involve inserting a NOP instruction between memory instructions and
>>> +64-bit integer multiply-accumulate instructions.
>>> +
>>>   @item -march=@var{name}
>>>   @opindex march
>>>   Specify the name of the target architecture, optionally suffixed by one or
>>>
>>>
>>> a53-49-workaround.patch
>>>
>>>
>>> commit 48cc738e2c38dd372ae054d30964dd780478c4d1
>>> Author: Kyrill Tkachov <kyrylo.tkachov@arm.com>
>>> Date:   Tue Oct 7 12:42:01 2014 +0000
>>>
>>>      [AArch64] Add workaround for Cortex-A53 erratum 835769
>>>      
>>>      2014-10-07  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>                  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
>>>      
>>>      	* config/aarch64/aarch64.h (FINAL_PRESCAN_INSN): Define.
>>>      	(ADJUST_INSN_LENGTH): Define.
>>>      	* config/aarch64/aarch64.opt (mfix-cortex-a53-835769): New
>>>      	* option.
>>>      	* config/aarch64/aarch64.c (is_mem_p): New function.
>>>      	(is_memory_op): Likewise.
>>>      	(aarch64_prev_real_insn): Likewise.
>>>      	(is_madd_op): Likewise.
>>>      	(dep_between_memop_and_curr): Likewise.
>>>      	(aarch64_madd_needs_nop): Likewise.
>>>      	(aarch64_final_prescan_insn): Likewise.
>>>      	* doc/invoke.texi (AArch64 Options): Document
>>>      	* -mfix-cortex-a53-835769
>>>      	and -mno-fix-cortex-a53-835769 options.
>>>
>>
>>> +
>>> +/* Check if there is a register dependency between a load and the insn
>>> +   for which we hold recog_data.  */
>>> +
>>> +static bool
>>> +dep_between_memop_and_curr (rtx memop)
>>> +{
>>> +  rtx load_reg;
>>> +  int opno;
>>> +
>>> +  if (!memop)
>>> +    return false;
>>> +
>>> +  if (!REG_P (SET_DEST (memop)))
>>> +    return false;
>>> +
>>> +  load_reg = SET_DEST (memop);
>>> +  for (opno = 0; opno < recog_data.n_operands; opno++)
>>> +    {
>>> +      rtx operand = recog_data.operand[opno];
>>> +      if (REG_P (operand)
>>> +          && reg_overlap_mentioned_p (load_reg, operand))
>>> +        return true;
>>> +
>>> +    }
>>> +  return false;
>>> +}
>>> +
>>> +bool
>>> +aarch64_madd_needs_nop (rtx insn)
>>> +{
>>> +  enum attr_type attr_type;
>>> +  rtx prev;
>>> +  rtx body;
>>> +
>>> +  if (!aarch64_fix_a53_err835769)
>>> +    return false;
>>> +
>>> +  if (recog_memoized (insn) < 0)
>>> +    return false;
>>> +
>>> +  attr_type = get_attr_type (insn);
>>> +  if (!is_madd_op (attr_type))
>>> +    return false;
>>> +
>>> +  prev = aarch64_prev_real_insn (insn);
>>> +  if (!prev)
>>> +    return false;
>>> +
>>> +  body = single_set (prev);
>>> +
>>> +  /* If the previous insn is a memory op and there is no dependency between
>>> +     it and the madd, emit a nop between them.  If we know the previous insn is
>>> +     a memory op but body is NULL, emit the nop to be safe, it's probably a
>>> +     load/store pair insn.  */
>>> +  if (is_memory_op (prev)
>>> +      && GET_MODE (recog_data.operand[0]) == DImode
>>> +      && (!dep_between_memop_and_curr (body)))
>>> +    return true;
>>> +
>> It seems a bit strange to me that we expect dep_between_memop_and_curr
>> to detect the non-single_set() case and act conservatively.  Better
>> would be to pull that test out to this level, giving us:
>>
>>
>>    body = single_set (prev);
>>    if (!body)
>>      /* Complex memory operation, possibly a load/store pair insn.  Just
>>         be conservative for now.  */
>>      return true;
> 
> Can do, but I think it should be:
> body = single_set (prev);
> if (!body && is_memory_op (prev))
>    return true;
> 
> since we want to do this only when we really do know it's a memory op.
> And restructure this a bit so that is_memory_op called checked only once...
> 

Then it might be better to pull the is_memory_op() above the
single_set() entirely and bail out early if it's not a memory operation.

R.

> Thanks,
> Kyrill
> 
> 
>>
>>    ...
>>
>> Now dep_between... can assert that it is passed a valid SET() rtx.
>>
>> R.
>>
>