PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Tue Sep 22 17:06:30 GMT 2020

Qing Zhao <QING.ZHAO@ORACLE.COM> writes:
>> On Sep 17, 2020, at 11:27 AM, Richard Sandiford <richard.sandiford@arm.com> wrote:
>> 
>> Qing Zhao <QING.ZHAO@ORACLE.COM <mailto:QING.ZHAO@ORACLE.COM>> writes:
>>>> On Sep 17, 2020, at 1:17 AM, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>> 
>>>> Qing Zhao <QING.ZHAO@ORACLE.COM> writes:
>>>>> Segher and Richard, 
>>>>> 
>>>>> Now there are two major concerns from the discussion so far:
>>>>> 
>>>>> 1. (From Richard):  Inserting zero insns should be done after pass_thread_prologue_and_epilogue since later passes (for example, pass_regrename) might introduce new used caller-saved registers. 
>>>>>    So, we should do this in the beginning of pass_late_compilation (some targets wouldn’t cope with doing it later). 
>>>>> 
>>>>> 2. (From Segher): The inserted zero insns should stay together with the return, no other insns should move in-between zero insns and return insns. Otherwise, a valid gadget could be formed. 
>>>>> 
>>>>> I think that both of the above 2 concerns are important and should be addressed for the correct implementation. 
>>>>> 
>>>>> In order to support 1,  we cannot implementing it in “targetm.gen_return()” and “targetm.gen_simple_return()”  since “targetm.gen_return()” and “targetm.gen_simple_return()” are called during pass_thread_prologue_and_epilogue, at that time, the use information still not correct. 
>>>>> 
>>>>> In order to support 2, enhancing EPILOGUE_USES to include the zeroed registgers is NOT enough to prevent all the zero insns from moving around.
>>>> 
>>>> Right.  The purpose of EPILOGUE_USES was instead to stop the moves from
>>>> being deleted as dead.
>>>> 
>>>>> More restrictions need to be added to these new zero insns.  (I think that marking these new zeroed registers as “unspec_volatile” at RTL level is necessary to prevent them from deleting from moving around). 
>>>>> 
>>>>> 
>>>>> So, based on the above, I propose the following approach that will resolve the above 2 concerns:
>>>>> 
>>>>> 1. Add 2 new target hooks:
>>>>>  A. targetm.pro_epilogue_use (reg)
>>>>>  This hook should return a UNSPEC_VOLATILE rtx to mark a register in use to
>>>>>  prevent deleting register setting instructions in prologue and epilogue.
>>>>> 
>>>>>  B. targetm.gen_zero_call_used_regs(need_zeroed_hardregs)
>>>>>  This hook will gen a sequence of zeroing insns that zero the registers that specified in NEED_ZEROED_HARDREGS.
>>>>> 
>>>>>   A default handler of “gen_zero_call_used_regs” could be defined in middle end, which use mov insns to zero registers, and then use “targetm.pro_epilogue_use(reg)” to mark each zeroed registers. 
>>>> 
>>>> This sounds like you're going back to using:
>>>> 
>>>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>>>       (const_int 0 [0])) "t10.c":11:1 -1
>>>>    (nil))
>>>> (insn 19 18 20 2 (unspec_volatile [
>>>>           (reg:SI 1 dx)
>>>>       ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>>>    (nil))
>>>> 
>>>> This also doesn't prevent the zeroing from being moved around.  Like the
>>>> EPILOGUE_USES approach, it only prevents the clearing from being removed
>>>> as dead.  I still think that the EPILOGUE_USES approach is the better
>>>> way of doing that.
>>> 
>>> The following is what I see from i386.md: (I didn’t look at how “UNSPEC_volatile” is used in data flow analysis in GCC yet)
>>> 
>>> ;; UNSPEC_VOLATILE is considered to use and clobber all hard registers and
>>> ;; all of memory.  This blocks insns from being moved across this point.
>> 
>> Heh, it looks like that comment dates back to 1994. :-)
>> 
>> The comment is no longer correct though.  I wasn't around at the time,
>> but I assume the comment was only locally true even then.
>> 
>> If what the comment said was true, then something like:
>> 
>> (define_insn "cld"
>>  [(unspec_volatile [(const_int 0)] UNSPECV_CLD)]
>>  ""
>>  "cld"
>>  [(set_attr "length" "1")
>>   (set_attr "length_immediate" "0")
>>   (set_attr "modrm" "0")])
>> 
>> would invalidate the entire register file and so would require all values
>> to be spilt to the stack around the CLD.
>
> Okay, thanks for the info. 
> then, what’s the current definition of UNSPEC_VOLATILE? 

I'm not sure it's written down anywhere TBH.  rtl.texi just says:

  @code{unspec_volatile} is used for volatile operations and operations
  that may trap; @code{unspec} is used for other operations.

which seems like a cyclic definition: volatile expressions are defined
to be expressions that are volatile.

But IMO the semantics are that unspec_volatile patterns with a given
set of inputs and outputs act for dataflow purposes like volatile asms
with the same inputs and outputs.  The semantics of asm volatile are
at least slightly more well-defined (if only by example); see extend.texi
for details.  In particular:

  Note that the compiler can move even @code{volatile asm} instructions relative
  to other code, including across jump instructions. For example, on many 
  targets there is a system register that controls the rounding mode of 
  floating-point operations. Setting it with a @code{volatile asm} statement,
  as in the following PowerPC example, does not work reliably.

  @example
  asm volatile("mtfsf 255, %0" : : "f" (fpenv));
  sum = x + y;
  @end example

  The compiler may move the addition back before the @code{volatile asm}
  statement. To make it work as expected, add an artificial dependency to
  the @code{asm} by referencing a variable in the subsequent code, for
  example:

  @example
  asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
  sum = x + y;
  @end example

which is very similar to the unspec_volatile case we're talking about.

To take an x86 example:

  void
  f (char *x)
  {
    asm volatile ("");
    x[0] = 0;
    asm volatile ("");
    x[1] = 0;
    asm volatile ("");
  }

gets optimised to:

        xorl    %eax, %eax
        movw    %ax, (%rdi)

with the two stores being merged.  The same thing is IMO valid for
unspec_volatile.  In both cases, you would need some kind of memory
clobber to prevent the move and merge from happening.

>>> I am not very familiar with how the unspec_volatile actually works in gcc’s data flow analysis, my understanding  from the above is, the RTL insns marked with UNSPEC_volatile would be served as a barrier that no other insns can move across this point. At the same time, since the marked RTL insns is considered to use and clobber all hard registers and memory, it cannot be deleted either. 
>> 
>> UNSPEC_VOLATILEs can't be deleted.  And they can't be reordered relative
>> to other UNSPEC_VOLATILEs.  But the problem with:
>> 
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>       (const_int 0 [0])) "t10.c":11:1 -1
>>    (nil))
>> (insn 19 18 20 2 (unspec_volatile [
>>           (reg:SI 1 dx)
>>       ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>    (nil))
>> 
>> is that the volatile occurs *after* the zeroing instruction.  So at best
>> it can stop insn 18 moving further down, to be closer to the return
>> instruction.  There's nothing to stop insn 18 moving further up,
>> away from the return instruction, which AIUI is what you're trying
>> to prevent.  E.g. suppose we had:
>> 
>> (insn 17 … pop a register other than dx from the stack …)
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>       (const_int 0 [0])) "t10.c":11:1 -1
>>    (nil))
>> (insn 19 18 20 2 (unspec_volatile [
>>           (reg:SI 1 dx)
>>       ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>    (nil))
>> 
>> There is nothing to stop an rtl pass reordering that to:
>> 
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>       (const_int 0 [0])) "t10.c":11:1 -1
>>    (nil))
>> (insn 17 … pop a register other than dx from the stack …)
>> (insn 19 18 20 2 (unspec_volatile [
>>           (reg:SI 1 dx)
>>       ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>    (nil))
>
> Yes, agreed. And then the volatile marking insn should be put BEFORE the zeroing insn. 
>
>> 
>> There's also no dataflow reason why this couldn't be reordered to:
>> 
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>       (const_int 0 [0])) "t10.c":11:1 -1
>>    (nil))
>> (insn 19 18 20 2 (unspec_volatile [
>>           (reg:SI 1 dx)
>>       ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>    (nil))
>> (insn 17 … pop a register other than dx from the stack …)
>> 
>
> This is the place I don’t quite agree at this moment, maybe I still not quite understand the “UNSPEC_volatile”.
>
> I checked several places in GCC that handle “UNSPEC_VOLATILE”, for example,  for the routine “can_move_insns_across” in gcc/df-problem.c:
>
>       if (NONDEBUG_INSN_P (insn))
>         {
>           if (volatile_insn_p (PATTERN (insn)))
>             return false;
>
> From my understanding of reading the code, when an insn is UNSPEC_VOLATILE, another insn will NOT be able to move across it. 
>
> Then for the above example, the insn 17 should Not be moved across insn 19 either.
>
> Let me know if I miss anything important. 

The above is conservatively correct.  But not all passes do it.
E.g. combine does have a similar approach:

  /* If INSN contains volatile references (specifically volatile MEMs),
     we cannot combine across any other volatile references.
     Even if INSN doesn't contain volatile references, any intervening
     volatile insn might affect machine state.  */

  is_volatile_p = volatile_refs_p (PATTERN (insn))
    ? volatile_refs_p
    : volatile_insn_p;

And like you say, the passes that use can_move_insns_across will be
conservative too.  But not many passes use that function.

Passes like fwprop.c, postreload-gcse.c and ree.c do not (AFAIK) worry
about volatile asms or unspec_volatiles, and can move code across them.
And that's kind-of inevitable.  Having an “everything barrier” makes
life very hard for global optimisation.

>> So…
>> 
>>> So, I thought that “UNSPEC_volatile” should be stronger than “EPILOGUE_USES”. And it can serve the purpose of preventing zeroing insns from deleting and moving. 
>> 
>> …both EPILOGUE_USES and UNSPEC_VOLATILE would be effective ways of
>> stopping insn 18 from being deleted.  But an UNSPEC_VOLATILE after
>> the instruction would IMO be counterproductive: it would stop the
>> zeroing instructions that we want to be close to the return instruction
>> from moving “closer” to the return instruction, but it wouldn't do the
>> same for unrelated instructions.  So if anything, the unspec_volatile
>> could increase the chances that something unrelated to the register
>> zeroing is moved later than the register zeroing.  E.g. this could
>> happen when filling delayed branch slots.
>> 
>>>> I don't think there's a foolproof way of preventing an unknown target
>>>> machine_reorg pass from moving the instructions around.  But since we
>>>> don't have unknown machine_reorgs (at least not in-tree), I think
>>>> instead we should be prepared to patch machine_reorgs where necessary
>>>> to ensure that they do the right thing.
>>>> 
>>>> If you want to increase the chances that machine_reorgs don't need to be
>>>> patched, you could either:
>>>> 
>>>> (a) to make the zeroing instructions themselves volatile or
>>>> (b) to insert a volatile reference to the register before (rather than
>>>>   after) the zeroing instruction
>>>> 
>>>> IMO (b) is the way to go, because it avoids the need to define special
>>>> volatile move patterns for each type of register.  (b) would be needed
>>>> on top of (rather than instead of) the EPILOGUE_USES thing.
>>>> 
>>> Okay, will take approach b. 
>>> 
>>> But I still don’t quite understand why we still need “EPILOUGE_USES”? What’s the additional benefit from EPILOGUE_USES?
>> 
>> The asm for (b) goes before the instruction, so we'd have:
>> 
>> (insn 17 … new asm …)
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>       (const_int 0 [0])) "t10.c":11:1 -1
>>    (nil))
>> (insn 19 … return …)
>> 
>> But something has to tell the df machinery that the value of edx
>> matters on return from the function, otherwise insn 18 could be
>> deleted as dead.  Adding edx to EPILOGUE_USES provides that information
>> and stops the instruction from being deleted.
>
>
> In the above, insn 17 will be something like:
>
> (insn 17 ...(unspec_volatile [  (reg:SI 1 dx)
>     ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1 
> (nil))

In the example above, insn 17 would be an asm that clobbers dx
(instead of using dx).

> So, the reg edx is marked as “UNSPEC_volatile” already, that should mean the value of edx matters on return from the function already, my understanding is that df should automatically pick up the “UNSPEC_VOLATILE” insn and it’s operands.   “UNSPEC_VOLATILE” insn should serve the same purpose as putting “edx” to EPILOGUE_USES. 
>
> Do I miss anything here?

The point is that any use of dx at insn 17 comes before the definition
in insn 18.  So a use in insn 17 would keep alive any store to dx that
happend before insn 17.  But it would not keep the store in insn 18 live,
since insn 18 executes later.

>>>> I don't think we need a new target-specific unspec_volatile code to do (b).
>>>> We can just use an automatically-generated volatile asm to clobber the
>>>> registers first.  See e.g. how expand_asm_memory_blockage handles memory
>>>> scheduling barriers.
>>> /* Generate asm volatile("" : : : "memory") as the memory blockage.  */
>>> 
>>> static void
>>> expand_asm_memory_blockage (void)
>>> {
>>>  rtx asm_op, clob;
>>> 
>>>  asm_op = gen_rtx_ASM_OPERANDS (VOIDmode, "", "", 0,
>>>                                 rtvec_alloc (0), rtvec_alloc (0),
>>>                                 rtvec_alloc (0), UNKNOWN_LOCATION);
>>>  MEM_VOLATILE_P (asm_op) = 1;
>>> 
>>>  clob = gen_rtx_SCRATCH (VOIDmode);
>>>  clob = gen_rtx_MEM (BLKmode, clob);
>>>  clob = gen_rtx_CLOBBER (VOIDmode, clob);
>>> 
>>>  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, asm_op, clob)));
>>> }
>>> 
>>> 
>>> As the following? 
>>> 
>>> /* Generate asm volatile("" : : : “regno") for REGNO.   */
>>> 
>>> static void
>>> expand_asm_reg_volatile (machine_mode mode, unsigned int regno)
>>> {
>>>  rtx asm_op, clob;
>>> 
>>>  asm_op = gen_rtx_ASM_OPERANDS (VOIDmode, "", "", 0,
>>>                                 rtvec_alloc (0), rtvec_alloc (0),
>>>                                 rtvec_alloc (0), UNKNOWN_LOCATION);
>>>  MEM_VOLATILE_P (asm_op) = 1;
>>> 
>>>  clob = gen_rtx_REG (mode, regno);
>>>  clob = gen_rtx_CLOBBER (VOIDmode, clob);
>>> 
>>>  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, asm_op, clob)));
>>> }
>>> 
>>> Is the above correct? 
>> 
>> Yeah, looks good.  You should be able to clobber all the registers you
>> want to clear in one asm.
>
> How to do this?

Rather than create:

  gen_rtvec (2, asm_op, clob)

with just the asm and one clobber, you can create:

  gen_rtvec (N + 1, asm_op, clob1, …, clobN)

with N clobbers side-by-side.  When N is variable (as it probably would
be in your case), it's easier to use rtvec_alloc and fill in the fields
using RTVEC_ELT.  E.g.:

  rtvec v = rtvec_alloc (N + 1);
  RTVEC_ELT (v, 0) = asm_op;
  RTVEC_ELT (v, 1) = clob1;
  …
  RTVEC_ELT (v, N) = clobN;

Thanks,
Richard