PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Mon Sep 14 16:33:33 GMT 2020

Qing Zhao <QING.ZHAO@ORACLE.COM> writes:
>> Like I mentioned earlier though, passes that run after
>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>> weren't previously used.  For example, on x86_64, the function might
>> not use %r8 when the prologue, epilogue and returns are generated,
>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>> the “used” version of the new command-line option is supposed to clear
>> %r8 in these circumstances, but it wouldn't do so if the data was
>> collected at the point that the return is generated.
>
> Thanks for the information.
>
>> 
>> That's why I think it's more robust to do this later (at the beginning
>> of pass_late_compilation) and insert the zeroing before returns that
>> already exist.
>
> Yes, looks like it’s not correct to insert the zeroing at the time when prologue, epilogue and return are generated.
> As I also checked, “return” might be also generated as late as pass “pass_delay_slots”,  So, shall we move the
> New pass as late as possible?

If we insert the zeroing before pass_delay_slots and describe the
result correctly, pass_delay_slots should do the right thing.

Describing the result correctly includes ensuring that the cleared
registers are treated as live on exit from the function, so that the
zeroing doesn't get deleted again, or skipped by pass_delay_slots.

> Can I put it immediately before “pass_final”? What’s the latest place
> I can put it?

Like you say here…

>>>> So IMO the pass should just search for all the
>>>> returns in a function and insert the zeroing instructions before
>>>> each one.
>>> 
>>> I was considering this approach too for some time, however, there is one major issue with this as 
>>> Segher mentioned, The middle end does not know some details on the registers, lacking such 
>>> detailed information might result incorrect code generation at middle end.
>>> 
>>> For example, on x86_64 target, when “return” with pop, the scratch register “ECX” will be 
>>> used for returning, then it’s incorrect to zero “ecx” before generating the return. Since middle end
>>> doesn't have such information, it cannot avoid to zero “ecx”. Therefore incorrect code might be 
>>> generated. 
>>> 
>>> Segher also mentioned that on Power, there are some scratch registers also are used for 
>>> Other purpose, clearing them before return is not correct. 
>> 
>> But the dataflow information has to be correct between
>> pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
>> any pass in that region could clobber the registers in the same way.
>
> You mean, the data flow information will be not correct after pass_free_cfg? 
>  “pass_delay_slots” is after “pass_free_cfg”,  and there might be new “return” generated in “pass_delay_slots”, 
> If we want to generate zeroing for the new “return” which was generated in “pass_delay_slots”, can we correctly to do so?

…the zeroing has to be done before pass_free_cfg, because the information
isn't reliable after that point.  I think it would make sense to do it
before pass_compute_alignments, because inserting the zeros will affect
alignment.

>> To get the registers that are live before the return, you can start with
>> the registers that are live out from the block that contains the return,
>> then “simulate” the return instruction backwards to get the set of
>> registers that are live before the return instruction
>> (see df_simulate_one_insn_backwards).
>
> Okay. 
> Currently, I am using the following to check whether a reg is live out the block that contains the return:
>
> /* Check whether the hard register REGNO is live at the exit block
>  * of the current routine.  */
> static bool
> is_live_reg_at_exit (unsigned int regno)
> {
>   edge e;
>   edge_iterator ei;
>
>   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>     {
>       bitmap live_out = df_get_live_out (e->src);
>       if (REGNO_REG_SET_P (live_out, regno))
>         return true;
>     }
>
>   return false;
> }
>
> Is this correct?

df_get_live_out is the right way to get the set of live registers
on exit from a block.  But if we search for return instructions
and find a return instruction R, we should do:

  basic_block bb = BLOCK_FOR_INSN (R);
  auto_bitmap live_regs;
  bitmap_copy (regs, df_get_live_out (bb));
  df_simulate_one_insn_backwards (bb, R, live_regs);

and then use LIVE_REGS as the set of registers that are live before R,
and so can't be clobbered.

For extra safety, you could/should also check targetm.hard_regno_scratch_ok
to see whether there's a target-specific reason why a register can't
be clobbered.

>> In the x86_64 case you mention, the pattern is:
>> 
>> (define_insn "*simple_return_indirect_internal<mode>"
>>  [(simple_return)
>>   (use (match_operand:W 0 "register_operand" "r"))]
>>  "reload_completed"
>>  …)
>> 
>> This (use …) tells the df machinery that the instruction needs
>> operand 0 (= ecx).  The process above would therefore realise
>> that ecx can't be clobbered.
>
> Okay, I see.  The df will reflect this information, no need for special handling here. 
>
> However, for the cases on Power as Segher mentioned, there are also some scratch registers used for
> Other purpose, not sure whether we can correctly generate zeroing in middle-end for Power?

Segher would be better placed to answer that, but I think the process
above has to give a conservatively-accurate list of live registers.
If it misses a register, the other late rtl passes could clobber
that same register.

Thanks,
Richard