This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][GCC][AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)]


Hi Jeff,

Thanks for the review!

The 07/11/2018 18:30, Jeff Law wrote:
> On 07/11/2018 05:20 AM, Tamar Christina wrote:
> > Hi All,
> > 
> > This patch implements the use of the stack clash mitigation for aarch64.
> > In Aarch64 we expect both the probing interval and the guard size to be 64KB
> > and we enforce them to always be equal.
> > 
> > We also probe up by 1024 bytes in the general case when a probe is required.
> > 
> > AArch64 has the following probing conditions:
> > 
> >  1) Any allocation less than 63KB requires no probing.  An ABI defined safe
> >     buffer of 1Kbytes is used and a page size of 64k is assumed.
> > 
> >  2) Any allocations larger than 1 page size, is done in increments of page size
> >     and probed up by 1KB leaving the residuals.
> > 
> >  3a) Any residual for local arguments that is less than 63KB requires no probing.
> >      Essentially this is a sliding window.  The probing range determines the ABI
> >      safe buffer, and the amount to be probed up.
> > 
> >   b) Any residual for outgoing arguments that is less than 1KB requires no probing,
> >      However to maintain our invariant, anything above or equal to 1KB requires a probe.
> > 
> > Incrementally allocating less than the probing thresholds, e.g. recursive functions will
> > not be an issue as the storing of LR counts as a probe.
> > 
> > 
> >                             +-------------------+                                    
> >                             |  ABI SAFE REGION  |                                    
> >                   +------------------------------                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >  maximum amount   |         |                   |                                    
> >  not needing a    |         |                   |                                    
> >  probe            |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |        Probe offset when           
> >                   |         ---------------------------- probe is required           
> >                   |         |                   |                                    
> >                   +-------- +-------------------+ --------  Point of first probe     
> >                             |  ABI SAFE REGION  |                                    
> >                             ---------------------                                    
> >                             |                   |                                    
> >                             |                   |                                    
> >                             |                   |                                         
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > Target was tested with stack clash on and off by default.
> > 
> > Ok for trunk?
> > 
> > Thanks,
> > Tamar
> > 
> > gcc/
> > 2018-07-11  Jeff Law  <law@redhat.com>
> > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > 	    Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	PR target/86486
> > 	* config/aarch64/aarch64.md (cmp<mode>,
> > 	probe_stack_range): Add k (SP) constraint.
> > 	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
> > 	STACK_CLASH_MAX_UNROLL_PAGES): New.
> > 	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range): Emit
> > 	stack probes for stack clash.
> > 	(aarch64_allocate_and_probe_stack_space): New.
> > 	(aarch64_expand_prologue): Use it.
> > 	(aarch64_expand_epilogue): Likewise and update IP regs re-use criteria.
> > 	(aarch64_sub_sp): Add emit_move_imm optional param.
> > 
> > gcc/testsuite/
> > 2018-07-11  Jeff Law  <law@redhat.com>
> > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > 	    Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	PR target/86486
> > 	* gcc.target/aarch64/stack-check-12.c: New.
> > 	* gcc.target/aarch64/stack-check-13.c: New.
> > 	* gcc.target/aarch64/stack-check-cfa-1.c: New.
> > 	* gcc.target/aarch64/stack-check-cfa-2.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-1.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-10.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-11.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-2.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-3.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-4.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-5.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-6.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-7.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-8.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-9.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue.h: New.
> > 	* lib/target-supports.exp
> > 	(check_effective_target_supports_stack_clash_protection): Add AArch64.
> > 
> > -- 
> > 
> > 
> > rb9150.patch
> > 
> > 
> > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> > index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
> > --- a/gcc/config/aarch64/aarch64.h
> > +++ b/gcc/config/aarch64/aarch64.h
> > @@ -84,6 +84,14 @@
> >  
> >  #define LONG_DOUBLE_TYPE_SIZE	128
> >  
> > +/* This value is the amount of bytes a caller is allowed to drop the stack
> > +   before probing has to be done for stack clash protection.  */
> > +#define STACK_CLASH_CALLER_GUARD 1024
> > +
> > +/* This value controls how many pages we manually unroll the loop for when
> > +   generating stack clash probes.  */
> > +#define STACK_CLASH_MAX_UNROLL_PAGES 4
> Thanks for pulling these out as #defines rather than leaving the
> original magic #s in the code.
> 
> 
> > +
> >  /* The architecture reserves all bits of the address for hardware use,
> >     so the vbit must go into the delta field of pointers to member
> >     functions.  This is the same config as that in the AArch32
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f2828197b98d31f158 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
> >     if nonnull.  */
> >  
> >  static inline void
> > -aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
> > +aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
> > +		bool emit_move_imm = true)
> >  {
> >    aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
> > -		      temp1, temp2, frame_related_p);
> > +		      temp1, temp2, frame_related_p, emit_move_imm);
> >  }
> >  
> >  /* Set DEST to (vec_series BASE STEP).  */
> > @@ -3973,13 +3974,29 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
> >    /* Loop.  */
> >    ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
> >  
> >    /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
> >    xops[0] = reg1;
> > -  xops[1] = GEN_INT (PROBE_INTERVAL);
> > +  if (flag_stack_clash_protection)
> > +    xops[1] = GEN_INT (stack_clash_probe_interval);
> > +  else
> > +    xops[1] = GEN_INT (PROBE_INTERVAL);
> > +
> >    output_asm_insn ("sub\t%0, %0, %1", xops);
> >  
> > -  /* Probe at TEST_ADDR.  */
> > -  output_asm_insn ("str\txzr, [%0]", xops);
> > +  /* If doing stack clash protection then we probe up by the ABI specified
> > +     amount.  We do this because we're dropping full pages at a time in the
> > +     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
> > +  if (flag_stack_clash_protection)
> > +    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
> > +  else
> > +    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
> So this is a bit different than the original, but probably just as safe.
>  The key is to make sure we hit the just allocated page and this should
> do so just fine.
> 
> 
> > @@ -4793,6 +4810,147 @@ aarch64_set_handled_components (sbitmap components)
> >        cfun->machine->reg_is_wrapped_separately[regno] = true;
> >  }
> >  
> > +/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
> > +   registers.  If POLY_SIZE is not large enough to require a probe this function
> > +   will only adjust the stack.  When allocation the stack space
> > +   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
> > +   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
> > +   this case we need to adjust the threshold to be any allocations larger than
> > +   the ABI defined buffer needs a probe such that the invariant is maintained.
> > +   */
> > +
> > +static void
> > +aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > +					poly_int64 poly_size,
> > +					bool frame_related_p,
> > +					bool final_adjustment_p)
> It looks like you pushed some of the work that was originally done in
> aarch64_expand_prologue into this function.  It all looks reasonable to me.
> 
> > +
> > +  /* Round size to the nearest multiple of guard_size, and calculate the
> > +     residual as the difference between the original size and the rounded
> > +     size. */
> > +  HOST_WIDE_INT rounded_size = size & -guard_size;
> > +  HOST_WIDE_INT residual = size - rounded_size;
> > +
> > +  /* We can handle a small number of allocations/probes inline.  Otherwise
> > +     punt to a loop.  */
> > +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
> > +    {
> > +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
> > +	{
> > +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
> > +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
> > +					   STACK_CLASH_CALLER_GUARD));
> > +	}
> So the only concern with this code is that it'll be inefficient and
> possibly incorrect for probe sizes larger than ARITH_FACTOR.
> Ultimately, that's a case I don't think we really care that much about.
> I wouldn't lose sleep if the port clamped the requested probing interval
> at ARITH_FACTOR.
> 

I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
stack probing implementation, which isn't used by this new code aside
from the part that emits the actual probe when doing a variable or large
allocation in aarch64_output_probe_stack_range.

Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
probing intervals. 

> 
> > +    {
> > +      /* Compute the ending address.  */
> > +      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
> > +			  temp1, NULL, false, true);
> > +      rtx_insn *insn = get_last_insn ();
> > +
> > +      /* For the initial allocation, we don't have a frame pointer
> > +	 set up, so we always need CFI notes.  If we're doing the
> > +	 final allocation, then we may have a frame pointer, in which
> > +	 case it is the CFA, otherwise we need CFI notes.
> > +
> > +	 We can determine which allocation we are doing by looking at
> > +	 the value of FRAME_RELATED_P since the final allocations are not
> > +	 frame related.  */
> > +      if (frame_related_p)
> > +	{
> > +	  /* We want the CFA independent of the stack pointer for the
> > +	     duration of the loop.  */
> > +	  add_reg_note (insn, REG_CFA_DEF_CFA,
> > +			plus_constant (Pmode, temp1, rounded_size));
> > +	  RTX_FRAME_RELATED_P (insn) = 1;
> > +	}
> This looks cleaner than what I had :-)
> 
> One think I've noticed is you seem to be missing blockage insns after
> the allocation and probe.
> 
> That can be problematical in a couple cases.  First it can run afoul of
> combine-stack-adjustments.  Essentially that pass will combine a series
> of stack adjustments into a single insn so your unrolled probing turns
> into something like this:
> 
>   multi-page stack adjust
>   probe first page
>   probe second page
>   probe third page
>   and so on..
> 

This is something we actually do want, particularly in the case of 4KB pages
as the immediates would fit in the store.  It's one of the things we were
thinking about for future improvements.

> 
> That violates the design constraint that we never allocate more than a
> page at a time.

Would there be a big issue here if we didn't adhere to this constraint?

> 
> Second the scheduler can rearrange the original sequence and break
> dependencies resulting in similar code.  I see that we have the test for
> that in these changes.
> 
> I thought we had tests in the suite to check for both of these
> scenarios.  Though it's possible they were x86 specific tests and we
> factored that into the work we did on aarch64 without ever building
> aarch64 specific tests for this issue.
> 

I have tried to get it to do this today but couldn't actually get it to,
I was using 4KB pages and attempting an allocation of 4*4KB, but it just
kept all the adjustments and stores separate.

> 
> Do you happen to have a libc.so and ld.so compiled with this code?  I've
> got a scanner here which will look at a DSO and flag obviously invalid
> stack allocations.  If you've got a suitable libc.so and ld.so I can run
> them through the scanner.
> 
> 
> Along similar lines, have you run the glibc testsuite with this stuff
> enabled by default.  That proved useful to find codegen issues,
> particularly with the register inheritance in the epilogue.
> 

I'm running one now, I'll send the two binaries once I get the results back
from the run. Thanks!

> 
> 
> > @@ -4990,10 +5178,21 @@ aarch64_expand_epilogue (bool for_sibcall)
> >    /* A stack clash protection prologue may not have left IP0_REGNUM or
> >       IP1_REGNUM in a usable state.  The same is true for allocations
> >       with an SVE component, since we then need both temporary registers
> > -     for each allocation.  */
> > +     for each allocation.  For stack clash we are in a usable state if
> > +     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
> > +  HOST_WIDE_INT guard_size
> > +    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
> > +  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
> > +
> > +  /* We can re-use the registers when the allocation amount is smaller than
> > +     guard_size - guard_used_by_caller because we won't be doing any probes
> > +     then.  In such situations the register should remain live with the correct
> > +     value.  */
> >    bool can_inherit_p = (initial_adjust.is_constant ()
> > -			&& final_adjust.is_constant ()
> > -			&& !flag_stack_clash_protection);
> > +			&& final_adjust.is_constant ())
> > +			&& (!flag_stack_clash_protection
> > +			     || known_lt (initial_adjust,
> > +					  guard_size - guard_used_by_caller));
> Probably better than my solution which always disabled inheritance when
> stack-clash-protection was enabled :-)
> 
> 
> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -3072,7 +3072,7 @@
> >  
> >  (define_insn "cmp<mode>"
> >    [(set (reg:CC CC_REGNUM)
> > -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
> > +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
> >  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
> >    ""
> >    "@
> I don't think I needed this, but I can certainly understand how it might
> be necessary.  THe only one I know we needed as the probe_stack_range
> constraint change.
> 

It's not strictly needed, but it allows us to do the comparison with the
stack pointer in the loop without needing to emit sp to a temporary first.
So it's just a tiny optimization. :)

Regards,
Tamar

> 
> 

-- 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]