[PATCH][GCC][AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)]

Fri Jul 13 16:46:00 GMT 2018

On 07/12/2018 11:39 AM, Tamar Christina wrote:
>>> +
>>> +  /* Round size to the nearest multiple of guard_size, and calculate the
>>> +     residual as the difference between the original size and the rounded
>>> +     size. */
>>> +  HOST_WIDE_INT rounded_size = size & -guard_size;
>>> +  HOST_WIDE_INT residual = size - rounded_size;
>>> +
>>> +  /* We can handle a small number of allocations/probes inline.  Otherwise
>>> +     punt to a loop.  */
>>> +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
>>> +    {
>>> +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
>>> +	{
>>> +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
>>> +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
>>> +					   STACK_CLASH_CALLER_GUARD));
>>> +	}
>> So the only concern with this code is that it'll be inefficient and
>> possibly incorrect for probe sizes larger than ARITH_FACTOR.
>> Ultimately, that's a case I don't think we really care that much about.
>> I wouldn't lose sleep if the port clamped the requested probing interval
>> at ARITH_FACTOR.
>>
> I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
> stack probing implementation, which isn't used by this new code aside
> from the part that emits the actual probe when doing a variable or large
> allocation in aarch64_output_probe_stack_range.
> 
> Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
> probing intervals. 
It may have been a misunderstanding on my part.  My understanding is
that ARITH_FACTOR represents the largest immediate constant we could
handle in this code using a single insn.  Anything above ARITH_FACTOR
needed a scratch register and I couldn't convince myself that we had a
suitable scratch register available.

But I'm really not well versed on the aarch64 architecture or the
various implementation details in aarch64.c.  So I'm happy to defer to
you and others @ ARM on what's best to do here.

>> That can be problematical in a couple cases.  First it can run afoul of
>> combine-stack-adjustments.  Essentially that pass will combine a series
>> of stack adjustments into a single insn so your unrolled probing turns
>> into something like this:
>>
>>   multi-page stack adjust
>>   probe first page
>>   probe second page
>>   probe third page
>>   and so on..
>>
> This is something we actually do want, particularly in the case of 4KB pages
> as the immediates would fit in the store.  It's one of the things we were
> thinking about for future improvements.
> 
>> That violates the design constraint that we never allocate more than a
>> page at a time.
> Would there be a big issue here if we didn't adhere to this constraint?
Yes, because it enlarges a window for exploitation.  Consider signal
delivery occurring after the adjustment but before the probes.  The
signal handler could then be running on a clashed heap/stack.

> 
>> Do you happen to have a libc.so and ld.so compiled with this code?  I've
>> got a scanner here which will look at a DSO and flag obviously invalid
>> stack allocations.  If you've got a suitable libc.so and ld.so I can run
>> them through the scanner.
>>
>>
>> Along similar lines, have you run the glibc testsuite with this stuff
>> enabled by default.  That proved useful to find codegen issues,
>> particularly with the register inheritance in the epilogue.
>>
> I'm running one now, I'll send the two binaries once I get the results back
> from the run. Thanks!
Great.  Looking forward getting those  .so files I can can throw them
into the scanner.

>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
>>> --- a/gcc/config/aarch64/aarch64.md
>>> +++ b/gcc/config/aarch64/aarch64.md
>>> @@ -3072,7 +3072,7 @@
>>>  
>>>  (define_insn "cmp<mode>"
>>>    [(set (reg:CC CC_REGNUM)
>>> -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
>>> +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
>>>  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
>>>    ""
>>>    "@
>> I don't think I needed this, but I can certainly understand how it might
>> be necessary.  THe only one I know we needed as the probe_stack_range
>> constraint change.
>>
> It's not strictly needed, but it allows us to do the comparison with the
> stack pointer in the loop without needing to emit sp to a temporary first.
> So it's just a tiny optimization. :)
Understood.  A similar tweak for cmp<mode> was done in a patch from
Richard E in his patches for spectre v1 mitigation.  I'm certainly not
objecting :-)

Jeff