[PATCH][GCC][AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)]
Jeff Law
law@redhat.com
Fri Jul 13 16:46:00 GMT 2018
On 07/12/2018 11:39 AM, Tamar Christina wrote:
>>> +
>>> + /* Round size to the nearest multiple of guard_size, and calculate the
>>> + residual as the difference between the original size and the rounded
>>> + size. */
>>> + HOST_WIDE_INT rounded_size = size & -guard_size;
>>> + HOST_WIDE_INT residual = size - rounded_size;
>>> +
>>> + /* We can handle a small number of allocations/probes inline. Otherwise
>>> + punt to a loop. */
>>> + if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
>>> + {
>>> + for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
>>> + {
>>> + aarch64_sub_sp (NULL, temp2, guard_size, true);
>>> + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
>>> + STACK_CLASH_CALLER_GUARD));
>>> + }
>> So the only concern with this code is that it'll be inefficient and
>> possibly incorrect for probe sizes larger than ARITH_FACTOR.
>> Ultimately, that's a case I don't think we really care that much about.
>> I wouldn't lose sleep if the port clamped the requested probing interval
>> at ARITH_FACTOR.
>>
> I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
> stack probing implementation, which isn't used by this new code aside
> from the part that emits the actual probe when doing a variable or large
> allocation in aarch64_output_probe_stack_range.
>
> Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
> probing intervals.
It may have been a misunderstanding on my part. My understanding is
that ARITH_FACTOR represents the largest immediate constant we could
handle in this code using a single insn. Anything above ARITH_FACTOR
needed a scratch register and I couldn't convince myself that we had a
suitable scratch register available.
But I'm really not well versed on the aarch64 architecture or the
various implementation details in aarch64.c. So I'm happy to defer to
you and others @ ARM on what's best to do here.
>> That can be problematical in a couple cases. First it can run afoul of
>> combine-stack-adjustments. Essentially that pass will combine a series
>> of stack adjustments into a single insn so your unrolled probing turns
>> into something like this:
>>
>> multi-page stack adjust
>> probe first page
>> probe second page
>> probe third page
>> and so on..
>>
> This is something we actually do want, particularly in the case of 4KB pages
> as the immediates would fit in the store. It's one of the things we were
> thinking about for future improvements.
>
>> That violates the design constraint that we never allocate more than a
>> page at a time.
> Would there be a big issue here if we didn't adhere to this constraint?
Yes, because it enlarges a window for exploitation. Consider signal
delivery occurring after the adjustment but before the probes. The
signal handler could then be running on a clashed heap/stack.
>
>> Do you happen to have a libc.so and ld.so compiled with this code? I've
>> got a scanner here which will look at a DSO and flag obviously invalid
>> stack allocations. If you've got a suitable libc.so and ld.so I can run
>> them through the scanner.
>>
>>
>> Along similar lines, have you run the glibc testsuite with this stuff
>> enabled by default. That proved useful to find codegen issues,
>> particularly with the register inheritance in the epilogue.
>>
> I'm running one now, I'll send the two binaries once I get the results back
> from the run. Thanks!
Great. Looking forward getting those .so files I can can throw them
into the scanner.
>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
>>> --- a/gcc/config/aarch64/aarch64.md
>>> +++ b/gcc/config/aarch64/aarch64.md
>>> @@ -3072,7 +3072,7 @@
>>>
>>> (define_insn "cmp<mode>"
>>> [(set (reg:CC CC_REGNUM)
>>> - (compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
>>> + (compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
>>> (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
>>> ""
>>> "@
>> I don't think I needed this, but I can certainly understand how it might
>> be necessary. THe only one I know we needed as the probe_stack_range
>> constraint change.
>>
> It's not strictly needed, but it allows us to do the comparison with the
> stack pointer in the loop without needing to emit sp to a temporary first.
> So it's just a tiny optimization. :)
Understood. A similar tweak for cmp<mode> was done in a patch from
Richard E in his patches for spectre v1 mitigation. I'm certainly not
objecting :-)
Jeff
More information about the Gcc-patches
mailing list