This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][GCC][AArch64] Add support for SVE stack clash probing [patch (2/7)]


Hi Richard,

The 09/11/2018 16:20, Richard Sandiford wrote:
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> > +
> >> > +  /* No probe leave.  */
> >> > +  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
> >> > +  return "";
> >> 
> >> With the CFA stuff and constant load, I think this works out as:
> >> 
> >> ---------------------------------------------
> >> # 12 insns
> >> 	mov	r15, base
> >> 	mov	adjustment, N
> >> 1:
> >> 	cmp	adjustment, guard_size
> >> 	b.lt    2f
> >> 	sub	base, base, guard_size
> >> 	str	xzr, [base, limit]
> >> 	sub     adjustment, adjustment, guard_size
> >> 	b	1b
> >> 2:
> >> 	sub     base, base, adjustment
> >> 	cmp	adjustment, limit
> >> 	b.le	3f
> >> 	str	xzr, [base, limit]
> >> 3:
> >> ---------------------------------------------
> >> 
> >> What do you think about something like:
> >> 
> >> ---------------------------------------------
> >> # 10 insns
> >> 	mov	adjustment, N
> >> 	sub	r15, base, adjustment
> >> 	subs	adjustment, adjustment, min_probe_threshold
> >> 	b.lo	2f
> >> 1:
> >> 	add	base, x15, adjustment
> >> 	str	xzr, [base, 0]
> >> 	subs	adjustment, adjustment, 16
> >> 	and	adjustment, adjustment, ~(guard_size-1)
> >> 	b.hs	1b
> >> 2:
> >> 	mov	base, r15
> >> ---------------------------------------------
> >> 
> >> or (with different trade-offs):
> >> 
> >> ---------------------------------------------
> >> # 11 insns
> >> 	mov	adjustment, N
> >> 	sub	r15, base, adjustment
> >> 	subs	adjustment, adjustment, min_probe_threshold
> >> 	b.lo	2f
> >> 	# Might be 0, leading to a double probe
> >> 	and	r14, adjustment, guard_size-1
> >> 1:
> >> 	add	base, x15, adjustment
> >> 	str	xzr, [base, 0]
> >> 	subs	adjustment, adjustment, r14
> >> 	mov	r14, guard_size
> >> 	b.hs	1b
> >> 2:
> >> 	mov	base, r15
> >> ---------------------------------------------
> >> 
> >> or (longer, but with a simpler loop):
> >> 
> >> ---------------------------------------------
> >> # 12 insns
> >> 	mov	adjustment, N
> >> 	sub	r15, base, adjustment
> >> 	subs	adjustment, adjustment, min_probe_threshold
> >> 	b.lo	2f
> >> 	str	xzr, [base, -16]!
> >> 	sub	adjustment, adjustment, 32
> >> 	and	adjustment, adjustment, -(guard_size-1)
> >> 1:
> >> 	add	base, x15, adjustment
> >> 	str	xzr, [base, 0]
> >> 	subs	adjustment, adjustment, guard_size
> >> 	b.hs	1b
> >> 2:
> >> 	mov	base, r15
> >> ---------------------------------------------
> >> 
> >> with the CFA based on r15+offset?
> >> 
> >> These loops probe more often than necessary in some cases,
> >> but they only need a single branch in the common case that
> >> ADJUSTMENT <= MIN_PROBE_THRESHOLD.
> >
> > I haven't changed the loop yet because I'm a bit on the edge about
> > whether the implementation difficulties would outweigh the benefits.
> > We are planning on doing something smarter for SVE so optimizing these
> > loops only to replace them later may not be time well spent now.
> >
> > The problem is that to support both 4KB and 64KB pages, instructions such
> > as subs would require different immediates and shifts. Granted we technically
> > only support these two so I could hardcode the values, but that would mean
> > these functions are less general than the rest.
> 
> Because of the min_probe_threshold?  You could conservatively clamp it
> to the next lowest value that's in range, which we could do without
> having to hard-code specific values.  I think it would be better
> to do that even with the current code, since hard-coding 2048 with:
> 
>   /* Test if ADJUSTMENT < RESIDUAL_PROBE_GUARD, in principle any power of two
>      larger than 1024B would work, but we need one that works for all supported
>      guard-sizes.  What we actually want to check is guard-size - 1KB, but this
>      immediate won't fit inside a cmp without requiring a tempory, so instead we
>      just accept a smaller immediate that doesn't, we may probe a bit more often
>      but that doesn't matter much on the long run.  */
> 
> seems a bit of a hack.
> 
> > If you think it would be worthwhile, I'd be happy to use one of these
> > loops instead.
> 
> Yeah, I still think we should do this unless we can commit to doing
> the optimised version by a specific date, and that date is soon enough
> that the optimisation could reasonably be backported to GCC 8.
> 

While implementing these loops I found them a bit hard to follow, or rather a bit
difficult to prove correct, to someone looking at the code it may not be trivially clear
what it does. I believe the main concern here is that the common case
isn't shortcutted? e.g. spills small enough not to require a probe. So how about

	mov	r15, base
	mov	adjustment, N
	cmp	adjustment, nearest(min_probe_threshold)
	b.lt	end
begin:
	sub	base, base, nearest(min_probe_threshold)
	str	xzr, [base, 0]
	subs	size, size, nearest(min_probe_threshold)
	b.hs	begin
end:
	sub	base, base, size

as an alternative? Which is 9 insn but also much simpler and follows the same semantics
as the other probing codes.  This has the downside that we probe a bit more often than we need to
but on the average case you'd likely not enter the loop more than once, so I'd expect in real world usage
the amount of probes to be the same as the previous code, since you'd have to spill a significant amount of SVE
vectors in order to enter the loop, let alone iterate.

This is still safe as the only invariant we have to hold is not to drop the SP by more than a page at a time,
doing less than a page it fine.

nearest just rounds down to the nearest value that fits in a 12-bit shifted immediate.


Thanks,
Tamar

> > @@ -4830,11 +4929,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> >  	}
> >      }
> > 
> > -  HOST_WIDE_INT size;
> > +  HOST_WIDE_INT size = 0;
> >    /* If SIZE is not large enough to require probing, just adjust the stack and
> >       exit.  */
> > -  if (!poly_size.is_constant (&size)
> > -      || known_lt (poly_size, min_probe_threshold)
> > +  if ((poly_size.is_constant (&size)
> > +       && known_lt (poly_size, min_probe_threshold))
> >        || !flag_stack_clash_protection)
> >      {
> >        aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
> 
> I still think we should remove this poly_size.is_constant, and instead:
> 
> > @@ -4842,9 +4941,64 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> >      }
>  
> >    if (dump_file)
> > -    fprintf (dump_file,
> > -	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> > -	     ", probing will be required.\n", size);
> > +    {
> > +      if (poly_size.is_constant ())
> > +	fprintf (dump_file,
> > +		 "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> > +		 " bytes, probing will be required.\n", size);
> > +      else
> > +	{
> > +	  fprintf (dump_file, "Stack clash SVE prologue: ");
> > +	  print_dec (poly_size, dump_file);
> > +	  fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> > +	}
> > +    }
> > +
> > +  /* Handle the SVE non-constant case first.  */
> > +  if (!poly_size.is_constant ())
> 
> ...use is_constant (&size) here, and put the dump messages for the
> constant and non-constant cases in their respective constant and
> non-constant blocks.  That way each use of "size" is directly protected
> by an is_constant call, and there's no need to initialise size to 0.
> 
> The non-constant case doesn't have the new special handling of
> final_adjustment_p, so I think the !is_constant block should assert
> !final_adjustment_p.
> 
> Thanks,
> Richard

-- 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]