[AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)]

Message ID 20180711112037.GA15738@arm.com
State New
Headers show
Series
  • [AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)]
Related show

Commit Message

Tamar Christina July 11, 2018, 11:20 a.m.
Hi All,

This patch implements the use of the stack clash mitigation for aarch64.
In Aarch64 we expect both the probing interval and the guard size to be 64KB
and we enforce them to always be equal.

We also probe up by 1024 bytes in the general case when a probe is required.

AArch64 has the following probing conditions:

 1) Any allocation less than 63KB requires no probing.  An ABI defined safe
    buffer of 1Kbytes is used and a page size of 64k is assumed.

 2) Any allocations larger than 1 page size, is done in increments of page size
    and probed up by 1KB leaving the residuals.

 3a) Any residual for local arguments that is less than 63KB requires no probing.
     Essentially this is a sliding window.  The probing range determines the ABI
     safe buffer, and the amount to be probed up.

  b) Any residual for outgoing arguments that is less than 1KB requires no probing,
     However to maintain our invariant, anything above or equal to 1KB requires a probe.

Incrementally allocating less than the probing thresholds, e.g. recursive functions will
not be an issue as the storing of LR counts as a probe.


                            +-------------------+                                    
                            |  ABI SAFE REGION  |                                    
                  +------------------------------                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
 maximum amount   |         |                   |                                    
 not needing a    |         |                   |                                    
 probe            |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |                                    
                  |         |                   |        Probe offset when           
                  |         ---------------------------- probe is required           
                  |         |                   |                                    
                  +-------- +-------------------+ --------  Point of first probe     
                            |  ABI SAFE REGION  |                                    
                            ---------------------                                    
                            |                   |                                    
                            |                   |                                    
                            |                   |                                         

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Target was tested with stack clash on and off by default.

Ok for trunk?

Thanks,
Tamar

gcc/
2018-07-11  Jeff Law  <law@redhat.com>
	    Richard Sandiford <richard.sandiford@linaro.org>
	    Tamar Christina  <tamar.christina@arm.com>

	PR target/86486
	* config/aarch64/aarch64.md (cmp<mode>,
	probe_stack_range): Add k (SP) constraint.
	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
	STACK_CLASH_MAX_UNROLL_PAGES): New.
	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range): Emit
	stack probes for stack clash.
	(aarch64_allocate_and_probe_stack_space): New.
	(aarch64_expand_prologue): Use it.
	(aarch64_expand_epilogue): Likewise and update IP regs re-use criteria.
	(aarch64_sub_sp): Add emit_move_imm optional param.

gcc/testsuite/
2018-07-11  Jeff Law  <law@redhat.com>
	    Richard Sandiford <richard.sandiford@linaro.org>
	    Tamar Christina  <tamar.christina@arm.com>

	PR target/86486
	* gcc.target/aarch64/stack-check-12.c: New.
	* gcc.target/aarch64/stack-check-13.c: New.
	* gcc.target/aarch64/stack-check-cfa-1.c: New.
	* gcc.target/aarch64/stack-check-cfa-2.c: New.
	* gcc.target/aarch64/stack-check-prologue-1.c: New.
	* gcc.target/aarch64/stack-check-prologue-10.c: New.
	* gcc.target/aarch64/stack-check-prologue-11.c: New.
	* gcc.target/aarch64/stack-check-prologue-2.c: New.
	* gcc.target/aarch64/stack-check-prologue-3.c: New.
	* gcc.target/aarch64/stack-check-prologue-4.c: New.
	* gcc.target/aarch64/stack-check-prologue-5.c: New.
	* gcc.target/aarch64/stack-check-prologue-6.c: New.
	* gcc.target/aarch64/stack-check-prologue-7.c: New.
	* gcc.target/aarch64/stack-check-prologue-8.c: New.
	* gcc.target/aarch64/stack-check-prologue-9.c: New.
	* gcc.target/aarch64/stack-check-prologue.h: New.
	* lib/target-supports.exp
	(check_effective_target_supports_stack_clash_protection): Add AArch64.

--

Comments

Jeff Law July 11, 2018, 5 p.m. | #1
On 07/11/2018 05:20 AM, Tamar Christina wrote:
> Hi All,
> 
> This patch implements the use of the stack clash mitigation for aarch64.
> In Aarch64 we expect both the probing interval and the guard size to be 64KB
> and we enforce them to always be equal.
> 
> We also probe up by 1024 bytes in the general case when a probe is required.
> 
> AArch64 has the following probing conditions:
> 
>  1) Any allocation less than 63KB requires no probing.  An ABI defined safe
>     buffer of 1Kbytes is used and a page size of 64k is assumed.
> 
>  2) Any allocations larger than 1 page size, is done in increments of page size
>     and probed up by 1KB leaving the residuals.
> 
>  3a) Any residual for local arguments that is less than 63KB requires no probing.
>      Essentially this is a sliding window.  The probing range determines the ABI
>      safe buffer, and the amount to be probed up.
> 
>   b) Any residual for outgoing arguments that is less than 1KB requires no probing,
>      However to maintain our invariant, anything above or equal to 1KB requires a probe.
> 
> Incrementally allocating less than the probing thresholds, e.g. recursive functions will
> not be an issue as the storing of LR counts as a probe.
> 
> 
>                             +-------------------+                                    
>                             |  ABI SAFE REGION  |                                    
>                   +------------------------------                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>  maximum amount   |         |                   |                                    
>  not needing a    |         |                   |                                    
>  probe            |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |        Probe offset when           
>                   |         ---------------------------- probe is required           
>                   |         |                   |                                    
>                   +-------- +-------------------+ --------  Point of first probe     
>                             |  ABI SAFE REGION  |                                    
>                             ---------------------                                    
>                             |                   |                                    
>                             |                   |                                    
>                             |                   |                                         
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Target was tested with stack clash on and off by default.
> 
> Ok for trunk?
> 
> Thanks,
> Tamar
> 
> gcc/
> 2018-07-11  Jeff Law  <law@redhat.com>
> 	    Richard Sandiford <richard.sandiford@linaro.org>
> 	    Tamar Christina  <tamar.christina@arm.com>
> 
> 	PR target/86486
> 	* config/aarch64/aarch64.md (cmp<mode>,
> 	probe_stack_range): Add k (SP) constraint.
> 	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
> 	STACK_CLASH_MAX_UNROLL_PAGES): New.
> 	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range): Emit
> 	stack probes for stack clash.
> 	(aarch64_allocate_and_probe_stack_space): New.
> 	(aarch64_expand_prologue): Use it.
> 	(aarch64_expand_epilogue): Likewise and update IP regs re-use criteria.
> 	(aarch64_sub_sp): Add emit_move_imm optional param.
[ ... ]
I'm going to let the aarch64 maintainers ack/nack the aarch64 specific
bits.  I'll review them to see if there's anything obvious (since I am
familiar with the core issues and the original implementation).

I'm happy to own review work on the target independent chunks.

jeff
Jeff Law July 11, 2018, 5:30 p.m. | #2
On 07/11/2018 05:20 AM, Tamar Christina wrote:
> Hi All,
> 
> This patch implements the use of the stack clash mitigation for aarch64.
> In Aarch64 we expect both the probing interval and the guard size to be 64KB
> and we enforce them to always be equal.
> 
> We also probe up by 1024 bytes in the general case when a probe is required.
> 
> AArch64 has the following probing conditions:
> 
>  1) Any allocation less than 63KB requires no probing.  An ABI defined safe
>     buffer of 1Kbytes is used and a page size of 64k is assumed.
> 
>  2) Any allocations larger than 1 page size, is done in increments of page size
>     and probed up by 1KB leaving the residuals.
> 
>  3a) Any residual for local arguments that is less than 63KB requires no probing.
>      Essentially this is a sliding window.  The probing range determines the ABI
>      safe buffer, and the amount to be probed up.
> 
>   b) Any residual for outgoing arguments that is less than 1KB requires no probing,
>      However to maintain our invariant, anything above or equal to 1KB requires a probe.
> 
> Incrementally allocating less than the probing thresholds, e.g. recursive functions will
> not be an issue as the storing of LR counts as a probe.
> 
> 
>                             +-------------------+                                    
>                             |  ABI SAFE REGION  |                                    
>                   +------------------------------                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>  maximum amount   |         |                   |                                    
>  not needing a    |         |                   |                                    
>  probe            |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |                                    
>                   |         |                   |        Probe offset when           
>                   |         ---------------------------- probe is required           
>                   |         |                   |                                    
>                   +-------- +-------------------+ --------  Point of first probe     
>                             |  ABI SAFE REGION  |                                    
>                             ---------------------                                    
>                             |                   |                                    
>                             |                   |                                    
>                             |                   |                                         
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Target was tested with stack clash on and off by default.
> 
> Ok for trunk?
> 
> Thanks,
> Tamar
> 
> gcc/
> 2018-07-11  Jeff Law  <law@redhat.com>
> 	    Richard Sandiford <richard.sandiford@linaro.org>
> 	    Tamar Christina  <tamar.christina@arm.com>
> 
> 	PR target/86486
> 	* config/aarch64/aarch64.md (cmp<mode>,
> 	probe_stack_range): Add k (SP) constraint.
> 	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
> 	STACK_CLASH_MAX_UNROLL_PAGES): New.
> 	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range): Emit
> 	stack probes for stack clash.
> 	(aarch64_allocate_and_probe_stack_space): New.
> 	(aarch64_expand_prologue): Use it.
> 	(aarch64_expand_epilogue): Likewise and update IP regs re-use criteria.
> 	(aarch64_sub_sp): Add emit_move_imm optional param.
> 
> gcc/testsuite/
> 2018-07-11  Jeff Law  <law@redhat.com>
> 	    Richard Sandiford <richard.sandiford@linaro.org>
> 	    Tamar Christina  <tamar.christina@arm.com>
> 
> 	PR target/86486
> 	* gcc.target/aarch64/stack-check-12.c: New.
> 	* gcc.target/aarch64/stack-check-13.c: New.
> 	* gcc.target/aarch64/stack-check-cfa-1.c: New.
> 	* gcc.target/aarch64/stack-check-cfa-2.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-1.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-10.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-11.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-2.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-3.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-4.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-5.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-6.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-7.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-8.c: New.
> 	* gcc.target/aarch64/stack-check-prologue-9.c: New.
> 	* gcc.target/aarch64/stack-check-prologue.h: New.
> 	* lib/target-supports.exp
> 	(check_effective_target_supports_stack_clash_protection): Add AArch64.
> 
> -- 
> 
> 
> rb9150.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -84,6 +84,14 @@
>  
>  #define LONG_DOUBLE_TYPE_SIZE	128
>  
> +/* This value is the amount of bytes a caller is allowed to drop the stack
> +   before probing has to be done for stack clash protection.  */
> +#define STACK_CLASH_CALLER_GUARD 1024
> +
> +/* This value controls how many pages we manually unroll the loop for when
> +   generating stack clash probes.  */
> +#define STACK_CLASH_MAX_UNROLL_PAGES 4
Thanks for pulling these out as #defines rather than leaving the
original magic #s in the code.


> +
>  /* The architecture reserves all bits of the address for hardware use,
>     so the vbit must go into the delta field of pointers to member
>     functions.  This is the same config as that in the AArch32
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f2828197b98d31f158 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
>     if nonnull.  */
>  
>  static inline void
> -aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
> +aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
> +		bool emit_move_imm = true)
>  {
>    aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
> -		      temp1, temp2, frame_related_p);
> +		      temp1, temp2, frame_related_p, emit_move_imm);
>  }
>  
>  /* Set DEST to (vec_series BASE STEP).  */
> @@ -3973,13 +3974,29 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
>    /* Loop.  */
>    ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
>  
>    /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
>    xops[0] = reg1;
> -  xops[1] = GEN_INT (PROBE_INTERVAL);
> +  if (flag_stack_clash_protection)
> +    xops[1] = GEN_INT (stack_clash_probe_interval);
> +  else
> +    xops[1] = GEN_INT (PROBE_INTERVAL);
> +
>    output_asm_insn ("sub\t%0, %0, %1", xops);
>  
> -  /* Probe at TEST_ADDR.  */
> -  output_asm_insn ("str\txzr, [%0]", xops);
> +  /* If doing stack clash protection then we probe up by the ABI specified
> +     amount.  We do this because we're dropping full pages at a time in the
> +     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
> +  if (flag_stack_clash_protection)
> +    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
> +  else
> +    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
So this is a bit different than the original, but probably just as safe.
 The key is to make sure we hit the just allocated page and this should
do so just fine.


> @@ -4793,6 +4810,147 @@ aarch64_set_handled_components (sbitmap components)
>        cfun->machine->reg_is_wrapped_separately[regno] = true;
>  }
>  
> +/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
> +   registers.  If POLY_SIZE is not large enough to require a probe this function
> +   will only adjust the stack.  When allocation the stack space
> +   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
> +   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
> +   this case we need to adjust the threshold to be any allocations larger than
> +   the ABI defined buffer needs a probe such that the invariant is maintained.
> +   */
> +
> +static void
> +aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> +					poly_int64 poly_size,
> +					bool frame_related_p,
> +					bool final_adjustment_p)
It looks like you pushed some of the work that was originally done in
aarch64_expand_prologue into this function.  It all looks reasonable to me.

> +
> +  /* Round size to the nearest multiple of guard_size, and calculate the
> +     residual as the difference between the original size and the rounded
> +     size. */
> +  HOST_WIDE_INT rounded_size = size & -guard_size;
> +  HOST_WIDE_INT residual = size - rounded_size;
> +
> +  /* We can handle a small number of allocations/probes inline.  Otherwise
> +     punt to a loop.  */
> +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
> +    {
> +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
> +	{
> +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
> +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
> +					   STACK_CLASH_CALLER_GUARD));
> +	}
So the only concern with this code is that it'll be inefficient and
possibly incorrect for probe sizes larger than ARITH_FACTOR.
Ultimately, that's a case I don't think we really care that much about.
I wouldn't lose sleep if the port clamped the requested probing interval
at ARITH_FACTOR.


> +    {
> +      /* Compute the ending address.  */
> +      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
> +			  temp1, NULL, false, true);
> +      rtx_insn *insn = get_last_insn ();
> +
> +      /* For the initial allocation, we don't have a frame pointer
> +	 set up, so we always need CFI notes.  If we're doing the
> +	 final allocation, then we may have a frame pointer, in which
> +	 case it is the CFA, otherwise we need CFI notes.
> +
> +	 We can determine which allocation we are doing by looking at
> +	 the value of FRAME_RELATED_P since the final allocations are not
> +	 frame related.  */
> +      if (frame_related_p)
> +	{
> +	  /* We want the CFA independent of the stack pointer for the
> +	     duration of the loop.  */
> +	  add_reg_note (insn, REG_CFA_DEF_CFA,
> +			plus_constant (Pmode, temp1, rounded_size));
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	}
This looks cleaner than what I had :-)

One think I've noticed is you seem to be missing blockage insns after
the allocation and probe.

That can be problematical in a couple cases.  First it can run afoul of
combine-stack-adjustments.  Essentially that pass will combine a series
of stack adjustments into a single insn so your unrolled probing turns
into something like this:

  multi-page stack adjust
  probe first page
  probe second page
  probe third page
  and so on..


That violates the design constraint that we never allocate more than a
page at a time.

Second the scheduler can rearrange the original sequence and break
dependencies resulting in similar code.  I see that we have the test for
that in these changes.

I thought we had tests in the suite to check for both of these
scenarios.  Though it's possible they were x86 specific tests and we
factored that into the work we did on aarch64 without ever building
aarch64 specific tests for this issue.


Do you happen to have a libc.so and ld.so compiled with this code?  I've
got a scanner here which will look at a DSO and flag obviously invalid
stack allocations.  If you've got a suitable libc.so and ld.so I can run
them through the scanner.


Along similar lines, have you run the glibc testsuite with this stuff
enabled by default.  That proved useful to find codegen issues,
particularly with the register inheritance in the epilogue.



> @@ -4990,10 +5178,21 @@ aarch64_expand_epilogue (bool for_sibcall)
>    /* A stack clash protection prologue may not have left IP0_REGNUM or
>       IP1_REGNUM in a usable state.  The same is true for allocations
>       with an SVE component, since we then need both temporary registers
> -     for each allocation.  */
> +     for each allocation.  For stack clash we are in a usable state if
> +     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
> +  HOST_WIDE_INT guard_size
> +    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
> +  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
> +
> +  /* We can re-use the registers when the allocation amount is smaller than
> +     guard_size - guard_used_by_caller because we won't be doing any probes
> +     then.  In such situations the register should remain live with the correct
> +     value.  */
>    bool can_inherit_p = (initial_adjust.is_constant ()
> -			&& final_adjust.is_constant ()
> -			&& !flag_stack_clash_protection);
> +			&& final_adjust.is_constant ())
> +			&& (!flag_stack_clash_protection
> +			     || known_lt (initial_adjust,
> +					  guard_size - guard_used_by_caller));
Probably better than my solution which always disabled inheritance when
stack-clash-protection was enabled :-)


> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -3072,7 +3072,7 @@
>  
>  (define_insn "cmp<mode>"
>    [(set (reg:CC CC_REGNUM)
> -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
> +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
>  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
>    ""
>    "@
I don't think I needed this, but I can certainly understand how it might
be necessary.  THe only one I know we needed as the probe_stack_range
constraint change.
Tamar Christina July 12, 2018, 5:39 p.m. | #3
Hi Jeff,

Thanks for the review!

The 07/11/2018 18:30, Jeff Law wrote:
> On 07/11/2018 05:20 AM, Tamar Christina wrote:
> > Hi All,
> > 
> > This patch implements the use of the stack clash mitigation for aarch64.
> > In Aarch64 we expect both the probing interval and the guard size to be 64KB
> > and we enforce them to always be equal.
> > 
> > We also probe up by 1024 bytes in the general case when a probe is required.
> > 
> > AArch64 has the following probing conditions:
> > 
> >  1) Any allocation less than 63KB requires no probing.  An ABI defined safe
> >     buffer of 1Kbytes is used and a page size of 64k is assumed.
> > 
> >  2) Any allocations larger than 1 page size, is done in increments of page size
> >     and probed up by 1KB leaving the residuals.
> > 
> >  3a) Any residual for local arguments that is less than 63KB requires no probing.
> >      Essentially this is a sliding window.  The probing range determines the ABI
> >      safe buffer, and the amount to be probed up.
> > 
> >   b) Any residual for outgoing arguments that is less than 1KB requires no probing,
> >      However to maintain our invariant, anything above or equal to 1KB requires a probe.
> > 
> > Incrementally allocating less than the probing thresholds, e.g. recursive functions will
> > not be an issue as the storing of LR counts as a probe.
> > 
> > 
> >                             +-------------------+                                    
> >                             |  ABI SAFE REGION  |                                    
> >                   +------------------------------                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >  maximum amount   |         |                   |                                    
> >  not needing a    |         |                   |                                    
> >  probe            |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |                                    
> >                   |         |                   |        Probe offset when           
> >                   |         ---------------------------- probe is required           
> >                   |         |                   |                                    
> >                   +-------- +-------------------+ --------  Point of first probe     
> >                             |  ABI SAFE REGION  |                                    
> >                             ---------------------                                    
> >                             |                   |                                    
> >                             |                   |                                    
> >                             |                   |                                         
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > Target was tested with stack clash on and off by default.
> > 
> > Ok for trunk?
> > 
> > Thanks,
> > Tamar
> > 
> > gcc/
> > 2018-07-11  Jeff Law  <law@redhat.com>
> > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > 	    Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	PR target/86486
> > 	* config/aarch64/aarch64.md (cmp<mode>,
> > 	probe_stack_range): Add k (SP) constraint.
> > 	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
> > 	STACK_CLASH_MAX_UNROLL_PAGES): New.
> > 	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range): Emit
> > 	stack probes for stack clash.
> > 	(aarch64_allocate_and_probe_stack_space): New.
> > 	(aarch64_expand_prologue): Use it.
> > 	(aarch64_expand_epilogue): Likewise and update IP regs re-use criteria.
> > 	(aarch64_sub_sp): Add emit_move_imm optional param.
> > 
> > gcc/testsuite/
> > 2018-07-11  Jeff Law  <law@redhat.com>
> > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > 	    Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	PR target/86486
> > 	* gcc.target/aarch64/stack-check-12.c: New.
> > 	* gcc.target/aarch64/stack-check-13.c: New.
> > 	* gcc.target/aarch64/stack-check-cfa-1.c: New.
> > 	* gcc.target/aarch64/stack-check-cfa-2.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-1.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-10.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-11.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-2.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-3.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-4.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-5.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-6.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-7.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-8.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue-9.c: New.
> > 	* gcc.target/aarch64/stack-check-prologue.h: New.
> > 	* lib/target-supports.exp
> > 	(check_effective_target_supports_stack_clash_protection): Add AArch64.
> > 
> > -- 
> > 
> > 
> > rb9150.patch
> > 
> > 
> > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> > index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
> > --- a/gcc/config/aarch64/aarch64.h
> > +++ b/gcc/config/aarch64/aarch64.h
> > @@ -84,6 +84,14 @@
> >  
> >  #define LONG_DOUBLE_TYPE_SIZE	128
> >  
> > +/* This value is the amount of bytes a caller is allowed to drop the stack
> > +   before probing has to be done for stack clash protection.  */
> > +#define STACK_CLASH_CALLER_GUARD 1024
> > +
> > +/* This value controls how many pages we manually unroll the loop for when
> > +   generating stack clash probes.  */
> > +#define STACK_CLASH_MAX_UNROLL_PAGES 4
> Thanks for pulling these out as #defines rather than leaving the
> original magic #s in the code.
> 
> 
> > +
> >  /* The architecture reserves all bits of the address for hardware use,
> >     so the vbit must go into the delta field of pointers to member
> >     functions.  This is the same config as that in the AArch32
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f2828197b98d31f158 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
> >     if nonnull.  */
> >  
> >  static inline void
> > -aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
> > +aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
> > +		bool emit_move_imm = true)
> >  {
> >    aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
> > -		      temp1, temp2, frame_related_p);
> > +		      temp1, temp2, frame_related_p, emit_move_imm);
> >  }
> >  
> >  /* Set DEST to (vec_series BASE STEP).  */
> > @@ -3973,13 +3974,29 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
> >    /* Loop.  */
> >    ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
> >  
> >    /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
> >    xops[0] = reg1;
> > -  xops[1] = GEN_INT (PROBE_INTERVAL);
> > +  if (flag_stack_clash_protection)
> > +    xops[1] = GEN_INT (stack_clash_probe_interval);
> > +  else
> > +    xops[1] = GEN_INT (PROBE_INTERVAL);
> > +
> >    output_asm_insn ("sub\t%0, %0, %1", xops);
> >  
> > -  /* Probe at TEST_ADDR.  */
> > -  output_asm_insn ("str\txzr, [%0]", xops);
> > +  /* If doing stack clash protection then we probe up by the ABI specified
> > +     amount.  We do this because we're dropping full pages at a time in the
> > +     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
> > +  if (flag_stack_clash_protection)
> > +    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
> > +  else
> > +    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
> So this is a bit different than the original, but probably just as safe.
>  The key is to make sure we hit the just allocated page and this should
> do so just fine.
> 
> 
> > @@ -4793,6 +4810,147 @@ aarch64_set_handled_components (sbitmap components)
> >        cfun->machine->reg_is_wrapped_separately[regno] = true;
> >  }
> >  
> > +/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
> > +   registers.  If POLY_SIZE is not large enough to require a probe this function
> > +   will only adjust the stack.  When allocation the stack space
> > +   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
> > +   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
> > +   this case we need to adjust the threshold to be any allocations larger than
> > +   the ABI defined buffer needs a probe such that the invariant is maintained.
> > +   */
> > +
> > +static void
> > +aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > +					poly_int64 poly_size,
> > +					bool frame_related_p,
> > +					bool final_adjustment_p)
> It looks like you pushed some of the work that was originally done in
> aarch64_expand_prologue into this function.  It all looks reasonable to me.
> 
> > +
> > +  /* Round size to the nearest multiple of guard_size, and calculate the
> > +     residual as the difference between the original size and the rounded
> > +     size. */
> > +  HOST_WIDE_INT rounded_size = size & -guard_size;
> > +  HOST_WIDE_INT residual = size - rounded_size;
> > +
> > +  /* We can handle a small number of allocations/probes inline.  Otherwise
> > +     punt to a loop.  */
> > +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
> > +    {
> > +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
> > +	{
> > +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
> > +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
> > +					   STACK_CLASH_CALLER_GUARD));
> > +	}
> So the only concern with this code is that it'll be inefficient and
> possibly incorrect for probe sizes larger than ARITH_FACTOR.
> Ultimately, that's a case I don't think we really care that much about.
> I wouldn't lose sleep if the port clamped the requested probing interval
> at ARITH_FACTOR.
> 

I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
stack probing implementation, which isn't used by this new code aside
from the part that emits the actual probe when doing a variable or large
allocation in aarch64_output_probe_stack_range.

Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
probing intervals. 

> 
> > +    {
> > +      /* Compute the ending address.  */
> > +      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
> > +			  temp1, NULL, false, true);
> > +      rtx_insn *insn = get_last_insn ();
> > +
> > +      /* For the initial allocation, we don't have a frame pointer
> > +	 set up, so we always need CFI notes.  If we're doing the
> > +	 final allocation, then we may have a frame pointer, in which
> > +	 case it is the CFA, otherwise we need CFI notes.
> > +
> > +	 We can determine which allocation we are doing by looking at
> > +	 the value of FRAME_RELATED_P since the final allocations are not
> > +	 frame related.  */
> > +      if (frame_related_p)
> > +	{
> > +	  /* We want the CFA independent of the stack pointer for the
> > +	     duration of the loop.  */
> > +	  add_reg_note (insn, REG_CFA_DEF_CFA,
> > +			plus_constant (Pmode, temp1, rounded_size));
> > +	  RTX_FRAME_RELATED_P (insn) = 1;
> > +	}
> This looks cleaner than what I had :-)
> 
> One think I've noticed is you seem to be missing blockage insns after
> the allocation and probe.
> 
> That can be problematical in a couple cases.  First it can run afoul of
> combine-stack-adjustments.  Essentially that pass will combine a series
> of stack adjustments into a single insn so your unrolled probing turns
> into something like this:
> 
>   multi-page stack adjust
>   probe first page
>   probe second page
>   probe third page
>   and so on..
> 

This is something we actually do want, particularly in the case of 4KB pages
as the immediates would fit in the store.  It's one of the things we were
thinking about for future improvements.

> 
> That violates the design constraint that we never allocate more than a
> page at a time.

Would there be a big issue here if we didn't adhere to this constraint?

> 
> Second the scheduler can rearrange the original sequence and break
> dependencies resulting in similar code.  I see that we have the test for
> that in these changes.
> 
> I thought we had tests in the suite to check for both of these
> scenarios.  Though it's possible they were x86 specific tests and we
> factored that into the work we did on aarch64 without ever building
> aarch64 specific tests for this issue.
> 

I have tried to get it to do this today but couldn't actually get it to,
I was using 4KB pages and attempting an allocation of 4*4KB, but it just
kept all the adjustments and stores separate.

> 
> Do you happen to have a libc.so and ld.so compiled with this code?  I've
> got a scanner here which will look at a DSO and flag obviously invalid
> stack allocations.  If you've got a suitable libc.so and ld.so I can run
> them through the scanner.
> 
> 
> Along similar lines, have you run the glibc testsuite with this stuff
> enabled by default.  That proved useful to find codegen issues,
> particularly with the register inheritance in the epilogue.
> 

I'm running one now, I'll send the two binaries once I get the results back
from the run. Thanks!

> 
> 
> > @@ -4990,10 +5178,21 @@ aarch64_expand_epilogue (bool for_sibcall)
> >    /* A stack clash protection prologue may not have left IP0_REGNUM or
> >       IP1_REGNUM in a usable state.  The same is true for allocations
> >       with an SVE component, since we then need both temporary registers
> > -     for each allocation.  */
> > +     for each allocation.  For stack clash we are in a usable state if
> > +     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
> > +  HOST_WIDE_INT guard_size
> > +    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
> > +  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
> > +
> > +  /* We can re-use the registers when the allocation amount is smaller than
> > +     guard_size - guard_used_by_caller because we won't be doing any probes
> > +     then.  In such situations the register should remain live with the correct
> > +     value.  */
> >    bool can_inherit_p = (initial_adjust.is_constant ()
> > -			&& final_adjust.is_constant ()
> > -			&& !flag_stack_clash_protection);
> > +			&& final_adjust.is_constant ())
> > +			&& (!flag_stack_clash_protection
> > +			     || known_lt (initial_adjust,
> > +					  guard_size - guard_used_by_caller));
> Probably better than my solution which always disabled inheritance when
> stack-clash-protection was enabled :-)
> 
> 
> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -3072,7 +3072,7 @@
> >  
> >  (define_insn "cmp<mode>"
> >    [(set (reg:CC CC_REGNUM)
> > -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
> > +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
> >  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
> >    ""
> >    "@
> I don't think I needed this, but I can certainly understand how it might
> be necessary.  THe only one I know we needed as the probe_stack_range
> constraint change.
> 

It's not strictly needed, but it allows us to do the comparison with the
stack pointer in the loop without needing to emit sp to a temporary first.
So it's just a tiny optimization. :)

Regards,
Tamar

> 
> 

--
Tamar Christina July 13, 2018, 4:35 p.m. | #4
Hi All,

I'm sending an updated patch which makes sure unwind tables are disabled always for tests that do sequence checks so they pass in all configurations.
There's no change to the cover letter or implementation.

Regards,
Tamar

> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Thursday, July 12, 2018 18:40
> To: Jeff Law <law@redhat.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; James Greenhalgh
> <James.Greenhalgh@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>
> Subject: Re: [PATCH][GCC][AArch64] Updated stack-clash implementation
> supporting 64k probes. [patch (1/6)]
> 
> Hi Jeff,
> 
> Thanks for the review!
> 
> The 07/11/2018 18:30, Jeff Law wrote:
> > On 07/11/2018 05:20 AM, Tamar Christina wrote:
> > > Hi All,
> > >
> > > This patch implements the use of the stack clash mitigation for aarch64.
> > > In Aarch64 we expect both the probing interval and the guard size to
> > > be 64KB and we enforce them to always be equal.
> > >
> > > We also probe up by 1024 bytes in the general case when a probe is
> required.
> > >
> > > AArch64 has the following probing conditions:
> > >
> > >  1) Any allocation less than 63KB requires no probing.  An ABI defined safe
> > >     buffer of 1Kbytes is used and a page size of 64k is assumed.
> > >
> > >  2) Any allocations larger than 1 page size, is done in increments of page
> size
> > >     and probed up by 1KB leaving the residuals.
> > >
> > >  3a) Any residual for local arguments that is less than 63KB requires no
> probing.
> > >      Essentially this is a sliding window.  The probing range determines the
> ABI
> > >      safe buffer, and the amount to be probed up.
> > >
> > >   b) Any residual for outgoing arguments that is less than 1KB requires no
> probing,
> > >      However to maintain our invariant, anything above or equal to 1KB
> requires a probe.
> > >
> > > Incrementally allocating less than the probing thresholds, e.g.
> > > recursive functions will not be an issue as the storing of LR counts as a
> probe.
> > >
> > >
> > >                             +-------------------+
> > >                             |  ABI SAFE REGION  |
> > >                   +------------------------------
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >  maximum amount   |         |                   |
> > >  not needing a    |         |                   |
> > >  probe            |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |
> > >                   |         |                   |        Probe offset when
> > >                   |         ---------------------------- probe is required
> > >                   |         |                   |
> > >                   +-------- +-------------------+ --------  Point of first probe
> > >                             |  ABI SAFE REGION  |
> > >                             ---------------------
> > >                             |                   |
> > >                             |                   |
> > >                             |                   |
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > Target was tested with stack clash on and off by default.
> > >
> > > Ok for trunk?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/
> > > 2018-07-11  Jeff Law  <law@redhat.com>
> > > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > > 	    Tamar Christina  <tamar.christina@arm.com>
> > >
> > > 	PR target/86486
> > > 	* config/aarch64/aarch64.md (cmp<mode>,
> > > 	probe_stack_range): Add k (SP) constraint.
> > > 	* config/aarch64/aarch64.h (STACK_CLASH_CALLER_GUARD,
> > > 	STACK_CLASH_MAX_UNROLL_PAGES): New.
> > > 	* config/aarch64/aarch64.c (aarch64_output_probe_stack_range):
> Emit
> > > 	stack probes for stack clash.
> > > 	(aarch64_allocate_and_probe_stack_space): New.
> > > 	(aarch64_expand_prologue): Use it.
> > > 	(aarch64_expand_epilogue): Likewise and update IP regs re-use
> criteria.
> > > 	(aarch64_sub_sp): Add emit_move_imm optional param.
> > >
> > > gcc/testsuite/
> > > 2018-07-11  Jeff Law  <law@redhat.com>
> > > 	    Richard Sandiford <richard.sandiford@linaro.org>
> > > 	    Tamar Christina  <tamar.christina@arm.com>
> > >
> > > 	PR target/86486
> > > 	* gcc.target/aarch64/stack-check-12.c: New.
> > > 	* gcc.target/aarch64/stack-check-13.c: New.
> > > 	* gcc.target/aarch64/stack-check-cfa-1.c: New.
> > > 	* gcc.target/aarch64/stack-check-cfa-2.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-1.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-10.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-11.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-2.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-3.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-4.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-5.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-6.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-7.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-8.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue-9.c: New.
> > > 	* gcc.target/aarch64/stack-check-prologue.h: New.
> > > 	* lib/target-supports.exp
> > > 	(check_effective_target_supports_stack_clash_protection): Add
> AArch64.
> > >
> > > --
> > >
> > >
> > > rb9150.patch
> > >
> > >
> > > diff --git a/gcc/config/aarch64/aarch64.h
> > > b/gcc/config/aarch64/aarch64.h index
> > >
> 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0
> a
> > > 32f79c3db03f99 100644
> > > --- a/gcc/config/aarch64/aarch64.h
> > > +++ b/gcc/config/aarch64/aarch64.h
> > > @@ -84,6 +84,14 @@
> > >
> > >  #define LONG_DOUBLE_TYPE_SIZE	128
> > >
> > > +/* This value is the amount of bytes a caller is allowed to drop the stack
> > > +   before probing has to be done for stack clash protection.  */
> > > +#define STACK_CLASH_CALLER_GUARD 1024
> > > +
> > > +/* This value controls how many pages we manually unroll the loop for
> when
> > > +   generating stack clash probes.  */ #define
> > > +STACK_CLASH_MAX_UNROLL_PAGES 4
> > Thanks for pulling these out as #defines rather than leaving the
> > original magic #s in the code.
> >
> >
> > > +
> > >  /* The architecture reserves all bits of the address for hardware use,
> > >     so the vbit must go into the delta field of pointers to member
> > >     functions.  This is the same config as that in the AArch32 diff
> > > --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index
> > >
> b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f28
> 2
> > > 8197b98d31f158 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2,
> poly_int64 delta, bool emit_move_imm)
> > >     if nonnull.  */
> > >
> > >  static inline void
> > > -aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool
> > > frame_related_p)
> > > +aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool
> frame_related_p,
> > > +		bool emit_move_imm = true)
> > >  {
> > >    aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -
> delta,
> > > -		      temp1, temp2, frame_related_p);
> > > +		      temp1, temp2, frame_related_p, emit_move_imm);
> > >  }
> > >
> > >  /* Set DEST to (vec_series BASE STEP).  */ @@ -3973,13 +3974,29 @@
> > > aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
> > >    /* Loop.  */
> > >    ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
> > >
> > >    /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
> > >    xops[0] = reg1;
> > > -  xops[1] = GEN_INT (PROBE_INTERVAL);
> > > +  if (flag_stack_clash_protection)
> > > +    xops[1] = GEN_INT (stack_clash_probe_interval);  else
> > > +    xops[1] = GEN_INT (PROBE_INTERVAL);
> > > +
> > >    output_asm_insn ("sub\t%0, %0, %1", xops);
> > >
> > > -  /* Probe at TEST_ADDR.  */
> > > -  output_asm_insn ("str\txzr, [%0]", xops);
> > > +  /* If doing stack clash protection then we probe up by the ABI specified
> > > +     amount.  We do this because we're dropping full pages at a time in
> the
> > > +     loop.  But if we're doing non-stack clash probing, probe at SP
> > > + 0.  */  if (flag_stack_clash_protection)
> > > +    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);  else
> > > +    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
> > So this is a bit different than the original, but probably just as safe.
> >  The key is to make sure we hit the just allocated page and this
> > should do so just fine.
> >
> >
> > > @@ -4793,6 +4810,147 @@ aarch64_set_handled_components (sbitmap
> components)
> > >        cfun->machine->reg_is_wrapped_separately[regno] = true;  }
> > >
> > > +/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as
> scratch
> > > +   registers.  If POLY_SIZE is not large enough to require a probe this
> function
> > > +   will only adjust the stack.  When allocation the stack space
> > > +   FRAME_RELATED_P is then used to indicated if the allocation is frame
> related.
> > > +   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing
> arguments. In
> > > +   this case we need to adjust the threshold to be any allocations larger
> than
> > > +   the ABI defined buffer needs a probe such that the invariant is
> maintained.
> > > +   */
> > > +
> > > +static void
> > > +aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > > +					poly_int64 poly_size,
> > > +					bool frame_related_p,
> > > +					bool final_adjustment_p)
> > It looks like you pushed some of the work that was originally done in
> > aarch64_expand_prologue into this function.  It all looks reasonable to me.
> >
> > > +
> > > +  /* Round size to the nearest multiple of guard_size, and calculate the
> > > +     residual as the difference between the original size and the rounded
> > > +     size. */
> > > +  HOST_WIDE_INT rounded_size = size & -guard_size;  HOST_WIDE_INT
> > > + residual = size - rounded_size;
> > > +
> > > +  /* We can handle a small number of allocations/probes inline.
> Otherwise
> > > +     punt to a loop.  */
> > > +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
> > > +    {
> > > +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
> > > +	{
> > > +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
> > > +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
> > > +					   STACK_CLASH_CALLER_GUARD));
> > > +	}
> > So the only concern with this code is that it'll be inefficient and
> > possibly incorrect for probe sizes larger than ARITH_FACTOR.
> > Ultimately, that's a case I don't think we really care that much about.
> > I wouldn't lose sleep if the port clamped the requested probing
> > interval at ARITH_FACTOR.
> >
> 
> I'm a bit confused here, the ARITH_FACTOR seems to have to do with the
> Ada stack probing implementation, which isn't used by this new code aside
> from the part that emits the actual probe when doing a variable or large
> allocation in aarch64_output_probe_stack_range.
> 
> Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
> probing intervals.
> 
> >
> > > +    {
> > > +      /* Compute the ending address.  */
> > > +      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -
> rounded_size,
> > > +			  temp1, NULL, false, true);
> > > +      rtx_insn *insn = get_last_insn ();
> > > +
> > > +      /* For the initial allocation, we don't have a frame pointer
> > > +	 set up, so we always need CFI notes.  If we're doing the
> > > +	 final allocation, then we may have a frame pointer, in which
> > > +	 case it is the CFA, otherwise we need CFI notes.
> > > +
> > > +	 We can determine which allocation we are doing by looking at
> > > +	 the value of FRAME_RELATED_P since the final allocations are not
> > > +	 frame related.  */
> > > +      if (frame_related_p)
> > > +	{
> > > +	  /* We want the CFA independent of the stack pointer for the
> > > +	     duration of the loop.  */
> > > +	  add_reg_note (insn, REG_CFA_DEF_CFA,
> > > +			plus_constant (Pmode, temp1, rounded_size));
> > > +	  RTX_FRAME_RELATED_P (insn) = 1;
> > > +	}
> > This looks cleaner than what I had :-)
> >
> > One think I've noticed is you seem to be missing blockage insns after
> > the allocation and probe.
> >
> > That can be problematical in a couple cases.  First it can run afoul
> > of combine-stack-adjustments.  Essentially that pass will combine a
> > series of stack adjustments into a single insn so your unrolled
> > probing turns into something like this:
> >
> >   multi-page stack adjust
> >   probe first page
> >   probe second page
> >   probe third page
> >   and so on..
> >
> 
> This is something we actually do want, particularly in the case of 4KB pages as
> the immediates would fit in the store.  It's one of the things we were thinking
> about for future improvements.
> 
> >
> > That violates the design constraint that we never allocate more than a
> > page at a time.
> 
> Would there be a big issue here if we didn't adhere to this constraint?
> 
> >
> > Second the scheduler can rearrange the original sequence and break
> > dependencies resulting in similar code.  I see that we have the test
> > for that in these changes.
> >
> > I thought we had tests in the suite to check for both of these
> > scenarios.  Though it's possible they were x86 specific tests and we
> > factored that into the work we did on aarch64 without ever building
> > aarch64 specific tests for this issue.
> >
> 
> I have tried to get it to do this today but couldn't actually get it to, I was using
> 4KB pages and attempting an allocation of 4*4KB, but it just kept all the
> adjustments and stores separate.
> 
> >
> > Do you happen to have a libc.so and ld.so compiled with this code?
> > I've got a scanner here which will look at a DSO and flag obviously
> > invalid stack allocations.  If you've got a suitable libc.so and ld.so
> > I can run them through the scanner.
> >
> >
> > Along similar lines, have you run the glibc testsuite with this stuff
> > enabled by default.  That proved useful to find codegen issues,
> > particularly with the register inheritance in the epilogue.
> >
> 
> I'm running one now, I'll send the two binaries once I get the results back
> from the run. Thanks!
> 
> >
> >
> > > @@ -4990,10 +5178,21 @@ aarch64_expand_epilogue (bool for_sibcall)
> > >    /* A stack clash protection prologue may not have left IP0_REGNUM or
> > >       IP1_REGNUM in a usable state.  The same is true for allocations
> > >       with an SVE component, since we then need both temporary registers
> > > -     for each allocation.  */
> > > +     for each allocation.  For stack clash we are in a usable state if
> > > +     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.
> > > + */  HOST_WIDE_INT guard_size
> > > +    = 1 << PARAM_VALUE
> (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
> > > +  HOST_WIDE_INT guard_used_by_caller =
> STACK_CLASH_CALLER_GUARD;
> > > +
> > > +  /* We can re-use the registers when the allocation amount is smaller
> than
> > > +     guard_size - guard_used_by_caller because we won't be doing any
> probes
> > > +     then.  In such situations the register should remain live with the
> correct
> > > +     value.  */
> > >    bool can_inherit_p = (initial_adjust.is_constant ()
> > > -			&& final_adjust.is_constant ()
> > > -			&& !flag_stack_clash_protection);
> > > +			&& final_adjust.is_constant ())
> > > +			&& (!flag_stack_clash_protection
> > > +			     || known_lt (initial_adjust,
> > > +					  guard_size - guard_used_by_caller));
> > Probably better than my solution which always disabled inheritance
> > when stack-clash-protection was enabled :-)
> >
> >
> > > diff --git a/gcc/config/aarch64/aarch64.md
> > > b/gcc/config/aarch64/aarch64.md index
> > >
> 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d59509
> 4
> > > 7608244d293025 100644
> > > --- a/gcc/config/aarch64/aarch64.md
> > > +++ b/gcc/config/aarch64/aarch64.md
> > > @@ -3072,7 +3072,7 @@
> > >
> > >  (define_insn "cmp<mode>"
> > >    [(set (reg:CC CC_REGNUM)
> > > -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
> > > +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
> > >  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
> > >    ""
> > >    "@
> > I don't think I needed this, but I can certainly understand how it
> > might be necessary.  THe only one I know we needed as the
> > probe_stack_range constraint change.
> >
> 
> It's not strictly needed, but it allows us to do the comparison with the stack
> pointer in the loop without needing to emit sp to a temporary first.
> So it's just a tiny optimization. :)
> 
> Regards,
> Tamar
> 
> >
> >
> 
> --
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -84,6 +84,14 @@
 
 #define LONG_DOUBLE_TYPE_SIZE	128
 
+/* This value is the amount of bytes a caller is allowed to drop the stack
+   before probing has to be done for stack clash protection.  */
+#define STACK_CLASH_CALLER_GUARD 1024
+
+/* This value controls how many pages we manually unroll the loop for when
+   generating stack clash probes.  */
+#define STACK_CLASH_MAX_UNROLL_PAGES 4
+
 /* The architecture reserves all bits of the address for hardware use,
    so the vbit must go into the delta field of pointers to member
    functions.  This is the same config as that in the AArch32
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f2828197b98d31f158 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
    if nonnull.  */
 
 static inline void
-aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
+aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
+		bool emit_move_imm = true)
 {
   aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
-		      temp1, temp2, frame_related_p);
+		      temp1, temp2, frame_related_p, emit_move_imm);
 }
 
 /* Set DEST to (vec_series BASE STEP).  */
@@ -3973,13 +3974,29 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   /* Loop.  */
   ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
 
+  HOST_WIDE_INT stack_clash_probe_interval
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+
   /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
   xops[0] = reg1;
-  xops[1] = GEN_INT (PROBE_INTERVAL);
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (stack_clash_probe_interval);
+  else
+    xops[1] = GEN_INT (PROBE_INTERVAL);
+
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
-  /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  /* If doing stack clash protection then we probe up by the ABI specified
+     amount.  We do this because we're dropping full pages at a time in the
+     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
+  else
+    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
+
+  /* Probe at TEST_ADDR.  If we're inside the loop it is always safe to probe
+     by this amount for each iteration.  */
+  output_asm_insn ("str\txzr, [%0, %1]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -4793,6 +4810,147 @@ aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
+   registers.  If POLY_SIZE is not large enough to require a probe this function
+   will only adjust the stack.  When allocation the stack space
+   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
+   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
+   this case we need to adjust the threshold to be any allocations larger than
+   the ABI defined buffer needs a probe such that the invariant is maintained.
+   */
+
+static void
+aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
+					poly_int64 poly_size,
+					bool frame_related_p,
+					bool final_adjustment_p)
+{
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT min_probe_threshold = final_adjustment_p
+				      ? guard_used_by_caller
+				      : guard_size - guard_used_by_caller;
+  poly_int64 frame_size = cfun->machine->frame.frame_size;
+
+  if (flag_stack_clash_protection && !final_adjustment_p)
+    {
+      poly_int64 initial_adjust = cfun->machine->frame.initial_adjust;
+      poly_int64 final_adjust = cfun->machine->frame.final_adjust;
+
+      if (known_eq (frame_size, 0))
+	dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, false);
+      else if (known_lt (initial_adjust, guard_size - guard_used_by_caller)
+	       && known_lt (final_adjust, guard_used_by_caller))
+	dump_stack_clash_frame_info (NO_PROBE_SMALL_FRAME, true);
+    }
+
+  /* If SIZE is not large enough to require probing, just adjust the stack and
+     exit.  */
+  if (known_lt (poly_size, min_probe_threshold)
+      || !flag_stack_clash_protection)
+    {
+      aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
+      return;
+    }
+
+  HOST_WIDE_INT size;
+  if (!poly_size.is_constant (&size))
+    {
+      sorry ("stack probes for SVE frames");
+      return;
+    }
+
+  if (dump_file)
+    fprintf (dump_file,
+	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
+	     ", probing will be required.\n", size);
+
+  /* Round size to the nearest multiple of guard_size, and calculate the
+     residual as the difference between the original size and the rounded
+     size. */
+  HOST_WIDE_INT rounded_size = size & -guard_size;
+  HOST_WIDE_INT residual = size - rounded_size;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
+	{
+	  aarch64_sub_sp (NULL, temp2, guard_size, true);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else
+    {
+      /* Compute the ending address.  */
+      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
+			  temp1, NULL, false, true);
+      rtx_insn *insn = get_last_insn ();
+
+      /* For the initial allocation, we don't have a frame pointer
+	 set up, so we always need CFI notes.  If we're doing the
+	 final allocation, then we may have a frame pointer, in which
+	 case it is the CFA, otherwise we need CFI notes.
+
+	 We can determine which allocation we are doing by looking at
+	 the value of FRAME_RELATED_P since the final allocations are not
+	 frame related.  */
+      if (frame_related_p)
+	{
+	  /* We want the CFA independent of the stack pointer for the
+	     duration of the loop.  */
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, temp1, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      /* This allocates and probes the stack.  Note that this re-uses some of
+	 the existing Ada stack protection code.  However we are guaranteed not
+	 to enter the non loop or residual branches of that code.
+
+	 The non-loop part won't be entered because if our allocation amount
+	 doesn't require a loop, the case above would handle it.
+
+	 The residual amount won't be entered because TEMP1 is a mutliple of
+	 the allocation size.  The residual will always be 0.  As such, the only
+	 part we are actually using from that code is the loop setup.  The
+	 actual probing is done in aarch64_output_probe_stack_range.  */
+      insn = emit_insn (gen_probe_stack_range (stack_pointer_rtx,
+					       stack_pointer_rtx, temp1));
+
+      /* Now reset the CFA register if needed.  */
+      if (frame_related_p)
+	{
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, stack_pointer_rtx, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+
+  /* Handle any residuals.
+     Note that any residual must be probed.  */
+  if (residual)
+    {
+      aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
+      if (residual >= min_probe_threshold)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "Stack clash AArch64 prologue residuals: "
+		     HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
+		     "\n", residual);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	}
+    }
+}
+
 /* Add a REG_CFA_EXPRESSION note to INSN to say that register REG
    is saved at BASE + OFFSET.  */
 
@@ -4839,7 +4997,24 @@ aarch64_add_cfa_expression (rtx_insn *insn, unsigned int reg,
 
    Dynamic stack allocations via alloca() decrease stack_pointer_rtx
    but leave frame_pointer_rtx and hard_frame_pointer_rtx
-   unchanged.  */
+   unchanged.
+
+   For stack-clash we assume the guard is at least 64KB.  Furthermore, we assume
+   that the caller has not pushed the stack pointer more than 1k into the guard.
+   A caller that pushes the stack pointer more than 1k into the guard is
+   considered invalid.
+
+   With those assumptions the callee can allocate up to 63KB of stack space
+   without probing.
+
+   When probing is needed, we emit a probe at the start of the prologue
+   and every PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE bytes thereafter.
+
+   We have to track how much space has been allocated and the only stores
+   to the stack we track as implicit probes are the FP/LR stores.
+
+   For outgoing arguments we probe is the number is larger than 1024, such that
+   the ABI specified buffer is maintained for the next callee.  */
 
 /* Generate the prologue instructions for entry into a function.
    Establish the stack frame by decreasing the stack pointer with a
@@ -4890,7 +5065,16 @@ aarch64_expand_prologue (void)
   rtx ip0_rtx = gen_rtx_REG (Pmode, IP0_REGNUM);
   rtx ip1_rtx = gen_rtx_REG (Pmode, IP1_REGNUM);
 
-  aarch64_sub_sp (ip0_rtx, ip1_rtx, initial_adjust, true);
+  /* In theory we should never have both an initial adjustment
+     and a callee save adjustment.  Verify that is the case since the
+     code below does not handle it for -fstack-clash-protection.  */
+  gcc_assert (known_eq (initial_adjust, 0) || callee_adjust == 0);
+
+  /* Will only probe if the initial adjustment is larger than the guard
+     less the amount of the guard reserved for use by the caller's
+     outgoing args.  */
+  aarch64_allocate_and_probe_stack_space (ip0_rtx, ip1_rtx, initial_adjust,
+					  true, false);
 
   if (callee_adjust != 0)
     aarch64_push_regs (reg1, reg2, callee_adjust);
@@ -4946,7 +5130,11 @@ aarch64_expand_prologue (void)
 			     callee_adjust != 0 || emit_frame_chain);
   aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
 			     callee_adjust != 0 || emit_frame_chain);
-  aarch64_sub_sp (ip1_rtx, ip0_rtx, final_adjust, !frame_pointer_needed);
+
+  /* We may need to probe the final adjustment if it is larger than the guard
+     that is assumed by the called.  */
+  aarch64_allocate_and_probe_stack_space (ip1_rtx, ip0_rtx, final_adjust,
+					  !frame_pointer_needed, true);
 }
 
 /* Return TRUE if we can use a simple_return insn.
@@ -4990,10 +5178,21 @@ aarch64_expand_epilogue (bool for_sibcall)
   /* A stack clash protection prologue may not have left IP0_REGNUM or
      IP1_REGNUM in a usable state.  The same is true for allocations
      with an SVE component, since we then need both temporary registers
-     for each allocation.  */
+     for each allocation.  For stack clash we are in a usable state if
+     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+
+  /* We can re-use the registers when the allocation amount is smaller than
+     guard_size - guard_used_by_caller because we won't be doing any probes
+     then.  In such situations the register should remain live with the correct
+     value.  */
   bool can_inherit_p = (initial_adjust.is_constant ()
-			&& final_adjust.is_constant ()
-			&& !flag_stack_clash_protection);
+			&& final_adjust.is_constant ())
+			&& (!flag_stack_clash_protection
+			     || known_lt (initial_adjust,
+					  guard_size - guard_used_by_caller));
 
   /* We need to add memory barrier to prevent read from deallocated stack.  */
   bool need_barrier_p
@@ -5021,8 +5220,10 @@ aarch64_expand_epilogue (bool for_sibcall)
 			hard_frame_pointer_rtx, -callee_offset,
 			ip1_rtx, ip0_rtx, callee_adjust == 0);
   else
-    aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust,
-		    !can_inherit_p || df_regs_ever_live_p (IP1_REGNUM));
+     /* The case where we need to re-use the register here is very rare, so
+	avoid the complicated condition and just always emit a move if the
+	immediate doesn't fit.  */
+     aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust, true);
 
   aarch64_restore_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
 				callee_adjust != 0, &cfi_ops);
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3072,7 +3072,7 @@
 
 (define_insn "cmp<mode>"
   [(set (reg:CC CC_REGNUM)
-	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
+	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
 		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
   ""
   "@
@@ -5930,7 +5930,7 @@
 )
 
 (define_insn "probe_stack_range"
-  [(set (match_operand:DI 0 "register_operand" "=r")
+  [(set (match_operand:DI 0 "register_operand" "=rk")
 	(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
 			     (match_operand:DI 2 "register_operand" "r")]
 			      UNSPECV_PROBE_STACK_RANGE))]
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-12.c b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..f4348d6745309a6642539f9141e2a6d36fe06c5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -fno-asynchronous-unwind-tables -fno-unwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+extern void arf (unsigned long int *, unsigned long int *);
+void
+frob ()
+{
+  unsigned long int num[10000];
+  unsigned long int den[10000];
+  arf (den, num);
+}
+
+/* This verifies that the scheduler did not break the dependencies
+   by adjusting the offsets within the probe and that the scheduler
+   did not reorder around the stack probes.  */
+/* { dg-final { scan-assembler-times {sub\tsp, sp, #65536\n(\tmov x[0-9]+, .+\n)?\tstr\txzr, \[sp, 1024\]} 2 } } */
+/* There is some residual allocation, but we don't care about that. Only that it's not probed.  */
+/* { dg-final { scan-assembler-times {str\txzr, } 2 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-13.c b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
new file mode 100644
index 0000000000000000000000000000000000000000..1fcbae6e3fc6c8423883542d16735e2a8ca0e013
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -fno-asynchronous-unwind-tables -fno-unwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
+#define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X)
+void out1(ARG192(__int128));
+int t1(int);
+
+int t3(int x)
+{
+  if (x < 1000)
+    return t1 (x) + 1;
+
+  out1 (ARG192(1));
+  return 0;
+}
+
+
+
+/* This test creates a large (> 1k) outgoing argument area that needs
+   to be probed.  We don't test the exact size of the space or the
+   exact offset to make the test a little less sensitive to trivial
+   output changes.  */
+/* { dg-final { scan-assembler-times "sub\\tsp, sp, #....\\n\\tstr\\txzr, \\\[sp" 1 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..425d8a9a6d6dc95b5441088063d043765d7a2e91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128*1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 65536} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 131072} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..6c92a321d204bd0a8b7ddab043cc1e6a6c999869
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 1280*1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa [0-9]+, 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1311232} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..a79ce277859e6f49eb1e56ca44ebce14389f4209
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..f75b8b42bda6356b975d82565c448e000b3ed8e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 63 * 1024) + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..00a894ae7b9d6a2be7424da3412fc7cf4d7a460a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 32 * 1024)
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..50fe086ad48831e02497fa1a91db9e194e157961
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 2 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..ca38bf9c4c5c4c29f03cfe26713efeb35e4b1449
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..30f6b1e146855fb1937a03e851c878f1dac78924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..e51d8189dc09b977cc0374e4a49cded908ca23cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..e7d0550433aea9f37ed4f3cc0f33d526dae42ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 65 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..8c160fdc071e2f0db0bf9da908c365f514502062
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 127 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..a64c68330c668b51994f96547c3125b82b543072
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..4ab63d9ae31c4909acec648d288c7c5b6c0b7f54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 6 * 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
new file mode 100644
index 0000000000000000000000000000000000000000..b7e06aedb81d7692ebd587b23d1065436b1c7218
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
@@ -0,0 +1,5 @@
+int f_test (int x)
+{
+  char arr[SIZE];
+  return arr[x];
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3c2c62a58004677411ca380259043d5a9d484469..7ec350213e9225ad342d030afac30bd613851aa1 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9269,14 +9269,9 @@ proc check_effective_target_autoincdec { } {
 # 
 proc check_effective_target_supports_stack_clash_protection { } {
 
-   # Temporary until the target bits are fully ACK'd.
-#  if { [istarget aarch*-*-*] } {
-#	return 1
-#  }
-
     if { [istarget x86_64-*-*] || [istarget i?86-*-*] 
 	  || [istarget powerpc*-*-*] || [istarget rs6000*-*-*]
-	  || [istarget s390*-*-*] } {
+	  || [istarget aarch64*-**] || [istarget s390*-*-*] } {
 	return 1
     }
   return 0
Jeff Law July 13, 2018, 4:46 p.m. | #5
On 07/12/2018 11:39 AM, Tamar Christina wrote:
>>> +
>>> +  /* Round size to the nearest multiple of guard_size, and calculate the
>>> +     residual as the difference between the original size and the rounded
>>> +     size. */
>>> +  HOST_WIDE_INT rounded_size = size & -guard_size;
>>> +  HOST_WIDE_INT residual = size - rounded_size;
>>> +
>>> +  /* We can handle a small number of allocations/probes inline.  Otherwise
>>> +     punt to a loop.  */
>>> +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
>>> +    {
>>> +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
>>> +	{
>>> +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
>>> +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
>>> +					   STACK_CLASH_CALLER_GUARD));
>>> +	}
>> So the only concern with this code is that it'll be inefficient and
>> possibly incorrect for probe sizes larger than ARITH_FACTOR.
>> Ultimately, that's a case I don't think we really care that much about.
>> I wouldn't lose sleep if the port clamped the requested probing interval
>> at ARITH_FACTOR.
>>
> I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
> stack probing implementation, which isn't used by this new code aside
> from the part that emits the actual probe when doing a variable or large
> allocation in aarch64_output_probe_stack_range.
> 
> Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
> probing intervals. 
It may have been a misunderstanding on my part.  My understanding is
that ARITH_FACTOR represents the largest immediate constant we could
handle in this code using a single insn.  Anything above ARITH_FACTOR
needed a scratch register and I couldn't convince myself that we had a
suitable scratch register available.

But I'm really not well versed on the aarch64 architecture or the
various implementation details in aarch64.c.  So I'm happy to defer to
you and others @ ARM on what's best to do here.


>> That can be problematical in a couple cases.  First it can run afoul of
>> combine-stack-adjustments.  Essentially that pass will combine a series
>> of stack adjustments into a single insn so your unrolled probing turns
>> into something like this:
>>
>>   multi-page stack adjust
>>   probe first page
>>   probe second page
>>   probe third page
>>   and so on..
>>
> This is something we actually do want, particularly in the case of 4KB pages
> as the immediates would fit in the store.  It's one of the things we were
> thinking about for future improvements.
> 
>> That violates the design constraint that we never allocate more than a
>> page at a time.
> Would there be a big issue here if we didn't adhere to this constraint?
Yes, because it enlarges a window for exploitation.  Consider signal
delivery occurring after the adjustment but before the probes.  The
signal handler could then be running on a clashed heap/stack.

> 
>> Do you happen to have a libc.so and ld.so compiled with this code?  I've
>> got a scanner here which will look at a DSO and flag obviously invalid
>> stack allocations.  If you've got a suitable libc.so and ld.so I can run
>> them through the scanner.
>>
>>
>> Along similar lines, have you run the glibc testsuite with this stuff
>> enabled by default.  That proved useful to find codegen issues,
>> particularly with the register inheritance in the epilogue.
>>
> I'm running one now, I'll send the two binaries once I get the results back
> from the run. Thanks!
Great.  Looking forward getting those  .so files I can can throw them
into the scanner.

>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
>>> --- a/gcc/config/aarch64/aarch64.md
>>> +++ b/gcc/config/aarch64/aarch64.md
>>> @@ -3072,7 +3072,7 @@
>>>  
>>>  (define_insn "cmp<mode>"
>>>    [(set (reg:CC CC_REGNUM)
>>> -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
>>> +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
>>>  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
>>>    ""
>>>    "@
>> I don't think I needed this, but I can certainly understand how it might
>> be necessary.  THe only one I know we needed as the probe_stack_range
>> constraint change.
>>
> It's not strictly needed, but it allows us to do the comparison with the
> stack pointer in the loop without needing to emit sp to a temporary first.
> So it's just a tiny optimization. :)
Understood.  A similar tweak for cmp<mode> was done in a patch from
Richard E in his patches for spectre v1 mitigation.  I'm certainly not
objecting :-)


Jeff
Tamar Christina July 16, 2018, 1:54 p.m. | #6
The 07/13/2018 17:46, Jeff Law wrote:
> On 07/12/2018 11:39 AM, Tamar Christina wrote:
> >>> +
> >>> +  /* Round size to the nearest multiple of guard_size, and calculate the
> >>> +     residual as the difference between the original size and the rounded
> >>> +     size. */
> >>> +  HOST_WIDE_INT rounded_size = size & -guard_size;
> >>> +  HOST_WIDE_INT residual = size - rounded_size;
> >>> +
> >>> +  /* We can handle a small number of allocations/probes inline.  Otherwise
> >>> +     punt to a loop.  */
> >>> +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
> >>> +    {
> >>> +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
> >>> +	{
> >>> +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
> >>> +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
> >>> +					   STACK_CLASH_CALLER_GUARD));
> >>> +	}
> >> So the only concern with this code is that it'll be inefficient and
> >> possibly incorrect for probe sizes larger than ARITH_FACTOR.
> >> Ultimately, that's a case I don't think we really care that much about.
> >> I wouldn't lose sleep if the port clamped the requested probing interval
> >> at ARITH_FACTOR.
> >>
> > I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
> > stack probing implementation, which isn't used by this new code aside
> > from the part that emits the actual probe when doing a variable or large
> > allocation in aarch64_output_probe_stack_range.
> > 
> > Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
> > probing intervals. 
> It may have been a misunderstanding on my part.  My understanding is
> that ARITH_FACTOR represents the largest immediate constant we could
> handle in this code using a single insn.  Anything above ARITH_FACTOR
> needed a scratch register and I couldn't convince myself that we had a
> suitable scratch register available.
> 
> But I'm really not well versed on the aarch64 architecture or the
> various implementation details in aarch64.c.  So I'm happy to defer to
> you and others @ ARM on what's best to do here.

Ah no, that 12 bit immediate for str offset is unscaled. Scaled it's 15 bits for the 64bit store case.
So the actual limit is 32760, so it's quite a bit larger than ARITH_FACTOR.

The value of STACK_CLASH_CALLER_GUARD is fixed in the back-end and can't be
changed, and if it's made too big will just give a compile error.

> 
> 
> >> That can be problematical in a couple cases.  First it can run afoul of
> >> combine-stack-adjustments.  Essentially that pass will combine a series
> >> of stack adjustments into a single insn so your unrolled probing turns
> >> into something like this:
> >>
> >>   multi-page stack adjust
> >>   probe first page
> >>   probe second page
> >>   probe third page
> >>   and so on..
> >>
> > This is something we actually do want, particularly in the case of 4KB pages
> > as the immediates would fit in the store.  It's one of the things we were
> > thinking about for future improvements.
> > 
> >> That violates the design constraint that we never allocate more than a
> >> page at a time.
> > Would there be a big issue here if we didn't adhere to this constraint?
> Yes, because it enlarges a window for exploitation.  Consider signal
> delivery occurring after the adjustment but before the probes.  The
> signal handler could then be running on a clashed heap/stack.
> 

Ah, you're quite right.. I didn't factor in asynchronous events during the stack
probing.  I have restored the barriers and added some documentation on why they're
needed.

> > 
> >> Do you happen to have a libc.so and ld.so compiled with this code?  I've
> >> got a scanner here which will look at a DSO and flag obviously invalid
> >> stack allocations.  If you've got a suitable libc.so and ld.so I can run
> >> them through the scanner.
> >>
> >>
> >> Along similar lines, have you run the glibc testsuite with this stuff
> >> enabled by default.  That proved useful to find codegen issues,
> >> particularly with the register inheritance in the epilogue.
> >>
> > I'm running one now, I'll send the two binaries once I get the results back
> > from the run. Thanks!
> Great.  Looking forward getting those  .so files I can can throw them
> into the scanner.

I have finished running the glibc testsuite and there were no new regressions.

Thanks,
Tamar

> 
> >>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> >>> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
> >>> --- a/gcc/config/aarch64/aarch64.md
> >>> +++ b/gcc/config/aarch64/aarch64.md
> >>> @@ -3072,7 +3072,7 @@
> >>>  
> >>>  (define_insn "cmp<mode>"
> >>>    [(set (reg:CC CC_REGNUM)
> >>> -	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
> >>> +	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
> >>>  		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
> >>>    ""
> >>>    "@
> >> I don't think I needed this, but I can certainly understand how it might
> >> be necessary.  THe only one I know we needed as the probe_stack_range
> >> constraint change.
> >>
> > It's not strictly needed, but it allows us to do the comparison with the
> > stack pointer in the loop without needing to emit sp to a temporary first.
> > So it's just a tiny optimization. :)
> Understood.  A similar tweak for cmp<mode> was done in a patch from
> Richard E in his patches for spectre v1 mitigation.  I'm certainly not
> objecting :-)
> 
> 
> Jeff
> 

--
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -84,6 +84,14 @@
 
 #define LONG_DOUBLE_TYPE_SIZE	128
 
+/* This value is the amount of bytes a caller is allowed to drop the stack
+   before probing has to be done for stack clash protection.  */
+#define STACK_CLASH_CALLER_GUARD 1024
+
+/* This value controls how many pages we manually unroll the loop for when
+   generating stack clash probes.  */
+#define STACK_CLASH_MAX_UNROLL_PAGES 4
+
 /* The architecture reserves all bits of the address for hardware use,
    so the vbit must go into the delta field of pointers to member
    functions.  This is the same config as that in the AArch32
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..c115e8499b2d7a37143d45b50c6dcadf186cc783 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2810,10 +2810,11 @@ aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
    if nonnull.  */
 
 static inline void
-aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
+aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
+		bool emit_move_imm = true)
 {
   aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
-		      temp1, temp2, frame_related_p);
+		      temp1, temp2, frame_related_p, emit_move_imm);
 }
 
 /* Set DEST to (vec_series BASE STEP).  */
@@ -3973,13 +3974,29 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   /* Loop.  */
   ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
 
+  HOST_WIDE_INT stack_clash_probe_interval
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+
   /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
   xops[0] = reg1;
-  xops[1] = GEN_INT (PROBE_INTERVAL);
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (stack_clash_probe_interval);
+  else
+    xops[1] = GEN_INT (PROBE_INTERVAL);
+
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
-  /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  /* If doing stack clash protection then we probe up by the ABI specified
+     amount.  We do this because we're dropping full pages at a time in the
+     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
+  else
+    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
+
+  /* Probe at TEST_ADDR.  If we're inside the loop it is always safe to probe
+     by this amount for each iteration.  */
+  output_asm_insn ("str\txzr, [%0, %1]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -4793,6 +4810,158 @@ aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
+   registers.  If POLY_SIZE is not large enough to require a probe this function
+   will only adjust the stack.  When allocation the stack space
+   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
+   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
+   this case we need to adjust the threshold to be any allocations larger than
+   the ABI defined buffer needs a probe such that the invariant is maintained.
+
+   We emit barriers after the each stack adjustments to prevent optimizations
+   from breaking the invariant that we never drop the stack more than a page.
+   This invariant is needed to make it easier to correctly handle asynchronous
+   events, e.g. if we were to allow the stack to be dropped by more than a page
+   and then have multiple probes up and we take a signal somewhere in between
+   then the signal handler doesn't know the state of the stack and can make no
+   assumptions about which pages have been probed.
+   */
+
+static void
+aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
+					poly_int64 poly_size,
+					bool frame_related_p,
+					bool final_adjustment_p)
+{
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT min_probe_threshold = final_adjustment_p
+				      ? guard_used_by_caller
+				      : guard_size - guard_used_by_caller;
+  poly_int64 frame_size = cfun->machine->frame.frame_size;
+
+  if (flag_stack_clash_protection && !final_adjustment_p)
+    {
+      poly_int64 initial_adjust = cfun->machine->frame.initial_adjust;
+      poly_int64 final_adjust = cfun->machine->frame.final_adjust;
+
+      if (known_eq (frame_size, 0))
+	dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, false);
+      else if (known_lt (initial_adjust, guard_size - guard_used_by_caller)
+	       && known_lt (final_adjust, guard_used_by_caller))
+	dump_stack_clash_frame_info (NO_PROBE_SMALL_FRAME, true);
+    }
+
+  /* If SIZE is not large enough to require probing, just adjust the stack and
+     exit.  */
+  if (known_lt (poly_size, min_probe_threshold)
+      || !flag_stack_clash_protection)
+    {
+      aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
+      return;
+    }
+
+  HOST_WIDE_INT size;
+  if (!poly_size.is_constant (&size))
+    {
+      sorry ("stack probes for SVE frames");
+      return;
+    }
+
+  if (dump_file)
+    fprintf (dump_file,
+	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
+	     ", probing will be required.\n", size);
+
+  /* Round size to the nearest multiple of guard_size, and calculate the
+     residual as the difference between the original size and the rounded
+     size. */
+  HOST_WIDE_INT rounded_size = size & -guard_size;
+  HOST_WIDE_INT residual = size - rounded_size;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
+	{
+	  aarch64_sub_sp (NULL, temp2, guard_size, true);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	  emit_insn (gen_blockage ());
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else
+    {
+      /* Compute the ending address.  */
+      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
+			  temp1, NULL, false, true);
+      rtx_insn *insn = get_last_insn ();
+
+      /* For the initial allocation, we don't have a frame pointer
+	 set up, so we always need CFI notes.  If we're doing the
+	 final allocation, then we may have a frame pointer, in which
+	 case it is the CFA, otherwise we need CFI notes.
+
+	 We can determine which allocation we are doing by looking at
+	 the value of FRAME_RELATED_P since the final allocations are not
+	 frame related.  */
+      if (frame_related_p)
+	{
+	  /* We want the CFA independent of the stack pointer for the
+	     duration of the loop.  */
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, temp1, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      /* This allocates and probes the stack.  Note that this re-uses some of
+	 the existing Ada stack protection code.  However we are guaranteed not
+	 to enter the non loop or residual branches of that code.
+
+	 The non-loop part won't be entered because if our allocation amount
+	 doesn't require a loop, the case above would handle it.
+
+	 The residual amount won't be entered because TEMP1 is a mutliple of
+	 the allocation size.  The residual will always be 0.  As such, the only
+	 part we are actually using from that code is the loop setup.  The
+	 actual probing is done in aarch64_output_probe_stack_range.  */
+      insn = emit_insn (gen_probe_stack_range (stack_pointer_rtx,
+					       stack_pointer_rtx, temp1));
+
+      /* Now reset the CFA register if needed.  */
+      if (frame_related_p)
+	{
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, stack_pointer_rtx, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      emit_insn (gen_blockage ());
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+
+  /* Handle any residuals.
+     Note that any residual must be probed.  */
+  if (residual)
+    {
+      aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
+      if (residual >= min_probe_threshold)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "Stack clash AArch64 prologue residuals: "
+		     HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
+		     "\n", residual);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	  emit_insn (gen_blockage ());
+	}
+    }
+}
+
 /* Add a REG_CFA_EXPRESSION note to INSN to say that register REG
    is saved at BASE + OFFSET.  */
 
@@ -4839,7 +5008,24 @@ aarch64_add_cfa_expression (rtx_insn *insn, unsigned int reg,
 
    Dynamic stack allocations via alloca() decrease stack_pointer_rtx
    but leave frame_pointer_rtx and hard_frame_pointer_rtx
-   unchanged.  */
+   unchanged.
+
+   For stack-clash we assume the guard is at least 64KB.  Furthermore, we assume
+   that the caller has not pushed the stack pointer more than 1k into the guard.
+   A caller that pushes the stack pointer more than 1k into the guard is
+   considered invalid.
+
+   With those assumptions the callee can allocate up to 63KB of stack space
+   without probing.
+
+   When probing is needed, we emit a probe at the start of the prologue
+   and every PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE bytes thereafter.
+
+   We have to track how much space has been allocated and the only stores
+   to the stack we track as implicit probes are the FP/LR stores.
+
+   For outgoing arguments we probe is the number is larger than 1024, such that
+   the ABI specified buffer is maintained for the next callee.  */
 
 /* Generate the prologue instructions for entry into a function.
    Establish the stack frame by decreasing the stack pointer with a
@@ -4890,7 +5076,16 @@ aarch64_expand_prologue (void)
   rtx ip0_rtx = gen_rtx_REG (Pmode, IP0_REGNUM);
   rtx ip1_rtx = gen_rtx_REG (Pmode, IP1_REGNUM);
 
-  aarch64_sub_sp (ip0_rtx, ip1_rtx, initial_adjust, true);
+  /* In theory we should never have both an initial adjustment
+     and a callee save adjustment.  Verify that is the case since the
+     code below does not handle it for -fstack-clash-protection.  */
+  gcc_assert (known_eq (initial_adjust, 0) || callee_adjust == 0);
+
+  /* Will only probe if the initial adjustment is larger than the guard
+     less the amount of the guard reserved for use by the caller's
+     outgoing args.  */
+  aarch64_allocate_and_probe_stack_space (ip0_rtx, ip1_rtx, initial_adjust,
+					  true, false);
 
   if (callee_adjust != 0)
     aarch64_push_regs (reg1, reg2, callee_adjust);
@@ -4946,7 +5141,11 @@ aarch64_expand_prologue (void)
 			     callee_adjust != 0 || emit_frame_chain);
   aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
 			     callee_adjust != 0 || emit_frame_chain);
-  aarch64_sub_sp (ip1_rtx, ip0_rtx, final_adjust, !frame_pointer_needed);
+
+  /* We may need to probe the final adjustment if it is larger than the guard
+     that is assumed by the called.  */
+  aarch64_allocate_and_probe_stack_space (ip1_rtx, ip0_rtx, final_adjust,
+					  !frame_pointer_needed, true);
 }
 
 /* Return TRUE if we can use a simple_return insn.
@@ -4990,10 +5189,21 @@ aarch64_expand_epilogue (bool for_sibcall)
   /* A stack clash protection prologue may not have left IP0_REGNUM or
      IP1_REGNUM in a usable state.  The same is true for allocations
      with an SVE component, since we then need both temporary registers
-     for each allocation.  */
+     for each allocation.  For stack clash we are in a usable state if
+     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+
+  /* We can re-use the registers when the allocation amount is smaller than
+     guard_size - guard_used_by_caller because we won't be doing any probes
+     then.  In such situations the register should remain live with the correct
+     value.  */
   bool can_inherit_p = (initial_adjust.is_constant ()
-			&& final_adjust.is_constant ()
-			&& !flag_stack_clash_protection);
+			&& final_adjust.is_constant ())
+			&& (!flag_stack_clash_protection
+			     || known_lt (initial_adjust,
+					  guard_size - guard_used_by_caller));
 
   /* We need to add memory barrier to prevent read from deallocated stack.  */
   bool need_barrier_p
@@ -5021,8 +5231,10 @@ aarch64_expand_epilogue (bool for_sibcall)
 			hard_frame_pointer_rtx, -callee_offset,
 			ip1_rtx, ip0_rtx, callee_adjust == 0);
   else
-    aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust,
-		    !can_inherit_p || df_regs_ever_live_p (IP1_REGNUM));
+     /* The case where we need to re-use the register here is very rare, so
+	avoid the complicated condition and just always emit a move if the
+	immediate doesn't fit.  */
+     aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust, true);
 
   aarch64_restore_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
 				callee_adjust != 0, &cfi_ops);
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3072,7 +3072,7 @@
 
 (define_insn "cmp<mode>"
   [(set (reg:CC CC_REGNUM)
-	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
+	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
 		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
   ""
   "@
@@ -5930,7 +5930,7 @@
 )
 
 (define_insn "probe_stack_range"
-  [(set (match_operand:DI 0 "register_operand" "=r")
+  [(set (match_operand:DI 0 "register_operand" "=rk")
 	(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
 			     (match_operand:DI 2 "register_operand" "r")]
 			      UNSPECV_PROBE_STACK_RANGE))]
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-12.c b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..4e3abcbcef2eb216be9f0e01b4f1713c33f8b0b8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -fno-asynchronous-unwind-tables -fno-unwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+extern void arf (unsigned long int *, unsigned long int *);
+void
+frob ()
+{
+  unsigned long int num[10000];
+  unsigned long int den[10000];
+  arf (den, num);
+}
+
+/* This verifies that the scheduler did not break the dependencies
+   by adjusting the offsets within the probe and that the scheduler
+   did not reorder around the stack probes.  */
+/* { dg-final { scan-assembler-times {sub\tsp, sp, #65536\n\tstr\txzr, \[sp, 1024\]} 2 } } */
+/* There is some residual allocation, but we don't care about that. Only that it's not probed.  */
+/* { dg-final { scan-assembler-times {str\txzr, } 2 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-13.c b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
new file mode 100644
index 0000000000000000000000000000000000000000..1fcbae6e3fc6c8423883542d16735e2a8ca0e013
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -fno-asynchronous-unwind-tables -fno-unwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
+#define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X)
+void out1(ARG192(__int128));
+int t1(int);
+
+int t3(int x)
+{
+  if (x < 1000)
+    return t1 (x) + 1;
+
+  out1 (ARG192(1));
+  return 0;
+}
+
+
+
+/* This test creates a large (> 1k) outgoing argument area that needs
+   to be probed.  We don't test the exact size of the space or the
+   exact offset to make the test a little less sensitive to trivial
+   output changes.  */
+/* { dg-final { scan-assembler-times "sub\\tsp, sp, #....\\n\\tstr\\txzr, \\\[sp" 1 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..425d8a9a6d6dc95b5441088063d043765d7a2e91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128*1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 65536} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 131072} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..6c92a321d204bd0a8b7ddab043cc1e6a6c999869
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 1280*1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa [0-9]+, 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1311232} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..a79ce277859e6f49eb1e56ca44ebce14389f4209
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..f75b8b42bda6356b975d82565c448e000b3ed8e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 63 * 1024) + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..00a894ae7b9d6a2be7424da3412fc7cf4d7a460a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 32 * 1024)
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..50fe086ad48831e02497fa1a91db9e194e157961
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 2 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..ca38bf9c4c5c4c29f03cfe26713efeb35e4b1449
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..30f6b1e146855fb1937a03e851c878f1dac78924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..e51d8189dc09b977cc0374e4a49cded908ca23cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..e7d0550433aea9f37ed4f3cc0f33d526dae42ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 65 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..8c160fdc071e2f0db0bf9da908c365f514502062
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 127 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..a64c68330c668b51994f96547c3125b82b543072
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..4ab63d9ae31c4909acec648d288c7c5b6c0b7f54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 6 * 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
new file mode 100644
index 0000000000000000000000000000000000000000..b7e06aedb81d7692ebd587b23d1065436b1c7218
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
@@ -0,0 +1,5 @@
+int f_test (int x)
+{
+  char arr[SIZE];
+  return arr[x];
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3c2c62a58004677411ca380259043d5a9d484469..7ec350213e9225ad342d030afac30bd613851aa1 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9269,14 +9269,9 @@ proc check_effective_target_autoincdec { } {
 # 
 proc check_effective_target_supports_stack_clash_protection { } {
 
-   # Temporary until the target bits are fully ACK'd.
-#  if { [istarget aarch*-*-*] } {
-#	return 1
-#  }
-
     if { [istarget x86_64-*-*] || [istarget i?86-*-*] 
 	  || [istarget powerpc*-*-*] || [istarget rs6000*-*-*]
-	  || [istarget s390*-*-*] } {
+	  || [istarget aarch64*-**] || [istarget s390*-*-*] } {
 	return 1
     }
   return 0
Jeff Law July 16, 2018, 5:03 p.m. | #7
On 07/16/2018 07:54 AM, Tamar Christina wrote:
> The 07/13/2018 17:46, Jeff Law wrote:
>> On 07/12/2018 11:39 AM, Tamar Christina wrote:
>>>>> +
>>>>> +  /* Round size to the nearest multiple of guard_size, and calculate the
>>>>> +     residual as the difference between the original size and the rounded
>>>>> +     size. */
>>>>> +  HOST_WIDE_INT rounded_size = size & -guard_size;
>>>>> +  HOST_WIDE_INT residual = size - rounded_size;
>>>>> +
>>>>> +  /* We can handle a small number of allocations/probes inline.  Otherwise
>>>>> +     punt to a loop.  */
>>>>> +  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
>>>>> +    {
>>>>> +      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
>>>>> +	{
>>>>> +	  aarch64_sub_sp (NULL, temp2, guard_size, true);
>>>>> +	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
>>>>> +					   STACK_CLASH_CALLER_GUARD));
>>>>> +	}
>>>> So the only concern with this code is that it'll be inefficient and
>>>> possibly incorrect for probe sizes larger than ARITH_FACTOR.
>>>> Ultimately, that's a case I don't think we really care that much about.
>>>> I wouldn't lose sleep if the port clamped the requested probing interval
>>>> at ARITH_FACTOR.
>>>>
>>> I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada
>>> stack probing implementation, which isn't used by this new code aside
>>> from the part that emits the actual probe when doing a variable or large
>>> allocation in aarch64_output_probe_stack_range.
>>>
>>> Clamping the probing interval at ARITH_FACTOR means we can't do 64KB
>>> probing intervals. 
>> It may have been a misunderstanding on my part.  My understanding is
>> that ARITH_FACTOR represents the largest immediate constant we could
>> handle in this code using a single insn.  Anything above ARITH_FACTOR
>> needed a scratch register and I couldn't convince myself that we had a
>> suitable scratch register available.
>>
>> But I'm really not well versed on the aarch64 architecture or the
>> various implementation details in aarch64.c.  So I'm happy to defer to
>> you and others @ ARM on what's best to do here.
> 
> Ah no, that 12 bit immediate for str offset is unscaled. Scaled it's 15 bits for the 64bit store case.
> So the actual limit is 32760, so it's quite a bit larger than ARITH_FACTOR.
> 
> The value of STACK_CLASH_CALLER_GUARD is fixed in the back-end and can't be
> changed, and if it's made too big will just give a compile error.
ACK.  Thanks for explaining.


> 
>>
>>
>>>> That can be problematical in a couple cases.  First it can run afoul of
>>>> combine-stack-adjustments.  Essentially that pass will combine a series
>>>> of stack adjustments into a single insn so your unrolled probing turns
>>>> into something like this:
>>>>
>>>>   multi-page stack adjust
>>>>   probe first page
>>>>   probe second page
>>>>   probe third page
>>>>   and so on..
>>>>
>>> This is something we actually do want, particularly in the case of 4KB pages
>>> as the immediates would fit in the store.  It's one of the things we were
>>> thinking about for future improvements.
>>>
>>>> That violates the design constraint that we never allocate more than a
>>>> page at a time.
>>> Would there be a big issue here if we didn't adhere to this constraint?
>> Yes, because it enlarges a window for exploitation.  Consider signal
>> delivery occurring after the adjustment but before the probes.  The
>> signal handler could then be running on a clashed heap/stack.
>>
> 
> Ah, you're quite right.. I didn't factor in asynchronous events during the stack
> probing.  I have restored the barriers and added some documentation on why they're
> needed.
Sounds good.


> 
>>>
>>>> Do you happen to have a libc.so and ld.so compiled with this code?  I've
>>>> got a scanner here which will look at a DSO and flag obviously invalid
>>>> stack allocations.  If you've got a suitable libc.so and ld.so I can run
>>>> them through the scanner.
>>>>
>>>>
>>>> Along similar lines, have you run the glibc testsuite with this stuff
>>>> enabled by default.  That proved useful to find codegen issues,
>>>> particularly with the register inheritance in the epilogue.
>>>>
>>> I'm running one now, I'll send the two binaries once I get the results back
>>> from the run. Thanks!
>> Great.  Looking forward getting those  .so files I can can throw them
>> into the scanner.
> 
> I have finished running the glibc testsuite and there were no new regressions.
Great.  And as mentioned privately, scanning looks reasonable.

So I've got no concerns at this point.  If the aarch64 maintainers are
OK with the changes this stuff can go in.

jeff

Patch

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 976f9afae54c1c98c22a4d5002d8d94e33b3190a..1345f0eb171d05e2b833935c0a32f79c3db03f99 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -84,6 +84,14 @@ 
 
 #define LONG_DOUBLE_TYPE_SIZE	128
 
+/* This value is the amount of bytes a caller is allowed to drop the stack
+   before probing has to be done for stack clash protection.  */
+#define STACK_CLASH_CALLER_GUARD 1024
+
+/* This value controls how many pages we manually unroll the loop for when
+   generating stack clash probes.  */
+#define STACK_CLASH_MAX_UNROLL_PAGES 4
+
 /* The architecture reserves all bits of the address for hardware use,
    so the vbit must go into the delta field of pointers to member
    functions.  This is the same config as that in the AArch32
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b88e7cac27ab76e01b9769563ec9077d2a81bd7b..7c5daf2c62eaeeb1f066d4f2828197b98d31f158 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2810,10 +2810,11 @@  aarch64_add_sp (rtx temp1, rtx temp2, poly_int64 delta, bool emit_move_imm)
    if nonnull.  */
 
 static inline void
-aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p)
+aarch64_sub_sp (rtx temp1, rtx temp2, poly_int64 delta, bool frame_related_p,
+		bool emit_move_imm = true)
 {
   aarch64_add_offset (Pmode, stack_pointer_rtx, stack_pointer_rtx, -delta,
-		      temp1, temp2, frame_related_p);
+		      temp1, temp2, frame_related_p, emit_move_imm);
 }
 
 /* Set DEST to (vec_series BASE STEP).  */
@@ -3973,13 +3974,29 @@  aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   /* Loop.  */
   ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_lab);
 
+  HOST_WIDE_INT stack_clash_probe_interval
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+
   /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
   xops[0] = reg1;
-  xops[1] = GEN_INT (PROBE_INTERVAL);
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (stack_clash_probe_interval);
+  else
+    xops[1] = GEN_INT (PROBE_INTERVAL);
+
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
-  /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  /* If doing stack clash protection then we probe up by the ABI specified
+     amount.  We do this because we're dropping full pages at a time in the
+     loop.  But if we're doing non-stack clash probing, probe at SP 0.  */
+  if (flag_stack_clash_protection)
+    xops[1] = GEN_INT (STACK_CLASH_CALLER_GUARD);
+  else
+    xops[1] = CONST0_RTX (GET_MODE (xops[1]));
+
+  /* Probe at TEST_ADDR.  If we're inside the loop it is always safe to probe
+     by this amount for each iteration.  */
+  output_asm_insn ("str\txzr, [%0, %1]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -4793,6 +4810,147 @@  aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate POLY_SIZE bytes of stack space using TEMP1 and TEMP2 as scratch
+   registers.  If POLY_SIZE is not large enough to require a probe this function
+   will only adjust the stack.  When allocation the stack space
+   FRAME_RELATED_P is then used to indicated if the allocation is frame related.
+   FINAL_ADJUSTMENT_P indicates whether we are doing the outgoing arguments. In
+   this case we need to adjust the threshold to be any allocations larger than
+   the ABI defined buffer needs a probe such that the invariant is maintained.
+   */
+
+static void
+aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
+					poly_int64 poly_size,
+					bool frame_related_p,
+					bool final_adjustment_p)
+{
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT min_probe_threshold = final_adjustment_p
+				      ? guard_used_by_caller
+				      : guard_size - guard_used_by_caller;
+  poly_int64 frame_size = cfun->machine->frame.frame_size;
+
+  if (flag_stack_clash_protection && !final_adjustment_p)
+    {
+      poly_int64 initial_adjust = cfun->machine->frame.initial_adjust;
+      poly_int64 final_adjust = cfun->machine->frame.final_adjust;
+
+      if (known_eq (frame_size, 0))
+	dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, false);
+      else if (known_lt (initial_adjust, guard_size - guard_used_by_caller)
+	       && known_lt (final_adjust, guard_used_by_caller))
+	dump_stack_clash_frame_info (NO_PROBE_SMALL_FRAME, true);
+    }
+
+  /* If SIZE is not large enough to require probing, just adjust the stack and
+     exit.  */
+  if (known_lt (poly_size, min_probe_threshold)
+      || !flag_stack_clash_protection)
+    {
+      aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
+      return;
+    }
+
+  HOST_WIDE_INT size;
+  if (!poly_size.is_constant (&size))
+    {
+      sorry ("stack probes for SVE frames");
+      return;
+    }
+
+  if (dump_file)
+    fprintf (dump_file,
+	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
+	     ", probing will be required.\n", size);
+
+  /* Round size to the nearest multiple of guard_size, and calculate the
+     residual as the difference between the original size and the rounded
+     size. */
+  HOST_WIDE_INT rounded_size = size & -guard_size;
+  HOST_WIDE_INT residual = size - rounded_size;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size)
+	{
+	  aarch64_sub_sp (NULL, temp2, guard_size, true);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else
+    {
+      /* Compute the ending address.  */
+      aarch64_add_offset (Pmode, temp1, stack_pointer_rtx, -rounded_size,
+			  temp1, NULL, false, true);
+      rtx_insn *insn = get_last_insn ();
+
+      /* For the initial allocation, we don't have a frame pointer
+	 set up, so we always need CFI notes.  If we're doing the
+	 final allocation, then we may have a frame pointer, in which
+	 case it is the CFA, otherwise we need CFI notes.
+
+	 We can determine which allocation we are doing by looking at
+	 the value of FRAME_RELATED_P since the final allocations are not
+	 frame related.  */
+      if (frame_related_p)
+	{
+	  /* We want the CFA independent of the stack pointer for the
+	     duration of the loop.  */
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, temp1, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      /* This allocates and probes the stack.  Note that this re-uses some of
+	 the existing Ada stack protection code.  However we are guaranteed not
+	 to enter the non loop or residual branches of that code.
+
+	 The non-loop part won't be entered because if our allocation amount
+	 doesn't require a loop, the case above would handle it.
+
+	 The residual amount won't be entered because TEMP1 is a mutliple of
+	 the allocation size.  The residual will always be 0.  As such, the only
+	 part we are actually using from that code is the loop setup.  The
+	 actual probing is done in aarch64_output_probe_stack_range.  */
+      insn = emit_insn (gen_probe_stack_range (stack_pointer_rtx,
+					       stack_pointer_rtx, temp1));
+
+      /* Now reset the CFA register if needed.  */
+      if (frame_related_p)
+	{
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			plus_constant (Pmode, stack_pointer_rtx, rounded_size));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+
+  /* Handle any residuals.
+     Note that any residual must be probed.  */
+  if (residual)
+    {
+      aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
+      if (residual >= min_probe_threshold)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "Stack clash AArch64 prologue residuals: "
+		     HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
+		     "\n", residual);
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   STACK_CLASH_CALLER_GUARD));
+	}
+    }
+}
+
 /* Add a REG_CFA_EXPRESSION note to INSN to say that register REG
    is saved at BASE + OFFSET.  */
 
@@ -4839,7 +4997,24 @@  aarch64_add_cfa_expression (rtx_insn *insn, unsigned int reg,
 
    Dynamic stack allocations via alloca() decrease stack_pointer_rtx
    but leave frame_pointer_rtx and hard_frame_pointer_rtx
-   unchanged.  */
+   unchanged.
+
+   For stack-clash we assume the guard is at least 64KB.  Furthermore, we assume
+   that the caller has not pushed the stack pointer more than 1k into the guard.
+   A caller that pushes the stack pointer more than 1k into the guard is
+   considered invalid.
+
+   With those assumptions the callee can allocate up to 63KB of stack space
+   without probing.
+
+   When probing is needed, we emit a probe at the start of the prologue
+   and every PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE bytes thereafter.
+
+   We have to track how much space has been allocated and the only stores
+   to the stack we track as implicit probes are the FP/LR stores.
+
+   For outgoing arguments we probe is the number is larger than 1024, such that
+   the ABI specified buffer is maintained for the next callee.  */
 
 /* Generate the prologue instructions for entry into a function.
    Establish the stack frame by decreasing the stack pointer with a
@@ -4890,7 +5065,16 @@  aarch64_expand_prologue (void)
   rtx ip0_rtx = gen_rtx_REG (Pmode, IP0_REGNUM);
   rtx ip1_rtx = gen_rtx_REG (Pmode, IP1_REGNUM);
 
-  aarch64_sub_sp (ip0_rtx, ip1_rtx, initial_adjust, true);
+  /* In theory we should never have both an initial adjustment
+     and a callee save adjustment.  Verify that is the case since the
+     code below does not handle it for -fstack-clash-protection.  */
+  gcc_assert (known_eq (initial_adjust, 0) || callee_adjust == 0);
+
+  /* Will only probe if the initial adjustment is larger than the guard
+     less the amount of the guard reserved for use by the caller's
+     outgoing args.  */
+  aarch64_allocate_and_probe_stack_space (ip0_rtx, ip1_rtx, initial_adjust,
+					  true, false);
 
   if (callee_adjust != 0)
     aarch64_push_regs (reg1, reg2, callee_adjust);
@@ -4946,7 +5130,11 @@  aarch64_expand_prologue (void)
 			     callee_adjust != 0 || emit_frame_chain);
   aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
 			     callee_adjust != 0 || emit_frame_chain);
-  aarch64_sub_sp (ip1_rtx, ip0_rtx, final_adjust, !frame_pointer_needed);
+
+  /* We may need to probe the final adjustment if it is larger than the guard
+     that is assumed by the called.  */
+  aarch64_allocate_and_probe_stack_space (ip1_rtx, ip0_rtx, final_adjust,
+					  !frame_pointer_needed, true);
 }
 
 /* Return TRUE if we can use a simple_return insn.
@@ -4990,10 +5178,21 @@  aarch64_expand_epilogue (bool for_sibcall)
   /* A stack clash protection prologue may not have left IP0_REGNUM or
      IP1_REGNUM in a usable state.  The same is true for allocations
      with an SVE component, since we then need both temporary registers
-     for each allocation.  */
+     for each allocation.  For stack clash we are in a usable state if
+     the adjustment is less than GUARD_SIZE - GUARD_USED_BY_CALLER.  */
+  HOST_WIDE_INT guard_size
+    = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+  HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+
+  /* We can re-use the registers when the allocation amount is smaller than
+     guard_size - guard_used_by_caller because we won't be doing any probes
+     then.  In such situations the register should remain live with the correct
+     value.  */
   bool can_inherit_p = (initial_adjust.is_constant ()
-			&& final_adjust.is_constant ()
-			&& !flag_stack_clash_protection);
+			&& final_adjust.is_constant ())
+			&& (!flag_stack_clash_protection
+			     || known_lt (initial_adjust,
+					  guard_size - guard_used_by_caller));
 
   /* We need to add memory barrier to prevent read from deallocated stack.  */
   bool need_barrier_p
@@ -5021,8 +5220,10 @@  aarch64_expand_epilogue (bool for_sibcall)
 			hard_frame_pointer_rtx, -callee_offset,
 			ip1_rtx, ip0_rtx, callee_adjust == 0);
   else
-    aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust,
-		    !can_inherit_p || df_regs_ever_live_p (IP1_REGNUM));
+     /* The case where we need to re-use the register here is very rare, so
+	avoid the complicated condition and just always emit a move if the
+	immediate doesn't fit.  */
+     aarch64_add_sp (ip1_rtx, ip0_rtx, final_adjust, true);
 
   aarch64_restore_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
 				callee_adjust != 0, &cfi_ops);
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3072,7 +3072,7 @@ 
 
 (define_insn "cmp<mode>"
   [(set (reg:CC CC_REGNUM)
-	(compare:CC (match_operand:GPI 0 "register_operand" "r,r,r")
+	(compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r")
 		    (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))]
   ""
   "@
@@ -5930,7 +5930,7 @@ 
 )
 
 (define_insn "probe_stack_range"
-  [(set (match_operand:DI 0 "register_operand" "=r")
+  [(set (match_operand:DI 0 "register_operand" "=rk")
 	(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
 			     (match_operand:DI 2 "register_operand" "r")]
 			      UNSPECV_PROBE_STACK_RANGE))]
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-12.c b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..1afa435e9f18720285950ecacc5effdd7f26dc02
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+extern void arf (unsigned long int *, unsigned long int *);
+void
+frob ()
+{
+  unsigned long int num[10000];
+  unsigned long int den[10000];
+  arf (den, num);
+}
+
+/* This verifies that the scheduler did not break the dependencies
+   by adjusting the offsets within the probe and that the scheduler
+   did not reorder around the stack probes.  */
+/* { dg-final { scan-assembler-times {sub\tsp, sp, #65536\n(\tmov x[0-9]+, .+\n)?\tstr\txzr, \[sp, 1024\]} 2 } } */
+/* There is some residual allocation, but we don't care about that. Only that it's not probed.  */
+/* { dg-final { scan-assembler-times {str\txzr, } 2 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-13.c b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
new file mode 100644
index 0000000000000000000000000000000000000000..dd122ef422d5a165ef340051f46515643ca075fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
+#define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X)
+void out1(ARG192(__int128));
+int t1(int);
+
+int t3(int x)
+{
+  if (x < 1000)
+    return t1 (x) + 1;
+
+  out1 (ARG192(1));
+  return 0;
+}
+
+
+
+/* This test creates a large (> 1k) outgoing argument area that needs
+   to be probed.  We don't test the exact size of the space or the
+   exact offset to make the test a little less sensitive to trivial
+   output changes.  */
+/* { dg-final { scan-assembler-times "sub\\tsp, sp, #....\\n\\tstr\\txzr, \\\[sp" 1 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..425d8a9a6d6dc95b5441088063d043765d7a2e91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-1.c
@@ -0,0 +1,10 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128*1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 65536} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 131072} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..6c92a321d204bd0a8b7ddab043cc1e6a6c999869
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-2.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 1280*1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa [0-9]+, 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1311232} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 1310720} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_offset 0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..a79ce277859e6f49eb1e56ca44ebce14389f4209
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-1.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..f75b8b42bda6356b975d82565c448e000b3ed8e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-10.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 63 * 1024) + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..00a894ae7b9d6a2be7424da3412fc7cf4d7a460a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-11.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE (6 * 64 * 1024) + (1 * 32 * 1024)
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..50fe086ad48831e02497fa1a91db9e194e157961
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-2.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 2 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..6dd94c3a78b427de4b7913f58e16059bda918272
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-3.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr,} 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..30f6b1e146855fb1937a03e851c878f1dac78924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-4.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 63 * 1024 + 512
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..e51d8189dc09b977cc0374e4a49cded908ca23cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-5.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..e7d0550433aea9f37ed4f3cc0f33d526dae42ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-6.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 65 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..08af6c766e968e7f3adc3845a6aae6057d597665
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-7.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 127 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..a64c68330c668b51994f96547c3125b82b543072
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-8.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 128 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..4ab63d9ae31c4909acec648d288c7c5b6c0b7f54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-9.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define SIZE 6 * 64 * 1024
+#include "stack-check-prologue.h"
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 1024\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
new file mode 100644
index 0000000000000000000000000000000000000000..b7e06aedb81d7692ebd587b23d1065436b1c7218
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue.h
@@ -0,0 +1,5 @@ 
+int f_test (int x)
+{
+  char arr[SIZE];
+  return arr[x];
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3c2c62a58004677411ca380259043d5a9d484469..7ec350213e9225ad342d030afac30bd613851aa1 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9269,14 +9269,9 @@  proc check_effective_target_autoincdec { } {
 # 
 proc check_effective_target_supports_stack_clash_protection { } {
 
-   # Temporary until the target bits are fully ACK'd.
-#  if { [istarget aarch*-*-*] } {
-#	return 1
-#  }
-
     if { [istarget x86_64-*-*] || [istarget i?86-*-*] 
 	  || [istarget powerpc*-*-*] || [istarget rs6000*-*-*]
-	  || [istarget s390*-*-*] } {
+	  || [istarget aarch64*-**] || [istarget s390*-*-*] } {
 	return 1
     }
   return 0