Message ID | 328b4297-f032-6bf0-9252-1d7ee7b50133@redhat.com |
---|---|
State | New |
Headers | show |
Series | [RFA,PR,target/84128] Fix restore of scratch register used in probing loops | expand |
> -fstack-check and -fstack-clash both potentially create a loop for stack > probing. In that case they both need a scratch register to hold the > loop upper bound. > > The code to allocate a scratch register first starts with the > caller-saved registers as they're zero cost. Then it'll use any callee > saved register that is actually saved. If neither is available (say > because all the caller-saved regs are used for parameter passing and > there are no callee saved register used), then the allocation routine > will push %eax on the stack and the deallocation routine will pop it off > to restore its value. > > Of course there is a *stack allocation* between those two points. So > the pop ends up restoring garbage into the register. There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, Windows, LynxOS and other OSes where you can probe below the stack pointer. > Clearly this code had not be exercised. So I hacked up things so that > we always generated a probing loop and always assumed that we needed to > save/restore a scratch register and enabled stack clash protections by > default. Probably not on Linux indeed. But, since you modified the non-Linux path, you need to avoid repeating the same mistake and test on non-Linux native too.
On Wed, Jan 31, 2018 at 6:35 AM, Jeff Law <law@redhat.com> wrote: > Whee, fun, this appears to have been broken for a very long time, > possibly since the introduction of -fstack-check in 2010. Thankfully it > only affects 32 bit and only in relatively constrained circumstances. > > -fstack-check and -fstack-clash both potentially create a loop for stack > probing. In that case they both need a scratch register to hold the > loop upper bound. > > The code to allocate a scratch register first starts with the > caller-saved registers as they're zero cost. Then it'll use any callee > saved register that is actually saved. If neither is available (say > because all the caller-saved regs are used for parameter passing and > there are no callee saved register used), then the allocation routine > will push %eax on the stack and the deallocation routine will pop it off > to restore its value. > > Of course there is a *stack allocation* between those two points. So > the pop ends up restoring garbage into the register. > > Obviously the restore routine needs to use reg+d addressing to get to > the stack slot and the allocated space needs to be deallocated by the > epilogue. But sadly there's enough assertions sprinkled around that > prevent that from working as-is. > > So what this patch does is continue to use the push to allocate the > register. And it uses reg+d to restore the register after the probing > loops. The "trick" is that the space allocated by the push becomes part > of the normal stack frame after we restore the scratch register's value. > Ie, if we push a 4 byte register, then we reduce the size of the main > allocation request by 4 bytes. And everything just works. > > Clearly this code had not be exercised. So I hacked up things so that > we always generated a probing loop and always assumed that we needed to > save/restore a scratch register and enabled stack clash protections by > default. I bootstrapped that and compared testsuite runs against a run > which just had stack clash protection on by default. That did turn up > an issue, but it was with my testing hacks, not with this patch :-) > > > I (of course) also did the usual bootstrap and regression tests, using > x86_64 and i686. Hopefully this is the last iteration on the x86/x86_64 > stack clash bits :-) > > The one concern I have is do we need to tell the CFI machinery that > %eax's value was restored to its entry value? Can you or someone that knows CFI stuff please investigate this a bit? I'm not expert in this area, and I don't feel comfortable to approve the patch that has some known loose edges in the area I don't know that well. Uros.
On 01/31/2018 01:28 AM, Eric Botcazou wrote: >> -fstack-check and -fstack-clash both potentially create a loop for stack >> probing. In that case they both need a scratch register to hold the >> loop upper bound. >> >> The code to allocate a scratch register first starts with the >> caller-saved registers as they're zero cost. Then it'll use any callee >> saved register that is actually saved. If neither is available (say >> because all the caller-saved regs are used for parameter passing and >> there are no callee saved register used), then the allocation routine >> will push %eax on the stack and the deallocation routine will pop it off >> to restore its value. >> >> Of course there is a *stack allocation* between those two points. So >> the pop ends up restoring garbage into the register. > > There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, > Windows, LynxOS and other OSes where you can probe below the stack pointer. Ack. Let's hold the patch until I can test *that* path. jeff
On 01/31/2018 01:28 AM, Eric Botcazou wrote: >> -fstack-check and -fstack-clash both potentially create a loop for stack >> probing. In that case they both need a scratch register to hold the >> loop upper bound. >> >> The code to allocate a scratch register first starts with the >> caller-saved registers as they're zero cost. Then it'll use any callee >> saved register that is actually saved. If neither is available (say >> because all the caller-saved regs are used for parameter passing and >> there are no callee saved register used), then the allocation routine >> will push %eax on the stack and the deallocation routine will pop it off >> to restore its value. >> >> Of course there is a *stack allocation* between those two points. So >> the pop ends up restoring garbage into the register. > > There is a stack allocation only on Linux, there isn't any on Solaris, *BSD, > Windows, LynxOS and other OSes where you can probe below the stack pointer. So I went back and restored the code for targets that don't define STACK_CHECK_MOVING_SP (everything but linux I believe). To test it I turned on stack checking by default, bootstrapped that as a baseline. Then I turned off STACK_CHECK_MOVING_SP, compared that to the baseline. Not surprisingly, things with large stacks failed during probing. I then hacked up the dejagnu bits to present just an 8k stack which made all the regressions which had stack usage dg markers turn into UNRESOLVED. There were still many tests that had to be checked manually because they didn't have stack usage dg markup. This is baseline #2. I then forced all prologue allocations to go through the loop code, bootstrapped that and compared it to baseline #2. Then I turned off STACK_CHECK_MOVING_SP and forced all prologue allocations to go through the loop code. Bootstrapped that and compared the results to the baseline to ensure we weren't regressing. Note some tests want to verify we don't have push/pops in their resulting assembly code or count the number of orl instructions. Those had to be hand-checked along the way as well. Par for the course. I'm confident I've got all 3 paths covered reasonably well now. I still need to check into the CFI concern before resubmitting. jeff ps. I also did light testing before restoring the code to handle !STACK_CHECK_MOVING_SP to verify that it indeed was wrong with my original patch.
On 01/31/2018 03:41 AM, Uros Bizjak wrote: >> >> The one concern I have is do we need to tell the CFI machinery that >> %eax's value was restored to its entry value? > > Can you or someone that knows CFI stuff please investigate this a bit? > I'm not expert in this area, and I don't feel comfortable to approve > the patch that has some known loose edges in the area I don't know > that well. Fair enough. I'm not terribly familiar with the CFI bits either, but I was able to work this out. I've looked at gcc-6, gcc-7 and the trunk. None of them record anything WRT %eax at the restore point. If the CFA is the stack pointer both record that the CFA offset changed, but nothing about %eax. So this patch doesn't change change the accuracy of the CFI info. One might be able to argue that we should record the restores, but that seems like an independent issue to me. I'll post an updated patch momentarily. jeff
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index fef34a1..93ce79c 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -12567,22 +12567,25 @@ get_scratch_register_on_entry (struct scratch_reg *sr) } } -/* Release a scratch register obtained from the preceding function. */ +/* Release a scratch register obtained from the preceding function. + + Note there will always be some kind of stack adjustment between + allocation and releasing the scratch register. So we can't just + pop the scratch register off the stack if we were forced to save it + (the stack pointer itself has a different value). + + Instead we're passed the offset into the stack where the value will + be found and the space becomes part of the local frame that is + deallocated by the epilogue. */ static void -release_scratch_register_on_entry (struct scratch_reg *sr) +release_scratch_register_on_entry (struct scratch_reg *sr, HOST_WIDE_INT offset) { if (sr->saved) { - struct machine_function *m = cfun->machine; - rtx x, insn = emit_insn (gen_pop (sr->reg)); - - /* The RTX_FRAME_RELATED_P mechanism doesn't know about pop. */ - RTX_FRAME_RELATED_P (insn) = 1; - x = gen_rtx_PLUS (Pmode, stack_pointer_rtx, GEN_INT (UNITS_PER_WORD)); - x = gen_rtx_SET (stack_pointer_rtx, x); - add_reg_note (insn, REG_FRAME_RELATED_EXPR, x); - m->fs.sp_offset -= UNITS_PER_WORD; + rtx x = gen_rtx_PLUS (Pmode, stack_pointer_rtx, GEN_INT (offset)); + x = gen_rtx_SET (sr->reg, gen_rtx_MEM (word_mode, x)); + emit_insn (x); } } @@ -12597,7 +12600,7 @@ release_scratch_register_on_entry (struct scratch_reg *sr) pushed on the stack. */ static void -ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size, +ix86_adjust_stack_and_probe_stack_clash (HOST_WIDE_INT size, const bool int_registers_saved) { struct machine_function *m = cfun->machine; @@ -12713,6 +12716,12 @@ ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size, struct scratch_reg sr; get_scratch_register_on_entry (&sr); + /* If we needed to save a register, then account for any space + that was pushed (we are not going to pop the register when + we do the restore). */ + if (sr.saved) + size -= UNITS_PER_WORD; + /* Step 1: round SIZE down to a multiple of the interval. */ HOST_WIDE_INT rounded_size = size & -probe_interval; @@ -12761,7 +12770,9 @@ ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size, m->fs.cfa_reg == stack_pointer_rtx); dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size); - release_scratch_register_on_entry (&sr); + /* This does not deallocate the space reserved for the scratch + register. That will be deallocated in the epilogue. */ + release_scratch_register_on_entry (&sr, size); } /* Make sure nothing is scheduled before we are done. */ @@ -12774,7 +12785,7 @@ ix86_adjust_stack_and_probe_stack_clash (const HOST_WIDE_INT size, pushed on the stack. */ static void -ix86_adjust_stack_and_probe (const HOST_WIDE_INT size, +ix86_adjust_stack_and_probe (HOST_WIDE_INT size, const bool int_registers_saved) { /* We skip the probe for the first interval + a small dope of 4 words and @@ -12847,6 +12858,11 @@ ix86_adjust_stack_and_probe (const HOST_WIDE_INT size, get_scratch_register_on_entry (&sr); + /* If we needed to save a register, then account for any space + that was pushed (we are not going to pop the register when + we do the restore). */ + if (sr.saved) + size -= UNITS_PER_WORD; /* Step 1: round SIZE to the previous multiple of the interval. */ @@ -12906,7 +12922,9 @@ ix86_adjust_stack_and_probe (const HOST_WIDE_INT size, (get_probe_interval () + dope)))); - release_scratch_register_on_entry (&sr); + /* This does not deallocate the space reserved for the scratch + register. That will be deallocated in the epilogue. */ + release_scratch_register_on_entry (&sr, size); } /* Even if the stack pointer isn't the CFA register, we need to correctly @@ -13015,6 +13033,11 @@ ix86_emit_probe_stack_range (HOST_WIDE_INT first, HOST_WIDE_INT size, get_scratch_register_on_entry (&sr); + /* If we needed to save a register, then account for any space + that was pushed (we are not going to pop the register when + we do the restore). */ + if (sr.saved) + size -= UNITS_PER_WORD; /* Step 1: round SIZE to the previous multiple of the interval. */ @@ -13055,7 +13078,9 @@ ix86_emit_probe_stack_range (HOST_WIDE_INT first, HOST_WIDE_INT size, sr.reg), rounded_size - size)); - release_scratch_register_on_entry (&sr); + /* This does not deallocate the space reserved for the scratch + register. That will be deallocated in the epilogue. */ + release_scratch_register_on_entry (&sr, size); } /* Make sure nothing is scheduled before we are done. */ diff --git a/gcc/testsuite/gcc.target/i386/pr84128.c b/gcc/testsuite/gcc.target/i386/pr84128.c new file mode 100644 index 0000000..a8323fd6 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr84128.c @@ -0,0 +1,30 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -march=i686 -mtune=generic -fstack-clash-protection" } */ +/* { dg-require-effective-target ia32 } */ + +__attribute__ ((noinline, noclone, weak, regparm (3))) +int +f1 (long arg0, int (*pf) (long, void *)) +{ + unsigned char buf[32768]; + return pf (arg0, buf); +} + +__attribute__ ((noinline, noclone, weak)) +int +f2 (long arg0, void *ignored) +{ + if (arg0 != 17) + __builtin_abort (); + return 19; +} + +int +main (void) +{ + if (f1 (17, f2) != 19) + __builtin_abort (); + return 0; +} + +