diff mbox

[RFC,v2,05/18] sched: add task flag for preempt IRQ tracking

Message ID 20160519231546.yvtqz5wacxvykvn2@treble (mailing list archive)
State Not Applicable
Headers show

Commit Message

Josh Poimboeuf May 19, 2016, 11:15 p.m. UTC
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.
> 
> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.
> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.
> 
> I could be convinced either way.

Ok, I took a stab at this.  See the patch below.

In addition to annotating interrupt/exception pt_regs frames, I also
annotated all the syscall pt_regs, for consistency.

As you mentioned, it will affect performance a bit, but I think it will
be insignificant.

I think I like this approach better than putting the
interrupt/idtentry's in a special section, because this is much more
precise.  Especially now that I'm annotating pt_regs syscalls.

Also I think with a few minor changes we could implement your idea of
annotating the calls with what type of entry they are.  But I don't
think that's really needed, because the name of the interrupt/idtentry
is already on the stack trace.

Before:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887058>] async_page_fault+0x28/0x30
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd

After:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887422>] async_page_fault+0x32/0x40
  [<ffffffff81887861>] pt_regs+0x1/0x10
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
  [<ffffffff81887861>] pt_regs+0x1/0x10

Note this example is with today's unwinder.  It could be made smarter to
get the RIP from the pt_regs so the '?' could be removed from
copy_page_to_iter().

Thoughts?

Comments

Andy Lutomirski May 19, 2016, 11:39 p.m. UTC | #1
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887058>] async_page_fault+0x28/0x30
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>
> After:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887422>] async_page_fault+0x32/0x40
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>
> Note this example is with today's unwinder.  It could be made smarter to
> get the RIP from the pt_regs so the '?' could be removed from
> copy_page_to_iter().
>
> Thoughts?

I think we should do that.  The silly sample patch I sent you (or at
least that I think I sent you) did that, and it worked nicely.

>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 9a9e588..f54886a 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
>         .byte 0xf1
>         .endm
>
> +       /*
> +        * Create a stack frame for the saved pt_regs.  This allows frame
> +        * pointer based unwinders to find pt_regs on the stack.
> +        */
> +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> +#ifdef CONFIG_FRAME_POINTER
> +       pushq   \regs
> +       pushq   $pt_regs+1
> +       pushq   %rbp
> +       movq    %rsp, %rbp
> +#endif
> +       .endm

I don't love this part.  It's going to hurt performance, and, given
that we need to change the unwinder anyway to make it useful, let's
just emit a table somewhere in .rodata and use it directly.

> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>
> +       CREATE_PT_REGS_FRAME
>         /*
>          * This call instruction is handled specially in stub_ptregs_64.
>          * It might end up jumping to the slow path.  If it jumps, RAX
> @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
>         call    *sys_call_table(, %rax, 8)
>  .Lentry_SYSCALL_64_after_fastpath_call:
>
> +       REMOVE_PT_REGS_FRAME
> +
>         movq    %rax, RAX(%rsp)
>  1:

This one is particular is quite hot, so I'd much rather avoid it.
Andy Lutomirski May 23, 2016, 9:34 p.m. UTC | #2
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887058>] async_page_fault+0x28/0x30
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>
> After:
>
>   [<ffffffff8143c243>] dump_stack+0x85/0xc2
>   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
>   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
>   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
>   [<ffffffff81887422>] async_page_fault+0x32/0x40
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
>   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
>   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
>   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
>   [<ffffffff81286378>] vfs_read+0x98/0x140
>   [<ffffffff812878c8>] SyS_read+0x58/0xc0
>   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
>   [<ffffffff81887861>] pt_regs+0x1/0x10
>
> Note this example is with today's unwinder.  It could be made smarter to
> get the RIP from the pt_regs so the '?' could be removed from
> copy_page_to_iter().
>
> Thoughts?

Maybe I'm coming around to liking this idea.

In an ideal world (DWARF support, high-quality unwinder, nice pretty
printer, etc), unwinding across a kernel exception would look like:

 - some_func
 - some_other_func
 - do_page_fault
 - page_fault

After page_fault, the next unwind step takes us to the faulting RIP
(faulting_func) and reports that all GPRs are known.  It should
probably learn this fact from DWARF if DWARF is available, instead of
directly decoding pt_regs, due to a few funny cases in which pt_regs
may be incomplete.  A nice pretty printer could now print all the
regs.

 - faulting_func
 - etc.

For this to work, we don't actually need the unwinder to explicitly
know where pt_regs is.

Food for thought, though: if user code does SYSENTER with TF set,
you'll end up with partial pt_regs.  There's nothing the kernel can do
about it.  DWARF will handle it without any fanfare, but I don't know
if it's going to cause trouble for you pre-DWARF.

I'm also not sure it makes sense to apply this before the unwinder
that can consume it is ready.  Maybe, if it would be consistent with
your plans, it would make sense to rewrite the unwinder first, then
apply this and teach live patching to use the new unwinder, and *then*
add DWARF support?


>
> +       /*
> +        * Create a stack frame for the saved pt_regs.  This allows frame
> +        * pointer based unwinders to find pt_regs on the stack.
> +        */
> +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> +#ifdef CONFIG_FRAME_POINTER
> +       pushq   \regs
> +       pushq   $pt_regs+1

Why the +1?

> +       pushq   %rbp
> +       movq    %rsp, %rbp
> +#endif
> +       .endm

I keep wanting this to be only two pushes and to fudge rbp to make it
work, but I don't see a good way.  But let's call it
CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
nested_frame or similar.

> +
> +       .macro CALL_HANDLER handler regs=%rsp
> +       CREATE_PT_REGS_FRAME \regs
> +       call    \handler
> +       REMOVE_PT_REGS_FRAME
> +       .endm

I think I'd rather open-code this everywhere.  It'll make it clearer
what's going on.

> @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>
> +       CREATE_PT_REGS_FRAME
>         /*
>          * This call instruction is handled specially in stub_ptregs_64.
>          * It might end up jumping to the slow path.  If it jumps, RAX
> @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
>         call    *sys_call_table(, %rax, 8)
>  .Lentry_SYSCALL_64_after_fastpath_call:
>
> +       REMOVE_PT_REGS_FRAME
> +

As discussed, let's get rid of this bit.

>         movq    %rax, RAX(%rsp)
>  1:
>
> @@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath:
>         ENABLE_INTERRUPTS(CLBR_NONE)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
> -       call    syscall_return_slowpath /* returns with IRQs disabled */
> +       CALL_HANDLER syscall_return_slowpath    /* returns with IRQs disabled */

and this.

>         jmp     return_from_SYSCALL_64
>
>  entry_SYSCALL64_slow_path:
>         /* IRQs are off. */
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
> -       call    do_syscall_64           /* returns with IRQs disabled */
> +       CALL_HANDLER do_syscall_64      /* returns with IRQs disabled */
>
>  return_from_SYSCALL_64:
>         RESTORE_EXTRA_REGS
> @@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64)
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
>         popq    %rax
> +       REMOVE_PT_REGS_FRAME

This will be less mysterious if you open-code the macros.  Also, I
think you have to, some return_from_SYSCALL_64 needs to be directly
after the actual call instruction.  (But if you get rid of the hunks
above, I think this goes away too, so this may be moot.)

>  1:
> @@ -372,7 +376,7 @@ END(ptregs_\func)
>  ENTRY(ret_from_fork)
>         LOCK ; btr $TIF_FORK, TI_flags(%r8)
>
> -       call    schedule_tail                   /* rdi: 'prev' task parameter */
> +       CALL_HANDLER schedule_tail              /* rdi: 'prev' task parameter */
>

If you end up making the unwinder smart enough to notice that rsp is
just below pt_regs, then this can go away.  It's harmless, though.

>         testb   $3, CS(%rsp)                    /* from kernel_thread? */
>         jnz     1f
> @@ -385,8 +389,9 @@ ENTRY(ret_from_fork)
>          * parameter to be passed in RBP.  The called function is permitted
>          * to call do_execve and thereby jump to user mode.
>          */
> +       movq    RBX(%rsp), %rbx
>         movq    RBP(%rsp), %rdi
> -       call    *RBX(%rsp)
> +       CALL_HANDLER *%rbx

Does using a register like this actually save any code size?
Admittedly, it's a bit cleaner.

> +
> +/* fake function which allows stack unwinders to detect pt_regs frames */
> +#ifdef CONFIG_FRAME_POINTER
> +ENTRY(pt_regs)
> +       nop
> +       nop
> +ENDPROC(pt_regs)
> +#endif /* CONFIG_FRAME_POINTER */

Why is this two bytes long?  Is there some reason it has to be more
than one byte?

--Andy
Josh Poimboeuf May 24, 2016, 2:28 a.m. UTC | #3
On Mon, May 23, 2016 at 02:34:56PM -0700, Andy Lutomirski wrote:
> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> >> > past the nested entry frame and keep going until it gets to the original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >>
> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >> But this is pretty simple and performance is already abysmal in most
> >> handlers.
> >>
> >> There's an added benefit to using a separate section, though: we could
> >> also annotate the calls with what type of entry they were so the
> >> unwinder could print it out nicely.
> >>
> >> I could be convinced either way.
> >
> > Ok, I took a stab at this.  See the patch below.
> >
> > In addition to annotating interrupt/exception pt_regs frames, I also
> > annotated all the syscall pt_regs, for consistency.
> >
> > As you mentioned, it will affect performance a bit, but I think it will
> > be insignificant.
> >
> > I think I like this approach better than putting the
> > interrupt/idtentry's in a special section, because this is much more
> > precise.  Especially now that I'm annotating pt_regs syscalls.
> >
> > Also I think with a few minor changes we could implement your idea of
> > annotating the calls with what type of entry they are.  But I don't
> > think that's really needed, because the name of the interrupt/idtentry
> > is already on the stack trace.
> >
> > Before:
> >
> >   [<ffffffff8143c243>] dump_stack+0x85/0xc2
> >   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
> >   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
> >   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
> >   [<ffffffff81887058>] async_page_fault+0x28/0x30
> >   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
> >   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
> >   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
> >   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
> >   [<ffffffff81286378>] vfs_read+0x98/0x140
> >   [<ffffffff812878c8>] SyS_read+0x58/0xc0
> >   [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> >
> > After:
> >
> >   [<ffffffff8143c243>] dump_stack+0x85/0xc2
> >   [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
> >   [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
> >   [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
> >   [<ffffffff81887422>] async_page_fault+0x32/0x40
> >   [<ffffffff81887861>] pt_regs+0x1/0x10
> >   [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
> >   [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
> >   [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
> >   [<ffffffff81285e32>] __vfs_read+0xe2/0x140
> >   [<ffffffff81286378>] vfs_read+0x98/0x140
> >   [<ffffffff812878c8>] SyS_read+0x58/0xc0
> >   [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
> >   [<ffffffff81887861>] pt_regs+0x1/0x10
> >
> > Note this example is with today's unwinder.  It could be made smarter to
> > get the RIP from the pt_regs so the '?' could be removed from
> > copy_page_to_iter().
> >
> > Thoughts?
> 
> Maybe I'm coming around to liking this idea.

Ok, good :-)

> In an ideal world (DWARF support, high-quality unwinder, nice pretty
> printer, etc), unwinding across a kernel exception would look like:
> 
>  - some_func
>  - some_other_func
>  - do_page_fault
>  - page_fault
> 
> After page_fault, the next unwind step takes us to the faulting RIP
> (faulting_func) and reports that all GPRs are known.  It should
> probably learn this fact from DWARF if DWARF is available, instead of
> directly decoding pt_regs, due to a few funny cases in which pt_regs
> may be incomplete.  A nice pretty printer could now print all the
> regs.
> 
>  - faulting_func
>  - etc.
> 
> For this to work, we don't actually need the unwinder to explicitly
> know where pt_regs is.

That's true (but only for DWARF).

> Food for thought, though: if user code does SYSENTER with TF set,
> you'll end up with partial pt_regs.  There's nothing the kernel can do
> about it.  DWARF will handle it without any fanfare, but I don't know
> if it's going to cause trouble for you pre-DWARF.

In this case it should see the stack pointer is past the pt_regs offset,
so it would just report it as an empty stack.

> I'm also not sure it makes sense to apply this before the unwinder
> that can consume it is ready.  Maybe, if it would be consistent with
> your plans, it would make sense to rewrite the unwinder first, then
> apply this and teach live patching to use the new unwinder, and *then*
> add DWARF support?

For the purposes of livepatch, the reliable unwinder needs to detect
whether an interrupt/exception pt_regs frame exists on a sleeping task
(or current).  This patch would allow us to do that.

So my preferred order of doing things would be:

1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
2) this patch for annotating pt_regs stack frames
3) reliable unwinder, similar to what I already posted, except it relies
   on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
   the new inactive task frame format of #1
4) livepatch consistency model which uses the reliable unwinder
5) rewrite unwinder, and port all users to the new interface
6) add DWARF unwinder

1-4 are pretty much already written, whereas 5 and 6 will take
considerably more work.

> > +       /*
> > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > +        * pointer based unwinders to find pt_regs on the stack.
> > +        */
> > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > +#ifdef CONFIG_FRAME_POINTER
> > +       pushq   \regs
> > +       pushq   $pt_regs+1
> 
> Why the +1?

Some unwinders like gdb are smart enough to report the function which
contains the instruction *before* the return address.  Without the +1,
they would show the wrong function.

> > +       pushq   %rbp
> > +       movq    %rsp, %rbp
> > +#endif
> > +       .endm
> 
> I keep wanting this to be only two pushes and to fudge rbp to make it
> work, but I don't see a good way.  But let's call it
> CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> nested_frame or similar.

Or, if we aren't going to annotate syscall pt_regs, we could give it a
more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?

> > +
> > +       .macro CALL_HANDLER handler regs=%rsp
> > +       CREATE_PT_REGS_FRAME \regs
> > +       call    \handler
> > +       REMOVE_PT_REGS_FRAME
> > +       .endm
> 
> I think I'd rather open-code this everywhere.  It'll make it clearer
> what's going on.

Ok.

> > @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
> >         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
> >         movq    %r10, %rcx
> >
> > +       CREATE_PT_REGS_FRAME
> >         /*
> >          * This call instruction is handled specially in stub_ptregs_64.
> >          * It might end up jumping to the slow path.  If it jumps, RAX
> > @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
> >         call    *sys_call_table(, %rax, 8)
> >  .Lentry_SYSCALL_64_after_fastpath_call:
> >
> > +       REMOVE_PT_REGS_FRAME
> > +
> 
> As discussed, let's get rid of this bit.

Yeah, it's fine with me to get rid of all the syscall stuff.

> 
> >         movq    %rax, RAX(%rsp)
> >  1:
> >
> > @@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath:
> >         ENABLE_INTERRUPTS(CLBR_NONE)
> >         SAVE_EXTRA_REGS
> >         movq    %rsp, %rdi
> > -       call    syscall_return_slowpath /* returns with IRQs disabled */
> > +       CALL_HANDLER syscall_return_slowpath    /* returns with IRQs disabled */
> 
> and this.

This will be gone...

> 
> >         jmp     return_from_SYSCALL_64
> >
> >  entry_SYSCALL64_slow_path:
> >         /* IRQs are off. */
> >         SAVE_EXTRA_REGS
> >         movq    %rsp, %rdi
> > -       call    do_syscall_64           /* returns with IRQs disabled */
> > +       CALL_HANDLER do_syscall_64      /* returns with IRQs disabled */
> >
> >  return_from_SYSCALL_64:
> >         RESTORE_EXTRA_REGS
> > @@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64)
> >         DISABLE_INTERRUPTS(CLBR_NONE)
> >         TRACE_IRQS_OFF
> >         popq    %rax
> > +       REMOVE_PT_REGS_FRAME
> 
> This will be less mysterious if you open-code the macros.  Also, I
> think you have to, some return_from_SYSCALL_64 needs to be directly
> after the actual call instruction.  (But if you get rid of the hunks
> above, I think this goes away too, so this may be moot.)

and this...

> >  1:
> > @@ -372,7 +376,7 @@ END(ptregs_\func)
> >  ENTRY(ret_from_fork)
> >         LOCK ; btr $TIF_FORK, TI_flags(%r8)
> >
> > -       call    schedule_tail                   /* rdi: 'prev' task parameter */
> > +       CALL_HANDLER schedule_tail              /* rdi: 'prev' task parameter */
> >
> 
> If you end up making the unwinder smart enough to notice that rsp is
> just below pt_regs, then this can go away.  It's harmless, though.

and this...

> >         testb   $3, CS(%rsp)                    /* from kernel_thread? */
> >         jnz     1f
> > @@ -385,8 +389,9 @@ ENTRY(ret_from_fork)
> >          * parameter to be passed in RBP.  The called function is permitted
> >          * to call do_execve and thereby jump to user mode.
> >          */
> > +       movq    RBX(%rsp), %rbx
> >         movq    RBP(%rsp), %rdi
> > -       call    *RBX(%rsp)
> > +       CALL_HANDLER *%rbx
> 
> Does using a register like this actually save any code size?
> Admittedly, it's a bit cleaner.

and this.

(FWIW, I used a register because the assembler macro didn't seem to
support passing "*RBX(%rsp)" as an argument.)

> > +
> > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > +#ifdef CONFIG_FRAME_POINTER
> > +ENTRY(pt_regs)
> > +       nop
> > +       nop
> > +ENDPROC(pt_regs)
> > +#endif /* CONFIG_FRAME_POINTER */
> 
> Why is this two bytes long?  Is there some reason it has to be more
> than one byte?

Similar to above, this is related to the need to support various
unwinders.  Whether the unwinder displays the ret_addr or the
instruction preceding it, either way the instruction needs to be inside
the function for the function to be reported.
Andy Lutomirski May 24, 2016, 3:52 a.m. UTC | #4
On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> > Maybe I'm coming around to liking this idea.
>
> Ok, good :-)
>
> > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > printer, etc), unwinding across a kernel exception would look like:
> >
> >  - some_func
> >  - some_other_func
> >  - do_page_fault
> >  - page_fault
> >
> > After page_fault, the next unwind step takes us to the faulting RIP
> > (faulting_func) and reports that all GPRs are known.  It should
> > probably learn this fact from DWARF if DWARF is available, instead of
> > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > may be incomplete.  A nice pretty printer could now print all the
> > regs.
> >
> >  - faulting_func
> >  - etc.
> >
> > For this to work, we don't actually need the unwinder to explicitly
> > know where pt_regs is.
>
> That's true (but only for DWARF).
>
> > Food for thought, though: if user code does SYSENTER with TF set,
> > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > about it.  DWARF will handle it without any fanfare, but I don't know
> > if it's going to cause trouble for you pre-DWARF.
>
> In this case it should see the stack pointer is past the pt_regs offset,
> so it would just report it as an empty stack.

OK

>
> > I'm also not sure it makes sense to apply this before the unwinder
> > that can consume it is ready.  Maybe, if it would be consistent with
> > your plans, it would make sense to rewrite the unwinder first, then
> > apply this and teach live patching to use the new unwinder, and *then*
> > add DWARF support?
>
> For the purposes of livepatch, the reliable unwinder needs to detect
> whether an interrupt/exception pt_regs frame exists on a sleeping task
> (or current).  This patch would allow us to do that.
>
> So my preferred order of doing things would be:
>
> 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> 2) this patch for annotating pt_regs stack frames
> 3) reliable unwinder, similar to what I already posted, except it relies
>    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>    the new inactive task frame format of #1
> 4) livepatch consistency model which uses the reliable unwinder
> 5) rewrite unwinder, and port all users to the new interface
> 6) add DWARF unwinder
>
> 1-4 are pretty much already written, whereas 5 and 6 will take
> considerably more work.

Fair enough.

>
> > > +       /*
> > > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > > +        * pointer based unwinders to find pt_regs on the stack.
> > > +        */
> > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +       pushq   \regs
> > > +       pushq   $pt_regs+1
> >
> > Why the +1?
>
> Some unwinders like gdb are smart enough to report the function which
> contains the instruction *before* the return address.  Without the +1,
> they would show the wrong function.

Lovely.  Want to add a comment?

>
> > > +       pushq   %rbp
> > > +       movq    %rsp, %rbp
> > > +#endif
> > > +       .endm
> >
> > I keep wanting this to be only two pushes and to fudge rbp to make it
> > work, but I don't see a good way.  But let's call it
> > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > nested_frame or similar.
>
> Or, if we aren't going to annotate syscall pt_regs, we could give it a
> more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?

CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
too.  CREATE_PT_REGS_FRAME is probably fine.

> > > +
> > > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +ENTRY(pt_regs)
> > > +       nop
> > > +       nop
> > > +ENDPROC(pt_regs)
> > > +#endif /* CONFIG_FRAME_POINTER */
> >
> > Why is this two bytes long?  Is there some reason it has to be more
> > than one byte?
>
> Similar to above, this is related to the need to support various
> unwinders.  Whether the unwinder displays the ret_addr or the
> instruction preceding it, either way the instruction needs to be inside
> the function for the function to be reported.

OK.

--Andy
diff mbox

Patch

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..f54886a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,32 @@  For 32-bit we have the following conventions - kernel is built with
 	.byte 0xf1
 	.endm
 
+	/*
+	 * Create a stack frame for the saved pt_regs.  This allows frame
+	 * pointer based unwinders to find pt_regs on the stack.
+	 */
+	.macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+	pushq	\regs
+	pushq	$pt_regs+1
+	pushq	%rbp
+	movq	%rsp, %rbp
+#endif
+	.endm
+
+	.macro REMOVE_PT_REGS_FRAME
+#ifdef CONFIG_FRAME_POINTER
+	popq	%rbp
+	addq	$0x10, %rsp
+#endif
+	.endm
+
+	.macro CALL_HANDLER handler regs=%rsp
+	CREATE_PT_REGS_FRAME \regs
+	call	\handler
+	REMOVE_PT_REGS_FRAME
+	.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..8642984 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -199,6 +199,7 @@  entry_SYSCALL_64_fastpath:
 	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
 	movq	%r10, %rcx
 
+	CREATE_PT_REGS_FRAME
 	/*
 	 * This call instruction is handled specially in stub_ptregs_64.
 	 * It might end up jumping to the slow path.  If it jumps, RAX
@@ -207,6 +208,8 @@  entry_SYSCALL_64_fastpath:
 	call	*sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
+	REMOVE_PT_REGS_FRAME
+
 	movq	%rax, RAX(%rsp)
 1:
 
@@ -238,14 +241,14 @@  entry_SYSCALL_64_fastpath:
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	CALL_HANDLER syscall_return_slowpath	/* returns with IRQs disabled */
 	jmp	return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi
-	call	do_syscall_64		/* returns with IRQs disabled */
+	CALL_HANDLER do_syscall_64	/* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
 	RESTORE_EXTRA_REGS
@@ -344,6 +347,7 @@  ENTRY(stub_ptregs_64)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	popq	%rax
+	REMOVE_PT_REGS_FRAME
 	jmp	entry_SYSCALL64_slow_path
 
 1:
@@ -372,7 +376,7 @@  END(ptregs_\func)
 ENTRY(ret_from_fork)
 	LOCK ; btr $TIF_FORK, TI_flags(%r8)
 
-	call	schedule_tail			/* rdi: 'prev' task parameter */
+	CALL_HANDLER schedule_tail		/* rdi: 'prev' task parameter */
 
 	testb	$3, CS(%rsp)			/* from kernel_thread? */
 	jnz	1f
@@ -385,8 +389,9 @@  ENTRY(ret_from_fork)
 	 * parameter to be passed in RBP.  The called function is permitted
 	 * to call do_execve and thereby jump to user mode.
 	 */
+	movq	RBX(%rsp), %rbx
 	movq	RBP(%rsp), %rdi
-	call	*RBX(%rsp)
+	CALL_HANDLER *%rbx
 	movl	$0, RAX(%rsp)
 
 	/*
@@ -396,7 +401,7 @@  ENTRY(ret_from_fork)
 
 1:
 	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
+	CALL_HANDLER syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
 	jmp	restore_regs_and_iret
@@ -468,7 +473,7 @@  END(irq_entries_start)
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
-	call	\func	/* rdi points to pt_regs */
+	CALL_HANDLER \func regs=%rdi
 	.endm
 
 	/*
@@ -495,7 +500,7 @@  ret_from_intr:
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
 	mov	%rsp,%rdi
-	call	prepare_exit_to_usermode
+	CALL_HANDLER prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
 	jmp	restore_regs_and_iret
@@ -509,7 +514,7 @@  retint_kernel:
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
-	call	preempt_schedule_irq
+	CALL_HANDLER preempt_schedule_irq
 	jmp	0b
 1:
 #endif
@@ -688,8 +693,6 @@  ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -701,7 +704,8 @@  ENTRY(\sym)
 	subq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	.if \shift_ist != -1
 	addq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
@@ -728,8 +732,6 @@  ENTRY(\sym)
 	call	sync_regs
 	movq	%rax, %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
-
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
@@ -737,7 +739,8 @@  ENTRY(\sym)
 	xorl	%esi, %esi			/* no error code */
 	.endif
 
-	call	\do_sym
+	movq	%rsp, %rdi			/* pt_regs pointer */
+	CALL_HANDLER \do_sym
 
 	jmp	error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1174,7 +1177,7 @@  ENTRY(nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1387,7 +1390,7 @@  end_repeat_nmi:
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	do_nmi
+	CALL_HANDLER do_nmi
 
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
@@ -1423,3 +1426,11 @@  ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+/* fake function which allows stack unwinders to detect pt_regs frames */
+#ifdef CONFIG_FRAME_POINTER
+ENTRY(pt_regs)
+	nop
+	nop
+ENDPROC(pt_regs)
+#endif /* CONFIG_FRAME_POINTER */