diff mbox

sparc64: Exclude perf user callchain during critical sections

Message ID 1497416757-9894-1-git-send-email-rob.gardner@oracle.com
State Changes Requested
Delegated to: David Miller
Headers show

Commit Message

Rob Gardner June 14, 2017, 5:05 a.m. UTC
This fixes another cause of random segfaults and bus errors that
may occur while running perf with the callgraph (-g) option.

Critical sections beginning with spin_lock_irqsave() raise the
interrupt level to PIL_NORMAL_MAX (14) and intentionally do not block
performance counter interrupts, which arrive at PIL_NMI (15).  So perf
code must be very careful about what it does since it might execute in
the middle of one of these critical sections. In particular, the
perf_callchain_user() path is problematic because it accesses user
space and may cause TLB activity as well as faults as it unwinds the
user stack.

One particular critical section occurs in switch_mm:

        spin_lock_irqsave(&mm->context.lock, flags);
        ...
        load_secondary_context(mm);
        tsb_context_switch(mm);
        ...
        spin_unlock_irqrestore(&mm->context.lock, flags);

If a perf interrupt arrives in between load_secondary_context() and
tsb_context_switch(), then perf_callchain_user() could execute with
the context ID of one process, but with an active tsb for a different
process. When the user stack is accessed, it is very likely to
incur a TLB miss, since the h/w context ID has been changed. The TLB
will then be reloaded with a translation from the TSB for one process,
but using a context ID for another process. This exposes memory from
one process to another, and since it is a mapping for stack memory,
this usually causes the new process to crash quickly.

Some potential solutions are:

1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
This would certainly eliminate the problem, but it would also prevent
perf from having any visibility into code running in these critical
sections, and it seems clear that PIL_NORMAL_MAX is used for just
this reason.

2) Protect this particular critical section by masking all interrupts,
either by setting %pil to PIL_NMI or by clearing pstate.ie around the
calls to load_secondary_context() and tsb_context_switch(). This approach
has a few drawbacks:

 - It would only address this particular critical section, and would
   have to be repeated in several other known places. There might be
   other such critical sections that are not known.

 - It has a performance cost which would be incurred at every context
   switch, since it would require additional accesses to %pil or
   %pstate.

 - Turning off pstate.ie would require changing __tsb_context_switch(),
   which expects to be called with pstate.ie on.

3) Avoid user space MMU activity entirely in perf_callchain_user() by
implementing a new copy_from_user() function that accesses the user
stack via physical addresses. This works, but requires quite a bit of
new code to get it to perform reasonably, ie, caching of translations,
etc.

4) Allow the perf interrupt to happen in existing critical sections as
it does now, but have perf code detect that this is happening, and
skip any user callchain processing. This approach was deemed best, as
the change is extremely localized and covers both known and unknown
instances of perf interrupting critical sections. Perf has interrupted
a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
interrupt.

Ordinarily, a function that has a pt_regs passed in can simply examine
(reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
effect at the time of the perf interrupt.  However, when the perf
interrupt occurs while executing in the kernel, the pt_regs pointer is
replaced with task_pt_regs(current) in perf_callchain(), and so
perf_callchain_user() does not have access to the state of the machine
at the time of the perf interrupt. To work around this, we check
(regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
in to the arch independent part of perf, and temporarily change the
event attributes so that user callchain processing is avoided. Though
a user stack sample is not collected, this loss is not statistically
significant. Kernel call graph collection is not affected.

Orabug: 25577560

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/sparc/kernel/perf_event.c |   12 +++++++++++-
 1 files changed, 11 insertions(+), 1 deletions(-)

Comments

David Miller June 14, 2017, 4:25 p.m. UTC | #1
From: Rob Gardner <rob.gardner@oracle.com>
Date: Tue, 13 Jun 2017 23:05:57 -0600

> One particular critical section occurs in switch_mm:
> 
>         spin_lock_irqsave(&mm->context.lock, flags);
>         ...
>         load_secondary_context(mm);
>         tsb_context_switch(mm);
>         ...
>         spin_unlock_irqrestore(&mm->context.lock, flags);
> 
> If a perf interrupt arrives in between load_secondary_context() and
> tsb_context_switch(), then perf_callchain_user() could execute with
> the context ID of one process, but with an active tsb for a different
> process. When the user stack is accessed, it is very likely to
> incur a TLB miss, since the h/w context ID has been changed. The TLB
> will then be reloaded with a translation from the TSB for one process,
> but using a context ID for another process. This exposes memory from
> one process to another, and since it is a mapping for stack memory,
> this usually causes the new process to crash quickly.

Ahhh, good catch.

> Some potential solutions are:
> 
> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
> This would certainly eliminate the problem, but it would also prevent
> perf from having any visibility into code running in these critical
> sections, and it seems clear that PIL_NORMAL_MAX is used for just
> this reason.
> 
> 2) Protect this particular critical section by masking all interrupts,
> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
> calls to load_secondary_context() and tsb_context_switch(). This approach
> has a few drawbacks:
> 
>  - It would only address this particular critical section, and would
>    have to be repeated in several other known places. There might be
>    other such critical sections that are not known.
> 
>  - It has a performance cost which would be incurred at every context
>    switch, since it would require additional accesses to %pil or
>    %pstate.
> 
>  - Turning off pstate.ie would require changing __tsb_context_switch(),
>    which expects to be called with pstate.ie on.
> 
> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
> implementing a new copy_from_user() function that accesses the user
> stack via physical addresses. This works, but requires quite a bit of
> new code to get it to perform reasonably, ie, caching of translations,
> etc.
> 
> 4) Allow the perf interrupt to happen in existing critical sections as
> it does now, but have perf code detect that this is happening, and
> skip any user callchain processing. This approach was deemed best, as
> the change is extremely localized and covers both known and unknown
> instances of perf interrupting critical sections. Perf has interrupted
> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
> interrupt.
> 
> Ordinarily, a function that has a pt_regs passed in can simply examine
> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
> effect at the time of the perf interrupt.  However, when the perf
> interrupt occurs while executing in the kernel, the pt_regs pointer is
> replaced with task_pt_regs(current) in perf_callchain(), and so
> perf_callchain_user() does not have access to the state of the machine
> at the time of the perf interrupt. To work around this, we check
> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
> in to the arch independent part of perf, and temporarily change the
> event attributes so that user callchain processing is avoided. Though
> a user stack sample is not collected, this loss is not statistically
> significant. Kernel call graph collection is not affected.

I tried to figure out how x86 handles this, and gave up after some
time :-)  Can you figure it out?

PowerPC handles this by hard disabling IRQs from switch_mm until
switch_to completes.

I think excluding all local_irq_disabled() paths from having stack
backtraces is not good.  I'd rather see PIL_MAX get set from switch_mm
until switch_to completes.

Ok?

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Gardner June 14, 2017, 4:47 p.m. UTC | #2
On 06/14/2017 10:25 AM, David Miller wrote:
> From: Rob Gardner <rob.gardner@oracle.com>
> Date: Tue, 13 Jun 2017 23:05:57 -0600
>
>> One particular critical section occurs in switch_mm:
>>
>>          spin_lock_irqsave(&mm->context.lock, flags);
>>          ...
>>          load_secondary_context(mm);
>>          tsb_context_switch(mm);
>>          ...
>>          spin_unlock_irqrestore(&mm->context.lock, flags);
>>
>> If a perf interrupt arrives in between load_secondary_context() and
>> tsb_context_switch(), then perf_callchain_user() could execute with
>> the context ID of one process, but with an active tsb for a different
>> process. When the user stack is accessed, it is very likely to
>> incur a TLB miss, since the h/w context ID has been changed. The TLB
>> will then be reloaded with a translation from the TSB for one process,
>> but using a context ID for another process. This exposes memory from
>> one process to another, and since it is a mapping for stack memory,
>> this usually causes the new process to crash quickly.
> Ahhh, good catch.
>
>> Some potential solutions are:
>>
>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
>> This would certainly eliminate the problem, but it would also prevent
>> perf from having any visibility into code running in these critical
>> sections, and it seems clear that PIL_NORMAL_MAX is used for just
>> this reason.
>>
>> 2) Protect this particular critical section by masking all interrupts,
>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
>> calls to load_secondary_context() and tsb_context_switch(). This approach
>> has a few drawbacks:
>>
>>   - It would only address this particular critical section, and would
>>     have to be repeated in several other known places. There might be
>>     other such critical sections that are not known.
>>
>>   - It has a performance cost which would be incurred at every context
>>     switch, since it would require additional accesses to %pil or
>>     %pstate.
>>
>>   - Turning off pstate.ie would require changing __tsb_context_switch(),
>>     which expects to be called with pstate.ie on.
>>
>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
>> implementing a new copy_from_user() function that accesses the user
>> stack via physical addresses. This works, but requires quite a bit of
>> new code to get it to perform reasonably, ie, caching of translations,
>> etc.
>>
>> 4) Allow the perf interrupt to happen in existing critical sections as
>> it does now, but have perf code detect that this is happening, and
>> skip any user callchain processing. This approach was deemed best, as
>> the change is extremely localized and covers both known and unknown
>> instances of perf interrupting critical sections. Perf has interrupted
>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
>> interrupt.
>>
>> Ordinarily, a function that has a pt_regs passed in can simply examine
>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
>> effect at the time of the perf interrupt.  However, when the perf
>> interrupt occurs while executing in the kernel, the pt_regs pointer is
>> replaced with task_pt_regs(current) in perf_callchain(), and so
>> perf_callchain_user() does not have access to the state of the machine
>> at the time of the perf interrupt. To work around this, we check
>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
>> in to the arch independent part of perf, and temporarily change the
>> event attributes so that user callchain processing is avoided. Though
>> a user stack sample is not collected, this loss is not statistically
>> significant. Kernel call graph collection is not affected.
> I tried to figure out how x86 handles this, and gave up after some
> time :-)  Can you figure it out?

I could look.

>
> PowerPC handles this by hard disabling IRQs from switch_mm until
> switch_to completes.
>
> I think excluding all local_irq_disabled() paths from having stack
> backtraces is not good.

Just to clarify, this does not exclude local_irq_disabled() paths from 
having stack backtraces. It only disables collection of *user* stack 
backtraces during these (very rare) times. Collection of *kernel* stack 
traces is not affected, so we still get the same visibility from perf 
for kernel call graphs. I do not believe we sacrifice anything here in 
terms of user stack collection since we'll only skip doing this when:
  1) we're already executing in the kernel when the perf interrupt 
arrives, and this is already (what should be) a rare event
  2) and kernel is executing in a local_irq_disabled() path, also should 
be very rare


>   I'd rather see PIL_MAX get set from switch_mm
> until switch_to completes.
>
> Ok?
>

In light of the above, what is your preference?

Rob


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 14, 2017, 5:05 p.m. UTC | #3
From: Rob Gardner <rob.gardner@oracle.com>
Date: Wed, 14 Jun 2017 10:47:08 -0600

> On 06/14/2017 10:25 AM, David Miller wrote:
>> From: Rob Gardner <rob.gardner@oracle.com>
>> Date: Tue, 13 Jun 2017 23:05:57 -0600
>>
>>> One particular critical section occurs in switch_mm:
>>>
>>>          spin_lock_irqsave(&mm->context.lock, flags);
>>>          ...
>>>          load_secondary_context(mm);
>>>          tsb_context_switch(mm);
>>>          ...
>>>          spin_unlock_irqrestore(&mm->context.lock, flags);
>>>
>>> If a perf interrupt arrives in between load_secondary_context() and
>>> tsb_context_switch(), then perf_callchain_user() could execute with
>>> the context ID of one process, but with an active tsb for a different
>>> process. When the user stack is accessed, it is very likely to
>>> incur a TLB miss, since the h/w context ID has been changed. The TLB
>>> will then be reloaded with a translation from the TSB for one process,
>>> but using a context ID for another process. This exposes memory from
>>> one process to another, and since it is a mapping for stack memory,
>>> this usually causes the new process to crash quickly.
>> Ahhh, good catch.
>>
>>> Some potential solutions are:
>>>
>>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
>>> This would certainly eliminate the problem, but it would also prevent
>>> perf from having any visibility into code running in these critical
>>> sections, and it seems clear that PIL_NORMAL_MAX is used for just
>>> this reason.
>>>
>>> 2) Protect this particular critical section by masking all interrupts,
>>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
>>> calls to load_secondary_context() and tsb_context_switch(). This
>>> approach
>>> has a few drawbacks:
>>>
>>>   - It would only address this particular critical section, and would
>>>     have to be repeated in several other known places. There might be
>>>     other such critical sections that are not known.
>>>
>>>   - It has a performance cost which would be incurred at every context
>>>     switch, since it would require additional accesses to %pil or
>>>     %pstate.
>>>
>>>   - Turning off pstate.ie would require changing __tsb_context_switch(),
>>>     which expects to be called with pstate.ie on.
>>>
>>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
>>> implementing a new copy_from_user() function that accesses the user
>>> stack via physical addresses. This works, but requires quite a bit of
>>> new code to get it to perform reasonably, ie, caching of translations,
>>> etc.
>>>
>>> 4) Allow the perf interrupt to happen in existing critical sections as
>>> it does now, but have perf code detect that this is happening, and
>>> skip any user callchain processing. This approach was deemed best, as
>>> the change is extremely localized and covers both known and unknown
>>> instances of perf interrupting critical sections. Perf has interrupted
>>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
>>> interrupt.
>>>
>>> Ordinarily, a function that has a pt_regs passed in can simply examine
>>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
>>> effect at the time of the perf interrupt.  However, when the perf
>>> interrupt occurs while executing in the kernel, the pt_regs pointer is
>>> replaced with task_pt_regs(current) in perf_callchain(), and so
>>> perf_callchain_user() does not have access to the state of the machine
>>> at the time of the perf interrupt. To work around this, we check
>>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
>>> in to the arch independent part of perf, and temporarily change the
>>> event attributes so that user callchain processing is avoided. Though
>>> a user stack sample is not collected, this loss is not statistically
>>> significant. Kernel call graph collection is not affected.
>> I tried to figure out how x86 handles this, and gave up after some
>> time :-)  Can you figure it out?
> 
> I could look.
> 
>>
>> PowerPC handles this by hard disabling IRQs from switch_mm until
>> switch_to completes.
>>
>> I think excluding all local_irq_disabled() paths from having stack
>> backtraces is not good.
> 
> Just to clarify, this does not exclude local_irq_disabled() paths from
> having stack backtraces. It only disables collection of *user* stack
> backtraces during these (very rare) times. Collection of *kernel*
> stack traces is not affected, so we still get the same visibility from
> perf for kernel call graphs. I do not believe we sacrifice anything
> here in terms of user stack collection since we'll only skip doing
> this when:
>  1) we're already executing in the kernel when the perf interrupt
>  arrives, and this is already (what should be) a rare event
>  2) and kernel is executing in a local_irq_disabled() path, also should
>  be very rare

I understood this from the beginning.

>>   I'd rather see PIL_MAX get set from switch_mm
>> until switch_to completes.
>>
>> Ok?
>>
> 
> In light of the above, what is your preference?

I still have the same position.

That lack of user backtrace is a loss of information.  And in order to
make it work we must have the case that the PGD/TSB/mmu-context/stack
all match up precisely.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Gardner June 14, 2017, 5:24 p.m. UTC | #4
On 06/14/2017 11:05 AM, David Miller wrote:
> From: Rob Gardner <rob.gardner@oracle.com>
> Date: Wed, 14 Jun 2017 10:47:08 -0600
>
>> On 06/14/2017 10:25 AM, David Miller wrote:
>>> From: Rob Gardner <rob.gardner@oracle.com>
>>> Date: Tue, 13 Jun 2017 23:05:57 -0600
>>>
>>>> One particular critical section occurs in switch_mm:
>>>>
>>>>           spin_lock_irqsave(&mm->context.lock, flags);
>>>>           ...
>>>>           load_secondary_context(mm);
>>>>           tsb_context_switch(mm);
>>>>           ...
>>>>           spin_unlock_irqrestore(&mm->context.lock, flags);
>>>>
>>>> If a perf interrupt arrives in between load_secondary_context() and
>>>> tsb_context_switch(), then perf_callchain_user() could execute with
>>>> the context ID of one process, but with an active tsb for a different
>>>> process. When the user stack is accessed, it is very likely to
>>>> incur a TLB miss, since the h/w context ID has been changed. The TLB
>>>> will then be reloaded with a translation from the TSB for one process,
>>>> but using a context ID for another process. This exposes memory from
>>>> one process to another, and since it is a mapping for stack memory,
>>>> this usually causes the new process to crash quickly.
>>> Ahhh, good catch.
>>>
>>>> Some potential solutions are:
>>>>
>>>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
>>>> This would certainly eliminate the problem, but it would also prevent
>>>> perf from having any visibility into code running in these critical
>>>> sections, and it seems clear that PIL_NORMAL_MAX is used for just
>>>> this reason.
>>>>
>>>> 2) Protect this particular critical section by masking all interrupts,
>>>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
>>>> calls to load_secondary_context() and tsb_context_switch(). This
>>>> approach
>>>> has a few drawbacks:
>>>>
>>>>    - It would only address this particular critical section, and would
>>>>      have to be repeated in several other known places. There might be
>>>>      other such critical sections that are not known.
>>>>
>>>>    - It has a performance cost which would be incurred at every context
>>>>      switch, since it would require additional accesses to %pil or
>>>>      %pstate.
>>>>
>>>>    - Turning off pstate.ie would require changing __tsb_context_switch(),
>>>>      which expects to be called with pstate.ie on.
>>>>
>>>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
>>>> implementing a new copy_from_user() function that accesses the user
>>>> stack via physical addresses. This works, but requires quite a bit of
>>>> new code to get it to perform reasonably, ie, caching of translations,
>>>> etc.
>>>>
>>>> 4) Allow the perf interrupt to happen in existing critical sections as
>>>> it does now, but have perf code detect that this is happening, and
>>>> skip any user callchain processing. This approach was deemed best, as
>>>> the change is extremely localized and covers both known and unknown
>>>> instances of perf interrupting critical sections. Perf has interrupted
>>>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
>>>> interrupt.
>>>>
>>>> Ordinarily, a function that has a pt_regs passed in can simply examine
>>>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
>>>> effect at the time of the perf interrupt.  However, when the perf
>>>> interrupt occurs while executing in the kernel, the pt_regs pointer is
>>>> replaced with task_pt_regs(current) in perf_callchain(), and so
>>>> perf_callchain_user() does not have access to the state of the machine
>>>> at the time of the perf interrupt. To work around this, we check
>>>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
>>>> in to the arch independent part of perf, and temporarily change the
>>>> event attributes so that user callchain processing is avoided. Though
>>>> a user stack sample is not collected, this loss is not statistically
>>>> significant. Kernel call graph collection is not affected.
>>> I tried to figure out how x86 handles this, and gave up after some
>>> time :-)  Can you figure it out?
>> I could look.
>>
>>> PowerPC handles this by hard disabling IRQs from switch_mm until
>>> switch_to completes.
>>>
>>> I think excluding all local_irq_disabled() paths from having stack
>>> backtraces is not good.
>> Just to clarify, this does not exclude local_irq_disabled() paths from
>> having stack backtraces. It only disables collection of *user* stack
>> backtraces during these (very rare) times. Collection of *kernel*
>> stack traces is not affected, so we still get the same visibility from
>> perf for kernel call graphs. I do not believe we sacrifice anything
>> here in terms of user stack collection since we'll only skip doing
>> this when:
>>   1) we're already executing in the kernel when the perf interrupt
>>   arrives, and this is already (what should be) a rare event
>>   2) and kernel is executing in a local_irq_disabled() path, also should
>>   be very rare
> I understood this from the beginning.
>
>>>    I'd rather see PIL_MAX get set from switch_mm
>>> until switch_to completes.
>>>
>>> Ok?
>>>
>> In light of the above, what is your preference?
> I still have the same position.
>
> That lack of user backtrace is a loss of information.  And in order to
> make it work we must have the case that the PGD/TSB/mmu-context/stack
> all match up precisely.

OK, thanks for the guidance. We'll work up a new patch.

Rob

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 14, 2017, 6:29 p.m. UTC | #5
From: Rob Gardner <rob.gardner@oracle.com>
Date: Wed, 14 Jun 2017 11:24:36 -0600

> OK, thanks for the guidance. We'll work up a new patch.

Thanks a lot for working on this.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Saint Etienne June 16, 2017, 9:01 a.m. UTC | #6
> I tried to figure out how x86 handles this, and gave up after some
> time :-)  Can you figure it out?

I've had a look, what I found is that on x86 
event->attr.exclude_callchain_user is not set from the arch/x86/*

But perf unconditionally uses the NMI specific version of copy_from_user 
to prevent perf_callchain_user() to generated a fault:

arch/x86/include/asm/perf_event.h:#define arch_perf_out_copy_user 
copy_from_user_nmi

Where copy_from_user_nmi() is defined as follows:

arch/x86/lib/usercopy.c: /*
arch/x86/lib/usercopy.c:  * We rely on the nested NMI work to allow 
atomic faults from the NMI path; the
arch/x86/lib/usercopy.c:  * nested NMI paths are careful to preserve CR2.
arch/x86/lib/usercopy.c:  */
arch/x86/lib/usercopy.c: unsigned long
arch/x86/lib/usercopy.c: copy_from_user_nmi(void *to, const void __user 
*from, unsigned long n)
arch/x86/lib/usercopy.c: {
arch/x86/lib/usercopy.c:         unsigned long ret;
arch/x86/lib/usercopy.c:
arch/x86/lib/usercopy.c:         if (__range_not_ok(from, n, TASK_SIZE))
arch/x86/lib/usercopy.c:                 return 0;
arch/x86/lib/usercopy.c:
arch/x86/lib/usercopy.c:         /*
arch/x86/lib/usercopy.c:          * Even though this function is 
typically called from NMI/IRQ context
arch/x86/lib/usercopy.c:          * disable pagefaults so that its 
behaviour is consistent even when
arch/x86/lib/usercopy.c:          * called form other contexts.
arch/x86/lib/usercopy.c:          */
arch/x86/lib/usercopy.c:         pagefault_disable();
arch/x86/lib/usercopy.c:         ret = __copy_from_user_inatomic(to, 
from, n);
arch/x86/lib/usercopy.c:         pagefault_enable();
arch/x86/lib/usercopy.c:
arch/x86/lib/usercopy.c:         return ret;
arch/x86/lib/usercopy.c: }
arch/x86/lib/usercopy.c: EXPORT_SYMBOL_GPL(copy_from_user_nmi);

And the kernel also falls back (for other archs) to a version very 
similar to copy_from_user_nmi() where faults are disabled:

kernel/events/internal.h: #ifndef arch_perf_out_copy_user
kernel/events/internal.h: #define arch_perf_out_copy_user 
arch_perf_out_copy_user
kernel/events/internal.h:
kernel/events/internal.h: static inline unsigned long
kernel/events/internal.h: arch_perf_out_copy_user(void *dst, const void 
*src, unsigned long n)
kernel/events/internal.h: {
kernel/events/internal.h:         unsigned long ret;
kernel/events/internal.h:
kernel/events/internal.h:         pagefault_disable();
kernel/events/internal.h:         ret = __copy_from_user_inatomic(dst, 
src, n);
kernel/events/internal.h:         pagefault_enable();
kernel/events/internal.h:
kernel/events/internal.h:         return ret;
kernel/events/internal.h: }
kernel/events/internal.h: #endif
kernel/events/internal.h:
kernel/events/internal.h: DEFINE_OUTPUT_COPY(__output_copy_user, 
arch_perf_out_copy_user)

I believe that's how faults are prevented from perf.

-eric
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 710f327..5295dda 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1602,6 +1602,7 @@  static int __kprobes perf_event_nmi_handler(struct notifier_block *self,
 	struct perf_sample_data data;
 	struct cpu_hw_events *cpuc;
 	struct pt_regs *regs;
+	bool irq_disabled;
 	int i;
 
 	if (!atomic_read(&active_events))
@@ -1630,6 +1631,7 @@  static int __kprobes perf_event_nmi_handler(struct notifier_block *self,
 	    sparc_pmu->num_pcrs == 1)
 		pcr_ops->write_pcr(0, cpuc->pcr[0]);
 
+	irq_disabled = ((regs->tstate & TSTATE_PIL) >> 20) == PIL_NORMAL_MAX;
 	for (i = 0; i < cpuc->n_events; i++) {
 		struct perf_event *event = cpuc->event[i];
 		int idx = cpuc->current_idx[i];
@@ -1649,8 +1651,16 @@  static int __kprobes perf_event_nmi_handler(struct notifier_block *self,
 		if (!sparc_perf_event_set_period(event, hwc, idx))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (unlikely(irq_disabled)) {
+			u64 saved_exclude = event->attr.exclude_callchain_user;
+
+			event->attr.exclude_callchain_user = 1;
+			if (perf_event_overflow(event, &data, regs))
+				sparc_pmu_stop(event, 0);
+			event->attr.exclude_callchain_user = saved_exclude;
+		} else if (perf_event_overflow(event, &data, regs)) {
 			sparc_pmu_stop(event, 0);
+		}
 	}
 
 	return NOTIFY_STOP;