diff mbox

[2/2] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

Message ID 20100414034827.GC6571@in.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

K.Prasad April 14, 2010, 3:48 a.m. UTC
Implement perf-events based hw-breakpoint interfaces for PPC64 processors.
These interfaces help arbitrate requests from various users and schedules
them as appropriate.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig                     |    1 
 arch/powerpc/include/asm/cputable.h      |    4 
 arch/powerpc/include/asm/hw_breakpoint.h |   45 ++++
 arch/powerpc/include/asm/processor.h     |    8 
 arch/powerpc/kernel/Makefile             |    1 
 arch/powerpc/kernel/hw_breakpoint.c      |  324 +++++++++++++++++++++++++++++++
 arch/powerpc/kernel/machine_kexec_64.c   |    3 
 arch/powerpc/kernel/process.c            |    6 
 arch/powerpc/kernel/ptrace.c             |   81 +++++++
 arch/powerpc/lib/Makefile                |    1 
 include/linux/hw_breakpoint.h            |    1 
 11 files changed, 475 insertions(+)

Comments

Paul Mackerras May 3, 2010, 6:23 a.m. UTC | #1
On Wed, Apr 14, 2010 at 09:18:27AM +0530, K.Prasad wrote:

> Implement perf-events based hw-breakpoint interfaces for PPC64 processors.
> These interfaces help arbitrate requests from various users and schedules
> them as appropriate.

[snip]

> --- /dev/null
> +++ linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
> @@ -0,0 +1,45 @@
> +#ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H
> +#define _PPC_BOOK3S_64_HW_BREAKPOINT_H
> +
> +#ifdef	__KERNEL__
> +#define	__ARCH_HW_BREAKPOINT_H

This symbol doesn't seem to be referenced anywhere.  Is it really
necessary to define it?  I know x86 and sh do, but maybe it's a
leftover there.

> ===================================================================
> --- /dev/null
> +++ linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c

> +	switch (bp->attr.bp_type) {
> +	case HW_BREAKPOINT_R:
> +		info->type = DABR_DATA_READ;
> +		break;
> +	case HW_BREAKPOINT_W:
> +		info->type = DABR_DATA_WRITE;
> +		break;
> +	case HW_BREAKPOINT_R | HW_BREAKPOINT_W:
> +		info->type = (DABR_DATA_READ | DABR_DATA_WRITE);
> +		break;
> +	default:
> +		return ret;
> +	}

You have code like this in several places -- maybe this should be
encapsulated in a helper function.  Also, I think it could be written
more compactly.

> +	/*
> +	 * Return early after invoking user-callback function without restoring
> +	 * DABR if the breakpoint is from ptrace which always operates in
> +	 * one-shot mode
> +	 */
> +	if (is_ptrace_bp == true) {
> +		perf_bp_event(bp, regs);
> +		rc = NOTIFY_DONE;
> +		goto out;
> +	}

As far as I can see, returning NOTIFY_DONE won't tell do_dabr() that
you have handled the exception, and it will go on to do the
debugger_dabr_match() call and generate a signal to the current
process.  Is that what you want?  If so a comment would be useful.

Also, the " == true" part is completely redundant.  Normal kernel
style would be if (is_ptrace_bp) { ..., and similarly the
(is_ptrace_bp == false) above should be !is_ptrace_bp.

> +
> +	/*
> +	 * Do not emulate user-space instructions from kernel-space,
> +	 * instead single-step them.
> +	 */
> +	if (is_kernel == false) {
> +		current->thread.last_hit_ubp = bp;
> +		regs->msr |= MSR_SE;
> +		goto out;
> +	}

I'm a bit worried about what could happen from here on.  We go back
out to userspace and try to execute the load or store.  Besides
executing successfully and taking the trace interrupt, there are
several other possibilities:

- we could get an alignment interrupt
- we could get a page fault (page not present or protection violation)
- we could get a MMU hash table miss (unlikely since the low-level
  code calls hash_page before do_dabr, but possible since the HPTE
  could in principle have been evicted in the meantime).
- we could even get an illegal or privileged instruction trap --
  if some other thread had changed the instruction in the meantime.

The alignment interrupt case is mostly OK, except that it will call
emulate_single_step after emulating the load or store, which doesn't
do quite what we want -- unlike single_step_exception(), it doesn't
call notify_die(), so we won't get back into
single_step_dabr_instruction(), but instead it just sends a SIGTRAP.
That needs to be fixed, but note that single_step_exception() only
applies to "classic"/server processors at present because it hits the
MSR SE and BE bits directly rather than using clear_single_step().

The MMU hash table miss case looks to be OK -- we'll just return to
userspace and re-execute the instruction, with MSR[SE] still set.

The page fault case should be OK if it just results in inserting a PTE
(possibly after some context switches) or in the process being
terminated.  If, however, it results in a signal we could end up with
a stale value in current->thread.last_hit_ubp if the signal handler
doesn't return to the instruction (and there is no requirement for it
to do so).  If the process later gets single-stepped for any reason it
could get misinterpreted, and we could end up accessing freed memory
if the perf_event associated with bp has been closed and freed in the
meantime.

Similarly, if the instruction has been changed underneath us, we would
end up with current->thread.last_hit_ubp being stale.  We do need to
handle this case if only to ensure that we don't create a
vulnerability that could be used to attack the kernel.

> +
> +	stepped = emulate_step(regs, regs->nip);
> +	/* emulate_step() could not execute it, single-step them */

Note that at present, emulate_step() will never emulate a load or
store, so this will always return 0 unless someone has changed the
instruction underneath us.

> +	if (stepped == 0) {
> +		regs->msr |= MSR_SE;
> +		per_cpu(last_hit_bp, cpu) = bp;
> +		goto out;
> +	}

This is subject to most of the same possibilities as the user address
case.  We need to make sure that if we ever get the situation where
the instruction is never going to be executed then we clear
last_hit_bp for this cpu.  By the way, __get_cpu_var(x) is more
efficient than per_cpu(x, cpu) and is equivalent if cpu ==
smp_processor_id().

> +	/*
> +	 * Do not disable MSR_SE if the process was already in
> +	 * single-stepping mode. We cannot reliable detect single-step mode
> +	 * for kernel-space breakpoints, so this cannot work along with other
> +	 * debuggers (like KGDB, xmon) which may be single-stepping kernel code.
> +	 */
> +	if (!(user_bp && test_thread_flag(TIF_SINGLESTEP)))
> +		regs->msr &= ~MSR_SE;
> +
> +	/* Deliver signal to user-space */
> +	if (user_bp) {
> +		info.si_signo = SIGTRAP;
> +		info.si_errno = 0;
> +		info.si_code = TRAP_HWBKPT;
> +		info.si_addr = (void __user *)bp_info->address;
> +		force_sig_info(SIGTRAP, &info, current);
> +	}

Why are we delivering a signal to userspace here?  This seems
unnecessary to me.

Ensuring that we get control back reliably after executing the
instruction, or when we get to the point where the instruction will
never be executed, seems to be difficult to get right in all corner
cases.  It may be better to just emulate the load or store
unconditionally.  We don't currently have code to do that, but
fix_alignment() is pretty close to what we would need (it would need
to handle lbz/stb and it would need to use 2, 4 or 8-byte loads/stores
rather than 1-byte loads/stores in the aligned case).

Paul.
K.Prasad May 4, 2010, 8:33 p.m. UTC | #2
On Mon, May 03, 2010 at 04:23:30PM +1000, Paul Mackerras wrote:
> On Wed, Apr 14, 2010 at 09:18:27AM +0530, K.Prasad wrote:
> 
> > Implement perf-events based hw-breakpoint interfaces for PPC64 processors.
> > These interfaces help arbitrate requests from various users and schedules
> > them as appropriate.
> 
> [snip]
> 

Hi Paul,
	Thanks for the detailed review. Please find my replies inline.

> > --- /dev/null
> > +++ linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
> > @@ -0,0 +1,45 @@
> > +#ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H
> > +#define _PPC_BOOK3S_64_HW_BREAKPOINT_H
> > +
> > +#ifdef	__KERNEL__
> > +#define	__ARCH_HW_BREAKPOINT_H
> 
> This symbol doesn't seem to be referenced anywhere.  Is it really
> necessary to define it?  I know x86 and sh do, but maybe it's a
> leftover there.
> 

Yes, either I use _PPC_BOOK3S_64_HW_BREAKPOINT_H or
__ARCH_HW_BREAKPOINT_H to prevent recursive definitions, not both. I will
remove __ARCH_HW_BREAKPOINT_H.

> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c
> 
> > +	switch (bp->attr.bp_type) {
> > +	case HW_BREAKPOINT_R:
> > +		info->type = DABR_DATA_READ;
> > +		break;
> > +	case HW_BREAKPOINT_W:
> > +		info->type = DABR_DATA_WRITE;
> > +		break;
> > +	case HW_BREAKPOINT_R | HW_BREAKPOINT_W:
> > +		info->type = (DABR_DATA_READ | DABR_DATA_WRITE);
> > +		break;
> > +	default:
> > +		return ret;
> > +	}
> 
> You have code like this in several places -- maybe this should be
> encapsulated in a helper function.  Also, I think it could be written
> more compactly.
> 

Actually this particular code-snippet differs slightly from two such
'similar' switch-cases found in ptrace_set_debugreg() - which I will
replace with a helper function. The above-cited switch-case statement
which maps the generic breakpoint types to arch-specific bit-encodings
are used only once (as above) and will not help code-reduction if moved
to a separate function.

> > +	/*
> > +	 * Return early after invoking user-callback function without restoring
> > +	 * DABR if the breakpoint is from ptrace which always operates in
> > +	 * one-shot mode
> > +	 */
> > +	if (is_ptrace_bp == true) {
> > +		perf_bp_event(bp, regs);
> > +		rc = NOTIFY_DONE;
> > +		goto out;
> > +	}
> 
> As far as I can see, returning NOTIFY_DONE won't tell do_dabr() that
> you have handled the exception, and it will go on to do the
> debugger_dabr_match() call and generate a signal to the current
> process.  Is that what you want?  If so a comment would be useful.
> 

Yes, the debugger_dabr_match() will return early without doing much and
the signal is generated for the ptrace-ed process. I will append the
comment to appear as below:

	/*
	 * Return early after invoking user-callback function without restoring
	 * DABR if the breakpoint is from ptrace which always operates in
	 * one-shot mode. The ptrace-ed process will receive the SIGTRAP signal
	 * generated in do_dabr().
	 */

> Also, the " == true" part is completely redundant.  Normal kernel
> style would be if (is_ptrace_bp) { ..., and similarly the
> (is_ptrace_bp == false) above should be !is_ptrace_bp.
>

It is an untidy habit that I've had, personally, for
increased readability! I will remove such redundancies from the code.

> > +
> > +	/*
> > +	 * Do not emulate user-space instructions from kernel-space,
> > +	 * instead single-step them.
> > +	 */
> > +	if (is_kernel == false) {
> > +		current->thread.last_hit_ubp = bp;
> > +		regs->msr |= MSR_SE;
> > +		goto out;
> > +	}
> 
> I'm a bit worried about what could happen from here on.  We go back
> out to userspace and try to execute the load or store.  Besides
> executing successfully and taking the trace interrupt, there are
> several other possibilities:
>

The aim, during design time, has been to reliably restore the breakpoint
values immediately after execution of the instruction or in the
worst-case, to cause loss of breakpoints rather than ignore the other
sources of exceptions. Please find more responses below.
 
> - we could get an alignment interrupt
> - we could get a page fault (page not present or protection violation)
> - we could get a MMU hash table miss (unlikely since the low-level
>   code calls hash_page before do_dabr, but possible since the HPTE
>   could in principle have been evicted in the meantime).
> - we could even get an illegal or privileged instruction trap --
>   if some other thread had changed the instruction in the meantime.
> 
> The alignment interrupt case is mostly OK, except that it will call
> emulate_single_step after emulating the load or store, which doesn't
> do quite what we want -- unlike single_step_exception(), it doesn't
> call notify_die(), so we won't get back into
> single_step_dabr_instruction(), but instead it just sends a SIGTRAP.
> That needs to be fixed, but note that single_step_exception() only
> applies to "classic"/server processors at present because it hits the
> MSR SE and BE bits directly rather than using clear_single_step().
> 

It is true that the breakpoint exceptions will go amiss following the
alignment exception, and be restored when the thread single-steps due
to other requests causing undesirable effects. (Borrowing from some of
the discussions I had with BenH, earlier) There can be two ways of
changing the implementation to counter it:

- To sense that the impending exception (alignment, page-fault,
  single-step) is a successor to a hw-breakpoint exception (and that
  restoration of debug register values is necessary), somewhere early in
  exceptions-64s.S and jump to a common handler, say
  do_single_step_dabr() which does a majority of
  single_step_dabr_instruction().
- To modify emulate_single_step() to also do a notify_die(DIE_SSTEP,...)
  in addition to its existing code. This would invoke
  single_step_dabr_instruction() where the breakpoints can be restored.

I would prefer the second method, given that it would be easier to implement.

> The MMU hash table miss case looks to be OK -- we'll just return to
> userspace and re-execute the instruction, with MSR[SE] still set.
> 
> The page fault case should be OK if it just results in inserting a PTE
> (possibly after some context switches) or in the process being
> terminated.  If, however, it results in a signal we could end up with
> a stale value in current->thread.last_hit_ubp if the signal handler
> doesn't return to the instruction (and there is no requirement for it
> to do so).  If the process later gets single-stepped for any reason it
> could get misinterpreted, and we could end up accessing freed memory
> if the perf_event associated with bp has been closed and freed in the
> meantime.
> 
> Similarly, if the instruction has been changed underneath us, we would
> end up with current->thread.last_hit_ubp being stale.  We do need to
> handle this case if only to ensure that we don't create a
> vulnerability that could be used to attack the kernel.
> 

I must admit that it is not clear to me when you say "doesn't return to
the instruction" and "instruction has been changed underneath". Are you
referring to the fact that the thread which generated breakpoints hits
could have moved out of the CPU due to a scheduler induced context
switch (which is an apparent cause for current->thread.last_hit_ubp to
turn stale) or is there a different source for such a change that I
don't realise?

Given that 'last_hit_ubp' is safely ensconced inside 'thread_struct',
the ill-effects of a possible pre-emption during a page-fault will be
suitably handled i.e. the pending single-step exception will be
generated on the processor to which 'current' is migrated to, and the
breakpoint will be set on the new processor.

However, the possibility that current->thread.last_hit_ubp points to a
perf_event structure that is unregistered and freed does exist, and I
did not foresee the risk. An arch-specific function that hooks onto
release_bp_slot() would be required to perform the cleanup. I will
submit modify the patch to that effect. Thanks for pointing it out.

> > +
> > +	stepped = emulate_step(regs, regs->nip);
> > +	/* emulate_step() could not execute it, single-step them */
> 
> Note that at present, emulate_step() will never emulate a load or
> store, so this will always return 0 unless someone has changed the
> instruction underneath us.
> 

In conjunction with what you've stated below, do you suggest that
emulate_step() be replaced with fix_alignment() which appears to be more
powerful at emulation (or carve out a helper function for fix_alignment()
that does only emulation and which can be invoked here)?

> > +	if (stepped == 0) {
> > +		regs->msr |= MSR_SE;
> > +		per_cpu(last_hit_bp, cpu) = bp;
> > +		goto out;
> > +	}
> 
> This is subject to most of the same possibilities as the user address
> case.  We need to make sure that if we ever get the situation where
> the instruction is never going to be executed then we clear
> last_hit_bp for this cpu.  By the way, __get_cpu_var(x) is more
> efficient than per_cpu(x, cpu) and is equivalent if cpu ==
> smp_processor_id().
> 

The possible kernel vulnerabilities (you mentioned above) can be
mitigated, by clearing 'last_hit_bp' upon unregistration of the
breakpoint request (through a hook in release_bp_slot()). However I'd
prefer to defer code that would increase the reliability of
hw-breakpoint restoration (upon a hit) to future patches.

Also since cpu == smp_processor_id() here, I'll use __get_cpu_var(x).

> > +	/*
> > +	 * Do not disable MSR_SE if the process was already in
> > +	 * single-stepping mode. We cannot reliable detect single-step mode
> > +	 * for kernel-space breakpoints, so this cannot work along with other
> > +	 * debuggers (like KGDB, xmon) which may be single-stepping kernel code.
> > +	 */
> > +	if (!(user_bp && test_thread_flag(TIF_SINGLESTEP)))
> > +		regs->msr &= ~MSR_SE;
> > +
> > +	/* Deliver signal to user-space */
> > +	if (user_bp) {
> > +		info.si_signo = SIGTRAP;
> > +		info.si_errno = 0;
> > +		info.si_code = TRAP_HWBKPT;
> > +		info.si_addr = (void __user *)bp_info->address;
> > +		force_sig_info(SIGTRAP, &info, current);
> > +	}
> 
> Why are we delivering a signal to userspace here?  This seems
> unnecessary to me.
>

I was under the wrong impression that such a behaviour existed in the
x86 version (which was true in some of its older versions), and hence
carried over here.

I will remove the signal delivery code. It is always possible for
the hw-breakpoint's callback to generate the signal, if desired.
 
> Ensuring that we get control back reliably after executing the
> instruction, or when we get to the point where the instruction will
> never be executed, seems to be difficult to get right in all corner
> cases.  It may be better to just emulate the load or store
> unconditionally.  We don't currently have code to do that, but
> fix_alignment() is pretty close to what we would need (it would need
> to handle lbz/stb and it would need to use 2, 4 or 8-byte loads/stores
> rather than 1-byte loads/stores in the aligned case).
> 

It has been pointed out to me before (Roland's mail Ref:linuxppc-dev
message-id: 20100119100335.3EB621DE@magilla.sf.frob.com) that there will
be too many corner cases that will be difficult to foresee, however your
above list appears to be exhaustive. While the alternatives to this being
a fallback to one-shot breakpoints (thereby leading to confusing
hw-breakpoint interface semantics), this is an attempt to generate
continuous and 'trigger-after-execute' (for non-ptrace requests)
breakpoint exceptions. I believe that, with the addressal of concerns
cited above, the resultant patchset would be one that achieves the
stated design goals with no loss to existing functionality.

I intend to send out another version of this patchset with fixes as
described in my replies above (unless I hear objections to it :-)).

Thanking you again for the very insightful comments!

-- K.Prasad
K.Prasad May 12, 2010, 3:34 a.m. UTC | #3
On Wed, May 05, 2010 at 02:03:03AM +0530, K.Prasad wrote:
> On Mon, May 03, 2010 at 04:23:30PM +1000, Paul Mackerras wrote:
> > On Wed, Apr 14, 2010 at 09:18:27AM +0530, K.Prasad wrote:
> > 
[snipped]
> 
> It has been pointed out to me before (Roland's mail Ref:linuxppc-dev
> message-id: 20100119100335.3EB621DE@magilla.sf.frob.com) that there will
> be too many corner cases that will be difficult to foresee, however your
> above list appears to be exhaustive. While the alternatives to this being
> a fallback to one-shot breakpoints (thereby leading to confusing
> hw-breakpoint interface semantics), this is an attempt to generate
> continuous and 'trigger-after-execute' (for non-ptrace requests)
> breakpoint exceptions. I believe that, with the addressal of concerns
> cited above, the resultant patchset would be one that achieves the
> stated design goals with no loss to existing functionality.
> 

Hi Paul,

> I intend to send out another version of this patchset with fixes as
> described in my replies above (unless I hear objections to it :-)).
> 

Meanwhile, a little sickness had kept me away from working on this
patchset. I have now posted a new version of the same here () which
contains changes as described above.

A few more changes to the patch is impending post merger
of Frederic's patches (which are now in -tip) into mainline (ref: commit
73266fc1df2f94cf72b3beba3eee3b88ed0b0664 to
777d0411cd1e384115985dac5ccd42031e3eee2b);
mainly due to the new ability for a per-task breakpoint to request
kernel-space breakpoints (the notion of kernel- vs user-bp would also
become obsolete, it is better to call them per-cpu vs per-task
breakpoints).

Also, I find that possibility of a kernel-thread specific breakpoint
(which can migrate across CPUs) has not been thought and implemented well
in this patch (will be much easier after merger of Frederic's patch). I
would prefer to have atleast some version of the patch included in
mainline before bringing in support for the same.

Thanks,
K.Prasad
Paul Mackerras May 12, 2010, 6:32 a.m. UTC | #4
On Wed, May 05, 2010 at 02:03:03AM +0530, K.Prasad wrote:

> It is true that the breakpoint exceptions will go amiss following the
> alignment exception, and be restored when the thread single-steps due
> to other requests causing undesirable effects. (Borrowing from some of
> the discussions I had with BenH, earlier) There can be two ways of
> changing the implementation to counter it:
> 
> - To sense that the impending exception (alignment, page-fault,
>   single-step) is a successor to a hw-breakpoint exception (and that
>   restoration of debug register values is necessary), somewhere early in
>   exceptions-64s.S and jump to a common handler, say
>   do_single_step_dabr() which does a majority of
>   single_step_dabr_instruction().
> - To modify emulate_single_step() to also do a notify_die(DIE_SSTEP,...)
>   in addition to its existing code. This would invoke
>   single_step_dabr_instruction() where the breakpoints can be restored.

I thought you would change the explicit regs->msr modification in
single_step_exception() to clear_single_step(), then just make
emulate_single_step() call single_step_exception().

> I must admit that it is not clear to me when you say "doesn't return to
> the instruction" and "instruction has been changed underneath". Are you

Suppose the address at which the data breakpoint has been unmapped,
and the process has a handler for the SIGSEGV signal.  When we try to
single-step the load or store, we will get a DSI (0x300) interrupt,
call into do_page_fault, and end up sending the process a SIGSEGV.
That will invoke the signal handler, which can then do anything it
likes.  It can do a blocking system call, it can longjmp() back into
its main event, or it can return from the signal handler.  Only in the
last case will it retry the load or store, and then only if the signal
handler hasn't modified the NIP value in the signal frame.  That's
what I mean by "doesn't return to the instruction".

The instruction could be changed underneath us if the program is
multi-threaded and another thread writes another instruction to the
instruction word where the load or store is.  Or it could use mmap()
to map some other page at the address of the load or store.  Either
way we could end up with a different instruction there.

> referring to the fact that the thread which generated breakpoints hits
> could have moved out of the CPU due to a scheduler induced context
> switch (which is an apparent cause for current->thread.last_hit_ubp to
> turn stale) or is there a different source for such a change that I
> don't realise?
> 
> Given that 'last_hit_ubp' is safely ensconced inside 'thread_struct',
> the ill-effects of a possible pre-emption during a page-fault will be
> suitably handled i.e. the pending single-step exception will be
> generated on the processor to which 'current' is migrated to, and the
> breakpoint will be set on the new processor.

If we do get a context switch, e.g. as a result of a page fault, and
then switch back to the task, it looks to me like we will end up with
MSR_SE and DABR both set.  I don't suppose that will actually cause
any real problem besides double-counting the hit.

> However, the possibility that current->thread.last_hit_ubp points to a
> perf_event structure that is unregistered and freed does exist, and I
> did not foresee the risk. An arch-specific function that hooks onto
> release_bp_slot() would be required to perform the cleanup. I will
> submit modify the patch to that effect. Thanks for pointing it out.

Yes, I think we need that.

> In conjunction with what you've stated below, do you suggest that
> emulate_step() be replaced with fix_alignment() which appears to be more
> powerful at emulation (or carve out a helper function for fix_alignment()
> that does only emulation and which can be invoked here)?

Something like that eventually, but not for a first pass.

Paul.
K.Prasad May 14, 2010, 6:55 a.m. UTC | #5
On Wed, May 12, 2010 at 04:32:47PM +1000, Paul Mackerras wrote:
> On Wed, May 05, 2010 at 02:03:03AM +0530, K.Prasad wrote:
> 
> > It is true that the breakpoint exceptions will go amiss following the
> > alignment exception, and be restored when the thread single-steps due
> > to other requests causing undesirable effects. (Borrowing from some of
> > the discussions I had with BenH, earlier) There can be two ways of
> > changing the implementation to counter it:
> > 
> > - To sense that the impending exception (alignment, page-fault,
> >   single-step) is a successor to a hw-breakpoint exception (and that
> >   restoration of debug register values is necessary), somewhere early in
> >   exceptions-64s.S and jump to a common handler, say
> >   do_single_step_dabr() which does a majority of
> >   single_step_dabr_instruction().
> > - To modify emulate_single_step() to also do a notify_die(DIE_SSTEP,...)
> >   in addition to its existing code. This would invoke
> >   single_step_dabr_instruction() where the breakpoints can be restored.
> 
> I thought you would change the explicit regs->msr modification in
> single_step_exception() to clear_single_step(), then just make
> emulate_single_step() call single_step_exception().
>

Okay. I will re-use single_step_exception() after modifications; it
appearsto have no in-kernel users for it. (single_step_exception() clears
MSR more than what clear_single_step() does, it shouldn't matter though).

> > I must admit that it is not clear to me when you say "doesn't return to
> > the instruction" and "instruction has been changed underneath". Are you
> 
> Suppose the address at which the data breakpoint has been unmapped,
> and the process has a handler for the SIGSEGV signal.  When we try to
> single-step the load or store, we will get a DSI (0x300) interrupt,
> call into do_page_fault, and end up sending the process a SIGSEGV.
> That will invoke the signal handler, which can then do anything it
> likes.  It can do a blocking system call, it can longjmp() back into
> its main event, or it can return from the signal handler.  Only in the
> last case will it retry the load or store, and then only if the signal
> handler hasn't modified the NIP value in the signal frame.  That's
> what I mean by "doesn't return to the instruction".
> 

At the outset, this seemed to be a scary thing to happen; but turns out
to be harmful only to the extent of generating a false hw-breakpoint
exception in certain cases. A case-by-case basis analysis reveals thus:

Consider an instruction stream i1, i2, i3, ... iN, where i1 has
finished execution and i2 is about to be executed but has generated a
DSI interrupt with the above-mentioned conditions i.e. DSISR indicates a
DABR match + Page-Table entry not found. Now according to do_hash_page
in exception-64s.S (as pasted below), do_page_fault() and do_dabr() are
invoked one after the other.

_STATIC(do_hash_page)
	std	r3,_DAR(r1)
	std	r4,_DSISR(r1)

	andis.	r0,r4,0xa410		/* weird error? */
	bne-	handle_page_fault	/* if not, try to insert a HPTE */
	andis.  r0,r4,DSISR_DABRMATCH@h
	bne-    handle_dabr_fault

Thus, when control returns to user-space to instruction 'i2', the
hw_breakpoint_handler() has completed execution, and a SIGSEGV is pending
to be delivered and single-stepping enabled MSR_SE is set. Upon return to
user-space, the handler for SIGSEGV is executed and it may perform one of
the following (as you stated previously):
(a) Make a blocking syscall, eventually yielding the CPU to a new thread
(b) Jump to a different instruction in user-space, say iN, and not complete
the execution of instruction i2 at all.
(c) Return to instruction i2 and complete the execution.

In case of (a), the context-switches should not affect the ability to
single-step the instruction when the thread is eventually scheduled to
run. The thread, when scheduled onto the CPU will complete signal
handling, return to execute instruction i2, cause single-step exception,
restore breakpoints and run smoothly thereafter.

In case of (b), the new instruction iN is single-stepped, the breakpoint
values are restored and the hw-breakpoint exception callback is invoked
after iN is executed. The user of this breakpoint i.e. the caller of
register_user_hw_breakpoint() who had placed a breakpoint on addressed
accessed by instruction i2 will be confused to find that an unrelated
instruction (which may not be a load/store) has caused the breakpoint.

If so desired, we may adopt the 'trigger-before-execute' semantics for
user-space breakpoints wherein the hw-breakpoint callback (through
perf_bp_event()) is invoked in hw_breakpoint_handler() itself. This
would indicate to the user that the impending instruction causes a DABR
'hit' but it may or may not be executed due to the role of
signal-handler or due to self-modifying code (as mentioned below).

Kindly let me know what you think about it.

(c) is the normal execution path we desire. The instruction i2 will be
safely single-stepped and breakpoints are restored.

> The instruction could be changed underneath us if the program is
> multi-threaded and another thread writes another instruction to the
> instruction word where the load or store is.  Or it could use mmap()
> to map some other page at the address of the load or store.  Either
> way we could end up with a different instruction there.
> 

If the instruction that originally caused the DABR exception is changed,
the new instruction in its place would still single-step to restore
breakpoint values. However the user of breakpoint interface will be
confused to find that the callback is invoked for an irrelevant
instruction.

It could be circumvented, to an extent, through the use of
trigger-before-execute semantics (as described before).

> > referring to the fact that the thread which generated breakpoints hits
> > could have moved out of the CPU due to a scheduler induced context
> > switch (which is an apparent cause for current->thread.last_hit_ubp to
> > turn stale) or is there a different source for such a change that I
> > don't realise?
> > 
> > Given that 'last_hit_ubp' is safely ensconced inside 'thread_struct',
> > the ill-effects of a possible pre-emption during a page-fault will be
> > suitably handled i.e. the pending single-step exception will be
> > generated on the processor to which 'current' is migrated to, and the
> > breakpoint will be set on the new processor.
> 
> If we do get a context switch, e.g. as a result of a page fault, and
> then switch back to the task, it looks to me like we will end up with
> MSR_SE and DABR both set.  I don't suppose that will actually cause
> any real problem besides double-counting the hit.
> 

Page fault exception will be handled before hw_breakpoint_handler(),
hence MSR_SE would not have been set if a context-switch happened in
pange-fault handling itself. I don't see a case where both MSR_SE and
DABR will be set together.

> > However, the possibility that current->thread.last_hit_ubp points to a
> > perf_event structure that is unregistered and freed does exist, and I
> > did not foresee the risk. An arch-specific function that hooks onto
> > release_bp_slot() would be required to perform the cleanup. I will
> > submit modify the patch to that effect. Thanks for pointing it out.
> 
> Yes, I think we need that.
> 

The same is implemented through arch_unregister_hw_breakpoint()
in version XVIII of the patch here: linuxppc-dev message-id:
20100512033315.GC6384@in.ibm.com.

> > In conjunction with what you've stated below, do you suggest that
> > emulate_step() be replaced with fix_alignment() which appears to be more
> > powerful at emulation (or carve out a helper function for fix_alignment()
> > that does only emulation and which can be invoked here)?
> 
> Something like that eventually, but not for a first pass.
> 
> Paul.

Thanks for the comments. Let me know if the analysis above is incorrect
or if I've failed to recognise any important issue that you pointed out.
I will send out a patch with changes to emulate_single_step() in the
next version of the patchset, if I don't hear any further comments.

Thanks,
K.Prasad
Paul Mackerras May 17, 2010, 12:32 p.m. UTC | #6
On Fri, May 14, 2010 at 12:25:31PM +0530, K.Prasad wrote:

> Okay. I will re-use single_step_exception() after modifications; it
> appearsto have no in-kernel users for it.

It's called from exceptions-64s.S, head_32.S and head_8xx.S in
arch/powerpc/kernel.

> > Suppose the address at which the data breakpoint has been unmapped,
> > and the process has a handler for the SIGSEGV signal.  When we try to
> > single-step the load or store, we will get a DSI (0x300) interrupt,
> > call into do_page_fault, and end up sending the process a SIGSEGV.
> > That will invoke the signal handler, which can then do anything it
> > likes.  It can do a blocking system call, it can longjmp() back into
> > its main event, or it can return from the signal handler.  Only in the
> > last case will it retry the load or store, and then only if the signal
> > handler hasn't modified the NIP value in the signal frame.  That's
> > what I mean by "doesn't return to the instruction".
> > 
> 
> At the outset, this seemed to be a scary thing to happen; but turns out
> to be harmful only to the extent of generating a false hw-breakpoint
> exception in certain cases. A case-by-case basis analysis reveals thus:
> 
> Consider an instruction stream i1, i2, i3, ... iN, where i1 has
> finished execution and i2 is about to be executed but has generated a
> DSI interrupt with the above-mentioned conditions i.e. DSISR indicates a
> DABR match + Page-Table entry not found. Now according to do_hash_page
> in exception-64s.S (as pasted below), do_page_fault() and do_dabr() are
> invoked one after the other.
> 
> _STATIC(do_hash_page)
> 	std	r3,_DAR(r1)
> 	std	r4,_DSISR(r1)
> 
> 	andis.	r0,r4,0xa410		/* weird error? */
> 	bne-	handle_page_fault	/* if not, try to insert a HPTE */
> 	andis.  r0,r4,DSISR_DABRMATCH@h
> 	bne-    handle_dabr_fault

Note that bne is not a procedure call; we'll actually get two DSIs in
this scenario.  But I don't think that matters.  Also note that the
branch to handle_page_fault here is not for the HPTE-not-found case;
it's for the unusual errors.  So we'll handle the HPTE insertion after
handling the DABR match.

> Thus, when control returns to user-space to instruction 'i2', the
> hw_breakpoint_handler() has completed execution, and a SIGSEGV is pending
> to be delivered and single-stepping enabled MSR_SE is set. Upon return to
> user-space, the handler for SIGSEGV is executed and it may perform one of
> the following (as you stated previously):
> (a) Make a blocking syscall, eventually yielding the CPU to a new thread
> (b) Jump to a different instruction in user-space, say iN, and not complete
> the execution of instruction i2 at all.
> (c) Return to instruction i2 and complete the execution.
> 
> In case of (a), the context-switches should not affect the ability to
> single-step the instruction when the thread is eventually scheduled to
> run. The thread, when scheduled onto the CPU will complete signal
> handling, return to execute instruction i2, cause single-step exception,
> restore breakpoints and run smoothly thereafter.

Right.  However, the thread is running the signal handler without the
DABR being set, which is unfortunate.

> In case of (b), the new instruction iN is single-stepped, the breakpoint
> values are restored and the hw-breakpoint exception callback is invoked
> after iN is executed. The user of this breakpoint i.e. the caller of
> register_user_hw_breakpoint() who had placed a breakpoint on addressed
> accessed by instruction i2 will be confused to find that an unrelated
> instruction (which may not be a load/store) has caused the breakpoint.

That's the case if the signal handler modifies the NIP value in the
register set saved on the stack and returns.  If the signal handler
instead simply jumps to instruction iN (e.g. with longjmp or
siglongjmp), we'll never get the single-step callback.

> If so desired, we may adopt the 'trigger-before-execute' semantics for
> user-space breakpoints wherein the hw-breakpoint callback (through
> perf_bp_event()) is invoked in hw_breakpoint_handler() itself. This
> would indicate to the user that the impending instruction causes a DABR
> 'hit' but it may or may not be executed due to the role of
> signal-handler or due to self-modifying code (as mentioned below).
> 
> Kindly let me know what you think about it.
> 
> (c) is the normal execution path we desire. The instruction i2 will be
> safely single-stepped and breakpoints are restored.
> 
> > The instruction could be changed underneath us if the program is
> > multi-threaded and another thread writes another instruction to the
> > instruction word where the load or store is.  Or it could use mmap()
> > to map some other page at the address of the load or store.  Either
> > way we could end up with a different instruction there.
> > 
> 
> If the instruction that originally caused the DABR exception is changed,
> the new instruction in its place would still single-step to restore
> breakpoint values. However the user of breakpoint interface will be
> confused to find that the callback is invoked for an irrelevant
> instruction.
> 
> It could be circumvented, to an extent, through the use of
> trigger-before-execute semantics (as described before).

I don't think we want to do trigger-before-execute.  Ideally what we
want to ensure at all times is that either DABR is set (enabled) or
MSR.SE is set, but not both.  To ensure that we'd have to modify the
signal delivery code and possibly other places.

> > If we do get a context switch, e.g. as a result of a page fault, and
> > then switch back to the task, it looks to me like we will end up with
> > MSR_SE and DABR both set.  I don't suppose that will actually cause
> > any real problem besides double-counting the hit.
> > 
> 
> Page fault exception will be handled before hw_breakpoint_handler(),
> hence MSR_SE would not have been set if a context-switch happened in
> pange-fault handling itself. I don't see a case where both MSR_SE and
> DABR will be set together.

Imagine this scenario: we get the DABR match, set MSR_SE and return to
the task.  In the meantime another higher-priority task has become
runnable and our need_resched flag is set, so we call schedule() on
the way back out to usermode.  The other task runs and then blocks and
our task gets scheduled again.  As part of the context switch,
arch_install_hw_breakpoint() will get called and will set DABR.  So
we'll return to usermode with both DABR and MSE_SE set.

> Thanks for the comments. Let me know if the analysis above is incorrect
> or if I've failed to recognise any important issue that you pointed out.
> I will send out a patch with changes to emulate_single_step() in the
> next version of the patchset, if I don't hear any further comments.

We haven't discussed at all the case where the breakpoint is a per-cpu
breakpoint or where it's a per-task breakpoint but the DABR match
occurs within the kernel -- which can happen, even for a user address,
in __get_user, __put_user, __copy_tofrom_user, etc.  If the access
there is to a bad address, we'll invoke the exception case in
bad_page_fault(), which looks to be another place where we need to
recognize that single-stepping won't succeed and reinstall the DABR
setting.  Do we count that as an event or not? - I'm not sure.

Paul.
K.Prasad May 20, 2010, 4:06 a.m. UTC | #7
On Mon, May 17, 2010 at 10:32:41PM +1000, Paul Mackerras wrote:
> On Fri, May 14, 2010 at 12:25:31PM +0530, K.Prasad wrote:
> 
> > Okay. I will re-use single_step_exception() after modifications; it
> > appearsto have no in-kernel users for it.
> 
> It's called from exceptions-64s.S, head_32.S and head_8xx.S in
> arch/powerpc/kernel.
> 
> > > Suppose the address at which the data breakpoint has been unmapped,
> > > and the process has a handler for the SIGSEGV signal.  When we try to
> > > single-step the load or store, we will get a DSI (0x300) interrupt,
> > > call into do_page_fault, and end up sending the process a SIGSEGV.
> > > That will invoke the signal handler, which can then do anything it
> > > likes.  It can do a blocking system call, it can longjmp() back into
> > > its main event, or it can return from the signal handler.  Only in the
> > > last case will it retry the load or store, and then only if the signal
> > > handler hasn't modified the NIP value in the signal frame.  That's
> > > what I mean by "doesn't return to the instruction".
> > > 
> > 
> > At the outset, this seemed to be a scary thing to happen; but turns out
> > to be harmful only to the extent of generating a false hw-breakpoint
> > exception in certain cases. A case-by-case basis analysis reveals thus:
> > 
> > Consider an instruction stream i1, i2, i3, ... iN, where i1 has
> > finished execution and i2 is about to be executed but has generated a
> > DSI interrupt with the above-mentioned conditions i.e. DSISR indicates a
> > DABR match + Page-Table entry not found. Now according to do_hash_page
> > in exception-64s.S (as pasted below), do_page_fault() and do_dabr() are
> > invoked one after the other.
> > 
> > _STATIC(do_hash_page)
> > 	std	r3,_DAR(r1)
> > 	std	r4,_DSISR(r1)
> > 
> > 	andis.	r0,r4,0xa410		/* weird error? */
> > 	bne-	handle_page_fault	/* if not, try to insert a HPTE */
> > 	andis.  r0,r4,DSISR_DABRMATCH@h
> > 	bne-    handle_dabr_fault
> 
> Note that bne is not a procedure call; we'll actually get two DSIs in
> this scenario.  But I don't think that matters.  Also note that the
> branch to handle_page_fault here is not for the HPTE-not-found case;
> it's for the unusual errors.  So we'll handle the HPTE insertion after
> handling the DABR match.
> 

Okay.

> > Thus, when control returns to user-space to instruction 'i2', the
> > hw_breakpoint_handler() has completed execution, and a SIGSEGV is pending
> > to be delivered and single-stepping enabled MSR_SE is set. Upon return to
> > user-space, the handler for SIGSEGV is executed and it may perform one of
> > the following (as you stated previously):
> > (a) Make a blocking syscall, eventually yielding the CPU to a new thread
> > (b) Jump to a different instruction in user-space, say iN, and not complete
> > the execution of instruction i2 at all.
> > (c) Return to instruction i2 and complete the execution.
> > 
> > In case of (a), the context-switches should not affect the ability to
> > single-step the instruction when the thread is eventually scheduled to
> > run. The thread, when scheduled onto the CPU will complete signal
> > handling, return to execute instruction i2, cause single-step exception,
> > restore breakpoints and run smoothly thereafter.
> 
> Right.  However, the thread is running the signal handler without the
> DABR being set, which is unfortunate.
> 

In order to keep the breakpoint active during signal handling, a
PowerPC specific signal handling code, say do_signal_pending() in
arch/powerpc/kernel/signal.c, must be tapped to check for/restore
the breakpoint for the process (if it exists).

Then, single_step_dabr_instruction() must be suitably modified to not
clear the current->thread.last_hit_ubp pointer if breakpoint has been
taken in a nested condition i.e. a breakpoint exception in signal-handler
which was preceded by a previous breakpoint exception in normal user-space
stack. A flag to denote such a condition would be required in
'struct thread_struct'.

I'm afraid if this is more complexity than we want to handle in the
first patchset. I agree that this will create a 'blind-spot' of code
which cannot not be monitored using breakpoints and may limit debugging
efforts (specially for memory corruption); however suspecting that signal
handlers (especially those that don't return to original instruction)
would be rare, I feel that this could be a 'feature' that can be brought
later-in. What do you think?

> > In case of (b), the new instruction iN is single-stepped, the breakpoint
> > values are restored and the hw-breakpoint exception callback is invoked
> > after iN is executed. The user of this breakpoint i.e. the caller of
> > register_user_hw_breakpoint() who had placed a breakpoint on addressed
> > accessed by instruction i2 will be confused to find that an unrelated
> > instruction (which may not be a load/store) has caused the breakpoint.
> 
> That's the case if the signal handler modifies the NIP value in the
> register set saved on the stack and returns.  If the signal handler
> instead simply jumps to instruction iN (e.g. with longjmp or
> siglongjmp), we'll never get the single-step callback.
> 

True. As described above, this will lead to unreliable restoration of
breakpoints. I'm not sure if this is a common occurrence in user-space,
but if rare, I think we should be able to tolerate this behaviour for now.

> > If so desired, we may adopt the 'trigger-before-execute' semantics for
> > user-space breakpoints wherein the hw-breakpoint callback (through
> > perf_bp_event()) is invoked in hw_breakpoint_handler() itself. This
> > would indicate to the user that the impending instruction causes a DABR
> > 'hit' but it may or may not be executed due to the role of
> > signal-handler or due to self-modifying code (as mentioned below).
> > 
> > Kindly let me know what you think about it.
> > 
> > (c) is the normal execution path we desire. The instruction i2 will be
> > safely single-stepped and breakpoints are restored.
> > 
> > > The instruction could be changed underneath us if the program is
> > > multi-threaded and another thread writes another instruction to the
> > > instruction word where the load or store is.  Or it could use mmap()
> > > to map some other page at the address of the load or store.  Either
> > > way we could end up with a different instruction there.
> > > 
> > 
> > If the instruction that originally caused the DABR exception is changed,
> > the new instruction in its place would still single-step to restore
> > breakpoint values. However the user of breakpoint interface will be
> > confused to find that the callback is invoked for an irrelevant
> > instruction.
> > 
> > It could be circumvented, to an extent, through the use of
> > trigger-before-execute semantics (as described before).
> 
> I don't think we want to do trigger-before-execute.  Ideally what we
> want to ensure at all times is that either DABR is set (enabled) or
> MSR.SE is set, but not both.  To ensure that we'd have to modify the
> signal delivery code and possibly other places.
> 

The case where both DABR and MSR_SE are set is when a page-fault
resulting in context-switch occurs before single-stepping, right? I have
responded to that below. The problem with signal-handling, as I
understand, is unreliable restoration of/lost breakpoints and I wish to
deal them in future patchsets. Let me know if my understanding is incorrect.

> > > If we do get a context switch, e.g. as a result of a page fault, and
> > > then switch back to the task, it looks to me like we will end up with
> > > MSR_SE and DABR both set.  I don't suppose that will actually cause
> > > any real problem besides double-counting the hit.
> > > 
> > 
> > Page fault exception will be handled before hw_breakpoint_handler(),
> > hence MSR_SE would not have been set if a context-switch happened in
> > pange-fault handling itself. I don't see a case where both MSR_SE and
> > DABR will be set together.
> 
> Imagine this scenario: we get the DABR match, set MSR_SE and return to
> the task.  In the meantime another higher-priority task has become
> runnable and our need_resched flag is set, so we call schedule() on
> the way back out to usermode.  The other task runs and then blocks and
> our task gets scheduled again.  As part of the context switch,
> arch_install_hw_breakpoint() will get called and will set DABR.  So
> we'll return to usermode with both DABR and MSE_SE set.
> 

I didn't foresee such a possibility. I think this can be handled by
putting a check in arch_install_hw_breakpoint() as shown below:

int arch_install_hw_breakpoint(struct perf_event *bp)
{
	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
	struct perf_event **slot = &__get_cpu_var(bp_per_reg);

	*slot = bp;
	if (!current->thread.last_hit_ubp)
		set_dabr(info->address | info->type | DABR_TRANSLATION);
	return 0;
}

This should prevent DABR and MSR_SE from being enabled at the same time.

> > Thanks for the comments. Let me know if the analysis above is incorrect
> > or if I've failed to recognise any important issue that you pointed out.
> > I will send out a patch with changes to emulate_single_step() in the
> > next version of the patchset, if I don't hear any further comments.
> 
> We haven't discussed at all the case where the breakpoint is a per-cpu
> breakpoint or where it's a per-task breakpoint but the DABR match
> occurs within the kernel -- which can happen, even for a user address,
> in __get_user, __put_user, __copy_tofrom_user, etc.  If the access
> there is to a bad address, we'll invoke the exception case in
> bad_page_fault(), which looks to be another place where we need to
> recognize that single-stepping won't succeed and reinstall the DABR
> setting.  Do we count that as an event or not? - I'm not sure.
>

I'll respond to this part in the subsequent mail.

Thanks,
K.Prasad
Paul Mackerras May 20, 2010, 1:10 p.m. UTC | #8
On Thu, May 20, 2010 at 09:36:03AM +0530, K.Prasad wrote:

> > Right.  However, the thread is running the signal handler without the
> > DABR being set, which is unfortunate.
> > 
> 
> In order to keep the breakpoint active during signal handling, a
> PowerPC specific signal handling code, say do_signal_pending() in
> arch/powerpc/kernel/signal.c, must be tapped to check for/restore
> the breakpoint for the process (if it exists).

What I would suggest is something slightly different: anything that
causes the thread to change where it's executing -- signal delivery,
modification of NIP with ptrace -- should cancel the single-step and
reinstall the breakpoint in the DABR.  In other words we just forget
that we hit the breakpoint, and rely on hitting it again if we ever
get back to that instruction.  I think that is by far the most
reliable approach.

That means that the hw-breakpoint code needs to export a function
called, say, thread_change_pc(), which is called whenever we're
changing a thread's userspace PC (NIP) value.  If the hw-breakpoint
code has put that thread into single-step mode, we cancel the
single-step and if the thread is current, set DABR.

> I'm afraid if this is more complexity than we want to handle in the
> first patchset. I agree that this will create a 'blind-spot' of code
> which cannot not be monitored using breakpoints and may limit debugging
> efforts (specially for memory corruption); however suspecting that signal
> handlers (especially those that don't return to original instruction)
> would be rare, I feel that this could be a 'feature' that can be brought
> later-in. What do you think?

I think the thread_change_pc() approach is quite a bit simpler, and I
think it's better to get this right at the outset rather than have it
cause bugs later on, when we've all forgotten all the curly
details. :)

> > Imagine this scenario: we get the DABR match, set MSR_SE and return to
> > the task.  In the meantime another higher-priority task has become
> > runnable and our need_resched flag is set, so we call schedule() on
> > the way back out to usermode.  The other task runs and then blocks and
> > our task gets scheduled again.  As part of the context switch,
> > arch_install_hw_breakpoint() will get called and will set DABR.  So
> > we'll return to usermode with both DABR and MSE_SE set.
> > 
> 
> I didn't foresee such a possibility. I think this can be handled by
> putting a check in arch_install_hw_breakpoint() as shown below:
> 
> int arch_install_hw_breakpoint(struct perf_event *bp)
> {
> 	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
> 	struct perf_event **slot = &__get_cpu_var(bp_per_reg);
> 
> 	*slot = bp;
> 	if (!current->thread.last_hit_ubp)
> 		set_dabr(info->address | info->type | DABR_TRANSLATION);
> 	return 0;
> }

Yes, basically, but don't we need to handle per-cpu breakpoints as
well?  That is, we only want the extra check if this breakpoint is a
per-task breakpoint.  Or am I not seeing enough context here?

Also, I have just posted extensions to emulate_single_step() to handle
loads and stores.  I think this should be enough to handle DABR
interrupts that occur in kernel mode, so we should never need to
actually single-step in that case -- if emulate_step fails we should
just print a warning, uninstall the breakpoint, and continue.  That
way we don't have to worry about all the interrupts and other
eventualities that could occur while single-stepping in the kernel.

For DABR interrupts that occur in user mode, I think the approach of
single-stepping together with calls to thread_change_pc() in the
signal delivery and ptrace code should handle all the cases, at least
for per-task breakpoints.  Per-cpu breakpoints that get hit in user
mode will be a bit trickier, but I assume we can restrict per-cpu
breakpoints to kernel addresses for now.  If you want help adding the
thread_change_pc() calls to signal delivery and ptrace, let me know.

Paul.
K.Prasad May 24, 2010, 3:49 p.m. UTC | #9
On Thu, May 20, 2010 at 11:10:03PM +1000, Paul Mackerras wrote:
> On Thu, May 20, 2010 at 09:36:03AM +0530, K.Prasad wrote:
> 
(Had this mail composed along with the patchset...but mail server issues
caused delay in sending this...)

Hi Paul,
	While we continue to discuss some of the design decisions, I
thought I'd ready up a patchset to capture the changes we agreed upon to
(lest we miss them). I have sent a new version of the patchset here:
linuxppc-dev message-id: 20100524040137.GA20873@in.ibm.com.

Please see more responses below.

> > > Right.  However, the thread is running the signal handler without the
> > > DABR being set, which is unfortunate.
> > > 
> > 
> > In order to keep the breakpoint active during signal handling, a
> > PowerPC specific signal handling code, say do_signal_pending() in
> > arch/powerpc/kernel/signal.c, must be tapped to check for/restore
> > the breakpoint for the process (if it exists).
> 
> What I would suggest is something slightly different: anything that
> causes the thread to change where it's executing -- signal delivery,
> modification of NIP with ptrace -- should cancel the single-step and
> reinstall the breakpoint in the DABR.  In other words we just forget
> that we hit the breakpoint, and rely on hitting it again if we ever
> get back to that instruction.  I think that is by far the most
> reliable approach.
> 
> That means that the hw-breakpoint code needs to export a function
> called, say, thread_change_pc(), which is called whenever we're
> changing a thread's userspace PC (NIP) value.  If the hw-breakpoint
> code has put that thread into single-step mode, we cancel the
> single-step and if the thread is current, set DABR.
> 

I have made changes to signal-handling code on the suggested lines (as
seen here: linuxppc-dev message-id:20100524040342.GE20873@in.ibm.com)
wherein the debug registers are populated before signal-delivery and
cleaned during signal-return.

However handling of nested interrupts, where second exception is taken
inside the signal handler is still flimsy and the system would experience
two hw-breakpoint exceptions. To overcome the same, we will need a flag in
'struct thread_struct' or perhaps in 'struct arch_hw_breakpoint' to
indicate a breakpoint previously taken in signal-handling context. Given
that the repurcussions of a double-hit are not dangerous, and unsure of
how an addition to thread_struct might be received, I've skipped those
changes for now.

> > I'm afraid if this is more complexity than we want to handle in the
> > first patchset. I agree that this will create a 'blind-spot' of code
> > which cannot not be monitored using breakpoints and may limit debugging
> > efforts (specially for memory corruption); however suspecting that signal
> > handlers (especially those that don't return to original instruction)
> > would be rare, I feel that this could be a 'feature' that can be brought
> > later-in. What do you think?
> 
> I think the thread_change_pc() approach is quite a bit simpler, and I
> think it's better to get this right at the outset rather than have it
> cause bugs later on, when we've all forgotten all the curly
> details. :)

Yes, the details are mostly captured in the latest patchset. Had to make
some 'bold' changes to overcome the issues though :-)

> 
> > > Imagine this scenario: we get the DABR match, set MSR_SE and return to
> > > the task.  In the meantime another higher-priority task has become
> > > runnable and our need_resched flag is set, so we call schedule() on
> > > the way back out to usermode.  The other task runs and then blocks and
> > > our task gets scheduled again.  As part of the context switch,
> > > arch_install_hw_breakpoint() will get called and will set DABR.  So
> > > we'll return to usermode with both DABR and MSE_SE set.
> > > 
> > 
> > I didn't foresee such a possibility. I think this can be handled by
> > putting a check in arch_install_hw_breakpoint() as shown below:
> > 
> > int arch_install_hw_breakpoint(struct perf_event *bp)
> > {
> > 	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
> > 	struct perf_event **slot = &__get_cpu_var(bp_per_reg);
> > 
> > 	*slot = bp;
> > 	if (!current->thread.last_hit_ubp)
> > 		set_dabr(info->address | info->type | DABR_TRANSLATION);
> > 	return 0;
> > }
> 
> Yes, basically, but don't we need to handle per-cpu breakpoints as
> well?  That is, we only want the extra check if this breakpoint is a
> per-task breakpoint.  Or am I not seeing enough context here?
>

Until version XVIII of the patchset, the antiquated notion of user- and
kernel-space still existed to some extent. Through changes made here
(reference: linuxppc-dev message-id: 20100524040418.GF20873@in.ibm.com)
per-task (whose pid > 0) and per-cpu breakpoints are suitably identified.

Now, per-task breakpoints can be one of the following
- User-space breakpoints bound to a process 'struct task_struct'.
- Kernel-space breakpoints bound to a process 'struct task_struct'.
- Kernel-thread breakpoint registered through
  register_user_hw_breakpoint() and still identified through 'struct
  task_struct' after cpu-migration.

The above type of breakpoints may be restricted to a given cpu, yet they
are identifiable through the 'task_struct'.

I am unable to think of a case where a cpu-bound breakpoint request
(essentially a kernel-space address request) without a process context
(i.e. no identifiable 'task_struct') could be scheduled and migrated
across CPUs.

Moving to version XX, the emulation of instructions during non-pertask
breakpoints does not provide a scope for pre-emption at all. Hence the
above check should suffice in either case.

> Also, I have just posted extensions to emulate_single_step() to handle
> loads and stores.  I think this should be enough to handle DABR
> interrupts that occur in kernel mode, so we should never need to
> actually single-step in that case -- if emulate_step fails we should
> just print a warning, uninstall the breakpoint, and continue.  That
> way we don't have to worry about all the interrupts and other
> eventualities that could occur while single-stepping in the kernel.
> 

It was beginning to respond to this section of the mail did I realise
that I failed to implement the suggestion to unregister breakpoint if
emulate_singlestep() failed; and hence the quick successive release of
version XX of the patchset :-)

It is true that if emulate_single_step() was powerful, it simplifies the
design to a large extent. However during my test (with the extensions to
emulate_single_step() applied), I found that the startup test for ftrace
(over ksym_tracer) failed as the breakpoint got unregistered (as a
consequence of emulate_single_step() failure).

I'm not sure if I missed something here...pasting the relevant disassembly
of code below.

0xc000000000138a2c <trace_selftest_startup_ksym+124>:	mr      r29,r3
0xc000000000138a30 <trace_selftest_startup_ksym+128>:	blt	cr7,0xc000000000138aac <trace_selftest_startup_ksym+252>
0xc000000000138a34 <trace_selftest_startup_ksym+132>:	lwz	r0,0(r28) <===== FAILED INSTRUCTION
0xc000000000138a38 <trace_selftest_startup_ksym+136>:	cmpwi   cr7,r0,0
0xc000000000138a3c <trace_selftest_startup_ksym+140>:	bne	cr7,0xc000000000138a48 <trace_selftest_startup_ksym+152>

> For DABR interrupts that occur in user mode, I think the approach of
> single-stepping together with calls to thread_change_pc() in the
> signal delivery and ptrace code should handle all the cases, at least
> for per-task breakpoints.  Per-cpu breakpoints that get hit in user
> mode will be a bit trickier, but I assume we can restrict per-cpu
> breakpoints to kernel addresses for now.  If you want help adding the
> thread_change_pc() calls to signal delivery and ptrace, let me know.

The newer patchsets should be taking care of such combinations i.e.
kernel-space address for a per-task/per-cpu breakpoint and user-space
breakpoint which is per-cpu. Implementations found in version XIX and
XX, handle the constraints in their own way (although version XX is much
simpler if emulate_single_step() mostly works).

In the absence of any adverse comments and given a powerful
emulate_single_step() which works for most instructions, I'd prefer to
push the version XX of the patchset for upstream integration.

Thanking you again for your help/review comments. Let me know your
thoughts on the newer patchset.

Thanks,
K.Prasad
diff mbox

Patch

Index: linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
===================================================================
--- /dev/null
+++ linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
@@ -0,0 +1,45 @@ 
+#ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H
+#define _PPC_BOOK3S_64_HW_BREAKPOINT_H
+
+#ifdef	__KERNEL__
+#define	__ARCH_HW_BREAKPOINT_H
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+
+struct arch_hw_breakpoint {
+	u8		len; /* length of the target data symbol */
+	int		type;
+	unsigned long	address;
+};
+
+#include <linux/kdebug.h>
+#include <asm/reg.h>
+#include <asm/system.h>
+
+struct perf_event;
+struct pmu;
+struct perf_sample_data;
+
+#define HW_BREAKPOINT_ALIGN 0x7
+/* Maximum permissible length of any HW Breakpoint */
+#define HW_BREAKPOINT_LEN 0x8
+
+extern int arch_validate_hwbkpt_settings(struct perf_event *bp,
+						struct task_struct *tsk);
+extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
+						unsigned long val, void *data);
+int arch_install_hw_breakpoint(struct perf_event *bp);
+void arch_uninstall_hw_breakpoint(struct perf_event *bp);
+void hw_breakpoint_pmu_read(struct perf_event *bp);
+extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk);
+
+extern struct pmu perf_ops_bp;
+extern void ptrace_triggered(struct perf_event *bp, int nmi,
+			struct perf_sample_data *data, struct pt_regs *regs);
+static inline void hw_breakpoint_disable(void)
+{
+	set_dabr(0);
+}
+
+#endif	/* CONFIG_HAVE_HW_BREAKPOINT */
+#endif	/* __KERNEL__ */
+#endif	/* _PPC_BOOK3S_64_HW_BREAKPOINT_H */
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c
===================================================================
--- /dev/null
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c
@@ -0,0 +1,324 @@ 
+/*
+ * HW_breakpoint: a unified kernel/user-space hardware breakpoint facility,
+ * using the CPU's debug registers. Derived from
+ * "arch/x86/kernel/hw_breakpoint.c"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright 2009 IBM Corporation
+ * Author: K.Prasad <prasad@linux.vnet.ibm.com>
+ *
+ */
+
+#include <linux/hw_breakpoint.h>
+#include <linux/notifier.h>
+#include <linux/kprobes.h>
+#include <linux/percpu.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/smp.h>
+
+#include <asm/hw_breakpoint.h>
+#include <asm/processor.h>
+#include <asm/sstep.h>
+
+/*
+ * Store the 'bp' that caused the hw-breakpoint exception just before we
+ * single-step. Use it to distinguish a single-step exception (due to a
+ * previous hw-breakpoint exception) from a normal one
+ */
+static DEFINE_PER_CPU(struct perf_event *, last_hit_bp);
+
+/*
+ * Stores the breakpoints currently in use on each breakpoint address
+ * register for every cpu
+ */
+static DEFINE_PER_CPU(struct perf_event *, bp_per_reg);
+
+/*
+ * Install a perf counter breakpoint.
+ *
+ * We seek a free debug address register and use it for this
+ * breakpoint.
+ *
+ * Atomic: we hold the counter->ctx->lock and we only handle variables
+ * and registers local to this cpu.
+ */
+int arch_install_hw_breakpoint(struct perf_event *bp)
+{
+	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
+	struct perf_event **slot = &__get_cpu_var(bp_per_reg);
+
+	*slot = bp;
+	set_dabr(info->address | info->type | DABR_TRANSLATION);
+	return 0;
+}
+
+/*
+ * Uninstall the breakpoint contained in the given counter.
+ *
+ * First we search the debug address register it uses and then we disable
+ * it.
+ *
+ * Atomic: we hold the counter->ctx->lock and we only handle variables
+ * and registers local to this cpu.
+ */
+void arch_uninstall_hw_breakpoint(struct perf_event *bp)
+{
+	struct perf_event **slot = &__get_cpu_var(bp_per_reg);
+
+	if (*slot != bp) {
+		WARN_ONCE(1, "Can't find the breakpoint");
+		return;
+	}
+
+	*slot = NULL;
+	set_dabr(0);
+}
+
+/*
+ * Validate the arch-specific HW Breakpoint register settings
+ */
+int arch_validate_hwbkpt_settings(struct perf_event *bp,
+						struct task_struct *tsk)
+{
+	bool is_kernel;
+	int ret = -EINVAL;
+	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
+
+	if (!bp)
+		return ret;
+
+	switch (bp->attr.bp_type) {
+	case HW_BREAKPOINT_R:
+		info->type = DABR_DATA_READ;
+		break;
+	case HW_BREAKPOINT_W:
+		info->type = DABR_DATA_WRITE;
+		break;
+	case HW_BREAKPOINT_R | HW_BREAKPOINT_W:
+		info->type = (DABR_DATA_READ | DABR_DATA_WRITE);
+		break;
+	default:
+		return ret;
+	}
+
+	is_kernel = is_kernel_addr(bp->attr.bp_addr);
+	if ((tsk && is_kernel) || (!tsk && !is_kernel))
+		return -EINVAL;
+
+	info->address = bp->attr.bp_addr;
+	info->len = bp->attr.bp_len;
+
+	/*
+	 * Since breakpoint length can be a maximum of HW_BREAKPOINT_LEN(8)
+	 * and breakpoint addresses are aligned to nearest double-word
+	 * HW_BREAKPOINT_ALIGN by rounding off to the lower address, the
+	 * 'symbolsize' should satisfy the check below.
+	 */
+	if (info->len >
+	    (HW_BREAKPOINT_LEN - (info->address & HW_BREAKPOINT_ALIGN)))
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Handle debug exception notifications.
+ */
+int __kprobes hw_breakpoint_handler(struct die_args *args)
+{
+	bool is_kernel, is_ptrace_bp = false;
+	int rc = NOTIFY_STOP;
+	struct perf_event *bp;
+	struct pt_regs *regs = args->regs;
+	unsigned long dar = regs->dar;
+	int cpu, stepped = 1;
+	struct arch_hw_breakpoint *info;
+
+	/* Disable breakpoints during exception handling */
+	set_dabr(0);
+	cpu = smp_processor_id();
+	/*
+	 * The counter may be concurrently released but that can only
+	 * occur from a call_rcu() path. We can then safely fetch
+	 * the breakpoint, use its callback, touch its counter
+	 * while we are in an rcu_read_lock() path.
+	 */
+	rcu_read_lock();
+
+	bp = per_cpu(bp_per_reg, cpu);
+	if (!bp)
+		goto out;
+	info = counter_arch_bp(bp);
+	is_kernel = is_kernel_addr(bp->attr.bp_addr);
+	is_ptrace_bp = (bp->overflow_handler == ptrace_triggered) ?
+			true : false;
+
+	/*
+	 * Verify if dar lies within the address range occupied by the symbol
+	 * being watched to filter extraneous exceptions.
+	 */
+	if (!((bp->attr.bp_addr <= dar) &&
+	    (dar <= (bp->attr.bp_addr + bp->attr.bp_len))) &&
+	    (is_ptrace_bp == false))
+		/*
+		 * This exception is triggered not because of a memory access on
+		 * the monitored variable but in the double-word address range
+		 * in which it is contained. We will consume this exception,
+		 * considering it as 'noise'.
+		 */
+		goto restore_bp;
+
+	/*
+	 * Return early after invoking user-callback function without restoring
+	 * DABR if the breakpoint is from ptrace which always operates in
+	 * one-shot mode
+	 */
+	if (is_ptrace_bp == true) {
+		perf_bp_event(bp, regs);
+		rc = NOTIFY_DONE;
+		goto out;
+	}
+
+	/*
+	 * Do not emulate user-space instructions from kernel-space,
+	 * instead single-step them.
+	 */
+	if (is_kernel == false) {
+		current->thread.last_hit_ubp = bp;
+		regs->msr |= MSR_SE;
+		goto out;
+	}
+
+	stepped = emulate_step(regs, regs->nip);
+	/* emulate_step() could not execute it, single-step them */
+	if (stepped == 0) {
+		regs->msr |= MSR_SE;
+		per_cpu(last_hit_bp, cpu) = bp;
+		goto out;
+	}
+	/*
+	 * As a policy, the callback is invoked in a 'trigger-after-execute'
+	 * fashion
+	 */
+	perf_bp_event(bp, regs);
+
+restore_bp:
+	set_dabr(info->address | info->type | DABR_TRANSLATION);
+out:
+	rcu_read_unlock();
+	return rc;
+}
+
+/*
+ * Handle single-step exceptions following a DABR hit.
+ */
+int __kprobes single_step_dabr_instruction(struct die_args *args)
+{
+	struct pt_regs *regs = args->regs;
+	int cpu = smp_processor_id();
+	siginfo_t info;
+	struct perf_event *bp = NULL, *kernel_bp, *user_bp;
+	struct arch_hw_breakpoint *bp_info;
+
+	/*
+	 * Identify the cause of single-stepping and find the corresponding
+	 * breakpoint structure
+	 */
+	user_bp = current->thread.last_hit_ubp;
+	kernel_bp = per_cpu(last_hit_bp, cpu);
+	if (user_bp) {
+		bp = user_bp;
+		current->thread.last_hit_ubp = NULL;
+	} else if (kernel_bp) {
+		bp = kernel_bp;
+		per_cpu(last_hit_bp, cpu) = NULL;
+	}
+
+	/*
+	 * Check if we are single-stepping as a result of a
+	 * previous HW Breakpoint exception
+	 */
+	if (!bp)
+		return NOTIFY_DONE;
+
+	bp_info = counter_arch_bp(bp);
+
+	/*
+	 * We shall invoke the user-defined callback function in the single
+	 * stepping handler to confirm to 'trigger-after-execute' semantics
+	 */
+	perf_bp_event(bp, regs);
+
+	/*
+	 * Do not disable MSR_SE if the process was already in
+	 * single-stepping mode. We cannot reliable detect single-step mode
+	 * for kernel-space breakpoints, so this cannot work along with other
+	 * debuggers (like KGDB, xmon) which may be single-stepping kernel code.
+	 */
+	if (!(user_bp && test_thread_flag(TIF_SINGLESTEP)))
+		regs->msr &= ~MSR_SE;
+
+	/* Deliver signal to user-space */
+	if (user_bp) {
+		info.si_signo = SIGTRAP;
+		info.si_errno = 0;
+		info.si_code = TRAP_HWBKPT;
+		info.si_addr = (void __user *)bp_info->address;
+		force_sig_info(SIGTRAP, &info, current);
+	}
+
+	set_dabr(bp_info->address | bp_info->type | DABR_TRANSLATION);
+	return NOTIFY_STOP;
+}
+
+/*
+ * Handle debug exception notifications.
+ */
+int __kprobes hw_breakpoint_exceptions_notify(
+		struct notifier_block *unused, unsigned long val, void *data)
+{
+	int ret = NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_DABR_MATCH:
+		ret = hw_breakpoint_handler(data);
+		break;
+	case DIE_SSTEP:
+		ret = single_step_dabr_instruction(data);
+		break;
+	}
+
+	return ret;
+}
+
+/*
+ * Release the user breakpoints used by ptrace
+ */
+void flush_ptrace_hw_breakpoint(struct task_struct *tsk)
+{
+	struct thread_struct *t = &tsk->thread;
+
+	unregister_hw_breakpoint(t->ptrace_bps[0]);
+	t->ptrace_bps[0] = NULL;
+}
+
+void hw_breakpoint_pmu_read(struct perf_event *bp)
+{
+	/* TODO */
+}
+
Index: linux-2.6.ppc64_test/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/Kconfig
+++ linux-2.6.ppc64_test/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@  config PPC
 	select HAVE_SYSCALL_WRAPPERS if PPC64
 	select GENERIC_ATOMIC64 if PPC32
 	select HAVE_PERF_EVENTS
+	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
 
 config EARLY_PRINTK
 	bool
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/Makefile
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/Makefile
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/Makefile
@@ -34,6 +34,7 @@  obj-y				+= vdso32/
 obj-$(CONFIG_PPC64)		+= setup_64.o sys_ppc32.o \
 				   signal_64.o ptrace32.o \
 				   paca.o nvram_64.o firmware.o
+obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= hw_breakpoint.o
 obj-$(CONFIG_PPC_BOOK3S_64)	+= cpu_setup_ppc970.o cpu_setup_pa6t.o
 obj64-$(CONFIG_RELOCATABLE)	+= reloc_64.o
 obj-$(CONFIG_PPC_BOOK3E_64)	+= exceptions-64e.o
Index: linux-2.6.ppc64_test/arch/powerpc/include/asm/processor.h
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/include/asm/processor.h
+++ linux-2.6.ppc64_test/arch/powerpc/include/asm/processor.h
@@ -209,6 +209,14 @@  struct thread_struct {
 #ifdef CONFIG_PPC64
 	unsigned long	start_tb;	/* Start purr when proc switched in */
 	unsigned long	accum_tb;	/* Total accumilated purr for process */
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	struct perf_event *ptrace_bps[HBP_NUM];
+	/*
+	 * Helps identify source of single-step exception and subsequent
+	 * hw-breakpoint enablement
+	 */
+	struct perf_event *last_hit_ubp;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #endif
 	unsigned long	dabr;		/* Data address breakpoint register */
 #ifdef CONFIG_ALTIVEC
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/ptrace.c
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/ptrace.c
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/ptrace.c
@@ -32,6 +32,8 @@ 
 #ifdef CONFIG_PPC32
 #include <linux/module.h>
 #endif
+#include <linux/hw_breakpoint.h>
+#include <linux/perf_event.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -763,9 +765,34 @@  void user_disable_single_step(struct tas
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+void ptrace_triggered(struct perf_event *bp, int nmi,
+		      struct perf_sample_data *data, struct pt_regs *regs)
+{
+	struct perf_event_attr attr;
+
+	/*
+	 * Disable the breakpoint request here since ptrace has defined a
+	 * one-shot behaviour for breakpoint exceptions in PPC64.
+	 * The SIGTRAP signal is generated automatically for us in do_dabr().
+	 * We don't have to do anything about that here
+	 */
+	attr = bp->attr;
+	attr.disabled = true;
+	modify_user_hw_breakpoint(bp, &attr);
+}
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+
 int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 			       unsigned long data)
 {
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	int ret;
+	struct thread_struct *thread = &(task->thread);
+	struct perf_event *bp;
+	struct perf_event_attr attr;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+
 	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
 	 *  For embedded processors we support one DAC and no IAC's at the
 	 *  moment.
@@ -793,6 +820,60 @@  int ptrace_set_debugreg(struct task_stru
 	/* Ensure breakpoint translation bit is set */
 	if (data && !(data & DABR_TRANSLATION))
 		return -EIO;
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	bp = thread->ptrace_bps[0];
+	if (data == 0) {
+		if (bp) {
+			unregister_hw_breakpoint(bp);
+			thread->ptrace_bps[0] = NULL;
+		}
+		return 0;
+	}
+	if (bp) {
+		attr = bp->attr;
+		attr.bp_addr = data & ~HW_BREAKPOINT_ALIGN;
+
+		switch (data & (DABR_DATA_WRITE | DABR_DATA_READ)) {
+		case DABR_DATA_READ:
+			attr.bp_type = HW_BREAKPOINT_R;
+			break;
+		case DABR_DATA_WRITE:
+			attr.bp_type = HW_BREAKPOINT_W;
+			break;
+		case (DABR_DATA_WRITE | DABR_DATA_READ):
+			attr.bp_type = HW_BREAKPOINT_R | HW_BREAKPOINT_W;
+			break;
+		}
+		ret =  modify_user_hw_breakpoint(bp, &attr);
+		if (ret)
+			return ret;
+		thread->ptrace_bps[0] = bp;
+		thread->dabr = data;
+		return 0;
+	}
+
+	/* Create a new breakpoint request if one doesn't exist already */
+	hw_breakpoint_init(&attr);
+	attr.bp_addr = data & ~HW_BREAKPOINT_ALIGN;
+	switch (data & (DABR_DATA_WRITE | DABR_DATA_READ)) {
+	case DABR_DATA_READ:
+		attr.bp_type = HW_BREAKPOINT_R;
+		break;
+	case DABR_DATA_WRITE:
+		attr.bp_type = HW_BREAKPOINT_W;
+		break;
+	case (DABR_DATA_WRITE | DABR_DATA_READ):
+		attr.bp_type = HW_BREAKPOINT_R | HW_BREAKPOINT_W;
+		break;
+	}
+	thread->ptrace_bps[0] = bp = register_user_hw_breakpoint(&attr,
+							ptrace_triggered, task);
+	if (IS_ERR(bp)) {
+		thread->ptrace_bps[0] = NULL;
+		return PTR_ERR(bp);
+	}
+
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 
 	/* Move contents to the DABR register */
 	task->thread.dabr = data;
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/process.c
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/process.c
@@ -459,8 +459,14 @@  struct task_struct *__switch_to(struct t
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 	switch_booke_debug_regs(&new->thread);
 #else
+/*
+ * For PPC_BOOK3S_64, we use the hw-breakpoint interfaces that would
+ * schedule DABR
+ */
+#ifndef CONFIG_HAVE_HW_BREAKPOINT
 	if (unlikely(__get_cpu_var(current_dabr) != new->thread.dabr))
 		set_dabr(new->thread.dabr);
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #endif
 
 
Index: linux-2.6.ppc64_test/arch/powerpc/include/asm/cputable.h
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/include/asm/cputable.h
+++ linux-2.6.ppc64_test/arch/powerpc/include/asm/cputable.h
@@ -511,6 +511,10 @@  static inline int cpu_has_feature(unsign
 		& feature);
 }
 
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#define HBP_NUM 1
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/machine_kexec_64.c
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/machine_kexec_64.c
@@ -15,6 +15,7 @@ 
 #include <linux/thread_info.h>
 #include <linux/init_task.h>
 #include <linux/errno.h>
+#include <linux/hw_breakpoint.h>
 
 #include <asm/page.h>
 #include <asm/current.h>
@@ -161,6 +162,7 @@  void kexec_copy_flush(struct kimage *ima
  */
 static void kexec_smp_down(void *arg)
 {
+	hw_breakpoint_disable();
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 1);
 
@@ -174,6 +176,7 @@  static void kexec_prepare_cpus(void)
 	int my_cpu, i, notified=-1;
 
 	smp_call_function(kexec_smp_down, NULL, /* wait */0);
+	hw_breakpoint_disable();
 	my_cpu = get_cpu();
 
 	/* check the others cpus are now down (via paca hw cpu id == -1) */
Index: linux-2.6.ppc64_test/arch/powerpc/lib/Makefile
===================================================================
--- linux-2.6.ppc64_test.orig/arch/powerpc/lib/Makefile
+++ linux-2.6.ppc64_test/arch/powerpc/lib/Makefile
@@ -20,6 +20,7 @@  obj-$(CONFIG_PPC64)	+= copypage_64.o cop
 			   memcpy_64.o usercopy_64.o mem_64.o string.o
 obj-$(CONFIG_XMON)	+= sstep.o
 obj-$(CONFIG_KPROBES)	+= sstep.o
+obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= sstep.o
 
 ifeq ($(CONFIG_PPC64),y)
 obj-$(CONFIG_SMP)	+= locks.o
Index: linux-2.6.ppc64_test/include/linux/hw_breakpoint.h
===================================================================
--- linux-2.6.ppc64_test.orig/include/linux/hw_breakpoint.h
+++ linux-2.6.ppc64_test/include/linux/hw_breakpoint.h
@@ -120,6 +120,7 @@  static inline struct arch_hw_breakpoint 
 {
 	return NULL;
 }
+static inline void hw_breakpoint_disable(void) { }
 
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #endif /* __KERNEL__ */