diff mbox

[1/3] powerpc/64s: fix handling of non-synchronous machine checks

Message ID 20170228020048.8862-2-npiggin@gmail.com (mailing list archive)
State Accepted
Headers show

Commit Message

Nicholas Piggin Feb. 28, 2017, 2 a.m. UTC
A synchronous machine check is an exception raised by the attempt to
execute the current instruction. If the error can't be corrected, it
can make sense to SIGBUS the currently running process.

In other cases, the error condition is not related to the current
instruction, so killing the current process is not the right thing to
do.

Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
practical change. It will be used to handle POWER9 asynchronous
machine checks.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/platforms/powernv/opal.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

Comments

Mahesh J Salgaonkar Feb. 28, 2017, 5:57 a.m. UTC | #1
On 02/28/2017 07:30 AM, Nicholas Piggin wrote:
> A synchronous machine check is an exception raised by the attempt to
> execute the current instruction. If the error can't be corrected, it
> can make sense to SIGBUS the currently running process.
> 
> In other cases, the error condition is not related to the current
> instruction, so killing the current process is not the right thing to
> do.
> 
> Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> practical change. It will be used to handle POWER9 asynchronous
> machine checks.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/platforms/powernv/opal.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 86d9fde93c17..e0f856bfbfe8 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -395,7 +395,6 @@ static int opal_recover_mce(struct pt_regs *regs,
>  					struct machine_check_event *evt)
>  {
>  	int recovered = 0;
> -	uint64_t ea = get_mce_fault_addr(evt);
> 
>  	if (!(regs->msr & MSR_RI)) {
>  		/* If MSR_RI isn't set, we cannot recover */
> @@ -404,26 +403,18 @@ static int opal_recover_mce(struct pt_regs *regs,
>  	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>  		/* Platform corrected itself */
>  		recovered = 1;
> -	} else if (ea && !is_kernel_addr(ea)) {
> +	} else if (evt->severity == MCE_SEV_FATAL) {
> +		/* Fatal machine check */
> +		pr_err("Machine check interrupt is fatal\n");
> +		recovered = 0;

Setting recovered = 0 would trigger kernel panic. Should we panic the
kernel for asynchronous errors ?

> +	} else if ((evt->severity == MCE_SEV_ERROR_SYNC) &&
> +			(user_mode(regs) && !is_global_init(current))) {
>  		/*
> -		 * Faulting address is not in kernel text. We should be fine.
> -		 * We need to find which process uses this address.
>  		 * For now, kill the task if we have received exception when
>  		 * in userspace.
>  		 *
>  		 * TODO: Queue up this address for hwpoisioning later.
>  		 */
> -		if (user_mode(regs) && !is_global_init(current)) {
> -			_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> -			recovered = 1;
> -		} else
> -			recovered = 0;
> -	} else if (user_mode(regs) && !is_global_init(current) &&
> -		evt->severity == MCE_SEV_ERROR_SYNC) {
> -		/*
> -		 * If we have received a synchronous error when in userspace
> -		 * kill the task.
> -		 */
>  		_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
>  		recovered = 1;
>  	}
>
Nicholas Piggin Feb. 28, 2017, 8:43 a.m. UTC | #2
On Tue, 28 Feb 2017 11:27:29 +0530
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> On 02/28/2017 07:30 AM, Nicholas Piggin wrote:
> > A synchronous machine check is an exception raised by the attempt to
> > execute the current instruction. If the error can't be corrected, it
> > can make sense to SIGBUS the currently running process.
> > 
> > In other cases, the error condition is not related to the current
> > instruction, so killing the current process is not the right thing to
> > do.
> > 
> > Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> > practical change. It will be used to handle POWER9 asynchronous
> > machine checks.
> > 
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/platforms/powernv/opal.c | 21 ++++++---------------
> >  1 file changed, 6 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> > index 86d9fde93c17..e0f856bfbfe8 100644
> > --- a/arch/powerpc/platforms/powernv/opal.c
> > +++ b/arch/powerpc/platforms/powernv/opal.c
> > @@ -395,7 +395,6 @@ static int opal_recover_mce(struct pt_regs *regs,
> >  					struct machine_check_event *evt)
> >  {
> >  	int recovered = 0;
> > -	uint64_t ea = get_mce_fault_addr(evt);
> > 
> >  	if (!(regs->msr & MSR_RI)) {
> >  		/* If MSR_RI isn't set, we cannot recover */
> > @@ -404,26 +403,18 @@ static int opal_recover_mce(struct pt_regs *regs,
> >  	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
> >  		/* Platform corrected itself */
> >  		recovered = 1;
> > -	} else if (ea && !is_kernel_addr(ea)) {
> > +	} else if (evt->severity == MCE_SEV_FATAL) {
> > +		/* Fatal machine check */
> > +		pr_err("Machine check interrupt is fatal\n");
> > +		recovered = 0;  
> 
> Setting recovered = 0 would trigger kernel panic. Should we panic the
> kernel for asynchronous errors ?

If it's not recoverable, I don't see what other option we have. SRR0 is
meaningless for async machine checks. So it's much the same thing we do
as if we don't have a process to kill or were running in kernel when a
synchronous MCE occurred.

Thanks,
Nick
Michael Ellerman March 14, 2017, 11:45 a.m. UTC | #3
On Tue, 2017-02-28 at 02:00:46 UTC, Nicholas Piggin wrote:
> A synchronous machine check is an exception raised by the attempt to
> execute the current instruction. If the error can't be corrected, it
> can make sense to SIGBUS the currently running process.
> 
> In other cases, the error condition is not related to the current
> instruction, so killing the current process is not the right thing to
> do.
> 
> Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> practical change. It will be used to handle POWER9 asynchronous
> machine checks.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/1363875bdb6317a2d0798284d7aaf3

cheers
diff mbox

Patch

diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 86d9fde93c17..e0f856bfbfe8 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -395,7 +395,6 @@  static int opal_recover_mce(struct pt_regs *regs,
 					struct machine_check_event *evt)
 {
 	int recovered = 0;
-	uint64_t ea = get_mce_fault_addr(evt);
 
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */
@@ -404,26 +403,18 @@  static int opal_recover_mce(struct pt_regs *regs,
 	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
 		/* Platform corrected itself */
 		recovered = 1;
-	} else if (ea && !is_kernel_addr(ea)) {
+	} else if (evt->severity == MCE_SEV_FATAL) {
+		/* Fatal machine check */
+		pr_err("Machine check interrupt is fatal\n");
+		recovered = 0;
+	} else if ((evt->severity == MCE_SEV_ERROR_SYNC) &&
+			(user_mode(regs) && !is_global_init(current))) {
 		/*
-		 * Faulting address is not in kernel text. We should be fine.
-		 * We need to find which process uses this address.
 		 * For now, kill the task if we have received exception when
 		 * in userspace.
 		 *
 		 * TODO: Queue up this address for hwpoisioning later.
 		 */
-		if (user_mode(regs) && !is_global_init(current)) {
-			_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
-			recovered = 1;
-		} else
-			recovered = 0;
-	} else if (user_mode(regs) && !is_global_init(current) &&
-		evt->severity == MCE_SEV_ERROR_SYNC) {
-		/*
-		 * If we have received a synchronous error when in userspace
-		 * kill the task.
-		 */
 		_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
 		recovered = 1;
 	}