Patchwork powerpc: Fix the corrupt r3 error during MCE handling.

login
register
mail settings
Submitter Mahesh Salgaonkar
Date July 10, 2013, 1:02 p.m.
Message ID <20130710130155.4993.61577.stgit@mars>
Download mbox | patch
Permalink /patch/258045/
State Accepted, archived
Commit ee1dd1e3dc774cf257012215d996e8e7e370c162
Headers show

Comments

Mahesh Salgaonkar - July 10, 2013, 1:02 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

During Machine Check interrupt on pseries platform, R3 generally points to
memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
delivers the machine check exception it passes the address inside FWNMI area
with the top most bit set. This patch fixes this issue by masking top two bit
in machine check exception handler.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/ras.c |    3 +++
 1 file changed, 3 insertions(+)
Aneesh Kumar K.V - July 10, 2013, 2:11 p.m.
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> During Machine Check interrupt on pseries platform, R3 generally points to
> memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
> delivers the machine check exception it passes the address inside FWNMI area
> with the top most bit set. This patch fixes this issue by masking top two bit
> in machine check exception handler.

I always got that error and used to wonder why I find FWNMI
corrupt. IS this a rtas bug or is it documented in papr ?


>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/ras.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 7b3cbde..721c058 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -287,6 +287,9 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>  	unsigned long *savep;
>  	struct rtas_error_log *h, *errhdr = NULL;
>
> +	/* Mask top two bits */
> +	regs->gpr[3] &= ~(0x3UL << 62);
> +
>  	if (!VALID_FWNMI_BUFFER(regs->gpr[3])) {
>  		printk(KERN_ERR "FWNMI: corrupt r3 0x%016lx\n", regs->gpr[3]);
>  		return NULL;
>

-aneesh
Mahesh Salgaonkar - July 11, 2013, 4:34 a.m.
On 07/10/2013 07:41 PM, Aneesh Kumar K.V wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> 
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> During Machine Check interrupt on pseries platform, R3 generally points to
>> memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
>> delivers the machine check exception it passes the address inside FWNMI area
>> with the top most bit set. This patch fixes this issue by masking top two bit
>> in machine check exception handler.
> 
> I always got that error and used to wonder why I find FWNMI
> corrupt. IS this a rtas bug or is it documented in papr ?

Nope. There is no mention of it in PAPR. It looks like a bug in RTAS.

> 
> 
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/ras.c |    3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index 7b3cbde..721c058 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -287,6 +287,9 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>>  	unsigned long *savep;
>>  	struct rtas_error_log *h, *errhdr = NULL;
>>
>> +	/* Mask top two bits */
>> +	regs->gpr[3] &= ~(0x3UL << 62);
>> +
>>  	if (!VALID_FWNMI_BUFFER(regs->gpr[3])) {
>>  		printk(KERN_ERR "FWNMI: corrupt r3 0x%016lx\n", regs->gpr[3]);
>>  		return NULL;
>>
> 
> -aneesh
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Benjamin Herrenschmidt - July 11, 2013, 4:42 a.m.
On Thu, 2013-07-11 at 10:04 +0530, Mahesh Jagannath Salgaonkar wrote:
> > I always got that error and used to wonder why I find FWNMI
> > corrupt. IS this a rtas bug or is it documented in papr ?
> 
> Nope. There is no mention of it in PAPR. It looks like a bug in RTAS.

Typically, the top bit in real mode means to ignore the HRMOR... I think
it's some old bug in RTAS indeed.

Cheers,
Ben.
Anshuman Khandual - July 11, 2013, 5:54 a.m.
On 07/10/2013 06:32 PM, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> During Machine Check interrupt on pseries platform, R3 generally points to
> memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
> delivers the machine check exception it passes the address inside FWNMI area
> with the top most bit set. This patch fixes this issue by masking top two bit
> in machine check exception handler.
> 
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/ras.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 7b3cbde..721c058 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -287,6 +287,9 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>  	unsigned long *savep;
>  	struct rtas_error_log *h, *errhdr = NULL;
> 
> +	/* Mask top two bits */
> +	regs->gpr[3] &= ~(0x3UL << 62);

We need to replace this "62" with a shift macro specifying the significance
of these top two address bits in the real mode.
Aneesh Kumar K.V - July 15, 2013, 6:06 a.m.
Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:

> On 07/10/2013 06:32 PM, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> 
>> During Machine Check interrupt on pseries platform, R3 generally points to
>> memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
>> delivers the machine check exception it passes the address inside FWNMI area
>> with the top most bit set. This patch fixes this issue by masking top two bit
>> in machine check exception handler.
>> 
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/ras.c |    3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index 7b3cbde..721c058 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -287,6 +287,9 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>>  	unsigned long *savep;
>>  	struct rtas_error_log *h, *errhdr = NULL;
>> 
>> +	/* Mask top two bits */
>> +	regs->gpr[3] &= ~(0x3UL << 62);
>
> We need to replace this "62" with a shift macro specifying the significance
> of these top two address bits in the real mode.

huh??

(gdb) p/t 0x3ull << 62
$4 = 1100000000000000000000000000000000000000000000000000000000000000

Why you need an extra comment to explain 62. But yes, we can possibly
write that this is an RTAS bug

-aneesh
Anshuman Khandual - July 15, 2013, 6:18 a.m.
On 07/15/2013 11:36 AM, Aneesh Kumar K.V wrote:
> Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:
> 
>> On 07/10/2013 06:32 PM, Mahesh J Salgaonkar wrote:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> During Machine Check interrupt on pseries platform, R3 generally points to
>>> memory region inside RTAS (FWNMI) area. We see r3 corruption because when RTAS
>>> delivers the machine check exception it passes the address inside FWNMI area
>>> with the top most bit set. This patch fixes this issue by masking top two bit
>>> in machine check exception handler.
>>>
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> ---
>>>  arch/powerpc/platforms/pseries/ras.c |    3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>>> index 7b3cbde..721c058 100644
>>> --- a/arch/powerpc/platforms/pseries/ras.c
>>> +++ b/arch/powerpc/platforms/pseries/ras.c
>>> @@ -287,6 +287,9 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
>>>  	unsigned long *savep;
>>>  	struct rtas_error_log *h, *errhdr = NULL;
>>>
>>> +	/* Mask top two bits */
>>> +	regs->gpr[3] &= ~(0x3UL << 62);
>>
>> We need to replace this "62" with a shift macro specifying the significance
>> of these top two address bits in the real mode.
> 
> huh??
> 
> (gdb) p/t 0x3ull << 62
> $4 = 1100000000000000000000000000000000000000000000000000000000000000
> 
> Why you need an extra comment to explain 62. But yes, we can possibly
> write that this is an RTAS bug

62 was just to point at the top two address bits in the real mode. Extra comment
request was to specify what is the RTAS behaviour or bug with respect to those top
two bits and how we are dealing with them here in this fix.

Patch

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 7b3cbde..721c058 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -287,6 +287,9 @@  static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 	unsigned long *savep;
 	struct rtas_error_log *h, *errhdr = NULL;
 
+	/* Mask top two bits */
+	regs->gpr[3] &= ~(0x3UL << 62);
+
 	if (!VALID_FWNMI_BUFFER(regs->gpr[3])) {
 		printk(KERN_ERR "FWNMI: corrupt r3 0x%016lx\n", regs->gpr[3]);
 		return NULL;