diff mbox

[v2,11/11] ARC: [plat-eznps] Handle memory error as an exception

Message ID 1495954328-28736-12-git-send-email-noamca@mellanox.com
State New
Headers show

Commit Message

Noam Camus May 28, 2017, 6:52 a.m. UTC
From: Noam Camus <noamca@mellanox.com>

This commit adds the configuration CONFIG_EZNPS_MEM_ERROR.
If set, it will cause the kernel to handle user memory error
as a machine check exception.
It is required in order to align the NPS simulator memory
error handling to the one of the NPS400 real chip behavior.
We override weak symbole of mem_service to achieve that.

Signed-off-by: Elad Kanfi <eladkan@mellanox.com>
Signed-off-by: Noam Camus <noamca@mellanox.com>
---
 arch/arc/plat-eznps/Kconfig |   11 +++++++++++
 arch/arc/plat-eznps/entry.S |   14 ++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)

Comments

Vineet Gupta June 2, 2017, 7:04 p.m. UTC | #1
On 05/27/2017 11:52 PM, Noam Camus wrote:
> From: Noam Camus <noamca@mellanox.com>
> 
> This commit adds the configuration CONFIG_EZNPS_MEM_ERROR.
> If set, it will cause the kernel to handle user memory error
> as a machine check exception.
> It is required in order to align the NPS simulator memory
> error handling to the one of the NPS400 real chip behavior.
> We override weak symbole of mem_service to achieve that.
> 
> Signed-off-by: Elad Kanfi <eladkan@mellanox.com>
> Signed-off-by: Noam Camus <noamca@mellanox.com>
> ---
>   arch/arc/plat-eznps/Kconfig |   11 +++++++++++
>   arch/arc/plat-eznps/entry.S |   14 ++++++++++++++
>   2 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/arc/plat-eznps/Kconfig b/arch/arc/plat-eznps/Kconfig
> index feaa471..c5f946c 100644
> --- a/arch/arc/plat-eznps/Kconfig
> +++ b/arch/arc/plat-eznps/Kconfig
> @@ -32,3 +32,14 @@ config EZNPS_MTM_EXT
>   	  any of them seem like CPU from Linux point of view.
>   	  All threads within same core share the execution unit of the
>   	  core and HW scheduler round robin between them.
> +
> +config EZNPS_MEM_ERROR
> +       bool "ARC-EZchip Memory error as an exception"
> +       depends on ARC_PLAT_EZNPS
> +       default n
> +       help
> +         On the real chip of the NPS, user memory errors are handled
> +         as a machine check exception, whereas on simulator platform
> +         for NPS, it handled as an interrupt level 2 (like legacy arc
> +         real chip architecture).This configuration will cause the kernel
> +         to handle memory error as a machine check exception.

Do you really need a Kconfig option here. AFAIKR you guys had some magic in 
platform code to determine whether running on sim or hw - can that be not used ?

> diff --git a/arch/arc/plat-eznps/entry.S b/arch/arc/plat-eznps/entry.S
> index 328261c..03e2892 100644
> --- a/arch/arc/plat-eznps/entry.S
> +++ b/arch/arc/plat-eznps/entry.S
> @@ -68,3 +68,17 @@ ENTRY(res_service)
>   
>   	j	stext
>   END(res_service)
> +
> +#if defined(CONFIG_EZNPS_MEM_ERROR)
> +ENTRY(mem_service)
> +	; SW workaround to cover up on a difference between
> +	; NPS real chip and simulator behaviors.
> +	; NPS real chip will activate a machine check exception
> +	; in case of memory error, while the simulator will
> +	; trigger a level 2 interrupt. Therefor this code section
> +	; should be reached only in simulation mode.
> +	; DEAD END: display Regs and HALT
> +
> +	j EV_MachineCheck
> +END(mem_service)
> +#endif


Just squash the weak symbol patch in here - not worth a separate patch !

-Vineet
Noam Camus June 4, 2017, 6:10 a.m. UTC | #2
> From: Vineet Gupta [mailto:Vineet.Gupta1@synopsys.com] 
> Sent: Friday, June 2, 2017 22:04 PM

>> diff --git a/arch/arc/plat-eznps/Kconfig b/arch/arc/plat-eznps/Kconfig 
>> index feaa471..c5f946c 100644
>> --- a/arch/arc/plat-eznps/Kconfig
>> +++ b/arch/arc/plat-eznps/Kconfig
>> @@ -32,3 +32,14 @@ config EZNPS_MTM_EXT
>>   	  any of them seem like CPU from Linux point of view.
>>   	  All threads within same core share the execution unit of the
>>   	  core and HW scheduler round robin between them.
>> +
>> +config EZNPS_MEM_ERROR
>> +       bool "ARC-EZchip Memory error as an exception"
>> +       depends on ARC_PLAT_EZNPS
>> +       default n
>> +       help
>> +         On the real chip of the NPS, user memory errors are handled
>> +         as a machine check exception, whereas on simulator platform
>> +         for NPS, it handled as an interrupt level 2 (like legacy arc
>> +         real chip architecture).This configuration will cause the kernel
>> +         to handle memory error as a machine check exception.

>Do you really need a Kconfig option here. AFAIKR you guys had some magic in platform code to determine whether running on sim or hw - can that be not used ?
We do not have this anymore, needed to create dedicated one here.

...

>> diff --git a/arch/arc/plat-eznps/entry.S b/arch/arc/plat-eznps/entry.S 
>> index 328261c..03e2892 100644
>> --- a/arch/arc/plat-eznps/entry.S
>> +++ b/arch/arc/plat-eznps/entry.S
>> @@ -68,3 +68,17 @@ ENTRY(res_service)
>>   
>>   	j	stext
>>   END(res_service)
>> +
>> +#if defined(CONFIG_EZNPS_MEM_ERROR)
>> +ENTRY(mem_service)
>> +	; SW workaround to cover up on a difference between
>> +	; NPS real chip and simulator behaviors.
>> +	; NPS real chip will activate a machine check exception
>> +	; in case of memory error, while the simulator will
>> +	; trigger a level 2 interrupt. Therefor this code section
>> +	; should be reached only in simulation mode.
>> +	; DEAD END: display Regs and HALT
>> +
>> +	j EV_MachineCheck
>> +END(mem_service)
>> +#endif


>Just squash the weak symbol patch in here - not worth a separate patch !
Ok , no problem.

-Noam
Vineet Gupta June 6, 2017, 10:10 p.m. UTC | #3
On 05/27/2017 11:52 PM, Noam Camus wrote:
> From: Noam Camus <noamca@mellanox.com>
> 
> This commit adds the configuration CONFIG_EZNPS_MEM_ERROR.
> If set, it will cause the kernel to handle user memory error
> as a machine check exception.
> It is required in order to align the NPS simulator memory
> error handling to the one of the NPS400 real chip behavior.
> We override weak symbole of mem_service to achieve that.
> 
> Signed-off-by: Elad Kanfi <eladkan@mellanox.com>
> Signed-off-by: Noam Camus <noamca@mellanox.com>
> ---
>   arch/arc/plat-eznps/Kconfig |   11 +++++++++++
>   arch/arc/plat-eznps/entry.S |   14 ++++++++++++++
>   2 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/arc/plat-eznps/Kconfig b/arch/arc/plat-eznps/Kconfig
> index feaa471..c5f946c 100644
> --- a/arch/arc/plat-eznps/Kconfig
> +++ b/arch/arc/plat-eznps/Kconfig
> @@ -32,3 +32,14 @@ config EZNPS_MTM_EXT
>   	  any of them seem like CPU from Linux point of view.
>   	  All threads within same core share the execution unit of the
>   	  core and HW scheduler round robin between them.
> +
> +config EZNPS_MEM_ERROR
> +       bool "ARC-EZchip Memory error as an exception"
> +       depends on ARC_PLAT_EZNPS
> +       default n

So you set default to "n" - thus by default it works for the simulator not silicon ?


> +       help
> +         On the real chip of the NPS, user memory errors are handled
> +         as a machine check exception, whereas on simulator platform
> +         for NPS, it handled as an interrupt level 2 (like legacy arc
> +         real chip architecture).This configuration will cause the kernel
> +         to handle memory error as a machine check exception.
> diff --git a/arch/arc/plat-eznps/entry.S b/arch/arc/plat-eznps/entry.S
> index 328261c..03e2892 100644
> --- a/arch/arc/plat-eznps/entry.S
> +++ b/arch/arc/plat-eznps/entry.S
> @@ -68,3 +68,17 @@ ENTRY(res_service)
>   
>   	j	stext
>   END(res_service)
> +
> +#if defined(CONFIG_EZNPS_MEM_ERROR)
> +ENTRY(mem_service)
> +	; SW workaround to cover up on a difference between
> +	; NPS real chip and simulator behaviors.
> +	; NPS real chip will activate a machine check exception
> +	; in case of memory error, while the simulator will
> +	; trigger a level 2 interrupt. Therefor this code section
> +	; should be reached only in simulation mode.
> +	; DEAD END: display Regs and HALT
> +
> +	j EV_MachineCheck
> +END(mem_service)
> +#endif
>
Noam Camus June 7, 2017, 6:07 a.m. UTC | #4
>From: Vineet Gupta [mailto:Vineet.Gupta1@synopsys.com] 
>Sent: Wednesday, June 7, 2017 1:11 AM
...
>> +
>> +config EZNPS_MEM_ERROR
>> +       bool "ARC-EZchip Memory error as an exception"
>> +       depends on ARC_PLAT_EZNPS
>> +       default n

>So you set default to "n" - thus by default it works for the simulator not silicon ?
Correct, this way I "align" Sim environment to react as close as possible to how it work with silicon.

-Noam
Noam Camus June 7, 2017, 11:14 a.m. UTC | #5
> From: Noam Camus 
> Sent: Wednesday, June 7, 2017 9:08 AM
>To: 'Vineet Gupta' <Vineet.Gupta1@synopsys.com>; linux-snps-arc@lists.infradead.org
>Cc: linux-kernel@vger.kernel.org; Elad Kanfi <eladkan@mellanox.com>
>Subject: RE: [PATCH v2 11/11] ARC: [plat-eznps] Handle memory error as an exception

>>From: Vineet Gupta [mailto:Vineet.Gupta1@synopsys.com] 
>>Sent: Wednesday, June 7, 2017 1:11 AM
>...
>>> +
>>> +config EZNPS_MEM_ERROR
>>> +       bool "ARC-EZchip Memory error as an exception"
>>> +       depends on ARC_PLAT_EZNPS
>>> +       default n

>>So you set default to "n" - thus by default it works for the simulator not silicon ?
>Correct, this way I "align" Sim environment to react as close as possible to how it work with silicon.

Sorry, but It is not correct.
Default is for silicon where it is naturally emits machine check and unlike simulator do not need OS to redirect the ISR L2 to machine check handler.
Above motivation of "align" is true, but default is silicon and not sim, as I wrote in my original configuration help.
Please re-update this CONFIG help section.

Thanks
-Noam
Vineet Gupta June 7, 2017, 4:15 p.m. UTC | #6
On 06/07/2017 04:14 AM, Noam Camus wrote:
>>>> +config EZNPS_MEM_ERROR
>>>> +       bool "ARC-EZchip Memory error as an exception"
>>>> +       depends on ARC_PLAT_EZNPS
>>>> +       default n
>>> So you set default to "n" - thus by default it works for the simulator not silicon ?
>> Correct, this way I "align" Sim environment to react as close as possible to how it work with silicon.
> Sorry, but It is not correct.
> Default is for silicon where it is naturally emits machine check and unlike simulator do not need OS to redirect the ISR L2 to machine check handler.
> Above motivation of "align" is true, but default is silicon and not sim, as I wrote in my original configuration help.
> Please re-update this CONFIG help section.



So NPS *hardware* generates exception, jumps to vector mem_service(), which you 
redirect to the machine check handler - which simply panics.
But this redirection is under EZNPS_MEM_ERROR, which you have defaulted to "n". So 
how is the default working for hardware ? Doesn't it need to be "y"

BTW it seems your patch is wrong otherwise too. So the userspace bus error will go 
to machine check handler which currently just panic's. You really want to kill the 
user space process and continue, thus need to call do_memory_error()

-Vineet
Vineet Gupta June 8, 2017, 4:38 p.m. UTC | #7
On 06/07/2017 08:29 PM, Noam Camus wrote:
> *From:* Noam Camus
> *Sent:* Wednesday, June 7, 2017 8:06:17 PM
> *To:* Vineet Gupta; linux-snps-arc@lists.infradead.org
> *Cc:* linux-kernel@vger.kernel.org; Elad Kanfi
> *Subject:* Re: [PATCH v2 11/11] ARC: [plat-eznps] Handle memory error as an 
> exception
>
> *> From:*Vineet Gupta <Vineet.Gupta1@synopsys.com>
>
> *> Sent:* Wednesday, June 7, 2017 7:15 PM...
>
> > So NPS *hardware* generates exception, jumps to vector mem_service(), which you
> > redirect to the machine check handler - which simply panics.
> > But this redirection is under EZNPS_MEM_ERROR, which you have defaulted to 
> "n". So
> > how is the default working for hardware ? Doesn't it need to be "y"
>
> The NPS400 architects changed userspace bus error behavior to be machine check 
> instead of Interrupt level 2.
> The reason was that since we are dealing with imprecise exception.
> So memory request result will be back to core long time after bad instruction 
> was executed.
> In the meantime core be able to do HW schedule between threads and result may 
> hit another thread.
> The core do not keep information on each such bus transaction so it just 
> interfere current thread without knowing if it was the initiator of this bus 
> transaction.
> In such case we prefer to create machine check and end with PANIC.

Ok this make sense !

>
> With simulator we just turn this configuration on, so we redirect the Legacy 
> Synopsys L2 ISR from nSIM into machine check.
> This way we end up just like with silicon 😊

This doesn't make sense :-)
In simulation (where L2 interrupt is asserted), you need to handle it as such - 
say reading out the banked regs for L2 interrupt. What you are doing here is 
handling it like an exception which won't work . I really don't see the point of 
this "alignment" - hardware and simulation are different. simulation semantics are 
already supported by generic ARC code. And for silicon case, the existing vector 
woudl MachineCheck would work for both K and U. So I'm not sure what we are trying 
to achieve here !


>
>
> >BTW it seems your patch is wrong otherwise too. So the userspace bus error will go
> >to machine check handler which currently just panic's. You really want to kill the
> >user space process and continue, thus need to call do_memory_error()
> So I believe that we do correct thing here, when we deal with multi thread cores.

Sure, the imprecise handling of bus error is an issue - but we should atleat try 
to recover. By just panic'ing unconditionally, you are enabling a one liner user 
program to panic the system (granted in simulation only)

-Vineet
Vineet Gupta June 8, 2017, 7 p.m. UTC | #8
On 06/08/2017 11:23 AM, Noam Camus wrote:
>
>
> *> From:* Vineet Gupta <Vineet.Gupta1@synopsys.com>
> *> Sent:* Thursday, June 8, 2017 7:38 PM
>
> >>
> >> With simulator we just turn this configuration on, so we redirect the Legacy
> >> Synopsys L2 ISR from nSIM into machine check.
> >> This way we end up just like with silicon 😊
>
> >This doesn't make sense :-)
> >In simulation (where L2 interrupt is asserted), you need to handle it as such -
> >say reading out the banked regs for L2 interrupt. What you are doing here is
> >handling it like an exception which won't work . I really don't see the point of
> >this "alignment" - hardware and simulation are different. simulation semantics are
> >already supported by generic ARC code. And for silicon case, the existing vector
> >woudl MachineCheck would work for both K and U. So I'm not sure what we are trying
> >to achieve here !
> With EZsim we try to simulate NPS400 CTOP core and not ARC core, and as such we 
> strive to have similar echo system for both silicon and its simulator.

Right, but if you are using nSIM which generates L2 interrupt for user mode error 
- then it is already different from silicon and needs to handled as such.

> If we could, we would alter nSIM to behave just like our silicon.
> So in current situation where we lack doing so we suffice in single pretty small 
> adjustment in OS (platform specific code).

You are saying contradicting things here. Above u want EZSim to simulate CTOP 
(i.e. generate machinechk for U errors) but here you claim u use nSIM which will 
generates L2 intr.

So I'm still grossly confused.

What does EZSim (based on nSIM) do when bus error is triggered from User mode - 
does it raise (A) L2 interrupt or (B) MachineCheck ?

If it is (A) the the existing common code in ARC will work - mem_service() -> 
do_memory_error() -> panic()
if it is (B), again the common machinecheck handler will be called and will panic.

I don't see the need to mix both (A) and (B) i.e. use mem_service() which is a L2 
interrupt, but then handle it in MachieChekc which is for exceptions ? How is that 
supposed to work !

-Vineet
diff mbox

Patch

diff --git a/arch/arc/plat-eznps/Kconfig b/arch/arc/plat-eznps/Kconfig
index feaa471..c5f946c 100644
--- a/arch/arc/plat-eznps/Kconfig
+++ b/arch/arc/plat-eznps/Kconfig
@@ -32,3 +32,14 @@  config EZNPS_MTM_EXT
 	  any of them seem like CPU from Linux point of view.
 	  All threads within same core share the execution unit of the
 	  core and HW scheduler round robin between them.
+
+config EZNPS_MEM_ERROR
+       bool "ARC-EZchip Memory error as an exception"
+       depends on ARC_PLAT_EZNPS
+       default n
+       help
+         On the real chip of the NPS, user memory errors are handled
+         as a machine check exception, whereas on simulator platform
+         for NPS, it handled as an interrupt level 2 (like legacy arc
+         real chip architecture).This configuration will cause the kernel
+         to handle memory error as a machine check exception.
diff --git a/arch/arc/plat-eznps/entry.S b/arch/arc/plat-eznps/entry.S
index 328261c..03e2892 100644
--- a/arch/arc/plat-eznps/entry.S
+++ b/arch/arc/plat-eznps/entry.S
@@ -68,3 +68,17 @@  ENTRY(res_service)
 
 	j	stext
 END(res_service)
+
+#if defined(CONFIG_EZNPS_MEM_ERROR)
+ENTRY(mem_service)
+	; SW workaround to cover up on a difference between
+	; NPS real chip and simulator behaviors.
+	; NPS real chip will activate a machine check exception
+	; in case of memory error, while the simulator will
+	; trigger a level 2 interrupt. Therefor this code section
+	; should be reached only in simulation mode.
+	; DEAD END: display Regs and HALT
+
+	j EV_MachineCheck
+END(mem_service)
+#endif