Patchwork [V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx

login
register
mail settings
Submitter Hongtao Jia
Date March 4, 2013, 8:40 a.m.
Message ID <1362386425-20149-1-git-send-email-B38951@freescale.com>
Download mbox | patch
Permalink /patch/224648/
State Superseded
Delegated to: Kumar Gala
Headers show

Comments

Hongtao Jia - March 4, 2013, 8:40 a.m.
A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
goes down. when the link goes down, Non-posted transactions issued
via the ATMU requiring completion result in an instruction stall.
At the same time a machine-check exception is generated to the core
to allow further processing by the handler. We implements the handler
which skips the instruction caused the stall.

This patch depends on patch:
powerpc/85xx: Add platform_device declaration to fsl_pci.h

Signed-off-by: Zhao Chenhui <b35336@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Liu Shuo <soniccat.liu@gmail.com>
Signed-off-by: Jia Hongtao <B38951@freescale.com>
---
V4 changed:
* Rebased the patch on kernel v3.8.0-rc5

The link for V3:
http://patchwork.ozlabs.org/patch/171120/

 arch/powerpc/kernel/cpu_setup_fsl_booke.S |    2 +-
 arch/powerpc/kernel/traps.c               |    3 ++
 arch/powerpc/sysdev/fsl_pci.c             |   41 +++++++++++++++++++++++++++++
 arch/powerpc/sysdev/fsl_pci.h             |    6 ++++
 4 files changed, 51 insertions(+), 1 deletions(-)
Stuart Yoder - March 4, 2013, 4:16 p.m.
On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com> wrote:
> A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> goes down. when the link goes down, Non-posted transactions issued
> via the ATMU requiring completion result in an instruction stall.
> At the same time a machine-check exception is generated to the core
> to allow further processing by the handler. We implements the handler
> which skips the instruction caused the stall.

Can you explain at a high level how just skipping an instruction solves
anything?   If you just skip a load/store and continue like nothing is
wrong, isn't your system possibly in a really bad state.

And if the core is already hung, due to the PCI link going down, isn't
it too late?   How does skipping help?

Stuart
David Laight - March 4, 2013, 5:15 p.m.
> A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> goes down. when the link goes down, Non-posted transactions issued
> via the ATMU requiring completion result in an instruction stall.
> At the same time a machine-check exception is generated to the core
> to allow further processing by the handler. We implements the handler
> which skips the instruction caused the stall.

Just skipping the instruction doesn't seem a good idea.
But I suspect that re-initialising the PCI interface is also
almost impossible.

Does the mpc83xx have the same errata?
We've seen machine-check faults using the CSB bridge on an 83xx
doing a 'pio' access after a PEX_DMA transfer to certain target
addresses stalls - software gives up waiting for the dma.
The target is an fpga, nothing is mapped at those
addesses - but we'd expect to get ~0u back as happens on other
slave windows.

I also remember some problems with single word DMA.

	David
Hongtao Jia - March 5, 2013, 10:12 a.m.
> -----Original Message-----
> From: David Laight [mailto:David.Laight@ACULAB.COM]
> Sent: Tuesday, March 05, 2013 1:16 AM
> To: Jia Hongtao-B38951; linuxppc-dev@lists.ozlabs.org;
> galak@kernel.crashing.org
> Cc: Wood Scott-B07421
> Subject: RE: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> > A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> > goes down. when the link goes down, Non-posted transactions issued via
> > the ATMU requiring completion result in an instruction stall.
> > At the same time a machine-check exception is generated to the core to
> > allow further processing by the handler. We implements the handler
> > which skips the instruction caused the stall.
> 
> Just skipping the instruction doesn't seem a good idea.
> But I suspect that re-initialising the PCI interface is also almost
> impossible.

This *skipping* is the best way I thought for this errata.
It's not perfect but works.

-Hongtao.

> 
> Does the mpc83xx have the same errata?
> We've seen machine-check faults using the CSB bridge on an 83xx doing a
> 'pio' access after a PEX_DMA transfer to certain target addresses stalls
> - software gives up waiting for the dma.
> The target is an fpga, nothing is mapped at those addesses - but we'd
> expect to get ~0u back as happens on other slave windows.
> 
> I also remember some problems with single word DMA.
> 
> 	David
> 
>
Hongtao Jia - March 5, 2013, 10:12 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Tuesday, March 05, 2013 7:46 AM
> To: Stuart Yoder
> Cc: Jia Hongtao-B38951; linuxppc-dev@lists.ozlabs.org; Kumar Gala
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/04/2013 10:16:10 AM, Stuart Yoder wrote:
> > On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com>
> > wrote:
> > > A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> > > goes down. when the link goes down, Non-posted transactions issued
> > > via the ATMU requiring completion result in an instruction stall.
> > > At the same time a machine-check exception is generated to the core
> > > to allow further processing by the handler. We implements the
> > handler
> > > which skips the instruction caused the stall.
> >
> > Can you explain at a high level how just skipping an instruction
> > solves
> > anything?   If you just skip a load/store and continue like nothing is
> > wrong, isn't your system possibly in a really bad state.
> 
> If the instruction was a load, we probably at least want to fill the
> destination register with 0xffffffff or similar.

You discuss this with Liu Shuo about a year ago.
here is the log:

"
On 02/01/2012 02:18 AM, shuo.liu@freescale.com wrote:
> v3 : Skip the instruction only. Don't access the user space memory in 
>      mechine check. 

It may be the least bad option for now, but be aware that there's a
small chance that this will cause a leak of sensitive information (such
as a piece of a crypto key that happened to be sitting in the register
to be loaded into).

-Scott
"

-Hongtao.

> 
> > And if the core is already hung, due to the PCI link going down, isn't
> > it too late?   How does skipping help?
> 
> Maybe the machine check unhangs the core?
> 
> Is there an erratum number for this?
> 
> -Scott
Scott Wood - March 5, 2013, 6:47 p.m.
On 03/05/2013 04:12:30 AM, Jia Hongtao-B38951 wrote:
> 
> 
> > -----Original Message-----
> > From: Wood Scott-B07421
> > Sent: Tuesday, March 05, 2013 7:46 AM
> > To: Stuart Yoder
> > Cc: Jia Hongtao-B38951; linuxppc-dev@lists.ozlabs.org; Kumar Gala
> > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to  
> fix
> > PCIe erratum on mpc85xx
> >
> > On 03/04/2013 10:16:10 AM, Stuart Yoder wrote:
> > > On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com>
> > > wrote:
> > > > A PCIe erratum of mpc85xx may causes a core hang when a link of  
> PCIe
> > > > goes down. when the link goes down, Non-posted transactions  
> issued
> > > > via the ATMU requiring completion result in an instruction  
> stall.
> > > > At the same time a machine-check exception is generated to the  
> core
> > > > to allow further processing by the handler. We implements the
> > > handler
> > > > which skips the instruction caused the stall.
> > >
> > > Can you explain at a high level how just skipping an instruction
> > > solves
> > > anything?   If you just skip a load/store and continue like  
> nothing is
> > > wrong, isn't your system possibly in a really bad state.
> >
> > If the instruction was a load, we probably at least want to fill the
> > destination register with 0xffffffff or similar.
> 
> You discuss this with Liu Shuo about a year ago.
> here is the log:
> 
> "
> On 02/01/2012 02:18 AM, shuo.liu@freescale.com wrote:
> > v3 : Skip the instruction only. Don't access the user space memory  
> in
> >      mechine check.
> 
> It may be the least bad option for now, but be aware that there's a
> small chance that this will cause a leak of sensitive information  
> (such
> as a piece of a crypto key that happened to be sitting in the register
> to be loaded into).

Yes, that's (one reason) why you'd want to fill in a known value.  Note  
the "for now". :-)

-Scott
Hongtao Jia - March 6, 2013, 8:28 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Wednesday, March 06, 2013 2:48 AM
> To: Jia Hongtao-B38951
> Cc: Wood Scott-B07421; Stuart Yoder; linuxppc-dev@lists.ozlabs.org; Kumar
> Gala
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/05/2013 04:12:30 AM, Jia Hongtao-B38951 wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wood Scott-B07421
> > > Sent: Tuesday, March 05, 2013 7:46 AM
> > > To: Stuart Yoder
> > > Cc: Jia Hongtao-B38951; linuxppc-dev@lists.ozlabs.org; Kumar Gala
> > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to
> > fix
> > > PCIe erratum on mpc85xx
> > >
> > > On 03/04/2013 10:16:10 AM, Stuart Yoder wrote:
> > > > On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com>
> > > > wrote:
> > > > > A PCIe erratum of mpc85xx may causes a core hang when a link of
> > PCIe
> > > > > goes down. when the link goes down, Non-posted transactions
> > issued
> > > > > via the ATMU requiring completion result in an instruction
> > stall.
> > > > > At the same time a machine-check exception is generated to the
> > core
> > > > > to allow further processing by the handler. We implements the
> > > > handler
> > > > > which skips the instruction caused the stall.
> > > >
> > > > Can you explain at a high level how just skipping an instruction
> > > > solves
> > > > anything?   If you just skip a load/store and continue like
> > nothing is
> > > > wrong, isn't your system possibly in a really bad state.
> > >
> > > If the instruction was a load, we probably at least want to fill the
> > > destination register with 0xffffffff or similar.
> >
> > You discuss this with Liu Shuo about a year ago.
> > here is the log:
> >
> > "
> > On 02/01/2012 02:18 AM, shuo.liu@freescale.com wrote:
> > > v3 : Skip the instruction only. Don't access the user space memory
> > in
> > >      mechine check.
> >
> > It may be the least bad option for now, but be aware that there's a
> > small chance that this will cause a leak of sensitive information
> > (such as a piece of a crypto key that happened to be sitting in the
> > register to be loaded into).
> 
> Yes, that's (one reason) why you'd want to fill in a known value.  Note
> the "for now". :-)
> 
> -Scott

I think there is no overwhelming reason to fill the destination register
with 0xffffffff. 

There's a small chance that 0xffffffff is treated as regular data rather
than an error sign.

Also setting this register may influence the user space under certain
circumstance.

So I think just ignore the skipped instruction is an acceptable option for
this fix.

-Hongtao.
David Laight - March 6, 2013, 10:24 a.m.
> > Yes, that's (one reason) why you'd want to fill in a known value.  Note
> > the "for now". :-)
> >
> > -Scott
> 
> I think there is no overwhelming reason to fill the destination register
> with 0xffffffff.
> 
> There's a small chance that 0xffffffff is treated as regular data rather
> than an error sign.
> 
> Also setting this register may influence the user space under certain
> circumstance.
> 
> So I think just ignore the skipped instruction is an acceptable option for
> this fix.

The 'random' value is just as likely to affect the reader, but only
for some values - so you'll get almost impossible to repeat bugs.
If a fixed value (0 or ~0) has an adverse effect, at least it will
have the same every time.

Read errors are also likely to affect device drivers reading
status bits, since these are very likely 'write to clear' any
driver would have to be willing to process the 'dummy' value
in a manner that won't loop forever (especially in an ISR).

You don't need every access to be via a function that explicitly
(somehow) detects that the fault happened, but knowing that a
specific value might be caused by a dead PCIe bus, and being
able to find out whether that is true (to avoid looping forever)
is probably useful.

This is probably similar to what a driver needs to recover from
an external PCIe list being unplugged.

	David
Hongtao Jia - March 7, 2013, 8:06 a.m.
> -----Original Message-----
> From: David Laight [mailto:David.Laight@ACULAB.COM]
> Sent: Wednesday, March 06, 2013 6:24 PM
> To: Jia Hongtao-B38951; Wood Scott-B07421
> Cc: linuxppc-dev@lists.ozlabs.org; Stuart Yoder
> Subject: RE: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> > > Yes, that's (one reason) why you'd want to fill in a known value.
> > > Note the "for now". :-)
> > >
> > > -Scott
> >
> > I think there is no overwhelming reason to fill the destination
> > register with 0xffffffff.
> >
> > There's a small chance that 0xffffffff is treated as regular data
> > rather than an error sign.
> >
> > Also setting this register may influence the user space under certain
> > circumstance.
> >
> > So I think just ignore the skipped instruction is an acceptable option
> > for this fix.
> 
> The 'random' value is just as likely to affect the reader, but only for
> some values - so you'll get almost impossible to repeat bugs.
> If a fixed value (0 or ~0) has an adverse effect, at least it will have
> the same every time.
> 
> Read errors are also likely to affect device drivers reading status bits,
> since these are very likely 'write to clear' any driver would have to be
> willing to process the 'dummy' value in a manner that won't loop forever
> (especially in an ISR).
> 
> You don't need every access to be via a function that explicitly
> (somehow) detects that the fault happened, but knowing that a specific
> value might be caused by a dead PCIe bus, and being able to find out
> whether that is true (to avoid looping forever) is probably useful.
> 
> This is probably similar to what a driver needs to recover from an
> external PCIe list being unplugged.
> 
> 	David
> 
> 

In my understanding filling the register could warn the executing process
an error occurred in some cases. But no way to fix the wrong behavior caused
by the instruction lost. So let's say that filling the register may benefit
a little.

On the other side, we should not access to the addresses of unknown process
in Linux kernel. We must get the instruction before filling the register.
If the instruction is not in the cache we have to access to the unknown
addresses to get it. For system security I think this is strictly forbidden.

Here is the ideas from Scott:
"
> +	if (is_in_pci_mem_space(addr)) {
> +		inst = *(unsigned int *)regs->nip;

Be careful about taking a fault here.  A simple TLB miss should be safe
given that we shouldn't be accessing PCIe in the middle of exception
code, but what if the mapping has gone away (e.g. a userspace driver had
its code munmap()ed or swapped out)?  What if permissions allow execute
but not read (not sure if Linux will allow this, but the hardware does)?

What if it happened in a KVM guest?  You can't access guest addresses
directly.
"

Although I think filling the register have some advantages but it's should
be forbidden for security reason.

-Hongtao.
David Laight - March 7, 2013, 10:04 a.m.
> In my understanding filling the register could warn the executing process
> an error occurred in some cases. But no way to fix the wrong behavior caused
> by the instruction lost. So let's say that filling the register may benefit
> a little.

IIRC the only ppc instructions that should be accessing PCIe space
are simple memory reads/writes, locked exchanges probably don't work.
writes will be async - so we are talking about memory reads.

> On the other side, we should not access to the addresses of unknown process
> in Linux kernel. We must get the instruction before filling the register.
> If the instruction is not in the cache we have to access to the unknown
> addresses to get it. For system security I think this is strictly forbidden.

Eh???
The kernel fault handler will know whether the fault is from kernel
or userspace (in which case it must be the current process), and will
almost certainly already have code that looks at the faulting instruction
sequence.
If the faulting code address is 'user', then the normal functions for
reading user addresses have to be used.
If the faulting address is 'kernel' it might be in an ISR - restricting
what can be done - but the instruction is definitely readable.

> Although I think filling the register have some advantages but it's should
> be forbidden for security reason.

I'm sure security would say exactly the opposite.

	David
Scott Wood - March 7, 2013, 4:37 p.m.
On 03/07/2013 02:06:05 AM, Jia Hongtao-B38951 wrote:
> Here is the ideas from Scott:
> "
> > +	if (is_in_pci_mem_space(addr)) {
> > +		inst = *(unsigned int *)regs->nip;
> 
> Be careful about taking a fault here.  A simple TLB miss should be  
> safe
> given that we shouldn't be accessing PCIe in the middle of exception
> code, but what if the mapping has gone away (e.g. a userspace driver  
> had
> its code munmap()ed or swapped out)?  What if permissions allow  
> execute
> but not read (not sure if Linux will allow this, but the hardware  
> does)?
> 
> What if it happened in a KVM guest?  You can't access guest addresses
> directly.
> "

That means you need to be careful about how you read the instruction,  
not that you shouldn't do it at all.

-Scott
Hongtao Jia - March 8, 2013, 8:01 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Friday, March 08, 2013 12:38 AM
> To: Jia Hongtao-B38951
> Cc: David Laight; Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org;
> Stuart Yoder
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/07/2013 02:06:05 AM, Jia Hongtao-B38951 wrote:
> > Here is the ideas from Scott:
> > "
> > > +	if (is_in_pci_mem_space(addr)) {
> > > +		inst = *(unsigned int *)regs->nip;
> >
> > Be careful about taking a fault here.  A simple TLB miss should be
> > safe given that we shouldn't be accessing PCIe in the middle of
> > exception code, but what if the mapping has gone away (e.g. a
> > userspace driver had its code munmap()ed or swapped out)?  What if
> > permissions allow execute but not read (not sure if Linux will allow
> > this, but the hardware does)?
> >
> > What if it happened in a KVM guest?  You can't access guest addresses
> > directly.
> > "
> 
> That means you need to be careful about how you read the instruction, not
> that you shouldn't do it at all.
> 
> -Scott

I agree.

Do you have a more secure way to get the instruction?
Or what should be done to avoid permission break issue?

Thanks.
-Hongtao
Hongtao Jia - March 12, 2013, 7:40 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Saturday, March 09, 2013 8:49 AM
> To: Jia Hongtao-B38951
> Cc: Wood Scott-B07421; David Laight; linuxppc-dev@lists.ozlabs.org;
> Stuart Yoder
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/08/2013 02:01:46 AM, Jia Hongtao-B38951 wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wood Scott-B07421
> > > Sent: Friday, March 08, 2013 12:38 AM
> > > To: Jia Hongtao-B38951
> > > Cc: David Laight; Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org;
> > > Stuart Yoder
> > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to
> > fix
> > > PCIe erratum on mpc85xx
> > >
> > > On 03/07/2013 02:06:05 AM, Jia Hongtao-B38951 wrote:
> > > > Here is the ideas from Scott:
> > > > "
> > > > > +	if (is_in_pci_mem_space(addr)) {
> > > > > +		inst = *(unsigned int *)regs->nip;
> > > >
> > > > Be careful about taking a fault here.  A simple TLB miss should be
> > > > safe given that we shouldn't be accessing PCIe in the middle of
> > > > exception code, but what if the mapping has gone away (e.g. a
> > > > userspace driver had its code munmap()ed or swapped out)?  What if
> > > > permissions allow execute but not read (not sure if Linux will
> > allow
> > > > this, but the hardware does)?
> > > >
> > > > What if it happened in a KVM guest?  You can't access guest
> > addresses
> > > > directly.
> > > > "
> > >
> > > That means you need to be careful about how you read the
> > instruction, not
> > > that you shouldn't do it at all.
> > >
> > > -Scott
> >
> > I agree.
> >
> > Do you have a more secure way to get the instruction?
> > Or what should be done to avoid permission break issue?
> 
> probe_kernel_address() should take care of userspace issues.  As for
> KVM, if you see MSR_GS set, bail out and don't apply the workaround.
> Let KVM/QEMU deal with it as it wishes (e.g. reflect to the guest and
> let its machine check handler do the skipping).  On PR-mode KVM (e.g.
> on e500v2-based chips) there is no MSR_GS and it just looks like
> userspace code -- for now just pretend it is user mode.
> 
> -Scott

Hi Scott,

Is that OK if I use the following code?

	u32 inst;
	int ret;

	if (is_in_pci_mem_space(addr)) {
		if (!user_mode(regs)) {
			ret = probe_kernel_address(regs->nip, inst);

			if (!ret) {
				rd = get_rt(inst);
				regs->gpr[rd] = 0xffffffff;
			}
		}

		regs->nip += 4;
		return 1;
	}


Thanks.
-Hongtao.
David Laight - March 12, 2013, 9:47 a.m.
> Is that OK if I use the following code?
...
> 	if (is_in_pci_mem_space(addr)) {
> 		if (!user_mode(regs)) {
> 			ret = probe_kernel_address(regs->nip, inst);
> 
> 			if (!ret) {
> 				rd = get_rt(inst);
> 				regs->gpr[rd] = 0xffffffff;
> 			}
> 		}

Don't you need to check that the instruction is actually
a memory read?

I also know that there are people mapping PCIe addresses
directly into userspace for 'simple' access to things like
fpga devices.

I suspect that such devices are the ones likely to generate
the faulting cycles. So you probably do want to handle
faults from normal userspace addresses.

	David
Scott Wood - March 13, 2013, 4:37 p.m.
On 03/13/2013 04:40:40 AM, David Laight wrote:
> > Hmm, seems there's no probe_user_address() -- for userspace we
> > basically want the same thing minus the KERNEL_DS.  See
> > arch/powerpc/perf/callchain.c for an example.
> 
> Isn't that just copy_from_user() ?

Plus pagefault_disable/enable().

-Scott
Hongtao Jia - March 15, 2013, 2:47 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Thursday, March 14, 2013 12:38 AM
> To: David Laight
> Cc: Jia Hongtao-B38951; Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org;
> Stuart Yoder
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/13/2013 04:40:40 AM, David Laight wrote:
> > > Hmm, seems there's no probe_user_address() -- for userspace we
> > > basically want the same thing minus the KERNEL_DS.  See
> > > arch/powerpc/perf/callchain.c for an example.
> >
> > Isn't that just copy_from_user() ?
> 
> Plus pagefault_disable/enable().
> 
> -Scott

pagefault_disable() is identical to preempt_disable(). So I think this
could not avoid other cpu to swap out the instruction we want to read back.
probe_kernel_address() also have the same issue.

Does this mean we can't just use pagefault_disable() to prevent from getting
the wrong instruction?

-Hongtao.
Hongtao Jia - March 29, 2013, 8:03 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Saturday, March 16, 2013 12:35 AM
> To: Jia Hongtao-B38951
> Cc: Wood Scott-B07421; David Laight; linuxppc-dev@lists.ozlabs.org;
> Stuart Yoder
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/14/2013 09:47:58 PM, Jia Hongtao-B38951 wrote:
> >
> > > -----Original Message-----
> > > From: Wood Scott-B07421
> > > Sent: Thursday, March 14, 2013 12:38 AM
> > > To: David Laight
> > > Cc: Jia Hongtao-B38951; Wood Scott-B07421;
> > linuxppc-dev@lists.ozlabs.org;
> > > Stuart Yoder
> > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to
> > fix
> > > PCIe erratum on mpc85xx
> > >
> > > On 03/13/2013 04:40:40 AM, David Laight wrote:
> > > > > Hmm, seems there's no probe_user_address() -- for userspace we
> > > > > basically want the same thing minus the KERNEL_DS.  See
> > > > > arch/powerpc/perf/callchain.c for an example.
> > > >
> > > > Isn't that just copy_from_user() ?
> > >
> > > Plus pagefault_disable/enable().
> > >
> > > -Scott
> >
> > pagefault_disable() is identical to preempt_disable(). So I think this
> > could not avoid other cpu to swap out the instruction we want to read
> > back.
> > probe_kernel_address() also have the same issue.
> 
> That's not the point -- the point is to let the page fault handler know
> that it should go directly to bad_page_fault().  Do not pass
> handle_mm_fault().  Do not collect a page from disk.
> 
> Granted, we're already in atomic context which will have that effect
> due to being in the machine check handler, but it's better to be
> explicit about it and not depend on how pagefault_diasble() is
> implemented.
> 
> -Scott


Based on the comments I updated the machine check handler.

Changes from last version:
* Check MSR_GS state
* Check if the instruction is LD
* Handle the user space issue

The updated machine check handler is as following:

int fsl_pci_mcheck_exception(struct pt_regs *regs)
{
        unsigned int op, rd;
        u32 inst;
        int ret;
        phys_addr_t addr = 0;

        /* Let KVM/QEMU deal with the exception */
        if (regs->msr & MSR_GS)
                return 0;

#ifdef CONFIG_PHYS_64BIT
        addr = mfspr(SPRN_MCARU);
        addr <<= 32;
#endif
        addr += mfspr(SPRN_MCAR);

        if (is_in_pci_mem_space(addr)) {
                if (user_mode(regs)) {
                        pagefault_disable();
                        ret = copy_from_user(&(inst), (u32 __user *)regs->nip, sizeof(inst));
                        pagefault_enable();
                } else {
                        ret = probe_kernel_address(regs->nip, inst);
                }

                op = get_op(inst);
                /* Check if the instruction is LD */
                if (!ret && (op == 111010)) {
                        rd = get_rt(inst);
                        regs->gpr[rd] = 0xffffffff;
                }

                regs->nip += 4;
                return 1;
        }

        return 0;
}

BTW, I'm still not sure how to deal with LD instruction with update.

Any comments and suggestions are welcomed.

Thanks.
-Hongtao.
Scott Wood - March 29, 2013, 4:33 p.m.
On 03/29/2013 03:03:51 AM, Jia Hongtao-B38951 wrote:
> 
> 
> > -----Original Message-----
> > From: Wood Scott-B07421
> > Sent: Saturday, March 16, 2013 12:35 AM
> > To: Jia Hongtao-B38951
> > Cc: Wood Scott-B07421; David Laight; linuxppc-dev@lists.ozlabs.org;
> > Stuart Yoder
> > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to  
> fix
> > PCIe erratum on mpc85xx
> >
> > On 03/14/2013 09:47:58 PM, Jia Hongtao-B38951 wrote:
> > >
> > > > -----Original Message-----
> > > > From: Wood Scott-B07421
> > > > Sent: Thursday, March 14, 2013 12:38 AM
> > > > To: David Laight
> > > > Cc: Jia Hongtao-B38951; Wood Scott-B07421;
> > > linuxppc-dev@lists.ozlabs.org;
> > > > Stuart Yoder
> > > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler  
> to
> > > fix
> > > > PCIe erratum on mpc85xx
> > > >
> > > > On 03/13/2013 04:40:40 AM, David Laight wrote:
> > > > > > Hmm, seems there's no probe_user_address() -- for userspace  
> we
> > > > > > basically want the same thing minus the KERNEL_DS.  See
> > > > > > arch/powerpc/perf/callchain.c for an example.
> > > > >
> > > > > Isn't that just copy_from_user() ?
> > > >
> > > > Plus pagefault_disable/enable().
> > > >
> > > > -Scott
> > >
> > > pagefault_disable() is identical to preempt_disable(). So I think  
> this
> > > could not avoid other cpu to swap out the instruction we want to  
> read
> > > back.
> > > probe_kernel_address() also have the same issue.
> >
> > That's not the point -- the point is to let the page fault handler  
> know
> > that it should go directly to bad_page_fault().  Do not pass
> > handle_mm_fault().  Do not collect a page from disk.
> >
> > Granted, we're already in atomic context which will have that effect
> > due to being in the machine check handler, but it's better to be
> > explicit about it and not depend on how pagefault_diasble() is
> > implemented.
> >
> > -Scott
> 
> 
> Based on the comments I updated the machine check handler.
> 
> Changes from last version:
> * Check MSR_GS state
> * Check if the instruction is LD
> * Handle the user space issue
> 
> The updated machine check handler is as following:
> 
> int fsl_pci_mcheck_exception(struct pt_regs *regs)
> {
>         unsigned int op, rd;
>         u32 inst;
>         int ret;
>         phys_addr_t addr = 0;
> 
>         /* Let KVM/QEMU deal with the exception */
>         if (regs->msr & MSR_GS)
>                 return 0;
> 
> #ifdef CONFIG_PHYS_64BIT
>         addr = mfspr(SPRN_MCARU);
>         addr <<= 32;
> #endif
>         addr += mfspr(SPRN_MCAR);
> 
>         if (is_in_pci_mem_space(addr)) {
>                 if (user_mode(regs)) {
>                         pagefault_disable();
>                         ret = copy_from_user(&(inst), (u32 __user  
> *)regs->nip, sizeof(inst));
>                         pagefault_enable();

You could use get_user() instead of copy_from_user().

>                 } else {
>                         ret = probe_kernel_address(regs->nip, inst);
>                 }
> 
>                 op = get_op(inst);
>                 /* Check if the instruction is LD */
>                 if (!ret && (op == 111010)) {
>                         rd = get_rt(inst);
>                         regs->gpr[rd] = 0xffffffff;
>                 }
> 
>                 regs->nip += 4;
>                 return 1;
>         }
> 
>         return 0;
> }
> 
> BTW, I'm still not sure how to deal with LD instruction with update.

You would need to do the update yourself.  Or just say that's a case  
you don't handle, and return 0.

Again, please check for the size of the load operation.

-Scott
Hongtao Jia - April 2, 2013, 9:28 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Saturday, March 30, 2013 12:34 AM
> To: Jia Hongtao-B38951
> Cc: Wood Scott-B07421; David Laight; linuxppc-dev@lists.ozlabs.org;
> Stuart Yoder
> Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix
> PCIe erratum on mpc85xx
> 
> On 03/29/2013 03:03:51 AM, Jia Hongtao-B38951 wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wood Scott-B07421
> > > Sent: Saturday, March 16, 2013 12:35 AM
> > > To: Jia Hongtao-B38951
> > > Cc: Wood Scott-B07421; David Laight; linuxppc-dev@lists.ozlabs.org;
> > > Stuart Yoder
> > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to
> > fix
> > > PCIe erratum on mpc85xx
> > >
> > > On 03/14/2013 09:47:58 PM, Jia Hongtao-B38951 wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Wood Scott-B07421
> > > > > Sent: Thursday, March 14, 2013 12:38 AM
> > > > > To: David Laight
> > > > > Cc: Jia Hongtao-B38951; Wood Scott-B07421;
> > > > linuxppc-dev@lists.ozlabs.org;
> > > > > Stuart Yoder
> > > > > Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler
> > to
> > > > fix
> > > > > PCIe erratum on mpc85xx
> > > > >
> > > > > On 03/13/2013 04:40:40 AM, David Laight wrote:
> > > > > > > Hmm, seems there's no probe_user_address() -- for userspace
> > we
> > > > > > > basically want the same thing minus the KERNEL_DS.  See
> > > > > > > arch/powerpc/perf/callchain.c for an example.
> > > > > >
> > > > > > Isn't that just copy_from_user() ?
> > > > >
> > > > > Plus pagefault_disable/enable().
> > > > >
> > > > > -Scott
> > > >
> > > > pagefault_disable() is identical to preempt_disable(). So I think
> > this
> > > > could not avoid other cpu to swap out the instruction we want to
> > read
> > > > back.
> > > > probe_kernel_address() also have the same issue.
> > >
> > > That's not the point -- the point is to let the page fault handler
> > know
> > > that it should go directly to bad_page_fault().  Do not pass
> > > handle_mm_fault().  Do not collect a page from disk.
> > >
> > > Granted, we're already in atomic context which will have that effect
> > > due to being in the machine check handler, but it's better to be
> > > explicit about it and not depend on how pagefault_diasble() is
> > > implemented.
> > >
> > > -Scott
> >
> >
> > Based on the comments I updated the machine check handler.
> >
> > Changes from last version:
> > * Check MSR_GS state
> > * Check if the instruction is LD
> > * Handle the user space issue
> >
> > The updated machine check handler is as following:
> >
> > int fsl_pci_mcheck_exception(struct pt_regs *regs) {
> >         unsigned int op, rd;
> >         u32 inst;
> >         int ret;
> >         phys_addr_t addr = 0;
> >
> >         /* Let KVM/QEMU deal with the exception */
> >         if (regs->msr & MSR_GS)
> >                 return 0;
> >
> > #ifdef CONFIG_PHYS_64BIT
> >         addr = mfspr(SPRN_MCARU);
> >         addr <<= 32;
> > #endif
> >         addr += mfspr(SPRN_MCAR);
> >
> >         if (is_in_pci_mem_space(addr)) {
> >                 if (user_mode(regs)) {
> >                         pagefault_disable();
> >                         ret = copy_from_user(&(inst), (u32 __user
> > *)regs->nip, sizeof(inst));
> >                         pagefault_enable();
> 
> You could use get_user() instead of copy_from_user().

Got it.

> 
> >                 } else {
> >                         ret = probe_kernel_address(regs->nip, inst);
> >                 }
> >
> >                 op = get_op(inst);
> >                 /* Check if the instruction is LD */
> >                 if (!ret && (op == 111010)) {
> >                         rd = get_rt(inst);
> >                         regs->gpr[rd] = 0xffffffff;
> >                 }
> >
> >                 regs->nip += 4;
> >                 return 1;
> >         }
> >
> >         return 0;
> > }
> >
> > BTW, I'm still not sure how to deal with LD instruction with update.
> 
> You would need to do the update yourself.  Or just say that's a case you
> don't handle, and return 0.
> 
> Again, please check for the size of the load operation.
> 
> -Scott

For informing error to the process that hold the stall instruction
we need to do:
1. Verify the instruction is load.
2. Fill the rd register with ~0UL.
3. Deal with the load instruction with update.

Here is the problems:
1. So many load instructions to handle. There are dozens of load instructions
   and most of them with different op code. Like:
   
   lbz: 1 0 0 0 1 0
   lhz: 1 0 1 0 0 0
   lwz: 1 0 0 0 0 0
   ld : 1 1 1 0 1 0
   ...

   Is there any available API for verifying the load instruction?

2. For different size of load operation could we just fill the rd register with
   ~0UL?

3. A load instruction with update could not just verified by op code. I'd like
   to leave it along. I think we could not fix but just inform the error by
   filling the rd with ~0UL. Could you explain why should we care about the update?


The main problem is how to verifying the load instruction. I wonder if there is
an easy way to do it. If not the code here will be so complicated.

Thanks.
-Hongtao.

Patch

diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index dcd8819..f1bde90 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -66,7 +66,7 @@  _GLOBAL(__setup_cpu_e500v2)
 	bl	__e500_icache_setup
 	bl	__e500_dcache_setup
 	bl	__setup_e500_ivors
-#ifdef CONFIG_FSL_RIO
+#if defined(CONFIG_FSL_RIO) || defined(CONFIG_FSL_PCI)
 	/* Ensure that RFXE is set */
 	mfspr	r3,SPRN_HID1
 	oris	r3,r3,HID1_RFXE@h
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a008cf5..dd275a4 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -59,6 +59,7 @@ 
 #include <asm/fadump.h>
 #include <asm/switch_to.h>
 #include <asm/debug.h>
+#include <sysdev/fsl_pci.h>
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -556,6 +557,8 @@  int machine_check_e500(struct pt_regs *regs)
 	if (reason & MCSR_BUS_RBERR) {
 		if (fsl_rio_mcheck_exception(regs))
 			return 1;
+		if (fsl_pci_mcheck_exception(regs))
+			return 1;
 	}
 
 	printk("Machine check in kernel mode.\n");
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 682084d..39a10b6 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -30,6 +30,7 @@ 
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
 #include <asm/machdep.h>
 #include <sysdev/fsl_soc.h>
 #include <sysdev/fsl_pci.h>
@@ -826,6 +827,46 @@  u64 fsl_pci_immrbar_base(struct pci_controller *hose)
 	return 0;
 }
 
+#ifdef CONFIG_E500
+static int is_in_pci_mem_space(phys_addr_t addr)
+{
+	struct pci_controller *hose;
+	struct resource *res;
+	int i;
+
+	list_for_each_entry(hose, &hose_list, list_node) {
+		if (!early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			res = &hose->mem_resources[i];
+			if ((res->flags & IORESOURCE_MEM) &&
+				addr >= res->start && addr <= res->end)
+				return 1;
+		}
+	}
+	return 0;
+}
+
+int fsl_pci_mcheck_exception(struct pt_regs *regs)
+{
+	phys_addr_t addr = 0;
+
+#ifdef CONFIG_PHYS_64BIT
+	addr = mfspr(SPRN_MCARU);
+	addr <<= 32;
+#endif
+	addr += mfspr(SPRN_MCAR);
+
+	if (is_in_pci_mem_space(addr)) {
+		regs->nip += 4;
+		return 1;
+	}
+
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
 static const struct of_device_id pci_ids[] = {
 	{ .compatible = "fsl,mpc8540-pci", },
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/arch/powerpc/sysdev/fsl_pci.h
index df66721..ee90a95 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/arch/powerpc/sysdev/fsl_pci.h
@@ -115,5 +115,11 @@  static inline int mpc85xx_pci_err_probe(struct platform_device *op)
 }
 #endif
 
+#ifdef CONFIG_FSL_PCI
+extern int fsl_pci_mcheck_exception(struct pt_regs *);
+#else
+static inline int fsl_pci_mcheck_exception(struct pt_regs *regs) {return 0; }
+#endif
+
 #endif /* __POWERPC_FSL_PCI_H */
 #endif /* __KERNEL__ */