Patchwork Support for PCI Express reset type in EEH

login
register
mail settings
Submitter Mike Mason
Date July 15, 2009, 6:45 p.m.
Message ID <4A5E23D0.9020906@us.ibm.com>
Download mbox | patch
Permalink /patch/29827/
State Superseded
Headers show

Comments

Mike Mason - July 15, 2009, 6:45 p.m.
By default, EEH does what's known as a "hot reset" during error recovery of a PCI Express device.  We've found a case where the device needs a "fundamental reset" to recover properly.  The current PCI error recovery and EEH frameworks do not support this distinction.

The attached patch (courtesy of Richard Lary) adds a bit field to pci_dev that indicates whether the device requires a fundamental reset during error recovery.  This bit can be checked by EEH to determine which reset type is required.

This patch supersedes the previously submitted patch that implemented a reset type callback.

Please review and let me know of any concerns.

Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Linas Vepstas - July 23, 2009, 2:44 p.m.
2009/7/15 Mike Mason <mmlnx@us.ibm.com>:
> By default, EEH does what's known as a "hot reset" during error recovery of
> a PCI Express device.  We've found a case where the device needs a
> "fundamental reset" to recover properly.  The current PCI error recovery and
> EEH frameworks do not support this distinction.
>
> The attached patch (courtesy of Richard Lary) adds a bit field to pci_dev
> that indicates whether the device requires a fundamental reset during error
> recovery.  This bit can be checked by EEH to determine which reset type is
> required.
>
> This patch supersedes the previously submitted patch that implemented a
> reset type callback.
>
> Please review and let me know of any concerns.

I like this patch a *lot* better .. it is vastly simpler, more direct.


> diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
> --- a/include/linux/pci.h       2009-07-13 14:25:37.000000000 -0700
> +++ b/include/linux/pci.h       2009-07-15 10:25:37.000000000 -0700
> @@ -273,6 +273,7 @@ struct pci_dev {
>        unsigned int    ari_enabled:1;  /* ARI forwarding */
>        unsigned int    is_managed:1;
>        unsigned int    is_pcie:1;
> +       unsigned int    fndmntl_rst_rqd:1; /* Dev requires fundamental reset
> */
>        unsigned int    state_saved:1;
>        unsigned int    is_physfn:1;
>        unsigned int    is_virtfn:1;

As Ben points out, the name is awkward.  How about needs_freset ?

Since this affects the entire pci subsystem, it should be documented
properly.  The "pci error recovery" subsystem was designed to be
usable in other architectures, and so the error recovery docs should
take at least a paragraph to describe what this flag means, and when
its supposed to be used.

Providing the docs patch together with the pci.h patch *only* would
probably simplify acceptance by the PCI community.

--linas
Richard Lary - July 23, 2009, 3:03 p.m.
Linas Vepstas <linasvepstas@gmail.com> wrote on 07/23/2009 07:44:33 AM:

> 2009/7/15 Mike Mason <mmlnx@us.ibm.com>:
> > By default, EEH does what's known as a "hot reset" during error
recovery of
> > a PCI Express device.  We've found a case where the device needs a
> > "fundamental reset" to recover properly.  The current PCI error
recovery and
> > EEH frameworks do not support this distinction.
> >
> > The attached patch (courtesy of Richard Lary) adds a bit field to
pci_dev
> > that indicates whether the device requires a fundamental reset during
error
> > recovery.  This bit can be checked by EEH to determine which reset type
is
> > required.
> >
> > This patch supersedes the previously submitted patch that implemented a
> > reset type callback.
> >
> > Please review and let me know of any concerns.
>
> I like this patch a *lot* better .. it is vastly simpler, more direct.
>
>
> > diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
> > --- a/include/linux/pci.h       2009-07-13 14:25:37.000000000 -0700
> > +++ b/include/linux/pci.h       2009-07-15 10:25:37.000000000 -0700
> > @@ -273,6 +273,7 @@ struct pci_dev {
> >        unsigned int    ari_enabled:1;  /* ARI forwarding */
> >        unsigned int    is_managed:1;
> >        unsigned int    is_pcie:1;
> > +       unsigned int    fndmntl_rst_rqd:1; /* Dev requires fundamental
reset
> > */
> >        unsigned int    state_saved:1;
> >        unsigned int    is_physfn:1;
> >        unsigned int    is_virtfn:1;
>
> As Ben points out, the name is awkward.  How about needs_freset ?

I have no problem changing the name.

> Since this affects the entire pci subsystem, it should be documented
> properly.  The "pci error recovery" subsystem was designed to be
> usable in other architectures, and so the error recovery docs should
> take at least a paragraph to describe what this flag means, and when
> its supposed to be used.

I will take a stab at updating the docs and post here for comment.

> Providing the docs patch together with the pci.h patch *only* would
> probably simplify acceptance by the PCI community.
>
> --linas
Richard Lary - July 24, 2009, 9:36 p.m.
Linas Vepstas <linasvepstas@gmail.com> wrote on 07/23/2009 07:44:33 AM:

> 2009/7/15 Mike Mason <mmlnx@us.ibm.com>:
> > By default, EEH does what's known as a "hot reset" during error
recovery of
> > a PCI Express device.  We've found a case where the device needs a
> > "fundamental reset" to recover properly.  The current PCI error
recovery and
> > EEH frameworks do not support this distinction.
> >
> > The attached patch (courtesy of Richard Lary) adds a bit field to
pci_dev
> > that indicates whether the device requires a fundamental reset during
error
> > recovery.  This bit can be checked by EEH to determine which reset type
is
> > required.
> >
> > This patch supersedes the previously submitted patch that implemented a
> > reset type callback.
> >
> > Please review and let me know of any concerns.
>
> I like this patch a *lot* better .. it is vastly simpler, more direct.
>
>
> > diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
> > --- a/include/linux/pci.h       2009-07-13 14:25:37.000000000 -0700
> > +++ b/include/linux/pci.h       2009-07-15 10:25:37.000000000 -0700
> > @@ -273,6 +273,7 @@ struct pci_dev {
> >        unsigned int    ari_enabled:1;  /* ARI forwarding */
> >        unsigned int    is_managed:1;
> >        unsigned int    is_pcie:1;
> > +       unsigned int    fndmntl_rst_rqd:1; /* Dev requires fundamental
reset
> > */
> >        unsigned int    state_saved:1;
> >        unsigned int    is_physfn:1;
> >        unsigned int    is_virtfn:1;
>
> As Ben points out, the name is awkward.  How about needs_freset ?

I am OK with name change.

> Since this affects the entire pci subsystem, it should be documented
> properly.  The "pci error recovery" subsystem was designed to be
> usable in other architectures, and so the error recovery docs should
> take at least a paragraph to describe what this flag means, and when
> its supposed to be used.

I will update the documentation, are you referring to
Documentation/powerpc/eeh-pci-error-recovery.txt
or some other documentation?

> Providing the docs patch together with the pci.h patch *only* would
> probably simplify acceptance by the PCI community.
>
> --linas
Linas Vepstas - July 25, 2009, 12:30 a.m.
2009/7/24 Richard Lary <rlary@us.ibm.com>:
> Linas Vepstas <linasvepstas@gmail.com> wrote on 07/23/2009 07:44:33 AM:
>
>> 2009/7/15 Mike Mason <mmlnx@us.ibm.com>:
>> > By default, EEH does what's known as a "hot reset" during error recovery
>> > of
>> > a PCI Express device.  We've found a case where the device needs a
>> > "fundamental reset" to recover properly.  The current PCI error recovery
>> > and
>> > EEH frameworks do not support this distinction.
>> >
>> > The attached patch (courtesy of Richard Lary) adds a bit field to
>> > pci_dev
>> > that indicates whether the device requires a fundamental reset during
>> > error
>> > recovery.  This bit can be checked by EEH to determine which reset type
>> > is
>> > required.
>> >
>> > This patch supersedes the previously submitted patch that implemented a
>> > reset type callback.
>> >
>> > Please review and let me know of any concerns.
>>
>> I like this patch a *lot* better .. it is vastly simpler, more direct.
>>
>>
>> > diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
>> > --- a/include/linux/pci.h       2009-07-13 14:25:37.000000000 -0700
>> > +++ b/include/linux/pci.h       2009-07-15 10:25:37.000000000 -0700
>> > @@ -273,6 +273,7 @@ struct pci_dev {
>> >        unsigned int    ari_enabled:1;  /* ARI forwarding */
>> >        unsigned int    is_managed:1;
>> >        unsigned int    is_pcie:1;
>> > +       unsigned int    fndmntl_rst_rqd:1; /* Dev requires fundamental
>> > reset
>> > */
>> >        unsigned int    state_saved:1;
>> >        unsigned int    is_physfn:1;
>> >        unsigned int    is_virtfn:1;
>>
>> As Ben points out, the name is awkward.  How about needs_freset ?
>
> I am OK with name change.
>
>
>> Since this affects the entire pci subsystem, it should be documented
>> properly.  The "pci error recovery" subsystem was designed to be
>> usable in other architectures, and so the error recovery docs should
>> take at least a paragraph to describe what this flag means, and when
>> its supposed to be used.
>
> I will update the documentation, are you referring to
> Documentation/powerpc/eeh-pci-error-recovery.txt
> or some other documentation?

No, I'm thinking
Documentation/PCI/pci-error-recovery.txt

because the flag is not powerpc-specific.

--linas

>
>> Providing the docs patch together with the pci.h patch *only* would
>> probably simplify acceptance by the PCI community.
>>
>> --linas
>
Richard Lary - July 27, 2009, 2:29 p.m.
Linas Vepstas <linasvepstas@gmail.com> wrote on 07/24/2009 05:30:09 PM:

> 2009/7/24 Richard Lary <rlary@us.ibm.com>:
> > Linas Vepstas <linasvepstas@gmail.com> wrote on 07/23/2009 07:44:33 AM:
> >
> >> 2009/7/15 Mike Mason <mmlnx@us.ibm.com>:
> >> > By default, EEH does what's known as a "hot reset" during error
recovery
> >> > of
> >> > a PCI Express device.  We've found a case where the device needs a
> >> > "fundamental reset" to recover properly.  The current PCI error
recovery
> >> > and
> >> > EEH frameworks do not support this distinction.
> >> >
> >> > The attached patch (courtesy of Richard Lary) adds a bit field to
> >> > pci_dev
> >> > that indicates whether the device requires a fundamental reset
during
> >> > error
> >> > recovery.  This bit can be checked by EEH to determine which reset
type
> >> > is
> >> > required.
> >> >
> >> > This patch supersedes the previously submitted patch that
implemented a
> >> > reset type callback.
> >> >
> >> > Please review and let me know of any concerns.
> >>
> >> I like this patch a *lot* better .. it is vastly simpler, more direct.
> >>
> >>
> >> > diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
> >> > --- a/include/linux/pci.h       2009-07-13 14:25:37.000000000 -0700
> >> > +++ b/include/linux/pci.h       2009-07-15 10:25:37.000000000 -0700
> >> > @@ -273,6 +273,7 @@ struct pci_dev {
> >> >        unsigned int    ari_enabled:1;  /* ARI forwarding */
> >> >        unsigned int    is_managed:1;
> >> >        unsigned int    is_pcie:1;
> >> > +       unsigned int    fndmntl_rst_rqd:1; /* Dev requires
fundamental
> >> > reset
> >> > */
> >> >        unsigned int    state_saved:1;
> >> >        unsigned int    is_physfn:1;
> >> >        unsigned int    is_virtfn:1;
> >>
> >> As Ben points out, the name is awkward.  How about needs_freset ?
> >
> > I am OK with name change.
> >
> >
> >> Since this affects the entire pci subsystem, it should be documented
> >> properly.  The "pci error recovery" subsystem was designed to be
> >> usable in other architectures, and so the error recovery docs should
> >> take at least a paragraph to describe what this flag means, and when
> >> its supposed to be used.
> >
> > I will update the documentation, are you referring to
> > Documentation/powerpc/eeh-pci-error-recovery.txt
> > or some other documentation?
>
> No, I'm thinking
> Documentation/PCI/pci-error-recovery.txt
>
> because the flag is not powerpc-specific.

Got it, glad I asked...

-rich

> >
> >> Providing the docs patch together with the pci.h patch *only* would
> >> probably simplify acceptance by the PCI community.
> >>
> >> --linas
> >

Patch

diff -uNrp a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
--- a/arch/powerpc/kernel/pci_64.c	2009-07-13 14:25:24.000000000 -0700
+++ b/arch/powerpc/kernel/pci_64.c	2009-07-15 10:26:26.000000000 -0700
@@ -143,6 +143,7 @@  struct pci_dev *of_create_pci_dev(struct
 	dev->dev.bus = &pci_bus_type;
 	dev->devfn = devfn;
 	dev->multifunction = 0;		/* maybe a lie? */
+	dev->fndmntl_rst_rqd = 0;       /* pcie fundamental reset required */
 
 	dev->vendor = get_int_prop(node, "vendor-id", 0xffff);
 	dev->device = get_int_prop(node, "device-id", 0xffff);
diff -uNrp a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
--- a/arch/powerpc/platforms/pseries/eeh.c	2009-06-09 20:05:27.000000000 -0700
+++ b/arch/powerpc/platforms/pseries/eeh.c	2009-07-15 10:29:04.000000000 -0700
@@ -744,7 +744,15 @@  int pcibios_set_pcie_reset_state(struct
 
 static void __rtas_set_slot_reset(struct pci_dn *pdn)
 {
-	rtas_pci_slot_reset (pdn, 1);
+	struct pci_dev *dev = pdn->pcidev;
+
+	/* Determine type of EEH reset required by device,
+	 * default hot reset or fundamental reset
+	 */
+	if (dev->fndmntl_rst_rqd)
+		rtas_pci_slot_reset(pdn, 3);
+	else
+		rtas_pci_slot_reset(pdn, 1);
 
 	/* The PCI bus requires that the reset be held high for at least
 	 * a 100 milliseconds. We wait a bit longer 'just in case'.  */
diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
--- a/include/linux/pci.h	2009-07-13 14:25:37.000000000 -0700
+++ b/include/linux/pci.h	2009-07-15 10:25:37.000000000 -0700
@@ -273,6 +273,7 @@  struct pci_dev {
 	unsigned int	ari_enabled:1;	/* ARI forwarding */
 	unsigned int	is_managed:1;
 	unsigned int	is_pcie:1;
+	unsigned int    fndmntl_rst_rqd:1; /* Dev requires fundamental reset */
 	unsigned int	state_saved:1;
 	unsigned int	is_physfn:1;
 	unsigned int	is_virtfn:1;