[v3,3/3] PCI: Avoid slot reset for Cavium cn8xxx root ports

Message ID 20170830142454.10971-4-jglauber@cavium.com
State Superseded
Headers show
Series
  • Workaround for bus/slot reset on Cavium cn8xxx root ports
Related show

Commit Message

Jan Glauber Aug. 30, 2017, 2:24 p.m.
Root ports of cn8xxx do not function after a slot reset when used with
some e1000e and LSI HBA devices. Add a quirk to prevent slot reset on
these root ports.

Signed-off-by: Jan Glauber <jglauber@cavium.com>
---
 drivers/pci/quirks.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Alex Williamson Aug. 30, 2017, 2:40 p.m. | #1
On Wed, 30 Aug 2017 16:24:54 +0200
Jan Glauber <jglauber@cavium.com> wrote:

> Root ports of cn8xxx do not function after a slot reset when used with
> some e1000e and LSI HBA devices. Add a quirk to prevent slot reset on
> these root ports.
> 
> Signed-off-by: Jan Glauber <jglauber@cavium.com>
> ---
>  drivers/pci/quirks.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 85191b8..6679971 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -845,6 +845,22 @@ static void quirk_cavium_sriov_rnm_link(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa018, quirk_cavium_sriov_rnm_link);
>  #endif
>  
> +/*
> + * Root port on some Cavium CN8xxx chips do not successfully complete
> + * a bus reset when used with certain types of child devices. Config
> + * space access to the child may quit responding. Flag all devices under
> + * the secondary bus as non-resettable.
> + */
> +static void quirk_CN8xxx_secondary_bus(struct pci_dev *dev)
> +{
> +	struct pci_dev *pdev;
> +
> +	dev_warn(&dev->dev, "Cavium CN8xxx quirk detected; reset for devices on secondary bus disabled\n");
> +	list_for_each_entry(pdev, &dev->subordinate->devices, bus_list)
> +		pdev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_CN8xxx_secondary_bus);
> +
>  /*
>   * Some settings of MMRBC can lead to data corruption so block changes.
>   * See AMD 8131 HyperTransport PCI-X Tunnel Revision Guide


This doesn't seem reliable, doesn't the user just need to remove and
reprobe the slot and the device would re-appear without this flag set?
Thanks,

Alex
Jan Glauber Aug. 31, 2017, 9:40 a.m. | #2
On Wed, Aug 30, 2017 at 08:40:12AM -0600, Alex Williamson wrote:
> On Wed, 30 Aug 2017 16:24:54 +0200
> Jan Glauber <jglauber@cavium.com> wrote:
> 
> > Root ports of cn8xxx do not function after a slot reset when used with
> > some e1000e and LSI HBA devices. Add a quirk to prevent slot reset on
> > these root ports.
> > 
> > Signed-off-by: Jan Glauber <jglauber@cavium.com>
> > ---
> >  drivers/pci/quirks.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 85191b8..6679971 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -845,6 +845,22 @@ static void quirk_cavium_sriov_rnm_link(struct pci_dev *dev)
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa018, quirk_cavium_sriov_rnm_link);
> >  #endif
> >  
> > +/*
> > + * Root port on some Cavium CN8xxx chips do not successfully complete
> > + * a bus reset when used with certain types of child devices. Config
> > + * space access to the child may quit responding. Flag all devices under
> > + * the secondary bus as non-resettable.
> > + */
> > +static void quirk_CN8xxx_secondary_bus(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *pdev;
> > +
> > +	dev_warn(&dev->dev, "Cavium CN8xxx quirk detected; reset for devices on secondary bus disabled\n");
> > +	list_for_each_entry(pdev, &dev->subordinate->devices, bus_list)
> > +		pdev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_CN8xxx_secondary_bus);
> > +
> >  /*
> >   * Some settings of MMRBC can lead to data corruption so block changes.
> >   * See AMD 8131 HyperTransport PCI-X Tunnel Revision Guide
> 
> 
> This doesn't seem reliable, doesn't the user just need to remove and
> reprobe the slot and the device would re-appear without this flag set?

No, I tried before to disable the slot with "echo 0 > /sys/bus/pci/slots/3/power"
but that does not work as it is not supported.

I'm not familiar with the quirk types, would another one be better
suited here (even if we don't have the problem you descibed)?

thanks,
Jan


> Thanks,
> 
> Alex
Alex Williamson Aug. 31, 2017, 4:01 p.m. | #3
On Thu, 31 Aug 2017 11:40:52 +0200
Jan Glauber <jan.glauber@caviumnetworks.com> wrote:

> On Wed, Aug 30, 2017 at 08:40:12AM -0600, Alex Williamson wrote:
> > On Wed, 30 Aug 2017 16:24:54 +0200
> > Jan Glauber <jglauber@cavium.com> wrote:
> >   
> > > Root ports of cn8xxx do not function after a slot reset when used with
> > > some e1000e and LSI HBA devices. Add a quirk to prevent slot reset on
> > > these root ports.
> > > 
> > > Signed-off-by: Jan Glauber <jglauber@cavium.com>
> > > ---
> > >  drivers/pci/quirks.c | 16 ++++++++++++++++
> > >  1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 85191b8..6679971 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -845,6 +845,22 @@ static void quirk_cavium_sriov_rnm_link(struct pci_dev *dev)
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa018, quirk_cavium_sriov_rnm_link);
> > >  #endif
> > >  
> > > +/*
> > > + * Root port on some Cavium CN8xxx chips do not successfully complete
> > > + * a bus reset when used with certain types of child devices. Config
> > > + * space access to the child may quit responding. Flag all devices under
> > > + * the secondary bus as non-resettable.
> > > + */
> > > +static void quirk_CN8xxx_secondary_bus(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *pdev;
> > > +
> > > +	dev_warn(&dev->dev, "Cavium CN8xxx quirk detected; reset for devices on secondary bus disabled\n");
> > > +	list_for_each_entry(pdev, &dev->subordinate->devices, bus_list)
> > > +		pdev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > +}
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_CN8xxx_secondary_bus);
> > > +
> > >  /*
> > >   * Some settings of MMRBC can lead to data corruption so block changes.
> > >   * See AMD 8131 HyperTransport PCI-X Tunnel Revision Guide  
> > 
> > 
> > This doesn't seem reliable, doesn't the user just need to remove and
> > reprobe the slot and the device would re-appear without this flag set?  
> 
> No, I tried before to disable the slot with "echo 0 > /sys/bus/pci/slots/3/power"
> but that does not work as it is not supported.
> 
> I'm not familiar with the quirk types, would another one be better
> suited here (even if we don't have the problem you descibed)?

The scenario I'm mentioning is to "echo 1 > /sys/bus/pci/devices/<some
device under the slot>/remove", then "echo <that device address> >
/sys/bus/pci/rescan".  This would break the ordering implicit in using
a fixup defined for the root port.  It seems like it'd make a lot more
sense to add a test on the parent bridge more similar to how the bus
reset works.  It's not the subordinate devices imposing the
no-bus-reset flag, it's the bridge device and the objects and code
should support and reflect that.  Thanks,

Alex
Jan Glauber Sept. 7, 2017, 7:40 a.m. | #4
On Thu, Aug 31, 2017 at 10:01:30AM -0600, Alex Williamson wrote:
> On Thu, 31 Aug 2017 11:40:52 +0200
> Jan Glauber <jan.glauber@caviumnetworks.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 08:40:12AM -0600, Alex Williamson wrote:
> > > On Wed, 30 Aug 2017 16:24:54 +0200
> > > Jan Glauber <jglauber@cavium.com> wrote:
> > >   
> > > > Root ports of cn8xxx do not function after a slot reset when used with
> > > > some e1000e and LSI HBA devices. Add a quirk to prevent slot reset on
> > > > these root ports.
> > > > 
> > > > Signed-off-by: Jan Glauber <jglauber@cavium.com>
> > > > ---
> > > >  drivers/pci/quirks.c | 16 ++++++++++++++++
> > > >  1 file changed, 16 insertions(+)
> > > > 
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 85191b8..6679971 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -845,6 +845,22 @@ static void quirk_cavium_sriov_rnm_link(struct pci_dev *dev)
> > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa018, quirk_cavium_sriov_rnm_link);
> > > >  #endif
> > > >  
> > > > +/*
> > > > + * Root port on some Cavium CN8xxx chips do not successfully complete
> > > > + * a bus reset when used with certain types of child devices. Config
> > > > + * space access to the child may quit responding. Flag all devices under
> > > > + * the secondary bus as non-resettable.
> > > > + */
> > > > +static void quirk_CN8xxx_secondary_bus(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *pdev;
> > > > +
> > > > +	dev_warn(&dev->dev, "Cavium CN8xxx quirk detected; reset for devices on secondary bus disabled\n");
> > > > +	list_for_each_entry(pdev, &dev->subordinate->devices, bus_list)
> > > > +		pdev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > > +}
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_CN8xxx_secondary_bus);
> > > > +
> > > >  /*
> > > >   * Some settings of MMRBC can lead to data corruption so block changes.
> > > >   * See AMD 8131 HyperTransport PCI-X Tunnel Revision Guide  
> > > 
> > > 
> > > This doesn't seem reliable, doesn't the user just need to remove and
> > > reprobe the slot and the device would re-appear without this flag set?  
> > 
> > No, I tried before to disable the slot with "echo 0 > /sys/bus/pci/slots/3/power"
> > but that does not work as it is not supported.
> > 
> > I'm not familiar with the quirk types, would another one be better
> > suited here (even if we don't have the problem you descibed)?
> 
> The scenario I'm mentioning is to "echo 1 > /sys/bus/pci/devices/<some
> device under the slot>/remove", then "echo <that device address> >
> /sys/bus/pci/rescan".  This would break the ordering implicit in using
> a fixup defined for the root port.  It seems like it'd make a lot more
> sense to add a test on the parent bridge more similar to how the bus
> reset works.  It's not the subordinate devices imposing the
> no-bus-reset flag, it's the bridge device and the objects and code
> should support and reflect that.  Thanks,

Doing "echo <that device address> > /sys/bus/pci/rescan" after the
remove did not work for me, but maybe the format of the device address
needs to be different. Anyway, the sequence
  echo 1 > /sys/bus/pci/devices/<some device under the slot>/remove
  echo 1 > /sys/bus/pci/rescan
still triggers the panic as you mentioned above.

I agree that the subordinate devices are not causing the issue, still
I need to make pci_slot_resetable() return false in our case.

So what if we add an additional check like:

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fdf65a6..389db4b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4389,6 +4389,9 @@ static bool pci_slot_resetable(struct pci_slot *slot)
 {
        struct pci_dev *dev;
 
+       if (slot->bus->self & PCI_DEV_FLAGS_NO_BUS_RESET)
+               return false;
+
        list_for_each_entry(dev, &slot->bus->devices, bus_list) {
                if (!dev->slot || dev->slot != slot)
                        continue;

--Jan
Jan Glauber Sept. 7, 2017, 7:49 a.m. | #5
On Thu, Sep 07, 2017 at 09:40:11AM +0200, Jan Glauber wrote:
> So what if we add an additional check like:
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index fdf65a6..389db4b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4389,6 +4389,9 @@ static bool pci_slot_resetable(struct pci_slot *slot)
>  {
>         struct pci_dev *dev;
>  
> +       if (slot->bus->self & PCI_DEV_FLAGS_NO_BUS_RESET)
> +               return false;
> +
>         list_for_each_entry(dev, &slot->bus->devices, bus_list) {
>                 if (!dev->slot || dev->slot != slot)
>                         continue;

Obviously I meant:
if (slot->bus->self->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)

--Jan
Alex Williamson Sept. 7, 2017, 4:52 p.m. | #6
On Thu, 7 Sep 2017 09:49:04 +0200
Jan Glauber <jan.glauber@caviumnetworks.com> wrote:

> On Thu, Sep 07, 2017 at 09:40:11AM +0200, Jan Glauber wrote:
> > So what if we add an additional check like:
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index fdf65a6..389db4b 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -4389,6 +4389,9 @@ static bool pci_slot_resetable(struct pci_slot *slot)
> >  {
> >         struct pci_dev *dev;
> >  
> > +       if (slot->bus->self & PCI_DEV_FLAGS_NO_BUS_RESET)
> > +               return false;
> > +
> >         list_for_each_entry(dev, &slot->bus->devices, bus_list) {
> >                 if (!dev->slot || dev->slot != slot)
> >                         continue;  
> 
> Obviously I meant:
> if (slot->bus->self->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)

Much better, perhaps even incorporate the bus->self check for good
measure... is it possible to have a slot on a root bus?  Taking
different approaches for bus vs slot reset should have been a giant red
flag that something is wrong.  Thanks,

Alex

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 85191b8..6679971 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -845,6 +845,22 @@  static void quirk_cavium_sriov_rnm_link(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa018, quirk_cavium_sriov_rnm_link);
 #endif
 
+/*
+ * Root port on some Cavium CN8xxx chips do not successfully complete
+ * a bus reset when used with certain types of child devices. Config
+ * space access to the child may quit responding. Flag all devices under
+ * the secondary bus as non-resettable.
+ */
+static void quirk_CN8xxx_secondary_bus(struct pci_dev *dev)
+{
+	struct pci_dev *pdev;
+
+	dev_warn(&dev->dev, "Cavium CN8xxx quirk detected; reset for devices on secondary bus disabled\n");
+	list_for_each_entry(pdev, &dev->subordinate->devices, bus_list)
+		pdev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xa100, quirk_CN8xxx_secondary_bus);
+
 /*
  * Some settings of MMRBC can lead to data corruption so block changes.
  * See AMD 8131 HyperTransport PCI-X Tunnel Revision Guide