diff mbox

[4/4] PCI: quirk Atheros AR93xx to avoid bus reset

Message ID 20141121182427.26669.82417.stgit@gimli.home
State Accepted
Headers show

Commit Message

Alex Williamson Nov. 21, 2014, 6:24 p.m. UTC
Reports against the TL-WDN4800 card indicate that PCI bus reset of
this Atheros device cause system lock-ups and resets.  I've also
been able to confirm this behavior on multiple systems.  The device
never returns from reset and attempts to access config space of the
device after reset result in hangs.  Blacklist bus reset for the
device to avoid this issue.

Reported-by: Andreas Hartmann <andihartmann@freenet.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Andreas Hartmann <andihartmann@freenet.de>
---

 drivers/pci/quirks.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Hartmann Dec. 26, 2014, 7:56 a.m. UTC | #1
Hello Bjorn,

I'm running this patch and the corresponding "[PATCH 3/4] PCI: Allow
device quirks to exclude bus reset" patch meanwhile since a month w/
kernel 3.14.x and couldn't find any problem. Would it be possible to
apply these patches to main kernel? Or even to lt-kernel 3.14?


Thanks.
kind regards,
Andreas Hartmann


Alex Williamson wrote:
> Reports against the TL-WDN4800 card indicate that PCI bus reset of
> this Atheros device cause system lock-ups and resets.  I've also
> been able to confirm this behavior on multiple systems.  The device
> never returns from reset and attempts to access config space of the
> device after reset result in hangs.  Blacklist bus reset for the
> device to avoid this issue.
> 
> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
> ---
> 
>  drivers/pci/quirks.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 561e10d..ebbd5b4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>  
> +static void quirk_no_bus_reset(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
> +
> +/*
> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
> + * throw a Link Down error on AER capable system and regardless of AER,
> + * config space of the device is never accessible again and typically
> + * causes the system to hang or reset when access is attempted.
> + * http://www.spinics.net/lists/linux-pci/msg34797.html
> + */
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> +
>  #ifdef CONFIG_ACPI
>  /*
>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 8, 2015, 4:07 p.m. UTC | #2
On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
> Reports against the TL-WDN4800 card indicate that PCI bus reset of
> this Atheros device cause system lock-ups and resets.  I've also
> been able to confirm this behavior on multiple systems.  The device
> never returns from reset and attempts to access config space of the
> device after reset result in hangs.  Blacklist bus reset for the
> device to avoid this issue.
> 
> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Tested-by: Andreas Hartmann <andihartmann@freenet.de>

If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").

If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
Do they fix a regression?  Is there a pointer to a bugzilla or problem
report about that issue?

I don't understand the connection between 425c1b223dac and
PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
resets.  Is that the wrong commit, or can you outline the connection for
me?

Bjorn

> ---
> 
>  drivers/pci/quirks.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 561e10d..ebbd5b4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>  
> +static void quirk_no_bus_reset(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
> +
> +/*
> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
> + * throw a Link Down error on AER capable system and regardless of AER,
> + * config space of the device is never accessible again and typically
> + * causes the system to hang or reset when access is attempted.
> + * http://www.spinics.net/lists/linux-pci/msg34797.html
> + */
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> +
>  #ifdef CONFIG_ACPI
>  /*
>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Jan. 8, 2015, 7:30 p.m. UTC | #3
On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
> > Reports against the TL-WDN4800 card indicate that PCI bus reset of
> > this Atheros device cause system lock-ups and resets.  I've also
> > been able to confirm this behavior on multiple systems.  The device
> > never returns from reset and attempts to access config space of the
> > device after reset result in hangs.  Blacklist bus reset for the
> > device to avoid this issue.
> > 
> > Reported-by: Andreas Hartmann <andihartmann@freenet.de>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Tested-by: Andreas Hartmann <andihartmann@freenet.de>
> 
> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
> 
> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
> Do they fix a regression?  Is there a pointer to a bugzilla or problem
> report about that issue?
> 
> I don't understand the connection between 425c1b223dac and
> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
> resets.  Is that the wrong commit, or can you outline the connection for
> me?

TBH, I don't have a lot of faith in associating this to 425c1b223dac,
I'm not sure how Andreas' bisect landed there.  IME, this device cannot,
and has never been able to handle a bus reset.  A simple setpci
experiment on the commandline can confirm this.  What I think happened
is that with the PCI bus reset infrastructure we added, we switched QEMU
to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
just more prolific use of bus resets by userspace.

There's also no regression in 1 & 2, PM reset has never done anything
useful on those devices.  Thanks,

Alex

> > ---
> > 
> >  drivers/pci/quirks.c |   14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 561e10d..ebbd5b4 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
> >  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
> >  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
> >  
> > +static void quirk_no_bus_reset(struct pci_dev *dev)
> > +{
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +}
> > +
> > +/*
> > + * Atheros AR93xx chips do not behave after a bus reset.  The device will
> > + * throw a Link Down error on AER capable system and regardless of AER,
> > + * config space of the device is never accessible again and typically
> > + * causes the system to hang or reset when access is attempted.
> > + * http://www.spinics.net/lists/linux-pci/msg34797.html
> > + */
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> > +
> >  #ifdef CONFIG_ACPI
> >  /*
> >   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
> > 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 8, 2015, 11:10 p.m. UTC | #4
On Thu, Jan 8, 2015 at 1:30 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>> > Reports against the TL-WDN4800 card indicate that PCI bus reset of
>> > this Atheros device cause system lock-ups and resets.  I've also
>> > been able to confirm this behavior on multiple systems.  The device
>> > never returns from reset and attempts to access config space of the
>> > device after reset result in hangs.  Blacklist bus reset for the
>> > device to avoid this issue.
>> >
>> > Reported-by: Andreas Hartmann <andihartmann@freenet.de>
>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > Tested-by: Andreas Hartmann <andihartmann@freenet.de>
>>
>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>
>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>> report about that issue?
>>
>> I don't understand the connection between 425c1b223dac and
>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>> resets.  Is that the wrong commit, or can you outline the connection for
>> me?
>
> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
> I'm not sure how Andreas' bisect landed there.  IME, this device cannot,
> and has never been able to handle a bus reset.  A simple setpci
> experiment on the commandline can confirm this.  What I think happened
> is that with the PCI bus reset infrastructure we added, we switched QEMU
> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
> just more prolific use of bus resets by userspace.

Maybe it's not 425c1b223dac, but it seems pretty clear that it really
is a regression from the user point of view.  I'm sure Andreas was
using the same user-space for v3.13 and v3.14.

Testing 425c1b223dac^ (the predecessor of 425c1b223dac) and
425c1b223dac would confirm the bisect results.

If it really turns out that 425c1b223dac^ works and 425c1b223dac
fails, maybe we could learn something interesting by adding debug
output to the various reset flavors and comparing the output?

> There's also no regression in 1 & 2, PM reset has never done anything
> useful on those devices.  Thanks,

Since 1 & 2 don't fix a regression, I think what I should do is apply
3 & 4 to for-linus for v3.19, then apply 1 & 2 on top as v3.20
material.

Bjorn

>> > ---
>> >
>> >  drivers/pci/quirks.c |   14 ++++++++++++++
>> >  1 file changed, 14 insertions(+)
>> >
>> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>> > index 561e10d..ebbd5b4 100644
>> > --- a/drivers/pci/quirks.c
>> > +++ b/drivers/pci/quirks.c
>> > @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>> >  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>> >                            PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>> >
>> > +static void quirk_no_bus_reset(struct pci_dev *dev)
>> > +{
>> > +   dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>> > +}
>> > +
>> > +/*
>> > + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>> > + * throw a Link Down error on AER capable system and regardless of AER,
>> > + * config space of the device is never accessible again and typically
>> > + * causes the system to hang or reset when access is attempted.
>> > + * http://www.spinics.net/lists/linux-pci/msg34797.html
>> > + */
>> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>> > +
>> >  #ifdef CONFIG_ACPI
>> >  /*
>> >   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>> >
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Jan. 12, 2015, 3:20 p.m. UTC | #5
Alex Williamson wrote:
> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>> this Atheros device cause system lock-ups and resets.  I've also
>>> been able to confirm this behavior on multiple systems.  The device
>>> never returns from reset and attempts to access config space of the
>>> device after reset result in hangs.  Blacklist bus reset for the
>>> device to avoid this issue.
>>>
>>> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
>>
>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>
>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>> report about that issue?
>>
>> I don't understand the connection between 425c1b223dac and
>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>> resets.  Is that the wrong commit, or can you outline the connection for
>> me?
> 
> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
> I'm not sure how Andreas' bisect landed there. 

Because removing this patch made it working again :-)

And too:
http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984

Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
kernel, which hangs the machine at startup of the VM. The userland
(qemu) didn't change in between.

Therefore: from my point of view, it is a regression, because things
have been working < 2.14.

Besides that: It is undoubted, that there is a problem with resetting
this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
has been working nevertheless. The patch
425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
which I can't say and I don't know off. Therefore, the quirk-patch is
definitely required, because things work completely fine again w/ this
patch.

"Working" means for me here: I was able to start (and use) the VM w/o
crashing the machine and this isn't possible w/ unpatched 2.14+ any
more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
machine), but w/ 2.10 even this was possible.


> IME, this device cannot,
> and has never been able to handle a bus reset.  A simple setpci
> experiment on the commandline can confirm this.  What I think happened
> is that with the PCI bus reset infrastructure we added, we switched QEMU
> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
> just more prolific use of bus resets by userspace.
> 
> There's also no regression in 1 & 2, PM reset has never done anything
> useful on those devices.  Thanks,
> 
> Alex
> 
>>> ---
>>>
>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>  1 file changed, 14 insertions(+)
>>>
>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>> index 561e10d..ebbd5b4 100644
>>> --- a/drivers/pci/quirks.c
>>> +++ b/drivers/pci/quirks.c
>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>  
>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>> +{
>>> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>> +}
>>> +
>>> +/*
>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>> + * config space of the device is never accessible again and typically
>>> + * causes the system to hang or reset when access is attempted.
>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>> + */
>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>> +
>>>  #ifdef CONFIG_ACPI
>>>  /*
>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Jan. 12, 2015, 4:49 p.m. UTC | #6
On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
> >> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
> >>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
> >>> this Atheros device cause system lock-ups and resets.  I've also
> >>> been able to confirm this behavior on multiple systems.  The device
> >>> never returns from reset and attempts to access config space of the
> >>> device after reset result in hangs.  Blacklist bus reset for the
> >>> device to avoid this issue.
> >>>
> >>> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
> >>
> >> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
> >> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
> >>
> >> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
> >> Do they fix a regression?  Is there a pointer to a bugzilla or problem
> >> report about that issue?
> >>
> >> I don't understand the connection between 425c1b223dac and
> >> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
> >> resets.  Is that the wrong commit, or can you outline the connection for
> >> me?
> > 
> > TBH, I don't have a lot of faith in associating this to 425c1b223dac,
> > I'm not sure how Andreas' bisect landed there. 
> 
> Because removing this patch made it working again :-)
> 
> And too:
> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
> 
> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
> kernel, which hangs the machine at startup of the VM. The userland
> (qemu) didn't change in between.

s/2\./3\./

Ok, so what about VC save/restore (425c1b223dac) is the problem then?
When we tried to determine that, you found that if we continue from the
top of the save loop, everything works (ie. no VC state saved), but if
you continue after the variable declaration of the same loop (ie. still
no VC state saved), it breaks:

http://www.spinics.net/lists/linux-pci/msg36166.html

So, please forgive me if I don't have a whole lot of faith that
425c1b223dac is involved.

We also both independently determined that this particular device never
recovers from a PCI bus reset, even when done from userspace with setpci
and absolutely no save/restore wrappers.  Config space on the device is
never accessible after the reset.  Therefore, how could any sort of bus
reset with save/restore ever work for this device?

> Therefore: from my point of view, it is a regression, because things
> have been working < 2.14.
> 
> Besides that: It is undoubted, that there is a problem with resetting
> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
> has been working nevertheless. The patch
> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
> which I can't say and I don't know off. Therefore, the quirk-patch is
> definitely required, because things work completely fine again w/ this
> patch.
> 
> "Working" means for me here: I was able to start (and use) the VM w/o
> crashing the machine and this isn't possible w/ unpatched 2.14+ any
> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
> machine), but w/ 2.10 even this was possible.

What?!  So v3.12 still had a machine crash when assigning this device.
The vfio hot reset interface was added in v3.12, so v3.10 didn't have
any way to do a reset other than what pci_reset_function() decided to
do.  That all seems to associate the machine crash to the ability to do
a bus reset on the device.  I'm not sure why the behavior changed
between v3.14 and v3.12 (maybe the try-reset addition), but there's some
sort of pre-existing issue before we even got to 425c1b223dac.

I'm perfectly happy tagging this for stable, but it seems like a
hardware bug exposed by allowing userspace the ability to select a bus
reset.  Whether or not that's a kernel regression isn't exactly clear to
me ("new functionality exposes broken hardware, news at 11").  Thanks,

Alex

> > IME, this device cannot,
> > and has never been able to handle a bus reset.  A simple setpci
> > experiment on the commandline can confirm this.  What I think happened
> > is that with the PCI bus reset infrastructure we added, we switched QEMU
> > to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
> > just more prolific use of bus resets by userspace.
> > 
> > There's also no regression in 1 & 2, PM reset has never done anything
> > useful on those devices.  Thanks,
> > 
> > Alex
> > 
> >>> ---
> >>>
> >>>  drivers/pci/quirks.c |   14 ++++++++++++++
> >>>  1 file changed, 14 insertions(+)
> >>>
> >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >>> index 561e10d..ebbd5b4 100644
> >>> --- a/drivers/pci/quirks.c
> >>> +++ b/drivers/pci/quirks.c
> >>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
> >>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
> >>>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
> >>>  
> >>> +static void quirk_no_bus_reset(struct pci_dev *dev)
> >>> +{
> >>> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
> >>> + * throw a Link Down error on AER capable system and regardless of AER,
> >>> + * config space of the device is never accessible again and typically
> >>> + * causes the system to hang or reset when access is attempted.
> >>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
> >>> + */
> >>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> >>> +
> >>>  #ifdef CONFIG_ACPI
> >>>  /*
> >>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
> >>>
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Jan. 12, 2015, 7:15 p.m. UTC | #7
Hello Alex!

Alex Williamson wrote:
> On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
>> Alex Williamson wrote:
>>> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>>>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>>>> this Atheros device cause system lock-ups and resets.  I've also
>>>>> been able to confirm this behavior on multiple systems.  The device
>>>>> never returns from reset and attempts to access config space of the
>>>>> device after reset result in hangs.  Blacklist bus reset for the
>>>>> device to avoid this issue.
>>>>>
>>>>> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>
>>>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>>>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>>>
>>>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>>>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>>>> report about that issue?
>>>>
>>>> I don't understand the connection between 425c1b223dac and
>>>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>>>> resets.  Is that the wrong commit, or can you outline the connection for
>>>> me?
>>>
>>> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
>>> I'm not sure how Andreas' bisect landed there. 
>>
>> Because removing this patch made it working again :-)
>>
>> And too:
>> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
>>
>> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
>> kernel, which hangs the machine at startup of the VM. The userland
>> (qemu) didn't change in between.
> 
> s/2\./3\./

Thanks :-) It seems I don't like the number 3 :-)

> Ok, so what about VC save/restore (425c1b223dac) is the problem then?
> When we tried to determine that, you found that if we continue from the
> top of the save loop, everything works (ie. no VC state saved), but if
> you continue after the variable declaration of the same loop (ie. still
> no VC state saved), it breaks:
> 
> http://www.spinics.net/lists/linux-pci/msg36166.html
> 
> So, please forgive me if I don't have a whole lot of faith that
> 425c1b223dac is involved.

It's hard for me, too. Really. It's kind of mystique.

> We also both independently determined that this particular device never
> recovers from a PCI bus reset, even when done from userspace with setpci
> and absolutely no save/restore wrappers.

Yes.

>  Config space on the device is
> never accessible after the reset.

Yes.

>  Therefore, how could any sort of bus
> reset with save/restore ever work for this device?

I can't say. What I definitely can say, is that I never had problems
with running VMs w/ qemu until 3.14 came up. Do you think I'm lying? I
used 3.10. and 3.12. for long time w/o (known!) problems (3.12 only on
first start of VM). Otherwise I would have been here long time before :-))).

>> Therefore: from my point of view, it is a regression, because things
>> have been working < 2.14.
>>
>> Besides that: It is undoubted, that there is a problem with resetting
>> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
>> has been working nevertheless. The patch
>> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
>> which I can't say and I don't know off. Therefore, the quirk-patch is
>> definitely required, because things work completely fine again w/ this
>> patch.
>>
>> "Working" means for me here: I was able to start (and use) the VM w/o
>> crashing the machine and this isn't possible w/ unpatched 2.14+ any
>> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
>> machine), but w/ 2.10 even this was possible.
> 
> What?!  So v3.12 still had a machine crash when assigning this device.

Yes. If you *re*start the VM (long time, I didn't knew that fact at all
- I just discovered it during testing while analyzing the problem :-)).
The first start (after reboot) was not a problem. This was the usual use
case here :-)).

Believe me, I'm really convinced that this card does have a problem with
resets. I'm just wondering why it had worked for me until 3.13. That's all.

> The vfio hot reset interface was added in v3.12, so v3.10 didn't have
> any way to do a reset other than what pci_reset_function() decided to
> do.  That all seems to associate the machine crash to the ability to do
> a bus reset on the device.  I'm not sure why the behavior changed
> between v3.14 and v3.12 (maybe the try-reset addition), but there's some
> sort of pre-existing issue before we even got to 425c1b223dac.

Most probably.

> I'm perfectly happy tagging this for stable,

Thanks!! I'm really very comfortable with your patch and your support!
Really! Thanks a lot! It's just odd for me, why it partly worked (first
start of VM worked) w/ 3.12 and 3.13 and 3.14 suddenly no more at all.

You have been accidentally the sufferer - most probably it could have
hit any other change, too. Sorry for that :-(. Therefore: kudos for
anyway fixing the problem. This is really not a matter of course at all!

> but it seems like a
> hardware bug exposed by allowing userspace the ability to select a bus
> reset.  Whether or not that's a kernel regression isn't exactly clear to
> me ("new functionality exposes broken hardware, news at 11").  Thanks,
> 
> Alex


Kind regards,
Andreas

>>> IME, this device cannot,
>>> and has never been able to handle a bus reset.  A simple setpci
>>> experiment on the commandline can confirm this.  What I think happened
>>> is that with the PCI bus reset infrastructure we added, we switched QEMU
>>> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
>>> just more prolific use of bus resets by userspace.
>>>
>>> There's also no regression in 1 & 2, PM reset has never done anything
>>> useful on those devices.  Thanks,
>>>
>>> Alex
>>>
>>>>> ---
>>>>>
>>>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>>>  1 file changed, 14 insertions(+)
>>>>>
>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>> index 561e10d..ebbd5b4 100644
>>>>> --- a/drivers/pci/quirks.c
>>>>> +++ b/drivers/pci/quirks.c
>>>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>>>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>>>  
>>>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>>>> +{
>>>>> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>>>> + * config space of the device is never accessible again and typically
>>>>> + * causes the system to hang or reset when access is attempted.
>>>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>>>> + */
>>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>>>> +
>>>>>  #ifdef CONFIG_ACPI
>>>>>  /*
>>>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 13, 2015, 12:37 a.m. UTC | #8
On Mon, Jan 12, 2015 at 1:15 PM, Andreas Hartmann
<andihartmann@freenet.de> wrote:
> Hello Alex!
>
> Alex Williamson wrote:
>> On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
>>> Alex Williamson wrote:
>>>> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>>>>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>>>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>>>>> this Atheros device cause system lock-ups and resets.  I've also
>>>>>> been able to confirm this behavior on multiple systems.  The device
>>>>>> never returns from reset and attempts to access config space of the
>>>>>> device after reset result in hangs.  Blacklist bus reset for the
>>>>>> device to avoid this issue.
>>>>>>
>>>>>> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>>> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>>
>>>>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>>>>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>>>>
>>>>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>>>>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>>>>> report about that issue?
>>>>>
>>>>> I don't understand the connection between 425c1b223dac and
>>>>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>>>>> resets.  Is that the wrong commit, or can you outline the connection for
>>>>> me?
>>>>
>>>> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
>>>> I'm not sure how Andreas' bisect landed there.
>>>
>>> Because removing this patch made it working again :-)
>>>
>>> And too:
>>> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
>>>
>>> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
>>> kernel, which hangs the machine at startup of the VM. The userland
>>> (qemu) didn't change in between.
>>
>> s/2\./3\./
>
> Thanks :-) It seems I don't like the number 3 :-)
>
>> Ok, so what about VC save/restore (425c1b223dac) is the problem then?
>> When we tried to determine that, you found that if we continue from the
>> top of the save loop, everything works (ie. no VC state saved), but if
>> you continue after the variable declaration of the same loop (ie. still
>> no VC state saved), it breaks:
>>
>> http://www.spinics.net/lists/linux-pci/msg36166.html
>>
>> So, please forgive me if I don't have a whole lot of faith that
>> 425c1b223dac is involved.
>
> It's hard for me, too. Really. It's kind of mystique.
>
>> We also both independently determined that this particular device never
>> recovers from a PCI bus reset, even when done from userspace with setpci
>> and absolutely no save/restore wrappers.
>
> Yes.
>
>>  Config space on the device is
>> never accessible after the reset.
>
> Yes.
>
>>  Therefore, how could any sort of bus
>> reset with save/restore ever work for this device?
>
> I can't say. What I definitely can say, is that I never had problems
> with running VMs w/ qemu until 3.14 came up. Do you think I'm lying? I
> used 3.10. and 3.12. for long time w/o (known!) problems (3.12 only on
> first start of VM). Otherwise I would have been here long time before :-))).
>
>>> Therefore: from my point of view, it is a regression, because things
>>> have been working < 2.14.
>>>
>>> Besides that: It is undoubted, that there is a problem with resetting
>>> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
>>> has been working nevertheless. The patch
>>> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
>>> which I can't say and I don't know off. Therefore, the quirk-patch is
>>> definitely required, because things work completely fine again w/ this
>>> patch.
>>>
>>> "Working" means for me here: I was able to start (and use) the VM w/o
>>> crashing the machine and this isn't possible w/ unpatched 2.14+ any
>>> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
>>> machine), but w/ 2.10 even this was possible.
>>
>> What?!  So v3.12 still had a machine crash when assigning this device.
>
> Yes. If you *re*start the VM (long time, I didn't knew that fact at all
> - I just discovered it during testing while analyzing the problem :-)).
> The first start (after reboot) was not a problem. This was the usual use
> case here :-)).
>
> Believe me, I'm really convinced that this card does have a problem with
> resets. I'm just wondering why it had worked for me until 3.13. That's all.
>
>> The vfio hot reset interface was added in v3.12, so v3.10 didn't have
>> any way to do a reset other than what pci_reset_function() decided to
>> do.  That all seems to associate the machine crash to the ability to do
>> a bus reset on the device.  I'm not sure why the behavior changed
>> between v3.14 and v3.12 (maybe the try-reset addition), but there's some
>> sort of pre-existing issue before we even got to 425c1b223dac.
>
> Most probably.
>
>> I'm perfectly happy tagging this for stable,
>
> Thanks!! I'm really very comfortable with your patch and your support!
> Really! Thanks a lot! It's just odd for me, why it partly worked (first
> start of VM worked) w/ 3.12 and 3.13 and 3.14 suddenly no more at all.
>
> You have been accidentally the sufferer - most probably it could have
> hit any other change, too. Sorry for that :-(. Therefore: kudos for
> anyway fixing the problem. This is really not a matter of course at all!

So we should be able to add instrumentation to the reset paths in
425c1b223dac and 425c1b223dac^ and see some difference in how those
paths are exercised.  Right?

It still feels like there's some magic we don't understand here, and
that niggles at me.

Bjorn

>> but it seems like a
>> hardware bug exposed by allowing userspace the ability to select a bus
>> reset.  Whether or not that's a kernel regression isn't exactly clear to
>> me ("new functionality exposes broken hardware, news at 11").  Thanks,
>>
>> Alex
>
>
> Kind regards,
> Andreas
>
>>>> IME, this device cannot,
>>>> and has never been able to handle a bus reset.  A simple setpci
>>>> experiment on the commandline can confirm this.  What I think happened
>>>> is that with the PCI bus reset infrastructure we added, we switched QEMU
>>>> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
>>>> just more prolific use of bus resets by userspace.
>>>>
>>>> There's also no regression in 1 & 2, PM reset has never done anything
>>>> useful on those devices.  Thanks,
>>>>
>>>> Alex
>>>>
>>>>>> ---
>>>>>>
>>>>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>>>>  1 file changed, 14 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>>> index 561e10d..ebbd5b4 100644
>>>>>> --- a/drivers/pci/quirks.c
>>>>>> +++ b/drivers/pci/quirks.c
>>>>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>>>>                          PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>>>>
>>>>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>>>>> +{
>>>>>> + dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>>>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>>>>> + * config space of the device is never accessible again and typically
>>>>>> + * causes the system to hang or reset when access is attempted.
>>>>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>>>>> + */
>>>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>>>>> +
>>>>>>  #ifdef CONFIG_ACPI
>>>>>>  /*
>>>>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 561e10d..ebbd5b4 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3029,6 +3029,20 @@  static void quirk_no_pm_reset(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
 			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
 
+static void quirk_no_bus_reset(struct pci_dev *dev)
+{
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+}
+
+/*
+ * Atheros AR93xx chips do not behave after a bus reset.  The device will
+ * throw a Link Down error on AER capable system and regardless of AER,
+ * config space of the device is never accessible again and typically
+ * causes the system to hang or reset when access is attempted.
+ * http://www.spinics.net/lists/linux-pci/msg34797.html
+ */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
+
 #ifdef CONFIG_ACPI
 /*
  * Apple: Shutdown Cactus Ridge Thunderbolt controller.