diff mbox

Commit ef83b0781a73f (PCI: Remove from bus_list and release resources in pci_release_dev()) broke TBT hotplug

Message ID CAE9FiQUaof1tk151Wbu2OtpTYb+pvP16ZhLsmbn6LDdBqoVEgA@mail.gmail.com
State Not Applicable
Headers show

Commit Message

Yinghai Lu Jan. 30, 2014, 4:56 p.m. UTC
>> The latest mainline kernel "hangs" when Thunderbolt devices are
>> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
>> getting huge amounts of messages like:
>>
>> [  352.717001] pci 0000:02:00.0: PME# disabled
>> [  352.717011] pci 0000:02:00.0: PME# disabled
>> [  352.717021] pci 0000:02:00.0: PME# disabled
>> [  352.717032] pci 0000:02:00.0: PME# disabled
>> [  352.717041] pci 0000:02:00.0: PME# disabled
>> [  352.717051] pci 0000:02:00.0: PME# disabled
>> [  352.717061] pci 0000:02:00.0: PME# disabled
>> [  352.717070] pci 0000:02:00.0: PME# disabled
>> [  352.717083] pci 0000:02:00.0: PME# disabled
>> [  352.717094] pci 0000:02:00.0: PME# disabled
>> [  352.717104] pci 0000:02:00.0: PME# disabled
>> [  352.717113] pci 0000:02:00.0: PME# disabled
>> [  352.717124] pci 0000:02:00.0: PME# disabled
>> [  352.717133] pci 0000:02:00.0: PME# disabled
>> [  352.717143] pci 0000:02:00.0: PME# disabled
>> [  352.717153] pci 0000:02:00.0: PME# disabled
>> [  352.717162] pci 0000:02:00.0: PME# disabled
>
> that mean pci_stop_dev() get called again and again ?

please check if attached patch could help.

it should prevent possible reattaching driver.

---
 drivers/pci/remove.c |    1 +
 1 file changed, 1 insertion(+)

Comments

Yinghai Lu Jan. 30, 2014, 11:39 p.m. UTC | #1
On Thu, Jan 30, 2014 at 3:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Thursday, January 30, 2014 08:56:05 AM Yinghai Lu wrote:
>>
>> --047d7b5d2ea4eb937804f132eedf
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> >> The latest mainline kernel "hangs" when Thunderbolt devices are
>> >> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
>> >> getting huge amounts of messages like:
>> >>
>> >> [  352.717001] pci 0000:02:00.0: PME# disabled
>> >> [  352.717011] pci 0000:02:00.0: PME# disabled
>> >> [  352.717021] pci 0000:02:00.0: PME# disabled
>> >> [  352.717032] pci 0000:02:00.0: PME# disabled
>> >> [  352.717041] pci 0000:02:00.0: PME# disabled
>> >> [  352.717051] pci 0000:02:00.0: PME# disabled
>> >> [  352.717061] pci 0000:02:00.0: PME# disabled
>> >> [  352.717070] pci 0000:02:00.0: PME# disabled
>> >> [  352.717083] pci 0000:02:00.0: PME# disabled
>> >> [  352.717094] pci 0000:02:00.0: PME# disabled
>> >> [  352.717104] pci 0000:02:00.0: PME# disabled
>> >> [  352.717113] pci 0000:02:00.0: PME# disabled
>> >> [  352.717124] pci 0000:02:00.0: PME# disabled
>> >> [  352.717133] pci 0000:02:00.0: PME# disabled
>> >> [  352.717143] pci 0000:02:00.0: PME# disabled
>> >> [  352.717153] pci 0000:02:00.0: PME# disabled
>> >> [  352.717162] pci 0000:02:00.0: PME# disabled
>> >
>> > that mean pci_stop_dev() get called again and again ?
>>
>> please check if attached patch could help.
>
> Well, it looks like what happens is an endless loop in
> acpiphp_glue.c:disable_slot().
>
> dev_in_slot() returns the first device in the list, so
> pci_stop_and_remove_bus_device() is called for it, but it
> doesn't remove the device from bus->devices any more, so
> dev_in_slot() will return the same device next time and
> so on forever.
>
...
>
> So the above won't help in my opinion.
>
> I wonder, however, if this patch helps instead:
>
> https://patchwork.kernel.org/patch/3540701/
>
> I thought it would be 3.15 material, but it very well can go in earlier if
> it happens to address this particular problem.

Agree, that should fix the problem.

but please use list_for_each_entry_safe_reverse
instead.

please refer to pciehp changelog in

commit 29ed1f29b68a8395d5679b3c4e38352b617b3236
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Fri Jul 19 12:14:16 2013 -0700

    PCI: pciehp: Fix null pointer deref when hot-removing SR-IOV device

    Hot-removing a device with SR-IOV enabled causes a null pointer dereference
    in v3.9 and v3.10.

    This is a regression caused by ba518e3c17 ("PCI: pciehp: Iterate over all
    devices in slot, not functions 0-7").  When we iterate over the
    bus->devices list, we first remove the PF, which also removes all the VFs
    from the list.  Then the list iterator blows up because more than just the
    current entry was removed from the list.

    ac205b7bb7 ("PCI: make sriov work with hotplug remove") works around a
    similar problem in pci_stop_bus_devices() by iterating over the list in
    reverse, so the VFs are stopped and removed from the list first, before the
    PF.

    This patch changes pciehp_unconfigure_device() to iterate over the list in
    reverse, too.

    [bhelgaas: bugzilla, changelog]
    Reference: https://bugzilla.kernel.org/show_bug.cgi?id=60604
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 30, 2014, 11:39 p.m. UTC | #2
On Thursday, January 30, 2014 08:56:05 AM Yinghai Lu wrote:
> 
> --047d7b5d2ea4eb937804f132eedf
> Content-Type: text/plain; charset=ISO-8859-1
> 
> >> The latest mainline kernel "hangs" when Thunderbolt devices are
> >> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
> >> getting huge amounts of messages like:
> >>
> >> [  352.717001] pci 0000:02:00.0: PME# disabled
> >> [  352.717011] pci 0000:02:00.0: PME# disabled
> >> [  352.717021] pci 0000:02:00.0: PME# disabled
> >> [  352.717032] pci 0000:02:00.0: PME# disabled
> >> [  352.717041] pci 0000:02:00.0: PME# disabled
> >> [  352.717051] pci 0000:02:00.0: PME# disabled
> >> [  352.717061] pci 0000:02:00.0: PME# disabled
> >> [  352.717070] pci 0000:02:00.0: PME# disabled
> >> [  352.717083] pci 0000:02:00.0: PME# disabled
> >> [  352.717094] pci 0000:02:00.0: PME# disabled
> >> [  352.717104] pci 0000:02:00.0: PME# disabled
> >> [  352.717113] pci 0000:02:00.0: PME# disabled
> >> [  352.717124] pci 0000:02:00.0: PME# disabled
> >> [  352.717133] pci 0000:02:00.0: PME# disabled
> >> [  352.717143] pci 0000:02:00.0: PME# disabled
> >> [  352.717153] pci 0000:02:00.0: PME# disabled
> >> [  352.717162] pci 0000:02:00.0: PME# disabled
> >
> > that mean pci_stop_dev() get called again and again ?
> 
> please check if attached patch could help.

Well, it looks like what happens is an endless loop in
acpiphp_glue.c:disable_slot().

dev_in_slot() returns the first device in the list, so
pci_stop_and_remove_bus_device() is called for it, but it
doesn't remove the device from bus->devices any more, so
dev_in_slot() will return the same device next time and
so on forever.

> it should prevent possible reattaching driver.

If my analysis above is correct, this isn't related to attaching drivers
and rather to the way dev_in_slot() is designed.

> ---
>  drivers/pci/remove.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: linux-2.6/drivers/pci/remove.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/remove.c
> +++ linux-2.6/drivers/pci/remove.c
> @@ -11,6 +11,7 @@ static void pci_stop_dev(struct pci_dev
>          pci_proc_detach_device(dev);
>          pci_remove_sysfs_dev_files(dev);
>          device_release_driver(&dev->dev);
> +        dev->match_driver = false;
>          dev->is_added = 0;
>      }

So the above won't help in my opinion.

I wonder, however, if this patch helps instead:

https://patchwork.kernel.org/patch/3540701/

I thought it would be 3.15 material, but it very well can go in earlier if
it happens to address this particular problem.

Mika, can you please try that?
Rafael J. Wysocki Jan. 30, 2014, 11:58 p.m. UTC | #3
On Friday, January 31, 2014 12:39:07 AM Rafael J. Wysocki wrote:
> On Thursday, January 30, 2014 08:56:05 AM Yinghai Lu wrote:
> > 
> > --047d7b5d2ea4eb937804f132eedf
> > Content-Type: text/plain; charset=ISO-8859-1
> > 
> > >> The latest mainline kernel "hangs" when Thunderbolt devices are
> > >> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
> > >> getting huge amounts of messages like:
> > >>
> > >> [  352.717001] pci 0000:02:00.0: PME# disabled
> > >> [  352.717011] pci 0000:02:00.0: PME# disabled
> > >> [  352.717021] pci 0000:02:00.0: PME# disabled
> > >> [  352.717032] pci 0000:02:00.0: PME# disabled
> > >> [  352.717041] pci 0000:02:00.0: PME# disabled
> > >> [  352.717051] pci 0000:02:00.0: PME# disabled
> > >> [  352.717061] pci 0000:02:00.0: PME# disabled
> > >> [  352.717070] pci 0000:02:00.0: PME# disabled
> > >> [  352.717083] pci 0000:02:00.0: PME# disabled
> > >> [  352.717094] pci 0000:02:00.0: PME# disabled
> > >> [  352.717104] pci 0000:02:00.0: PME# disabled
> > >> [  352.717113] pci 0000:02:00.0: PME# disabled
> > >> [  352.717124] pci 0000:02:00.0: PME# disabled
> > >> [  352.717133] pci 0000:02:00.0: PME# disabled
> > >> [  352.717143] pci 0000:02:00.0: PME# disabled
> > >> [  352.717153] pci 0000:02:00.0: PME# disabled
> > >> [  352.717162] pci 0000:02:00.0: PME# disabled
> > >
> > > that mean pci_stop_dev() get called again and again ?
> > 
> > please check if attached patch could help.
> 
> Well, it looks like what happens is an endless loop in
> acpiphp_glue.c:disable_slot().
> 
> dev_in_slot() returns the first device in the list, so
> pci_stop_and_remove_bus_device() is called for it, but it
> doesn't remove the device from bus->devices any more, so
> dev_in_slot() will return the same device next time and
> so on forever.
> 
> > it should prevent possible reattaching driver.
> 
> If my analysis above is correct, this isn't related to attaching drivers
> and rather to the way dev_in_slot() is designed.
> 
> > ---
> >  drivers/pci/remove.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > Index: linux-2.6/drivers/pci/remove.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/pci/remove.c
> > +++ linux-2.6/drivers/pci/remove.c
> > @@ -11,6 +11,7 @@ static void pci_stop_dev(struct pci_dev
> >          pci_proc_detach_device(dev);
> >          pci_remove_sysfs_dev_files(dev);
> >          device_release_driver(&dev->dev);
> > +        dev->match_driver = false;
> >          dev->is_added = 0;
> >      }
> 
> So the above won't help in my opinion.
> 
> I wonder, however, if this patch helps instead:
> 
> https://patchwork.kernel.org/patch/3540701/

And even if it does, I wonder what happens if someone walks bus->devices of
that bus after we've run pci_stop_and_remove_bus_device() for one of its
devices and before that device is released?  The device is pretty much dead
at this point, so won't stepping on it during the walk cause any problems to
happen?
Rafael J. Wysocki Jan. 30, 2014, 11:59 p.m. UTC | #4
On Thursday, January 30, 2014 03:39:02 PM Yinghai Lu wrote:
> On Thu, Jan 30, 2014 at 3:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Thursday, January 30, 2014 08:56:05 AM Yinghai Lu wrote:
> >>
> >> --047d7b5d2ea4eb937804f132eedf
> >> Content-Type: text/plain; charset=ISO-8859-1
> >>
> >> >> The latest mainline kernel "hangs" when Thunderbolt devices are
> >> >> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
> >> >> getting huge amounts of messages like:
> >> >>
> >> >> [  352.717001] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717011] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717021] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717032] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717041] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717051] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717061] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717070] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717083] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717094] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717104] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717113] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717124] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717133] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717143] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717153] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717162] pci 0000:02:00.0: PME# disabled
> >> >
> >> > that mean pci_stop_dev() get called again and again ?
> >>
> >> please check if attached patch could help.
> >
> > Well, it looks like what happens is an endless loop in
> > acpiphp_glue.c:disable_slot().
> >
> > dev_in_slot() returns the first device in the list, so
> > pci_stop_and_remove_bus_device() is called for it, but it
> > doesn't remove the device from bus->devices any more, so
> > dev_in_slot() will return the same device next time and
> > so on forever.
> >
> ...
> >
> > So the above won't help in my opinion.
> >
> > I wonder, however, if this patch helps instead:
> >
> > https://patchwork.kernel.org/patch/3540701/
> >
> > I thought it would be 3.15 material, but it very well can go in earlier if
> > it happens to address this particular problem.
> 
> Agree, that should fix the problem.
> 
> but please use list_for_each_entry_safe_reverse
> instead.

OK, I will.
Rafael J. Wysocki Jan. 31, 2014, 12:39 a.m. UTC | #5
On Thursday, January 30, 2014 03:39:02 PM Yinghai Lu wrote:
> On Thu, Jan 30, 2014 at 3:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Thursday, January 30, 2014 08:56:05 AM Yinghai Lu wrote:
> >>
> >> --047d7b5d2ea4eb937804f132eedf
> >> Content-Type: text/plain; charset=ISO-8859-1
> >>
> >> >> The latest mainline kernel "hangs" when Thunderbolt devices are
> >> >> hot-unplugged to the system. I can't see any oops but after hot-unplug I'm
> >> >> getting huge amounts of messages like:
> >> >>
> >> >> [  352.717001] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717011] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717021] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717032] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717041] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717051] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717061] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717070] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717083] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717094] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717104] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717113] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717124] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717133] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717143] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717153] pci 0000:02:00.0: PME# disabled
> >> >> [  352.717162] pci 0000:02:00.0: PME# disabled
> >> >
> >> > that mean pci_stop_dev() get called again and again ?
> >>
> >> please check if attached patch could help.
> >
> > Well, it looks like what happens is an endless loop in
> > acpiphp_glue.c:disable_slot().
> >
> > dev_in_slot() returns the first device in the list, so
> > pci_stop_and_remove_bus_device() is called for it, but it
> > doesn't remove the device from bus->devices any more, so
> > dev_in_slot() will return the same device next time and
> > so on forever.
> >
> ...
> >
> > So the above won't help in my opinion.
> >
> > I wonder, however, if this patch helps instead:
> >
> > https://patchwork.kernel.org/patch/3540701/
> >
> > I thought it would be 3.15 material, but it very well can go in earlier if
> > it happens to address this particular problem.
> 
> Agree, that should fix the problem.
> 
> but please use list_for_each_entry_safe_reverse
> instead.
> 
> please refer to pciehp changelog in
> 
> commit 29ed1f29b68a8395d5679b3c4e38352b617b3236
> Author: Yinghai Lu <yinghai@kernel.org>
> Date:   Fri Jul 19 12:14:16 2013 -0700
> 
>     PCI: pciehp: Fix null pointer deref when hot-removing SR-IOV device
> 
>     Hot-removing a device with SR-IOV enabled causes a null pointer dereference
>     in v3.9 and v3.10.
> 
>     This is a regression caused by ba518e3c17 ("PCI: pciehp: Iterate over all
>     devices in slot, not functions 0-7").  When we iterate over the
>     bus->devices list, we first remove the PF, which also removes all the VFs
>     from the list.  Then the list iterator blows up because more than just the
>     current entry was removed from the list.
> 
>     ac205b7bb7 ("PCI: make sriov work with hotplug remove") works around a
>     similar problem in pci_stop_bus_devices() by iterating over the list in
>     reverse, so the VFs are stopped and removed from the list first, before the
>     PF.
> 
>     This patch changes pciehp_unconfigure_device() to iterate over the list in
>     reverse, too.
> 
>     [bhelgaas: bugzilla, changelog]
>     Reference: https://bugzilla.kernel.org/show_bug.cgi?id=60604

So I gather all of bus->devices list walks during which devices may be removed
should be done in the reverse order, right?
Yinghai Lu Jan. 31, 2014, 1:38 a.m. UTC | #6
On Thu, Jan 30, 2014 at 4:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Thursday, January 30, 2014 03:39:02 PM Yinghai Lu wrote:
>
> So I gather all of bus->devices list walks during which devices may be removed
> should be done in the reverse order, right?

Yes.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

---
 drivers/pci/remove.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -11,6 +11,7 @@  static void pci_stop_dev(struct pci_dev
 		pci_proc_detach_device(dev);
 		pci_remove_sysfs_dev_files(dev);
 		device_release_driver(&dev->dev);
+		dev->match_driver = false;
 		dev->is_added = 0;
 	}