Patchwork [v2,10/19] PCI: serialize hotplug operaitons triggered by the shpchp driver

login
register
mail settings
Submitter Jiang Liu
Date April 27, 2012, 3:16 p.m.
Message ID <1335539820-11232-11-git-send-email-jiang.liu@huawei.com>
Download mbox | patch
Permalink /patch/155514/
State Superseded
Headers show

Comments

Jiang Liu - April 27, 2012, 3:16 p.m.
From: Jiang Liu <jiang.liu@huawei.com>

Use PCI hotplug lock to serialize hotplug operations triggered by the
shpchp driver.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
---
 drivers/pci/hotplug/shpchp.h      |    2 ++
 drivers/pci/hotplug/shpchp_core.c |    3 ++-
 drivers/pci/hotplug/shpchp_ctrl.c |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 1 deletions(-)
Martin Mokrejs - May 7, 2012, 2:13 p.m.
Martin Mokrejs wrote:
> Hi Greg and Jiang,
>   I am confirming this issue still happens for me with my changed motherboard and express card slot
> of my Dell Vostro 3550 laptop. :( I trigger this when unplugging a FireWire express card. Not always
> but does happen. attached is a camera picture of the Oops.
> 
> 
> wq_worker_sleeping
> __schedule
> schedule
> do_exit
> oops_end
> no_context
> __bad_area_nosemaphore
> ? put_dec
> bad_area_nosemaphore
> do_page_fault
> ? pointer.clone
> ? vsnprintf
> page_fault
> ? sysfs_name_hash
> sysfs_find_dirent
> sysfs_hash_and_remove
> sysfs_remove_bin_file
> pci_remove_resource_files
> pci_remove_sysfs_dev_files
> pci_stop_bus_device
> pci_stop_and_remove_bus_device
> pciehp_unconfigure_device
> pciehp_disable_slot
> ? pciehp_disable_slot
> pciehp_power_thread
> process_one_work
> worker_thread
> ? manage_workers.clone
> ? manage_workers.clone
> kthread
> kernel_thread_helper
> ? kthread_freezable_should_stop
> ? gs_change

On linux-3.4-rc5 ...

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH - May 7, 2012, 6:40 p.m.
On Mon, May 07, 2012 at 04:13:40PM +0200, Martin Mokrejs wrote:
> Martin Mokrejs wrote:
> > Hi Greg and Jiang,
> >   I am confirming this issue still happens for me with my changed motherboard and express card slot
> > of my Dell Vostro 3550 laptop. :( I trigger this when unplugging a FireWire express card. Not always
> > but does happen. attached is a camera picture of the Oops.
> > 
> > 
> > wq_worker_sleeping
> > __schedule
> > schedule
> > do_exit
> > oops_end
> > no_context
> > __bad_area_nosemaphore
> > ? put_dec
> > bad_area_nosemaphore
> > do_page_fault
> > ? pointer.clone
> > ? vsnprintf
> > page_fault
> > ? sysfs_name_hash
> > sysfs_find_dirent
> > sysfs_hash_and_remove
> > sysfs_remove_bin_file
> > pci_remove_resource_files

Bjorn, the resource stuff changed recently, while pci hotplug didn't
change at all.  Are we trying to remove files that aren't really there
anymore?

Or could we be removing the same file twice?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Mokrejs - May 7, 2012, 7:23 p.m.
Greg Kroah-Hartman wrote:
> On Mon, May 07, 2012 at 04:13:40PM +0200, Martin Mokrejs wrote:
>> Martin Mokrejs wrote:
>>> Hi Greg and Jiang,
>>>   I am confirming this issue still happens for me with my changed motherboard and express card slot
>>> of my Dell Vostro 3550 laptop. :( I trigger this when unplugging a FireWire express card. Not always
>>> but does happen. attached is a camera picture of the Oops.
>>>
>>>
>>> wq_worker_sleeping
>>> __schedule
>>> schedule
>>> do_exit
>>> oops_end
>>> no_context
>>> __bad_area_nosemaphore
>>> ? put_dec
>>> bad_area_nosemaphore
>>> do_page_fault
>>> ? pointer.clone
>>> ? vsnprintf
>>> page_fault
>>> ? sysfs_name_hash
>>> sysfs_find_dirent
>>> sysfs_hash_and_remove
>>> sysfs_remove_bin_file
>>> pci_remove_resource_files
> 
> Bjorn, the resource stuff changed recently, while pci hotplug didn't
> change at all.  Are we trying to remove files that aren't really there
> anymore?
> 
> Or could we be removing the same file twice?

If you give me some debug patch I can provide you with dozens of inserts and reinserts
of the card and corresponding output (the normal scenario). Unfortunately not the Oops case,
that will not be synced to disk and my remote ttyUSB0 works only partially, only few lines
I ever managed to record. But still you could compare the stacktrace with the correct path?

It seems to me Oops happens when I just pull out the card out of the slot. Maybe the
expected way would be to push the card into the slot to release some lock and it would
be pulled out by itself. I fear this is the intended way to unplug a card.

I played with that a bit now it seems if I just pull out the card, the mechanical lock still
stays in "inserted" position. Next insertion of the card will fail (insert+slip out).
So, every second insertion attempt will succeed and my card will really be held in.

I think I see in logs that output slightly differs on every second ... re-insert of the
card. Although the card presence detection does not work if catches up on card insertion
(some mechanical lock flips). I think removal of the by card pushing it in at least ensures
the lock is flipped as well. Just my impression why something differs on every second
insertion of the card.

I just retested with 3.4-rc6 and this Oops still happens. If I did did not screw it up it
happens even when I carefully always push in the card to get it popped out (the maybe intended
and only allowed card removal?). So the above "every second" scenario does not really help.

Thanks,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu - May 8, 2012, 12:01 a.m.
resend and cc linux-pci list.

On 05/06/2012 11:20 PM, Greg Kroah-Hartman wrote:
> A: No.
> Q: Should I include quotations after my reply?
>
> http://daringfireball.net/2007/07/on_top
Thanks for your kindly reminder, Greg!
I'm a newbie to Linux community and there are still many things for me to learn.

>> 	I'm afraid that we can't get the patch set merged in to v3.4. 
> The merge window for 3.4 was months ago, of course this can't be fixed
> there with your large patchset, sorry.
Actually the patch set is still in RFC stage, I forgot to mark it as RFC
last time, will pay attention to this in future.

>> Greg doesn't like the implementation to solve this issue, and asked me
>> to redo it in other ways, but I haven't found other way yet until now.
>> Still keeping punching my header for better solution now.
> I gave you hints on how to implement this properly if you want to, did
> they not work?
I think your have raised three main concerns.
1) Avoid busy wait.
2) Do not introduce a new style lock.
3) Enhance the PCI core instead of changing each PCI hotplug driver.

We could improve or avoid the first issue by replacing msleep() with wait queue.

For the second issue, we may have two candidate solutions: 
a) introduce a global lock for all PCI devices
b) introduce pci_branch_lock()/unlock() to lock/unlock a PCI subtree for hotplug
operations. 
According to my experience with other OS, the second solution is complex and
deadlock prune. It may become more complex on linux due to the device tree abstraction,
and pci_dev/pci_bus splitting. So I took the first solution for trade off. It
may cost more time for hotplug operations, but the implementation will be simpler
and won't cause regression to the PCI core.

If we adopt a global lock for all PCI devices solution, we do need the 
pci_hotplug_try_lock() interface to break deadlock cases. For example,
	Thread A				Thread B
1) try to remove a downstream PCIe port with
   PCIe native hotplug support
2) acquire the global lock
						3) PCIe hotplug event happens, a work
						   item is queued into the workqueue
						4) worker thread wake up and try to
						   handle the hotplug event
						5) try to acquire the global lock
6) try to destroy the PCIe device for the 
   PCIe native hotplug service
7) unbind the pciehp driver and flush the
   workqueue
8) wait for all queued work items to be finished

Then deadlock happens. Interface pci_hotplug_try_lock() is used to break the
deadlock in the worker thread. There are similar deadlock scenario in other
drivers.

For the third issue, it depends on the solution for the second issue.

Some patches in the v2 patch set doesn't depend on the solution for issue two.
So to make things simpler, I will try to split out those patches.

Thanks, Greg! Really appreciate your comments and suggestions! 

-- gerry
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH - May 8, 2012, 12:18 a.m.
On Tue, May 08, 2012 at 08:01:00AM +0800, Jiang Liu wrote:
> >> Greg doesn't like the implementation to solve this issue, and asked me
> >> to redo it in other ways, but I haven't found other way yet until now.
> >> Still keeping punching my header for better solution now.
> > I gave you hints on how to implement this properly if you want to, did
> > they not work?
> I think your have raised three main concerns.
> 1) Avoid busy wait.
> 2) Do not introduce a new style lock.
> 3) Enhance the PCI core instead of changing each PCI hotplug driver.
> 
> We could improve or avoid the first issue by replacing msleep() with wait queue.

No, wait, you should never have done #1 or #2 in the first place.  Don't
try to "solve" #1 any other way than just flat out not doing that at
all.

> For the second issue, we may have two candidate solutions: 
> a) introduce a global lock for all PCI devices

The driver core already has one, so why create another one?

> b) introduce pci_branch_lock()/unlock() to lock/unlock a PCI subtree for hotplug
> operations. 
> According to my experience with other OS, the second solution is complex and
> deadlock prune. It may become more complex on linux due to the device tree abstraction,
> and pci_dev/pci_bus splitting. So I took the first solution for trade off.

I agree, we don't want to lock branches, that is wrong as you have found
out as well.

> It may cost more time for hotplug operations, but the implementation
> will be simpler and won't cause regression to the PCI core.

There are no speed issues with hotplug pci operations, the spec itself
has wait times required in _seconds_, so we are fine here.

> If we adopt a global lock for all PCI devices solution, we do need the 
> pci_hotplug_try_lock() interface to break deadlock cases. For example,
> 	Thread A				Thread B
> 1) try to remove a downstream PCIe port with
>    PCIe native hotplug support
> 2) acquire the global lock
> 						3) PCIe hotplug event happens, a work
> 						   item is queued into the workqueue

Wait right here, how did a "pcie hotplug event happen"?

> 						4) worker thread wake up and try to
> 						   handle the hotplug event
> 						5) try to acquire the global lock
> 6) try to destroy the PCIe device for the 
>    PCIe native hotplug service
> 7) unbind the pciehp driver and flush the
>    workqueue
> 8) wait for all queued work items to be finished

Why would you want to do these last 2 options?

hotplug events should be "simple" in that they happen anytime, and as
you have pointed out, from different places, so just make them block
waiting for other events to complete.

> Then deadlock happens. Interface pci_hotplug_try_lock() is used to break the
> deadlock in the worker thread. There are similar deadlock scenario in other
> drivers.
> 
> For the third issue, it depends on the solution for the second issue.
> 
> Some patches in the v2 patch set doesn't depend on the solution for issue two.
> So to make things simpler, I will try to split out those patches.
> 
> Thanks, Greg! Really appreciate your comments and suggestions! 

Use a single lock, in the pci hotplug core, for this type of thing.
And, I'm pretty sure we have a lock already for this, in the pci core,
as part of the driver core, so you should be fine.

Don't try to make this more complex than it is.  Yes, it's going to take
some refactoring of the messy pci hotplug drivers (sorry, they are a
mess in places), but that will make this all be much simpler in the end,
hopefully.

good luck,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu - May 8, 2012, 2:13 p.m.
On 05/08/2012 08:18 AM, Greg Kroah-Hartman wrote:
> On Tue, May 08, 2012 at 08:01:00AM +0800, Jiang Liu wrote:
>>>> Greg doesn't like the implementation to solve this issue, and asked me
>>>> to redo it in other ways, but I haven't found other way yet until now.
>>>> Still keeping punching my header for better solution now.
>>> I gave you hints on how to implement this properly if you want to, did
>>> they not work?
>> I think your have raised three main concerns.
>> 1) Avoid busy wait.
>> 2) Do not introduce a new style lock.
>> 3) Enhance the PCI core instead of changing each PCI hotplug driver.
>>
>> We could improve or avoid the first issue by replacing msleep() with wait queue.
> 
> No, wait, you should never have done #1 or #2 in the first place.  Don't
> try to "solve" #1 any other way than just flat out not doing that at
> all.
Sure, I won't change the code until we have reached some agreements.

> 
>> For the second issue, we may have two candidate solutions: 
>> a) introduce a global lock for all PCI devices
> 
> The driver core already has one, so why create another one?
It's for performance and simplicity.

There's already a global rw_semaphore named pci_bus_sem in the PCI core, which
is mainly used to protect "children" and "devices" lists in struct pci_bus.
pci_bus_sem will be used in many hot paths, such as pci_get_slot() etc.
On the other handle, a PCI hotplug operation may last for several seconds,
or even minutes when removing PCI host bridge for the worst case. The new
global lock is introduced to avoid blocking the hot paths for seconds or minutes,

The pci_bus_sem is used as a fine grain lock to protect "children" and "devices"
lists. If we extend it to serialize whole PCI hotplug operations, we must carefully
tune the PCI core to avoid deadlock. The change may not be trivial too.

Actually we are just extending the existing pci_remove_rescan_mutex to 
serialize all PCI hotplug operations. Currently it's only used to serialize
PCI hotplug operations triggered by sysfs interfaces created by PCI core,
such as /sys/devices/pci/xxxxx/{remove,rescan}.

> 
>> It may cost more time for hotplug operations, but the implementation
>> will be simpler and won't cause regression to the PCI core.
> 
> There are no speed issues with hotplug pci operations, the spec itself
> has wait times required in _seconds_, so we are fine here.
Yes. So I follow the rule: optimize for hot paths. It's not a big issue
for hotplug operations to wait for a bit longer.
 
>> If we adopt a global lock for all PCI devices solution, we do need the 
>> pci_hotplug_try_lock() interface to break deadlock cases. For example,
>> 	Thread A				Thread B
>> 1) try to remove a downstream PCIe port with
>>    PCIe native hotplug support
>> 2) acquire the global lock
>> 						3) PCIe hotplug event happens, a work
>> 						   item is queued into the workqueue
> 
> Wait right here, how did a "pcie hotplug event happen"?
We have a system with one PCIe root port connected to an IO expansion box by PCIe cable.
The IO expansion box is essentially an MR-IOV switch, which supports multiple virtual
switches and can dynamically migrate downstream ports among virtual switches. Both the
root port on motherboard and downstream ports on IOBOX support PCIe hotplug.
The entirely system supports:
1) IOH hotplug
2) IOBOX hotplug
3) PCIe hotplug on the IOBOX downstream port
4) IOBOX downstream port migration among virtual switches.

So PCIe hotplug event may be triggered by physical button or port migration operations 
when hot-removing the IOH or IOBOX.

> 
>> 						4) worker thread wake up and try to
>> 						   handle the hotplug event
>> 						5) try to acquire the global lock
>> 6) try to destroy the PCIe device for the 
>>    PCIe native hotplug service
>> 7) unbind the pciehp driver and flush the
>>    workqueue
>> 8) wait for all queued work items to be finished
> 
> Why would you want to do these last 2 options?
> 
> hotplug events should be "simple" in that they happen anytime, and as
> you have pointed out, from different places, so just make them block
> waiting for other events to complete.
The step 7 and 8 are done by the pciehp driver, which is to avoid invalid memory
access after unloading pciehp driver. The real story is that, the wq must be
flushed when unbinding the pciehp driver, otherwise pending work item may cause
invalid memory access after the pciehp driver has been unloaded.

>> Then deadlock happens. Interface pci_hotplug_try_lock() is used to break the
>> deadlock in the worker thread. There are similar deadlock scenario in other
>> drivers.
>>
>> For the third issue, it depends on the solution for the second issue.
>>
>> Some patches in the v2 patch set doesn't depend on the solution for issue two.
>> So to make things simpler, I will try to split out those patches.
>>
>> Thanks, Greg! Really appreciate your comments and suggestions! 
> 
> Use a single lock, in the pci hotplug core, for this type of thing.
> And, I'm pretty sure we have a lock already for this, in the pci core,
> as part of the driver core, so you should be fine.
Please refer to the comments above, I fell that the existing pci_bus_sem is not
suitable for the task here. And we can't rely on lock mechanisms in the device
model framework too, that may interfere with bind/unbind process.

> 
> Don't try to make this more complex than it is.  Yes, it's going to take
> some refactoring of the messy pci hotplug drivers (sorry, they are a
> mess in places), but that will make this all be much simpler in the end,
> hopefully.
Actually I was just trying to find a simple way to fix the issue. I feel it's
more complex to solve this by enhancing the PCI core. But I will keep on thinking
to find other simpler ways for the task.

Thanks, Greg!
--gerry
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH - May 8, 2012, 2:48 p.m.
On Tue, May 08, 2012 at 10:13:07PM +0800, Jiang Liu wrote:
> On 05/08/2012 08:18 AM, Greg Kroah-Hartman wrote:
> > On Tue, May 08, 2012 at 08:01:00AM +0800, Jiang Liu wrote:
> >>>> Greg doesn't like the implementation to solve this issue, and asked me
> >>>> to redo it in other ways, but I haven't found other way yet until now.
> >>>> Still keeping punching my header for better solution now.
> >>> I gave you hints on how to implement this properly if you want to, did
> >>> they not work?
> >> I think your have raised three main concerns.
> >> 1) Avoid busy wait.
> >> 2) Do not introduce a new style lock.
> >> 3) Enhance the PCI core instead of changing each PCI hotplug driver.
> >>
> >> We could improve or avoid the first issue by replacing msleep() with wait queue.
> > 
> > No, wait, you should never have done #1 or #2 in the first place.  Don't
> > try to "solve" #1 any other way than just flat out not doing that at
> > all.
> Sure, I won't change the code until we have reached some agreements.

Hm, that's not the way kernel development works, you never "change"
code, you send patches and they either get accepted or not.

> >> For the second issue, we may have two candidate solutions: 
> >> a) introduce a global lock for all PCI devices
> > 
> > The driver core already has one, so why create another one?
> It's for performance and simplicity.
> 
> There's already a global rw_semaphore named pci_bus_sem in the PCI core, which
> is mainly used to protect "children" and "devices" lists in struct pci_bus.
> pci_bus_sem will be used in many hot paths, such as pci_get_slot() etc.

That's not a "hot patch" at all.

> On the other handle, a PCI hotplug operation may last for several seconds,
> or even minutes when removing PCI host bridge for the worst case. The new
> global lock is introduced to avoid blocking the hot paths for seconds or minutes,

What's wrong with blocking that?  Seriously, this isn't a hot path by
any stretch of the imagination.

> The pci_bus_sem is used as a fine grain lock to protect "children" and "devices"
> lists. If we extend it to serialize whole PCI hotplug operations, we must carefully
> tune the PCI core to avoid deadlock. The change may not be trivial too.

I think you will find it easier than anything else, as you really only
want one lock.

> Actually we are just extending the existing pci_remove_rescan_mutex to 
> serialize all PCI hotplug operations. Currently it's only used to serialize
> PCI hotplug operations triggered by sysfs interfaces created by PCI core,
> such as /sys/devices/pci/xxxxx/{remove,rescan}.
> 
> > 
> >> It may cost more time for hotplug operations, but the implementation
> >> will be simpler and won't cause regression to the PCI core.
> > 
> > There are no speed issues with hotplug pci operations, the spec itself
> > has wait times required in _seconds_, so we are fine here.
> Yes. So I follow the rule: optimize for hot paths. It's not a big issue
> for hotplug operations to wait for a bit longer.

Again, there are no hot paths here.  If there are, then something is
wrong in the drivers that are calling those functions.

> >> If we adopt a global lock for all PCI devices solution, we do need the 
> >> pci_hotplug_try_lock() interface to break deadlock cases. For example,
> >> 	Thread A				Thread B
> >> 1) try to remove a downstream PCIe port with
> >>    PCIe native hotplug support
> >> 2) acquire the global lock
> >> 						3) PCIe hotplug event happens, a work
> >> 						   item is queued into the workqueue
> > 
> > Wait right here, how did a "pcie hotplug event happen"?
> We have a system with one PCIe root port connected to an IO expansion box by PCIe cable.
> The IO expansion box is essentially an MR-IOV switch, which supports multiple virtual
> switches and can dynamically migrate downstream ports among virtual switches. Both the
> root port on motherboard and downstream ports on IOBOX support PCIe hotplug.
> The entirely system supports:
> 1) IOH hotplug
> 2) IOBOX hotplug
> 3) PCIe hotplug on the IOBOX downstream port
> 4) IOBOX downstream port migration among virtual switches.
> 
> So PCIe hotplug event may be triggered by physical button or port migration operations 
> when hot-removing the IOH or IOBOX.

Ok.

> >> 						4) worker thread wake up and try to
> >> 						   handle the hotplug event
> >> 						5) try to acquire the global lock
> >> 6) try to destroy the PCIe device for the 
> >>    PCIe native hotplug service
> >> 7) unbind the pciehp driver and flush the
> >>    workqueue
> >> 8) wait for all queued work items to be finished
> > 
> > Why would you want to do these last 2 options?
> > 
> > hotplug events should be "simple" in that they happen anytime, and as
> > you have pointed out, from different places, so just make them block
> > waiting for other events to complete.
> The step 7 and 8 are done by the pciehp driver, which is to avoid invalid memory
> access after unloading pciehp driver.

Huh?  Unloading it from what?

> The real story is that, the wq must be flushed when unbinding the
> pciehp driver, otherwise pending work item may cause invalid memory
> access after the pciehp driver has been unloaded.

That's fine, just don't do operations on devices that are now gone.

> >> Then deadlock happens. Interface pci_hotplug_try_lock() is used to break the
> >> deadlock in the worker thread. There are similar deadlock scenario in other
> >> drivers.
> >>
> >> For the third issue, it depends on the solution for the second issue.
> >>
> >> Some patches in the v2 patch set doesn't depend on the solution for issue two.
> >> So to make things simpler, I will try to split out those patches.
> >>
> >> Thanks, Greg! Really appreciate your comments and suggestions! 
> > 
> > Use a single lock, in the pci hotplug core, for this type of thing.
> > And, I'm pretty sure we have a lock already for this, in the pci core,
> > as part of the driver core, so you should be fine.
> Please refer to the comments above, I fell that the existing pci_bus_sem is not
> suitable for the task here. And we can't rely on lock mechanisms in the device
> model framework too, that may interfere with bind/unbind process.

No, you should also be locking on that, as it can cause events to
happen.

Just try using the one lock, if that doesn't work for some reason,
please show us why.

> > Don't try to make this more complex than it is.  Yes, it's going to take
> > some refactoring of the messy pci hotplug drivers (sorry, they are a
> > mess in places), but that will make this all be much simpler in the end,
> > hopefully.
> Actually I was just trying to find a simple way to fix the issue. I feel it's
> more complex to solve this by enhancing the PCI core. But I will keep on thinking
> to find other simpler ways for the task.

Remember, we are allowed to change the PCI core here, unlike other
operating systems :)

good luck,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu - May 8, 2012, 4:34 p.m.
On 05/08/2012 10:48 PM, Greg Kroah-Hartman wrote:
>> There's already a global rw_semaphore named pci_bus_sem in the PCI core, which
>> is mainly used to protect "children" and "devices" lists in struct pci_bus.
>> pci_bus_sem will be used in many hot paths, such as pci_get_slot() etc.
> 
> That's not a "hot patch" at all.
I mean it's much more hot than the hotplug path:)

>>>> 						4) worker thread wake up and try to
>>>> 						   handle the hotplug event
>>>> 						5) try to acquire the global lock
>>>> 6) try to destroy the PCIe device for the 
>>>>    PCIe native hotplug service
>>>> 7) unbind the pciehp driver and flush the
>>>>    workqueue
>>>> 8) wait for all queued work items to be finished
>>>
>>> Why would you want to do these last 2 options?
>>>
>>> hotplug events should be "simple" in that they happen anytime, and as
>>> you have pointed out, from different places, so just make them block
>>> waiting for other events to complete.
>> The step 7 and 8 are done by the pciehp driver, which is to avoid invalid memory
>> access after unloading pciehp driver.
> 
> Huh?  Unloading it from what?
> 
>> The real story is that, the wq must be flushed when unbinding the
>> pciehp driver, otherwise pending work item may cause invalid memory
>> access after the pciehp driver has been unloaded.
> 
> That's fine, just don't do operations on devices that are now gone.
pciehp driver's remove() method will flush the work queue to wait until all queued
task items to be finished. And remove() method may be called under several conditions:
1) hot-remove PCIe root ports/downstream ports with PCIe native hotplug capability
2) echo pci_device_path > /sys/bus/pci_express/drivers/pciehp/unbind 
3) rmmod pciehp

Step 7 and 8 is optional for case 1 and 2, but it's must for case 3 above.
And the remove() can't tell it's called for unbinding or unloading, so remove()
will always flush the work queue. 

>>> Use a single lock, in the pci hotplug core, for this type of thing.
>>> And, I'm pretty sure we have a lock already for this, in the pci core,
>>> as part of the driver core, so you should be fine.
>> Please refer to the comments above, I fell that the existing pci_bus_sem is not
>> suitable for the task here. And we can't rely on lock mechanisms in the device
>> model framework too, that may interfere with bind/unbind process.
> 
> No, you should also be locking on that, as it can cause events to
> happen.
> 
> Just try using the one lock, if that doesn't work for some reason,
> please show us why.

I will try to explain why I think a single lock solution maybe is not the best. 

Actually this patch set is mainly a preparation for PCI host bridge hotplug.
For PCI/PCIe device hotplug operations, it may last for several seconds.
For host bridge hotplug, it may last for much more longer or even indeterminable
for the worst case. Why indeterminable? It's due to a design limitation in the
DMAEngine subsystem, so it may take a very long time to reclaim an Intel IOAT device
(we are working on this issue too).

For the global pci_bus_sem, it's mainly used by:
drivers/pci/{bus.c, probe.c, remove.c, search.c, slot.c}
drivers/pci/pcie/{aspm.c, pme.c}
It's reasonable to use the same lock for PCI hotplug operations with bus.c, probe.c,
remove.c, and may be acceptable for search.c, slot.c too. But I think it may cause
trouble if hotplug share the same lock with aspm.c and pme.c, given the long period
the lock may be held by a PCI hotplug operation.

And currently there are calling paths as below:
1) user triggers hotplug event
2) acquire the down_write(&pci_hotplug_sem)
3) unbind aspm or pme driver
4) aspm/pme driver's remove() method invokes down_read(&pci_bus_sem)
If we use the global lock pci_bus_sem for PCI hotplug operations, we can't use 
down_read(&pci_bus_sem) in any code which may be called by PCI hotplug logic.
That actually changes the pci_bus_sem from a rw_semaphore into a mutex. I think
that penalty is too heavy here.

Thanks!
--gerry
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/pci/hotplug/shpchp.h b/drivers/pci/hotplug/shpchp.h
index 6691dc4..07f3b2d 100644
--- a/drivers/pci/hotplug/shpchp.h
+++ b/drivers/pci/hotplug/shpchp.h
@@ -157,6 +157,7 @@  struct controller {
 #define BLINKINGOFF_STATE		2
 #define POWERON_STATE			3
 #define POWEROFF_STATE			4
+#define SHUTDOWN_STATE			5
 
 /* Error messages */
 #define INTERLOCK_OPEN			0x00000002
@@ -179,6 +180,7 @@  extern u8 shpchp_handle_presence_change(u8 hp_slot, struct controller *ctrl);
 extern u8 shpchp_handle_power_fault(u8 hp_slot, struct controller *ctrl);
 extern int shpchp_configure_device(struct slot *p_slot);
 extern int shpchp_unconfigure_device(struct slot *p_slot);
+extern void shpchp_shutdown_slot(struct slot *slot);
 extern void cleanup_slots(struct controller *ctrl);
 extern void shpchp_queue_pushbutton_work(struct work_struct *work);
 extern int shpc_init( struct controller *ctrl, struct pci_dev *pdev);
diff --git a/drivers/pci/hotplug/shpchp_core.c b/drivers/pci/hotplug/shpchp_core.c
index 19762b3..71cc5f2 100644
--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -172,10 +172,11 @@  void cleanup_slots(struct controller *ctrl)
 
 	list_for_each_safe(tmp, next, &ctrl->slot_list) {
 		slot = list_entry(tmp, struct slot, slot_list);
-		list_del(&slot->slot_list);
 		flush_workqueue(shpchp_wq);
 		cancel_delayed_work_sync(&slot->work);
+		shpchp_shutdown_slot(slot);
 		flush_workqueue(shpchp_ordered_wq);
+		list_del(&slot->slot_list);
 		pci_hp_deregister(slot->hotplug_slot);
 	}
 }
diff --git a/drivers/pci/hotplug/shpchp_ctrl.c b/drivers/pci/hotplug/shpchp_ctrl.c
index b00b09b..7b8e3a6 100644
--- a/drivers/pci/hotplug/shpchp_ctrl.c
+++ b/drivers/pci/hotplug/shpchp_ctrl.c
@@ -31,6 +31,7 @@ 
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/delay.h>
 #include <linux/pci.h>
 #include "../pci.h"
 #include "shpchp.h"
@@ -406,6 +407,18 @@  static void shpchp_pushbutton_thread(struct work_struct *work)
 	struct pushbutton_work_info *info =
 		container_of(work, struct pushbutton_work_info, work);
 	struct slot *p_slot = info->p_slot;
+	bool shutdown;
+
+	/* Break possible deadlock by using pci_hotplug_try_enter() */
+	while (!pci_hotplug_try_enter()) {
+		mutex_lock(&p_slot->lock);
+		shutdown = p_slot->state == SHUTDOWN_STATE;
+		mutex_unlock(&p_slot->lock);
+		if (shutdown)
+			goto out;
+		else
+			mdelay(1);
+	}
 
 	mutex_lock(&p_slot->lock);
 	switch (p_slot->state) {
@@ -427,6 +440,9 @@  static void shpchp_pushbutton_thread(struct work_struct *work)
 	}
 	mutex_unlock(&p_slot->lock);
 
+	pci_hotplug_exit();
+
+out:
 	kfree(info);
 }
 
@@ -729,3 +745,19 @@  int shpchp_sysfs_disable_slot(struct slot *p_slot)
 
 	return retval;
 }
+
+void shpchp_shutdown_slot(struct slot *slot)
+{
+	u8 getstatus;
+
+	mutex_lock(&slot->lock);
+	slot->state = SHUTDOWN_STATE;
+	mutex_unlock(&slot->lock);
+
+	slot->hpc_ops->set_attention_status(slot, 0);
+	slot->hpc_ops->get_power_status(slot, &getstatus);
+	if (getstatus)
+		slot->hpc_ops->green_led_on(slot);
+	else
+		slot->hpc_ops->green_led_off(slot);
+}