diff mbox series

[2/2] PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors

Message ID 20211129121934.4963-2-hdegoede@redhat.com
State New
Headers show
Series [1/2] PCI: Add a pci_dev_depth() helper function | expand

Commit Message

Hans de Goede Nov. 29, 2021, 12:19 p.m. UTC
Use down_read_nested() and down_write_nested() when taking the
ctrl->reset_lock rw-sem, passing the PCI-device depth in the hierarchy
as lock subclass parameter. This fixes the following false-positive lockdep
report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:

[   28.583853] pcieport 0000:06:01.0: pciehp: Slot(1): Link Down
[   28.583891] pcieport 0000:06:01.0: pciehp: Slot(1): Card not present
[   28.583995] pcieport 0000:09:04.0: can't change power state from D3cold to D0 (config space inaccessible)

[   28.584849] ============================================
[   28.584854] WARNING: possible recursive locking detected
[   28.584858] 5.16.0-rc2+ #621 Not tainted
[   28.584864] --------------------------------------------
[   28.584867] irq/124-pciehp/86 is trying to acquire lock:
[   28.584873] ffff8e5ac4299ef8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_check_presence+0x23/0x80
[   28.584904]
               but task is already holding lock:
[   28.584908] ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
[   28.584929]
               other info that might help us debug this:
[   28.584933]  Possible unsafe locking scenario:

[   28.584936]        CPU0
[   28.584939]        ----
[   28.584942]   lock(&ctrl->reset_lock);
[   28.584949]   lock(&ctrl->reset_lock);
[   28.584955]
                *** DEADLOCK ***

[   28.584959]  May be due to missing lock nesting notation

[   28.584963] 3 locks held by irq/124-pciehp/86:
[   28.584970]  #0: ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
[   28.584991]  #1: ffffffffa3b024e8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pciehp_unconfigure_device+0x31/0x110
[   28.585012]  #2: ffff8e5ac1ee2248 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x1c/0x40
[   28.585037]
               stack backtrace:
[   28.585042] CPU: 4 PID: 86 Comm: irq/124-pciehp Not tainted 5.16.0-rc2+ #621
[   28.585052] Hardware name: LENOVO 20U90SIT19/20U90SIT19, BIOS N2WET30W (1.20 ) 08/26/2021
[   28.585059] Call Trace:
[   28.585064]  <TASK>
[   28.585073]  dump_stack_lvl+0x59/0x73
[   28.585087]  __lock_acquire.cold+0xc5/0x2c6
[   28.585106]  ? find_held_lock+0x2b/0x80
[   28.585124]  lock_acquire+0xb5/0x2b0
[   28.585132]  ? pciehp_check_presence+0x23/0x80
[   28.585144]  ? lock_is_held_type+0xa8/0x120
[   28.585161]  down_read+0x3e/0x50
[   28.585172]  ? pciehp_check_presence+0x23/0x80
[   28.585183]  pciehp_check_presence+0x23/0x80
[   28.585194]  pciehp_runtime_resume+0x5c/0xa0
[   28.585206]  ? pci_msix_init+0x60/0x60
[   28.585214]  device_for_each_child+0x45/0x70
[   28.585227]  pcie_port_device_runtime_resume+0x20/0x30
[   28.585236]  pci_pm_runtime_resume+0xa7/0xc0
[   28.585246]  ? pci_pm_freeze_noirq+0x100/0x100
[   28.585257]  __rpm_callback+0x41/0x110
[   28.585271]  ? pci_pm_freeze_noirq+0x100/0x100
[   28.585281]  rpm_callback+0x59/0x70
[   28.585293]  rpm_resume+0x512/0x7b0
[   28.585309]  __pm_runtime_resume+0x4a/0x90
[   28.585322]  __device_release_driver+0x28/0x240
[   28.585338]  device_release_driver+0x26/0x40
[   28.585351]  pci_stop_bus_device+0x68/0x90
[   28.585363]  pci_stop_bus_device+0x2c/0x90
[   28.585373]  pci_stop_and_remove_bus_device+0xe/0x20
[   28.585384]  pciehp_unconfigure_device+0x6c/0x110
[   28.585396]  ? __pm_runtime_resume+0x58/0x90
[   28.585409]  pciehp_disable_slot+0x5b/0xe0
[   28.585421]  pciehp_handle_presence_or_link_change+0xc3/0x2f0
[   28.585436]  pciehp_ist+0x179/0x180
[   28.585449]  ? disable_irq_nosync+0x10/0x10
[   28.585460]  irq_thread_fn+0x1d/0x60
[   28.585470]  ? irq_thread+0x81/0x1a0
[   28.585480]  irq_thread+0xcb/0x1a0
[   28.585491]  ? irq_thread_fn+0x60/0x60
[   28.585502]  ? irq_thread_check_affinity+0xb0/0xb0
[   28.585514]  kthread+0x165/0x190
[   28.585522]  ? set_kthread_struct+0x40/0x40
[   28.585531]  ret_from_fork+0x1f/0x30
[   28.585554]  </TASK>
[   28.586512] xhci_hcd 0000:0a:00.0: remove, state 1
[   28.586538] usb usb4: USB disconnect, device number 1
[   28.586547] usb 4-2: USB disconnect, device number 2
[   28.586555] usb 4-2.1: USB disconnect, device number 3
[   28.586561] usb 4-2.1.2: USB disconnect, device number 5
[   28.587709] xhci_hcd 0000:0a:00.0: xHCI host controller not responding, assume dead
[   28.590021] usb 4-2.3: USB disconnect, device number 4
[   28.613814] xhci_hcd 0000:0a:00.0: WARN Can't disable streams for endpoint 0x1, streams are being disabled already
[   28.616865] xhci_hcd 0000:0a:00.0: USB bus 4 deregistered
[   28.617082] xhci_hcd 0000:0a:00.0: remove, state 1
[   28.617089] usb usb3: USB disconnect, device number 1
[   28.617092] usb 3-2: USB disconnect, device number 2
[   28.617094] usb 3-2.1: USB disconnect, device number 3
[   28.617096] usb 3-2.1.1: USB disconnect, device number 6
[   28.617098] usb 3-2.1.1.4: USB disconnect, device number 8
[   28.645354] usb 3-2.1.3: USB disconnect, device number 7
[   28.645357] usb 3-2.1.3.1: USB disconnect, device number 10
[   28.647489] usb 3-2.1.3.4: USB disconnect, device number 11
[   28.647494] usb 3-2.1.3.4.1: USB disconnect, device number 13
[   28.760411] usb 3-2.1.4: USB disconnect, device number 9
[   28.760414] usb 3-2.1.4.1: USB disconnect, device number 12
[   28.795513] usb 3-2.4: USB disconnect, device number 4
[   28.821464] usb 3-2.5: USB disconnect, device number 5
[   28.822850] xhci_hcd 0000:0a:00.0: Host halt failed, -19
[   28.822854] xhci_hcd 0000:0a:00.0: Host not accessible, reset failed.
[   28.823331] xhci_hcd 0000:0a:00.0: USB bus 3 deregistered
[   28.823589] pci 0000:0a:00.0: Removing from iommu group 15
[   28.823605] pci_bus 0000:0a: busn_res: [bus 0a] is released
[   28.823660] pci 0000:09:02.0: Removing from iommu group 15
[   28.823729] pci_bus 0000:0b: busn_res: [bus 0b-2c] is released
[   28.823773] pci 0000:09:04.0: Removing from iommu group 15
[   28.823782] pci_bus 0000:09: busn_res: [bus 09-2c] is released
[   28.823851] pci 0000:08:00.0: Removing from iommu group 15

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
 drivers/pci/hotplug/pciehp.h      | 1 +
 drivers/pci/hotplug/pciehp_core.c | 2 +-
 drivers/pci/hotplug/pciehp_hpc.c  | 7 ++++---
 3 files changed, 6 insertions(+), 4 deletions(-)

Comments

Lukas Wunner Nov. 29, 2021, 3:39 p.m. UTC | #1
On Mon, Nov 29, 2021 at 01:19:34PM +0100, Hans de Goede wrote:
> Use down_read_nested() and down_write_nested() when taking the
> ctrl->reset_lock rw-sem, passing the PCI-device depth in the hierarchy
> as lock subclass parameter. This fixes the following false-positive lockdep
> report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:
[...]
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>

That's exactly what I had in mind, thanks.

Reported-by: "Theodore Ts'o" <tytso@mit.edu>
Link: https://lore.kernel.org/linux-pci/20190402021933.GA2966@mit.edu/
Link: https://lore.kernel.org/linux-pci/de684a28-9038-8fc6-27ca-3f6f2f6400d7@redhat.com/
Reviewed-by: Lukas Wunner <lukas@wunner.de>


> Note the 2nd patch can probably use a Fixes: tag but I had no
> idea which commit to put there. Or maybe add a Cc: stable to
> both patches?

I'd just add a stable designation and let the stable maintainers decide
which versions to backport to.  The problem I see is the dependency on
the first patch in the series.  In theory there's a syntax to specify
such prerequisites (see "Option 3" in
Documentation/process/stable-kernel-rules.rst), but in practice,
my experience is that stable maintainers may ignore such prerequisite tags.
It might be simplest to just squash the two patches together.

If you do respin, it would be good to explain in the commit message why
one lockdep class is used per hierarchy level:  This is done to conserve
class keys because their number is limited and the complexity grows
quadratically with number of keys according to
Documentation/locking/lockdep-design.rst.

It would also be good to explain why the lockdep splat occurs and why
it's a false positive:  With Thunderbolt, hotplug ports are nested.  When
removing multiple devices in a daisy-chain, each hotplug port's reset_lock
may be acquired recursively.  It's never the same lock, so the lockdep
splat is a false positive.  Because locks at the same hierarchy level
are never acquired recursively, a per-level lockdep class is sufficient
to fix the lockdep splat.

Thanks,

Lukas
Lukas Wunner Nov. 29, 2021, 3:45 p.m. UTC | #2
On Mon, Nov 29, 2021 at 01:19:34PM +0100, Hans de Goede wrote:
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -106,6 +106,7 @@ struct controller {
>  
>  	struct hotplug_slot hotplug_slot;	/* hotplug core interface */
>  	struct rw_semaphore reset_lock;
> +	unsigned int depth;
>  	unsigned int ist_running;
>  	int request_result;
>  	wait_queue_head_t requester;

Could you amend the kernel-doc of the struct with a short explanation
of the attribute you're adding above?  Thanks!
Lukas Wunner Nov. 29, 2021, 6:59 p.m. UTC | #3
On Mon, Nov 29, 2021 at 04:39:43PM +0100, Lukas Wunner wrote:
> On Mon, Nov 29, 2021 at 01:19:34PM +0100, Hans de Goede wrote:
> > Use down_read_nested() and down_write_nested() when taking the
> > ctrl->reset_lock rw-sem, passing the PCI-device depth in the hierarchy
> > as lock subclass parameter. This fixes the following false-positive lockdep
> > report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:
> [...]
> > Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> 
> That's exactly what I had in mind, thanks.

Hm, I found the notes that I took when I investigated Theodore's report:
Using a subclass may be problematic because it's limited to a value < 8
(MAX_LOCKDEP_SUBCLASSES).  If there's a hotplug port at a deeper level
than 8 in the PCI hierarchy (can easily happen, Thunderbolt daisy chain
allows up to 6 devices, each device contains a PCIe switch, so 2 levels per
device), look_up_lock_class() in kernel/locking/lockdep.c will emit a BUG
error message.

It may be necessary to call lockdep_register_key() for each level or
for each hotplug port and assign the lock with lockdep_set_class()
(or ..._and_name() and use the dev_name()).

It's these complications that made me put aside the problem back in the day.
My apologies for not remembering them earlier.

Thanks,

Lukas
Bjorn Helgaas Nov. 29, 2021, 11:45 p.m. UTC | #4
On Mon, Nov 29, 2021 at 01:19:34PM +0100, Hans de Goede wrote:
> Use down_read_nested() and down_write_nested() when taking the
> ctrl->reset_lock rw-sem, passing the PCI-device depth in the hierarchy
> as lock subclass parameter. This fixes the following false-positive lockdep
> report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:
> 
> [   28.583853] pcieport 0000:06:01.0: pciehp: Slot(1): Link Down
> [   28.583891] pcieport 0000:06:01.0: pciehp: Slot(1): Card not present
> [   28.583995] pcieport 0000:09:04.0: can't change power state from D3cold to D0 (config space inaccessible)
> 
> [   28.584849] ============================================
> [   28.584854] WARNING: possible recursive locking detected
> [   28.584858] 5.16.0-rc2+ #621 Not tainted
> [   28.584864] --------------------------------------------
> [   28.584867] irq/124-pciehp/86 is trying to acquire lock:
> [   28.584873] ffff8e5ac4299ef8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_check_presence+0x23/0x80
> [   28.584904]
>                but task is already holding lock:
> [   28.584908] ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
> [   28.584929]
>                other info that might help us debug this:
> [   28.584933]  Possible unsafe locking scenario:
> 
> [   28.584936]        CPU0
> [   28.584939]        ----
> [   28.584942]   lock(&ctrl->reset_lock);
> [   28.584949]   lock(&ctrl->reset_lock);
> [   28.584955]
>                 *** DEADLOCK ***
> 
> [   28.584959]  May be due to missing lock nesting notation
> 
> [   28.584963] 3 locks held by irq/124-pciehp/86:
> [   28.584970]  #0: ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
> [   28.584991]  #1: ffffffffa3b024e8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pciehp_unconfigure_device+0x31/0x110
> [   28.585012]  #2: ffff8e5ac1ee2248 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x1c/0x40
> [   28.585037]
>                stack backtrace:
> [   28.585042] CPU: 4 PID: 86 Comm: irq/124-pciehp Not tainted 5.16.0-rc2+ #621
> [   28.585052] Hardware name: LENOVO 20U90SIT19/20U90SIT19, BIOS N2WET30W (1.20 ) 08/26/2021
> [   28.585059] Call Trace:
> [   28.585064]  <TASK>
> [   28.585073]  dump_stack_lvl+0x59/0x73
> [   28.585087]  __lock_acquire.cold+0xc5/0x2c6
> [   28.585106]  ? find_held_lock+0x2b/0x80
> [   28.585124]  lock_acquire+0xb5/0x2b0
> [   28.585132]  ? pciehp_check_presence+0x23/0x80
> [   28.585144]  ? lock_is_held_type+0xa8/0x120
> [   28.585161]  down_read+0x3e/0x50
> [   28.585172]  ? pciehp_check_presence+0x23/0x80
> [   28.585183]  pciehp_check_presence+0x23/0x80
> [   28.585194]  pciehp_runtime_resume+0x5c/0xa0
> [   28.585206]  ? pci_msix_init+0x60/0x60
> [   28.585214]  device_for_each_child+0x45/0x70
> [   28.585227]  pcie_port_device_runtime_resume+0x20/0x30
> [   28.585236]  pci_pm_runtime_resume+0xa7/0xc0
> [   28.585246]  ? pci_pm_freeze_noirq+0x100/0x100
> [   28.585257]  __rpm_callback+0x41/0x110
> [   28.585271]  ? pci_pm_freeze_noirq+0x100/0x100
> [   28.585281]  rpm_callback+0x59/0x70
> [   28.585293]  rpm_resume+0x512/0x7b0
> [   28.585309]  __pm_runtime_resume+0x4a/0x90
> [   28.585322]  __device_release_driver+0x28/0x240
> [   28.585338]  device_release_driver+0x26/0x40
> [   28.585351]  pci_stop_bus_device+0x68/0x90
> [   28.585363]  pci_stop_bus_device+0x2c/0x90
> [   28.585373]  pci_stop_and_remove_bus_device+0xe/0x20
> [   28.585384]  pciehp_unconfigure_device+0x6c/0x110
> [   28.585396]  ? __pm_runtime_resume+0x58/0x90
> [   28.585409]  pciehp_disable_slot+0x5b/0xe0
> [   28.585421]  pciehp_handle_presence_or_link_change+0xc3/0x2f0
> [   28.585436]  pciehp_ist+0x179/0x180
> [   28.585449]  ? disable_irq_nosync+0x10/0x10
> [   28.585460]  irq_thread_fn+0x1d/0x60
> [   28.585470]  ? irq_thread+0x81/0x1a0
> [   28.585480]  irq_thread+0xcb/0x1a0
> [   28.585491]  ? irq_thread_fn+0x60/0x60
> [   28.585502]  ? irq_thread_check_affinity+0xb0/0xb0
> [   28.585514]  kthread+0x165/0x190
> [   28.585522]  ? set_kthread_struct+0x40/0x40
> [   28.585531]  ret_from_fork+0x1f/0x30
> [   28.585554]  </TASK>
> [   28.586512] xhci_hcd 0000:0a:00.0: remove, state 1
> [   28.586538] usb usb4: USB disconnect, device number 1
> [   28.586547] usb 4-2: USB disconnect, device number 2
> [   28.586555] usb 4-2.1: USB disconnect, device number 3
> [   28.586561] usb 4-2.1.2: USB disconnect, device number 5
> [   28.587709] xhci_hcd 0000:0a:00.0: xHCI host controller not responding, assume dead
> [   28.590021] usb 4-2.3: USB disconnect, device number 4
> [   28.613814] xhci_hcd 0000:0a:00.0: WARN Can't disable streams for endpoint 0x1, streams are being disabled already
> [   28.616865] xhci_hcd 0000:0a:00.0: USB bus 4 deregistered
> [   28.617082] xhci_hcd 0000:0a:00.0: remove, state 1
> [   28.617089] usb usb3: USB disconnect, device number 1
> [   28.617092] usb 3-2: USB disconnect, device number 2
> [   28.617094] usb 3-2.1: USB disconnect, device number 3
> [   28.617096] usb 3-2.1.1: USB disconnect, device number 6
> [   28.617098] usb 3-2.1.1.4: USB disconnect, device number 8
> [   28.645354] usb 3-2.1.3: USB disconnect, device number 7
> [   28.645357] usb 3-2.1.3.1: USB disconnect, device number 10
> [   28.647489] usb 3-2.1.3.4: USB disconnect, device number 11
> [   28.647494] usb 3-2.1.3.4.1: USB disconnect, device number 13
> [   28.760411] usb 3-2.1.4: USB disconnect, device number 9
> [   28.760414] usb 3-2.1.4.1: USB disconnect, device number 12
> [   28.795513] usb 3-2.4: USB disconnect, device number 4
> [   28.821464] usb 3-2.5: USB disconnect, device number 5
> [   28.822850] xhci_hcd 0000:0a:00.0: Host halt failed, -19
> [   28.822854] xhci_hcd 0000:0a:00.0: Host not accessible, reset failed.
> [   28.823331] xhci_hcd 0000:0a:00.0: USB bus 3 deregistered
> [   28.823589] pci 0000:0a:00.0: Removing from iommu group 15
> [   28.823605] pci_bus 0000:0a: busn_res: [bus 0a] is released
> [   28.823660] pci 0000:09:02.0: Removing from iommu group 15
> [   28.823729] pci_bus 0000:0b: busn_res: [bus 0b-2c] is released
> [   28.823773] pci 0000:09:04.0: Removing from iommu group 15
> [   28.823782] pci_bus 0000:09: busn_res: [bus 09-2c] is released
> [   28.823851] pci 0000:08:00.0: Removing from iommu group 15

I think you're planning a v2 for Lukas comments.  When you do, can you
strip out the timestamps above, since they don't contribute to
understanding the problem.  It would also be helpful if you could
remove the irrelevant items from the stack trace and dmesg, e.g.,
"Can't disable streams", "busn_res", "Removing from iommu group",
"Host halt failed", maybe "USB disconnect", etc.  We mainly want to
see the call paths that cause the lockdep warning, and just enough of
the dmesg to help someone match their symptom to this fix.

Bjorn
Hans de Goede Nov. 30, 2021, 10:15 a.m. UTC | #5
Hi Lukas,

First of all thank you for the review and for all the remarks,
I agree with all of them, so I'll incorporate all of them in v2
of this patch-set including squashing the 2 patches together.

On 11/29/21 19:59, Lukas Wunner wrote:
> On Mon, Nov 29, 2021 at 04:39:43PM +0100, Lukas Wunner wrote:
>> On Mon, Nov 29, 2021 at 01:19:34PM +0100, Hans de Goede wrote:
>>> Use down_read_nested() and down_write_nested() when taking the
>>> ctrl->reset_lock rw-sem, passing the PCI-device depth in the hierarchy
>>> as lock subclass parameter. This fixes the following false-positive lockdep
>>> report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:
>> [...]
>>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
>>
>> That's exactly what I had in mind, thanks.
> 
> Hm, I found the notes that I took when I investigated Theodore's report:
> Using a subclass may be problematic because it's limited to a value < 8
> (MAX_LOCKDEP_SUBCLASSES).  If there's a hotplug port at a deeper level
> than 8 in the PCI hierarchy (can easily happen, Thunderbolt daisy chain
> allows up to 6 devices, each device contains a PCIe switch, so 2 levels per
> device), look_up_lock_class() in kernel/locking/lockdep.c will emit a BUG
> error message.
> 
> It may be necessary to call lockdep_register_key() for each level or
> for each hotplug port and assign the lock with lockdep_set_class()
> (or ..._and_name() and use the dev_name()).
> 
> It's these complications that made me put aside the problem back in the day.
> My apologies for not remembering them earlier.

No worries.

I've been looking at how to rework things and this should be pretty doable.

My plan is to have a global static array of lock_class_key-s in
drivers/pci/hotplug/pciehp_hpc.c, one per depth level (say 32 of them ?)
and then use lockdep_set_class() and that looks like it should be pretty
doable.

When using a static global array of lock_class_key-s there is no need to call
lockdep_register_key().

Also see:

drivers/usb/storage/usb.c

Which does something similar to avoid issues when a single usb-device has
multiple USB-storage interfaces.

Regards,

Hans
Lukas Wunner Nov. 30, 2021, 7:39 p.m. UTC | #6
On Tue, Nov 30, 2021 at 11:15:40AM +0100, Hans de Goede wrote:
> On 11/29/21 19:59, Lukas Wunner wrote:
> > Hm, I found the notes that I took when I investigated Theodore's report:
> > Using a subclass may be problematic because it's limited to a value < 8
> > (MAX_LOCKDEP_SUBCLASSES).  If there's a hotplug port at a deeper level
> > than 8 in the PCI hierarchy (can easily happen, Thunderbolt daisy chain
> > allows up to 6 devices, each device contains a PCIe switch, so 2 levels per
> > device), look_up_lock_class() in kernel/locking/lockdep.c will emit a BUG
> > error message.

Actually, after thinking about this some more it has occurred to me
that you should be able to make do with 8 subclasses if you change
the algorithm in patch [1/2] to something like this:

	while (bus->parent) {
-		depth++;
		bus = bus->parent;
+		if (bus->self->is_hotplug_bridge)
+			depth++;
	}

(It may be necessary to add a NULL pointer check for bus->self,
off the top of my hat I'm not sure if that's set for the root bus.)

Because with the patches as they are, the array of 8 subclasses is
sparsely populated, i.e. if a parent port is not a hotplug port,
a subclass is wasted.  You only care about hotplug ports (more
specifically those handled by pciehp) because those are the only
ones with a reset_lock.  The above change should result in a
densely populated array of subclasses.

Naturally, because this makes the function pciehp-specific,
it should be moved from include/linux/pci.h to pciehp_hpc.c.

With Thunderbolt you can have 6 devices plus the hotplug port in
the host controller.  And there may be an 8th hotplug port at the
Root Port if the host controller can be "unplugged" when it goes
to D3cold.  (That's handled by acpiphp, not pciehp, and I'm not
even sure if is_hotplug_bridge is set in that case.)
So 8 subclasses should be sufficient.

Aside from Thunderbolt, nested hotplug ports exist in data centers
with hot-removable NVMe slots in hot-removable chassis, but I highly
doubt those use cases have more than 8 levels of hotplug ports.

So that would probably be a pragmatic and minimalistic approach to this
problem.

Thanks,

Lukas
Hans de Goede Dec. 2, 2021, 11:52 a.m. UTC | #7
Hi,

On 11/30/21 20:39, Lukas Wunner wrote:
> On Tue, Nov 30, 2021 at 11:15:40AM +0100, Hans de Goede wrote:
>> On 11/29/21 19:59, Lukas Wunner wrote:
>>> Hm, I found the notes that I took when I investigated Theodore's report:
>>> Using a subclass may be problematic because it's limited to a value < 8
>>> (MAX_LOCKDEP_SUBCLASSES).  If there's a hotplug port at a deeper level
>>> than 8 in the PCI hierarchy (can easily happen, Thunderbolt daisy chain
>>> allows up to 6 devices, each device contains a PCIe switch, so 2 levels per
>>> device), look_up_lock_class() in kernel/locking/lockdep.c will emit a BUG
>>> error message.
> 
> Actually, after thinking about this some more it has occurred to me
> that you should be able to make do with 8 subclasses if you change
> the algorithm in patch [1/2] to something like this:
> 
> 	while (bus->parent) {
> -		depth++;
> 		bus = bus->parent;
> +		if (bus->self->is_hotplug_bridge)
> +			depth++;
> 	}
> 
> (It may be necessary to add a NULL pointer check for bus->self,
> off the top of my hat I'm not sure if that's set for the root bus.)
> 
> Because with the patches as they are, the array of 8 subclasses is
> sparsely populated, i.e. if a parent port is not a hotplug port,
> a subclass is wasted.  You only care about hotplug ports (more
> specifically those handled by pciehp) because those are the only
> ones with a reset_lock.  The above change should result in a
> densely populated array of subclasses.
> 
> Naturally, because this makes the function pciehp-specific,
> it should be moved from include/linux/pci.h to pciehp_hpc.c.
> 
> With Thunderbolt you can have 6 devices plus the hotplug port in
> the host controller.  And there may be an 8th hotplug port at the
> Root Port if the host controller can be "unplugged" when it goes
> to D3cold.  (That's handled by acpiphp, not pciehp, and I'm not
> even sure if is_hotplug_bridge is set in that case.)
> So 8 subclasses should be sufficient.
> 
> Aside from Thunderbolt, nested hotplug ports exist in data centers
> with hot-removable NVMe slots in hot-removable chassis, but I highly
> doubt those use cases have more than 8 levels of hotplug ports.
> 
> So that would probably be a pragmatic and minimalistic approach to this
> problem.

Ack. I've prepared and tested a v2 patch using this approach, this
v2 also addresses all your and Bjorn's other remarks.

I will send out the v2 patch right after this email.

Regards,

Hans
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 918dccbc74b6..7de17f32bf8c 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -106,6 +106,7 @@  struct controller {
 
 	struct hotplug_slot hotplug_slot;	/* hotplug core interface */
 	struct rw_semaphore reset_lock;
+	unsigned int depth;
 	unsigned int ist_running;
 	int request_result;
 	wait_queue_head_t requester;
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index f34114d45259..4042d87d539d 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -166,7 +166,7 @@  static void pciehp_check_presence(struct controller *ctrl)
 {
 	int occupied;
 
-	down_read(&ctrl->reset_lock);
+	down_read_nested(&ctrl->reset_lock, ctrl->depth);
 	mutex_lock(&ctrl->state_lock);
 
 	occupied = pciehp_card_present_or_link_active(ctrl);
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 83a0fa119cae..54643a95baad 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -583,7 +583,7 @@  static void pciehp_ignore_dpc_link_change(struct controller *ctrl,
 	 * the corresponding link change may have been ignored above.
 	 * Synthesize it to ensure that it is acted on.
 	 */
-	down_read(&ctrl->reset_lock);
+	down_read_nested(&ctrl->reset_lock, ctrl->depth);
 	if (!pciehp_check_link_active(ctrl))
 		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
 	up_read(&ctrl->reset_lock);
@@ -746,7 +746,7 @@  static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
 	 */
-	down_read(&ctrl->reset_lock);
+	down_read_nested(&ctrl->reset_lock, ctrl->depth);
 	if (events & DISABLE_SLOT)
 		pciehp_handle_disable_request(ctrl);
 	else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
@@ -906,7 +906,7 @@  int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, bool probe)
 	if (probe)
 		return 0;
 
-	down_write(&ctrl->reset_lock);
+	down_write_nested(&ctrl->reset_lock, ctrl->depth);
 
 	if (!ATTN_BUTTN(ctrl)) {
 		ctrl_mask |= PCI_EXP_SLTCTL_PDCE;
@@ -975,6 +975,7 @@  struct controller *pcie_init(struct pcie_device *dev)
 		return NULL;
 
 	ctrl->pcie = dev;
+	ctrl->depth = pci_dev_depth(dev->port);
 	pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
 
 	if (pdev->hotplug_user_indicators)