Patchwork [repost,for-3.9] pci: avoid work_on_cpu for nested SRIOV probes

login
register
mail settings
Submitter Michael S. Tsirkin
Date April 18, 2013, 8:33 a.m.
Message ID <20130418083347.GA16526@redhat.com>
Download mbox | patch
Permalink /patch/237585/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Michael S. Tsirkin - April 18, 2013, 8:33 a.m.
On Sun, Apr 14, 2013 at 06:43:39AM -0700, Tejun Heo wrote:
> On Sun, Apr 14, 2013 at 03:58:55PM +0300, Or Gerlitz wrote:
> > So the patch eliminated the lockdep warning for mlx4 nested probing
> > sequence, but introduced lockdep warning for
> > 00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC
> > Interrupt Controller (rev 22)
> 
> Oops, the patch in itself doesn't really change anything.  The caller
> should use a different subclass for the nested invocation, just like
> spin_lock_nested() and friends.  Sorry about not being clear.
> Michael, can you please help?
> 
> Thanks.
> 
> -- 
> tejun

So like this on top. Tejun, you didn't add your S.O.B and patch
description, if this helps as we expect they will be needed.

---->

pci: use work_on_cpu_nested for nested SRIOV

Snce 3.9-rc1 mlx driver started triggering a lockdep warning.

The issue is that a driver, in it's probe function, calls
pci_sriov_enable so a PF device probe causes VF probe (AKA nested
probe).  Each probe in pci_device_probe which is (normally) run through
work_on_cpu (this is to get the right numa node for memory allocated by
the driver).  In turn work_on_cpu does this internally:

        schedule_work_on(cpu, &wfc.work);
        flush_work(&wfc.work);

So if you are running probe on CPU1, and cause another
probe on the same CPU, this will try to flush
workqueue from inside same workqueue which triggers
a lockdep warning.

Nested probing might be tricky to get right generally.

But for pci_sriov_enable, the situation is actually very simple:
VFs almost never use the same driver as the PF so the warning
is bogus there.

This is hardly elegant as it might shut up some real warnings if a buggy
driver actually probes itself in a nested way, but looks to me like an
appropriate quick fix for 3.9.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - April 18, 2013, 8:48 a.m.
On Thu, Apr 18, 2013 at 12:40:09PM +0300, Jack Morgenstein wrote:
> On Thursday 18 April 2013 11:33, Michael S. Tsirkin wrote:
> > But for pci_sriov_enable, the situation is actually very simple:
> > VFs almost never use the same driver as the PF so the warning
> > is bogus there.
> > 
> What about the case where the VF driver IS the same as the PF driver?

Then it can deadlock, e.g. if driver takes a global mutex.  But it's an
internal driver issue the, you can trigger a deadlock through hardware
too, e.g. if VF initialization blocks until PF is fully initialized.
I think it's not the case for Mellanox, is it?
This is what I refer to: would be nice to fix nested probing in general
but it seems disabling the warning is the best we can do for 3.9 since
it causes false positives.
Jack Morgenstein - April 18, 2013, 9:40 a.m.
On Thursday 18 April 2013 11:33, Michael S. Tsirkin wrote:
> But for pci_sriov_enable, the situation is actually very simple:
> VFs almost never use the same driver as the PF so the warning
> is bogus there.
> 
What about the case where the VF driver IS the same as the PF driver?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jack Morgenstein - April 18, 2013, 9:57 a.m.
On Thursday 18 April 2013 11:48, Michael S. Tsirkin wrote:
> On Thu, Apr 18, 2013 at 12:40:09PM +0300, Jack Morgenstein wrote:
> > On Thursday 18 April 2013 11:33, Michael S. Tsirkin wrote:
> > > But for pci_sriov_enable, the situation is actually very simple:
> > > VFs almost never use the same driver as the PF so the warning
> > > is bogus there.
> > > 
> > What about the case where the VF driver IS the same as the PF driver?
> 
> Then it can deadlock, e.g. if driver takes a global mutex.  But it's an
> internal driver issue the, you can trigger a deadlock through hardware
> too, e.g. if VF initialization blocks until PF is fully initialized.
> I think it's not the case for Mellanox, is it?

Correct, the Mellanox driver does not deadlock.

> This is what I refer to: would be nice to fix nested probing in general
> but it seems disabling the warning is the best we can do for 3.9 since
> it causes false positives.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - April 18, 2013, 1:54 p.m.
On Thu, Apr 18, 2013 at 05:49:20PM +0300, Or Gerlitz wrote:
> On 18/04/2013 11:33, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 06:43:39AM -0700, Tejun Heo wrote:
> >>On Sun, Apr 14, 2013 at 03:58:55PM +0300, Or Gerlitz wrote:
> >>>So the patch eliminated the lockdep warning for mlx4 nested probing
> >>>sequence, but introduced lockdep warning for
> >>>00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC
> >>>Interrupt Controller (rev 22)
> >>Oops, the patch in itself doesn't really change anything.  The caller
> >>should use a different subclass for the nested invocation, just like
> >>spin_lock_nested() and friends.  Sorry about not being clear.
> >>Michael, can you please help?
> >>
> >>Thanks.
> >>
> >>-- 
> >>tejun
> >So like this on top. Tejun, you didn't add your S.O.B and patch
> >description, if this helps as we expect they will be needed.
> >
> >---->
> >
> >pci: use work_on_cpu_nested for nested SRIOV
> >
> >Snce 3.9-rc1 mlx driver started triggering a lockdep warning.
> >
> >The issue is that a driver, in it's probe function, calls
> >pci_sriov_enable so a PF device probe causes VF probe (AKA nested
> >probe).  Each probe in pci_device_probe which is (normally) run through
> >work_on_cpu (this is to get the right numa node for memory allocated by
> >the driver).  In turn work_on_cpu does this internally:
> >
> >         schedule_work_on(cpu, &wfc.work);
> >         flush_work(&wfc.work);
> >
> >So if you are running probe on CPU1, and cause another
> >probe on the same CPU, this will try to flush
> >workqueue from inside same workqueue which triggers
> >a lockdep warning.
> >
> >Nested probing might be tricky to get right generally.
> >
> >But for pci_sriov_enable, the situation is actually very simple:
> >VFs almost never use the same driver as the PF so the warning
> >is bogus there.
> >
> >This is hardly elegant as it might shut up some real warnings if a buggy
> >driver actually probes itself in a nested way, but looks to me like an
> >appropriate quick fix for 3.9.
> >
> >Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> >---
> >diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >index 1fa1e48..9c836ef 100644
> >--- a/drivers/pci/pci-driver.c
> >+++ b/drivers/pci/pci-driver.c
> >@@ -286,9 +286,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
> >  		int cpu;
> >  		get_online_cpus();
> >-		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> >-		if (cpu < nr_cpu_ids)
> >-			error = work_on_cpu(cpu, local_pci_probe, &ddi);
> >+		cpu = cpumask_first_and(cpumask_of_node(node), cpu_online_mask);
> >+		if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
> >+			error = work_on_cpu_nested(cpu, local_pci_probe, &ddi);
> 
> as you wrote to me later, missing here is SINGLE_DEPTH_NESTING as
> the last param to work_on_cpu_nested
> >  		else
> >  			error = local_pci_probe(&ddi);
> >  		put_online_cpus();
> 
> So now I used  Tejun's patch and Michael patch on top of the net.git
> as of commit 2e0cbf2cc2c9371f0aa198857d799175ffe231a6
> "net: mvmdio: add select PHYLIB" from April 13 -- and I still see
> this... so we're not there yet
> 
> =====================================
> [ BUG: bad unlock balance detected! ]
> 3.9.0-rc6+ #56 Not tainted
> -------------------------------------
> swapper/0/1 is trying to release lock ((&wfc.work)) at:
> [<ffffffff81220167>] pci_device_probe+0x117/0x120
> but there are no more locks to release!
> 
> other info that might help us debug this:
> 2 locks held by swapper/0/1:
>  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff812da443>]
> __driver_attach+0x53/0xb0
>  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff812da451>]
> __driver_attach+0x61/0xb0
> 
> stack backtrace:
> Pid: 1, comm: swapper/0 Not tainted 3.9.0-rc6+ #56
> Call Trace:
>  [<ffffffff81220167>] ? pci_device_probe+0x117/0x120
>  [<ffffffff81093529>] print_unlock_imbalance_bug+0xf9/0x100
>  [<ffffffff8109616f>] lock_set_class+0x27f/0x7c0
>  [<ffffffff81091d9e>] ? mark_held_locks+0x9e/0x130
>  [<ffffffff81220167>] ? pci_device_probe+0x117/0x120
>  [<ffffffff81066aeb>] work_on_cpu_nested+0x8b/0xc0
>  [<ffffffff810633c0>] ? keventd_up+0x20/0x20
>  [<ffffffff8121f420>] ? pci_pm_prepare+0x60/0x60
>  [<ffffffff81220167>] pci_device_probe+0x117/0x120
>  [<ffffffff812da0fa>] ? driver_sysfs_add+0x7a/0xb0
>  [<ffffffff812da24f>] driver_probe_device+0x8f/0x230
>  [<ffffffff812da493>] __driver_attach+0xa3/0xb0
>  [<ffffffff812da3f0>] ? driver_probe_device+0x230/0x230
>  [<ffffffff812da3f0>] ? driver_probe_device+0x230/0x230
>  [<ffffffff812d86fc>] bus_for_each_dev+0x8c/0xb0
>  [<ffffffff812da079>] driver_attach+0x19/0x20
>  [<ffffffff812d91a0>] bus_add_driver+0x1f0/0x250
>  [<ffffffff818bd596>] ? dmi_pcie_pme_disable_msi+0x21/0x21
>  [<ffffffff812daadf>] driver_register+0x6f/0x150
>  [<ffffffff818bd596>] ? dmi_pcie_pme_disable_msi+0x21/0x21
>  [<ffffffff8122026f>] __pci_register_driver+0x5f/0x70
>  [<ffffffff818bd5ff>] pcie_portdrv_init+0x69/0x7a
>  [<ffffffff810001fd>] do_one_initcall+0x3d/0x170
>  [<ffffffff81895943>] kernel_init_freeable+0x10d/0x19c
>  [<ffffffff818959d2>] ? kernel_init_freeable+0x19c/0x19c
>  [<ffffffff8145a040>] ? rest_init+0x160/0x160
>  [<ffffffff8145a049>] kernel_init+0x9/0xf0
>  [<ffffffff8146ca6c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8145a040>] ? rest_init+0x160/0x160
> ioapic: probe of 0000:00:13.0 failed with error -22
> pci_hotplug: PCI Hot Plug PCI Core version: 0.5

Tejun, what do you say my patch is used for 3.9,
and we can revisit for 3.10.
The release is almost here.
If yes please send your Ack.
Or Gerlitz - April 18, 2013, 2:49 p.m.
On 18/04/2013 11:33, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 06:43:39AM -0700, Tejun Heo wrote:
>> On Sun, Apr 14, 2013 at 03:58:55PM +0300, Or Gerlitz wrote:
>>> So the patch eliminated the lockdep warning for mlx4 nested probing
>>> sequence, but introduced lockdep warning for
>>> 00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC
>>> Interrupt Controller (rev 22)
>> Oops, the patch in itself doesn't really change anything.  The caller
>> should use a different subclass for the nested invocation, just like
>> spin_lock_nested() and friends.  Sorry about not being clear.
>> Michael, can you please help?
>>
>> Thanks.
>>
>> -- 
>> tejun
> So like this on top. Tejun, you didn't add your S.O.B and patch
> description, if this helps as we expect they will be needed.
>
> ---->
>
> pci: use work_on_cpu_nested for nested SRIOV
>
> Snce 3.9-rc1 mlx driver started triggering a lockdep warning.
>
> The issue is that a driver, in it's probe function, calls
> pci_sriov_enable so a PF device probe causes VF probe (AKA nested
> probe).  Each probe in pci_device_probe which is (normally) run through
> work_on_cpu (this is to get the right numa node for memory allocated by
> the driver).  In turn work_on_cpu does this internally:
>
>          schedule_work_on(cpu, &wfc.work);
>          flush_work(&wfc.work);
>
> So if you are running probe on CPU1, and cause another
> probe on the same CPU, this will try to flush
> workqueue from inside same workqueue which triggers
> a lockdep warning.
>
> Nested probing might be tricky to get right generally.
>
> But for pci_sriov_enable, the situation is actually very simple:
> VFs almost never use the same driver as the PF so the warning
> is bogus there.
>
> This is hardly elegant as it might shut up some real warnings if a buggy
> driver actually probes itself in a nested way, but looks to me like an
> appropriate quick fix for 3.9.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> ---
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 1fa1e48..9c836ef 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -286,9 +286,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>   		int cpu;
>   
>   		get_online_cpus();
> -		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> -		if (cpu < nr_cpu_ids)
> -			error = work_on_cpu(cpu, local_pci_probe, &ddi);
> +		cpu = cpumask_first_and(cpumask_of_node(node), cpu_online_mask);
> +		if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
> +			error = work_on_cpu_nested(cpu, local_pci_probe, &ddi);

as you wrote to me later, missing here is SINGLE_DEPTH_NESTING as the 
last param to work_on_cpu_nested
>   		else
>   			error = local_pci_probe(&ddi);
>   		put_online_cpus();

So now I used  Tejun's patch and Michael patch on top of the net.git as 
of commit 2e0cbf2cc2c9371f0aa198857d799175ffe231a6
"net: mvmdio: add select PHYLIB" from April 13 -- and I still see 
this... so we're not there yet

=====================================
[ BUG: bad unlock balance detected! ]
3.9.0-rc6+ #56 Not tainted
-------------------------------------
swapper/0/1 is trying to release lock ((&wfc.work)) at:
[<ffffffff81220167>] pci_device_probe+0x117/0x120
but there are no more locks to release!

other info that might help us debug this:
2 locks held by swapper/0/1:
  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff812da443>] 
__driver_attach+0x53/0xb0
  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff812da451>] 
__driver_attach+0x61/0xb0

stack backtrace:
Pid: 1, comm: swapper/0 Not tainted 3.9.0-rc6+ #56
Call Trace:
  [<ffffffff81220167>] ? pci_device_probe+0x117/0x120
  [<ffffffff81093529>] print_unlock_imbalance_bug+0xf9/0x100
  [<ffffffff8109616f>] lock_set_class+0x27f/0x7c0
  [<ffffffff81091d9e>] ? mark_held_locks+0x9e/0x130
  [<ffffffff81220167>] ? pci_device_probe+0x117/0x120
  [<ffffffff81066aeb>] work_on_cpu_nested+0x8b/0xc0
  [<ffffffff810633c0>] ? keventd_up+0x20/0x20
  [<ffffffff8121f420>] ? pci_pm_prepare+0x60/0x60
  [<ffffffff81220167>] pci_device_probe+0x117/0x120
  [<ffffffff812da0fa>] ? driver_sysfs_add+0x7a/0xb0
  [<ffffffff812da24f>] driver_probe_device+0x8f/0x230
  [<ffffffff812da493>] __driver_attach+0xa3/0xb0
  [<ffffffff812da3f0>] ? driver_probe_device+0x230/0x230
  [<ffffffff812da3f0>] ? driver_probe_device+0x230/0x230
  [<ffffffff812d86fc>] bus_for_each_dev+0x8c/0xb0
  [<ffffffff812da079>] driver_attach+0x19/0x20
  [<ffffffff812d91a0>] bus_add_driver+0x1f0/0x250
  [<ffffffff818bd596>] ? dmi_pcie_pme_disable_msi+0x21/0x21
  [<ffffffff812daadf>] driver_register+0x6f/0x150
  [<ffffffff818bd596>] ? dmi_pcie_pme_disable_msi+0x21/0x21
  [<ffffffff8122026f>] __pci_register_driver+0x5f/0x70
  [<ffffffff818bd5ff>] pcie_portdrv_init+0x69/0x7a
  [<ffffffff810001fd>] do_one_initcall+0x3d/0x170
  [<ffffffff81895943>] kernel_init_freeable+0x10d/0x19c
  [<ffffffff818959d2>] ? kernel_init_freeable+0x19c/0x19c
  [<ffffffff8145a040>] ? rest_init+0x160/0x160
  [<ffffffff8145a049>] kernel_init+0x9/0xf0
  [<ffffffff8146ca6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8145a040>] ? rest_init+0x160/0x160
ioapic: probe of 0000:00:13.0 failed with error -22
pci_hotplug: PCI Hot Plug PCI Core version: 0.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - April 18, 2013, 6:19 p.m.
On Thu, Apr 18, 2013 at 04:54:58PM +0300, Michael S. Tsirkin wrote:
> Tejun, what do you say my patch is used for 3.9,
> and we can revisit for 3.10.
> The release is almost here.
> If yes please send your Ack.

Yeap, let's do that.

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Bjorn Helgaas - April 18, 2013, 6:25 p.m.
On Thu, Apr 18, 2013 at 12:19 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Apr 18, 2013 at 04:54:58PM +0300, Michael S. Tsirkin wrote:
>> Tejun, what do you say my patch is used for 3.9,
>> and we can revisit for 3.10.
>> The release is almost here.
>> If yes please send your Ack.
>
> Yeap, let's do that.
>
> Acked-by: Tejun Heo <tj@kernel.org>

Michael, can you post a new version with Tejun's ack?  IIRC, this was
in drivers/pci, but I haven't been following this and am not sure
exactly what you want applied.  Thanks.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - April 18, 2013, 6:41 p.m.
On Thu, Apr 18, 2013 at 4:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
[...]
> Tejun, what do you say my patch is used for 3.9, and we can revisit for 3.10.
> The release is almost here. If yes please send your Ack.

Michael,

I assume you mean pull to 3.9 both Tejun's and your patch, correct? I
wasn't sure what does this really buys us... we got read from the
false-positive lockdep warning which takes place during nested probe
and got another lockdep warning during the probe of Interrupt
controller

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - April 18, 2013, 8:03 p.m.
On Thu, Apr 18, 2013 at 09:41:31PM +0300, Or Gerlitz wrote:
> On Thu, Apr 18, 2013 at 4:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> [...]
> > Tejun, what do you say my patch is used for 3.9, and we can revisit for 3.10.
> > The release is almost here. If yes please send your Ack.
> 
> Michael,
> 
> I assume you mean pull to 3.9 both Tejun's and your patch, correct? I
> wasn't sure what does this really buys us... we got read from the
> false-positive lockdep warning which takes place during nested probe
> and got another lockdep warning during the probe of Interrupt
> controller
> 
> Or.

No I mean my original patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - April 18, 2013, 8:11 p.m.
On Thu, Apr 18, 2013 at 12:25:59PM -0600, Bjorn Helgaas wrote:
> On Thu, Apr 18, 2013 at 12:19 PM, Tejun Heo <tj@kernel.org> wrote:
> > On Thu, Apr 18, 2013 at 04:54:58PM +0300, Michael S. Tsirkin wrote:
> >> Tejun, what do you say my patch is used for 3.9,
> >> and we can revisit for 3.10.
> >> The release is almost here.
> >> If yes please send your Ack.
> >
> > Yeap, let's do that.
> >
> > Acked-by: Tejun Heo <tj@kernel.org>
> 
> Michael, can you post a new version with Tejun's ack?  IIRC, this was
> in drivers/pci, but I haven't been following this and am not sure
> exactly what you want applied.  Thanks.
> 
> Bjorn

Done. Subject is:
[PATCHv2 for-3.9] pci: avoid work_on_cpu for nested SRIOV
it's the same patch with Tejun's ack and a minor
correction in the commit message.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 1fa1e48..9c836ef 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -286,9 +286,9 @@  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		int cpu;
 
 		get_online_cpus();
-		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
-		if (cpu < nr_cpu_ids)
-			error = work_on_cpu(cpu, local_pci_probe, &ddi);
+		cpu = cpumask_first_and(cpumask_of_node(node), cpu_online_mask);
+		if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
+			error = work_on_cpu_nested(cpu, local_pci_probe, &ddi);
 		else
 			error = local_pci_probe(&ddi);
 		put_online_cpus();