Patchwork Possible regression with cgroups in 3.11

login
register
mail settings
Submitter Bjorn Helgaas
Date Nov. 16, 2013, 12:28 a.m.
Message ID <20131116002820.GA31073@google.com>
Download mbox | patch
Permalink /patch/291728/
State Not Applicable
Headers show

Comments

Bjorn Helgaas - Nov. 16, 2013, 12:28 a.m.
On Wed, Nov 13, 2013 at 04:38:06PM +0900, Tejun Heo wrote:
> Hey, guys.
> 
> cc'ing people from "workqueue, pci: INFO: possible recursive locking
> detected" thread.
> 
>   http://thread.gmane.org/gmane.linux.kernel/1525779
> 
> So, to resolve that issue, we ripped out lockdep annotation from
> work_on_cpu() and cgroup is now experiencing deadlock involving
> work_on_cpu().  It *could* be that workqueue is actually broken or
> memcg is looping but it doesn't seem like a very good idea to not have
> lockdep annotation around work_on_cpu().
> 
> IIRC, there was one pci code path which called work_on_cpu()
> recursively.  Would it be possible for that path to use something like
> work_on_cpu_nested(XXX, depth) so that we can retain lockdep
> annotation on work_on_cpu()?

I'm open to changing the way pci_call_probe() works, but my opinion is
that the PCI path that causes trouble is a broken design, and we shouldn't
complicate the work_on_cpu() interface just to accommodate that broken
design.

The problem is that when a PF .probe() method that calls
pci_enable_sriov(), we add new VF devices and call *their* .probe()
methods before the PF .probe() method completes.  That is ugly and
error-prone.

When we call .probe() methods for the VFs, we're obviously already on the
correct node, because the VFs are on the same node as the PF, so I think
the best short-term fix is Alexander's patch to avoid work_on_cpu() when
we're already on the correct node -- something like the (untested) patch
below.

Bjorn


PCI: Avoid unnecessary CPU switch when calling driver .probe() method

From: Bjorn Helgaas <bhelgaas@google.com>

If we are already on a CPU local to the device, call the driver .probe()
method directly without using work_on_cpu().

This is a workaround for a lockdep warning in the following scenario:

  pci_call_probe
    work_on_cpu(cpu, local_pci_probe, ...)
      driver .probe
        pci_enable_sriov
          ...
            pci_bus_add_device
              ...
                pci_call_probe
                  work_on_cpu(cpu, local_pci_probe, ...)

It would be better to fix PCI so we don't call VF driver .probe() methods
from inside a PF driver .probe() method, but that's a bigger project.

This patch is due to Alexander Duyck <alexander.h.duyck@intel.com>; I merely
added the preemption disable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci-driver.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Nov. 16, 2013, 4:53 a.m.
Hello, Bjorn.

On Fri, Nov 15, 2013 at 05:28:20PM -0700, Bjorn Helgaas wrote:
> It would be better to fix PCI so we don't call VF driver .probe() methods
> from inside a PF driver .probe() method, but that's a bigger project.

Yeah, if pci doesn't need the recursion, we can simply revert restore
the lockdep annoation on work_on_cpu().

> @@ -293,7 +293,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	   its local memory on the right node without any need to
>  	   change it. */
>  	node = dev_to_node(&dev->dev);
> -	if (node >= 0) {
> +	preempt_disable();
> +
> +	if (node >= 0 && node != numa_node_id()) {

A bit of comment here would be nice but yeah I think this should work.
Can you please also queue the revert of c2fda509667b ("workqueue:
allow work_on_cpu() to be called recursively") after this patch?
Please feel free to add my acked-by.

Thanks.

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 454853507b7e..accae06aa79a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -293,7 +293,9 @@  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	   its local memory on the right node without any need to
 	   change it. */
 	node = dev_to_node(&dev->dev);
-	if (node >= 0) {
+	preempt_disable();
+
+	if (node >= 0 && node != numa_node_id()) {
 		int cpu;
 
 		get_online_cpus();
@@ -305,6 +307,8 @@  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		put_online_cpus();
 	} else
 		error = local_pci_probe(&ddi);
+
+	preempt_enable();
 	return error;
 }