From patchwork Thu Apr 18 20:08:55 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Michael S. Tsirkin" X-Patchwork-Id: 237735 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 70EC22C0205 for ; Fri, 19 Apr 2013 07:07:55 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754200Ab3DRVHy (ORCPT ); Thu, 18 Apr 2013 17:07:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48902 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751823Ab3DRVHx (ORCPT ); Thu, 18 Apr 2013 17:07:53 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r3IL7epq008590 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 18 Apr 2013 17:07:40 -0400 Received: from redhat.com (vpn1-4-42.ams2.redhat.com [10.36.4.42]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with SMTP id r3IL7YpY006442; Thu, 18 Apr 2013 17:07:35 -0400 Date: Thu, 18 Apr 2013 23:08:55 +0300 From: "Michael S. Tsirkin" To: Or Gerlitz Cc: Ming Lei , Greg Kroah-Hartman , David Miller , Roland Dreier , netdev , Yan Burman , Jack Morgenstein , Tejun Heo , Bjorn Helgaas , linux-pci@vger.kernel.org Subject: [PATCHv2 for-3.9] pci: avoid work_on_cpu for nested SRIOV probes Message-ID: <20130418200855.GA24086@redhat.com> References: <51360D7C.3060209@mellanox.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <51360D7C.3060209@mellanox.com> X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org The following lockdep report triggers since 3.9-rc1: --- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ============================================= [ INFO: possible recursive locking detected ] 3.9.0-rc1 #96 Not tainted --------------------------------------------- kworker/0:1/734 is trying to acquire lock: ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0x250 but task is already holding lock: ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x162/0x4c0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock((&wfc.work)); lock((&wfc.work)); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/0:1/734: #0: (events){.+.+.+}, at: [] process_one_work+0x162/0x4c0 #1: ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x162/0x4c0 #2: (&__lockdep_no_validate__){......}, at: [] device_attach+0x25/0xb0 stack backtrace: Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96 Call Trace: [] validate_chain+0xdcc/0x11f0 [] __lock_acquire+0x440/0xc70 [] ? __lock_acquire+0x440/0xc70 [] lock_acquire+0x5a/0x70 [] ? wq_worker_waking_up+0x60/0x60 [] flush_work+0x45/0x250 [] ? wq_worker_waking_up+0x60/0x60 [] ? mark_held_locks+0x9e/0x130 [] ? queue_work_on+0x46/0x90 [] ? trace_hardirqs_on_caller+0xfd/0x190 [] ? trace_hardirqs_on+0xd/0x10 [] work_on_cpu+0x74/0x90 [] ? keventd_up+0x20/0x20 [] ? pci_pm_prepare+0x60/0x60 [] ? cpumask_next_and+0x23/0x40 [] pci_device_probe+0xba/0x110 [] ? driver_sysfs_add+0x7a/0xb0 [] driver_probe_device+0x8f/0x230 [] ? __driver_attach+0xb0/0xb0 [] __device_attach+0x4b/0x60 [] bus_for_each_drv+0x64/0x90 [] device_attach+0x98/0xb0 [] pci_bus_add_device+0x24/0x50 [] virtfn_add+0x240/0x3e0 [] ? _raw_spin_unlock_irqrestore+0x3d/0x80 [] pci_enable_sriov+0x23e/0x500 [] __mlx4_init_one+0x5da/0xce0 [mlx4_core] [] mlx4_init_one+0x2d/0x60 [mlx4_core] [] local_pci_probe+0x49/0x80 [] work_for_cpu_fn+0x13/0x20 [] process_one_work+0x1c8/0x4c0 [] ? process_one_work+0x162/0x4c0 [] worker_thread+0x30b/0x430 [] ? manage_workers+0x340/0x340 [] kthread+0xd6/0xe0 [] ? __init_kthread_worker+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x70/0x70 The issue is that a driver, in it's probe function, calls pci_sriov_enable so a PF device probe causes VF probe (AKA nested probe). Each probe in pci_device_probe which is (normally) run through work_on_cpu (this is to get the right numa node for memory allocated by the driver). In turn work_on_cpu does this internally: schedule_work_on(cpu, &wfc.work); flush_work(&wfc.work); So if you are running probe on CPU1, and cause another probe on the same CPU, this will try to flush workqueue from inside same workqueue which causes a lockep warning. Nested probing might be tricky to get right generally. But for pci_sriov_enable, the situation is actually very simple: all VFs naturally have same affinity as the PF, and cpumask_any_and is actually same as cpumask_first_and, so it always gives us the same CPU. So let's just detect that, and run the probing for VFs locally without a workqueue. This is hardly elegant, but looks to me like an appropriate quick fix for 3.9. Tested-by: Or Gerlitz Signed-off-by: Michael S. Tsirkin Acked-by: Tejun Heo --- Changes from v1: - clarified commit log and added Ack by Tejun Heo patch is unchanged. diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 1fa1e48..6eeb5ec 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, int cpu; get_online_cpus(); cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); - if (cpu < nr_cpu_ids) + if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids) error = work_on_cpu(cpu, local_pci_probe, &ddi); else error = local_pci_probe(&ddi);