[{"id":1777163,"web_url":"http://patchwork.ozlabs.org/comment/1777163/","msgid":"<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>","list_archive_url":null,"date":"2017-09-28T16:47:45","subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","submitter":{"id":67496,"url":"http://patchwork.ozlabs.org/api/people/67496/","name":"Sinan Kaya","email":"okaya@codeaurora.org"},"content":"On 9/27/2017 5:42 PM, Govindarajulu Varadarajan wrote:\n> CPU0\t\t\t\t\tCPU1\n> ---------------------------------------------------------------------\n> __driver_attach()\n> device_lock(&dev->mutex) <--- device mutex lock here\n> driver_probe_device()\n> pci_enable_sriov()\n> pci_iov_add_virtfn()\n> pci_device_add()\n> \t\t\t\t\taer_isr()\t\t<--- pci aer error\n> \t\t\t\t\tdo_recovery()\n> \t\t\t\t\tbroadcast_error_message()\n> \t\t\t\t\tpci_walk_bus()\n> \t\t\t\t\tdown_read(&pci_bus_sem) <--- rd sem\n\nHow about releasing the device_lock here on CPU0? or in other words keep\ndevice_lock as short as possible?\n\n> down_write(&pci_bus_sem) <-- stuck on wr sem\n> \t\t\t\t\treport_error_detected()\n> \t\t\t\t\tdevice_lock(&dev->mutex)<--- DEAD LOCK","headers":{"Return-Path":"<linux-pci-owner@vger.kernel.org>","X-Original-To":"incoming@patchwork.ozlabs.org","Delivered-To":"patchwork-incoming@bilbo.ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-pci-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (1024-bit key;\n\tunprotected) header.d=codeaurora.org header.i=@codeaurora.org\n\theader.b=\"jJpDVkvs\"; \n\tdkim=pass (1024-bit key) header.d=codeaurora.org\n\theader.i=@codeaurora.org header.b=\"HgvOnXKA\"; \n\tdkim-atps=neutral","pdx-caf-mail.web.codeaurora.org;\n\tdmarc=none (p=none dis=none)\n\theader.from=codeaurora.org","pdx-caf-mail.web.codeaurora.org;\n\tspf=none smtp.mailfrom=okaya@codeaurora.org"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y30x30PK6z9t3h\n\tfor <incoming@patchwork.ozlabs.org>;\n\tFri, 29 Sep 2017 02:47:51 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751813AbdI1Qrt (ORCPT <rfc822;incoming@patchwork.ozlabs.org>);\n\tThu, 28 Sep 2017 12:47:49 -0400","from smtp.codeaurora.org ([198.145.29.96]:49638 \"EHLO\n\tsmtp.codeaurora.org\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751595AbdI1Qrs (ORCPT\n\t<rfc822; linux-pci@vger.kernel.org>); Thu, 28 Sep 2017 12:47:48 -0400","by smtp.codeaurora.org (Postfix, from userid 1000)\n\tid 3718360A09; Thu, 28 Sep 2017 16:47:48 +0000 (UTC)","from [192.168.1.79] (104-182-54-152.lightspeed.rlghnc.sbcglobal.net\n\t[104.182.54.152])\n\t(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128\n\tbits)) (No client certificate requested)\n\t(Authenticated sender: okaya@smtp.codeaurora.org)\n\tby smtp.codeaurora.org (Postfix) with ESMTPSA id 9BA9E60227;\n\tThu, 28 Sep 2017 16:47:46 +0000 (UTC)"],"DKIM-Signature":["v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;\n\ts=default; t=1506617268;\n\tbh=yZahSm/3E2YpY1qKMQ1+9DmkGzcs9yXWBhMkMLtetDk=;\n\th=Subject:To:References:From:Date:In-Reply-To:From;\n\tb=jJpDVkvsTNxC7UjCSxY43kMDRju6tFxxCbXjoJVMmy1l9iSe0vjn2NxGj8y2+5//r\n\tF/C4LWJwtsWnXX2fYGjCvBWDKhnlHtNYOJj/Kx2a4WWboEKxNttU6a3GH8Eua0A7AU\n\t3qzEfYtVFL0ZyREICveIyRhezxu0eqRbzmMRBFm0=","v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;\n\ts=default; t=1506617267;\n\tbh=yZahSm/3E2YpY1qKMQ1+9DmkGzcs9yXWBhMkMLtetDk=;\n\th=Subject:To:References:From:Date:In-Reply-To:From;\n\tb=HgvOnXKA+KNZagcpVTuSQQi2arC91NJWx3m9hXBYz+OAfszAl2FASEIg/pahh9kTK\n\tYL4ch0xbjQFuX4eImy89ioXqgw5sbqI5ON/hoZvBu+FYCYa/Lg9VlsP3DvK7nFuRKI\n\tgLZnA8dDCVcymPLdO+agpQtat9nLxhNNTXL2kxV8="],"X-Spam-Checker-Version":"SpamAssassin 3.4.0 (2014-02-07) on\n\tpdx-caf-mail.web.codeaurora.org","X-Spam-Level":"","X-Spam-Status":"No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00,\n\tDKIM_SIGNED,\n\tT_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0","DMARC-Filter":"OpenDMARC Filter v1.3.2 smtp.codeaurora.org 9BA9E60227","Subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","To":"Govindarajulu Varadarajan <gvaradar@cisco.com>, benve@cisco.com,\n\tbhelgaas@google.com, linux-pci@vger.kernel.org,\n\tlinux-kernel@vger.kernel.org, jlbec@evilplan.org, hch@lst.de,\n\tmingo@redhat.com, peterz@infradead.org","References":"<20170927214220.41216-1-gvaradar@cisco.com>\n\t<20170927214220.41216-4-gvaradar@cisco.com>","From":"Sinan Kaya <okaya@codeaurora.org>","Message-ID":"<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>","Date":"Thu, 28 Sep 2017 12:47:45 -0400","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.3.0","MIME-Version":"1.0","In-Reply-To":"<20170927214220.41216-4-gvaradar@cisco.com>","Content-Type":"text/plain; charset=utf-8","Content-Language":"en-US","Content-Transfer-Encoding":"7bit","Sender":"linux-pci-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-pci.vger.kernel.org>","X-Mailing-List":"linux-pci@vger.kernel.org"}},{"id":1777324,"web_url":"http://patchwork.ozlabs.org/comment/1777324/","msgid":"<alpine.LNX.2.20.1709281624330.24635@cae-iprp-alln-lb.cisco.com>","list_archive_url":null,"date":"2017-09-28T23:46:23","subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","submitter":{"id":46073,"url":"http://patchwork.ozlabs.org/api/people/46073/","name":"Govindarajulu Varadarajan","email":"gvaradar@cisco.com"},"content":"On Thu, 28 Sep 2017, Sinan Kaya wrote:\n\n> On 9/27/2017 5:42 PM, Govindarajulu Varadarajan wrote:\n>> CPU0\t\t\t\t\tCPU1\n>> ---------------------------------------------------------------------\n>> __driver_attach()\n>> device_lock(&dev->mutex) <--- device mutex lock here\n>> driver_probe_device()\n>> pci_enable_sriov()\n>> pci_iov_add_virtfn()\n>> pci_device_add()\n>> \t\t\t\t\taer_isr()\t\t<--- pci aer error\n>> \t\t\t\t\tdo_recovery()\n>> \t\t\t\t\tbroadcast_error_message()\n>> \t\t\t\t\tpci_walk_bus()\n>> \t\t\t\t\tdown_read(&pci_bus_sem) <--- rd sem\n>\n> How about releasing the device_lock here on CPU0?>\n\npci_device_add() is called by driver's pci probe function. device_lock(dev)\nshould be held before calling pci driver probe function.\n\n> or in other words keep device_lock as short as possible?\n\nThe problem is not the duration device_lock is held. It is the order two locks\nare aquired. We cannot control or implement a restriction that during\ndevice_lock() is held, driver probe should not call pci function which aquires\npci_bus_sem. And in case of pci aer, aer handler needs to call driver err_handler()\nfor which we need to hold device_lock() before calling err_handler(). In order\nto find all the devices on a pci bus, we should hold pci_bus_sem to do\npci_walk_bus().\n\n>> down_write(&pci_bus_sem) <-- stuck on wr sem\n>> \t\t\t\t\treport_error_detected()\n>> \t\t\t\t\tdevice_lock(&dev->mutex)<--- DEAD LOCK","headers":{"Return-Path":"<linux-pci-owner@vger.kernel.org>","X-Original-To":"incoming@patchwork.ozlabs.org","Delivered-To":"patchwork-incoming@bilbo.ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-pci-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (1024-bit key;\n\tunprotected) header.d=cisco.com header.i=@cisco.com\n\theader.b=\"fIRsf0cX\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y3BDX1lLvz9t3B\n\tfor <incoming@patchwork.ozlabs.org>;\n\tFri, 29 Sep 2017 09:46:52 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751058AbdI1Xqr (ORCPT <rfc822;incoming@patchwork.ozlabs.org>);\n\tThu, 28 Sep 2017 19:46:47 -0400","from alln-iport-2.cisco.com ([173.37.142.89]:28664 \"EHLO\n\talln-iport-2.cisco.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750857AbdI1Xqq (ORCPT\n\t<rfc822; linux-pci@vger.kernel.org>); Thu, 28 Sep 2017 19:46:46 -0400","from rcdn-core-4.cisco.com ([173.37.93.155])\n\tby alln-iport-2.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t28 Sep 2017 23:46:30 +0000","from XCH-RCD-012.cisco.com (xch-rcd-012.cisco.com [173.37.102.22])\n\tby rcdn-core-4.cisco.com (8.14.5/8.14.5) with ESMTP id\n\tv8SNkUDj028103\n\t(version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=FAIL);\n\tThu, 28 Sep 2017 23:46:30 GMT","from cisco (10.157.132.141) by XCH-RCD-012.cisco.com\n\t(173.37.102.22) with Microsoft SMTP Server (TLS) id 15.0.1320.4;\n\tThu, 28 Sep 2017 18:46:29 -0500"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/simple;\n\td=cisco.com; i=@cisco.com; l=1407; q=dns/txt; s=iport;\n\tt=1506642406; x=1507852006;\n\th=date:from:to:cc:subject:in-reply-to:message-id:\n\treferences:mime-version;\n\tbh=fss5JWdjlndyGsh+HwEVVjLv9wXxsPF4v+Twet6Wcdg=;\n\tb=fIRsf0cX1Fi252nKphMOmfDTuaJ4t7Me5uxisQsX2TUrO1QaQ6KwYCHf\n\tqGh/DWhbgX4wYwo1IEncWqx0fWSZe0Dy2sC/O/tFpuP8ACMB0DzJnsJYW\n\tvBZRWp5WSxjrVjci4VRj5+3KI0xI8NiRR8EaZfI1RGFYSzXpjSu5Z/uQw 4=;","X-IronPort-Anti-Spam-Filtered":"true","X-IronPort-Anti-Spam-Result":"A0CoAQChic1Z/5tdJa1dGQEBAQEBAQEBAQEBBwEBAQEBg1yBUi6dcIFUIpg9CoU7AoQlVwECAQEBAQECayiFGAEBAQECAScRAj8FCwsOCi48GwYOiikFCKkFOotDAQEBAQEBAQEBAQEBAQEBAQEBAR+DK4ICgVGCIIJyincFihKIOo5cizucK0iUWAIRGQGBOVeBDngViAaHewGBDwEBAQ","X-IronPort-AV":"E=Sophos;i=\"5.42,451,1500940800\"; d=\"scan'208\";a=\"10194185\"","Date":"Thu, 28 Sep 2017 16:46:23 -0700","From":"Govindarajulu Varadarajan <gvaradar@cisco.com>","To":"Sinan Kaya <okaya@codeaurora.org>","CC":"<benve@cisco.com>, <bhelgaas@google.com>,\n\t<linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>,\n\t<jlbec@evilplan.org>, <hch@lst.de>, <mingo@redhat.com>,\n\t<peterz@infradead.org>","Subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","In-Reply-To":"<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>","Message-ID":"<alpine.LNX.2.20.1709281624330.24635@cae-iprp-alln-lb.cisco.com>","References":"<20170927214220.41216-1-gvaradar@cisco.com>\n\t<20170927214220.41216-4-gvaradar@cisco.com>\n\t<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>","User-Agent":"Alpine 2.20 (LNX 67 2015-01-07)","MIME-Version":"1.0","Content-Type":"text/plain; charset=\"US-ASCII\"; format=flowed","X-Originating-IP":"[10.157.132.141]","X-ClientProxiedBy":"xch-rcd-008.cisco.com (173.37.102.18) To\n\tXCH-RCD-012.cisco.com (173.37.102.22)","Sender":"linux-pci-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-pci.vger.kernel.org>","X-Mailing-List":"linux-pci@vger.kernel.org"}},{"id":1777543,"web_url":"http://patchwork.ozlabs.org/comment/1777543/","msgid":"<0a2a41c5-2872-fdb6-8ad2-97b0b6dc69b1@codeaurora.org>","list_archive_url":null,"date":"2017-09-29T13:32:42","subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","submitter":{"id":67496,"url":"http://patchwork.ozlabs.org/api/people/67496/","name":"Sinan Kaya","email":"okaya@codeaurora.org"},"content":"On 9/28/2017 7:46 PM, Govindarajulu Varadarajan wrote:\n>> How about releasing the device_lock here on CPU0?>\n> \n> pci_device_add() is called by driver's pci probe function. device_lock(dev)\n> should be held before calling pci driver probe function.\n\nI see. The goal of the lock held here is to protect probe() operation from\nbeing disrupted. I also don't think we can change this. \n\n> \n>> or in other words keep device_lock as short as possible?\n> \n> The problem is not the duration device_lock is held. It is the order two locks\n> are aquired. We cannot control or implement a restriction that during\n> device_lock() is held, driver probe should not call pci function which aquires\n> pci_bus_sem. And in case of pci aer, aer handler needs to call driver err_handler()\n> for which we need to hold device_lock() before calling err_handler(). In order\n> to find all the devices on a pci bus, we should hold pci_bus_sem to do\n> pci_walk_bus().\n\nI was reacting to this to see if there is a better way to do this.\n\n\"Only fix I could think of is to lock &pci_bus_sem and try locking all\ndevice->mutex under that pci_bus. If it fails, unlock all device->mutex\nand &pci_bus_sem and try again.\"\n\nHow about gracefully returning from report_error_detected() when we cannot obtain\nthe device_lock() by replacing it with device_trylock()? \n\naer_pci_walk_bus() can still poll like you did until it gets the lock. At least,\nwe don't get to introduce a new API, new lock semantics and code refactoring.\n__pci_bus_trylock() looked very powerful and also dangerously flexible to\nintroduce new bugs to me.\n\nFor instance, you called it like this.\n\n+\t\tdown_read(&pci_bus_sem);\n+\t\tlocked = __pci_bus_trylock(bus, pci_device_trylock,\n+\t\t\t\t\t   pci_device_unlock);\n\npci_bus_trylock() would obtain device + cfg locks whereas pci_device_trylock() only\nobtains the device lock. Can it race against cfg lock? It depends on the caller.\nVery subtle difference.","headers":{"Return-Path":"<linux-pci-owner@vger.kernel.org>","X-Original-To":"incoming@patchwork.ozlabs.org","Delivered-To":"patchwork-incoming@bilbo.ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-pci-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (1024-bit key;\n\tunprotected) header.d=codeaurora.org header.i=@codeaurora.org\n\theader.b=\"G+HMo2Mm\"; \n\tdkim=pass (1024-bit key) header.d=codeaurora.org\n\theader.i=@codeaurora.org header.b=\"G+HMo2Mm\"; \n\tdkim-atps=neutral","pdx-caf-mail.web.codeaurora.org;\n\tdmarc=none (p=none dis=none)\n\theader.from=codeaurora.org","pdx-caf-mail.web.codeaurora.org;\n\tspf=none smtp.mailfrom=okaya@codeaurora.org"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y3XYX2QPnz9rxj\n\tfor <incoming@patchwork.ozlabs.org>;\n\tFri, 29 Sep 2017 23:32:48 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751346AbdI2Ncq (ORCPT <rfc822;incoming@patchwork.ozlabs.org>);\n\tFri, 29 Sep 2017 09:32:46 -0400","from smtp.codeaurora.org ([198.145.29.96]:48574 \"EHLO\n\tsmtp.codeaurora.org\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751176AbdI2Ncp (ORCPT\n\t<rfc822; linux-pci@vger.kernel.org>); Fri, 29 Sep 2017 09:32:45 -0400","by smtp.codeaurora.org (Postfix, from userid 1000)\n\tid E34F36099A; Fri, 29 Sep 2017 13:32:44 +0000 (UTC)","from [10.235.228.98] (global_nat1_iad_fw.qualcomm.com\n\t[129.46.232.65])\n\t(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128\n\tbits)) (No client certificate requested)\n\t(Authenticated sender: okaya@smtp.codeaurora.org)\n\tby smtp.codeaurora.org (Postfix) with ESMTPSA id 53F2E60227;\n\tFri, 29 Sep 2017 13:32:43 +0000 (UTC)"],"DKIM-Signature":["v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;\n\ts=default; t=1506691964;\n\tbh=Mq23W+xVq+ngyU7oCW6mS6Y3fUKM9LKN8DW3+T6moto=;\n\th=Subject:To:Cc:References:From:Date:In-Reply-To:From;\n\tb=G+HMo2Mmp7LIi4ZF/rPG1yE49Abhx+zGd6l1Ykc2SdNsVs7jzl9lvPuScRuRCds79\n\t+TFS2jPsRxFGdiLRlb2g25Z7deu3CkE9mZMHchf+vy8g1uFDk/HbB9DOultKGZtgWg\n\txPmBj2lzSqAQ8Prhf6NyjJvMhfLgVAPLWSFu3uho=","v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;\n\ts=default; t=1506691964;\n\tbh=Mq23W+xVq+ngyU7oCW6mS6Y3fUKM9LKN8DW3+T6moto=;\n\th=Subject:To:Cc:References:From:Date:In-Reply-To:From;\n\tb=G+HMo2Mmp7LIi4ZF/rPG1yE49Abhx+zGd6l1Ykc2SdNsVs7jzl9lvPuScRuRCds79\n\t+TFS2jPsRxFGdiLRlb2g25Z7deu3CkE9mZMHchf+vy8g1uFDk/HbB9DOultKGZtgWg\n\txPmBj2lzSqAQ8Prhf6NyjJvMhfLgVAPLWSFu3uho="],"X-Spam-Checker-Version":"SpamAssassin 3.4.0 (2014-02-07) on\n\tpdx-caf-mail.web.codeaurora.org","X-Spam-Level":"","X-Spam-Status":"No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00,\n\tDKIM_SIGNED,\n\tT_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0","DMARC-Filter":"OpenDMARC Filter v1.3.2 smtp.codeaurora.org 53F2E60227","Subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","To":"Govindarajulu Varadarajan <gvaradar@cisco.com>","Cc":"benve@cisco.com, bhelgaas@google.com, linux-pci@vger.kernel.org,\n\tlinux-kernel@vger.kernel.org, jlbec@evilplan.org, hch@lst.de,\n\tmingo@redhat.com, peterz@infradead.org","References":"<20170927214220.41216-1-gvaradar@cisco.com>\n\t<20170927214220.41216-4-gvaradar@cisco.com>\n\t<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>\n\t<alpine.LNX.2.20.1709281624330.24635@cae-iprp-alln-lb.cisco.com>","From":"Sinan Kaya <okaya@codeaurora.org>","Message-ID":"<0a2a41c5-2872-fdb6-8ad2-97b0b6dc69b1@codeaurora.org>","Date":"Fri, 29 Sep 2017 09:32:42 -0400","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.3.0","MIME-Version":"1.0","In-Reply-To":"<alpine.LNX.2.20.1709281624330.24635@cae-iprp-alln-lb.cisco.com>","Content-Type":"text/plain; charset=utf-8","Content-Language":"en-US","Content-Transfer-Encoding":"7bit","Sender":"linux-pci-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-pci.vger.kernel.org>","X-Mailing-List":"linux-pci@vger.kernel.org"}},{"id":1777873,"web_url":"http://patchwork.ozlabs.org/comment/1777873/","msgid":"<alpine.LNX.2.20.1709292250520.24635@cae-iprp-alln-lb.cisco.com>","list_archive_url":null,"date":"2017-09-30T06:00:39","subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","submitter":{"id":46073,"url":"http://patchwork.ozlabs.org/api/people/46073/","name":"Govindarajulu Varadarajan","email":"gvaradar@cisco.com"},"content":"On Fri, 29 Sep 2017, Sinan Kaya wrote:\n\n> On 9/28/2017 7:46 PM, Govindarajulu Varadarajan wrote:\n>>> How about releasing the device_lock here on CPU0?>\n>>\n>> pci_device_add() is called by driver's pci probe function. device_lock(dev)\n>> should be held before calling pci driver probe function.\n>\n> I see. The goal of the lock held here is to protect probe() operation from\n> being disrupted. I also don't think we can change this.\n>\n>>\n>>> or in other words keep device_lock as short as possible?\n>>\n>> The problem is not the duration device_lock is held. It is the order two locks\n>> are aquired. We cannot control or implement a restriction that during\n>> device_lock() is held, driver probe should not call pci function which aquires\n>> pci_bus_sem. And in case of pci aer, aer handler needs to call driver err_handler()\n>> for which we need to hold device_lock() before calling err_handler(). In order\n>> to find all the devices on a pci bus, we should hold pci_bus_sem to do\n>> pci_walk_bus().\n>\n> I was reacting to this to see if there is a better way to do this.\n>\n> \"Only fix I could think of is to lock &pci_bus_sem and try locking all\n> device->mutex under that pci_bus. If it fails, unlock all device->mutex\n> and &pci_bus_sem and try again.\"\n>\n> How about gracefully returning from report_error_detected() when we cannot obtain\n> the device_lock() by replacing it with device_trylock()?\n>\n\nSome of the devices may miss the error reporthing. I have sent V2 where we do\na pci_bus_walk and adds all the devices to a list. After unlocking (up_read)\n&pci_bus_sem, we go through the list and call err_handler of the devices with\ndevic_lock() held. This way, we dont try to hold both locks at same time and\nwe dont hold 50+ device_lock. Let me know if this approach is better.\n\n> aer_pci_walk_bus() can still poll like you did until it gets the lock. At least,\n> we don't get to introduce a new API, new lock semantics and code refactoring.\n> __pci_bus_trylock() looked very powerful and also dangerously flexible to\n> introduce new bugs to me.\n>\n> For instance, you called it like this.\n>\n> +\t\tdown_read(&pci_bus_sem);\n> +\t\tlocked = __pci_bus_trylock(bus, pci_device_trylock,\n> +\t\t\t\t\t   pci_device_unlock);\n>\n> pci_bus_trylock() would obtain device + cfg locks whereas pci_device_trylock() only\n> obtains the device lock. Can it race against cfg lock? It depends on the caller.\n> Very subtle difference.\n>\n> -- \n> Sinan Kaya\n> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.\n> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.\n>","headers":{"Return-Path":"<linux-pci-owner@vger.kernel.org>","X-Original-To":"incoming@patchwork.ozlabs.org","Delivered-To":"patchwork-incoming@bilbo.ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-pci-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (1024-bit key;\n\tunprotected) header.d=cisco.com header.i=@cisco.com\n\theader.b=\"Evh3GfKk\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y3yTb13hxz9t4r\n\tfor <incoming@patchwork.ozlabs.org>;\n\tSat, 30 Sep 2017 16:00:51 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751160AbdI3GAs (ORCPT <rfc822;incoming@patchwork.ozlabs.org>);\n\tSat, 30 Sep 2017 02:00:48 -0400","from alln-iport-4.cisco.com ([173.37.142.91]:19850 \"EHLO\n\talln-iport-4.cisco.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751099AbdI3GAr (ORCPT\n\t<rfc822; linux-pci@vger.kernel.org>); Sat, 30 Sep 2017 02:00:47 -0400","from rcdn-core-11.cisco.com ([173.37.93.147])\n\tby alln-iport-4.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t30 Sep 2017 06:00:46 +0000","from XCH-RCD-012.cisco.com (xch-rcd-012.cisco.com [173.37.102.22])\n\tby rcdn-core-11.cisco.com (8.14.5/8.14.5) with ESMTP id\n\tv8U60kvZ009000\n\t(version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=FAIL);\n\tSat, 30 Sep 2017 06:00:46 GMT","from cisco (10.24.76.29) by XCH-RCD-012.cisco.com (173.37.102.22)\n\twith Microsoft SMTP Server (TLS) id 15.0.1320.4;\n\tSat, 30 Sep 2017 01:00:45 -0500"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/simple;\n\td=cisco.com; i=@cisco.com; l=2689; q=dns/txt; s=iport;\n\tt=1506751247; x=1507960847;\n\th=date:from:to:cc:subject:in-reply-to:message-id:\n\treferences:mime-version;\n\tbh=zde6ZvDcxdN6ntDNq5OMUMkNdm/u/Wd8gfj8w9XpjZI=;\n\tb=Evh3GfKkpNwUFh29v0LZf3T1Koo35RI0iLhoj53Y5eLFAVWKxgbI9mdS\n\tpeERJcZ/BcD7ew7J2X+reFEvZzPOpqafCq7fha27SB19eC9Wy2wQzyjX6\n\twXLkRQElT/Di0S5MXQ2nOkPMkaMKh5GYzWMOOXI8gJoT8I4XTVMN+Ppnq k=;","X-IronPort-Anti-Spam-Filtered":"true","X-IronPort-Anti-Spam-Result":"A0C9AgAVMs9Z/5NdJa1UCBkBAQEBAQEBAQEBAQcBAQEBAYNcgVIunXWBVCKYPgqFOwKEM1cBAgEBAQEBAmsohRgBAQEBAgEnEQI/BQsLDgouPBsGDoopBQioCzqLQwEBAQEBAQEBAQEBAQEBAQEBAQEfgy2CAoFRgiCCcoRfhhgFkk+ODlKLPJwvSJRbAhEZAYE5V4EOeBVJhRocggeJHQGBDwEBAQ","X-IronPort-AV":"E=Sophos;i=\"5.42,456,1500940800\"; d=\"scan'208\";a=\"10730268\"","Date":"Fri, 29 Sep 2017 23:00:39 -0700","From":"Govindarajulu Varadarajan <gvaradar@cisco.com>","To":"Sinan Kaya <okaya@codeaurora.org>","CC":"<benve@cisco.com>, <bhelgaas@google.com>,\n\t<linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>,\n\t<jlbec@evilplan.org>, <hch@lst.de>, <mingo@redhat.com>,\n\t<peterz@infradead.org>","Subject":"Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery","In-Reply-To":"<0a2a41c5-2872-fdb6-8ad2-97b0b6dc69b1@codeaurora.org>","Message-ID":"<alpine.LNX.2.20.1709292250520.24635@cae-iprp-alln-lb.cisco.com>","References":"<20170927214220.41216-1-gvaradar@cisco.com>\n\t<20170927214220.41216-4-gvaradar@cisco.com>\n\t<2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org>\n\t<alpine.LNX.2.20.1709281624330.24635@cae-iprp-alln-lb.cisco.com>\n\t<0a2a41c5-2872-fdb6-8ad2-97b0b6dc69b1@codeaurora.org>","User-Agent":"Alpine 2.20 (LNX 67 2015-01-07)","MIME-Version":"1.0","Content-Type":"text/plain; charset=\"US-ASCII\"; format=flowed","X-Originating-IP":"[10.24.76.29]","X-ClientProxiedBy":"xch-rcd-020.cisco.com (173.37.102.30) To\n\tXCH-RCD-012.cisco.com (173.37.102.22)","Sender":"linux-pci-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-pci.vger.kernel.org>","X-Mailing-List":"linux-pci@vger.kernel.org"}}]