From patchwork Sun Apr 22 17:15:24 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Bottomley X-Patchwork-Id: 154296 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 52F06B6FA3 for ; Mon, 23 Apr 2012 03:15:55 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751580Ab2DVRPd (ORCPT ); Sun, 22 Apr 2012 13:15:33 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:50432 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751340Ab2DVRPc (ORCPT ); Sun, 22 Apr 2012 13:15:32 -0400 Received: from localhost (localhost [127.0.0.1]) by bedivere.hansenpartnership.com (Postfix) with ESMTP id 90C808EE12A; Sun, 22 Apr 2012 10:15:29 -0700 (PDT) Received: from bedivere.hansenpartnership.com ([127.0.0.1]) by localhost (bedivere.hansenpartnership.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CPj91JzIKRI1; Sun, 22 Apr 2012 10:15:29 -0700 (PDT) Received: from [192.168.1.224] (unknown [178.109.28.157]) by bedivere.hansenpartnership.com (Postfix) with ESMTPSA id AC3968EE0F1; Sun, 22 Apr 2012 10:15:27 -0700 (PDT) Message-ID: <1335114924.13208.27.camel@dabdike.lan> Subject: Re: [PATCH 12/12] scsi_transport_sas: fix delete vs scan race From: James Bottomley To: Dan Williams Cc: linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org Date: Sun, 22 Apr 2012 18:15:24 +0100 In-Reply-To: References: <20120413233343.8025.18101.stgit@dwillia2-linux.jf.intel.com> <20120413233752.8025.97983.stgit@dwillia2-linux.jf.intel.com> <1335091115.13208.14.camel@dabdike.lan> X-Mailer: Evolution 3.2.1 Mime-Version: 1.0 Sender: linux-ide-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ide@vger.kernel.org On Sun, 2012-04-22 at 08:43 -0700, Dan Williams wrote: > On Sun, Apr 22, 2012 at 3:38 AM, James Bottomley > wrote: > > On Fri, 2012-04-13 at 16:37 -0700, Dan Williams wrote: > >> The following crash results from cases where the end_device has been > >> removed before scsi_sysfs_add_sdev has had a chance to run. > >> > >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 > >> IP: [] sysfs_create_dir+0x32/0xb6 > >> ... > >> Call Trace: > >> [] kobject_add_internal+0x120/0x1e3 > >> [] ? trace_hardirqs_on+0xd/0xf > >> [] kobject_add_varg+0x41/0x50 > >> [] kobject_add+0x64/0x66 > >> [] device_add+0x12d/0x63a > >> [] ? _raw_spin_unlock_irqrestore+0x47/0x56 > >> [] ? module_refcount+0x89/0xa0 > >> [] scsi_sysfs_add_sdev+0x4e/0x28a > >> [] do_scan_async+0x9c/0x145 > >> > >> ...teach sas_rphy_remove to wait for async scanning to quiesce before > >> removing the end_device. It seems this is a more general problem [1], > >> but this patch only addresses sas transport. > >> > >> [1]: 23edb6e [SCSI] mpt2sas: Do not set sas_device->starget to NULL from > >> the slave_destroy callback when all the LUNS have been deleted > >> > >> Signed-off-by: Dan Williams > >> --- > >> drivers/scsi/scsi_transport_sas.c | 6 +++++- > >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c > >> index f7565fc..47abb90 100644 > >> --- a/drivers/scsi/scsi_transport_sas.c > >> +++ b/drivers/scsi/scsi_transport_sas.c > >> @@ -33,8 +33,9 @@ > >> #include > >> > >> #include > >> -#include > >> #include > >> +#include > >> +#include > >> #include > >> #include > >> > >> @@ -1667,6 +1668,9 @@ sas_rphy_remove(struct sas_rphy *rphy) > >> { > >> struct device *dev = &rphy->dev; > >> > >> + /* prevent device_del() while child device_add() may be in-flight */ > >> + scsi_complete_async_scans(); > >> + > >> switch (rphy->identify.device_type) { > > > > This doesn't really fix the problem, it merely narrows the window (we > > still crash in the much shorter window if another async scan starts > > after you check for completion). > > Oh, I was under the impression that async scanning was only the > initial scan and everything was sync thereafter since > scsi_finish_async_scan() clears the host ->async_scan flag? Async scan here means any scan in a different thread, right ... it just has to be asynchronous relative to us? So that includes the manually initiated ones and hotplug ones, doesn't it? > > Isn't the fix that will prevent all of > > this to hold the scan mutex across scsi_remove_device() ... in which > > case it should probably be part of scsi_remove_device()? > > I thought along these lines initially, but in this case we're crashing > because the sas rphy is removed before the starget is added, so > scsi_remove_device() is out of the picture. Just adding the sequence mutex_lock(&shost->scan_mutex); mutex_unlock(&shost->scan_mutex); is logically a subset of scsi_complete_async_scans() So putting it here: should definitely be equivalent to scsi_complete_async_scans() above the switch statement. The questions are a) should it be inside scsi_remove_target() because that seems to be the sync point and b) does it fix all the races. James --- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index f7565fc..c89bba6 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -1669,7 +1669,9 @@ sas_rphy_remove(struct sas_rphy *rphy) switch (rphy->identify.device_type) { case SAS_END_DEVICE: + mutex_lock(&shost->scan_mutex); scsi_remove_target(dev); + mutex_unlock(&shost->scan_mutex); break; case SAS_EDGE_EXPANDER_DEVICE: case SAS_FANOUT_EXPANDER_DEVICE: