From patchwork Sat Jul 10 01:24:08 2010
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mark Lord <kernel@teksavvy.com>
X-Patchwork-Id: 58448
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <linux-ide-owner@vger.kernel.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 040F7B6F06
	for <incoming@patchwork.ozlabs.org>;
	Sat, 10 Jul 2010 11:24:18 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751614Ab0GJBYM (ORCPT <rfc822;incoming@patchwork.ozlabs.org>);
	Fri, 9 Jul 2010 21:24:12 -0400
Received: from ironport2-out.teksavvy.com ([206.248.154.181]:39128 "EHLO
	ironport2-out.pppoe.ca" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
	with ESMTP id S1751425Ab0GJBYL (ORCPT
	<rfc822;linux-ide@vger.kernel.org>); Fri, 9 Jul 2010 21:24:11 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApIBAAZpN0xLd/sX/2dsb2JhbAAHgxbLSJEZhDVyBA
X-IronPort-AV: E=Sophos;i="4.55,176,1278302400"; d="scan'208";a="69905202"
Received: from rtr.ca (HELO [10.0.0.6]) ([75.119.251.23])
	by ironport2-out.pppoe.ca with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
	09 Jul 2010 21:24:10 -0400
Message-ID: <4C37CBB8.1040909@teksavvy.com>
Date: Fri, 09 Jul 2010 21:24:08 -0400
From: Mark Lord <kernel@teksavvy.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB;
	rv:1.9.1.10) Gecko/20100512 Thunderbird/3.0.5
MIME-Version: 1.0
To: Greg Freemyer <greg.freemyer@gmail.com>
CC: IDE/ATA development list <linux-ide@vger.kernel.org>,
	Mark Lord <liml@rtr.ca>
Subject: Re: If I have a single bad sector, how many failed reads should
	simple dd report?
References: <AANLkTil2Zfhpkq24qrCoHZVx1vlbIPTcDMrZ508yODta@mail.gmail.com>
	<AANLkTimWfp4mLF9DCpxtnnUQsYQ08gXmh491EfhARe1x@mail.gmail.com>
	<4C37CA99.1040104@teksavvy.com>
In-Reply-To: <4C37CA99.1040104@teksavvy.com>
Sender: linux-ide-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-ide.vger.kernel.org>
X-Mailing-List: linux-ide@vger.kernel.org

On 09/07/10 09:19 PM, Mark Lord wrote:
> On 09/07/10 03:04 PM, Greg Freemyer wrote:
> ..
>>> When I re-ran it, /var/log/messages reported 10 bad logical blocks.
>>> And even worse, dd reported 20 bad blocks. I examined the data dd
>>> read and it had 80KB of zero'ed out data. So that's 160 sectors worth
>>> of data lost because of a single bad sector. At most I was expecting
>>> 4KB of zero'ed out data.
> ..
>
> That's just the standard, undesirable result of the current SCSI EH
> when used with libata for (mainly) desktop computers.
>
> I have patches (against older kernels) to fix it, but have yet to
> get both myself and James B. interested enough simultaneously to
> actually get the kernel fixed. :)
..

Here (attached and inline below) are my most recent patches for this.
Still outdated, though.  These are against the SLES11 2.6.27.19 kernel:

-----------------------------snip----------------------------

Stop the SCSI EH from performing tons of retries on unrecoverable medium errors,
so that error-handling fails more quickly and we (EMC) avoid unneeded node resets.

The ugliness of this patch matches the ugliness of SCSI EH.
Does *anyone* actually understand this code completely?

Signed-off-by: Mark Lord <mlord@pobox.com>

-----------------------------snip----------------------------

sles11:
On encountering a bad sector, report and skip over it,
then continue with the remainder of the request.
Otherwise we would fail perfectly good sectors,
making a bad situation even worse.

Signed-off-by: Mark Lord <mlord@pobox.com>

--- old/drivers/scsi/scsi_lib.c	2009-06-04 12:26:52.000000000 -0400
+++ linux/drivers/scsi/scsi_lib.c	2009-06-04 14:40:11.000000000 -0400
@@ -952,6 +952,12 @@
 	 */
 	if (sense_valid && !sense_deferred) {
 		switch (sshdr.sense_key) {
+		case MEDIUM_ERROR:
+		/* Bad sector.  Fail it, and then continue the rest of the request. */
+		if (this_count && scsi_end_request(cmd, -EIO, cmd->device->sector_size, 1) == NULL) {
+			cmd->retries = 0;       // go around again..
+			return;
+		}
 		case UNIT_ATTENTION:
 			if (cmd->device->removable) {
 				/* Detected disc change.  Set a bit