From patchwork Thu Nov 8 15:21:25 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christopher Harvey X-Patchwork-Id: 197850 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from merlin.infradead.org (merlin.infradead.org [IPv6:2001:4978:20e::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 8DFBE2C0134 for ; Fri, 9 Nov 2012 02:20:03 +1100 (EST) Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux)) id 1TWTrY-0003IL-1S; Thu, 08 Nov 2012 15:17:48 +0000 Received: from mtxmxout6.matrox.com ([138.11.2.96]) by merlin.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux)) id 1TWTrQ-0003H7-UJ for linux-mtd@lists.infradead.org; Thu, 08 Nov 2012 15:17:45 +0000 Received: from venus.matrox.com (venus.matrox.com [192.168.1.30]) by mtxmxout6.matrox.com (Postfix) with ESMTP id 085011D74FA; Thu, 8 Nov 2012 10:17:38 -0500 (EST) Received: (from ssmsp@localhost) by venus.matrox.com (8.14.4/8.13.2) id qA8FHbP1009394; Thu, 8 Nov 2012 10:17:37 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by venus.matrox.com (Postfix) with ESMTP id A958CF4DB6; Thu, 8 Nov 2012 10:17:37 -0500 (EST) X-Virus-MTX-Scanned: by Matrox Virus scanner at venus.matrox.com Received: from pluton.matrox.com (pluton.matrox.com [192.168.8.7]) by venus.matrox.com (Postfix) with ESMTP id 7C925F4DB5; Thu, 8 Nov 2012 10:17:37 -0500 (EST) Received: from harvey-pc.matrox.com (dyn-152-224.matrox.com [192.168.152.224]) by pluton.matrox.com (Postfix) with ESMTP id 750A27F45F; Thu, 8 Nov 2012 10:17:37 -0500 (EST) Date: Thu, 8 Nov 2012 10:21:25 -0500 From: Christopher Harvey To: Gerlando Falauto Subject: Re: state of support for "external ECC hardware" Message-ID: <20121108152125.GR2389@harvey-pc.matrox.com> References: <20121029204227.GA32300@harvey-pc.matrox.com> <509B9143.4020506@keymile.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <509B9143.4020506@keymile.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20121108_101741_389881_D1267543 X-CRM114-Status: GOOD ( 41.30 ) X-Spam-Score: -3.3 (---) X-Spam-Report: SpamAssassin version 3.3.2 on merlin.infradead.org summary: Content analysis details: (-3.3 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [138.11.2.96 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record -0.7 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Cc: stefan.bigler@keymile.com, Ivan Djelic , "Brunck, Holger" , linux-mtd@lists.infradead.org X-BeenThere: linux-mtd@lists.infradead.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-mtd-bounces@lists.infradead.org Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org On Thu, Nov 08, 2012 at 12:02:27PM +0100, Gerlando Falauto wrote: > Hi Chris, > > good to hear we're not alone in this thinking... :-) > We're now facing the exact same issue as some Micron NAND chips (most > likely the same one you're dealing with) can no longer live with the > default, simple 1-bit ECC implementation used by default > (NAND_ECC_SOFT), I guess because chances of having multiple bitflips > within the same page are no longer negligible. So some 4-bit ECC > mechanism must be used at the very least. We had BCH8 code running, but it wasn't enough. The main reason we switched away from host side ECC was because we were getting bitflips within the ECC codeword data itself. Yes, it would have been possible to add a 1 byte hamming code to protect the main ECC data, but it was just easier to say, "hey, Micron knows their hardware, so we'll trust their algorithms", and enable the Micron ECC hardware. Although it didn't require too much work to enable it's all a total hack. I took the code that runs the "ECC disabled mode", and sprinkled in some extra init code and error checking code. Would be nice to add an "external ecc mode" to support these chips explicitly. > Support for software-based multiple-bit-resilient ECC mechanism (BCH) > was posted (http://lwn.net/Articles/426856/) by Ivan Djelic (which I > took liberty to Cc:) and merged in March last year. > I haven't been able to track how the situation evolved, but apparently > you need to enable it (in addition to within the kernel configuration), > also within your flash controller setup. > Micron gives an example of how to enable it on a sample NAND host > controller S3C6410 in this TN (rest of the code, mainly from the above > patch, would be already present in recent kernels): > http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2971_software_bch_ecc_on_linux.pdf I haven't looked into current software ECC algorithms in the kernel. Do the protect against corrupted ECC data? As in, corruptions in the out of bounds area? > As for hardware-based (or on-die) ECC support, one of the application > notes from Micron (TN-29-56 Enabling On-Die ECC for OMAP3 on > Linux/Android OS, > http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2956_ondie_ecc_omap3_linux.pdf) > shows how to enable that (rather, it shows how to disable software ECC > altogether after enabling it on the chip). However, I haven't been able > to find a code section where the information returned by the chip > ("Rewrite recommended") is actually used to solicit scrubbing... Neither > on the TN, nor on the upstream linux kernel... My next step would be to > give it a go and see what happens. I got that working, if you're running in "eec disabled mode", try something like this: Ran a trace on some manually inserted bitflips, and the block was moved. Hope that helps. > I'd love to hear some feedback, if anyone has had experience with this. > I know it's not been a long time since your post, but perhaps you've > heard something in the meantime? > > I have one additional question though. Looking at the code I got the > impression that decisions upon ECC seem to be based on the flash > controller rather than on the flash chip itself. > I mean, I would think of having a default 1-bit NAND_ECC_SOFT > implementation; only when it is detected that the flash part either > supports HW ECC or requires multiple-bit ECC, should the ECC mode get > switched to NAND_ECC_NONE or NAND_ECC_SOFT_BCH respectively. > No matter what the flash controller, I would say. > > Ivan, do you think that makes any sense? > > Thank you so much! > Gerlando > > On 10/29/2012 09:42 PM, Christopher Harvey wrote: > > I know of at least one Micron NAND chip that has the ability to handle > > ECC completely on the NAND chip itself. All the host has to do is send > > data and the OOB section is updated automatically. The automatic ECC > > hardware can be enabled and disabled with the "Set Feature" command, > > (0xEF) and bit flips are reported via get status after page reads. I > > don't see support for this in 2.6.37, and a quick check in the logs > > doesn't show anything new for these chips in the latest version of the > > kernel. Any idea floating around on this list? Are these chips going > > to be the future for NAND and does Linux care about them? > > > > thanks, > > Chris > > diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c index a796dd7..68af8b0 100644 --- a/drivers/mtd/nand/nand_base.c +++ b/drivers/mtd/nand/nand_base.c @@ -1069,7 +1069,16 @@ static int nand_read_page_raw(struct mtd_info *mtd, struct nand_chip *chip, uint8_t *buf, int page) { chip->read_buf(mtd, buf, mtd->writesize); - chip->read_buf(mtd, chip->oob_poi, mtd->oobsize); + chip->read_buf(mtd, chip->oob_poi, mtd->oobsize); /* (this data used in sw ecc) */ + + /* TODO: only do this CMD_STATUS if we have Micron NAND */ + chip->cmdfunc(mtd, NAND_CMD_STATUS, -1, -1); + if (chip->read_byte(mtd) & NAND_STATUS_REWRITE_RECOMMENDED) { + /* Micron NAND is telling us that this block may be going bad, + tell Linux to move it */ + mtd->ecc_stats.corrected++; /* (we don't actually know if it's just one correction, could be up to 4) */ + } + return 0; }