diff mbox

[U-Boot] NAND ECC Error with wrong SMC ording bug

Message ID 20090820153644.631dbd7b@lappy.seanm.ca
State Not Applicable
Headers show

Commit Message

Sean MacLennan Aug. 20, 2009, 7:36 p.m. UTC
On Thu, 20 Aug 2009 07:01:21 +0200
Stefan Roese <sr@denx.de> wrote:

> On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
> > > I see other boards using SMC as well, can someone comment on the
> > > change I am proposing.
> > > Should I change the correction algorithm or the calculate
> > > function? If the later is preferred
> > > it would mean the change must be pushed in both U-Boot and Linux.
> >
> > Odds are the calculate function is wrong. The correction algo is
> > used by many nand drivers, I *assume* it is correct. The calculate
> > function was set to agree with u-boot (1.3.0).
> 
> Yes, it seems that you changed the order in the calculation function
> while reworking the NDFC driver for arch/powerpc. So we should
> probably change this order back to the original version. And change
> it in U-Boot as well.
> 
> BTW: I didn't see any problems with ECC so far with the current code.
> Feng, how did you spot this problem?

Ok, I think I have reproduced the problem programmatically. Basically,
I force a one bit error with the following patch:


Does anybody see a problem with my method of reproducing the bug? This
bug is deadly for our customers. I don't want to make the change unless
it is absolutely necessary.

Cheers,
   Sean

Comments

Victor Gallardo Aug. 20, 2009, 10:56 p.m. UTC | #1
Hi Sean,

The change is necessary in both Linux and u-boot. Without this change customer are seeing the problem.

Best Regards,

Victor Gallardo

> -----Original Message-----
> From: linuxppc-dev-bounces+vgallardo=amcc.com@lists.ozlabs.org [mailto:linuxppc-dev-
> bounces+vgallardo=amcc.com@lists.ozlabs.org] On Behalf Of Sean MacLennan
> Sent: Thursday, August 20, 2009 12:37 PM
> To: Stefan Roese
> Cc: u-boot@lists.denx.de; Feng Kan; linux-mtd@lists.infradead.org; linuxppc-dev@ozlabs.org
> Subject: Re: [U-Boot] NAND ECC Error with wrong SMC ording bug
> 
> On Thu, 20 Aug 2009 07:01:21 +0200
> Stefan Roese <sr@denx.de> wrote:
> 
> > On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
> > > > I see other boards using SMC as well, can someone comment on the
> > > > change I am proposing.
> > > > Should I change the correction algorithm or the calculate
> > > > function? If the later is preferred
> > > > it would mean the change must be pushed in both U-Boot and Linux.
> > >
> > > Odds are the calculate function is wrong. The correction algo is
> > > used by many nand drivers, I *assume* it is correct. The calculate
> > > function was set to agree with u-boot (1.3.0).
> >
> > Yes, it seems that you changed the order in the calculation function
> > while reworking the NDFC driver for arch/powerpc. So we should
> > probably change this order back to the original version. And change
> > it in U-Boot as well.
> >
> > BTW: I didn't see any problems with ECC so far with the current code.
> > Feng, how did you spot this problem?
> 
> Ok, I think I have reproduced the problem programmatically. Basically,
> I force a one bit error with the following patch:
> 
> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> index 8c21b89..91dd5b4 100644
> --- a/drivers/mtd/nand/nand_base.c
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -1628,11 +1628,22 @@ static void nand_write_page_hwecc(struct mtd_info *mtd, struct nand_chip
> *chip,
>  	uint8_t *ecc_calc = chip->buffers->ecccalc;
>  	const uint8_t *p = buf;
>  	uint32_t *eccpos = chip->ecc.layout->eccpos;
> +	static int count;
> 
>  	for (i = 0; eccsteps; eccsteps--, i += eccbytes, p += eccsize) {
>  		chip->ecc.hwctl(mtd, NAND_ECC_WRITE);
> -		chip->write_buf(mtd, p, eccsize);
> -		chip->ecc.calculate(mtd, p, &ecc_calc[i]);
> +		if (count == 0) {
> +			count = 1;
> +			printk("Corrupt one bit: %08x => %08x\n",
> +			       *p, *p ^ 8);
> +			*(uint8_t *)p ^= 8;
> +			chip->write_buf(mtd, p, eccsize);
> +			*(uint8_t *)p ^= 8;
> +			nand_calculate_ecc(mtd, p, &ecc_calc[i]);
> +		} else {
> +			chip->write_buf(mtd, p, eccsize);
> +			chip->ecc.calculate(mtd, p, &ecc_calc[i]);
> +		}
>  	}
> 
>  	for (i = 0; i < chip->ecc.total; i++)
> 
> Basically I write a one bit error to the NAND, but calculate with the
> correct bit. This assumes nand_calculate_ecc is correct.
> 
> I then added debugs to the correction to make sure it corrected
> properly:
> 
> diff --git a/drivers/mtd/nand/nand_ecc.c b/drivers/mtd/nand/nand_ecc.c
> index c0cb87d..57dcaa1 100644
> --- a/drivers/mtd/nand/nand_ecc.c
> +++ b/drivers/mtd/nand/nand_ecc.c
> @@ -483,14 +483,20 @@ int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
>  			byte_addr = (addressbits[b2 & 0x3] << 8) +
>  				    (addressbits[b1] << 4) + addressbits[b0];
>  		bit_addr = addressbits[b2 >> 2];
> +
> +		printk("Single bit error: correct %08x => %08x\n",
> +		       buf[byte_addr], buf[byte_addr] ^ (1 << bit_addr));
> +
>  		/* flip the bit */
>  		buf[byte_addr] ^= (1 << bit_addr);
>  		return 1;
> 
>  	}
>  	/* count nr of bits; use table lookup, faster than calculating it */
> -	if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1)
> +	if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1) {
> +		printk("ECC DATA BAD\n"); // SAM DBG
>  		return 1;	/* error in ecc data; no action needed */
> +	}
> 
>  	printk(KERN_ERR "uncorrectable error : ");
>  	return -1;
> 
> With the current ndfc code, the error correction gets the bits wrong.
> Switching it back to the original way and the correction is correct.
> 
> diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c
> index 89bf85a..497e175 100644
> --- a/drivers/mtd/nand/ndfc.c
> +++ b/drivers/mtd/nand/ndfc.c
> @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
> 
>  	wmb();
>  	ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
> -	/* The NDFC uses Smart Media (SMC) bytes order */
> -	ecc_code[0] = p[2];
> -	ecc_code[1] = p[1];
> +	ecc_code[0] = p[1];
> +	ecc_code[1] = p[2];
>  	ecc_code[2] = p[3];
> 
>  	return 0;
> 
> Does anybody see a problem with my method of reproducing the bug? This
> bug is deadly for our customers. I don't want to make the change unless
> it is absolutely necessary.
> 
> Cheers,
>    Sean
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
vimal singh Aug. 21, 2009, 5:17 a.m. UTC | #2
<snip>

> With the current ndfc code, the error correction gets the bits wrong.
> Switching it back to the original way and the correction is correct.
>
> diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c
> index 89bf85a..497e175 100644
> --- a/drivers/mtd/nand/ndfc.c
> +++ b/drivers/mtd/nand/ndfc.c
> @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
>
>        wmb();
>        ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
> -       /* The NDFC uses Smart Media (SMC) bytes order */
> -       ecc_code[0] = p[2];
> -       ecc_code[1] = p[1];
> +       ecc_code[0] = p[1];
> +       ecc_code[1] = p[2];
>        ecc_code[2] = p[3];
>
>        return 0;
>
> Does anybody see a problem with my method of reproducing the bug? This
> bug is deadly for our customers. I don't want to make the change unless
> it is absolutely necessary..

Just one question: did you enabled MTD_NAND_ECC_SMC in configs?

-vimal

>
> Cheers,
>   Sean
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
Sean MacLennan Aug. 21, 2009, 6:26 a.m. UTC | #3
On Fri, 21 Aug 2009 10:47:09 +0530
vimal singh <vimal.newwork@gmail.com> wrote:

> Just one question: did you enabled MTD_NAND_ECC_SMC in configs?

It is automagically selected when you select the NDFC driver.

Cheers,
   Sean
Stefan Roese Aug. 21, 2009, 6:27 a.m. UTC | #4
On Friday 21 August 2009 07:17:09 vimal singh wrote:
> > diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c
> > index 89bf85a..497e175 100644
> > --- a/drivers/mtd/nand/ndfc.c
> > +++ b/drivers/mtd/nand/ndfc.c
> > @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
> >
> >        wmb();
> >        ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
> > -       /* The NDFC uses Smart Media (SMC) bytes order */
> > -       ecc_code[0] = p[2];
> > -       ecc_code[1] = p[1];
> > +       ecc_code[0] = p[1];
> > +       ecc_code[1] = p[2];
> >        ecc_code[2] = p[3];
> >
> >        return 0;
> >
> > Does anybody see a problem with my method of reproducing the bug? This
> > bug is deadly for our customers. I don't want to make the change unless
> > it is absolutely necessary..
>
> Just one question: did you enabled MTD_NAND_ECC_SMC in configs?

Yes, MTD_NAND_ECC_SMC is selected via Kconfig for this driver.

Cheers,
Stefan

--
DENX Software Engineering GmbH,      MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich,  Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de
Victor Gallardo Aug. 21, 2009, 6:30 a.m. UTC | #5
Hi Vimal,

> > With the current ndfc code, the error correction gets the bits wrong.
> > Switching it back to the original way and the correction is correct.
> >
> > diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c
> > index 89bf85a..497e175 100644
> > --- a/drivers/mtd/nand/ndfc.c
> > +++ b/drivers/mtd/nand/ndfc.c
> > @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
> >
> >        wmb();
> >        ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
> > -       /* The NDFC uses Smart Media (SMC) bytes order */
> > -       ecc_code[0] = p[2];
> > -       ecc_code[1] = p[1];
> > +       ecc_code[0] = p[1];
> > +       ecc_code[1] = p[2];
> >        ecc_code[2] = p[3];
> >
> >        return 0;
> >
> > Does anybody see a problem with my method of reproducing the bug? This
> > bug is deadly for our customers. I don't want to make the change unless
> > it is absolutely necessary..
> 
> Just one question: did you enabled MTD_NAND_ECC_SMC in configs?

Yes, it was set.

Best Regards,

Victor Gallardo
diff mbox

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index 8c21b89..91dd5b4 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -1628,11 +1628,22 @@  static void nand_write_page_hwecc(struct mtd_info *mtd, struct nand_chip *chip,
 	uint8_t *ecc_calc = chip->buffers->ecccalc;
 	const uint8_t *p = buf;
 	uint32_t *eccpos = chip->ecc.layout->eccpos;
+	static int count;
 
 	for (i = 0; eccsteps; eccsteps--, i += eccbytes, p += eccsize) {
 		chip->ecc.hwctl(mtd, NAND_ECC_WRITE);
-		chip->write_buf(mtd, p, eccsize);
-		chip->ecc.calculate(mtd, p, &ecc_calc[i]);
+		if (count == 0) {
+			count = 1;
+			printk("Corrupt one bit: %08x => %08x\n",
+			       *p, *p ^ 8);
+			*(uint8_t *)p ^= 8;
+			chip->write_buf(mtd, p, eccsize);
+			*(uint8_t *)p ^= 8;
+			nand_calculate_ecc(mtd, p, &ecc_calc[i]);
+		} else {
+			chip->write_buf(mtd, p, eccsize);
+			chip->ecc.calculate(mtd, p, &ecc_calc[i]);
+		}
 	}
 
 	for (i = 0; i < chip->ecc.total; i++)

Basically I write a one bit error to the NAND, but calculate with the
correct bit. This assumes nand_calculate_ecc is correct.

I then added debugs to the correction to make sure it corrected
properly:

diff --git a/drivers/mtd/nand/nand_ecc.c b/drivers/mtd/nand/nand_ecc.c
index c0cb87d..57dcaa1 100644
--- a/drivers/mtd/nand/nand_ecc.c
+++ b/drivers/mtd/nand/nand_ecc.c
@@ -483,14 +483,20 @@  int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
 			byte_addr = (addressbits[b2 & 0x3] << 8) +
 				    (addressbits[b1] << 4) + addressbits[b0];
 		bit_addr = addressbits[b2 >> 2];
+
+		printk("Single bit error: correct %08x => %08x\n",
+		       buf[byte_addr], buf[byte_addr] ^ (1 << bit_addr));
+
 		/* flip the bit */
 		buf[byte_addr] ^= (1 << bit_addr);
 		return 1;
 
 	}
 	/* count nr of bits; use table lookup, faster than calculating it */
-	if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1)
+	if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1) {
+		printk("ECC DATA BAD\n"); // SAM DBG
 		return 1;	/* error in ecc data; no action needed */
+	}
 
 	printk(KERN_ERR "uncorrectable error : ");
 	return -1;

With the current ndfc code, the error correction gets the bits wrong.
Switching it back to the original way and the correction is correct.

diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c
index 89bf85a..497e175 100644
--- a/drivers/mtd/nand/ndfc.c
+++ b/drivers/mtd/nand/ndfc.c
@@ -101,9 +101,8 @@  static int ndfc_calculate_ecc(struct mtd_info *mtd,
 
 	wmb();
 	ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
-	/* The NDFC uses Smart Media (SMC) bytes order */
-	ecc_code[0] = p[2];
-	ecc_code[1] = p[1];
+	ecc_code[0] = p[1];
+	ecc_code[1] = p[2];
 	ecc_code[2] = p[3];
 
 	return 0;