diff mbox

mtd oob test is failing consistently at same places in NAND flash

Message ID 518397C60809E147AF5323E0420B992E3E934BF1@DBDE01.ent.ti.com
State New, archived
Headers show

Commit Message

Philip, Avinash May 8, 2012, 12:33 p.m. UTC
Hi,

We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) connected to GPMC
Module (General purpose memory controller) from TI.
We have been seeing mtd_oobtest failure on a partition size of 248 MB. Most of
the time, test case 2 of mtd_oobtest is failing. On debugging further it seems
that bit flip is happening on the test case 2 in OOB area. It is observed that
the failure locations are consistent. 

To verify further we had tried writing zeros to OOB area and read it back. 
This test is passing and confirms that all OOB bits (that are programmable)
are not bad.

To test further I had modified the mtd_oobtest.c as follows and found this
test is passing


Is this behavior due to bits getting corrupted on certain sequence? 
Is this issue observed by anyone else?

Thanks & Regards
Avinash Philip

Comments

Ivan Djelic May 8, 2012, 1:23 p.m. UTC | #1
On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> Hi,
> 
> We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) connected to GPMC
> Module (General purpose memory controller) from TI.

Hi,
How is ecc performed ?
Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
Which version of omap2 driver are you using ?
Is OOB also ECC-protected ?

> We have been seeing mtd_oobtest failure on a partition size of 248 MB. Most of
> the time, test case 2 of mtd_oobtest is failing. On debugging further it seems
> that bit flip is happening on the test case 2 in OOB area. It is observed that
> the failure locations are consistent. 

If you are able to reproduce failures, then you should be able to tell which bits
in OOB are failing, by adding a few debugging lines in the code.

> To verify further we had tried writing zeros to OOB area and read it back. 
> This test is passing and confirms that all OOB bits (that are programmable)
> are not bad.

It does not confirm anything, bits can fail by remaining stuck at 0.

BR,
--
Ivan
Philip, Avinash May 8, 2012, 3:09 p.m. UTC | #2
On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > Hi,
> > 
> > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) 
> > connected to GPMC Module (General purpose memory controller) from TI.
> 
> Hi,
> How is ecc performed ?
> Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> Which version of omap2 driver are you using ?
> Is OOB also ECC-protected ?

Hardware ECC is performing.
4-bit BCH ECC scheme is used.
I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.
No, OOB is not ECC protected.

> 
> > We have been seeing mtd_oobtest failure on a partition size of 248 MB. 
> > Most of the time, test case 2 of mtd_oobtest is failing. On debugging 
> > further it seems that bit flip is happening on the test case 2 in OOB 
> > area. It is observed that the failure locations are consistent.
> 
> If you are able to reproduce failures, then you should be able to tell which bits in OOB are failing, by adding a few debugging lines in the code.
> 

I add debugs and found bit flips from 1 to 0. The location of bit flips might vary on
boards. But on the same board it is consistent.

> > To verify further we had tried writing zeros to OOB area and read it back. 
> > This test is passing and confirms that all OOB bits (that are 
> > programmable) are not bad.
> 
> It does not confirm anything, bits can fail by remaining stuck at 0.
>

As bit flip is from 1 to 0, you are right. But same experiment with 0x55,
test is passing.
 
> BR,
> --
> Ivan
>
Philip, Avinash May 8, 2012, 3:23 p.m. UTC | #3
On Tue, May 08, 2012 at 20:39:46, Philip, Avinash wrote:
> On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> > On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > > Hi,
> > > 
> > > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) 
> > > connected to GPMC Module (General purpose memory controller) from TI.
> > 
> > Hi,
> > How is ecc performed ?
> > Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> > Which version of omap2 driver are you using ?
> > Is OOB also ECC-protected ?
> 
> Hardware ECC is performing.
> 4-bit BCH ECC scheme is used.

One correction, 8-bit BCH ECC scheme used,

> I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.
> No, OOB is not ECC protected.
> 
> > 
> > > We have been seeing mtd_oobtest failure on a partition size of 248 MB. 
> > > Most of the time, test case 2 of mtd_oobtest is failing. On 
> > > debugging further it seems that bit flip is happening on the test 
> > > case 2 in OOB area. It is observed that the failure locations are consistent.
> > 
> > If you are able to reproduce failures, then you should be able to tell which bits in OOB are failing, by adding a few debugging lines in the code.
> > 
> 
> I add debugs and found bit flips from 1 to 0. The location of bit flips might vary on boards. But on the same board it is consistent.
> 
> > > To verify further we had tried writing zeros to OOB area and read it back. 
> > > This test is passing and confirms that all OOB bits (that are
> > > programmable) are not bad.
> > 
> > It does not confirm anything, bits can fail by remaining stuck at 0.
> >
> 
> As bit flip is from 1 to 0, you are right. But same experiment with 0x55, test is passing.
>  
> > BR,
> > --
> > Ivan
> > 
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
Ivan Djelic May 8, 2012, 6:45 p.m. UTC | #4
On Tue, May 08, 2012 at 04:09:46PM +0100, Philip, Avinash wrote:
> On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> > On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > > Hi,
> > > 
> > > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) 
> > > connected to GPMC Module (General purpose memory controller) from TI.
> > 
> > Hi,
> > How is ecc performed ?
> > Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> > Which version of omap2 driver are you using ?
> > Is OOB also ECC-protected ?
> 
> Hardware ECC is performing.
> 4-bit BCH ECC scheme is used.
> I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.

You are probably using a patched kernel, since 3.2.0 does not have GPMC BCH support ?!
What is your ecc layout ? Does it expose oobfree regions ?

> No, OOB is not ECC protected.

Well, in that case, isn't it normal that mtd_oobtest should fail if there happens to be
a single bitflip in the available OOB area ? (of size mtd->ecclayout->oobavail for each page)

BR,
--
Ivan
Ricard Wanderlof May 9, 2012, 6:37 a.m. UTC | #5
On Tue, 8 May 2012, Philip, Avinash wrote:

> I add debugs and found bit flips from 1 to 0. The location of bit flips might vary on
> boards. But on the same board it is consistent.
>
>>> To verify further we had tried writing zeros to OOB area and read it back.
>>> This test is passing and confirms that all OOB bits (that are
>>> programmable) are not bad.
>>
>> It does not confirm anything, bits can fail by remaining stuck at 0.
>>
>
> As bit flip is from 1 to 0, you are right. But same experiment with 0x55,
> test is passing.

Are you sure it's not the bits that happen to be 0 in 0x55 that are the 
ones suspected of flipping?

/Ricard
Philip, Avinash May 9, 2012, 3:12 p.m. UTC | #6
On Wed, May 09, 2012 at 00:15:16, Ivan Djelic wrote:
> On Tue, May 08, 2012 at 04:09:46PM +0100, Philip, Avinash wrote:
> > On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> > > On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > > > Hi,
> > > > 
> > > > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) 
> > > > connected to GPMC Module (General purpose memory controller) from TI.
> > > 
> > > Hi,
> > > How is ecc performed ?
> > > Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> > > Which version of omap2 driver are you using ?
> > > Is OOB also ECC-protected ?
> > 
> > Hardware ECC is performing.
> > 4-bit BCH ECC scheme is used.
> > I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.
> 
> You are probably using a patched kernel, since 3.2.0 does not have GPMC BCH support ?!
> What is your ecc layout ? Does it expose oobfree regions ?
>

Yes, we had using patched kernel. OOB free region is exposed.

ECC layout will be as follows.

0-1	-> BAD block marking
2-57  -> ECC byte position, ( 14 bytes for 512 byte)
58-63 -> oob free bytes

mtd->ecclayout->eccbytes		= 56
mtd->ecclayout->eccpos[0]	= 2
mtd->ecclayout->oobavail		= 6
mtd->ecclayout->oobfree[0].offset	= 58
mtd->ecclayout->oobfree[0].length	= 6

Regards
Avinash

> > No, OOB is not ECC protected.
> 
> Well, in that case, isn't it normal that mtd_oobtest should fail if there happens to be a single bitflip in the available OOB area ? (of size mtd->ecclayout->oobavail for each page)
>
> BR,
> --
> Ivan
>
Ivan Djelic May 9, 2012, 3:24 p.m. UTC | #7
On Wed, May 09, 2012 at 04:12:05PM +0100, Philip, Avinash wrote:
> On Wed, May 09, 2012 at 00:15:16, Ivan Djelic wrote:
> > On Tue, May 08, 2012 at 04:09:46PM +0100, Philip, Avinash wrote:
> > > On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> > > > On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > > > > Hi,
> > > > > 
> > > > > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from Micron) 
> > > > > connected to GPMC Module (General purpose memory controller) from TI.
> > > > 
> > > > Hi,
> > > > How is ecc performed ?
> > > > Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> > > > Which version of omap2 driver are you using ?
> > > > Is OOB also ECC-protected ?
> > > 
> > > Hardware ECC is performing.
> > > 4-bit BCH ECC scheme is used.
> > > I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.
> > 
> > You are probably using a patched kernel, since 3.2.0 does not have GPMC BCH support ?!
> > What is your ecc layout ? Does it expose oobfree regions ?
> >
> 
> Yes, we had using patched kernel. OOB free region is exposed.
> 
> ECC layout will be as follows.
> 
> 0-1	-> BAD block marking
> 2-57  -> ECC byte position, ( 14 bytes for 512 byte)
> 58-63 -> oob free bytes
> 
> mtd->ecclayout->eccbytes		= 56
> mtd->ecclayout->eccpos[0]	= 2
> mtd->ecclayout->oobavail		= 6
> mtd->ecclayout->oobfree[0].offset	= 58
> mtd->ecclayout->oobfree[0].length	= 6
> 

OK, then it is quite normal that mtd_oobtest should fail when it encounters a
bitflip (one that does not match the programmed data) in those unprotected 6
bytes (58-63). What do you think ?

BR,
--
Ivan
Philip, Avinash May 9, 2012, 3:46 p.m. UTC | #8
On Wed, May 09, 2012 at 20:54:37, Ivan Djelic wrote:
> On Wed, May 09, 2012 at 04:12:05PM +0100, Philip, Avinash wrote:
> > On Wed, May 09, 2012 at 00:15:16, Ivan Djelic wrote:
> > > On Tue, May 08, 2012 at 04:09:46PM +0100, Philip, Avinash wrote:
> > > > On Tue, May 08, 2012 at 18:53:54, Ivan Djelic wrote:
> > > > > On Tue, May 08, 2012 at 01:33:06PM +0100, Philip, Avinash wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > We are having an 8-bit NAND part (MT29F2G08ABAEAWP from 
> > > > > > Micron) connected to GPMC Module (General purpose memory controller) from TI.
> > > > > 
> > > > > Hi,
> > > > > How is ecc performed ?
> > > > > Using NAND internal ecc ? or with GPMC 1-bit Hamming ? 4-bit/8-bit BCH ?
> > > > > Which version of omap2 driver are you using ?
> > > > > Is OOB also ECC-protected ?
> > > > 
> > > > Hardware ECC is performing.
> > > > 4-bit BCH ECC scheme is used.

One correction: we are using 8-bit BCH ECC scheme is used.

> > > > I am using omap2 driver in Linux 3.2.0 Kernel. Don't know omap2 driver version.
> > > 
> > > You are probably using a patched kernel, since 3.2.0 does not have GPMC BCH support ?!
> > > What is your ecc layout ? Does it expose oobfree regions ?
> > >
> > 
> > Yes, we had using patched kernel. OOB free region is exposed.
> > 
> > ECC layout will be as follows.
> > 
> > 0-1	-> BAD block marking
> > 2-57  -> ECC byte position, ( 14 bytes for 512 byte)
> > 58-63 -> oob free bytes
> > 
> > mtd->ecclayout->eccbytes		= 56
> > mtd->ecclayout->eccpos[0]	= 2
> > mtd->ecclayout->oobavail		= 6
> > mtd->ecclayout->oobfree[0].offset	= 58
> > mtd->ecclayout->oobfree[0].length	= 6
> > 
> 
> OK, then it is quite normal that mtd_oobtest should fail when it encounters a bitflip (one that does not match the programmed data) in those unprotected 6 bytes (58-63). What do you think ?


Is this behavior is expected for which OOB area left unprotected?
(I am not sure, What I understood is with failure in OOB area, ECC won't be useful.
Is it ideally we should have ECC protection for OOB area also required?)

Basically I am testing why bit flips is happening in OOB area. Some observation related
to mtd_oob test in the setup we are having is
1. Modify mtd_oob test to write patterns (0x0, 0x55, 0xAA, 0xff), then test is getting passed
for all patterns.
2. On inserting a delay of 10 ms after erase_whole_device() in mtd oob test, test is getting passed.

I can't correlate how test is getting passed on modifying pattern as we are covering all bits in
either of the patterns.

On inserting delay test is getting passed, will point to me some problems in command issue. I am
debugging on this.

Any suggestions will be helpful.

Thanks
Avinash
Ivan Djelic May 9, 2012, 4:05 p.m. UTC | #9
On Wed, May 09, 2012 at 04:46:17PM +0100, Philip, Avinash wrote:
> > > 
> > > Yes, we had using patched kernel. OOB free region is exposed.
> > > 
> > > ECC layout will be as follows.
> > > 
> > > 0-1	-> BAD block marking
> > > 2-57  -> ECC byte position, ( 14 bytes for 512 byte)
> > > 58-63 -> oob free bytes
> > > 
> > > mtd->ecclayout->eccbytes		= 56
> > > mtd->ecclayout->eccpos[0]	= 2
> > > mtd->ecclayout->oobavail		= 6
> > > mtd->ecclayout->oobfree[0].offset	= 58
> > > mtd->ecclayout->oobfree[0].length	= 6
> > > 
> > 
> > OK, then it is quite normal that mtd_oobtest should fail when it encounters a bitflip (one that does not match the programmed data) in those unprotected 6 bytes (58-63). What do you think ?
> 
> 
> Is this behavior is expected for which OOB area left unprotected?
> (I am not sure, What I understood is with failure in OOB area, ECC won't be useful.

Yes.

> Is it ideally we should have ECC protection for OOB area also required?)

There is no need for ECC protection on free oob bytes if you do not use them.

> Basically I am testing why bit flips is happening in OOB area. Some observation related
> to mtd_oob test in the setup we are having is
> 1. Modify mtd_oob test to write patterns (0x0, 0x55, 0xAA, 0xff), then test is getting passed
> for all patterns.

OK, strange.

> 2. On inserting a delay of 10 ms after erase_whole_device() in mtd oob test, test is getting passed.
> I can't correlate how test is getting passed on modifying pattern as we are covering all bits in
> either of the patterns.
> 
> On inserting delay test is getting passed, will point to me some problems in command issue. I am
> debugging on this.
> 

OK. If you are relying on a R/nB pin to wait for operation completion, you
might want to check that is works properly.

BR,
Ricard Wanderlof May 10, 2012, 7:51 a.m. UTC | #10
On Wed, 9 May 2012, Philip, Avinash wrote:

> Basically I am testing why bit flips is happening in OOB area. Some observation related
> to mtd_oob test in the setup we are having is
> 1. Modify mtd_oob test to write patterns (0x0, 0x55, 0xAA, 0xff), then test is getting passed
> for all patterns.
> 2. On inserting a delay of 10 ms after erase_whole_device() in mtd oob test, test is getting passed.
>
> I can't correlate how test is getting passed on modifying pattern as we are covering all bits in
> either of the patterns.
>
> On inserting delay test is getting passed, will point to me some problems in command issue. I am
> debugging on this.

We've had problems with bus buffers between the CPU and flash not being 
fast enough. The symptoms were similar to what is described above, certain 
patterns would fail, whereas others wouldn't. When a given byte failed, it 
depended on what the previous byte on the bus was if I remember correctly, 
in some pattern that we never bothered do find out. We upgraded the bus 
drivers and changed the CPU timing towards the flash which cleared up the 
problem.

One symptom was that the probability of errors changed drastically when 
the temperature was changed. It also varied a lot between individual 
devices.

/Ricard
Artem Bityutskiy May 10, 2012, 1:01 p.m. UTC | #11
On Tue, 2012-05-08 at 12:33 +0000, Philip, Avinash wrote:
> static inline unsigned int simple_rand(void)
> {
> -       next = next * 1103515245 + 12345;
> +       next = next * 1103515244 + 12345; /* 45 -> 44. Sequence is changed */
>         return (unsigned int)((next / 65536) % 32768);

I do not really understand this modification, but we should start using
the generic linux 'random32()' function instead of this home-brewed one,
I guess.
diff mbox

Patch

diff --git a/drivers/mtd/tests/mtd_oobtest.c b/drivers/mtd/tests/mtd_oobtest.c
index 933f7e5..9f118de 100644
--- a/drivers/mtd/tests/mtd_oobtest.c
+++ b/drivers/mtd/tests/mtd_oobtest.c
@@ -50,7 +50,7 @@  static unsigned long next = 1;

static inline unsigned int simple_rand(void)
{
-       next = next * 1103515245 + 12345;
+       next = next * 1103515244 + 12345; /* 45 -> 44. Sequence is changed */
        return (unsigned int)((next / 65536) % 32768);
}