diff mbox

[3/3] nandtest: Introduce multiple reads & check iterations

Message ID 1398690859-11494-4-git-send-email-ezequiel@vanguardiasur.com.ar
State Changes Requested
Headers show

Commit Message

Ezequiel Garcia April 28, 2014, 1:14 p.m. UTC
The current nandtest performs a simple test which consists of:

  1. erase block
  2. write data
  3. read and verify

In order to improve the nandtest strength, this commit adds a new parameter
to increase the number of "read and verify" iterations. In other words,
the test now consists of:

  1. erase block
  2. write data
  3. read and verify (N times)

This seem to apply more pressure on a NAND driver's ECC engine and has been
used to discover stability problems with an old OMAP2.

Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
---
 nandtest.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

Comments

Artem Bityutskiy May 5, 2014, 7:35 a.m. UTC | #1
On Mon, 2014-04-28 at 10:14 -0300, Ezequiel Garcia wrote:
>                         }
> -                       if (erase_and_write(test_ofs, wbuf, rbuf))
> +                       if (erase_and_write(test_ofs, wbuf, rbuf,
> nr_reads))
>                                 continue;
>                         if (keep_contents)
> -                               erase_and_write(test_ofs, kbuf, rbuf);
> +                               erase_and_write(test_ofs, kbuf, rbuf,
> 1);

I think it would make sense to make the default to be at least 2, or may
be 4.

If you added this improvement, let's make sure people have it on by
default.
Ezequiel Garcia May 5, 2014, 9:56 a.m. UTC | #2
On 5 May 2014 04:35, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Mon, 2014-04-28 at 10:14 -0300, Ezequiel Garcia wrote:
>>                         }
>> -                       if (erase_and_write(test_ofs, wbuf, rbuf))
>> +                       if (erase_and_write(test_ofs, wbuf, rbuf,
>> nr_reads))
>>                                 continue;
>>                         if (keep_contents)
>> -                               erase_and_write(test_ofs, kbuf, rbuf);
>> +                               erase_and_write(test_ofs, kbuf, rbuf,
>> 1);
>
> I think it would make sense to make the default to be at least 2, or may
> be 4.
>
> If you added this improvement, let's make sure people have it on by
> default.
>

No problem. How about 4?

Any other ideas from the community? (Ccing more NAND driver maintainers).
pekon gupta May 5, 2014, 10:07 a.m. UTC | #3
>From: linux-mtd [mailto:linux-mtd-bounces@lists.infradead.org] On
>
>The current nandtest performs a simple test which consists of:
>
>  1. erase block
>  2. write data
>  3. read and verify
>
>In order to improve the nandtest strength, this commit adds a new
>parameter
>to increase the number of "read and verify" iterations. In other
>words,
>the test now consists of:
>
>  1. erase block
>  2. write data
>  3. read and verify (N times)
>
>This seem to apply more pressure on a NAND driver's ECC engine and
>has been
>used to discover stability problems with an old OMAP2.
>
If you are just re-verifying "reads", then you may be testing unstable bits [1],
which is not a valid driver's fault but a problem arising due to sudden power-cut.
If you really want to test driver then iterate all the steps (erase -> write -> read)
multiple times. Same as what is done in torture_peb() test.
	@@ drivers/mtd/ubi/io.c: torture_peb()

[1] http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits

With regards, pekon
Ezequiel Garcia May 5, 2014, 10:33 a.m. UTC | #4
On 05 May 10:07 AM, Gupta, Pekon wrote:
> >From: linux-mtd [mailto:linux-mtd-bounces@lists.infradead.org] On
> >
> >The current nandtest performs a simple test which consists of:
> >
> >  1. erase block
> >  2. write data
> >  3. read and verify
> >
> >In order to improve the nandtest strength, this commit adds a new
> >parameter
> >to increase the number of "read and verify" iterations. In other
> >words,
> >the test now consists of:
> >
> >  1. erase block
> >  2. write data
> >  3. read and verify (N times)
> >
> >This seem to apply more pressure on a NAND driver's ECC engine and
> >has been
> >used to discover stability problems with an old OMAP2.
> >
> If you are just re-verifying "reads", then you may be testing unstable bits [1],
> which is not a valid driver's fault but a problem arising due to sudden power-cut.
> If you really want to test driver then iterate all the steps (erase -> write -> read)
> multiple times. Same as what is done in torture_peb() test.
> 	@@ drivers/mtd/ubi/io.c: torture_peb()
> 

I'm sorry Pekon, but your comment makes no sense to me.

First of all, we're adding a new nandtest capability. The tool *already*
handles multiple erase/write/read cycle (by using the --passes parameter)
and one can already use it to stress drivers. This is not under discussion.

However, while testing the OMAP2 NAND driver provided in TI SKD 6.0.0
(the one with a v3.2 kernel) the nandtest was left running a large number
of times, using the --passes parameter each block was erase/write/read
lots and lots of times. So, *that* particular test passed without issues.

And still, since we were still observing instability when doing filesystem
operations we developed this new test, which consists in
erase/write/read/.../read.

Now, since each block is *erased* before the write/read/.../read loop, how is
this related to the unstable bit issue?

In case it's not clear, we never did *any* power-cutting, and still this
improved test quickly showed ECC read errors in the mentioned driver.
pekon gupta May 5, 2014, 10:58 a.m. UTC | #5
Hi Ezequiel,

>From: Ezequiel Garcia [mailto:ezequiel@vanguardiasur.com.ar]
>>On 05 May 10:07 AM, Gupta, Pekon wrote:
>> >From: linux-mtd [mailto:linux-mtd-bounces@lists.infradead.org]

[...]

>> >This seem to apply more pressure on a NAND driver's ECC engine
>and
>> >has been
>> >used to discover stability problems with an old OMAP2.
>> >
>> If you are just re-verifying "reads", then you may be testing
>unstable bits [1],
>> which is not a valid driver's fault but a problem arising due to sudden
>power-cut.
>> If you really want to test driver then iterate all the steps (erase ->
>write -> read)
>> multiple times. Same as what is done in torture_peb() test.
>> 	@@ drivers/mtd/ubi/io.c: torture_peb()
>>
>
>I'm sorry Pekon, but your comment makes no sense to me.
>
>First of all, we're adding a new nandtest capability. The tool *already*
>handles multiple erase/write/read cycle (by using the --passes
>parameter)
>and one can already use it to stress drivers. This is not under
>discussion.
>
>However, while testing the OMAP2 NAND driver provided in TI SKD
>6.0.0
>(the one with a v3.2 kernel) the nandtest was left running a large
>number
>of times, using the --passes parameter each block was
>erase/write/read
>lots and lots of times. So, *that* particular test passed without issues.
>
>And still, since we were still observing instability when doing
>filesystem
>operations we developed this new test, which consists in
>erase/write/read/.../read.
>
>Now, since each block is *erased* before the write/read/.../read
>loop, how is
>this related to the unstable bit issue?
>
>In case it's not clear, we never did *any* power-cutting, and still this
>improved test quickly showed ECC read errors in the mentioned
>driver.
>--
>Ezequiel Garcia, VanguardiaSur
>www.vanguardiasur.com.ar


Ok.. I now get the background.
But ideally by re-reading the data you are just invoking the same data path
Again, And so it's unlikely that you are un-covering any driver issues.
However, re-reading the device may introduce some read-disturb errors
which are causing some additional effects in subsequent reads.
Therefore, I'm not sure having re-reads is a good test or not, because
re-reads is changing the underlying testing scenario by introducing _new_
bit-flips in neighboring regions pages because of read-disturbs.


However, I know of some issues in OMAP NAND driver bundled with
3.2 kernel, which might be helpful in nailing down your specific issue.

(1) 3.2 kernel does not have concept of bitflip_threshold, so by default
scrubbing and peb_torture happens even for single bit-flips. So please
pull-in following patch series.
[PATCH v2 0/7] mtd: Change meaning of -EUCLEAN return code on reads
http://lists.infradead.org/pipermail/linux-mtd/2012-April/040945.html

(2) OMAP NAND in 3.2 kernel does not factor bit-flips in empty pages
Hence if empty pages with bit-flips are encountered, then it treats them
like programmed pages and expects a ECC correction on them. But as
empty pages do not have ECC stored in OOB, the driver bails out giving
'uncorrectable ecc' read errors.


with regards, pekon
Artem Bityutskiy May 5, 2014, 11:09 a.m. UTC | #6
On Mon, 2014-05-05 at 10:58 +0000, Gupta, Pekon wrote:
> Therefore, I'm not sure having re-reads is a good test or not, because
> re-reads is changing the underlying testing scenario by introducing
> _new_
> bit-flips in neighboring regions pages because of read-disturbs.

If the driver / HW cannot properly handle 4 re-reads, we do want the
test to catch this, I think, irrespective of the true nature of the
error.

Therefore, I see this patch with a default 4 as an improvement.
pekon gupta May 5, 2014, 11:21 a.m. UTC | #7
Hi Artem,

>From: Artem Bityutskiy [mailto:dedekind1@gmail.com]
>>On Mon, 2014-05-05 at 10:58 +0000, Gupta, Pekon wrote:
>> Therefore, I'm not sure having re-reads is a good test or not,
>because
>> re-reads is changing the underlying testing scenario by introducing
>> _new_
>> bit-flips in neighboring regions pages because of read-disturbs.
>
>If the driver / HW cannot properly handle 4 re-reads, we do want the
>test to catch this, I think, irrespective of the true nature of the
>error.
>
>Therefore, I see this patch with a default 4 as an improvement.
>
>--
>Best Regards,
>Artem Bityutskiy

I'm not against the patch, but my thought is that re-reads are
introducing read-disturb bit-flips.
Now in event of no UBIFS there would be no scrubbing. So these bit-flips
will keep on accumulating. And once these bit-flips cross beyond
ecc.strength then the nand_read is bound to fail.
So this makes this test un-deterministic.

Example: If I use this test with any driver using 1-bit ECC correction. Then
- test will most likely fail on weak or aged NAND devices, as they are more
  prone to read-disturb errors.
- But _may_ not fail on fresh devices using same driver with 1-bit ECC correction.

So my argument is that this test is actually not testing the NAND driver,
it is actually testing the weakness of NAND device.


with regards, pekon
Ezequiel Garcia May 5, 2014, 12:50 p.m. UTC | #8
I've changed the subject of this mail, since it seems we've moved from the
discussion.

On 05 May 10:58 AM, Gupta, Pekon wrote:
> 
> However, I know of some issues in OMAP NAND driver bundled with
> 3.2 kernel, which might be helpful in nailing down your specific issue.
> 

FWIW, I'm not using such driver anymore. Right after spotting this, we've
ran the *same* test on the mainline driver, which passed. This was the kick
we needed to start using mainline (we also need mainline for other reasons).

> (1) 3.2 kernel does not have concept of bitflip_threshold, so by default
> scrubbing and peb_torture happens even for single bit-flips. So please
> pull-in following patch series.
> [PATCH v2 0/7] mtd: Change meaning of -EUCLEAN return code on reads
> http://lists.infradead.org/pipermail/linux-mtd/2012-April/040945.html
> 

Isn't this series related to the higher-level treatment of bitflips?
In that case, it's not related to the issue I got. Keep in mind that
this simple test failed in our v3.2 stress (using this nandtest patch):

  * erase
  * write
  * read
  * read
  * ...
  * read (20 times)

> (2) OMAP NAND in 3.2 kernel does not factor bit-flips in empty pages
> Hence if empty pages with bit-flips are encountered, then it treats them
> like programmed pages and expects a ECC correction on them. But as
> empty pages do not have ECC stored in OOB, the driver bails out giving
> 'uncorrectable ecc' read errors.
> 

Since the test doesn't involve empty pages, I'd say this is not relevant.

If you're still interested in debugging the problematic TI SDK omap2-nand
driver, I suggest that you try using this nandtest patch and see how it
goes.

Despite you saying drivers can fail the test, I think it's still a nice test.
Keep in mind the nandtest tool reports the number of corrected ECC errors
after reads, so if that number is adding-up and increasing you can even
use this patch to see this evolution.

All in all, I think it's still a nice improvement on stock nandtest.
pekon gupta May 5, 2014, 6:12 p.m. UTC | #9
>From: Ezequiel Garcia [mailto:ezequiel@vanguardiasur.com.ar]
>
>I've changed the subject of this mail, since it seems we've moved from the
>discussion.
>
>On 05 May 10:58 AM, Gupta, Pekon wrote:
>>
>> However, I know of some issues in OMAP NAND driver bundled with
>> 3.2 kernel, which might be helpful in nailing down your specific issue.
>>
>
>FWIW, I'm not using such driver anymore. Right after spotting this, we've
>ran the *same* test on the mainline driver, which passed. This was the kick
>we needed to start using mainline (we also need mainline for other reasons).
>
>> (1) 3.2 kernel does not have concept of bitflip_threshold, so by default
>> scrubbing and peb_torture happens even for single bit-flips. So please
>> pull-in following patch series.
>> [PATCH v2 0/7] mtd: Change meaning of -EUCLEAN return code on reads
>> http://lists.infradead.org/pipermail/linux-mtd/2012-April/040945.html
>>
>
>Isn't this series related to the higher-level treatment of bitflips?
>In that case, it's not related to the issue I got. Keep in mind that
>this simple test failed in our v3.2 stress (using this nandtest patch):
>
>  * erase
>  * write
>  * read
>  * read
>  * ...
>  * read (20 times)
>
>> (2) OMAP NAND in 3.2 kernel does not factor bit-flips in empty pages
>> Hence if empty pages with bit-flips are encountered, then it treats them
>> like programmed pages and expects a ECC correction on them. But as
>> empty pages do not have ECC stored in OOB, the driver bails out giving
>> 'uncorrectable ecc' read errors.
>>
>
>Since the test doesn't involve empty pages, I'd say this is not relevant.
>
>If you're still interested in debugging the problematic TI SDK omap2-nand
>driver, I suggest that you try using this nandtest patch and see how it
>goes.
>
>Despite you saying drivers can fail the test, I think it's still a nice test.
>Keep in mind the nandtest tool reports the number of corrected ECC errors
>after reads, so if that number is adding-up and increasing you can even
>use this patch to see this evolution.
>
>All in all, I think it's still a nice improvement on stock nandtest.

I was just presenting my view that
With multiple re-reads nand-test may fail randomly on some devices
due of accumulation of bitflips because of read-disturb errors.
However, this may not be the case on actual file-system like UBIFS
where upper layer performs scrubbing to avoid bit-flip accumulation.

Otherwise I have no issue with the patch. So if Artem | Brian feel okay,
they can anyways pick the patch.


with regards, pekon
Brian Norris May 30, 2014, 11:49 p.m. UTC | #10
On Mon, May 05, 2014 at 06:12:55PM +0000, Pekon Gupta wrote:
> >From: Ezequiel Garcia [mailto:ezequiel@vanguardiasur.com.ar]
> >All in all, I think it's still a nice improvement on stock nandtest.
> 
> I was just presenting my view that
> With multiple re-reads nand-test may fail randomly on some devices
> due of accumulation of bitflips because of read-disturb errors.

Read disturb is not totally "random." It has definite causes, and it
should be relatively obvious if we're observing it in the controlled
environment of nandtest, rather than amidst the complexity of UBI(FS)
scrubbing, wear-leveling, etc.

> However, this may not be the case on actual file-system like UBIFS
> where upper layer performs scrubbing to avoid bit-flip accumulation.
> 
> Otherwise I have no issue with the patch. So if Artem | Brian feel okay,
> they can anyways pick the patch.

For the record, I agree with Ezequiel. This is a good (small)
improvement to the test. If for any reason, we cannot reread the same
data consistently for 4 tries immediately after programming it, then we
have a big problem. And even if read disturb is causing an accumulation
of bitflips, I think this would be a nice way to view that progression.

Brian
diff mbox

Patch

diff --git a/nandtest.c b/nandtest.c
index 31301d6..32877b8 100644
--- a/nandtest.c
+++ b/nandtest.c
@@ -24,6 +24,7 @@  void usage(int status)
 		"  -m, --markbad        Mark blocks bad if they appear so\n"
 		"  -s, --seed           Supply random seed\n"
 		"  -p, --passes         Number of passes\n"
+		"  -r, --reads          Read & check iterations per pass (default=1)\n"
 		"  -o, --offset         Start offset on flash\n"
 		"  -l, --length         Length of flash to test\n"
 		"  -k, --keep           Restore existing contents after test\n",
@@ -37,14 +38,11 @@  int fd;
 int markbad=0;
 int seed;
 
-void read_and_compare(loff_t ofs, unsigned char *data, unsigned char *rbuf)
+int read_and_compare(loff_t ofs, unsigned char *data, unsigned char *rbuf)
 {
 	ssize_t len;
 	int i;
 
-	printf("\r%08x: reading...", (unsigned)ofs);
-	fflush(stdout);
-
 	len = pread(fd, rbuf, meminfo.erasesize, ofs);
 	if (len < meminfo.erasesize) {
 		printf("\n");
@@ -84,14 +82,16 @@  void read_and_compare(loff_t ofs, unsigned char *data, unsigned char *rbuf)
 				printf("Byte 0x%x is %02x should be %02x\n",
 				       i, rbuf[i], data[i]);
 		}
-		exit(1);
+		return 1;
 	}
+	return 0;
 }
 
-int erase_and_write(loff_t ofs, unsigned char *data, unsigned char *rbuf)
+int erase_and_write(loff_t ofs, unsigned char *data, unsigned char *rbuf, int nr_reads)
 {
 	struct erase_info_user er;
 	ssize_t len;
+	int i, read_errs = 0;
 
 	printf("\r%08x: erasing... ", (unsigned)ofs);
 	fflush(stdout);
@@ -127,7 +127,16 @@  int erase_and_write(loff_t ofs, unsigned char *data, unsigned char *rbuf)
 		exit(1);
 	}
 
-	read_and_compare(ofs, data, rbuf);
+	for (i=1; i<=nr_reads; i++) {
+		printf("\r%08x: reading (%d of %d)...", (unsigned)ofs, i, nr_reads);
+		fflush(stdout);
+		if (read_and_compare(ofs, data, rbuf))
+			read_errs++;
+	}
+	if (read_errs) {
+		fprintf(stderr, "read/check %d of %d failed. seed %d\n", read_errs, nr_reads, seed);
+		return 1;
+	}
 	return 0;
 }
 
@@ -141,6 +150,7 @@  int main(int argc, char **argv)
 	unsigned char *wbuf, *rbuf, *kbuf;
 	int pass;
 	int nr_passes = 1;
+	int nr_reads = 1;
 	int keep_contents = 0;
 	uint32_t offset = 0;
 	uint32_t length = -1;
@@ -148,7 +158,7 @@  int main(int argc, char **argv)
 	seed = time(NULL);
 
 	for (;;) {
-		static const char short_options[] = "hkl:mo:p:s:";
+		static const char short_options[] = "hkl:mo:p:r:s:";
 		static const struct option long_options[] = {
 			{ "help", no_argument, 0, 'h' },
 			{ "markbad", no_argument, 0, 'm' },
@@ -156,6 +166,7 @@  int main(int argc, char **argv)
 			{ "passes", required_argument, 0, 'p' },
 			{ "offset", required_argument, 0, 'o' },
 			{ "length", required_argument, 0, 'l' },
+			{ "reads", optional_argument, 0, 'r' },
 			{ "keep", no_argument, 0, 'k' },
 			{0, 0, 0, 0},
 		};
@@ -189,6 +200,10 @@  int main(int argc, char **argv)
 			nr_passes = atol(optarg);
 			break;
 
+		case 'r':
+			nr_reads = atol(optarg);
+			break;
+
 		case 'o':
 			offset = atol(optarg);
 			break;
@@ -286,10 +301,10 @@  int main(int argc, char **argv)
 					exit(1);
 				}
 			}
-			if (erase_and_write(test_ofs, wbuf, rbuf))
+			if (erase_and_write(test_ofs, wbuf, rbuf, nr_reads))
 				continue;
 			if (keep_contents)
-				erase_and_write(test_ofs, kbuf, rbuf);
+				erase_and_write(test_ofs, kbuf, rbuf, 1);
 		}
 		printf("\nFinished pass %d successfully\n", pass+1);
 	}