diff mbox series

[02/10] mtd: nand: gpmi: Utilize hardware to detect bitflips in erased blocks

Message ID 20171206091925.5810-3-s.hauer@pengutronix.de
State Rejected
Delegated to: Boris Brezillon
Headers show
Series [01/10] mtd: nand: gpmi: Fix failure when a erased page has a bitflip at BBM | expand

Commit Message

Sascha Hauer Dec. 6, 2017, 9:19 a.m. UTC
The GPMI nand has a hardware feature to ignore bitflips in erased pages.
Use this feature rather than the longish code we have now.
Unfortunately the bitflips in erased pages are not corrected, so we have
to memset the read data before passing it to the upper layers.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
---
 drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 77 ++++------------------------------
 1 file changed, 9 insertions(+), 68 deletions(-)

Comments

Boris Brezillon Dec. 6, 2017, 9:27 a.m. UTC | #1
Hi Sascha,

On Wed,  6 Dec 2017 10:19:17 +0100
Sascha Hauer <s.hauer@pengutronix.de> wrote:

> The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> Use this feature rather than the longish code we have now.
> Unfortunately the bitflips in erased pages are not corrected, so we have
> to memset the read data before passing it to the upper layers.

There's a good reason we didn't use the HW bitflip detection in the
first place: we currently have no way to report the number of corrected
bitflips in an erased page, and that's a big problem, because then UBI
does not know when it should re-erase the block.

Maybe we missed something when the initial proposal was done and
there's actually a way to retrieve this information, but if that's not
the case, I'd prefer to keep the existing implementation even if it's
slower and more verbose.

Regards,

Boris

> 
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> ---
>  drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 77 ++++------------------------------
>  1 file changed, 9 insertions(+), 68 deletions(-)
> 
> diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> index d4d824ef64e9..09e8ded3f1e2 100644
> --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> @@ -1056,6 +1056,8 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>  	auxiliary_virt = this->auxiliary_virt;
>  	auxiliary_phys = this->auxiliary_phys;
>  
> +	writel(mtd->bitflip_threshold, this->resources.bch_regs + HW_BCH_MODE);
> +
>  	/* go! */
>  	ret = gpmi_read_page(this, payload_phys, auxiliary_phys);
>  	read_page_end(this, buf, nfc_geo->payload_size,
> @@ -1076,77 +1078,16 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>  			   payload_virt, payload_phys);
>  
>  	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
> -		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
> +		if (*status == STATUS_GOOD)
>  			continue;
>  
> -		if (*status == STATUS_UNCORRECTABLE) {
> -			int eccbits = nfc_geo->ecc_strength * nfc_geo->gf_len;
> -			u8 *eccbuf = this->raw_buffer;
> -			int offset, bitoffset;
> -			int eccbytes;
> -			int flips;
> -
> -			/* Read ECC bytes into our internal raw_buffer */
> -			offset = nfc_geo->metadata_size * 8;
> -			offset += ((8 * nfc_geo->ecc_chunk_size) + eccbits) * (i + 1);
> -			offset -= eccbits;
> -			bitoffset = offset % 8;
> -			eccbytes = DIV_ROUND_UP(offset + eccbits, 8);
> -			offset /= 8;
> -			eccbytes -= offset;
> -			chip->cmdfunc(mtd, NAND_CMD_RNDOUT, offset, -1);
> -			chip->read_buf(mtd, eccbuf, eccbytes);
> -
> -			/*
> -			 * ECC data are not byte aligned and we may have
> -			 * in-band data in the first and last byte of
> -			 * eccbuf. Set non-eccbits to one so that
> -			 * nand_check_erased_ecc_chunk() does not count them
> -			 * as bitflips.
> -			 */
> -			if (bitoffset)
> -				eccbuf[0] |= GENMASK(bitoffset - 1, 0);
> -
> -			bitoffset = (bitoffset + eccbits) % 8;
> -			if (bitoffset)
> -				eccbuf[eccbytes - 1] |= GENMASK(7, bitoffset);
> -
> -			/*
> -			 * The ECC hardware has an uncorrectable ECC status
> -			 * code in case we have bitflips in an erased page. As
> -			 * nothing was written into this subpage the ECC is
> -			 * obviously wrong and we can not trust it. We assume
> -			 * at this point that we are reading an erased page and
> -			 * try to correct the bitflips in buffer up to
> -			 * ecc_strength bitflips. If this is a page with random
> -			 * data, we exceed this number of bitflips and have a
> -			 * ECC failure. Otherwise we use the corrected buffer.
> -			 */
> -			if (i == 0) {
> -				/* The first block includes metadata */
> -				flips = nand_check_erased_ecc_chunk(
> -						buf + i * nfc_geo->ecc_chunk_size,
> -						nfc_geo->ecc_chunk_size,
> -						eccbuf, eccbytes,
> -						auxiliary_virt,
> -						nfc_geo->metadata_size,
> -						nfc_geo->ecc_strength);
> -			} else {
> -				flips = nand_check_erased_ecc_chunk(
> -						buf + i * nfc_geo->ecc_chunk_size,
> -						nfc_geo->ecc_chunk_size,
> -						eccbuf, eccbytes,
> -						NULL, 0,
> -						nfc_geo->ecc_strength);
> -			}
> -
> -			if (flips > 0) {
> -				max_bitflips = max_t(unsigned int, max_bitflips,
> -						     flips);
> -				mtd->ecc_stats.corrected += flips;
> -				continue;
> -			}
> +		if (*status == STATUS_ERASED) {
> +			memset(buf + nfc_geo->ecc_chunk_size * i, 0xff,
> +			       nfc_geo->ecc_chunk_size);
> +			continue;
> +		}
>  
> +		if (*status == STATUS_UNCORRECTABLE) {
>  			mtd->ecc_stats.failed++;
>  			continue;
>  		}
Sascha Hauer Dec. 6, 2017, 3:28 p.m. UTC | #2
On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:
> Hi Sascha,
> 
> On Wed,  6 Dec 2017 10:19:17 +0100
> Sascha Hauer <s.hauer@pengutronix.de> wrote:
> 
> > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > Use this feature rather than the longish code we have now.
> > Unfortunately the bitflips in erased pages are not corrected, so we have
> > to memset the read data before passing it to the upper layers.
> 
> There's a good reason we didn't use the HW bitflip detection in the
> first place: we currently have no way to report the number of corrected
> bitflips in an erased page, and that's a big problem, because then UBI
> does not know when it should re-erase the block.

Ah, ok.

> 
> Maybe we missed something when the initial proposal was done and
> there's actually a way to retrieve this information, but if that's not
> the case, I'd prefer to keep the existing implementation even if it's
> slower and more verbose.

I'm not so much concerned about the verbosity of the code but rather
about the magic it has inside. I have stared at this code for some time
now and still only vaguely know what it does.

We could do a bit better: We can still detect the number of bitflips
using nand_check_erased_ecc_chunk() without reading the oob data.
That would not be 100% accurate since we do not take the oob data into
account which might have bitflips aswell, but still should be good
enough, no?

Sascha
Boris Brezillon Dec. 6, 2017, 3:34 p.m. UTC | #3
On Wed, 6 Dec 2017 16:28:04 +0100
Sascha Hauer <s.hauer@pengutronix.de> wrote:

> On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:
> > Hi Sascha,
> > 
> > On Wed,  6 Dec 2017 10:19:17 +0100
> > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> >   
> > > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > > Use this feature rather than the longish code we have now.
> > > Unfortunately the bitflips in erased pages are not corrected, so we have
> > > to memset the read data before passing it to the upper layers.  
> > 
> > There's a good reason we didn't use the HW bitflip detection in the
> > first place: we currently have no way to report the number of corrected
> > bitflips in an erased page, and that's a big problem, because then UBI
> > does not know when it should re-erase the block.  
> 
> Ah, ok.
> 
> > 
> > Maybe we missed something when the initial proposal was done and
> > there's actually a way to retrieve this information, but if that's not
> > the case, I'd prefer to keep the existing implementation even if it's
> > slower and more verbose.  
> 
> I'm not so much concerned about the verbosity of the code but rather
> about the magic it has inside. I have stared at this code for some time
> now and still only vaguely know what it does.
> 
> We could do a bit better: We can still detect the number of bitflips
> using nand_check_erased_ecc_chunk() without reading the oob data.
> That would not be 100% accurate since we do not take the oob data into
> account which might have bitflips aswell, but still should be good
> enough, no?

Nope, we really have to take OOB bytes into account when counting
bitflips. Say that you have almost all in-band data set to 0xff, but
all OOB bytes reserved for ECC are set to 0 because someone decided to
write this portion of the page in raw mode. In this case we can't
consider the page as empty, otherwise we'll fail to correctly write ECC
bytes next time we program the page.

> 
> Sascha
>
Sascha Hauer Dec. 8, 2017, 10:21 a.m. UTC | #4
On Wed, Dec 06, 2017 at 04:34:31PM +0100, Boris Brezillon wrote:
> On Wed, 6 Dec 2017 16:28:04 +0100
> Sascha Hauer <s.hauer@pengutronix.de> wrote:
> 
> > On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:
> > > Hi Sascha,
> > > 
> > > On Wed,  6 Dec 2017 10:19:17 +0100
> > > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> > >   
> > > > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > > > Use this feature rather than the longish code we have now.
> > > > Unfortunately the bitflips in erased pages are not corrected, so we have
> > > > to memset the read data before passing it to the upper layers.  
> > > 
> > > There's a good reason we didn't use the HW bitflip detection in the
> > > first place: we currently have no way to report the number of corrected
> > > bitflips in an erased page, and that's a big problem, because then UBI
> > > does not know when it should re-erase the block.  
> > 
> > Ah, ok.
> > 
> > > 
> > > Maybe we missed something when the initial proposal was done and
> > > there's actually a way to retrieve this information, but if that's not
> > > the case, I'd prefer to keep the existing implementation even if it's
> > > slower and more verbose.  
> > 
> > I'm not so much concerned about the verbosity of the code but rather
> > about the magic it has inside. I have stared at this code for some time
> > now and still only vaguely know what it does.
> > 
> > We could do a bit better: We can still detect the number of bitflips
> > using nand_check_erased_ecc_chunk() without reading the oob data.
> > That would not be 100% accurate since we do not take the oob data into
> > account which might have bitflips aswell, but still should be good
> > enough, no?
> 
> Nope, we really have to take OOB bytes into account when counting
> bitflips. Say that you have almost all in-band data set to 0xff, but
> all OOB bytes reserved for ECC are set to 0 because someone decided to
> write this portion of the page in raw mode. In this case we can't
> consider the page as empty, otherwise we'll fail to correctly write ECC
> bytes next time we program the page.

I would probably be with you if at least the current code worked correctly,
but unfortunately it doesn't.

I did a test with the attached program. What it does is that it writes
pages nearly full of 0xff in raw mode. In the first page the first byte
is exchanged with a bitflip, in the second page the second byte is
exchanged with a bitflip, and so on. What I would expect that the number
of corrected bits is the same for all bitflip positions. What I get
instead is this:

./a.out /dev/mtd4 0x0f
byteno: 0x00000000 corrected: 5 failed: 0
byteno: 0x00000001 corrected: 4 failed: 0
byteno: 0x00000002 corrected: 5 failed: 0
byteno: 0x00000003 corrected: 6 failed: 0
byteno: 0x00000004 corrected: 5 failed: 0
byteno: 0x00000005 corrected: 5 failed: 0
byteno: 0x00000006 corrected: 4 failed: 0
byteno: 0x00000007 corrected: 5 failed: 0
byteno: 0x00000008 corrected: 5 failed: 0
byteno: 0x00000009 corrected: 4 failed: 0
...

(Read this as: byteno <x> in page has 0x0f instead of 0xff, so number
of corrected bits should be 4, instead we have 5)

or:
./a.out /dev/mtd4 0xfe
byteno: 0x00000000 corrected: 1 failed: 0
byteno: 0x00000001 corrected: 1 failed: 0
byteno: 0x00000002 corrected: 1 failed: 0
byteno: 0x00000003 corrected: 1 failed: 0
byteno: 0x00000004 corrected: 1 failed: 0
byteno: 0x00000005 corrected: 1 failed: 0
byteno: 0x00000006 corrected: 1 failed: 0
byteno: 0x00000007 corrected: 1 failed: 0
byteno: 0x00000008 corrected: 1 failed: 0
byteno: 0x00000009 corrected: 1 failed: 0
byteno: 0x0000000a corrected: 1 failed: 0
byteno: 0x0000000b corrected: 1 failed: 0
byteno: 0x0000000c corrected: 1 failed: 0
byteno: 0x0000000d corrected: 1 failed: 0
byteno: 0x0000000e corrected: 2 failed: 0
byteno: 0x0000000f corrected: 1 failed: 0
byteno: 0x00000010 corrected: 1 failed: 0
byteno: 0x00000011 corrected: 1 failed: 0
byteno: 0x00000012 corrected: 2 failed: 0
byteno: 0x00000013 corrected: 1 failed: 0

When I add a print_hex_dump to the driver I really see a buffer
with 5 bytes flipped instead of 4. When I do a raw read of the
page (under barebox, the Linux driver doesn't print the OOB data
properly), then I really see what I have written: A page with four
bitflips. The results are perfectly reproducable, so I don't expect
that to be any random read failures.

I know next to nothing about BCH, but could it be that the BCH engine
already starts correcting the page while it still doesn't know that
it can't be corrected at all leaving spurious cleared bits in the
read data?

Sascha

-------------------------------8<-------------------------------

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <mtd/mtd-abi.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <string.h>

int fd;
uint8_t *buf;
uint8_t byte;

void erase_block(void)
{
	struct erase_info_user req = {
		.start = 0,
		.length = 128 * 1024,
	};
	int ret;

	ret = ioctl(fd, MEMERASE, &req);
	if (ret != 0) {
		perror("ioctl");
		exit (1);
	}
}

void get_stats(unsigned int *__corrected, unsigned int *__failed)
{
	struct mtd_ecc_stats stats;
	static unsigned int corrected, failed;
	int ret;

	ret = ioctl(fd, ECCGETSTATS, &stats);
	if (ret != 0) {
		perror("ioctl");
		exit (1);
 	}

	*__corrected = stats.corrected - corrected;
	*__failed = stats.failed - failed;

	corrected = stats.corrected;
	failed = stats.failed;
}

void wr_pages(int startbyte)
{
	struct mtd_write_req req = {
		.start = 0,
		.len = 2048,
		.ooblen = 64,
		.usr_data = (__u64)(unsigned long)buf,
		.usr_oob = (__u64)(unsigned long)(buf + 2048),
		.mode = MTD_OPS_RAW,
	};
	unsigned int corrected, failed;
	int i, ret;

	erase_block();

	for (i = 0; i < 64; i++) {
		memset(buf, 0xff, 2112);

		buf[i + startbyte] = byte;

		req.start = i * 2048;

		ret = ioctl(fd, MEMWRITE, &req);
		if (ret != 0) {
			perror("ioctl");
			exit(1);
		}
	}

	lseek(fd, 0, SEEK_SET);

	for (i = 0; i < 64; i++) {
		int j;

		ret = read(fd, buf, 2048);
		if (ret < 2048) {
			perror("read");
			exit(1);
		}

		get_stats(&corrected, &failed);
		printf("byteno: 0x%08x corrected: %d failed: %d\n", i + startbyte, corrected, failed);

		for (j = 0; j < 2048; j++) {
			if (buf[j] != 0xff) {
				printf("swapped: 0x%08x notff: 0x%08x instead: 0x%02x\n",
				       startbyte + i, j, buf[j]);
			}
		}
	}
}

int main(int argc, char *argv[])
{
	int i;
	unsigned int c, f;

	if (argc < 3) {
		printf("usage: %s <mtd> <byte>\n", argv[0]);
		exit (1);
	}

	byte = strtoul(argv[2], NULL, 0);

	fd = open(argv[1], O_RDWR);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	get_stats(&c, &f);

	buf = malloc(2112);

	for (i = 0; i < 0x840; i+= 64)
		wr_pages(i);

	exit(0);
}
Boris Brezillon Dec. 8, 2017, 10:35 a.m. UTC | #5
On Fri, 8 Dec 2017 11:21:57 +0100
Sascha Hauer <s.hauer@pengutronix.de> wrote:

> On Wed, Dec 06, 2017 at 04:34:31PM +0100, Boris Brezillon wrote:
> > On Wed, 6 Dec 2017 16:28:04 +0100
> > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> >   
> > > On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:  
> > > > Hi Sascha,
> > > > 
> > > > On Wed,  6 Dec 2017 10:19:17 +0100
> > > > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> > > >     
> > > > > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > > > > Use this feature rather than the longish code we have now.
> > > > > Unfortunately the bitflips in erased pages are not corrected, so we have
> > > > > to memset the read data before passing it to the upper layers.    
> > > > 
> > > > There's a good reason we didn't use the HW bitflip detection in the
> > > > first place: we currently have no way to report the number of corrected
> > > > bitflips in an erased page, and that's a big problem, because then UBI
> > > > does not know when it should re-erase the block.    
> > > 
> > > Ah, ok.
> > >   
> > > > 
> > > > Maybe we missed something when the initial proposal was done and
> > > > there's actually a way to retrieve this information, but if that's not
> > > > the case, I'd prefer to keep the existing implementation even if it's
> > > > slower and more verbose.    
> > > 
> > > I'm not so much concerned about the verbosity of the code but rather
> > > about the magic it has inside. I have stared at this code for some time
> > > now and still only vaguely know what it does.
> > > 
> > > We could do a bit better: We can still detect the number of bitflips
> > > using nand_check_erased_ecc_chunk() without reading the oob data.
> > > That would not be 100% accurate since we do not take the oob data into
> > > account which might have bitflips aswell, but still should be good
> > > enough, no?  
> > 
> > Nope, we really have to take OOB bytes into account when counting
> > bitflips. Say that you have almost all in-band data set to 0xff, but
> > all OOB bytes reserved for ECC are set to 0 because someone decided to
> > write this portion of the page in raw mode. In this case we can't
> > consider the page as empty, otherwise we'll fail to correctly write ECC
> > bytes next time we program the page.  
> 
> I would probably be with you if at least the current code worked correctly,
> but unfortunately it doesn't.
> 
> I did a test with the attached program. What it does is that it writes
> pages nearly full of 0xff in raw mode. In the first page the first byte
> is exchanged with a bitflip, in the second page the second byte is
> exchanged with a bitflip, and so on. What I would expect that the number
> of corrected bits is the same for all bitflip positions. What I get
> instead is this:
> 
> ./a.out /dev/mtd4 0x0f
> byteno: 0x00000000 corrected: 5 failed: 0
> byteno: 0x00000001 corrected: 4 failed: 0
> byteno: 0x00000002 corrected: 5 failed: 0
> byteno: 0x00000003 corrected: 6 failed: 0
> byteno: 0x00000004 corrected: 5 failed: 0
> byteno: 0x00000005 corrected: 5 failed: 0
> byteno: 0x00000006 corrected: 4 failed: 0
> byteno: 0x00000007 corrected: 5 failed: 0
> byteno: 0x00000008 corrected: 5 failed: 0
> byteno: 0x00000009 corrected: 4 failed: 0
> ...
> 
> (Read this as: byteno <x> in page has 0x0f instead of 0xff, so number
> of corrected bits should be 4, instead we have 5)
> 
> or:
> ./a.out /dev/mtd4 0xfe
> byteno: 0x00000000 corrected: 1 failed: 0
> byteno: 0x00000001 corrected: 1 failed: 0
> byteno: 0x00000002 corrected: 1 failed: 0
> byteno: 0x00000003 corrected: 1 failed: 0
> byteno: 0x00000004 corrected: 1 failed: 0
> byteno: 0x00000005 corrected: 1 failed: 0
> byteno: 0x00000006 corrected: 1 failed: 0
> byteno: 0x00000007 corrected: 1 failed: 0
> byteno: 0x00000008 corrected: 1 failed: 0
> byteno: 0x00000009 corrected: 1 failed: 0
> byteno: 0x0000000a corrected: 1 failed: 0
> byteno: 0x0000000b corrected: 1 failed: 0
> byteno: 0x0000000c corrected: 1 failed: 0
> byteno: 0x0000000d corrected: 1 failed: 0
> byteno: 0x0000000e corrected: 2 failed: 0
> byteno: 0x0000000f corrected: 1 failed: 0
> byteno: 0x00000010 corrected: 1 failed: 0
> byteno: 0x00000011 corrected: 1 failed: 0
> byteno: 0x00000012 corrected: 2 failed: 0
> byteno: 0x00000013 corrected: 1 failed: 0
> 
> When I add a print_hex_dump to the driver I really see a buffer
> with 5 bytes flipped instead of 4. When I do a raw read of the
> page (under barebox, the Linux driver doesn't print the OOB data
> properly), then I really see what I have written: A page with four
> bitflips. The results are perfectly reproducable, so I don't expect
> that to be any random read failures.

Wow, that's really weird.

> 
> I know next to nothing about BCH, but could it be that the BCH engine
> already starts correcting the page while it still doesn't know that
> it can't be corrected at all leaving spurious cleared bits in the
> read data?

Bitflips are not supposed to be fixed by the BCH engine if an
uncorrectable error is detected. BCH is operating on a block of data
(usually 512 or 1024 bytes), and before starting to fix actual errors,
it searches their positions, and their positions is only determined
after the correctable/uncorrectable check has been done, so really, I
doubt BCH can mess up you data.

Did you find out where the extra bitflip is? Is it next to the other
bitflips or in a totally different region?

> 
> Sascha
> 
> -------------------------------8<-------------------------------
> 
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <mtd/mtd-abi.h>
> #include <sys/ioctl.h>
> #include <stdint.h>
> #include <string.h>
> 
> int fd;
> uint8_t *buf;
> uint8_t byte;
> 
> void erase_block(void)
> {
> 	struct erase_info_user req = {
> 		.start = 0,
> 		.length = 128 * 1024,
> 	};
> 	int ret;
> 
> 	ret = ioctl(fd, MEMERASE, &req);
> 	if (ret != 0) {
> 		perror("ioctl");
> 		exit (1);
> 	}
> }
> 
> void get_stats(unsigned int *__corrected, unsigned int *__failed)
> {
> 	struct mtd_ecc_stats stats;
> 	static unsigned int corrected, failed;
> 	int ret;
> 
> 	ret = ioctl(fd, ECCGETSTATS, &stats);
> 	if (ret != 0) {
> 		perror("ioctl");
> 		exit (1);
>  	}
> 
> 	*__corrected = stats.corrected - corrected;
> 	*__failed = stats.failed - failed;
> 
> 	corrected = stats.corrected;
> 	failed = stats.failed;
> }
> 
> void wr_pages(int startbyte)
> {
> 	struct mtd_write_req req = {
> 		.start = 0,
> 		.len = 2048,
> 		.ooblen = 64,
> 		.usr_data = (__u64)(unsigned long)buf,
> 		.usr_oob = (__u64)(unsigned long)(buf + 2048),
> 		.mode = MTD_OPS_RAW,
> 	};
> 	unsigned int corrected, failed;
> 	int i, ret;
> 
> 	erase_block();
> 
> 	for (i = 0; i < 64; i++) {
> 		memset(buf, 0xff, 2112);
> 
> 		buf[i + startbyte] = byte;
> 
> 		req.start = i * 2048;
> 
> 		ret = ioctl(fd, MEMWRITE, &req);
> 		if (ret != 0) {
> 			perror("ioctl");
> 			exit(1);
> 		}
> 	}
> 
> 	lseek(fd, 0, SEEK_SET);
> 
> 	for (i = 0; i < 64; i++) {
> 		int j;
> 
> 		ret = read(fd, buf, 2048);
> 		if (ret < 2048) {
> 			perror("read");
> 			exit(1);
> 		}
> 
> 		get_stats(&corrected, &failed);
> 		printf("byteno: 0x%08x corrected: %d failed: %d\n", i + startbyte, corrected, failed);
> 
> 		for (j = 0; j < 2048; j++) {
> 			if (buf[j] != 0xff) {
> 				printf("swapped: 0x%08x notff: 0x%08x instead: 0x%02x\n",
> 				       startbyte + i, j, buf[j]);
> 			}
> 		}
> 	}
> }
> 
> int main(int argc, char *argv[])
> {
> 	int i;
> 	unsigned int c, f;
> 
> 	if (argc < 3) {
> 		printf("usage: %s <mtd> <byte>\n", argv[0]);
> 		exit (1);
> 	}
> 
> 	byte = strtoul(argv[2], NULL, 0);
> 
> 	fd = open(argv[1], O_RDWR);
> 	if (fd < 0) {
> 		perror("open");
> 		exit(1);
> 	}
> 
> 	get_stats(&c, &f);
> 
> 	buf = malloc(2112);
> 
> 	for (i = 0; i < 0x840; i+= 64)
> 		wr_pages(i);
> 
> 	exit(0);
> }
Sascha Hauer Dec. 8, 2017, 10:57 a.m. UTC | #6
On Fri, Dec 08, 2017 at 11:35:50AM +0100, Boris Brezillon wrote:
> On Fri, 8 Dec 2017 11:21:57 +0100
> Sascha Hauer <s.hauer@pengutronix.de> wrote:
> 
> > On Wed, Dec 06, 2017 at 04:34:31PM +0100, Boris Brezillon wrote:
> > > On Wed, 6 Dec 2017 16:28:04 +0100
> > > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> > >   
> > > > On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:  
> > > > > Hi Sascha,
> > > > > 
> > > > > On Wed,  6 Dec 2017 10:19:17 +0100
> > > > > Sascha Hauer <s.hauer@pengutronix.de> wrote:
> > > > >     
> > > > > > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > > > > > Use this feature rather than the longish code we have now.
> > > > > > Unfortunately the bitflips in erased pages are not corrected, so we have
> > > > > > to memset the read data before passing it to the upper layers.    
> > > > > 
> > > > > There's a good reason we didn't use the HW bitflip detection in the
> > > > > first place: we currently have no way to report the number of corrected
> > > > > bitflips in an erased page, and that's a big problem, because then UBI
> > > > > does not know when it should re-erase the block.    
> > > > 
> > > > Ah, ok.
> > > >   
> > > > > 
> > > > > Maybe we missed something when the initial proposal was done and
> > > > > there's actually a way to retrieve this information, but if that's not
> > > > > the case, I'd prefer to keep the existing implementation even if it's
> > > > > slower and more verbose.    
> > > > 
> > > > I'm not so much concerned about the verbosity of the code but rather
> > > > about the magic it has inside. I have stared at this code for some time
> > > > now and still only vaguely know what it does.
> > > > 
> > > > We could do a bit better: We can still detect the number of bitflips
> > > > using nand_check_erased_ecc_chunk() without reading the oob data.
> > > > That would not be 100% accurate since we do not take the oob data into
> > > > account which might have bitflips aswell, but still should be good
> > > > enough, no?  
> > > 
> > > Nope, we really have to take OOB bytes into account when counting
> > > bitflips. Say that you have almost all in-band data set to 0xff, but
> > > all OOB bytes reserved for ECC are set to 0 because someone decided to
> > > write this portion of the page in raw mode. In this case we can't
> > > consider the page as empty, otherwise we'll fail to correctly write ECC
> > > bytes next time we program the page.  
> > 
> > I would probably be with you if at least the current code worked correctly,
> > but unfortunately it doesn't.
> > 
> > I did a test with the attached program. What it does is that it writes
> > pages nearly full of 0xff in raw mode. In the first page the first byte
> > is exchanged with a bitflip, in the second page the second byte is
> > exchanged with a bitflip, and so on. What I would expect that the number
> > of corrected bits is the same for all bitflip positions. What I get
> > instead is this:
> > 
> > ./a.out /dev/mtd4 0x0f
> > byteno: 0x00000000 corrected: 5 failed: 0
> > byteno: 0x00000001 corrected: 4 failed: 0
> > byteno: 0x00000002 corrected: 5 failed: 0
> > byteno: 0x00000003 corrected: 6 failed: 0
> > byteno: 0x00000004 corrected: 5 failed: 0
> > byteno: 0x00000005 corrected: 5 failed: 0
> > byteno: 0x00000006 corrected: 4 failed: 0
> > byteno: 0x00000007 corrected: 5 failed: 0
> > byteno: 0x00000008 corrected: 5 failed: 0
> > byteno: 0x00000009 corrected: 4 failed: 0
> > ...
> > 
> > (Read this as: byteno <x> in page has 0x0f instead of 0xff, so number
> > of corrected bits should be 4, instead we have 5)
> > 
> > or:
> > ./a.out /dev/mtd4 0xfe
> > byteno: 0x00000000 corrected: 1 failed: 0
> > byteno: 0x00000001 corrected: 1 failed: 0
> > byteno: 0x00000002 corrected: 1 failed: 0
> > byteno: 0x00000003 corrected: 1 failed: 0
> > byteno: 0x00000004 corrected: 1 failed: 0
> > byteno: 0x00000005 corrected: 1 failed: 0
> > byteno: 0x00000006 corrected: 1 failed: 0
> > byteno: 0x00000007 corrected: 1 failed: 0
> > byteno: 0x00000008 corrected: 1 failed: 0
> > byteno: 0x00000009 corrected: 1 failed: 0
> > byteno: 0x0000000a corrected: 1 failed: 0
> > byteno: 0x0000000b corrected: 1 failed: 0
> > byteno: 0x0000000c corrected: 1 failed: 0
> > byteno: 0x0000000d corrected: 1 failed: 0
> > byteno: 0x0000000e corrected: 2 failed: 0
> > byteno: 0x0000000f corrected: 1 failed: 0
> > byteno: 0x00000010 corrected: 1 failed: 0
> > byteno: 0x00000011 corrected: 1 failed: 0
> > byteno: 0x00000012 corrected: 2 failed: 0
> > byteno: 0x00000013 corrected: 1 failed: 0
> > 
> > When I add a print_hex_dump to the driver I really see a buffer
> > with 5 bytes flipped instead of 4. When I do a raw read of the
> > page (under barebox, the Linux driver doesn't print the OOB data
> > properly), then I really see what I have written: A page with four
> > bitflips. The results are perfectly reproducable, so I don't expect
> > that to be any random read failures.
> 
> Wow, that's really weird.
> 
> > 
> > I know next to nothing about BCH, but could it be that the BCH engine
> > already starts correcting the page while it still doesn't know that
> > it can't be corrected at all leaving spurious cleared bits in the
> > read data?
> 
> Bitflips are not supposed to be fixed by the BCH engine if an
> uncorrectable error is detected. BCH is operating on a block of data
> (usually 512 or 1024 bytes), and before starting to fix actual errors,
> it searches their positions, and their positions is only determined
> after the correctable/uncorrectable check has been done, so really, I
> doubt BCH can mess up you data.
> 
> Did you find out where the extra bitflip is? Is it next to the other
> bitflips or in a totally different region?

It's reproducably the same place, but at arbitrary positions:

See below for a one chunk. The '0xf0' is the bitflip I introduced, no idea
where the '0xbf' and '0xf7' come from:

[   44.223908] data: 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.231436] data: 00000010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.238414] data: 00000020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.246150] data: 00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.253560] data: 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.261099] data: 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.268076] data: 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.275795] data: 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.283177] data: 00000080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.290635] data: 00000090: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.297610] data: 000000a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.305274] data: 000000b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.312670] data: 000000c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.319648] data: 000000d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.327403] data: 000000e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.334850] data: 000000f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.342221] data: 00000100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.349197] data: 00000110: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.356878] data: 00000120: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.364258] data: 00000130: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.371704] data: 00000140: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.378678] data: 00000150: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.386335] data: 00000160: ff ff ff ff ff ff ff bf ff ff ff ff ff ff ff ff
[   44.393767] data: 00000170: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.401211] data: 00000180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.408187] data: 00000190: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff f7
[   44.415890] data: 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.423286] data: 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.430716] data: 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.437693] data: 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   44.445382] data: 000001e0: ff ff ff ff ff ff ff ff ff ff ff f0 ff ff ff ff
[   44.452814] data: 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff f7 ff ff ff
[   44.459792] ecc: 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff
diff mbox series

Patch

diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
index d4d824ef64e9..09e8ded3f1e2 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
@@ -1056,6 +1056,8 @@  static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	auxiliary_virt = this->auxiliary_virt;
 	auxiliary_phys = this->auxiliary_phys;
 
+	writel(mtd->bitflip_threshold, this->resources.bch_regs + HW_BCH_MODE);
+
 	/* go! */
 	ret = gpmi_read_page(this, payload_phys, auxiliary_phys);
 	read_page_end(this, buf, nfc_geo->payload_size,
@@ -1076,77 +1078,16 @@  static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 			   payload_virt, payload_phys);
 
 	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
-		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
+		if (*status == STATUS_GOOD)
 			continue;
 
-		if (*status == STATUS_UNCORRECTABLE) {
-			int eccbits = nfc_geo->ecc_strength * nfc_geo->gf_len;
-			u8 *eccbuf = this->raw_buffer;
-			int offset, bitoffset;
-			int eccbytes;
-			int flips;
-
-			/* Read ECC bytes into our internal raw_buffer */
-			offset = nfc_geo->metadata_size * 8;
-			offset += ((8 * nfc_geo->ecc_chunk_size) + eccbits) * (i + 1);
-			offset -= eccbits;
-			bitoffset = offset % 8;
-			eccbytes = DIV_ROUND_UP(offset + eccbits, 8);
-			offset /= 8;
-			eccbytes -= offset;
-			chip->cmdfunc(mtd, NAND_CMD_RNDOUT, offset, -1);
-			chip->read_buf(mtd, eccbuf, eccbytes);
-
-			/*
-			 * ECC data are not byte aligned and we may have
-			 * in-band data in the first and last byte of
-			 * eccbuf. Set non-eccbits to one so that
-			 * nand_check_erased_ecc_chunk() does not count them
-			 * as bitflips.
-			 */
-			if (bitoffset)
-				eccbuf[0] |= GENMASK(bitoffset - 1, 0);
-
-			bitoffset = (bitoffset + eccbits) % 8;
-			if (bitoffset)
-				eccbuf[eccbytes - 1] |= GENMASK(7, bitoffset);
-
-			/*
-			 * The ECC hardware has an uncorrectable ECC status
-			 * code in case we have bitflips in an erased page. As
-			 * nothing was written into this subpage the ECC is
-			 * obviously wrong and we can not trust it. We assume
-			 * at this point that we are reading an erased page and
-			 * try to correct the bitflips in buffer up to
-			 * ecc_strength bitflips. If this is a page with random
-			 * data, we exceed this number of bitflips and have a
-			 * ECC failure. Otherwise we use the corrected buffer.
-			 */
-			if (i == 0) {
-				/* The first block includes metadata */
-				flips = nand_check_erased_ecc_chunk(
-						buf + i * nfc_geo->ecc_chunk_size,
-						nfc_geo->ecc_chunk_size,
-						eccbuf, eccbytes,
-						auxiliary_virt,
-						nfc_geo->metadata_size,
-						nfc_geo->ecc_strength);
-			} else {
-				flips = nand_check_erased_ecc_chunk(
-						buf + i * nfc_geo->ecc_chunk_size,
-						nfc_geo->ecc_chunk_size,
-						eccbuf, eccbytes,
-						NULL, 0,
-						nfc_geo->ecc_strength);
-			}
-
-			if (flips > 0) {
-				max_bitflips = max_t(unsigned int, max_bitflips,
-						     flips);
-				mtd->ecc_stats.corrected += flips;
-				continue;
-			}
+		if (*status == STATUS_ERASED) {
+			memset(buf + nfc_geo->ecc_chunk_size * i, 0xff,
+			       nfc_geo->ecc_chunk_size);
+			continue;
+		}
 
+		if (*status == STATUS_UNCORRECTABLE) {
 			mtd->ecc_stats.failed++;
 			continue;
 		}