diff mbox

[1/2] mtd: nand: add erased-page bitflip correction

Message ID 1394529112-9659-1-git-send-email-computersforpeace@gmail.com
State Rejected
Headers show

Commit Message

Brian Norris March 11, 2014, 9:11 a.m. UTC
Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
will return all 1's (0xff). However, modern NAND (MLC, and even some
SLC) experience bitflips in unprogrammed pages, and so they may not read
back all 1's. This is problematic for drivers whose ECC cannot correct
bitflips in an all-0xff page, as they will report an ECC error
(-EBADMSG) when they come across such a page. This appears to UBIFS as
"corrupt empty space".

Several others [1][2] are attempting to solve this problem, but none is
generically useful, and most (all?) have subtle bugs in their reasoning. Let's
implement a simple, generic, and correct solution instead.

To handle such situations, we should implement the following software
workaround for drivers that need it: when the hardware driver reports an
ECC error, nand_base should "verify" this error by

 * Re-reading the page without ECC
 * counting the number of 0 bits in each ECC sector (N[i], for i = 0, 1,
   ..., chip->ecc.steps)
 * If any N[i] exceeds the ECC strength (and thus, the maximum allowable
   bitflips) then we consider this to be an uncorrectable sector.
 * If all N[i] are less than the ECC strength, then we "correct" the
   output to all-0xff and report max-bitflips appropriately

Obviously, this sequence can be compute intensive if applied heavily.
Ideally, MTD users should not need to read un-programmed pages very
often and would require this software check, for instance, only during
attach/mount checks or certain recovery situations.

Drivers can request support for this new feature by OR'ing
NAND_ECC_NEED_CHECK_FF into their chip->options.

[1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051513.html
[2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051585.html

Signed-off-by: Brian Norris <computersforpeace@gmail.com>
---
 drivers/mtd/nand/nand_base.c | 68 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mtd/nand.h     |  5 ++++
 2 files changed, 73 insertions(+)

Comments

pekon gupta March 11, 2014, 10:12 a.m. UTC | #1
Hi Brian,

Few comments below...

>From: Brian Norris [mailto:computersforpeace@gmail.com]
>
>Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
>will return all 1's (0xff). However, modern NAND (MLC, and even some
>SLC) experience bitflips in unprogrammed pages, and so they may not read
>back all 1's. This is problematic for drivers whose ECC cannot correct
>bitflips in an all-0xff page, as they will report an ECC error
>(-EBADMSG) when they come across such a page. This appears to UBIFS as
>"corrupt empty space".
>
I'm not sure, if any driver can solve this either. As if an erased-page has no
ECC stored in OOB then how can any driver correct the bit-flips.
The only other way is to manually scanning through entire read_buf() and
count the bit-flips, which is similar what we are doing here.


>Several others [1][2] are attempting to solve this problem, but none is
>generically useful, and most (all?) have subtle bugs in their reasoning. Let's
>implement a simple, generic, and correct solution instead.
>
>To handle such situations, we should implement the following software
>workaround for drivers that need it: when the hardware driver reports an
>ECC error, nand_base should "verify" this error by
>
> * Re-reading the page without ECC
> * counting the number of 0 bits in each ECC sector (N[i], for i = 0, 1,
>   ..., chip->ecc.steps)
> * If any N[i] exceeds the ECC strength (and thus, the maximum allowable
>   bitflips) then we consider this to be an uncorrectable sector.
> * If all N[i] are less than the ECC strength, then we "correct" the
>   output to all-0xff and report max-bitflips appropriately
>
>Obviously, this sequence can be compute intensive if applied heavily.
>Ideally, MTD users should not need to read un-programmed pages very
>often and would require this software check, for instance, only during
>attach/mount checks or certain recovery situations.
>
>Drivers can request support for this new feature by OR'ing
>NAND_ECC_NEED_CHECK_FF into their chip->options.
>
>[1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051513.html
>[2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051585.html
>
>Signed-off-by: Brian Norris <computersforpeace@gmail.com>
>---
> drivers/mtd/nand/nand_base.c | 68 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/mtd/nand.h     |  5 ++++
> 2 files changed, 73 insertions(+)
>
>diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
>index 5826da39216e..f3723770e43a 100644
>--- a/drivers/mtd/nand/nand_base.c
>+++ b/drivers/mtd/nand/nand_base.c
>@@ -1483,6 +1483,62 @@ static int nand_setup_read_retry(struct mtd_info *mtd, int retry_mode)
> }
>
> /**
>+ * nand_verify_erased_page - [INTERN] Check a page for 0xff + bitflips
>+ * @mtd: MTD device structure
>+ * @buf: read buffer; will contain output data (either raw or "corrected")
>+ * @page: page to check
>+ *
>+ * Check a page to see if it is erased (w/ bitflips) after an uncorrectable ECC
>+ * error
>+ *
>+ * On a real error, return a negative error code (-EBADMSG for ECC error), and
>+ * buf will contain raw data.
>+ * Otherwise, fill buf with 0xff and return the maximum number of
>+ * bitflips per ECC sector to the caller.
>+ */
>+static int nand_verify_erased_page(struct mtd_info *mtd, uint8_t *buf, int page)
>+{
>+	struct nand_chip *chip = mtd->priv;
>+	int i, oobbits, databits;
>+	uint8_t *data = buf, *oob = chip->oob_poi;
>+	unsigned int max_bitflips = 0;
>+	int oob_step = mtd->oobsize / chip->ecc.steps;
>+	int ret;
>+
>+	oobbits = oob_step * 8;
>+	databits = chip->ecc.size * 8;
>+
>+	/* read without ECC for verification */
>+	ret = chip->ecc.read_page_raw(mtd, chip, buf, true, page);

You can re-use the *buf and *oobpoi  filled in by chip->ecc.read_page(),
Instead of re-reading the read_page_raw. 
(1) Re-using buffers will ensure, that you do not re-count any bit-flips
    which were already corrected by driver.
(2) it will save overhead of re-reading the page.


>+	if (ret)
>+		return ret;
>+
>+	/* Check each ECC step individually */
>+	for (i = 0; i < chip->ecc.steps; i++) {
>+		unsigned int flips = databits + oobbits;
>+
>+		flips -= bitmap_weight((unsigned long *)oob, oobbits);
>+		flips -= bitmap_weight((unsigned long *)data, databits);

It was observed that this scanning of all(0xff) in both OOB and Data buffers
caused significant performance penalty, especially with MLC and newer
technology NAND devices (as they exhibit extensive read-disturb errors
in erased-pages).

Hence as done in [2], a special function "count_zero_bits()" was introduced
which will return, as soon as 0-bit count exceeded ecc.strength. So that you
don't end-up counting complete buffer even if count exceeded the limit.

Also please use staggered approach of first counting bit-flips in OOB and
then in data buffer, so that you don't over count.
	int *bitflip_count;
	*bitflip_count = 0;
	/* count number of 0-bits in OOB */
	 If (count_zero_bits(oobpoi, eccbytes, eccstrength, &bitflip_count)) {
		return -EBADMSG;
	} else {
		if (count_zero_bits(read_buf, eccsize, eccstrength, &bitflip_count)
			return -EBADMSG;
		else
			break;
	}

 
>+
>+		/* Too many bitflips */
>+		if (flips > chip->ecc.strength)
>+			return -EBADMSG;
>+
>+		max_bitflips = max(max_bitflips, flips);
>+
>+		data += chip->ecc.size;
>+		oob += oob_step;
>+	}
>+
>+	pr_debug("correcting bitflips in erased page (max %u)\n",
>+			max_bitflips);
>+
>+	memset(buf, 0xff, mtd->writesize);
>+	memset(oob, 0xff, mtd->oobsize);
>+

Now this leaves back to same question, is it correct to fill the region
with 0xff to fool UBIFS (and other upper layers)
OR
Let UBIFS (and other upper layers) try extracting whatever they
could from the corrupt/junk data. Refer discussion in [1]


Also, it makes more sense to introduce this check just before
chip->ecc.correct(), as you can avoid running through failures first.
But then that means  adding this inside nand_read_page_hwecc()
which is replacable.


[1] http://lists.infradead.org/pipermail/linux-mtd/2014-March/052501.html

with regards, pekon
Brian Norris March 12, 2014, 5:32 a.m. UTC | #2
Hi Pekon,

On 03/11/2014 03:12 AM, Gupta, Pekon wrote:
>> From: Brian Norris [mailto:computersforpeace@gmail.com]
>>
>> Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
>> will return all 1's (0xff). However, modern NAND (MLC, and even some
>> SLC) experience bitflips in unprogrammed pages, and so they may not read
>> back all 1's. This is problematic for drivers whose ECC cannot correct
>> bitflips in an all-0xff page, as they will report an ECC error
>> (-EBADMSG) when they come across such a page. This appears to UBIFS as
>> "corrupt empty space".
>>
> I'm not sure, if any driver can solve this either. As if an erased-page has no
> ECC stored in OOB then how can any driver correct the bit-flips.

There are ways to design an ECC algorithm such that the ECC syndrome
bytes for all-0xff data are all 0xff. With such an ECC algorithm, you
then should be able to correct bitflips in pages that are almost all
0xff. But obviously, not all ECC implementations can do this.

Notably, the 1-bit software ECC implementation in
drivers/mtd/nand/nand_ecc.c can correct bit errors in erased pages.

>> @@ -1483,6 +1483,62 @@ static int nand_setup_read_retry(struct mtd_info *mtd, int retry_mode)
...
>> +	/* read without ECC for verification */
>> +	ret = chip->ecc.read_page_raw(mtd, chip, buf, true, page);
> 
> You can re-use the *buf and *oobpoi  filled in by chip->ecc.read_page(),
> Instead of re-reading the read_page_raw. 
> (1) Re-using buffers will ensure, that you do not re-count any bit-flips
>     which were already corrected by driver.
> (2) it will save overhead of re-reading the page.

OK, that might be doable. But there are two considerations:

1. nand_verify_erased_page() may be called after a subpage read (not
full page), so we will need to distinguish how much data we are
targeting here (i.e., column and length)

2. for some drivers, the read_page() implementation does not read out
OOB data by default (see the 'oob_required' parameter), so we will need
to re-read the page/subpage to get the full data+OOB in some cases. But
I suppose we can still weed out pages with totally garbage data by
checking for 0 bits before re-reading the page

>> +	if (ret)
>> +		return ret;
>> +
>> +	/* Check each ECC step individually */
>> +	for (i = 0; i < chip->ecc.steps; i++) {
>> +		unsigned int flips = databits + oobbits;
>> +
>> +		flips -= bitmap_weight((unsigned long *)oob, oobbits);
>> +		flips -= bitmap_weight((unsigned long *)data, databits);
> 
> It was observed that this scanning of all(0xff) in both OOB and Data buffers
> caused significant performance penalty, especially with MLC and newer
> technology NAND devices (as they exhibit extensive read-disturb errors
> in erased-pages).

Do you have any benchmarks? Remember, optimization must be driven by
data, not by opinion. "Extensive read-disturb" does not automatically
mean that the optimizations you imagine are actually worthwhile in practice.

> Hence as done in [2], a special function "count_zero_bits()" was introduced
> which will return, as soon as 0-bit count exceeded ecc.strength. So that you
> don't end-up counting complete buffer even if count exceeded the limit.

I really doubt that quitting early while counting bits will yield much
tangible benefit, and it's possible that it will make things worse.

Also, remember as you've said yourself, this should be optimized for the
mostly-0xff case. This means that we will likely have to check most or
all of the sector before we discover enough 0 bits, so adding a "quit
early" optimization may not be that useful.

> Also please use staggered approach of first counting bit-flips in OOB and
> then in data buffer, so that you don't over count.

OK, I can probably do something like this if I include the "quit early"
optimization.

>> +	memset(buf, 0xff, mtd->writesize);
>> +	memset(oob, 0xff, mtd->oobsize);
>> +
> 
> Now this leaves back to same question, is it correct to fill the region
> with 0xff to fool UBIFS (and other upper layers)

Yes, it absolutely is correct. The idea is that unprogrammed pages will
naturally have some small (?) number of bitflips, within the
correctability specifications of the flash. Such a page can still be
programmed, and likely be within our correction limits. So in order to
gracefully handle this, we remove the small number of 0 bits in order to
"correct" the page for the upper layers, which expect a clean 0xff page.

Or to put it another way, is it correct for ECC to "fool" upper layers
into thinking that there are no bit errors in a page when its syndrome
bytes can help it to do so? Our memset(0xff) is very much analogous.

> OR
> Let UBIFS (and other upper layers) try extracting whatever they
> could from the corrupt/junk data. Refer discussion in [1]

I'm not sure how [1] is related. Artem is simply explaining that you
can't remove those all-0xff checks in UBIFS. And I agree.

On the other hand, this patch is simply saying we can correct a simple,
benign piece of garbage data that is known to occur under normal
conditions, so that UBIFS won't choke on it. We cannot simply pass
easily-correctable garbage to higher layers.

> Also, it makes more sense to introduce this check just before
> chip->ecc.correct(), as you can avoid running through failures first.

No. We want to run the verify_erased_page code as a last-ditch effort,
after we've determined that the ECC implementation couldn't correct the
data. We don't want to run this on *every* page we come across!

> [1] http://lists.infradead.org/pipermail/linux-mtd/2014-March/052501.html

Brian
Elie De Brauwer March 12, 2014, 5:59 a.m. UTC | #3
On Wed, Mar 12, 2014 at 6:32 AM, Brian Norris
<computersforpeace@gmail.com> wrote:
> Hi Pekon,
>
> On 03/11/2014 03:12 AM, Gupta, Pekon wrote:
>>> From: Brian Norris [mailto:computersforpeace@gmail.com]
>>>
>>> Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
>>> will return all 1's (0xff). However, modern NAND (MLC, and even some
>>> SLC) experience bitflips in unprogrammed pages, and so they may not read
>>> back all 1's. This is problematic for drivers whose ECC cannot correct
>>> bitflips in an all-0xff page, as they will report an ECC error
>>> (-EBADMSG) when they come across such a page. This appears to UBIFS as
>>> "corrupt empty space".
>>>
>> I'm not sure, if any driver can solve this either. As if an erased-page has no
>> ECC stored in OOB then how can any driver correct the bit-flips.
>
> There are ways to design an ECC algorithm such that the ECC syndrome
> bytes for all-0xff data are all 0xff. With such an ECC algorithm, you
> then should be able to correct bitflips in pages that are almost all
> 0xff. But obviously, not all ECC implementations can do this.
>
> Notably, the 1-bit software ECC implementation in
> drivers/mtd/nand/nand_ecc.c can correct bit errors in erased pages.
>
>>> @@ -1483,6 +1483,62 @@ static int nand_setup_read_retry(struct mtd_info *mtd, int retry_mode)
> ...
>>> +    /* read without ECC for verification */
>>> +    ret = chip->ecc.read_page_raw(mtd, chip, buf, true, page);
>>
>> You can re-use the *buf and *oobpoi  filled in by chip->ecc.read_page(),
>> Instead of re-reading the read_page_raw.
>> (1) Re-using buffers will ensure, that you do not re-count any bit-flips
>>     which were already corrected by driver.
>> (2) it will save overhead of re-reading the page.
>
> OK, that might be doable. But there are two considerations:
>
> 1. nand_verify_erased_page() may be called after a subpage read (not
> full page), so we will need to distinguish how much data we are
> targeting here (i.e., column and length)
>
> 2. for some drivers, the read_page() implementation does not read out
> OOB data by default (see the 'oob_required' parameter), so we will need
> to re-read the page/subpage to get the full data+OOB in some cases. But
> I suppose we can still weed out pages with totally garbage data by
> checking for 0 bits before re-reading the page
>
>>> +    if (ret)
>>> +            return ret;
>>> +
>>> +    /* Check each ECC step individually */
>>> +    for (i = 0; i < chip->ecc.steps; i++) {
>>> +            unsigned int flips = databits + oobbits;
>>> +
>>> +            flips -= bitmap_weight((unsigned long *)oob, oobbits);
>>> +            flips -= bitmap_weight((unsigned long *)data, databits);
>>
>> It was observed that this scanning of all(0xff) in both OOB and Data buffers
>> caused significant performance penalty, especially with MLC and newer
>> technology NAND devices (as they exhibit extensive read-disturb errors
>> in erased-pages).
>
> Do you have any benchmarks? Remember, optimization must be driven by
> data, not by opinion. "Extensive read-disturb" does not automatically
> mean that the optimizations you imagine are actually worthwhile in practice.
>
>> Hence as done in [2], a special function "count_zero_bits()" was introduced
>> which will return, as soon as 0-bit count exceeded ecc.strength. So that you
>> don't end-up counting complete buffer even if count exceeded the limit.
>
> I really doubt that quitting early while counting bits will yield much
> tangible benefit, and it's possible that it will make things worse.
>
> Also, remember as you've said yourself, this should be optimized for the
> mostly-0xff case. This means that we will likely have to check most or
> all of the sector before we discover enough 0 bits, so adding a "quit
> early" optimization may not be that useful.
>

Hello Brian,

In [1] you an find some benchmarks which I did in the early days of the GPMI fix
where I tried several approaches ranging from the naive version based on some
of Pekon's work, going to making using of the BCH status register ending with
reading the syndromes and caching them, for me this last version is what I have
in our own Linux tree, because after this Huang took over and came
with the patch
which started these discussions which I'm waiting to upstream.

What my tests haves learned me is that there's probably very little to
gain in the
actual optimization of the erased-page correction, but the magic lies in quickly
and efficiently determining if a read-page is actually an all-0xff
case with a bitflip
causing the BCH block to detect it as an error. (In the case of GPMI
is, our n-bit
ECC failed to withstand a single bitflip). Also by default on the GPMI
implementation the syndromes are not by default/read accessible (see also [1]
where I modified the driver to write the syndromes in a read operation to cache
the syndromes of a correctly read erased page since I wanted to use the
syndrome to quickly verify a correctly erased page from an erased page with
bitflips).


[1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051414.html

my 2 cents
E.
Brian Norris March 12, 2014, 6:59 a.m. UTC | #4
Hi Elie,

Thanks for your response.

On 03/11/2014 10:59 PM, Elie De Brauwer wrote:
> In [1] you an find some benchmarks which I did in the early days of the GPMI fix
> where I tried several approaches ranging from the naive version based on some
> of Pekon's work, going to making using of the BCH status register ending with
> reading the syndromes and caching them, for me this last version is what I have
> in our own Linux tree, because after this Huang took over and came
> with the patch
> which started these discussions which I'm waiting to upstream.

I'm a little confused by the number of different patches out here. I'll
summarize what I understand, but please correct if I'm wrong:

[A] First, you (Elie) sent a series of patches that made it to v7 [1].
This utilizes a special GPMI hardware feature that can tell report an
ECC chunk as "erased" based on how many 0 bits it has (between 0 and
some threshold). This still required a fallback to count the number of
bits whenever it's under this threshold

[B] Then, you sent an additional patch [2] (on top of [1]) which tries
to cache the syndrome related to a fully erased page (no bitflips) for
speeding up some comparisons. You provided benchmarks in [3]

[C] Finally, Huang followed up with his own patch [4]. It doesn't do
anything specific to GPMI really, and it encouraged me to just submit my
own patch (the current thread) for nand_base.

But I can't tell what to do with your performance numbers. I see results
for [1] and for [1]+[2], but I don't see any results for [4].

Finally, is [4] supposed to replace your (Elie's) work from [1] and [2],
or supplement it? It sounded like you two were encouraging me to merge
it by itself.

> What my tests haves learned me is that there's probably very little to
> gain in the
> actual optimization of the erased-page correction, but the magic lies in quickly
> and efficiently determining if a read-page is actually an all-0xff
> case with a bitflip
> causing the BCH block to detect it as an error.

I'm not quite sure what you're saying here. What do you mean
"erased-page correction" vs. "determining ... all-0xff"? Aren't those
the same thing?

> (In the case of GPMI
> is, our n-bit
> ECC failed to withstand a single bitflip).

That's understandable. ECC algorithms must be written specifically so
that they can match and correct mostly-0xff patterns. You can't really
massage an inflexible hardware implementation to do this.

Brian

[1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051357.html

[2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051413.html

[3] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051414.html

[4] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051513.html
Huang Shijie March 12, 2014, 8:06 a.m. UTC | #5
On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> +static int nand_verify_erased_page(struct mtd_info *mtd, uint8_t *buf, int page)
> +{
> +	struct nand_chip *chip = mtd->priv;
> +	int i, oobbits, databits;
> +	uint8_t *data = buf, *oob = chip->oob_poi;
> +	unsigned int max_bitflips = 0;
> +	int oob_step = mtd->oobsize / chip->ecc.steps;
> +	int ret;
> +
> +	oobbits = oob_step * 8;
> +	databits = chip->ecc.size * 8;
> +
> +	/* read without ECC for verification */
> +	ret = chip->ecc.read_page_raw(mtd, chip, buf, true, page);
> +	if (ret)
> +		return ret;
> +
> +	/* Check each ECC step individually */
> +	for (i = 0; i < chip->ecc.steps; i++) {
> +		unsigned int flips = databits + oobbits;
> +
> +		flips -= bitmap_weight((unsigned long *)oob, oobbits);
> +		flips -= bitmap_weight((unsigned long *)data, databits);
> +
> +		/* Too many bitflips */
> +		if (flips > chip->ecc.strength)
> +			return -EBADMSG;
> +
> +		max_bitflips = max(max_bitflips, flips);
> +
> +		data += chip->ecc.size;
The gpmi uses a NAND layout like this:

	 *    +---+----------+-+----------+-+----------+-+----------+-+
	 *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
	 *    +---+----------+-+----------+-+----------+-+----------+-+

        M : The metadata, which is 10 by default.	 
 	E : The ECC parity for "data" or "data + M".

If you want to count each sector, it makes the code very complicated. :(


> +		oob += oob_step;
> +	}
> +
> +	pr_debug("correcting bitflips in erased page (max %u)\n",
> +			max_bitflips);
> +
> +	memset(buf, 0xff, mtd->writesize);
> +	memset(oob, 0xff, mtd->oobsize);
> +
> +	return max_bitflips;
> +}
> +/**
>   * nand_do_read_ops - [INTERN] Read data with ECC
>   * @mtd: MTD device structure
>   * @from: offset to read from
> @@ -1554,6 +1610,18 @@ read_retry:
>  				break;
>  			}
>  
> +			/* Check an ECC error for 0xff + bitflips? */
> +			if ((chip->options & NAND_ECC_NEED_CHECK_FF) &&
> +					mtd->ecc_stats.failed - ecc_failures) {
> +				ret = nand_verify_erased_page(mtd, bufpoi,
> +						page);
I think we can merge this code into the read-retry check like:
	---------------------------------------------
			if (mtd->ecc_stats.failed - ecc_failures) {

				// step 1: count the bitflips, assume N.
				N = bitflips_count();
				// step 2:  
				if (N < ecc_strength) {
					do the erase-page-check;
				} else
					do the read-retry.


				if (retry_mode + 1 <= chip->read_retries) {
					retry_mode++;
					ret = nand_setup_read_retry(mtd,
							retry_mode);
					if (ret < 0)
						break;

					/* Reset failures; retry */
					mtd->ecc_stats.failed = ecc_failures;
					goto read_retry;
				} else {
					/* No more retry modes; real failure */
					ecc_fail = true;
				}
			}
	---------------------------------------------


thanks
Huang Shijie
Elie De Brauwer March 12, 2014, 12:45 p.m. UTC | #6
On Wed, Mar 12, 2014 at 7:59 AM, Brian Norris
<computersforpeace@gmail.com> wrote:
> Hi Elie,
>
> Thanks for your response.
>
> On 03/11/2014 10:59 PM, Elie De Brauwer wrote:
>> In [1] you an find some benchmarks which I did in the early days of the GPMI fix
>> where I tried several approaches ranging from the naive version based on some
>> of Pekon's work, going to making using of the BCH status register ending with
>> reading the syndromes and caching them, for me this last version is what I have
>> in our own Linux tree, because after this Huang took over and came
>> with the patch
>> which started these discussions which I'm waiting to upstream.
>
> I'm a little confused by the number of different patches out here. I'll
> summarize what I understand, but please correct if I'm wrong:
>
> [A] First, you (Elie) sent a series of patches that made it to v7 [1].
> This utilizes a special GPMI hardware feature that can tell report an
> ECC chunk as "erased" based on how many 0 bits it has (between 0 and
> some threshold). This still required a fallback to count the number of
> bits whenever it's under this threshold
>
> [B] Then, you sent an additional patch [2] (on top of [1]) which tries
> to cache the syndrome related to a fully erased page (no bitflips) for
> speeding up some comparisons. You provided benchmarks in [3]
>
> [C] Finally, Huang followed up with his own patch [4]. It doesn't do
> anything specific to GPMI really, and it encouraged me to just submit my
> own patch (the current thread) for nand_base.
>
> But I can't tell what to do with your performance numbers. I see results
> for [1] and for [1]+[2], but I don't see any results for [4].
>
> Finally, is [4] supposed to replace your (Elie's) work from [1] and [2],
> or supplement it? It sounded like you two were encouraging me to merge
> it by itself.

Your archeological research is  correct. During A-B I followed a track in
which I believed was a water-tight system using as much as possible
assist  from hardware as possible. But Huang turned this into something
more usable using more internal knowledge of the working of the
BCH syndromes.

>> What my tests haves learned me is that there's probably very little to
>> gain in the
>> actual optimization of the erased-page correction, but the magic lies in quickly
>> and efficiently determining if a read-page is actually an all-0xff
>> case with a bitflip
>> causing the BCH block to detect it as an error.
>
> I'm not quite sure what you're saying here. What do you mean
> "erased-page correction" vs. "determining ... all-0xff"? Aren't those
> the same thing?
>

I meant there are two things:
a) "erase-page correction": if you have an erased page with bitflips,
 count and correct them. (E.g. the for loop with the hweight and
a potential break when you each a threshold, possible with an in
place setting of the to 0xff of the byte in question or followed by
a memset to 0xff at the end).
b) "determining you read an erased page". In case of the i.mx (and
thus GPMI) the BCH block can tell you three things:
 1. I read all 0xff's
 2. I read some data and nothing got corrected
 3. I read something but failed to correct it.
The third case can have two causes:
 3.a you read valid data with bitflips exceeding what the BCH could
correct
 3.b you read an erased page with bitflips.

Obviously case 3.b is what this discussion is all about, and my quest
revolved around a means to quickly identify case '3.b'.

And once you're in case '3' it's difficult to actually distinguish between
3.a and 3.b (think of a page wich consists out of 99.99% 1-bits but
has bitflips exceeding the threshold), You could only identify this by
looking at the syndrome data (which should be 0xff) but which is not
at hand in the GPMI.

So we experimented with tuning the '0xff' case detecting algorithm
(mainly [A] which I improved by caching the correct syndromes for
an erased page addition [B]).  Mainly focusing on the theoretical
background, Huang turned this into a more down to earth approach.

In my own kernel tree I'm still using my patches [A] and [B], and this
is what goes out to the field if we need to start shipping (since vanilla
has a high risk of turning our devices into bricks), but if I have time
I plan to integrate whatever makes it upstream (unless it really messes
up my boot times ;) ).


my 2 cents
E.


>> (In the case of GPMI
>> is, our n-bit
>> ECC failed to withstand a single bitflip).
>
> That's understandable. ECC algorithms must be written specifically so
> that they can match and correct mostly-0xff patterns. You can't really
> massage an inflexible hardware implementation to do this.
>
> Brian
>
> [1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051357.html
>
> [2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051413.html
>
> [3] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051414.html
>
> [4] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051513.html
Brian Norris March 13, 2014, 4:55 a.m. UTC | #7
Hi Huang,

On Wed, Mar 12, 2014 at 04:06:07PM +0800, Huang Shijie wrote:
> On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> > +	/* Check each ECC step individually */
> > +	for (i = 0; i < chip->ecc.steps; i++) {

[...]

> The gpmi uses a NAND layout like this:
> 
> 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> 	 *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
> 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> 
>         M : The metadata, which is 10 by default.	 
>  	E : The ECC parity for "data" or "data + M".
> 
> If you want to count each sector, it makes the code very complicated. :(

I see. That does make things complicated.

But I'm curious: why does GPMI's ecc.read_page_raw() implementation (the
default) have to give such a different data/spare layout than its
ecc.read_page() implementation? There seems to be no benefit to this,
except that it means gpmi-nand doesn't have to implement its own
read_page_raw() callback...

Look, for instance, at nand_base's nand_read_page_raw_syndrome(). It
actually bothers to untangle a layout like yours and place it into the
appropriate places in the data and OOB buffers. I think this is the
ideal implementation for read_page_raw(), at which point my patch makes
more sense; all NAND drivers should have a (reasonably) similar layout
when viewed from nand_base, without any mixed data/OOB striping like in
your diagram.

Side topic: did you plan on looking at this FIXME in gpmi-nand.c?

 * FIXME: The following paragraph is incorrect, now that there exist
 * ecc.read_oob_raw and ecc.write_oob_raw functions.

It might be worth addressing at the same time, since the stale comments
near the FIXME talk about a difference between a raw and an ECC-based
"view" of the page. IMO, there should be no difference between these
types of "view", if at all possible. The only difference should be that
one tries to apply ECC and the other doesn't.

> > @@ -1554,6 +1610,18 @@ read_retry:
> >  				break;
> >  			}
> >  
> > +			/* Check an ECC error for 0xff + bitflips? */
> > +			if ((chip->options & NAND_ECC_NEED_CHECK_FF) &&
> > +					mtd->ecc_stats.failed - ecc_failures) {
> > +				ret = nand_verify_erased_page(mtd, bufpoi,
> > +						page);
> I think we can merge this code into the read-retry check like:

There's some very specific ordering requirements for some of the steps,
so we'd have to be careful about moving this code downward (max_bitflips
checks; the !aligned buffer copy; nand_transfer_oob()). I'm not sure
your suggestion is possible. And now that I mention it, I think the
NAND_NEED_READRDY check might already be a bit misplaced in my patch...

> 	---------------------------------------------
> 			if (mtd->ecc_stats.failed - ecc_failures) {
> 
> 				// step 1: count the bitflips, assume N.
> 				N = bitflips_count();
> 				// step 2:  
> 				if (N < ecc_strength) {
> 					do the erase-page-check;
> 				} else
> 					do the read-retry.

I'm not 100% clear on the pseudocode here, but it looks like you're
saying erased-page checks and read-retry would be mutually exclusive. I
believe that is wrong. In fact, they may *both* be required; we should
first check if the ECC failure was just an erased page, and if not, we
should retry the page.

> 
> 
> 				if (retry_mode + 1 <= chip->read_retries) {
> 					retry_mode++;
> 					ret = nand_setup_read_retry(mtd,
> 							retry_mode);
> 					if (ret < 0)
> 						break;
> 
> 					/* Reset failures; retry */
> 					mtd->ecc_stats.failed = ecc_failures;
> 					goto read_retry;
> 				} else {
> 					/* No more retry modes; real failure */
> 					ecc_fail = true;
> 				}
> 			}
> 	---------------------------------------------

Brian
Brian Norris March 13, 2014, 5:22 a.m. UTC | #8
On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
> On Wed, Mar 12, 2014 at 7:59 AM, Brian Norris <computersforpeace@gmail.com> wrote:
> > On 03/11/2014 10:59 PM, Elie De Brauwer wrote:
[...]
> > I'm a little confused by the number of different patches out here. I'll
> > summarize what I understand, but please correct if I'm wrong:
> >
> > [A] First, you (Elie) sent a series of patches that made it to v7 [1].
> > This utilizes a special GPMI hardware feature that can tell report an
> > ECC chunk as "erased" based on how many 0 bits it has (between 0 and
> > some threshold). This still required a fallback to count the number of
> > bits whenever it's under this threshold
> >
> > [B] Then, you sent an additional patch [2] (on top of [1]) which tries
> > to cache the syndrome related to a fully erased page (no bitflips) for
> > speeding up some comparisons. You provided benchmarks in [3]
> >
> > [C] Finally, Huang followed up with his own patch [4]. It doesn't do
> > anything specific to GPMI really, and it encouraged me to just submit my
> > own patch (the current thread) for nand_base.
> >
> > But I can't tell what to do with your performance numbers. I see results
> > for [1] and for [1]+[2], but I don't see any results for [4].
> >
> > Finally, is [4] supposed to replace your (Elie's) work from [1] and [2],
> > or supplement it? It sounded like you two were encouraging me to merge
> > it by itself.
> 
> Your archeological research is  correct. During A-B I followed a track in
> which I believed was a water-tight system using as much as possible
> assist  from hardware as possible. But Huang turned this into something
> more usable using more internal knowledge of the working of the
> BCH syndromes.

OK, so if an approach like [C] stabilizes, that should replace [A] and [B]?

> >> What my tests haves learned me is that there's probably very little to
> >> gain in the
> >> actual optimization of the erased-page correction, but the magic lies in quickly
> >> and efficiently determining if a read-page is actually an all-0xff
> >> case with a bitflip
> >> causing the BCH block to detect it as an error.
> >
> > I'm not quite sure what you're saying here. What do you mean
> > "erased-page correction" vs. "determining ... all-0xff"? Aren't those
> > the same thing?
> >
> 
> I meant there are two things:
> a) "erase-page correction": if you have an erased page with bitflips,
>  count and correct them. (E.g. the for loop with the hweight and
> a potential break when you each a threshold, possible with an in
> place setting of the to 0xff of the byte in question or followed by
> a memset to 0xff at the end).
> b) "determining you read an erased page". In case of the i.mx (and
> thus GPMI) the BCH block can tell you three things:
>  1. I read all 0xff's
>  2. I read some data and nothing got corrected
>  3. I read something but failed to correct it.
> The third case can have two causes:
>  3.a you read valid data with bitflips exceeding what the BCH could
> correct
>  3.b you read an erased page with bitflips.
> 
> Obviously case 3.b is what this discussion is all about, and my quest
> revolved around a means to quickly identify case '3.b'.

Yes, 3.a vs. 3.b is the big problem.

> And once you're in case '3' it's difficult to actually distinguish between
> 3.a and 3.b (think of a page wich consists out of 99.99% 1-bits but
> has bitflips exceeding the threshold), You could only identify this by
> looking at the syndrome data (which should be 0xff) but which is not
> at hand in the GPMI.

So if you can't distinguish between 3.a and 3.b, then you're in the same
boat as many other hardware/drivers. But if you can do something special
with the hardware, that is still potentially very interesting.

(BTW, I don't think saving the syndrome bytes is a very good approach
here. It seems like that would only help for very specific patterns.)

What *can* your BCH do to help w/ distinguishing 3.a and 3.b?

As I read your patches, it seems your BCH can report STATUS_ERASED when
it sees a page with only N or fewer 0-bits, where N is configurable from
0 to 255 (?). If only it could also report *how many* zero bits it
saw... Otherewise, it seems like you just want to leave N = 0 to signal
a clean page, and continue with a pure software solution for the
STATUS_UNCORRECTABLE case.

> So we experimented with tuning the '0xff' case detecting algorithm
> (mainly [A] which I improved by caching the correct syndromes for
> an erased page addition [B]).  Mainly focusing on the theoretical
> background, Huang turned this into a more down to earth approach.
> 
> In my own kernel tree I'm still using my patches [A] and [B], and this
> is what goes out to the field if we need to start shipping (since vanilla
> has a high risk of turning our devices into bricks), but if I have time
> I plan to integrate whatever makes it upstream (unless it really messes
> up my boot times ;) ).

OK, then I think I'll consider only some version of either Huang's patch
or my own. And if we can straighten out the GPMI data/OOB layout (per
Huang and my conversation on another fork of this thread), then I think
we'd want this code in nand_base.

Brian

> > [1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051357.html
> >
> > [2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051413.html
> >
> > [3] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051414.html
> >
> > [4] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051513.html
pekon gupta March 13, 2014, 5:55 a.m. UTC | #9
All,

>From: Brian Norris [mailto:computersforpeace@gmail.com]
>On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
>> On Wed, Mar 12, 2014 at 7:59 AM, Brian Norris <computersforpeace@gmail.com> wrote:
>> > On 03/11/2014 10:59 PM, Elie De Brauwer wrote:
[...]
>> > I'm a little confused by the number of different patches out here. I'll
>> > summarize what I understand, but please correct if I'm wrong:
>> >
>> > [A] First, you (Elie) sent a series of patches that made it to v7 [1].
>> > This utilizes a special GPMI hardware feature that can tell report an
>> > ECC chunk as "erased" based on how many 0 bits it has (between 0 and
>> > some threshold). This still required a fallback to count the number of
>> > bits whenever it's under this threshold
>> >
>> > [B] Then, you sent an additional patch [2] (on top of [1]) which tries
>> > to cache the syndrome related to a fully erased page (no bitflips) for
>> > speeding up some comparisons. You provided benchmarks in [3]
>> >
>> > [C] Finally, Huang followed up with his own patch [4]. It doesn't do
>> > anything specific to GPMI really, and it encouraged me to just submit my
>> > own patch (the current thread) for nand_base.
>> >
>> > But I can't tell what to do with your performance numbers. I see results
>> > for [1] and for [1]+[2], but I don't see any results for [4].
>> >
>> > Finally, is [4] supposed to replace your (Elie's) work from [1] and [2],
>> > or supplement it? It sounded like you two were encouraging me to merge
>> > it by itself.
>>
>> Your archeological research is  correct. During A-B I followed a track in
>> which I believed was a water-tight system using as much as possible
>> assist  from hardware as possible. But Huang turned this into something
>> more usable using more internal knowledge of the working of the
>> BCH syndromes.
>
>OK, so if an approach like [C] stabilizes, that should replace [A] and [B]?
>
>> >> What my tests haves learned me is that there's probably very little to
>> >> gain in the
>> >> actual optimization of the erased-page correction, but the magic lies in quickly
>> >> and efficiently determining if a read-page is actually an all-0xff
>> >> case with a bitflip
>> >> causing the BCH block to detect it as an error.
>> >
>> > I'm not quite sure what you're saying here. What do you mean
>> > "erased-page correction" vs. "determining ... all-0xff"? Aren't those
>> > the same thing?
>> >
>>
>> I meant there are two things:
>> a) "erase-page correction": if you have an erased page with bitflips,
>>  count and correct them. (E.g. the for loop with the hweight and
>> a potential break when you each a threshold, possible with an in
>> place setting of the to 0xff of the byte in question or followed by
>> a memset to 0xff at the end).
>> b) "determining you read an erased page". In case of the i.mx (and
>> thus GPMI) the BCH block can tell you three things:
>>  1. I read all 0xff's
>>  2. I read some data and nothing got corrected
>>  3. I read something but failed to correct it.
>> The third case can have two causes:
>>  3.a you read valid data with bitflips exceeding what the BCH could
>> correct
>>  3.b you read an erased page with bitflips.
>>
>> Obviously case 3.b is what this discussion is all about, and my quest
>> revolved around a means to quickly identify case '3.b'.
>
>Yes, 3.a vs. 3.b is the big problem.
>
>> And once you're in case '3' it's difficult to actually distinguish between
>> 3.a and 3.b (think of a page wich consists out of 99.99% 1-bits but
>> has bitflips exceeding the threshold), You could only identify this by
>> looking at the syndrome data (which should be 0xff) but which is not
>> at hand in the GPMI.
>
>So if you can't distinguish between 3.a and 3.b, then you're in the same
>boat as many other hardware/drivers. But if you can do something special
>with the hardware, that is still potentially very interesting.
>
>(BTW, I don't think saving the syndrome bytes is a very good approach
>here. It seems like that would only help for very specific patterns.)
>
I think for OMAP NAND driver there needs to be some help on "1." also.
There is no hardware support in GPMC (Ti's controller) to find out if the
(read_data + read_oob) == 0xff. So you have to do this comparison in
software. And, thus there is performance penalty right at first-step.
(I'm trying to get the statistics of this soon).

So, for OMAP NAND driver flow becomes..
1. There needs to be something without much performance penalty to 
filter out 'erased-pages without bit-flips'. And this points to basic question,
which was asked by Brain earlier [1] and wasn't sure about ..
"Does OOB==all(0xff) can be assumed that pages_is_erased ?"
OR
"Is there any combination of data which produces all(0xff) ECC ?"

Then, statement "3." (3a and 3b) remains similar  to GPMI (FSL Controller) driver.


[1] http://lists.infradead.org/pipermail/linux-mtd/2014-March/052472.html

with regards, pekon
Brian Norris March 13, 2014, 6:28 a.m. UTC | #10
Hi Pekon,

On Thu, Mar 13, 2014 at 05:55:32AM +0000, Pekon Gupta wrote:
> >From: Brian Norris [mailto:computersforpeace@gmail.com]
> >On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
> >> b) "determining you read an erased page". In case of the i.mx (and
> >> thus GPMI) the BCH block can tell you three things:
> >>  1. I read all 0xff's
> >>  2. I read some data and nothing got corrected
> >>  3. I read something but failed to correct it.
> >> The third case can have two causes:
> >>  3.a you read valid data with bitflips exceeding what the BCH could
> >> correct
> >>  3.b you read an erased page with bitflips.
> >>
> >> Obviously case 3.b is what this discussion is all about, and my quest
> >> revolved around a means to quickly identify case '3.b'.
> >
> >Yes, 3.a vs. 3.b is the big problem.
[...]

> I think for OMAP NAND driver there needs to be some help on "1." also.
> There is no hardware support in GPMC (Ti's controller) to find out if the
> (read_data + read_oob) == 0xff. So you have to do this comparison in
> software.

Really? So if you read a blank (all 0xff) page that has no bitflips, you
see an ECC error? I'm sorry, but I didn't realize that's how your
hardware worked. That's the worst hardware ECC design I've seen so far
:(

So currently, omap2.c tries to mitigate the overhead of checking 1 (and
3.b) by using a fixed-offset "programmed" marker, right? But this isn't
workable for all platforms (why, exactly?) so you're trying to remove
this marker in your patch sets, right? (Could you possibly move the
marker somewhere else, depending on the platform? But admittedly, this
is still fragile.)

In any case, if the special marker approach fails, you're falling back
to an approach similar to the Freescale approach (and mine), with a few
extra tweaks.

> And, thus there is performance penalty right at first-step.
> (I'm trying to get the statistics of this soon).

Yes, that would be nice.

Brian
pekon gupta March 13, 2014, 7:01 a.m. UTC | #11
Hi Brian,

>From: Brian Norris [mailto:computersforpeace@gmail.com]
>Hi Pekon,
>
>On Thu, Mar 13, 2014 at 05:55:32AM +0000, Pekon Gupta wrote:
>> >From: Brian Norris [mailto:computersforpeace@gmail.com]
>> >On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
>> >> b) "determining you read an erased page". In case of the i.mx (and
>> >> thus GPMI) the BCH block can tell you three things:
>> >>  1. I read all 0xff's
>> >>  2. I read some data and nothing got corrected
>> >>  3. I read something but failed to correct it.
>> >> The third case can have two causes:
>> >>  3.a you read valid data with bitflips exceeding what the BCH could
>> >> correct
>> >>  3.b you read an erased page with bitflips.
>> >>
>> >> Obviously case 3.b is what this discussion is all about, and my quest
>> >> revolved around a means to quickly identify case '3.b'.
>> >
>> >Yes, 3.a vs. 3.b is the big problem.
>[...]
>
>> I think for OMAP NAND driver there needs to be some help on "1." also.
>> There is no hardware support in GPMC (Ti's controller) to find out if the
>> (read_data + read_oob) == 0xff. So you have to do this comparison in
>> software.
>
>Really? So if you read a blank (all 0xff) page that has no bitflips, you
>see an ECC error? I'm sorry, but I didn't realize that's how your
>hardware worked. 
>
Yes, it is like that.
Therefore if the driver does not filter out "blank (all 0xff) pages) before
'read_data + read_oob' being fed to ELM engine you see "uncorrectable" error.


> That's the worst hardware ECC design I've seen so far
>:(
>
- GPMC and ELM hardware engines were design during OMAP3 and earlier
  time-frames. When only 1-bit Hamming was enough, and there were
  hardly any bit-flips seen on erased-pages :-).
- And I don't think h/w IP designers have too much visibility on actual usage
  and software framework of these engines. So this remained un-changed.
However, I do have internally raised this issue of getting a feature added
in hardware to check for all(0xff). But still we have to support legacy devices.


>So currently, omap2.c tries to mitigate the overhead of checking 1 (and
>3.b) by using a fixed-offset "programmed" marker, right? But this isn't
>workable for all platforms (why, exactly?)
>
Because this 'special marker' is not present in ecc_layout of all ECC schemes 
supported by ROM code. (its only present in BCH8_HW). And so for other
ECC schemes (like BCH16) we have to fall-back on driver to manually compare
and filter out completely 'erased-pages'.


> so you're trying to remove
>this marker in your patch sets, right? (Could you possibly move the
>marker somewhere else, depending on the platform? But admittedly, this
>is still fragile.)
>
Yes, that is what my patch [2] does. I can move the marker to a location
in 'ecc_layout.oobfree' space but then other problems.
(1) This would require change in ecc.layout, for which I don't know how
  legacy devices will behave.
(2) And ECC marker location on an 'erased-page' is still susceptible to
 bit-flips, which will make it detectible as programmed page. And then
 again when ELM tries to correct it, it will show "un-correctable" errors.


>In any case, if the special marker approach fails, you're falling back
>to an approach similar to the Freescale approach (and mine), with a few
>extra tweaks.
>
Yes, so I want to effectively do away with "special marker approach".

I'm very much open to use your nand_base.c approach. But I'm just
trying to get more optimal solution in terms of performance.
because there would be anyways some performance hit to filter-out
erased-pages (without bit-flip) pages in OMAP NAND driver.
Then if there is another iteration to filter-out erased-pages (with bit-flips)
in nand_base.c, the driver's performance will be hit. (especially for MLC
and NAND devices at lower technologies).


>> And, thus there is performance penalty right at first-step.
>> (I'm trying to get the statistics of this soon).
>
>Yes, that would be nice.
>
>Brian

[2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051585.html


with regards, pekon
Elie De Brauwer March 13, 2014, 7:37 a.m. UTC | #12
On Thu, Mar 13, 2014 at 6:22 AM, Brian Norris
<computersforpeace@gmail.com> wrote:
> On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
>> On Wed, Mar 12, 2014 at 7:59 AM, Brian Norris <computersforpeace@gmail.com> wrote:
>> > On 03/11/2014 10:59 PM, Elie De Brauwer wrote:
> [...]
>> > I'm a little confused by the number of different patches out here. I'll
>> > summarize what I understand, but please correct if I'm wrong:
>> >
>> > [A] First, you (Elie) sent a series of patches that made it to v7 [1].
>> > This utilizes a special GPMI hardware feature that can tell report an
>> > ECC chunk as "erased" based on how many 0 bits it has (between 0 and
>> > some threshold). This still required a fallback to count the number of
>> > bits whenever it's under this threshold
>> >
>> > [B] Then, you sent an additional patch [2] (on top of [1]) which tries
>> > to cache the syndrome related to a fully erased page (no bitflips) for
>> > speeding up some comparisons. You provided benchmarks in [3]
>> >
>> > [C] Finally, Huang followed up with his own patch [4]. It doesn't do
>> > anything specific to GPMI really, and it encouraged me to just submit my
>> > own patch (the current thread) for nand_base.
>> >
>> > But I can't tell what to do with your performance numbers. I see results
>> > for [1] and for [1]+[2], but I don't see any results for [4].
>> >
>> > Finally, is [4] supposed to replace your (Elie's) work from [1] and [2],
>> > or supplement it? It sounded like you two were encouraging me to merge
>> > it by itself.
>>
>> Your archeological research is  correct. During A-B I followed a track in
>> which I believed was a water-tight system using as much as possible
>> assist  from hardware as possible. But Huang turned this into something
>> more usable using more internal knowledge of the working of the
>> BCH syndromes.
>
> OK, so if an approach like [C] stabilizes, that should replace [A] and [B]?


Correct, obviously for mainstream code it's probably better if this can be
solved at a more generic level (e.g. at nand level instead of at the level
of each individual driver). The only reason [A] and [B] exist because for my
current needs (getting my boards stable) that was the point where it made
more sense, and it appeared the same path was being followed by Pekon
for OMAP.

>
>> >> What my tests haves learned me is that there's probably very little to
>> >> gain in the
>> >> actual optimization of the erased-page correction, but the magic lies in quickly
>> >> and efficiently determining if a read-page is actually an all-0xff
>> >> case with a bitflip
>> >> causing the BCH block to detect it as an error.
>> >
>> > I'm not quite sure what you're saying here. What do you mean
>> > "erased-page correction" vs. "determining ... all-0xff"? Aren't those
>> > the same thing?
>> >
>>
>> I meant there are two things:
>> a) "erase-page correction": if you have an erased page with bitflips,
>>  count and correct them. (E.g. the for loop with the hweight and
>> a potential break when you each a threshold, possible with an in
>> place setting of the to 0xff of the byte in question or followed by
>> a memset to 0xff at the end).
>> b) "determining you read an erased page". In case of the i.mx (and
>> thus GPMI) the BCH block can tell you three things:
>>  1. I read all 0xff's
>>  2. I read some data and nothing got corrected
>>  3. I read something but failed to correct it.
>> The third case can have two causes:
>>  3.a you read valid data with bitflips exceeding what the BCH could
>> correct
>>  3.b you read an erased page with bitflips.
>>
>> Obviously case 3.b is what this discussion is all about, and my quest
>> revolved around a means to quickly identify case '3.b'.
>
> Yes, 3.a vs. 3.b is the big problem.
>
>> And once you're in case '3' it's difficult to actually distinguish between
>> 3.a and 3.b (think of a page wich consists out of 99.99% 1-bits but
>> has bitflips exceeding the threshold), You could only identify this by
>> looking at the syndrome data (which should be 0xff) but which is not
>> at hand in the GPMI.
>
> So if you can't distinguish between 3.a and 3.b, then you're in the same
> boat as many other hardware/drivers. But if you can do something special
> with the hardware, that is still potentially very interesting.
>
> (BTW, I don't think saving the syndrome bytes is a very good approach
> here. It seems like that would only help for very specific patterns.)
>
> What *can* your BCH do to help w/ distinguishing 3.a and 3.b?
>
> As I read your patches, it seems your BCH can report STATUS_ERASED when
> it sees a page with only N or fewer 0-bits, where N is configurable from
> 0 to 255 (?). If only it could also report *how many* zero bits it
> saw... Otherewise, it seems like you just want to leave N = 0 to signal
> a clean page, and continue with a pure software solution for the
> STATUS_UNCORRECTABLE case.

Indeed, what can be done is tuning the all 0xff-detection algorithm, you
can modify this threshold to count how many non-1-bits are allowed before
the block is marked as uncorrectable. The only thing which is not clear to
me, and not very clearly documented, is how the OOB data matches in,
suppose we set this threshold to n (or even to zero) and bitflips occur
only in the OOB data, there would be no way to tell. The problem I saw
is that by looking at the data alone, this would not be enough information
to distinguish between an erased pages with bitflips and a non-erased
page which consists mainly out of 0xff's, which you could probably only
distinguish by having access to the ECC data which we had not. Hence
the approach of playing with the 0xff-detection algorithm in the BCH block.

E.g. page 1285 of the i.MX28 reference manual ( [1] ), states

<quote>
As the BCH decoder reads the data and parity blocks, it records a
special condition, i.e.,
that all of the bits of a payload data block or metadata block are
one, including any associated
parity bytes. The all-ones case for both parity and data indicates an
erased block in the
NAND device.
</quote>

[1] http://free-electrons.com/~maxime/pub/datasheet/MCIMX28RM.pdf

>
>> So we experimented with tuning the '0xff' case detecting algorithm
>> (mainly [A] which I improved by caching the correct syndromes for
>> an erased page addition [B]).  Mainly focusing on the theoretical
>> background, Huang turned this into a more down to earth approach.
>>
>> In my own kernel tree I'm still using my patches [A] and [B], and this
>> is what goes out to the field if we need to start shipping (since vanilla
>> has a high risk of turning our devices into bricks), but if I have time
>> I plan to integrate whatever makes it upstream (unless it really messes
>> up my boot times ;) ).
>
> OK, then I think I'll consider only some version of either Huang's patch
> or my own. And if we can straighten out the GPMI data/OOB layout (per
> Huang and my conversation on another fork of this thread), then I think
> we'd want this code in nand_base.

As stated I have no problem with that, on the contrary, I'm very eager to
see an accepted solution to this issue in the kernel.

my 2 cents
E.
Huang Shijie March 13, 2014, 8:04 a.m. UTC | #13
On Wed, Mar 12, 2014 at 09:55:19PM -0700, Brian Norris wrote:
> Hi Huang,
> 
> On Wed, Mar 12, 2014 at 04:06:07PM +0800, Huang Shijie wrote:
> > On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> > > +	/* Check each ECC step individually */
> > > +	for (i = 0; i < chip->ecc.steps; i++) {
> 
> [...]
> 
> > The gpmi uses a NAND layout like this:
> > 
> > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > 	 *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
> > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > 
> >         M : The metadata, which is 10 by default.	 
> >  	E : The ECC parity for "data" or "data + M".
> > 
> > If you want to count each sector, it makes the code very complicated. :(
> 
> I see. That does make things complicated.
> 
> But I'm curious: why does GPMI's ecc.read_page_raw() implementation (the
> default) have to give such a different data/spare layout than its
> ecc.read_page() implementation? There seems to be no benefit to this,
> except that it means gpmi-nand doesn't have to implement its own
> read_page_raw() callback...

The raw data of the NAND is like the diagram above.

Do you mean i should implement the ecc.read_page_raw to read out all
the "data"s in the diagram above, and concatenate them together? 

What is the "raw" data meaning? I feel confused now.


> 
> Look, for instance, at nand_base's nand_read_page_raw_syndrome(). It
> actually bothers to untangle a layout like yours and place it into the
> appropriate places in the data and OOB buffers. I think this is the
> ideal implementation for read_page_raw(), at which point my patch makes
> more sense; all NAND drivers should have a (reasonably) similar layout
> when viewed from nand_base, without any mixed data/OOB striping like in
> your diagram.
We'd better add more comment in the:
 * @read_page_raw:	function to read a raw page without ECC

Make it clear that the @read_oob_raw read out the data which do not contain
the ECC parity. 

> 
> Side topic: did you plan on looking at this FIXME in gpmi-nand.c?
> 
>  * FIXME: The following paragraph is incorrect, now that there exist
>  * ecc.read_oob_raw and ecc.write_oob_raw functions.
> 
I should remove these stale comments.

> It might be worth addressing at the same time, since the stale comments
> near the FIXME talk about a difference between a raw and an ECC-based
> "view" of the page. IMO, there should be no difference between these
The difference is caused by the block marker swap. That is maybe a different
topic from this one.

> types of "view", if at all possible. The only difference should be that
> one tries to apply ECC and the other doesn't.
> 
> > > @@ -1554,6 +1610,18 @@ read_retry:
> > >  				break;
> > >  			}
> > >  
> > > +			/* Check an ECC error for 0xff + bitflips? */
> > > +			if ((chip->options & NAND_ECC_NEED_CHECK_FF) &&
> > > +					mtd->ecc_stats.failed - ecc_failures) {
> > > +				ret = nand_verify_erased_page(mtd, bufpoi,
> > > +						page);
> > I think we can merge this code into the read-retry check like:
> 
> There's some very specific ordering requirements for some of the steps,
> so we'd have to be careful about moving this code downward (max_bitflips
> checks; the !aligned buffer copy; nand_transfer_oob()). I'm not sure
> your suggestion is possible. And now that I mention it, I think the
> NAND_NEED_READRDY check might already be a bit misplaced in my patch...

okay.


> 
> > 	---------------------------------------------
> > 			if (mtd->ecc_stats.failed - ecc_failures) {
> > 
> > 				// step 1: count the bitflips, assume N.
> > 				N = bitflips_count();
> > 				// step 2:  
> > 				if (N < ecc_strength) {
> > 					do the erase-page-check;
> > 				} else
> > 					do the read-retry.
> 
> I'm not 100% clear on the pseudocode here, but it looks like you're
> saying erased-page checks and read-retry would be mutually exclusive. I
> believe that is wrong. In fact, they may *both* be required; we should

yes, that is what i meant. they both are required.
But could we find a way to know that it is _NOT_ an erased-page-ECC-fail?

> first check if the ECC failure was just an erased page, and if not, we
> should retry the page.
My pseudocode code is not accurate enough.
Maybe it should like this:
                    if (is_NOT_an_erase_page_failure()) {
			    do the read-retry;
		    } else {
			   if (check_erase_page_ok())
				   /* nothing */;
			   else
				do the read-retry;
		    }

thanks
Huang Shijie
Ezequiel Garcia March 13, 2014, 12:57 p.m. UTC | #14
On Mar 12, Brian Norris wrote:
> Hi Pekon,
> 
> On Thu, Mar 13, 2014 at 05:55:32AM +0000, Pekon Gupta wrote:
> > >From: Brian Norris [mailto:computersforpeace@gmail.com]
> > >On Wed, Mar 12, 2014 at 01:45:15PM +0100, Elie De Brauwer wrote:
> > >> b) "determining you read an erased page". In case of the i.mx (and
> > >> thus GPMI) the BCH block can tell you three things:
> > >>  1. I read all 0xff's
> > >>  2. I read some data and nothing got corrected
> > >>  3. I read something but failed to correct it.
> > >> The third case can have two causes:
> > >>  3.a you read valid data with bitflips exceeding what the BCH could
> > >> correct
> > >>  3.b you read an erased page with bitflips.
> > >>
> > >> Obviously case 3.b is what this discussion is all about, and my quest
> > >> revolved around a means to quickly identify case '3.b'.
> > >
> > >Yes, 3.a vs. 3.b is the big problem.
> [...]
> 
> > I think for OMAP NAND driver there needs to be some help on "1." also.
> > There is no hardware support in GPMC (Ti's controller) to find out if the
> > (read_data + read_oob) == 0xff. So you have to do this comparison in
> > software.
> 
> Really? So if you read a blank (all 0xff) page that has no bitflips, you
> see an ECC error? I'm sorry, but I didn't realize that's how your
> hardware worked. That's the worst hardware ECC design I've seen so far
> :(
> 

I haven't had time to follow this discussion yet, but just wanted to point
out the above is also true for pxa3xx-nand.c.

If a completely erased page is read, the calculated ECC would be 0 and the
stored ECC would be all 0xff (although I never actually checked this).
See the snippet below:

static int pxa3xx_nand_read_page_hwecc(struct mtd_info *mtd,
		struct nand_chip *chip, uint8_t *buf, int oob_required,
		int page)
{
	[..]
	} else if (info->retcode == ERR_UNCORERR) {
		/*
		 * for blank page (all 0xff), HW will calculate its ECC as
		 * 0, which is different from the ECC information within
		 * OOB, ignore such uncorrectable errors
		 */
		if (is_buf_blank(buf, mtd->writesize))
			info->retcode = ERR_NONE;
	}
	[..]
}
Bill Pringlemeir March 13, 2014, 9:32 p.m. UTC | #15
On 11 Mar 2014, Brian Norris wrote:

> Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
> will return all 1's (0xff). However, modern NAND (MLC, and even some
> SLC) experience bitflips in unprogrammed pages, and so they may not read
> back all 1's. This is problematic for drivers whose ECC cannot correct
> bitflips in an all-0xff page, as they will report an ECC error
> (-EBADMSG) when they come across such a page. This appears to UBIFS as
> "corrupt empty space".

> Several others [1][2] are attempting to solve this problem, but none is
> generically useful, and most (all?) have subtle bugs in their reasoning. Let's
> implement a simple, generic, and correct solution instead.

> To handle such situations, we should implement the following software
> workaround for drivers that need it: when the hardware driver reports an
> ECC error, nand_base should "verify" this error by

> * Re-reading the page without ECC
> * counting the number of 0 bits in each ECC sector (N[i], for i = 0, 1,
> ..., chip->ecc.steps)
> * If any N[i] exceeds the ECC strength (and thus, the maximum allowable
> bitflips) then we consider this to be an uncorrectable sector.
> * If all N[i] are less than the ECC strength, then we "correct" the
> output to all-0xff and report max-bitflips appropriately

One issue is that a raw read will never see 'stuck at one' errors.  I
believe that Elie had a good diagnosis of the issue,

> 3. I read something but failed to correct it.
> The third case can have two causes:
> 3.a you read valid data with bitflips exceeding what the BCH could
>   correct
> 3.b you read an erased page with bitflips.

For 3.b, the permitted value of bitflips should probably be based on the
flash device and not the ECC controller.  If the chip is giving bit
flips on an SLC NAND device, do we wish to continue on 3.b?  I believe
that maybe only some MLC NAND devices might want to permit this.  If the
conclusion is that this is an erased page, then someone is going to write
to it and possibly then see 'stuck at one' issues.  At first glance,
using the ECC strength seems correct, but I don't think that this is
simple data correction in this case.

Another issue is that the management of flash is not at the MTD layer.
The other layers general know when a sector is erased.  There is no hint
ever given to the MTD driver.  For instance, many drivers implement
sub-pages by doing a full page read followed by a sub-page write, where
just the sub-page data is updated in the originally read page.  If this
is happening multiple times (read page w ECC, read page w/o ECC, write
page), the performance to write a sub-page in a known erased sector
could be pretty horrid.  This maybe a fairly common case.

So, I think this statement,

> Obviously, this sequence can be compute intensive if applied heavily.
> Ideally, MTD users should not need to read un-programmed pages very
> often and would require this software check, for instance, only during
> attach/mount checks or certain recovery situations.

... is not quite correct.  It seems common for some upper layers to ask
to read erased data during normal operation?  Or the only MTD drivers I
have looked at have sub-page handling broken and need to fixing.

If this happens only at boot, then I don't think people would be as
concerned about performance.

See: 
 http://git.infradead.org/linux-mtd.git/blob/HEAD:/drivers/mtd/nand/fsmc_nand.c#l800

for another driver which is doing this check.  I think you are right to
ask for an generic API solution to this issue.  However, I believe we
only need to determine whether it is an all xff page or an erased page.
If an upper layer gave a hint that this page is 'known to be written',
then this could be avoided.  I don't think we have such hints?

Fwiw,
Bill Pringlemeir.
Brian Norris March 17, 2014, 4:46 p.m. UTC | #16
On Thu, Mar 13, 2014 at 04:04:27PM +0800, Huang Shijie wrote:
> On Wed, Mar 12, 2014 at 09:55:19PM -0700, Brian Norris wrote:
> > On Wed, Mar 12, 2014 at 04:06:07PM +0800, Huang Shijie wrote:
> > > On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> > > > +	/* Check each ECC step individually */
> > > > +	for (i = 0; i < chip->ecc.steps; i++) {
> > 
> > [...]
> > 
> > > The gpmi uses a NAND layout like this:
> > > 
> > > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > > 	 *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
> > > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > > 
> > >         M : The metadata, which is 10 by default.	 
> > >  	E : The ECC parity for "data" or "data + M".
> > > 
> > > If you want to count each sector, it makes the code very complicated. :(
> > 
> > I see. That does make things complicated.
> > 
> > But I'm curious: why does GPMI's ecc.read_page_raw() implementation (the
> > default) have to give such a different data/spare layout than its
> > ecc.read_page() implementation? There seems to be no benefit to this,
> > except that it means gpmi-nand doesn't have to implement its own
> > read_page_raw() callback...
> 
> The raw data of the NAND is like the diagram above.
> 
> Do you mean i should implement the ecc.read_page_raw to read out all
> the "data"s in the diagram above, and concatenate them together? 

Yes, that is essentially what I'm suggesting. Is there a good reason to
have ecc.read_page() and ecc.read_page_raw() look so different to the
caller? Is it important to see the data exactly as it lays on the flash,
rather than concatenated to how the driver/hardware interprets it?

> What is the "raw" data meaning? I feel confused now.

I did not know there was such a confusion. I simply read this:

 * @read_page_raw:      function to read a raw page without ECC

and I assumed it meant that read_page_raw() was the same as read_page(),
except that there would be no *correction* done on the data. I didn't
previously think too hard about what it should look like for HW ECC that
requires shifting/concatenating data around.

> > Look, for instance, at nand_base's nand_read_page_raw_syndrome(). It
> > actually bothers to untangle a layout like yours and place it into the
> > appropriate places in the data and OOB buffers. I think this is the
> > ideal implementation for read_page_raw(), at which point my patch makes
> > more sense; all NAND drivers should have a (reasonably) similar layout
> > when viewed from nand_base, without any mixed data/OOB striping like in
> > your diagram.
> We'd better add more comment in the:
>  * @read_page_raw:	function to read a raw page without ECC
> 
> Make it clear that the @read_oob_raw read out the data which do not contain
> the ECC parity. 

No, that's not the point; the ECC parity data should still be read out
from the flash, and IMO, they should be placed in the same location as
with read_page(). They just shouldn't be used for correction. How about
this?

 * @read_page_raw:	function to read a page without correcting for
			bitflips

> > Side topic: did you plan on looking at this FIXME in gpmi-nand.c?
> > 
> >  * FIXME: The following paragraph is incorrect, now that there exist
> >  * ecc.read_oob_raw and ecc.write_oob_raw functions.
> > 
> I should remove these stale comments.
> 
> > It might be worth addressing at the same time, since the stale comments
> > near the FIXME talk about a difference between a raw and an ECC-based
> > "view" of the page. IMO, there should be no difference between these
> The difference is caused by the block marker swap. That is maybe a different
> topic from this one.

OK.

> > types of "view", if at all possible. The only difference should be that
> > one tries to apply ECC and the other doesn't.

[...]

> > saying erased-page checks and read-retry would be mutually exclusive. I
> > believe that is wrong. In fact, they may *both* be required; we should
> 
> yes, that is what i meant. they both are required.
> But could we find a way to know that it is _NOT_ an erased-page-ECC-fail?

Yes, I think there are a few small optimizations that Pekon has brought
up, for instance, that wouldn't be too complex and could help to quickly
rule out data which is not an erased-page failure.

> > first check if the ECC failure was just an erased page, and if not, we
> > should retry the page.
> My pseudocode code is not accurate enough.
> Maybe it should like this:
>                     if (is_NOT_an_erase_page_failure()) {
> 			    do the read-retry;
> 		    } else {
> 			   if (check_erase_page_ok())
> 				   /* nothing */;
> 			   else
> 				do the read-retry;
> 		    }

This logic is just a convoluted way of saying we should add a
short-circuit path for quickly detecting "this is not an erased page"
into the "check_erase_page_ok()" routine. I think this could be covered
by doing a bitflip count (with an early-exit path) on the original
buffer, before reading out the full raw page again to do the full check.
I'm not sure what other generic options we have.

I'll look at sending a v2 patch soon. Pekon already sent his own
modifications which address some of this.

Brian
pekon gupta March 17, 2014, 5:50 p.m. UTC | #17
> On Wed, Mar 12, 2014 at 09:55:19PM -0700, Brian Norris wrote:
> > On Wed, Mar 12, 2014 at 04:06:07PM +0800, Huang Shijie wrote:
> > > On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> > > > +       /* Check each ECC step individually */
> > > > +       for (i = 0; i < chip->ecc.steps; i++) {
> >
> > [...]
> >
> > > The gpmi uses a NAND layout like this:
> > >
> > >    *    +---+----------+-+----------+-+----------+-+----------+-+
> > >    *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
> > >    *    +---+----------+-+----------+-+----------+-+----------+-+
> > >
> > >         M : The metadata, which is 10 by default.
> > >   E : The ECC parity for "data" or "data + M".
> > >
> > > If you want to count each sector, it makes the code very complicated. :(
> >
> > I see. That does make things complicated.
> >

I have implemented another version of Brian's patch [1], which eliminates
need for chip->ecc.read_page_raw(). Its tested on OMAP NAND driver,
and does not degrade too much performance either. Please see if it works
well with GPMI and pxa3xx-nand controllers too.

[1] [RFC PATCH v2] mtd: nand: add erased-page bitflip correction
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052625.html


with regards, pekon
Brian Norris March 17, 2014, 6:47 p.m. UTC | #18
On Thu, Mar 13, 2014 at 07:01:13AM +0000, Pekon Gupta wrote:
> >From: Brian Norris [mailto:computersforpeace@gmail.com]
> >Really? So if you read a blank (all 0xff) page that has no bitflips, you
> >see an ECC error? I'm sorry, but I didn't realize that's how your
> >hardware worked. 
> >
> > That's the worst hardware ECC design I've seen so far
> >:(
> >
> - GPMC and ELM hardware engines were design during OMAP3 and earlier
>   time-frames. When only 1-bit Hamming was enough, and there were
>   hardly any bit-flips seen on erased-pages :-).
> - And I don't think h/w IP designers have too much visibility on actual usage
>   and software framework of these engines. So this remained un-changed.
> However, I do have internally raised this issue of getting a feature added
> in hardware to check for all(0xff). But still we have to support legacy devices.

I would think even a hardware designer could figure out that flagging an
ECC error every time you read an all-0xff erased page would be a bad
idea. And that has nothing to do with bit error rates.

Brian
Brian Norris March 17, 2014, 6:53 p.m. UTC | #19
On Thu, Mar 13, 2014 at 09:57:17AM -0300, Ezequiel Garcia wrote:
> On Mar 12, Brian Norris wrote:
> > Really? So if you read a blank (all 0xff) page that has no bitflips, you
> > see an ECC error? I'm sorry, but I didn't realize that's how your
> > hardware worked. That's the worst hardware ECC design I've seen so far
> > :(
> > 
> 
> I haven't had time to follow this discussion yet, but just wanted to point
> out the above is also true for pxa3xx-nand.c.

I'm learning every day!

> If a completely erased page is read, the calculated ECC would be 0 and the
> stored ECC would be all 0xff (although I never actually checked this).
> See the snippet below:
> 
> static int pxa3xx_nand_read_page_hwecc(struct mtd_info *mtd,
> 		struct nand_chip *chip, uint8_t *buf, int oob_required,
> 		int page)
> {
> 	[..]
> 	} else if (info->retcode == ERR_UNCORERR) {
> 		/*
> 		 * for blank page (all 0xff), HW will calculate its ECC as
> 		 * 0, which is different from the ECC information within
> 		 * OOB, ignore such uncorrectable errors
> 		 */
> 		if (is_buf_blank(buf, mtd->writesize))
> 			info->retcode = ERR_NONE;
> 	}
> 	[..]
> }

Looks like pxa3xx_nand will probably need the same flag, then we could
probably kill the is_buf_blank() check.

Brian
Brian Norris March 17, 2014, 7:46 p.m. UTC | #20
On Thu, Mar 13, 2014 at 05:32:02PM -0400, Bill Pringlemeir wrote:
> On 11 Mar 2014, Brian Norris wrote:
> > Upper layers (e.g., UBI/UBIFS) expect that pages that have been erased
> > will return all 1's (0xff). However, modern NAND (MLC, and even some
> > SLC) experience bitflips in unprogrammed pages, and so they may not read
> > back all 1's. This is problematic for drivers whose ECC cannot correct
> > bitflips in an all-0xff page, as they will report an ECC error
> > (-EBADMSG) when they come across such a page. This appears to UBIFS as
> > "corrupt empty space".
> 
> > Several others [1][2] are attempting to solve this problem, but none is
> > generically useful, and most (all?) have subtle bugs in their reasoning. Let's
> > implement a simple, generic, and correct solution instead.
> 
> > To handle such situations, we should implement the following software
> > workaround for drivers that need it: when the hardware driver reports an
> > ECC error, nand_base should "verify" this error by
> 
> > * Re-reading the page without ECC
> > * counting the number of 0 bits in each ECC sector (N[i], for i = 0, 1,
> > ..., chip->ecc.steps)
> > * If any N[i] exceeds the ECC strength (and thus, the maximum allowable
> > bitflips) then we consider this to be an uncorrectable sector.
> > * If all N[i] are less than the ECC strength, then we "correct" the
> > output to all-0xff and report max-bitflips appropriately
> 
> One issue is that a raw read will never see 'stuck at one' errors.  I
> believe that Elie had a good diagnosis of the issue,

I'm not aware of Elie's diagnosis of 'stuck at one' errors. Perhaps it
is lost somewhere in the many revisions of Eli's original patch series?

But I think that's a good point. We can't allow 100% of the
potentially-correctible flips to be 1->0 flips, since we may see more
0->1 flips once we try to program.

> > 3. I read something but failed to correct it.
> > The third case can have two causes:
> > 3.a you read valid data with bitflips exceeding what the BCH could
> >   correct
> > 3.b you read an erased page with bitflips.
> 
> For 3.b, the permitted value of bitflips should probably be based on the
> flash device and not the ECC controller.  If the chip is giving bit
> flips on an SLC NAND device, do we wish to continue on 3.b?  I believe
> that maybe only some MLC NAND devices might want to permit this.

I don't think your belief matches reality ;) I have seen reports from
users that very much looked like bitflips in an erased page (I didn't
personally diagnose it), and they were using SLC NAND. I don't think
that the SLC vs. MLC distinction is really so strong in some of these
scenarios.

> If the
> conclusion is that this is an erased page, then someone is going to write
> to it and possibly then see 'stuck at one' issues.  At first glance,
> using the ECC strength seems correct, but I don't think that this is
> simple data correction in this case.

1->0 and 0->1 flips don't make much difference to ECC; we can correct
either in a programmed page. You do have a good point that we shouldn't
"correct" the full ECC strength in an erased page, since you might see
additional 0->1 flips after you program. But I think this just means we
either have to do better error distribution modeling (not likely!) or
just pick a more reasonable threshold (ecc_strength / 2 ?).

> Another issue is that the management of flash is not at the MTD layer.
> The other layers general know when a sector is erased.  There is no hint
> ever given to the MTD driver.  For instance, many drivers implement
> sub-pages by doing a full page read followed by a sub-page write, where
> just the sub-page data is updated in the originally read page.  If this
> is happening multiple times (read page w ECC, read page w/o ECC, write
> page), the performance to write a sub-page in a known erased sector
> could be pretty horrid.  This maybe a fairly common case.

Subpage writing is performed within the MTD layer. Do you have any
examples aside from subpage writes? I really don't know the specifics of
how UBIFS tries to read blank pages (I think Artem replied pretty
in-depth to Pekon's UBIFS patches about this; I'll have to re-read), but
I did not understand them to be in the hot path.

> So, I think this statement,
> 
> > Obviously, this sequence can be compute intensive if applied heavily.
> > Ideally, MTD users should not need to read un-programmed pages very
> > often and would require this software check, for instance, only during
> > attach/mount checks or certain recovery situations.
> 
> ... is not quite correct.  It seems common for some upper layers to ask
> to read erased data during normal operation?

Is that a question or a statement? I don't think it is common, but feel
free to point out specifics where I'm wrong.

> Or the only MTD drivers I
> have looked at have sub-page handling broken and need to fixing.

I'm curious: which drivers are you looking at?

If MTD drivers are doing "subpage" writing by doing read/modify/write,
that does indeed seem to be rather broken. Or at least, it doesn't seem
very maintainable in the presence of bitflips like this. Can't subpage
writes be done where all "unprogrammed" data is assumed to be 0xff? (I'm
admittedly not very familiar with subpage programming implementations.)

> If this happens only at boot, then I don't think people would be as
> concerned about performance.

I still haven't seen anyone with numbers for real application
performance, not just benchmarks run specifically for erased areas.

> See: 
>  http://git.infradead.org/linux-mtd.git/blob/HEAD:/drivers/mtd/nand/fsmc_nand.c#l800
> 
> for another driver which is doing this check.

Thanks. Looks like the problem is more widespread (and ad-hoc) than I
thought.

> I think you are right to
> ask for an generic API solution to this issue.  However, I believe we
> only need to determine whether it is an all xff page or an erased page.
> If an upper layer gave a hint that this page is 'known to be written',
> then this could be avoided.  I don't think we have such hints?

Brian
Bill Pringlemeir March 17, 2014, 10:55 p.m. UTC | #21
> On Thu, Mar 13, 2014 at 05:32:02PM -0400, Bill Pringlemeir wrote:

>> One issue is that a raw read will never see 'stuck at one' errors.  I
>> believe that Elie had a good diagnosis of the issue,

On 17 Mar 2014, computersforpeace@gmail.com wrote:

> I'm not aware of Elie's diagnosis of 'stuck at one' errors. Perhaps it
> is lost somewhere in the many revisions of Eli's original patch series?

> But I think that's a good point. We can't allow 100% of the
> potentially-correctible flips to be 1->0 flips, since we may see more
> 0->1 flips once we try to program.

Well, that is my point (about 'stuck at one').  Elie's point was just
the need to discern,

 An almost all xff page with errors beyond ECC strength.

vs.

 An erased page with almost all bits xff.

Ie, it is either un-correctable or erased.  Some controllers can detect
all xff, others can't.  However, it seems that even this feature is not
enough as some erase pages can have the 'stuck at zero' bits.  Maybe
that is obvious; but it was a good point for me.

>> For 3.b, the permitted value of bitflips should probably be based on the
>> flash device and not the ECC controller.  If the chip is giving bit
>> flips on an SLC NAND device, do we wish to continue on 3.b?  I believe
>> that maybe only some MLC NAND devices might want to permit this.

> I don't think your belief matches reality ;) I have seen reports from
> users that very much looked like bitflips in an erased page (I didn't
> personally diagnose it), and they were using SLC NAND. I don't think
> that the SLC vs. MLC distinction is really so strong in some of these
> scenarios.

Ok.  But I think that the amount of 'stuck at zero' permitted is most
likely to be flash device based as opposed to controller based.
Probably I am technically wrong about SLC vs MLC.  Just the main point
is it is NAND chip based and not related to the controller.

> Subpage writing is performed within the MTD layer. Do you have any
> examples aside from subpage writes? I really don't know the specifics of
> how UBIFS tries to read blank pages (I think Artem replied pretty
> in-depth to Pekon's UBIFS patches about this; I'll have to re-read), but
> I did not understand them to be in the hot path.

[snip]

> I'm curious: which drivers are you looking at?

> If MTD drivers are doing "subpage" writing by doing read/modify/write,
> that does indeed seem to be rather broken. Or at least, it doesn't seem
> very maintainable in the presence of bitflips like this. Can't subpage
> writes be done where all "unprogrammed" data is assumed to be 0xff? (I'm
> admittedly not very familiar with subpage programming implementations.)

Sub page writing seems to be like this for many Freescale controllers
(the only ones I am familiar with).  I don't think the controllers can
read a partial page.  It must read/write full pages.  As the MTD has no
state about how many sub-pages are written, it does a
'read-modify-write' to ensure that already written data is preserved.
Zero to three sub-pages may have already been written with 2k pages and
512 byte sub-pages chips.

See:
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/mxc_nand.c#n1099

I beleive that the 'NAND_CMD_SEQIN' is sent before a
'NAND_CMD_PAGEPROG', by the MTD/nand base.  This will read the page
before programming it so that sub-pages are done correctly.

Also, I have benchmarked them and read performance is about 2x the write
performance.  When I look at the chip data sheet, this didn't seem to
have to be this way.  Optimizing the read performance also increases the
write performance.  In live systems I have, it seems common to read
erased pages with sub-pages enabled and UBI/UbiFs.

This is fine, it increases information on the device and I don't know of
any other way to work around it?  The controller's only read/write full
pages, so it seems that RMW cycle must be done (unless we had cached
them).  Adding a 2nd read with hardware ECC would be painful, beyond
initial boot scanning.  I have benchmarks that seem to indicate this and
in order to do subpages on yet another controller, I did this 'RMW'.  It
led me to comment on Huang Shijie's patches; mainly to try to figure out
what the best practice is.

Unfortunately, the controller tries to correct the erased page when it
is read with HW ECC.  The first few bytes usually have 'ECC strength'
zeros, before it decided that it can not correct it.  Raw reads are the
fool-proof way to tell the difference.. and the only way without
assumptions on the ECC mathematics (and maybe the only way at all?).

Fwiw,
Bill Pringlemeir.
Bill Pringlemeir March 17, 2014, 11:01 p.m. UTC | #22
> On 17 Mar 2014, computersforpeace@gmail.com wrote:

>> Subpage writing is performed within the MTD layer. Do you have any
>> examples aside from subpage writes? I really don't know the specifics of
>> how UBIFS tries to read blank pages (I think Artem replied pretty
>> in-depth to Pekon's UBIFS patches about this; I'll have to re-read), but
>> I did not understand them to be in the hot path.
>
> [snip]
>
>> I'm curious: which drivers are you looking at?
>
>> If MTD drivers are doing "subpage" writing by doing read/modify/write,
>> that does indeed seem to be rather broken. Or at least, it doesn't seem
>> very maintainable in the presence of bitflips like this. Can't subpage
>> writes be done where all "unprogrammed" data is assumed to be 0xff? (I'm
>> admittedly not very familiar with subpage programming implementations.)


On 17 Mar 2014, bpringlemeir@nbsps.com wrote:

> Sub page writing seems to be like this for many Freescale controllers
> (the only ones I am familiar with).  I don't think the controllers can
> read a partial page.  It must read/write full pages.  As the MTD has no
> state about how many sub-pages are written, it does a
> 'read-modify-write' to ensure that already written data is preserved.
> Zero to three sub-pages may have already been written with 2k pages and
> 512 byte sub-pages chips.

> See:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/mxc_nand.c#n1099

> I beleive that the 'NAND_CMD_SEQIN' is sent before a
> 'NAND_CMD_PAGEPROG', by the MTD/nand base.  This will read the page
> before programming it so that sub-pages are done correctly.

Sorry, this one is even more obvious.

 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/mpc5121_nfc.c#n367

As a matter of fact, I could be wrong about the 'mxc_nand'.  Although
my bench mark data does say that the read is ~2x as fast as the writes
for the 'mxc_nand'.

Regards,
Bill Pringlemeir.
Huang Shijie March 18, 2014, 6:48 a.m. UTC | #23
On Mon, Mar 17, 2014 at 09:46:35AM -0700, Brian Norris wrote:
> On Thu, Mar 13, 2014 at 04:04:27PM +0800, Huang Shijie wrote:
> > On Wed, Mar 12, 2014 at 09:55:19PM -0700, Brian Norris wrote:
> > > On Wed, Mar 12, 2014 at 04:06:07PM +0800, Huang Shijie wrote:
> > > > On Tue, Mar 11, 2014 at 02:11:51AM -0700, Brian Norris wrote:
> > > > > +	/* Check each ECC step individually */
> > > > > +	for (i = 0; i < chip->ecc.steps; i++) {
> > > 
> > > [...]
> > > 
> > > > The gpmi uses a NAND layout like this:
> > > > 
> > > > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > > > 	 *    | M |   data   |E|   data   |E|   data   |E|   data   |E|
> > > > 	 *    +---+----------+-+----------+-+----------+-+----------+-+
> > > > 
> > > >         M : The metadata, which is 10 by default.	 
> > > >  	E : The ECC parity for "data" or "data + M".
> > > > 
> > > > If you want to count each sector, it makes the code very complicated. :(
> > > 
> > > I see. That does make things complicated.
> > > 
> > > But I'm curious: why does GPMI's ecc.read_page_raw() implementation (the
> > > default) have to give such a different data/spare layout than its
> > > ecc.read_page() implementation? There seems to be no benefit to this,
> > > except that it means gpmi-nand doesn't have to implement its own
> > > read_page_raw() callback...
> > 
> > The raw data of the NAND is like the diagram above.
> > 
> > Do you mean i should implement the ecc.read_page_raw to read out all
> > the "data"s in the diagram above, and concatenate them together? 
> 
> Yes, that is essentially what I'm suggesting. Is there a good reason to
> have ecc.read_page() and ecc.read_page_raw() look so different to the
> caller? Is it important to see the data exactly as it lays on the flash,
> rather than concatenated to how the driver/hardware interprets it?

[1] Implement the ecc.read_page_raw() is not a easy thing for the GPMI driver.
    The "E"(ECC parity above) can be _NOT_ byte aligned. But the NAND_CMD_RNDOUT
    only works with the byte column.  

    So even you and Pekon have merged this patch set into the MTD, I still prefer to
    do not add the NAND_ECC_NEED_CHECK_FF to the gpmi driver, 
    and implement the specific handle function as i ever sent.

[2] For an erased-page with bitflips, the bitflips can occur at the "E" part too.
    But your patch ignore this case, is it ok? 


> 
> > What is the "raw" data meaning? I feel confused now.
> 
> I did not know there was such a confusion. I simply read this:
> 
>  * @read_page_raw:      function to read a raw page without ECC
> 
> and I assumed it meant that read_page_raw() was the same as read_page(),
> except that there would be no *correction* done on the data. I didn't
> previously think too hard about what it should look like for HW ECC that
> requires shifting/concatenating data around.
> 
> > > Look, for instance, at nand_base's nand_read_page_raw_syndrome(). It
> > > actually bothers to untangle a layout like yours and place it into the
> > > appropriate places in the data and OOB buffers. I think this is the
> > > ideal implementation for read_page_raw(), at which point my patch makes
> > > more sense; all NAND drivers should have a (reasonably) similar layout
> > > when viewed from nand_base, without any mixed data/OOB striping like in
> > > your diagram.
> > We'd better add more comment in the:
> >  * @read_page_raw:	function to read a raw page without ECC
> > 
> > Make it clear that the @read_oob_raw read out the data which do not contain
> > the ECC parity. 
> 
> No, that's not the point; the ECC parity data should still be read out
> from the flash, and IMO, they should be placed in the same location as
> with read_page(). They just shouldn't be used for correction. How about
> this?
> 
>  * @read_page_raw:	function to read a page without correcting for
> 			bitflips
much better to me.

thanks
Huang Shijie
Ricard Wanderlof March 18, 2014, 7:55 a.m. UTC | #24
On Mon, 17 Mar 2014, Brian Norris wrote:

> I would think even a hardware designer could figure out that flagging an
> ECC error every time you read an all-0xff erased page would be a bad
> idea. And that has nothing to do with bit error rates.

One would think so, but I've come across two NAND controllers lately where 
the hardware designers apparently had no clue this was an issue.

The same controllers also have no way of letting the rest of the world 
knowing how many bitflips were corrected during ECC correction. Which 
makes their ECC functionality almost worthless with flashes that exhibit a 
significant amount of bitflips, as it is impossible to know if it's just a 
bit or two that has flipped, or we are close to the error correction 
capability limit.

I think we software poeple have to be more vocal towards the hardware 
people in terms of want we want to see from a software point of view. The 
attitide taken by the hardware side seems to be "let's just correct the 
bits transparently so the software doesn't need to worry", which was 
alright in the 1-bit era but is not a sustainable philosophy nowadays.

/Ricard
diff mbox

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index 5826da39216e..f3723770e43a 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -1483,6 +1483,62 @@  static int nand_setup_read_retry(struct mtd_info *mtd, int retry_mode)
 }
 
 /**
+ * nand_verify_erased_page - [INTERN] Check a page for 0xff + bitflips
+ * @mtd: MTD device structure
+ * @buf: read buffer; will contain output data (either raw or "corrected")
+ * @page: page to check
+ *
+ * Check a page to see if it is erased (w/ bitflips) after an uncorrectable ECC
+ * error
+ *
+ * On a real error, return a negative error code (-EBADMSG for ECC error), and
+ * buf will contain raw data.
+ * Otherwise, fill buf with 0xff and return the maximum number of
+ * bitflips per ECC sector to the caller.
+ */
+static int nand_verify_erased_page(struct mtd_info *mtd, uint8_t *buf, int page)
+{
+	struct nand_chip *chip = mtd->priv;
+	int i, oobbits, databits;
+	uint8_t *data = buf, *oob = chip->oob_poi;
+	unsigned int max_bitflips = 0;
+	int oob_step = mtd->oobsize / chip->ecc.steps;
+	int ret;
+
+	oobbits = oob_step * 8;
+	databits = chip->ecc.size * 8;
+
+	/* read without ECC for verification */
+	ret = chip->ecc.read_page_raw(mtd, chip, buf, true, page);
+	if (ret)
+		return ret;
+
+	/* Check each ECC step individually */
+	for (i = 0; i < chip->ecc.steps; i++) {
+		unsigned int flips = databits + oobbits;
+
+		flips -= bitmap_weight((unsigned long *)oob, oobbits);
+		flips -= bitmap_weight((unsigned long *)data, databits);
+
+		/* Too many bitflips */
+		if (flips > chip->ecc.strength)
+			return -EBADMSG;
+
+		max_bitflips = max(max_bitflips, flips);
+
+		data += chip->ecc.size;
+		oob += oob_step;
+	}
+
+	pr_debug("correcting bitflips in erased page (max %u)\n",
+			max_bitflips);
+
+	memset(buf, 0xff, mtd->writesize);
+	memset(oob, 0xff, mtd->oobsize);
+
+	return max_bitflips;
+}
+/**
  * nand_do_read_ops - [INTERN] Read data with ECC
  * @mtd: MTD device structure
  * @from: offset to read from
@@ -1554,6 +1610,18 @@  read_retry:
 				break;
 			}
 
+			/* Check an ECC error for 0xff + bitflips? */
+			if ((chip->options & NAND_ECC_NEED_CHECK_FF) &&
+					mtd->ecc_stats.failed - ecc_failures) {
+				ret = nand_verify_erased_page(mtd, bufpoi,
+						page);
+				if (ret >= 0)
+					/* "corrected", so reset failures */
+					mtd->ecc_stats.failed = ecc_failures;
+				else if (ret == -EBADMSG)
+					ret = 0;
+			}
+
 			max_bitflips = max_t(unsigned int, max_bitflips, ret);
 
 			/* Transfer not aligned data */
diff --git a/include/linux/mtd/nand.h b/include/linux/mtd/nand.h
index 0747fef2bfc6..515f61f028ba 100644
--- a/include/linux/mtd/nand.h
+++ b/include/linux/mtd/nand.h
@@ -182,6 +182,11 @@  typedef enum {
  * before calling nand_scan_tail.
  */
 #define NAND_BUSWIDTH_AUTO      0x00080000
+/*
+ * Driver's ECC does not correct erased pages that have bitflips (i.e.,
+ * "mostly" 0xff data), so ECC errors should be checked for mostly-0xff
+ */
+#define NAND_ECC_NEED_CHECK_FF	0x00100000
 
 /* Options set by nand scan */
 /* Nand scan has allocated controller struct */