diff mbox

mtd: gpmi: Deal with bitflips in erased regions regions

Message ID 20980858CB6D3A4BAE95CA194937D5E73EA56E78@DBDE04.ent.ti.com
State RFC
Headers show

Commit Message

pekon gupta Dec. 17, 2013, 10:17 a.m. UTC
>From: Elie De Brauwer
>
>The BCH block typically used with a GPMI block on an i.MX28/i.MX6 is only
>able to correct bitflips on data actually streamed through the block.
>When erasing a block the data does not stream through the BCH block
>and therefore no ECC data is written to the NAND chip. This causes
>gpmi_ecc_read_page to return failure as soon as a single non-1-bit is
>found in an erased page. Typically causing problems at higher levels
>(ubifs corrupted empty space warnings). This problem was also observed
>when using SLC NAND devices.
>
>This patch configures the BCH block to mark a block as 'erased' if
>no more than gf_len/2 bitflips are found. Next HW_BCH_STATUS0:ALLONES
>is used to check if the data read were all ones, indicating a read of a
>properly erased chunk was performed. If this was not the case a slow path
>is entered where bitflips are counted and corrected in software,
>allowing the upper layers to take proper actions.
>
>Signed-off-by: Elie De Brauwer <eliedebrauwer@gmail.com>
>Acked-by: Peter Korsgaard <peter@korsgaard.com>
[...]

>--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>@@ -286,6 +286,13 @@ int bch_set_geometry(struct gpmi_nand_data *this)
> 			| BF_BCH_FLASH0LAYOUT1_DATAN_SIZE(block_size, this),
> 			r->bch_regs + HW_BCH_FLASH0LAYOUT1);
>
>+	/*
>+	 * Set the tolerance for bitflips when reading erased blocks
>+	 * equal to half the gf_len.
>+	 */
>+	writel((gf_len / 2) & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
>+		r->bch_regs + HW_BCH_MODE);
>+
> 	/* Set *all* chip selects to use layout 0. */
> 	writel(0, r->bch_regs + HW_BCH_LAYOUTSELECT);
>
>@@ -1094,6 +1101,15 @@ int gpmi_is_ready(struct gpmi_nand_data *this, unsigned chip)
> 	return reg & mask;
> }
>
>+/* Returns 1 if the last transaction consisted only out of ones. */
>+int gpmi_all_ones(struct gpmi_nand_data *this)
>+{
>+	struct resources *r = &this->resources;
>+	uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
>+
>+	return reg & BM_BCH_STATUS0_ALLONES_MASK;
>+}
>+
>
Just query.. 
Is GPMI controller is able to detect number of 0s or 1s while reading
the raw data from NAND ?
I'm asking because a similar implementation was used in omap2.c
driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
But, we ended up degrading the driver performance, so even I was
looking for alternative implementations..

 [...]

>
>+/*
>+ * Count the number of 0 bits in a supposed to be
>+ * erased region and correct them. Return the number
>+ * of bitflips or zero when the region was correct.
>+ */
>+static unsigned int erased_sector_bitflips(unsigned char *data,
>+					unsigned int chunk,
>+					struct bch_geometry *geo)
>+{
>+	unsigned int flip_bits = 0;
>+	int i;
>+	int base = geo->ecc_chunk_size * chunk;
>+
>+	/* Count bitflips */
>+	for (i = 0; i < geo->ecc_chunk_size; i++)
>+		flip_bits += hweight8(~data[base + i]);
>+
>+	/* Correct bitflips by 0xFF'ing this chunk. */
>+	if (flip_bits)
>+		memset(&data[base], 0xFF, geo->ecc_chunk_size);

But I don't quite understand here is.. 
Why do you want to mask off the bit-flips in NAND, and give corrected
data to upper layers ? Won't this lead to accumulation of bit-flips ?

Actually UBIFS checks for clean erased-pages before writing anything
on it [1]. And if UBIFS finds an unclean page, it re-erases it without
harming the system to avoid accumulation of bit-flips [2].
(Artem, please correct me if I'm wrong here)..

But with your above patch you will actually fool UBIFS to write on
unclean pages, which will increase the probability of bit-flips failures.
Example:  your erased page had 4-bit flips, but you still reported
data as clean. Now when UBIFS will write on this erased page, it
has the margin of tolerating only 4 more bit-flips.

Now, if you want to get rid of the debug messages and stack_dump
coming into your boot log because of this, then better suppress
debug message in following file
----------------------
----------------------


[1] drivers/mtd/ubi/io.c
@@ ubi_io_write(...)
	err = ubi_self_check_all_ff( ... )

[2] drivers/mtd/ubi/wl.c
@@ wear_leveling_worker( ... )
       /UBI_IO_FF_BITFLIPS


with regards, pekon

Comments

Huang Shijie Dec. 17, 2013, 10:26 a.m. UTC | #1
On Tue, Dec 17, 2013 at 10:17:38AM +0000, Gupta, Pekon wrote:
> >+/* Returns 1 if the last transaction consisted only out of ones. */
> >+int gpmi_all_ones(struct gpmi_nand_data *this)
> >+{
> >+	struct resources *r = &this->resources;
> >+	uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
> >+
> >+	return reg & BM_BCH_STATUS0_ALLONES_MASK;
> >+}
> >+
> >
> Just query.. 
> Is GPMI controller is able to detect number of 0s or 1s while reading
> the raw data from NAND ?

The gpmi can detect the all-one case, in other word, if the page is all
0xff, we can know this case.

Since the bit flips of erase page is very very rare, i think 
the performance is not effected by this patch.


> I'm asking because a similar implementation was used in omap2.c
> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
> But, we ended up degrading the driver performance, so even I was
> looking for alternative implementations..
> 
>  [...]
> 
> >
> >+/*
> >+ * Count the number of 0 bits in a supposed to be
> >+ * erased region and correct them. Return the number
> >+ * of bitflips or zero when the region was correct.
> >+ */
> >+static unsigned int erased_sector_bitflips(unsigned char *data,
> >+					unsigned int chunk,
> >+					struct bch_geometry *geo)
> >+{
> >+	unsigned int flip_bits = 0;
> >+	int i;
> >+	int base = geo->ecc_chunk_size * chunk;
> >+
> >+	/* Count bitflips */
> >+	for (i = 0; i < geo->ecc_chunk_size; i++)
> >+		flip_bits += hweight8(~data[base + i]);
> >+
> >+	/* Correct bitflips by 0xFF'ing this chunk. */
> >+	if (flip_bits)
> >+		memset(&data[base], 0xFF, geo->ecc_chunk_size);
of course, this function could be optimized more better.

> 
> But I don't quite understand here is.. 
> Why do you want to mask off the bit-flips in NAND, and give corrected
> data to upper layers ? Won't this lead to accumulation of bit-flips ?
> 
> Actually UBIFS checks for clean erased-pages before writing anything
> on it [1]. And if UBIFS finds an unclean page, it re-erases it without
> harming the system to avoid accumulation of bit-flips [2].
> (Artem, please correct me if I'm wrong here)..
> 
> But with your above patch you will actually fool UBIFS to write on
> unclean pages, which will increase the probability of bit-flips failures.
> Example:  your erased page had 4-bit flips, but you still reported
> data as clean. Now when UBIFS will write on this erased page, it
> has the margin of tolerating only 4 more bit-flips.
> 
if we use the 8 as the ECC strengh, we still can correct the bitflips,
am I right? I feel a little confused at this.

Btw: From Pekon's comment, i suddenly notice that we should fill the
     ERASE_THRESHOLD like this:

	int threshold = gf_len / 2;

	if (threshold > geo->ecc_strength)
		threshold = geo->ecc_strength;

	/*
	 * Set the tolerance for bitflips when reading erased blocks
	 * equal to half the gf_len.
	 */
	writel(threshold & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
		r->bch_regs + HW_BCH_MODE);


thanks
Huang Shijie
Elie De Brauwer Dec. 17, 2013, 12:38 p.m. UTC | #2
On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>From: Elie De Brauwer
>>
>>The BCH block typically used with a GPMI block on an i.MX28/i.MX6 is only
>>able to correct bitflips on data actually streamed through the block.
>>When erasing a block the data does not stream through the BCH block
>>and therefore no ECC data is written to the NAND chip. This causes
>>gpmi_ecc_read_page to return failure as soon as a single non-1-bit is
>>found in an erased page. Typically causing problems at higher levels
>>(ubifs corrupted empty space warnings). This problem was also observed
>>when using SLC NAND devices.
>>
>>This patch configures the BCH block to mark a block as 'erased' if
>>no more than gf_len/2 bitflips are found. Next HW_BCH_STATUS0:ALLONES
>>is used to check if the data read were all ones, indicating a read of a
>>properly erased chunk was performed. If this was not the case a slow path
>>is entered where bitflips are counted and corrected in software,
>>allowing the upper layers to take proper actions.
>>
>>Signed-off-by: Elie De Brauwer <eliedebrauwer@gmail.com>
>>Acked-by: Peter Korsgaard <peter@korsgaard.com>
> [...]
>
>>--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>>+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
>>@@ -286,6 +286,13 @@ int bch_set_geometry(struct gpmi_nand_data *this)
>>                       | BF_BCH_FLASH0LAYOUT1_DATAN_SIZE(block_size, this),
>>                       r->bch_regs + HW_BCH_FLASH0LAYOUT1);
>>
>>+      /*
>>+       * Set the tolerance for bitflips when reading erased blocks
>>+       * equal to half the gf_len.
>>+       */
>>+      writel((gf_len / 2) & BM_BCH_MODE_ERASE_THRESHOLD_MASK,
>>+              r->bch_regs + HW_BCH_MODE);
>>+
>>       /* Set *all* chip selects to use layout 0. */
>>       writel(0, r->bch_regs + HW_BCH_LAYOUTSELECT);
>>
>>@@ -1094,6 +1101,15 @@ int gpmi_is_ready(struct gpmi_nand_data *this, unsigned chip)
>>       return reg & mask;
>> }
>>
>>+/* Returns 1 if the last transaction consisted only out of ones. */
>>+int gpmi_all_ones(struct gpmi_nand_data *this)
>>+{
>>+      struct resources *r = &this->resources;
>>+      uint32_t reg = readl(r->bch_regs + HW_BCH_STATUS0);
>>+
>>+      return reg & BM_BCH_STATUS0_ALLONES_MASK;
>>+}
>>+
>>
> Just query..
> Is GPMI controller is able to detect number of 0s or 1s while reading
> the raw data from NAND ?
> I'm asking because a similar implementation was used in omap2.c
> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
> But, we ended up degrading the driver performance, so even I was
> looking for alternative implementations..
>
>  [...]

If you look at my original patch, the correction algorithm is more or
less stolen from omap2 driver. In my original version of this patch
there was indeed a performance penalty which occurred each time you
were reading an erased block. Fortunately Huang Shijie pointed my at
the 'allones' case where the BCH block is able to tell us whether or
net it read 'all ones' as in a properly erased block. We combine this
with the ERASE_THRESHOLD to create a fast path (erased block without
any bitflips) and a slow path (erased block with bitflips) and in this
slow path we correct this. Bringing the overhead of this patch in the
normal case to an additional register access which we only do when the
BCH block told us the block is likely an erased block.



>
>>
>>+/*
>>+ * Count the number of 0 bits in a supposed to be
>>+ * erased region and correct them. Return the number
>>+ * of bitflips or zero when the region was correct.
>>+ */
>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>+                                      unsigned int chunk,
>>+                                      struct bch_geometry *geo)
>>+{
>>+      unsigned int flip_bits = 0;
>>+      int i;
>>+      int base = geo->ecc_chunk_size * chunk;
>>+
>>+      /* Count bitflips */
>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>+              flip_bits += hweight8(~data[base + i]);
>>+
>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>+      if (flip_bits)
>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>
> But I don't quite understand here is..
> Why do you want to mask off the bit-flips in NAND, and give corrected
> data to upper layers ? Won't this lead to accumulation of bit-flips ?

Actually the data is corrected the same way a properly functioning BCH
block would do if the FEC data would have been written. The trick is
that the in the function which calls this we return the number of
corrected bits (the same way as would happen on 'real' data with
proper ECC).  So we will not trick any upper layer into writing onto
possible damaged data. Whenever the upper layer reads an erased page
we will tell it 'we read what you requested but we corrected 'n'
bitflips' where the tolerance of the 'n' was an open point.  I'm no
expert on a decent value for this tolerance.

An earlier post ( ref:
http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
) this corrupted empty space was tackled at ubifs level, returning
-EUCLEAN instead of -EBADMSG and the result of that was that the
bitflip was not corrected but migrated away to another PEB, hence in
this solution I wanted to tackle this at the mtd level.

I also tested this on a board with bitflips in empty space (see also
my initial posting) and my patch made the issue disappear.

>
> Actually UBIFS checks for clean erased-pages before writing anything
> on it [1]. And if UBIFS finds an unclean page, it re-erases it without
> harming the system to avoid accumulation of bit-flips [2].
> (Artem, please correct me if I'm wrong here)..
>
> But with your above patch you will actually fool UBIFS to write on
> unclean pages, which will increase the probability of bit-flips failures.
> Example:  your erased page had 4-bit flips, but you still reported
> data as clean. Now when UBIFS will write on this erased page, it
> has the margin of tolerating only 4 more bit-flips.
>
> Now, if you want to get rid of the debug messages and stack_dump
> coming into your boot log because of this, then better suppress
> debug message in following file
> ----------------------
> diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
> index bf79def..e589ff3 100644
> --- a/drivers/mtd/ubi/io.c
> +++ b/drivers/mtd/ubi/io.c
> @@ -1430,7 +1430,9 @@ fail:
>         print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, buf, len, 1);
>         err = -EINVAL;
>  error:
> +#ifdef DEBUG
>         dump_stack();
> +#endif
>         vfree(buf);
>         return err;
>  }
> ----------------------
>
>
> [1] drivers/mtd/ubi/io.c
> @@ ubi_io_write(...)
>         err = ubi_self_check_all_ff( ... )
>
> [2] drivers/mtd/ubi/wl.c
> @@ wear_leveling_worker( ... )
>        /UBI_IO_FF_BITFLIPS
>
>
> with regards, pekon
pekon gupta Dec. 17, 2013, 1:21 p.m. UTC | #3
>From: Elie De Brauwer [mailto:eliedebrauwer@gmail.com]
>>On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>>From: Elie De Brauwer
[...]

>> Just query..
>> Is GPMI controller is able to detect number of 0s or 1s while reading
>> the raw data from NAND ?
>> I'm asking because a similar implementation was used in omap2.c
>> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
>> But, we ended up degrading the driver performance, so even I was
>> looking for alternative implementations..
>>
>>  [...]
>
>If you look at my original patch, the correction algorithm is more or
>less stolen from omap2 driver. In my original version of this patch
>there was indeed a performance penalty which occurred each time you
>were reading an erased block. Fortunately Huang Shijie pointed my at
>the 'allones' case where the BCH block is able to tell us whether or
>net it read 'all ones' as in a properly erased block. We combine this
>with the ERASE_THRESHOLD to create a fast path (erased block without
>any bitflips) and a slow path (erased block with bitflips) and in this
>slow path we correct this. Bringing the overhead of this patch in the
>normal case to an additional register access which we only do when the
>BCH block told us the block is likely an erased block.
>
Yes, Shijie explained that..
GPMI (FSL controller) supports 'allones' in hardware. (good controller :-) )
Unfortunately GPMC (TI controller) does not have such mechanism,
hence omap2.c driver had to do all read-data comparisons in software,
leading to performance penalty, if bit-flips were encountered.


>>>
>>>+/*
>>>+ * Count the number of 0 bits in a supposed to be
>>>+ * erased region and correct them. Return the number
>>>+ * of bitflips or zero when the region was correct.
>>>+ */
>>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>>+                                      unsigned int chunk,
>>>+                                      struct bch_geometry *geo)
>>>+{
>>>+      unsigned int flip_bits = 0;
>>>+      int i;
>>>+      int base = geo->ecc_chunk_size * chunk;
>>>+
>>>+      /* Count bitflips */
>>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>>+              flip_bits += hweight8(~data[base + i]);
>>>+
>>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>>+      if (flip_bits)
>>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>>
>> But I don't quite understand here is..
>> Why do you want to mask off the bit-flips in NAND, and give corrected
>> data to upper layers ? Won't this lead to accumulation of bit-flips ?
>
>Actually the data is corrected the same way a properly functioning BCH
>block would do if the FEC data would have been written. The trick is
>that the in the function which calls this we return the number of
>corrected bits (the same way as would happen on 'real' data with
>proper ECC).  So we will not trick any upper layer into writing onto
>possible damaged data. Whenever the upper layer reads an erased page
>we will tell it 'we read what you requested but we corrected 'n'
>bitflips' where the tolerance of the 'n' was an open point.  I'm no
>expert on a decent value for this tolerance.
>
>An earlier post ( ref:
>http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
>) this corrupted empty space was tackled at ubifs level, returning
>-EUCLEAN instead of -EBADMSG and the result of that was that the
>bitflip was not corrected but migrated away to another PEB, hence in
>this solution I wanted to tackle this at the mtd level.
>
Understood.. Thanks for explaining..
I missed that you were incrementing "err_stats.corrected" on detecting
bit-flips in erased-pages. This caused nand_base.c to return -EUCLEAN.

But, another query is.. 
For erased-pages, does UBI differentiates between -EUCLEAN v/s -EBADMSG ?
I couldn't find such difference in mtd/ubi/*..

In UBI, I could trace that erased-pages are checked only on following paths:
(a) ubi_io_write_data()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(b) ubi_io_write_ec_hdr()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(c) ubi_io_write_vid_hdr()
	|- ubi_io_write()
		|- ubi_self_check_all_ff()

(d) ubi_wl_get_peb()
	|- ubi_self_check_all_ff()

And in each case if an erased-page page was returned with error
(whether -EUCLEAN or -EBADMSG) the PEB was re-added to
erase-queue for erasure.
So I'm not sure, what is causing UBI to scream on -EBADMSG and
not on -EUCLEAN (just for erased-pages). Do you have any pointers ?


with regards, pekon
Elie De Brauwer Dec. 17, 2013, 1:56 p.m. UTC | #4
On Tue, Dec 17, 2013 at 2:21 PM, Gupta, Pekon <pekon@ti.com> wrote:
>>From: Elie De Brauwer [mailto:eliedebrauwer@gmail.com]
>>>On Tue, Dec 17, 2013 at 11:17 AM, Gupta, Pekon <pekon@ti.com> wrote:
>>>>From: Elie De Brauwer
> [...]
>
>>> Just query..
>>> Is GPMI controller is able to detect number of 0s or 1s while reading
>>> the raw data from NAND ?
>>> I'm asking because a similar implementation was used in omap2.c
>>> driver reference: drivers/mtd/nand/omap2.c: erased_sector_bitflips()
>>> But, we ended up degrading the driver performance, so even I was
>>> looking for alternative implementations..
>>>
>>>  [...]
>>
>>If you look at my original patch, the correction algorithm is more or
>>less stolen from omap2 driver. In my original version of this patch
>>there was indeed a performance penalty which occurred each time you
>>were reading an erased block. Fortunately Huang Shijie pointed my at
>>the 'allones' case where the BCH block is able to tell us whether or
>>net it read 'all ones' as in a properly erased block. We combine this
>>with the ERASE_THRESHOLD to create a fast path (erased block without
>>any bitflips) and a slow path (erased block with bitflips) and in this
>>slow path we correct this. Bringing the overhead of this patch in the
>>normal case to an additional register access which we only do when the
>>BCH block told us the block is likely an erased block.
>>
> Yes, Shijie explained that..
> GPMI (FSL controller) supports 'allones' in hardware. (good controller :-) )
> Unfortunately GPMC (TI controller) does not have such mechanism,
> hence omap2.c driver had to do all read-data comparisons in software,
> leading to performance penalty, if bit-flips were encountered.
>
>
>>>>
>>>>+/*
>>>>+ * Count the number of 0 bits in a supposed to be
>>>>+ * erased region and correct them. Return the number
>>>>+ * of bitflips or zero when the region was correct.
>>>>+ */
>>>>+static unsigned int erased_sector_bitflips(unsigned char *data,
>>>>+                                      unsigned int chunk,
>>>>+                                      struct bch_geometry *geo)
>>>>+{
>>>>+      unsigned int flip_bits = 0;
>>>>+      int i;
>>>>+      int base = geo->ecc_chunk_size * chunk;
>>>>+
>>>>+      /* Count bitflips */
>>>>+      for (i = 0; i < geo->ecc_chunk_size; i++)
>>>>+              flip_bits += hweight8(~data[base + i]);
>>>>+
>>>>+      /* Correct bitflips by 0xFF'ing this chunk. */
>>>>+      if (flip_bits)
>>>>+              memset(&data[base], 0xFF, geo->ecc_chunk_size);
>>>
>>> But I don't quite understand here is..
>>> Why do you want to mask off the bit-flips in NAND, and give corrected
>>> data to upper layers ? Won't this lead to accumulation of bit-flips ?
>>
>>Actually the data is corrected the same way a properly functioning BCH
>>block would do if the FEC data would have been written. The trick is
>>that the in the function which calls this we return the number of
>>corrected bits (the same way as would happen on 'real' data with
>>proper ECC).  So we will not trick any upper layer into writing onto
>>possible damaged data. Whenever the upper layer reads an erased page
>>we will tell it 'we read what you requested but we corrected 'n'
>>bitflips' where the tolerance of the 'n' was an open point.  I'm no
>>expert on a decent value for this tolerance.
>>
>>An earlier post ( ref:
>>http://article.gmane.org/gmane.linux.drivers.mtd/46266/match=corrupted+empty+space+again
>>) this corrupted empty space was tackled at ubifs level, returning
>>-EUCLEAN instead of -EBADMSG and the result of that was that the
>>bitflip was not corrected but migrated away to another PEB, hence in
>>this solution I wanted to tackle this at the mtd level.
>>
> Understood.. Thanks for explaining..
> I missed that you were incrementing "err_stats.corrected" on detecting
> bit-flips in erased-pages. This caused nand_base.c to return -EUCLEAN.
>
> But, another query is..
> For erased-pages, does UBI differentiates between -EUCLEAN v/s -EBADMSG ?
> I couldn't find such difference in mtd/ubi/*..
>
> In UBI, I could trace that erased-pages are checked only on following paths:
> (a) ubi_io_write_data()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (b) ubi_io_write_ec_hdr()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (c) ubi_io_write_vid_hdr()
>         |- ubi_io_write()
>                 |- ubi_self_check_all_ff()
>
> (d) ubi_wl_get_peb()
>         |- ubi_self_check_all_ff()
>
> And in each case if an erased-page page was returned with error
> (whether -EUCLEAN or -EBADMSG) the PEB was re-added to
> erase-queue for erasure.
> So I'm not sure, what is causing UBI to scream on -EBADMSG and
> not on -EUCLEAN (just for erased-pages). Do you have any pointers ?
>

Well I traced what I made scream in my specific case. And the error
messages I got pointed to the right direction:

[    3.831323] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.845026] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.858710] UBI warning: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read only 16384 bytes, retry
[    3.872408] UBI error: ubi_io_read: error -74 (ECC error) while
reading 16384 bytes from PEB 443:245760, read 16384 bytes

And simply looking at ubi_io_read() shows that the code is
distinguishing between EUCLEAN and just deals with it. And EBADMSG
which is making it retry/dump stack and return EIO.

In a next stage I got UBIFS complaining about corrupted empty space
which looks something like:

[    4.011529] UBIFS error (pid 36): ubifs_recover_leb: corrupt empty
space LEB 27:237568, corruption starts at 9815
[    4.021897] UBIFS error (pid 36): ubifs_scanned_corruption:
corruption at LEB 27:247383

So  I actually followed the advice of earlier threads and dove deeper
(into the mtd layers) rather than upwards in the direction of
UBI/UBIFS.

On the other hand, just looking around points me to:
- wear_leveling_worker() which calls ubi_io_read_vid_hdr() (calling in
turn ubi_io_read(), see above ) and which may return a.o
UBI_IO_FF_BITFLIPS which the wear_levelling_worker then marks for
scrubbing.

my 2 cents
E.
diff mbox

Patch

diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
index bf79def..e589ff3 100644
--- a/drivers/mtd/ubi/io.c
+++ b/drivers/mtd/ubi/io.c
@@ -1430,7 +1430,9 @@  fail:
        print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, buf, len, 1);
        err = -EINVAL;
 error:
+#ifdef DEBUG
        dump_stack();
+#endif
        vfree(buf);
        return err;
 }