diff mbox

flash bbt broken due to unitialized bitflip_threshold?

Message ID 20120606125013.5897a02d@pixies.home.jungo.com
State New, archived
Headers show

Commit Message

Shmulik Ladkani June 6, 2012, 9:50 a.m. UTC
Hi Sascha,

On Wed, 6 Jun 2012 00:06:47 +0200 Sascha Hauer <s.hauer@pengutronix.de> wrote:
> Hi All,
> 
> nand_scan_tail calls chip->scan_bbt which in case of a flash based bbt
> calls mtd_read. mtd_read returns -EUCLEAN when the number of errors
> exceeds bitflip_threshold. The problem is that bitflip_threshold is
> uninitialized at that time, it is initialized in add_mtd_device which is
> called after nand_scan. This is seen on the mxc_nand controller, but
> probably on other drivers aswell.
> 
> Am I missing something?

Yep, you are correct.
The result is that the BBTs will always be scrubbed (re-programmed).

I think the below fix is missing, should do the job.

Mike, does it look sane to you?

It assigns a 'bitflip_threshold' for the master mtd
(defaults to ecc_strength unless previously set by the driver).

It makes page reads of the BBT area adhere to an erase-block "health"
policy, either initially set by the driver, or set by default to
ecc_strength.

Note that if multiple partitions are later registered, one can't
later override the master's bitflip_threshold via sysfs.
But I guess this isn't important as the BBT is read once, when the nand
is probed.

Comments

Artem Bityutskiy June 6, 2012, 1:30 p.m. UTC | #1
On Wed, 2012-06-06 at 12:50 +0300, Shmulik Ladkani wrote:
> +	if (!mtd->bitflip_threshold)
> +		mtd->bitflip_threshold = mtd->ecc_strength;

Hmm, why se default is to report bit-flips only when one more flipping
bit would cause an ECC error? The default be 1 - report on any bit-flip.
The same should be done in 'add_mtd_device()' - we should preserve the
old behavior, as it was before these patches. Do I miss something?
Shmulik Ladkani June 6, 2012, 3:15 p.m. UTC | #2
Hi Artem,

On Wed, 06 Jun 2012 16:30:53 +0300 Artem Bityutskiy <artem.bityutskiy@linux.intel.com> wrote:
> On Wed, 2012-06-06 at 12:50 +0300, Shmulik Ladkani wrote:
> > +	if (!mtd->bitflip_threshold)
> > +		mtd->bitflip_threshold = mtd->ecc_strength;
> 
> Hmm, why se default is to report bit-flips only when one more flipping
> bit would cause an ECC error? The default be 1 - report on any bit-flip.
> The same should be done in 'add_mtd_device()' - we should preserve the
> old behavior, as it was before these patches. Do I miss something?

Mike's patchset modified behavior of mtd_read() to return EUCLEAN only
if max number of bitflips (per ecc step) exceeds the bitflip_threshold:

	return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0;

mtd->bitflip_threshold can be set in the following ways:
- By the lowlevel driver
- By 'add_mtd_device'. Here it defaults to 'ecc_strength' if NOT
  previously set by the driver.
- Using sysfs attribute

Currently, no driver explicitly sets bitflip_threshold, so it defaults
to ecc_strength. This is not necessarily 1; it depends on the hardware
driver, or software algorithm, etc...

So the "old behavior" was not preserved by the patchset, at least for
those setups with ecc_strength greater than 1. And I don't think the
intention was to preserve 'bitflip_threshold' as 1.

As Sascha spotted, the patchset created a problem for 'scan_bbt'.
This is since 'scan_bbt' calls 'mtd_read()' to read the BBT pages,
however 'bitflip_threshold' is usually NOT YET assigned at that point,
which results in mtd_read's condition (quoted above) to ALWAYS
return -EUCLEAN.
This leads to constantly scrubbing the BBT. Ouch.

My suggestion is to assign the default value of 'ecc_strength' to
'bitflip_threshold' at the end of 'nand_scan_tail', PRIOR the call to
scan_bbt().
Hence, BBT will be scrubbed only if maximum bitflips exceeded the
'ecc_strength' of this mtd (which is the default behavior).

Does it make sense to you?

Or would you like to scrub the BBT upon every bitflip, regardless
ecc_strength?

Regards,
Shmulik
Artem Bityutskiy June 6, 2012, 3:46 p.m. UTC | #3
On Wed, 2012-06-06 at 18:15 +0300, Shmulik Ladkani wrote:
> - By 'add_mtd_device'. Here it defaults to 'ecc_strength' if NOT
>   previously set by the driver.

But this is wrong. If I use the old doc2000 driver, with ecc_strength =
2, and it works fine for me, and I am happy that UBI scrubs for a single
bit-flip, why should my system become broken because someone decided
that now UBI should start scrubbing on 2 bit-flips?

We should not change the defaults - if I do not set the threshold via
sysfs of in the driver, it should be 1.

Unless I am completely confused, we should change this, CC -stable if
needed, and ask dwmw2 to merge that.
Shmulik Ladkani June 6, 2012, 4:08 p.m. UTC | #4
Hi,

On Wed, 06 Jun 2012 18:46:15 +0300 Artem Bityutskiy <artem.bityutskiy@linux.intel.com> wrote:
> On Wed, 2012-06-06 at 18:15 +0300, Shmulik Ladkani wrote:
> > - By 'add_mtd_device'. Here it defaults to 'ecc_strength' if NOT
> >   previously set by the driver.
> 
> But this is wrong. If I use the old doc2000 driver, with ecc_strength =
> 2, and it works fine for me, and I am happy that UBI scrubs for a single
> bit-flip, why should my system become broken because someone decided
> that now UBI should start scrubbing on 2 bit-flips?

As I remember, the motivation was to reduce unnecessary scrubbing for
devices exposing high rate of bitflips but having strong ECC to
compensate for.
For these users, the system was "broken" in their perspective, as it
performed too many scrubbing...

Anyway, I understand your point preserving the old behavior.
I thought you were aware of this change.

> We should not change the defaults - if I do not set the threshold via
> sysfs of in the driver, it should be 1.

Fair point. This is safer, backwords compatible.

But eventually, wouldn't we end up with all the drivers assigning
bitflip_threshold to ecc_strength?

> Unless I am completely confused, we should change this, CC -stable if
> needed, and ask dwmw2 to merge that.

No you're not confused.
If we'd like to preserve old behavior, the default assignment needs to
be changed.

BTW this is independent of the 'scan_bbt' bug spotted by Sascha.

Regards,
Shmulik
Ivan Djelic June 6, 2012, 5:55 p.m. UTC | #5
On Wed, Jun 06, 2012 at 04:46:15PM +0100, Artem Bityutskiy wrote:
> On Wed, 2012-06-06 at 18:15 +0300, Shmulik Ladkani wrote:
> > - By 'add_mtd_device'. Here it defaults to 'ecc_strength' if NOT
> >   previously set by the driver.
> 
> But this is wrong. If I use the old doc2000 driver, with ecc_strength =
> 2, and it works fine for me, and I am happy that UBI scrubs for a single
> bit-flip, why should my system become broken because someone decided
> that now UBI should start scrubbing on 2 bit-flips?
> 
> We should not change the defaults - if I do not set the threshold via
> sysfs of in the driver, it should be 1.
> 
> Unless I am completely confused, we should change this, CC -stable if
> needed, and ask dwmw2 to merge that.
> 

Hi Artem,

The "safest" bitflip_threshold value actually depends on the nand device you are using.
If I want to use today's 4-bit and 8-bit nand devices, a threshold of 1
will make my system unusable (because of constant scrubbing and block torturing).

The rationale behind Mike's patch is that it should be safe to delay block scrubbing
until the nand controller has reached its maximum correction capability; at this point,
any additional bitflip would produce uncorrectable errors, so scrubbing is necessary.
This is the recommended policy in nand datasheets when the device ecc requirement matches
the controller capability. In your doc2000 example, why would it break with a threshold of 2 ?

On the other hand, I agree that we should not force this kind of changes on existing systems.
Let us list use cases:

1. on legacy systems with 1-bit nand and strength = 1, default bitflip_threshold is 1
2. on legacy systems with 1-bit nand and strength > 1, default bitflip_threshold is 'strength'
3. on new systems   with 2+ bit nand and strength > 1, default bitflip_threshold is 'strength'

Case 1 is OK, no change.

Case 2 introduces an unwanted change.

Case 3 (which was not properly handled until Mike's patch) seems OK to me, it fits current datasheet
requirements and the driver still has the possibility to explicitly set bitflip_threshold on a
per-board basis (because it really depends on the nand device).

So rather than reverting to a default bitflip_threshold of 1 for everyone, I suggest we revert the
change on case 2 legacy systems like the doc2000 by explicitly setting bitflip_threshold to 1 in the driver.
What do you think ?

BR,
Artem Bityutskiy June 7, 2012, 7:36 a.m. UTC | #6
On Wed, 2012-06-06 at 19:55 +0200, Ivan Djelic wrote:
> 1. on legacy systems with 1-bit nand and strength = 1, default bitflip_threshold is 1
> 2. on legacy systems with 1-bit nand and strength > 1, default bitflip_threshold is 'strength'
> 3. on new systems   with 2+ bit nand and strength > 1, default bitflip_threshold is 'strength'

Ivan, Shmulik,

I've gave this another though, and I think it is OK to leave this as it
is now. Your replies were very helpful, thanks.

Shmulik, would you please send your BBT read fix as a nice patch?
Please, add a not to the commit message that this fixes an issue
introduced in 3.5-rc1 and should be merged before 3.5. Thanks!
Artem Bityutskiy June 7, 2012, 7:43 a.m. UTC | #7
On Wed, 2012-06-06 at 19:55 +0200, Ivan Djelic wrote:
> In your doc2000 example, why would it break with a threshold of 2 ?

Yes, "break" was a wrong word. It would just change the behavior. But I
guess it would in fact _improve_ the behavior. Let's not try treat the
legacy specially, I think we can indeed treat this change as an
improvement and se if anyone complains.
Shmulik Ladkani June 7, 2012, 2:02 p.m. UTC | #8
On Thu, 07 Jun 2012 10:36:10 +0300 Artem Bityutskiy <artem.bityutskiy@linux.intel.com> wrote:
> Shmulik, would you please send your BBT read fix as a nice patch?
> Please, add a not to the commit message that this fixes an issue
> introduced in 3.5-rc1 and should be merged before 3.5. Thanks!

Will do.

Regards,
Shmulik
diff mbox

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index 61805e7..46c7d2b 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -3484,6 +3484,8 @@  int nand_scan_tail(struct mtd_info *mtd)
 	/* propagate ecc info to mtd_info */
 	mtd->ecclayout = chip->ecc.layout;
 	mtd->ecc_strength = chip->ecc.strength;
+	if (!mtd->bitflip_threshold)
+		mtd->bitflip_threshold = mtd->ecc_strength;
 
 	/* Check, if we should skip the bad block table scan */
 	if (chip->options & NAND_SKIP_BBTSCAN)