tango_nand.c: fix ecc.stats_corrected in empty flash case

Message ID	20170422104033.GA14508@amd
State	Superseded
Delegated to:	Boris Brezillon
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> Date: Sat, 22 Apr 2017 12:40:33 +0200 From: Pavel Machek <pavel@ucw.cz> To: Boris Brezillon <boris.brezillon@free-electrons.com>, marc_gonzalez@sigmadesigns.com Subject: [PATCH] tango_nand.c: fix ecc.stats_corrected in empty flash case Message-ID: <20170422104033.GA14508@amd> References: <20170419121332.GA26979@amd> <20170419231804.5a04ed69@bbrezillon> <20170421100813.GA4332@amd> <20170421133721.GA15332@amd> <20170421154903.2782cd06@bbrezillon> MIME-Version: 1.0 In-Reply-To: <20170421154903.2782cd06@bbrezillon> User-Agent: Mutt/1.5.23 (2014-03-12) summary: Content analysis details: (-4.2 points) pts rule name description ---- ---------------------- -------------------------------------------------- -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium trust [195.113.26.193 listed in list.dnswl.org] -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Precedence: list Cc: richard@nod.at, mark.marshall@omicronenergy.com, linux-kernel@vger.kernel.org, marek.vasut@gmail.com, linux-mtd@lists.infradead.org, Dipen.Dudhat@freescale.com, cyrille.pitchen@atmel.com, computersforpeace@gmail.com, dwmw2@infradead.org, prabhakar@freescale.com, b44839@freescale.com Content-Type: multipart/mixed; boundary="===============2374533378653086574==" Sender: "linux-mtd" <linux-mtd-bounces@lists.infradead.org> Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Pavel Machek April 22, 2017, 10:40 a.m. UTC

Fix ecc.stats_corrected in empty flash case.

Signed-off-by: Pavel Machek <pavel@denx.de>

---

This was suggested by Boris Brezillon in another context. Not tested;
I don't have the hardware.

Marc Gonzalez April 24, 2017, 8:58 a.m. UTC | #1

[ Trimming CC list ]

On 22/04/2017 12:40, Pavel Machek wrote:

> Fix ecc.stats_corrected in empty flash case.
> 
> Signed-off-by: Pavel Machek <pavel@denx.de>
> 
> ---
> 
> This was suggested by Boris Brezillon in another context. Not tested;
> I don't have the hardware.
> 
> diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> index 4a5e948..db4bff4 100644
> --- a/drivers/mtd/nand/tango_nand.c
> +++ b/drivers/mtd/nand/tango_nand.c
> @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
>  						  chip->ecc.strength);
>  		if (res < 0)
>  			mtd->ecc_stats.failed++;
> +		else
> +			mtd->ecc_stats.corrected += res;
>  
>  		bitflips = max(res, bitflips);
>  		buf += pkt_size;
> 

Hello Pavel,

You may have noticed that ecc_stats.corrected is not updated in
decode_error_report() which is the main code path, i.e. the path
that will succeed 99.99% of the time (HW read).

It turns out that the HW does not report the number of errors
corrected in a page... Instead it reports two values:
1) U = number of errors corrected in the first packet/step
2) V = max number of errors corrected in other packets/steps

Thus, it is not possible to determine the actual number of errors
corrected in a page (unless V is 0). Otherwise, we just have an
interval; let n be the number of packets/steps:

U + V <= corrected errors count <= U + (n-1)*V

In my opinion, it is better to provide no information than to
provide incorrect information. Therefore, I did not update
ecc_stats.corrected in decode_error_report().

One could argue that updating ecc_stats.corrected in
check_erased_page() sets the correct value, since the error
counts are computed in software for each step. But updating
the value here is IMO pointless if we can't do it in the main
code path.

Regards.

Pavel Machek April 24, 2017, 9:03 a.m. UTC | #2

Hi!

> [ Trimming CC list ]
> 
> On 22/04/2017 12:40, Pavel Machek wrote:
> 
> > Fix ecc.stats_corrected in empty flash case.
> > 
> > Signed-off-by: Pavel Machek <pavel@denx.de>
> > 
> > ---
> > 
> > This was suggested by Boris Brezillon in another context. Not tested;
> > I don't have the hardware.
> > 
> > diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> > index 4a5e948..db4bff4 100644
> > --- a/drivers/mtd/nand/tango_nand.c
> > +++ b/drivers/mtd/nand/tango_nand.c
> > @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
> >  						  chip->ecc.strength);
> >  		if (res < 0)
> >  			mtd->ecc_stats.failed++;
> > +		else
> > +			mtd->ecc_stats.corrected += res;
> >  
> >  		bitflips = max(res, bitflips);
> >  		buf += pkt_size;
> > 
> 
> Hello Pavel,
> 
> You may have noticed that ecc_stats.corrected is not updated in
> decode_error_report() which is the main code path, i.e. the path
> that will succeed 99.99% of the time (HW read).
> 
> It turns out that the HW does not report the number of errors
> corrected in a page... Instead it reports two values:
> 1) U = number of errors corrected in the first packet/step
> 2) V = max number of errors corrected in other packets/steps
> 
> Thus, it is not possible to determine the actual number of errors
> corrected in a page (unless V is 0). Otherwise, we just have an
> interval; let n be the number of packets/steps:
> 
> U + V <= corrected errors count <= U + (n-1)*V
> 
> In my opinion, it is better to provide no information than to
> provide incorrect information. Therefore, I did not update
> ecc_stats.corrected in decode_error_report().
> 
> One could argue that updating ecc_stats.corrected in
> check_erased_page() sets the correct value, since the error
> counts are computed in software for each step. But updating
> the value here is IMO pointless if we can't do it in the main
> code path.

Aha, thanks for explanation... perhaps comment is worth adding? This
is certainly "interesting" property (people would conclude that
check_erased_page is buggy -- and it is buggy -- but it matches
behaviour of rest of the driver). Also... people do copy&paste in
kernel (I did :-) ) and this is quite a trap for them.

Thanks,
									Pavel

Boris Brezillon May 2, 2017, 9:42 a.m. UTC | #3

Hi Marc,

On Mon, 24 Apr 2017 10:58:47 +0200
Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote:

> [ Trimming CC list ]
> 
> On 22/04/2017 12:40, Pavel Machek wrote:
> 
> > Fix ecc.stats_corrected in empty flash case.
> > 
> > Signed-off-by: Pavel Machek <pavel@denx.de>
> > 
> > ---
> > 
> > This was suggested by Boris Brezillon in another context. Not tested;
> > I don't have the hardware.
> > 
> > diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> > index 4a5e948..db4bff4 100644
> > --- a/drivers/mtd/nand/tango_nand.c
> > +++ b/drivers/mtd/nand/tango_nand.c
> > @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
> >  						  chip->ecc.strength);
> >  		if (res < 0)
> >  			mtd->ecc_stats.failed++;
> > +		else
> > +			mtd->ecc_stats.corrected += res;
> >  
> >  		bitflips = max(res, bitflips);
> >  		buf += pkt_size;
> >   
> 
> Hello Pavel,
> 
> You may have noticed that ecc_stats.corrected is not updated in
> decode_error_report() which is the main code path, i.e. the path
> that will succeed 99.99% of the time (HW read).
> 
> It turns out that the HW does not report the number of errors
> corrected in a page... Instead it reports two values:
> 1) U = number of errors corrected in the first packet/step
> 2) V = max number of errors corrected in other packets/steps
> 
> Thus, it is not possible to determine the actual number of errors
> corrected in a page (unless V is 0). Otherwise, we just have an
> interval; let n be the number of packets/steps:
> 
> U + V <= corrected errors count <= U + (n-1)*V
> 
> In my opinion, it is better to provide no information than to
> provide incorrect information. Therefore, I did not update
> ecc_stats.corrected in decode_error_report().

Hm, not sure I agree with that. The situation is far from ideal, but
some userspace tools query the number of corrected bits before and
after doing a read operation to report the number of bitflips that
have been correcte. Letting users think there were no bitflips at all is
not a good thing IMO.

> 
> One could argue that updating ecc_stats.corrected in
> check_erased_page() sets the correct value, since the error
> counts are computed in software for each step. But updating
> the value here is IMO pointless if we can't do it in the main
> code path.

Well, it's not pointless if you want to inform the user that some
bitflips were present on the NAND. Yes, the numbers you report
won't be accurate in your case, but at least the user can tell when
bitflips were discovered.

Note that ideally, we should have per-page (or per-block) max-bitflips
information (where max-bitflips is the number returned by
ecc->read_page()), because counting the total number of corrected bits
is certainly not helping when it comes to checking how reliable a NAND
page/eraseblock is.

Regards,

Boris

Marc Gonzalez May 2, 2017, 11:52 a.m. UTC | #4

On 02/05/2017 11:42, Boris Brezillon wrote:

> Hi Marc,
> 
> On Mon, 24 Apr 2017 10:58:47 +0200
> Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote:
> 
>> [ Trimming CC list ]
>>
>> On 22/04/2017 12:40, Pavel Machek wrote:
>>
>>> Fix ecc.stats_corrected in empty flash case.
>>>
>>> Signed-off-by: Pavel Machek <pavel@denx.de>
>>>
>>> ---
>>>
>>> This was suggested by Boris Brezillon in another context. Not tested;
>>> I don't have the hardware.
>>>
>>> diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
>>> index 4a5e948..db4bff4 100644
>>> --- a/drivers/mtd/nand/tango_nand.c
>>> +++ b/drivers/mtd/nand/tango_nand.c
>>> @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
>>>  						  chip->ecc.strength);
>>>  		if (res < 0)
>>>  			mtd->ecc_stats.failed++;
>>> +		else
>>> +			mtd->ecc_stats.corrected += res;
>>>  
>>>  		bitflips = max(res, bitflips);
>>>  		buf += pkt_size;
>>>   
>>
>> Hello Pavel,
>>
>> You may have noticed that ecc_stats.corrected is not updated in
>> decode_error_report() which is the main code path, i.e. the path
>> that will succeed 99.99% of the time (HW read).
>>
>> It turns out that the HW does not report the number of errors
>> corrected in a page... Instead it reports two values:
>> 1) U = number of errors corrected in the first packet/step
>> 2) V = max number of errors corrected in other packets/steps
>>
>> Thus, it is not possible to determine the actual number of errors
>> corrected in a page (unless V is 0). Otherwise, we just have an
>> interval; let n be the number of packets/steps:
>>
>> U + V <= corrected errors count <= U + (n-1)*V
>>
>> In my opinion, it is better to provide no information than to
>> provide incorrect information. Therefore, I did not update
>> ecc_stats.corrected in decode_error_report().
> 
> Hm, not sure I agree with that. The situation is far from ideal, but
> some userspace tools query the number of corrected bits before and
> after doing a read operation to report the number of bitflips that
> have been correcte. Letting users think there were no bitflips at all is
> not a good thing IMO.

I (still) find the API of ecc->read_page somewhat confusing ;-)
Let me try to work through the assumptions, as I remember.

1) If the driver was not able to read the page,
then we return an error code < 0

2) If the driver was able to read the page, and there were
no bitflips, then we return 0

3) If the driver was able to read the page, and there were
less than 'strength' bitflips in each step, then we return
the *max* of all bitflips across all steps

4) If the driver was able to read the page, but there was
at least one step with too many bitflips, then we are
expected to increment mtd->ecc_stats.failed, and return
the max of all bitflips across successful steps

AFAICT, the NAND framework does not use or update ecc_stats.corrected?
What about the MTD layer?


>> One could argue that updating ecc_stats.corrected in
>> check_erased_page() sets the correct value, since the error
>> counts are computed in software for each step. But updating
>> the value here is IMO pointless if we can't do it in the main
>> code path.
> 
> Well, it's not pointless if you want to inform the user that some
> bitflips were present on the NAND. Yes, the numbers you report
> won't be accurate in your case, but at least the user can tell when
> bitflips were discovered.

I do report some bitflips value: it's the return value from
ecc->read_page -- which is the max bitflips per step. But perhaps
that value is not available to user-space? In other words,
the NAND and MTD layers do not forward it to user-space, as this
is left as a responsibility of individual drivers?

> Note that ideally, we should have per-page (or per-block) max-bitflips
> information (where max-bitflips is the number returned by
> ecc->read_page()), because counting the total number of corrected bits
> is certainly not helping when it comes to checking how reliable a NAND
> page/eraseblock is.

That would be quite large an array for a 1 TB NAND chip?

Regards.

Boris Brezillon May 2, 2017, 12:20 p.m. UTC | #5

On Tue, 2 May 2017 13:52:30 +0200
Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote:

> On 02/05/2017 11:42, Boris Brezillon wrote:
> 
> > Hi Marc,
> > 
> > On Mon, 24 Apr 2017 10:58:47 +0200
> > Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote:
> >   
> >> [ Trimming CC list ]
> >>
> >> On 22/04/2017 12:40, Pavel Machek wrote:
> >>  
> >>> Fix ecc.stats_corrected in empty flash case.
> >>>
> >>> Signed-off-by: Pavel Machek <pavel@denx.de>
> >>>
> >>> ---
> >>>
> >>> This was suggested by Boris Brezillon in another context. Not tested;
> >>> I don't have the hardware.
> >>>
> >>> diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> >>> index 4a5e948..db4bff4 100644
> >>> --- a/drivers/mtd/nand/tango_nand.c
> >>> +++ b/drivers/mtd/nand/tango_nand.c
> >>> @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
> >>>  						  chip->ecc.strength);
> >>>  		if (res < 0)
> >>>  			mtd->ecc_stats.failed++;
> >>> +		else
> >>> +			mtd->ecc_stats.corrected += res;
> >>>  
> >>>  		bitflips = max(res, bitflips);
> >>>  		buf += pkt_size;
> >>>     
> >>
> >> Hello Pavel,
> >>
> >> You may have noticed that ecc_stats.corrected is not updated in
> >> decode_error_report() which is the main code path, i.e. the path
> >> that will succeed 99.99% of the time (HW read).
> >>
> >> It turns out that the HW does not report the number of errors
> >> corrected in a page... Instead it reports two values:
> >> 1) U = number of errors corrected in the first packet/step
> >> 2) V = max number of errors corrected in other packets/steps
> >>
> >> Thus, it is not possible to determine the actual number of errors
> >> corrected in a page (unless V is 0). Otherwise, we just have an
> >> interval; let n be the number of packets/steps:
> >>
> >> U + V <= corrected errors count <= U + (n-1)*V
> >>
> >> In my opinion, it is better to provide no information than to
> >> provide incorrect information. Therefore, I did not update
> >> ecc_stats.corrected in decode_error_report().  
> > 
> > Hm, not sure I agree with that. The situation is far from ideal, but
> > some userspace tools query the number of corrected bits before and
> > after doing a read operation to report the number of bitflips that
> > have been correcte. Letting users think there were no bitflips at all is
> > not a good thing IMO.  
> 
> I (still) find the API of ecc->read_page somewhat confusing ;-)
> Let me try to work through the assumptions, as I remember.
> 
> 1) If the driver was not able to read the page,
> then we return an error code < 0
> 
> 2) If the driver was able to read the page, and there were
> no bitflips, then we return 0
> 
> 3) If the driver was able to read the page, and there were
> less than 'strength' bitflips in each step, then we return
> the *max* of all bitflips across all steps
> 
> 4) If the driver was able to read the page, but there was
> at least one step with too many bitflips, then we are
> expected to increment mtd->ecc_stats.failed, and return
> the max of all bitflips across successful steps

Yep, you got it right.

> 
> AFAICT, the NAND framework does not use or update ecc_stats.corrected?

Nope, but this information is exposed to userspace, and some tools
(like nanddump) use it to calculate the number of bitflips found in a
page. It's definitely not reliable, because someone could have read
different portion of the MTD device in parallel, thus invalidating the
stats initially retrieved by nanddump, but it works as long as no-one
accesses the MTD device while nandump is used.

> What about the MTD layer?

Not sure what you mean by the MTD layer? The MTD layer itself does not
use this information, but it exposes it to its users, so MTD users
might use this information.

> 
> 
> >> One could argue that updating ecc_stats.corrected in
> >> check_erased_page() sets the correct value, since the error
> >> counts are computed in software for each step. But updating
> >> the value here is IMO pointless if we can't do it in the main
> >> code path.  
> > 
> > Well, it's not pointless if you want to inform the user that some
> > bitflips were present on the NAND. Yes, the numbers you report
> > won't be accurate in your case, but at least the user can tell when
> > bitflips were discovered.  
> 
> I do report some bitflips value: it's the return value from
> ecc->read_page -- which is the max bitflips per step. But perhaps
> that value is not available to user-space?

It's indeed not exposed to userspace. It's only used by the MTD core to
decide when to return -EUCLEAN (see here [1]).

> In other words,
> the NAND and MTD layers do not forward it to user-space, as this
> is left as a responsibility of individual drivers?

I don't know the exact reason, but I guess no one needed it so far.
IOW, everyone was happy with the existing corrected/failed stats.

> 
> > Note that ideally, we should have per-page (or per-block) max-bitflips
> > information (where max-bitflips is the number returned by
> > ecc->read_page()), because counting the total number of corrected bits
> > is certainly not helping when it comes to checking how reliable a NAND
> > page/eraseblock is.  
> 
> That would be quite large an array for a 1 TB NAND chip?

The page/eraseblock size tend to grow with total chip size, but I
agree, keeping max-bitflips on a per-page basis has a
non-negligible cost.

Anyway, I think we went too far. I was just arguing that never updating
stats->corrected just because you don't know the exact number of
bitflips is not necessarily a better idea than updating corrected with
a potentially invalid number of bitflips. At least the 2nd approach
shows that bitflips were present on the media.

Regards,

Boris

[1]http://lxr.free-electrons.com/source/drivers/mtd/mtdcore.c#L1043

Pavel Machek May 3, 2017, 8:02 p.m. UTC | #6

Hi!

> Anyway, I think we went too far. I was just arguing that never updating
> stats->corrected just because you don't know the exact number of
> bitflips is not necessarily a better idea than updating corrected with
> a potentially invalid number of bitflips. At least the 2nd approach
> shows that bitflips were present on the media.

I have to agree here. Indication of "we have errors" is important.

									Pavel

Pavel Machek May 3, 2017, 8:04 p.m. UTC | #7

Hi!
On Mon 2017-04-24 10:58:47, Marc Gonzalez wrote:
> [ Trimming CC list ]
> 
> On 22/04/2017 12:40, Pavel Machek wrote:
> 
> > Fix ecc.stats_corrected in empty flash case.
> > 
> > Signed-off-by: Pavel Machek <pavel@denx.de>
> > 
> > ---
> > 
> > This was suggested by Boris Brezillon in another context. Not tested;
> > I don't have the hardware.
> > 
> > diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> > index 4a5e948..db4bff4 100644
> > --- a/drivers/mtd/nand/tango_nand.c
> > +++ b/drivers/mtd/nand/tango_nand.c
> > @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
> >  						  chip->ecc.strength);
> >  		if (res < 0)
> >  			mtd->ecc_stats.failed++;
> > +		else
> > +			mtd->ecc_stats.corrected += res;
> >  
> >  		bitflips = max(res, bitflips);
> >  		buf += pkt_size;
> > 
> 
> Hello Pavel,
> 
> You may have noticed that ecc_stats.corrected is not updated in
> decode_error_report() which is the main code path, i.e. the path
> that will succeed 99.99% of the time (HW read).
> 
> It turns out that the HW does not report the number of errors
> corrected in a page... Instead it reports two values:
> 1) U = number of errors corrected in the first packet/step
> 2) V = max number of errors corrected in other packets/steps
> 
> Thus, it is not possible to determine the actual number of errors
> corrected in a page (unless V is 0). Otherwise, we just have an
> interval; let n be the number of packets/steps:
> 
> U + V <= corrected errors count <= U + (n-1)*V
> 
> In my opinion, it is better to provide no information than to
> provide incorrect information. Therefore, I did not update
> ecc_stats.corrected in decode_error_report().

Well... Having corrected ECC errors is pretty rare, right? So one
solution would be to re-compute ECCs in software if we see U or V >
0...

Regards,
									Pavel

Boris Brezillon May 4, 2017, 8:42 a.m. UTC | #8

On Wed, 3 May 2017 22:04:27 +0200
Pavel Machek <pavel@ucw.cz> wrote:

> Hi!
> On Mon 2017-04-24 10:58:47, Marc Gonzalez wrote:
> > [ Trimming CC list ]
> > 
> > On 22/04/2017 12:40, Pavel Machek wrote:
> >   
> > > Fix ecc.stats_corrected in empty flash case.
> > > 
> > > Signed-off-by: Pavel Machek <pavel@denx.de>
> > > 
> > > ---
> > > 
> > > This was suggested by Boris Brezillon in another context. Not tested;
> > > I don't have the hardware.
> > > 
> > > diff --git a/drivers/mtd/nand/tango_nand.c b/drivers/mtd/nand/tango_nand.c
> > > index 4a5e948..db4bff4 100644
> > > --- a/drivers/mtd/nand/tango_nand.c
> > > +++ b/drivers/mtd/nand/tango_nand.c
> > > @@ -193,6 +193,8 @@ static int check_erased_page(struct nand_chip *chip, u8 *buf)
> > >  						  chip->ecc.strength);
> > >  		if (res < 0)
> > >  			mtd->ecc_stats.failed++;
> > > +		else
> > > +			mtd->ecc_stats.corrected += res;
> > >  
> > >  		bitflips = max(res, bitflips);
> > >  		buf += pkt_size;
> > >   
> > 
> > Hello Pavel,
> > 
> > You may have noticed that ecc_stats.corrected is not updated in
> > decode_error_report() which is the main code path, i.e. the path
> > that will succeed 99.99% of the time (HW read).
> > 
> > It turns out that the HW does not report the number of errors
> > corrected in a page... Instead it reports two values:
> > 1) U = number of errors corrected in the first packet/step
> > 2) V = max number of errors corrected in other packets/steps
> > 
> > Thus, it is not possible to determine the actual number of errors
> > corrected in a page (unless V is 0). Otherwise, we just have an
> > interval; let n be the number of packets/steps:
> > 
> > U + V <= corrected errors count <= U + (n-1)*V
> > 
> > In my opinion, it is better to provide no information than to
> > provide incorrect information. Therefore, I did not update
> > ecc_stats.corrected in decode_error_report().  
> 
> Well... Having corrected ECC errors is pretty rare, right?

Depends on the NAND chip. On modern SLC NAND chips requiring
ECC of 8bits/512bytes are likely to have frequent bitflips.

> So one
> solution would be to re-compute ECCs in software if we see U or V >
> 0...

Hm, not sure it's worth the trouble for statistics that are anyway
rarely used, and when they are, are only used has a metric to determine
how worn the NAND is.

I'd prefer to see a better user-space interface returning the
max_bitflips information when someone reads from an MTD device (see [1])
rather than trying to fix drivers to return the exact number of
corrected bitflips (which might be impossible for some of them anyway).

[1]http://lists.infradead.org/pipermail/linux-mtd/2016-April/067187.html

Pavel Machek May 17, 2017, 12:04 p.m. UTC | #9

Hi!

> > > Hello Pavel,
> > > 
> > > You may have noticed that ecc_stats.corrected is not updated in
> > > decode_error_report() which is the main code path, i.e. the path
> > > that will succeed 99.99% of the time (HW read).
> > > 
> > > It turns out that the HW does not report the number of errors
> > > corrected in a page... Instead it reports two values:
> > > 1) U = number of errors corrected in the first packet/step
> > > 2) V = max number of errors corrected in other packets/steps
> > > 
> > > Thus, it is not possible to determine the actual number of errors
> > > corrected in a page (unless V is 0). Otherwise, we just have an
> > > interval; let n be the number of packets/steps:
> > > 
> > > U + V <= corrected errors count <= U + (n-1)*V
> > > 
> > > In my opinion, it is better to provide no information than to
> > > provide incorrect information. Therefore, I did not update
> > > ecc_stats.corrected in decode_error_report().  
> > 
> > Well... Having corrected ECC errors is pretty rare, right?
> 
> Depends on the NAND chip. On modern SLC NAND chips requiring
> ECC of 8bits/512bytes are likely to have frequent bitflips.
> 
> > So one
> > solution would be to re-compute ECCs in software if we see U or V >
> > 0...
> 
> Hm, not sure it's worth the trouble for statistics that are anyway
> rarely used, and when they are, are only used has a metric to determine
> how worn the NAND is.

Well, knowing "how worn the NAND is" and "when to refresh the
data". Both seem to be quite important for storage system that works.

> I'd prefer to see a better user-space interface returning the
> max_bitflips information when someone reads from an MTD device (see [1])
> rather than trying to fix drivers to return the exact number of
> corrected bitflips (which might be impossible for some of them anyway).
> 
> > [1]http://lists.infradead.org/pipermail/linux-mtd/2016-April/067187.html

Yes, current interface leaves something to be desired...

Best regards,
										Pavel

tango_nand.c: fix ecc.stats_corrected in empty flash case

Commit Message

Comments

Patch