diff mbox

[v2,10/53] mtd: nand: denali: fix erased page checking

Message ID 1490191680-14481-11-git-send-email-yamada.masahiro@socionext.com
State Superseded
Delegated to: Boris Brezillon
Headers show

Commit Message

Masahiro Yamada March 22, 2017, 2:07 p.m. UTC
This part is wrong in multiple ways:

[1] is_erased() is called against "buf" twice, so the second one is
meaningless.  The second call should check chip->oob_poi.

[2] This code block is nested by double "if (check_erase_page)".
The inner one is redundant.

[3] Erased page checking without threshold is false-positive.
Basically, there are two ways for erased page checking:
- read the whole of page + oob in raw transfer, then check if all
  the data are 0xFF.
- read the ECC-corrected page + oob, then check if *almost* all the
  data are 0xFF (bit-flips less than ecc.strength are allowed)
While here, it checks if all data in ECC-corrected page are 0xFF.
This is too strong because not all of the data are 0xFF after they
are manipulated by the ECC engine.  Proper threshold must be taken
into account to avoid false-positive ecc_stats.failed increments.

[4] positive return value for uncorrectable bitflips

The comment of ecc->read_page() says it should return "0 if bitflips
uncorrectable", but the current code could return a positive value
in the case.

This commit solves the problems above.  The nand framework provides
a helper nand_check_erased_ecc_chunk() for erased page check with
threshold.  The driver's own helper is unneeded.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
---

Changes in v2:
  - Squash some patches into one.
  - Use nand_check_erased_ecc_chunk() with threshold

 drivers/mtd/nand/denali.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

Comments

Boris Brezillon March 22, 2017, 8:36 p.m. UTC | #1
On Wed, 22 Mar 2017 23:07:17 +0900
Masahiro Yamada <yamada.masahiro@socionext.com> wrote:

> This part is wrong in multiple ways:
> 
> [1] is_erased() is called against "buf" twice, so the second one is
> meaningless.  The second call should check chip->oob_poi.
> 
> [2] This code block is nested by double "if (check_erase_page)".
> The inner one is redundant.
> 
> [3] Erased page checking without threshold is false-positive.
> Basically, there are two ways for erased page checking:
> - read the whole of page + oob in raw transfer, then check if all
>   the data are 0xFF.
> - read the ECC-corrected page + oob, then check if *almost* all the
>   data are 0xFF (bit-flips less than ecc.strength are allowed)
> While here, it checks if all data in ECC-corrected page are 0xFF.
> This is too strong because not all of the data are 0xFF after they
> are manipulated by the ECC engine.  Proper threshold must be taken
> into account to avoid false-positive ecc_stats.failed increments.

Hm, the ECC engine should not introduce extra bitflips. I've seen 3
different cases in the various ECC engine I worked with:

1/ the ECC engine is able to correct bitflips in erased pages. In this
   case you should trust it and return the number of corrected
   bitflips or increment the ECC failed counter if it reports
   uncorrectable errors.
2/ the ECC engine is able to detect erased pages, but fails to detect
   those containing bitflips in it. In this case, you should rely on
   the default "empty page" detection and only manually check if the
   page is almost filled with 0xff when an error is reported.
3/ the ECC engine does not detect empty pages at all. In this case, you
   should check if the page empty (or almost empty) each time an ECC
   error is reported

In any case, if the ECC engine reports uncorrectable errors, it should
keep the data untouched, which means you don't have to re-read the whole
page in raw mode, only the OOB bytes.

> 
> [4] positive return value for uncorrectable bitflips
> 
> The comment of ecc->read_page() says it should return "0 if bitflips
> uncorrectable", but the current code could return a positive value
> in the case.

This one should probably be fixed in the core. Returning a negative
error core for uncorrectable errors is forbidden, but reporting the
maximum number of bitflips that have been corrected in each valid
ECC sector of the page (even if the page contains uncorrectable
sectors) does not sound like a bad idea to me.

The reason the core asks drivers to return 0 in case of uncorrectable
errors is because it updates the max_bitflips variable before testing
if the page contains uncorrectable errors [1]. Moving this statement
here [2] (in an else branch) should solve the problem for all drivers
returning positive numbers even when uncorrectable errors are detected
in one of the ECC chunk contained in a page.

[1]http://lxr.free-electrons.com/source/drivers/mtd/nand/nand_base.c#L1999
[2]http://lxr.free-electrons.com/source/drivers/mtd/nand/nand_base.c#L2048
Boris Brezillon March 22, 2017, 8:56 p.m. UTC | #2
On Wed, 22 Mar 2017 23:07:17 +0900
Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
>  		dev_err(denali->dev,
> @@ -1148,12 +1136,15 @@ static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>  	if (check_erased_page) {
>  		read_oob_data(mtd, chip->oob_poi, denali->page);
>  
> -		/* check ECC failures that may have occurred on erased pages */
> -		if (check_erased_page) {
> -			if (!is_erased(buf, mtd->writesize))
> -				mtd->ecc_stats.failed++;
> -			if (!is_erased(buf, mtd->oobsize))
> -				mtd->ecc_stats.failed++;
> +		stat = nand_check_erased_ecc_chunk(
> +					buf, mtd->writesize,
> +					chip->oob_poi, mtd->oobsize,
> +					NULL, 0,
> +					chip->ecc.strength * chip->ecc.steps);

That's not how it's supposed to be done. Each chunk should be checked
independently. Here is a simple example explaining why this is
important:

Let's consider the following setup:
- 4k pages
- 16bits/1024bytes ECC

With your approach, you turn this into:
- 4k pages
- 64bits/4096bytes ECC

Now suppose you have 32 bitflips in the first 1024 bytes. The real ECC
config is expected to report uncorrectable errors, but your approach
will just report that 32 bits have been fixed, which is wrong.
Masahiro Yamada March 23, 2017, 5:04 a.m. UTC | #3
Hi Boris,

2017-03-23 5:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> On Wed, 22 Mar 2017 23:07:17 +0900
> Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
>>               dev_err(denali->dev,
>> @@ -1148,12 +1136,15 @@ static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>>       if (check_erased_page) {
>>               read_oob_data(mtd, chip->oob_poi, denali->page);
>>
>> -             /* check ECC failures that may have occurred on erased pages */
>> -             if (check_erased_page) {
>> -                     if (!is_erased(buf, mtd->writesize))
>> -                             mtd->ecc_stats.failed++;
>> -                     if (!is_erased(buf, mtd->oobsize))
>> -                             mtd->ecc_stats.failed++;
>> +             stat = nand_check_erased_ecc_chunk(
>> +                                     buf, mtd->writesize,
>> +                                     chip->oob_poi, mtd->oobsize,
>> +                                     NULL, 0,
>> +                                     chip->ecc.strength * chip->ecc.steps);
>
> That's not how it's supposed to be done. Each chunk should be checked
> independently. Here is a simple example explaining why this is
> important:
>
> Let's consider the following setup:
> - 4k pages
> - 16bits/1024bytes ECC
>
> With your approach, you turn this into:
> - 4k pages
> - 64bits/4096bytes ECC
>
> Now suppose you have 32 bitflips in the first 1024 bytes. The real ECC
> config is expected to report uncorrectable errors, but your approach
> will just report that 32 bits have been fixed, which is wrong.


OK.  How about adding a helper like follows:

static int denali_check_erased_page(struct mtd_info *mtd,
                                    struct nand_chip *chip, uint8_t *buf)
{
        uint8_t *ecc_code = chip->buffers->ecccode;
        int ecc_steps = chip->ecc.steps;
        int ecc_size = chip->ecc.size;
        int ecc_bytes = chip->ecc.bytes;
        int i, ret;

        ret = mtd_ooblayout_get_eccbytes(mtd, ecc_code, chip->oob_poi, 0,
                                         chip->ecc.total);
        if (ret)
                return ret;

        for (i = 0; i < ecc_steps; i++) {
                ret = nand_check_erased_ecc_chunk(buf, ecc_size,
                                                  ecc_code, ecc_bytes,
                                                  NULL, 0,
                                                  chip->ecc.strength);
                if (ret < 0)
                        return ret;
                buf += ecc_size;
                ecc_code += ecc_bytes;
        }

        return 0;
}



Then,

                stat = denali_check_erased_page(mtd, chip, buf);
                if (stat < 0) {
                        mtd->ecc_stats.failed++;
                        /* return 0 for uncorrectable bitflips */
                        stat = 0;
                }
Masahiro Yamada March 23, 2017, 5:15 a.m. UTC | #4
Hi Boris,


2017-03-23 5:36 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> On Wed, 22 Mar 2017 23:07:17 +0900
> Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
>
>> This part is wrong in multiple ways:
>>
>> [1] is_erased() is called against "buf" twice, so the second one is
>> meaningless.  The second call should check chip->oob_poi.
>>
>> [2] This code block is nested by double "if (check_erase_page)".
>> The inner one is redundant.
>>
>> [3] Erased page checking without threshold is false-positive.
>> Basically, there are two ways for erased page checking:
>> - read the whole of page + oob in raw transfer, then check if all
>>   the data are 0xFF.
>> - read the ECC-corrected page + oob, then check if *almost* all the
>>   data are 0xFF (bit-flips less than ecc.strength are allowed)
>> While here, it checks if all data in ECC-corrected page are 0xFF.
>> This is too strong because not all of the data are 0xFF after they
>> are manipulated by the ECC engine.  Proper threshold must be taken
>> into account to avoid false-positive ecc_stats.failed increments.
>
> Hm, the ECC engine should not introduce extra bitflips. I've seen 3
> different cases in the various ECC engine I worked with:
>
> 1/ the ECC engine is able to correct bitflips in erased pages. In this
>    case you should trust it and return the number of corrected
>    bitflips or increment the ECC failed counter if it reports
>    uncorrectable errors.
> 2/ the ECC engine is able to detect erased pages, but fails to detect
>    those containing bitflips in it. In this case, you should rely on
>    the default "empty page" detection and only manually check if the
>    page is almost filled with 0xff when an error is reported.
> 3/ the ECC engine does not detect empty pages at all. In this case, you
>    should check if the page empty (or almost empty) each time an ECC
>    error is reported


I think the Denali is case 3.
But, very new versions of this IP support erased page detection by hardware.
Please see 49/53:
http://patchwork.ozlabs.org/patch/742414/

Unfortunately this feature is not exactly what we want.
We want to detect per-sector empty'ness,
but this features is actually page oriented.

If you are unhappy about this,
it is possible to always turn off this feature
and use software detection (with nand_check_erased_ecc_chunk)



> In any case, if the ECC engine reports uncorrectable errors, it should
> keep the data untouched, which means you don't have to re-read the whole
> page in raw mode, only the OOB bytes.


OK.  We should respect the result from the ECC engine,
but we still need to fill the buffer with 0xff
if the page turned out to be empty.
(nand_check_erased_ecc_chunk() does this for us.)





>>
>> [4] positive return value for uncorrectable bitflips
>>
>> The comment of ecc->read_page() says it should return "0 if bitflips
>> uncorrectable", but the current code could return a positive value
>> in the case.
>
> This one should probably be fixed in the core. Returning a negative
> error core for uncorrectable errors is forbidden, but reporting the
> maximum number of bitflips that have been corrected in each valid
> ECC sector of the page (even if the page contains uncorrectable
> sectors) does not sound like a bad idea to me.
>
> The reason the core asks drivers to return 0 in case of uncorrectable
> errors is because it updates the max_bitflips variable before testing
> if the page contains uncorrectable errors [1]. Moving this statement
> here [2] (in an else branch) should solve the problem for all drivers
> returning positive numbers even when uncorrectable errors are detected
> in one of the ECC chunk contained in a page.


I understood your idea, but do you want this change in this series?
Boris Brezillon March 23, 2017, 7:56 a.m. UTC | #5
On Thu, 23 Mar 2017 14:04:44 +0900
Masahiro Yamada <yamada.masahiro@socionext.com> wrote:

> Hi Boris,
> 
> 2017-03-23 5:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> > On Wed, 22 Mar 2017 23:07:17 +0900
> > Masahiro Yamada <yamada.masahiro@socionext.com> wrote:  
> >>               dev_err(denali->dev,
> >> @@ -1148,12 +1136,15 @@ static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
> >>       if (check_erased_page) {
> >>               read_oob_data(mtd, chip->oob_poi, denali->page);
> >>
> >> -             /* check ECC failures that may have occurred on erased pages */
> >> -             if (check_erased_page) {
> >> -                     if (!is_erased(buf, mtd->writesize))
> >> -                             mtd->ecc_stats.failed++;
> >> -                     if (!is_erased(buf, mtd->oobsize))
> >> -                             mtd->ecc_stats.failed++;
> >> +             stat = nand_check_erased_ecc_chunk(
> >> +                                     buf, mtd->writesize,
> >> +                                     chip->oob_poi, mtd->oobsize,
> >> +                                     NULL, 0,
> >> +                                     chip->ecc.strength * chip->ecc.steps);  
> >
> > That's not how it's supposed to be done. Each chunk should be checked
> > independently. Here is a simple example explaining why this is
> > important:
> >
> > Let's consider the following setup:
> > - 4k pages
> > - 16bits/1024bytes ECC
> >
> > With your approach, you turn this into:
> > - 4k pages
> > - 64bits/4096bytes ECC
> >
> > Now suppose you have 32 bitflips in the first 1024 bytes. The real ECC
> > config is expected to report uncorrectable errors, but your approach
> > will just report that 32 bits have been fixed, which is wrong.  
> 
> 
> OK.  How about adding a helper like follows:
> 
> static int denali_check_erased_page(struct mtd_info *mtd,
>                                     struct nand_chip *chip, uint8_t *buf)
> {
>         uint8_t *ecc_code = chip->buffers->ecccode;
>         int ecc_steps = chip->ecc.steps;
>         int ecc_size = chip->ecc.size;
>         int ecc_bytes = chip->ecc.bytes;
>         int i, ret;
> 
>         ret = mtd_ooblayout_get_eccbytes(mtd, ecc_code, chip->oob_poi, 0,
>                                          chip->ecc.total);
>         if (ret)
>                 return ret;
> 
>         for (i = 0; i < ecc_steps; i++) {
>                 ret = nand_check_erased_ecc_chunk(buf, ecc_size,
>                                                   ecc_code, ecc_bytes,
>                                                   NULL, 0,
>                                                   chip->ecc.strength);
>                 if (ret < 0)
>                         return ret;
>                 buf += ecc_size;
>                 ecc_code += ecc_bytes;
>         }
> 
>         return 0;
> }
> 
> 
> 
> Then,
> 
>                 stat = denali_check_erased_page(mtd, chip, buf);
>                 if (stat < 0) {
>                         mtd->ecc_stats.failed++;
>                         /* return 0 for uncorrectable bitflips */
>                         stat = 0;
>                 }

What's the point of checking all ECC chunks if only one contains ECC
errors? I really recommend to put the nand_check_erased_ecc_chunk()
call next to the per-ECC-block correction test.

Also, mtd->ecc_stats.failed is supposed to be incremented each time an
uncorrectable error is detected. In your denali_sw_ecc_fixup()
implementation you can detect errors at the ECC chunk level, so you
should increment ecc_stats.failed for each failure and not once if at
least one chunk is faulty.
Boris Brezillon March 23, 2017, 8:03 a.m. UTC | #6
On Thu, 23 Mar 2017 14:15:59 +0900
Masahiro Yamada <yamada.masahiro@socionext.com> wrote:

> Hi Boris,
> 
> 
> 2017-03-23 5:36 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> > On Wed, 22 Mar 2017 23:07:17 +0900
> > Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
> >  
> >> This part is wrong in multiple ways:
> >>
> >> [1] is_erased() is called against "buf" twice, so the second one is
> >> meaningless.  The second call should check chip->oob_poi.
> >>
> >> [2] This code block is nested by double "if (check_erase_page)".
> >> The inner one is redundant.
> >>
> >> [3] Erased page checking without threshold is false-positive.
> >> Basically, there are two ways for erased page checking:
> >> - read the whole of page + oob in raw transfer, then check if all
> >>   the data are 0xFF.
> >> - read the ECC-corrected page + oob, then check if *almost* all the
> >>   data are 0xFF (bit-flips less than ecc.strength are allowed)
> >> While here, it checks if all data in ECC-corrected page are 0xFF.
> >> This is too strong because not all of the data are 0xFF after they
> >> are manipulated by the ECC engine.  Proper threshold must be taken
> >> into account to avoid false-positive ecc_stats.failed increments.  
> >
> > Hm, the ECC engine should not introduce extra bitflips. I've seen 3
> > different cases in the various ECC engine I worked with:
> >
> > 1/ the ECC engine is able to correct bitflips in erased pages. In this
> >    case you should trust it and return the number of corrected
> >    bitflips or increment the ECC failed counter if it reports
> >    uncorrectable errors.
> > 2/ the ECC engine is able to detect erased pages, but fails to detect
> >    those containing bitflips in it. In this case, you should rely on
> >    the default "empty page" detection and only manually check if the
> >    page is almost filled with 0xff when an error is reported.
> > 3/ the ECC engine does not detect empty pages at all. In this case, you
> >    should check if the page empty (or almost empty) each time an ECC
> >    error is reported  
> 
> 
> I think the Denali is case 3.
> But, very new versions of this IP support erased page detection by hardware.
> Please see 49/53:
> http://patchwork.ozlabs.org/patch/742414/
> 
> Unfortunately this feature is not exactly what we want.
> We want to detect per-sector empty'ness,
> but this features is actually page oriented.
> 
> If you are unhappy about this,
> it is possible to always turn off this feature
> and use software detection (with nand_check_erased_ecc_chunk)

As long as the engine reports the maximum number of
bitflips-per-ECC-chunk we're good. Of course, if you have an
uncorrectable error reported and your engine does not tell you in which
chunk(s) this happened, you'll have to call
nand_check_erased_ecc_chunk() on all chunks, but that should be fine.

> 
> 
> 
> > In any case, if the ECC engine reports uncorrectable errors, it should
> > keep the data untouched, which means you don't have to re-read the whole
> > page in raw mode, only the OOB bytes.  
> 
> 
> OK.  We should respect the result from the ECC engine,
> but we still need to fill the buffer with 0xff
> if the page turned out to be empty.
> (nand_check_erased_ecc_chunk() does this for us.)

Yes, calling nand_check_erased_ecc_chunk() is still needed.

> 
> 
> 
> 
> 
> >>
> >> [4] positive return value for uncorrectable bitflips
> >>
> >> The comment of ecc->read_page() says it should return "0 if bitflips
> >> uncorrectable", but the current code could return a positive value
> >> in the case.  
> >
> > This one should probably be fixed in the core. Returning a negative
> > error core for uncorrectable errors is forbidden, but reporting the
> > maximum number of bitflips that have been corrected in each valid
> > ECC sector of the page (even if the page contains uncorrectable
> > sectors) does not sound like a bad idea to me.
> >
> > The reason the core asks drivers to return 0 in case of uncorrectable
> > errors is because it updates the max_bitflips variable before testing
> > if the page contains uncorrectable errors [1]. Moving this statement
> > here [2] (in an else branch) should solve the problem for all drivers
> > returning positive numbers even when uncorrectable errors are detected
> > in one of the ECC chunk contained in a page.  
> 
> 
> I understood your idea, but do you want this change in this series?

Not necessarily, but I'm pretty sure other drivers are doing the same
mistake, so we'd better fix it in one place and stop requiring drivers
to return 0 if at least one ECC chunk is uncorrectable in the page.
Masahiro Yamada March 24, 2017, 2:43 a.m. UTC | #7
Hi Boris,


2017-03-23 16:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> On Thu, 23 Mar 2017 14:04:44 +0900
> Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
>
>> Hi Boris,
>>
>> 2017-03-23 5:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
>> > On Wed, 22 Mar 2017 23:07:17 +0900
>> > Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
>> >>               dev_err(denali->dev,
>> >> @@ -1148,12 +1136,15 @@ static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>> >>       if (check_erased_page) {
>> >>               read_oob_data(mtd, chip->oob_poi, denali->page);
>> >>
>> >> -             /* check ECC failures that may have occurred on erased pages */
>> >> -             if (check_erased_page) {
>> >> -                     if (!is_erased(buf, mtd->writesize))
>> >> -                             mtd->ecc_stats.failed++;
>> >> -                     if (!is_erased(buf, mtd->oobsize))
>> >> -                             mtd->ecc_stats.failed++;
>> >> +             stat = nand_check_erased_ecc_chunk(
>> >> +                                     buf, mtd->writesize,
>> >> +                                     chip->oob_poi, mtd->oobsize,
>> >> +                                     NULL, 0,
>> >> +                                     chip->ecc.strength * chip->ecc.steps);
>> >
>> > That's not how it's supposed to be done. Each chunk should be checked
>> > independently. Here is a simple example explaining why this is
>> > important:
>> >
>> > Let's consider the following setup:
>> > - 4k pages
>> > - 16bits/1024bytes ECC
>> >
>> > With your approach, you turn this into:
>> > - 4k pages
>> > - 64bits/4096bytes ECC
>> >
>> > Now suppose you have 32 bitflips in the first 1024 bytes. The real ECC
>> > config is expected to report uncorrectable errors, but your approach
>> > will just report that 32 bits have been fixed, which is wrong.
>>
>>
>> OK.  How about adding a helper like follows:
>>
>> static int denali_check_erased_page(struct mtd_info *mtd,
>>                                     struct nand_chip *chip, uint8_t *buf)
>> {
>>         uint8_t *ecc_code = chip->buffers->ecccode;
>>         int ecc_steps = chip->ecc.steps;
>>         int ecc_size = chip->ecc.size;
>>         int ecc_bytes = chip->ecc.bytes;
>>         int i, ret;
>>
>>         ret = mtd_ooblayout_get_eccbytes(mtd, ecc_code, chip->oob_poi, 0,
>>                                          chip->ecc.total);
>>         if (ret)
>>                 return ret;
>>
>>         for (i = 0; i < ecc_steps; i++) {
>>                 ret = nand_check_erased_ecc_chunk(buf, ecc_size,
>>                                                   ecc_code, ecc_bytes,
>>                                                   NULL, 0,
>>                                                   chip->ecc.strength);
>>                 if (ret < 0)
>>                         return ret;
>>                 buf += ecc_size;
>>                 ecc_code += ecc_bytes;
>>         }
>>
>>         return 0;
>> }
>>
>>
>>
>> Then,
>>
>>                 stat = denali_check_erased_page(mtd, chip, buf);
>>                 if (stat < 0) {
>>                         mtd->ecc_stats.failed++;
>>                         /* return 0 for uncorrectable bitflips */
>>                         stat = 0;
>>                 }
>
> What's the point of checking all ECC chunks if only one contains ECC
> errors? I really recommend to put the nand_check_erased_ecc_chunk()
> call next to the per-ECC-block correction test.


OK.  I can fix it for software ECC fixup.


What should I do for hardware ECC fixup case?
http://patchwork.ozlabs.org/patch/742321/


If at least one ECC sector fails to correct bit-flips,
the controller sets INTR__ECC_UNCOR_ERR flag.


In this case, we can not know the number of uncorrectable errors.

Possible solutions are:

  - Increment ecc_stats.failed only by one  (compromised solution)

  - If the controller IP supports sub-page read,
    transfer sectors once again, one by one, checking the register
flag each time.


As far as I see, there are three cases.

[1] SW ECC fixup (Intel)
    This can be fixed

[2] HW ECC fixup is supported, but sub-page read is not supported
    (old UniPhier SoCs,  probably SOCFPGA too)

[3] HW ECC fixup is supported, and sub-page read is supported as well
    (new UniPhier SoCs)


I do not know how to precisely increment
ecc_stats.failed and ecc_stats.corrected for [2].


As for [3], we can solve the issue by making more efforts,
but I am not sure this effort is worthwhile.




> Also, mtd->ecc_stats.failed is supposed to be incremented each time an
> uncorrectable error is detected. In your denali_sw_ecc_fixup()
> implementation you can detect errors at the ECC chunk level, so you
> should increment ecc_stats.failed for each failure and not once if at
> least one chunk is faulty.


Yes, I can do this for denali_sw_ecc_fixup().

Can I ask what disadvantage would happen
if ecc_stats.failed / .corrected is incremented only by one,
where actually errors happen in multiple sectors.
Boris Brezillon March 24, 2017, 8:06 a.m. UTC | #8
On Fri, 24 Mar 2017 11:43:43 +0900
Masahiro Yamada <yamada.masahiro@socionext.com> wrote:

> Hi Boris,
> 
> 
> 2017-03-23 16:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:
> > On Thu, 23 Mar 2017 14:04:44 +0900
> > Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
> >  
> >> Hi Boris,
> >>
> >> 2017-03-23 5:56 GMT+09:00 Boris Brezillon <boris.brezillon@free-electrons.com>:  
> >> > On Wed, 22 Mar 2017 23:07:17 +0900
> >> > Masahiro Yamada <yamada.masahiro@socionext.com> wrote:  
> >> >>               dev_err(denali->dev,
> >> >> @@ -1148,12 +1136,15 @@ static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
> >> >>       if (check_erased_page) {
> >> >>               read_oob_data(mtd, chip->oob_poi, denali->page);
> >> >>
> >> >> -             /* check ECC failures that may have occurred on erased pages */
> >> >> -             if (check_erased_page) {
> >> >> -                     if (!is_erased(buf, mtd->writesize))
> >> >> -                             mtd->ecc_stats.failed++;
> >> >> -                     if (!is_erased(buf, mtd->oobsize))
> >> >> -                             mtd->ecc_stats.failed++;
> >> >> +             stat = nand_check_erased_ecc_chunk(
> >> >> +                                     buf, mtd->writesize,
> >> >> +                                     chip->oob_poi, mtd->oobsize,
> >> >> +                                     NULL, 0,
> >> >> +                                     chip->ecc.strength * chip->ecc.steps);  
> >> >
> >> > That's not how it's supposed to be done. Each chunk should be checked
> >> > independently. Here is a simple example explaining why this is
> >> > important:
> >> >
> >> > Let's consider the following setup:
> >> > - 4k pages
> >> > - 16bits/1024bytes ECC
> >> >
> >> > With your approach, you turn this into:
> >> > - 4k pages
> >> > - 64bits/4096bytes ECC
> >> >
> >> > Now suppose you have 32 bitflips in the first 1024 bytes. The real ECC
> >> > config is expected to report uncorrectable errors, but your approach
> >> > will just report that 32 bits have been fixed, which is wrong.  
> >>
> >>
> >> OK.  How about adding a helper like follows:
> >>
> >> static int denali_check_erased_page(struct mtd_info *mtd,
> >>                                     struct nand_chip *chip, uint8_t *buf)
> >> {
> >>         uint8_t *ecc_code = chip->buffers->ecccode;
> >>         int ecc_steps = chip->ecc.steps;
> >>         int ecc_size = chip->ecc.size;
> >>         int ecc_bytes = chip->ecc.bytes;
> >>         int i, ret;
> >>
> >>         ret = mtd_ooblayout_get_eccbytes(mtd, ecc_code, chip->oob_poi, 0,
> >>                                          chip->ecc.total);
> >>         if (ret)
> >>                 return ret;
> >>
> >>         for (i = 0; i < ecc_steps; i++) {
> >>                 ret = nand_check_erased_ecc_chunk(buf, ecc_size,
> >>                                                   ecc_code, ecc_bytes,
> >>                                                   NULL, 0,
> >>                                                   chip->ecc.strength);
> >>                 if (ret < 0)
> >>                         return ret;
> >>                 buf += ecc_size;
> >>                 ecc_code += ecc_bytes;
> >>         }
> >>
> >>         return 0;
> >> }
> >>
> >>
> >>
> >> Then,
> >>
> >>                 stat = denali_check_erased_page(mtd, chip, buf);
> >>                 if (stat < 0) {
> >>                         mtd->ecc_stats.failed++;
> >>                         /* return 0 for uncorrectable bitflips */
> >>                         stat = 0;
> >>                 }  
> >
> > What's the point of checking all ECC chunks if only one contains ECC
> > errors? I really recommend to put the nand_check_erased_ecc_chunk()
> > call next to the per-ECC-block correction test.  
> 
> 
> OK.  I can fix it for software ECC fixup.
> 
> 
> What should I do for hardware ECC fixup case?
> http://patchwork.ozlabs.org/patch/742321/
> 
> 
> If at least one ECC sector fails to correct bit-flips,
> the controller sets INTR__ECC_UNCOR_ERR flag.
> 
> 
> In this case, we can not know the number of uncorrectable errors.
> 
> Possible solutions are:
> 
>   - Increment ecc_stats.failed only by one  (compromised solution)

Let's go for this solution.

> 
> 
> > Also, mtd->ecc_stats.failed is supposed to be incremented each time an
> > uncorrectable error is detected. In your denali_sw_ecc_fixup()
> > implementation you can detect errors at the ECC chunk level, so you
> > should increment ecc_stats.failed for each failure and not once if at
> > least one chunk is faulty.  
> 
> 
> Yes, I can do this for denali_sw_ecc_fixup().
> 
> Can I ask what disadvantage would happen
> if ecc_stats.failed / .corrected is incremented only by one,
> where actually errors happen in multiple sectors.

Reporting wrong stats, which is not such a big deal, but let's try to
keep them correct when we can (the SW ECC fixup case).
diff mbox

Patch

diff --git a/drivers/mtd/nand/denali.c b/drivers/mtd/nand/denali.c
index 2c59eb3..86381ac 100644
--- a/drivers/mtd/nand/denali.c
+++ b/drivers/mtd/nand/denali.c
@@ -883,19 +883,6 @@  static void read_oob_data(struct mtd_info *mtd, uint8_t *buf, int page)
 	}
 }
 
-/*
- * this function examines buffers to see if they contain data that
- * indicate that the buffer is part of an erased region of flash.
- */
-static bool is_erased(uint8_t *buf, int len)
-{
-	int i;
-
-	for (i = 0; i < len; i++)
-		if (buf[i] != 0xFF)
-			return false;
-	return true;
-}
 #define ECC_SECTOR_SIZE 512
 
 #define ECC_SECTOR(x)	(((x) & ECC_ERROR_ADDRESS__SECTOR_NR) >> 12)
@@ -1119,6 +1106,7 @@  static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	uint32_t irq_status;
 	uint32_t irq_mask = INTR__ECC_TRANSACTION_DONE | INTR__ECC_ERR;
 	bool check_erased_page = false;
+	int stat;
 
 	if (page != denali->page) {
 		dev_err(denali->dev,
@@ -1148,12 +1136,15 @@  static int denali_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	if (check_erased_page) {
 		read_oob_data(mtd, chip->oob_poi, denali->page);
 
-		/* check ECC failures that may have occurred on erased pages */
-		if (check_erased_page) {
-			if (!is_erased(buf, mtd->writesize))
-				mtd->ecc_stats.failed++;
-			if (!is_erased(buf, mtd->oobsize))
-				mtd->ecc_stats.failed++;
+		stat = nand_check_erased_ecc_chunk(
+					buf, mtd->writesize,
+					chip->oob_poi, mtd->oobsize,
+					NULL, 0,
+					chip->ecc.strength * chip->ecc.steps);
+		if (stat < 0) {
+			mtd->ecc_stats.failed++;
+			/* return 0 for uncorrectable bitflips */
+			max_bitflips = 0;
 		}
 	}
 	return max_bitflips;