diff mbox

[2/2] mtd: brcmnand: Detect sticky ucorr ecc error on dma reads

Message ID 1461961285-24159-2-git-send-email-kdasu.kdev@gmail.com
State Superseded
Headers show

Commit Message

Kamal Dasu April 29, 2016, 8:21 p.m. UTC
This change provides a fix for controller bug where nand
controller could have a possible sticky error after a PIO
followed by a DMA read. The fix retries a read if we see
a uncorr_ecc after read to detect such sticky errors.

Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
---
 drivers/mtd/nand/brcmnand/brcmnand.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Boris Brezillon May 30, 2016, 8:50 a.m. UTC | #1
On Fri, 29 Apr 2016 16:21:25 -0400
Kamal Dasu <kdasu.kdev@gmail.com> wrote:

> This change provides a fix for controller bug where nand
> controller could have a possible sticky error after a PIO
> followed by a DMA read. The fix retries a read if we see
> a uncorr_ecc after read to detect such sticky errors.
> 
> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
> ---
>  drivers/mtd/nand/brcmnand/brcmnand.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
> index 29a9abd..13c7784 100644
> --- a/drivers/mtd/nand/brcmnand/brcmnand.c
> +++ b/drivers/mtd/nand/brcmnand/brcmnand.c
> @@ -1555,9 +1555,11 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
>  	struct brcmnand_controller *ctrl = host->ctrl;
>  	u64 err_addr = 0;
>  	int err;
> +	bool retry = true;
>  
>  	dev_dbg(ctrl->dev, "read %llx -> %p\n", (unsigned long long)addr, buf);
>  
> +try_dmaread:
>  	brcmnand_write_reg(ctrl, BRCMNAND_UNCORR_COUNT, 0);
>  
>  	if (has_flash_dma(ctrl) && !oob && flash_dma_buf_ok(buf)) {
> @@ -1579,7 +1581,18 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
>  
>  	if (mtd_is_eccerr(err)) {
>  		int ret;
> -
> +		/*
> +		 * On controller version >=7.0 if we are doing a DMA read
> +		 * after a prior PIO read that reported uncorrectable error,
> +		 * the DMA engine captures this error following DMA read
> +		 * cleared only on subsequent DMA read, so just retry once
> +		 * to clear a possible false error reported for current DMA
> +		 * read
> +		 */

Hm, shouldn't this BRCMNAND_UNCORR_COUNT bit be cleared just after
doing the PIO/DMA read instead of doing it before executing a new read?
This would solve your problem without the need for this extra retry, or
am I missing something?

> +		if ((ctrl->nand_version >= 0x0700) && retry) {
> +			retry = false;
> +			goto try_dmaread;
> +		}
>  		ret = brcmstb_nand_verify_erased_page(mtd, chip, buf, addr);
>  		if (ret < 0) {
>  			dev_dbg(ctrl->dev, "uncorrectable error at 0x%llx\n",
Kamal Dasu June 1, 2016, 4:50 p.m. UTC | #2
Boris,

On Mon, May 30, 2016 at 4:50 AM, Boris Brezillon
<boris.brezillon@free-electrons.com> wrote:
> On Fri, 29 Apr 2016 16:21:25 -0400
> Kamal Dasu <kdasu.kdev@gmail.com> wrote:
>
>> This change provides a fix for controller bug where nand
>> controller could have a possible sticky error after a PIO
>> followed by a DMA read. The fix retries a read if we see
>> a uncorr_ecc after read to detect such sticky errors.
>>
>> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
>> ---
>>  drivers/mtd/nand/brcmnand/brcmnand.c | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
>> index 29a9abd..13c7784 100644
>> --- a/drivers/mtd/nand/brcmnand/brcmnand.c
>> +++ b/drivers/mtd/nand/brcmnand/brcmnand.c
>> @@ -1555,9 +1555,11 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
>>       struct brcmnand_controller *ctrl = host->ctrl;
>>       u64 err_addr = 0;
>>       int err;
>> +     bool retry = true;
>>
>>       dev_dbg(ctrl->dev, "read %llx -> %p\n", (unsigned long long)addr, buf);
>>
>> +try_dmaread:
>>       brcmnand_write_reg(ctrl, BRCMNAND_UNCORR_COUNT, 0);
>>
>>       if (has_flash_dma(ctrl) && !oob && flash_dma_buf_ok(buf)) {
>> @@ -1579,7 +1581,18 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
>>
>>       if (mtd_is_eccerr(err)) {
>>               int ret;
>> -
>> +             /*
>> +              * On controller version >=7.0 if we are doing a DMA read
>> +              * after a prior PIO read that reported uncorrectable error,
>> +              * the DMA engine captures this error following DMA read
>> +              * cleared only on subsequent DMA read, so just retry once
>> +              * to clear a possible false error reported for current DMA
>> +              * read
>> +              */
>
> Hm, shouldn't this BRCMNAND_UNCORR_COUNT bit be cleared just after
> doing the PIO/DMA read instead of doing it before executing a new read?
> This would solve your problem without the need for this extra retry, or
> am I missing something?
>

Clearing the count registers or the intr registers does not clear the
condition. Only a clean read (a page that does not have errors) clears
the condition. So if this was a false error  ( page is really clean)
and we read again, it will clear the condition.



Thanks
Kamal
Boris Brezillon June 1, 2016, 5:20 p.m. UTC | #3
On Wed, 1 Jun 2016 12:50:56 -0400
Kamal Dasu <kamal.dasu@broadcom.com> wrote:

> Boris,
> 
> On Mon, May 30, 2016 at 4:50 AM, Boris Brezillon
> <boris.brezillon@free-electrons.com> wrote:
> > On Fri, 29 Apr 2016 16:21:25 -0400
> > Kamal Dasu <kdasu.kdev@gmail.com> wrote:
> >  
> >> This change provides a fix for controller bug where nand
> >> controller could have a possible sticky error after a PIO
> >> followed by a DMA read. The fix retries a read if we see
> >> a uncorr_ecc after read to detect such sticky errors.
> >>
> >> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
> >> ---
> >>  drivers/mtd/nand/brcmnand/brcmnand.c | 15 ++++++++++++++-
> >>  1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
> >> index 29a9abd..13c7784 100644
> >> --- a/drivers/mtd/nand/brcmnand/brcmnand.c
> >> +++ b/drivers/mtd/nand/brcmnand/brcmnand.c
> >> @@ -1555,9 +1555,11 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
> >>       struct brcmnand_controller *ctrl = host->ctrl;
> >>       u64 err_addr = 0;
> >>       int err;
> >> +     bool retry = true;
> >>
> >>       dev_dbg(ctrl->dev, "read %llx -> %p\n", (unsigned long long)addr, buf);
> >>
> >> +try_dmaread:
> >>       brcmnand_write_reg(ctrl, BRCMNAND_UNCORR_COUNT, 0);
> >>
> >>       if (has_flash_dma(ctrl) && !oob && flash_dma_buf_ok(buf)) {
> >> @@ -1579,7 +1581,18 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
> >>
> >>       if (mtd_is_eccerr(err)) {
> >>               int ret;
> >> -
> >> +             /*
> >> +              * On controller version >=7.0 if we are doing a DMA read
> >> +              * after a prior PIO read that reported uncorrectable error,
> >> +              * the DMA engine captures this error following DMA read
> >> +              * cleared only on subsequent DMA read, so just retry once
> >> +              * to clear a possible false error reported for current DMA
> >> +              * read
> >> +              */  
> >
> > Hm, shouldn't this BRCMNAND_UNCORR_COUNT bit be cleared just after
> > doing the PIO/DMA read instead of doing it before executing a new read?
> > This would solve your problem without the need for this extra retry, or
> > am I missing something?
> >  
> 
> Clearing the count registers or the intr registers does not clear the
> condition. Only a clean read (a page that does not have errors) clears
> the condition. So if this was a false error  ( page is really clean)
> and we read again, it will clear the condition.

This sounds like an expensive workaround, but you know the IP better
than I do, so I'll trust you ;).
Boris Brezillon June 1, 2016, 8:37 p.m. UTC | #4
On Wed, 1 Jun 2016 12:50:56 -0400
Kamal Dasu <kamal.dasu@broadcom.com> wrote:

> Boris,
> 
> On Mon, May 30, 2016 at 4:50 AM, Boris Brezillon
> <boris.brezillon@free-electrons.com> wrote:
> > On Fri, 29 Apr 2016 16:21:25 -0400
> > Kamal Dasu <kdasu.kdev@gmail.com> wrote:
> >  
> >> This change provides a fix for controller bug where nand
> >> controller could have a possible sticky error after a PIO
> >> followed by a DMA read. The fix retries a read if we see
> >> a uncorr_ecc after read to detect such sticky errors.
> >>
> >> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
> >> ---
> >>  drivers/mtd/nand/brcmnand/brcmnand.c | 15 ++++++++++++++-
> >>  1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
> >> index 29a9abd..13c7784 100644
> >> --- a/drivers/mtd/nand/brcmnand/brcmnand.c
> >> +++ b/drivers/mtd/nand/brcmnand/brcmnand.c
> >> @@ -1555,9 +1555,11 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
> >>       struct brcmnand_controller *ctrl = host->ctrl;
> >>       u64 err_addr = 0;
> >>       int err;
> >> +     bool retry = true;
> >>
> >>       dev_dbg(ctrl->dev, "read %llx -> %p\n", (unsigned long long)addr, buf);
> >>
> >> +try_dmaread:
> >>       brcmnand_write_reg(ctrl, BRCMNAND_UNCORR_COUNT, 0);
> >>
> >>       if (has_flash_dma(ctrl) && !oob && flash_dma_buf_ok(buf)) {
> >> @@ -1579,7 +1581,18 @@ static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
> >>
> >>       if (mtd_is_eccerr(err)) {
> >>               int ret;
> >> -
> >> +             /*
> >> +              * On controller version >=7.0 if we are doing a DMA read
> >> +              * after a prior PIO read that reported uncorrectable error,
> >> +              * the DMA engine captures this error following DMA read
> >> +              * cleared only on subsequent DMA read, so just retry once
> >> +              * to clear a possible false error reported for current DMA
> >> +              * read
> >> +              */  
> >
> > Hm, shouldn't this BRCMNAND_UNCORR_COUNT bit be cleared just after
> > doing the PIO/DMA read instead of doing it before executing a new read?
> > This would solve your problem without the need for this extra retry, or
> > am I missing something?
> >  
> 
> Clearing the count registers or the intr registers does not clear the
> condition. Only a clean read (a page that does not have errors) clears
> the condition. So if this was a false error  ( page is really clean)
> and we read again, it will clear the condition.
> 

Can you try that patch [1], just to make sure the extra read cycle is
really necessary?

[1]http://code.bulix.org/uzb7m8-99997
Kamal Dasu June 2, 2016, 6:55 p.m. UTC | #5
Yes have tried this, but its the controller internals that causes this
condition.

Kamal
diff mbox

Patch

diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
index 29a9abd..13c7784 100644
--- a/drivers/mtd/nand/brcmnand/brcmnand.c
+++ b/drivers/mtd/nand/brcmnand/brcmnand.c
@@ -1555,9 +1555,11 @@  static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
 	struct brcmnand_controller *ctrl = host->ctrl;
 	u64 err_addr = 0;
 	int err;
+	bool retry = true;
 
 	dev_dbg(ctrl->dev, "read %llx -> %p\n", (unsigned long long)addr, buf);
 
+try_dmaread:
 	brcmnand_write_reg(ctrl, BRCMNAND_UNCORR_COUNT, 0);
 
 	if (has_flash_dma(ctrl) && !oob && flash_dma_buf_ok(buf)) {
@@ -1579,7 +1581,18 @@  static int brcmnand_read(struct mtd_info *mtd, struct nand_chip *chip,
 
 	if (mtd_is_eccerr(err)) {
 		int ret;
-
+		/*
+		 * On controller version >=7.0 if we are doing a DMA read
+		 * after a prior PIO read that reported uncorrectable error,
+		 * the DMA engine captures this error following DMA read
+		 * cleared only on subsequent DMA read, so just retry once
+		 * to clear a possible false error reported for current DMA
+		 * read
+		 */
+		if ((ctrl->nand_version >= 0x0700) && retry) {
+			retry = false;
+			goto try_dmaread;
+		}
 		ret = brcmstb_nand_verify_erased_page(mtd, chip, buf, addr);
 		if (ret < 0) {
 			dev_dbg(ctrl->dev, "uncorrectable error at 0x%llx\n",