diff mbox series

mtd: rawnand: micron: handle "ecc off" devices correctly

Message ID 20190726074434.21627-1-m.felsch@pengutronix.de
State Changes Requested
Delegated to: Miquel Raynal
Headers show
Series mtd: rawnand: micron: handle "ecc off" devices correctly | expand

Commit Message

Marco Felsch July 26, 2019, 7:44 a.m. UTC
Some devices don't support ecc "official". By "official" I mean that the
feature can be set trough the "SET FEATURE (EFh)" command but isn't
reported to the "READ ID Parameter Tables". Because the "ECC Field"
still says that it is disabled. This is applicable at least
for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
datasheet describes the ECC feature in chapter "ECC Protection".

Currently the driver checks the "READ ID Parameter" field directly after
we enabled the feature. If the check fails we return immediately but
leave the ECC on. Now all future read/program cycles goes trough the ecc
and the host nfc gets confused and reports ECC errors.

To address this in a common way we need to turn off the ECC directly
after reading the "READ ID Parameter" and before checking the
"ECC status".

Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
---
 drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Comments

Miquel Raynal July 26, 2019, 8:28 a.m. UTC | #1
Hi Marco,

+ Richard
+ Working e-mail address for Boris

Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019
09:44:34 +0200:

> Some devices don't support ecc "official". By "official" I mean that the
> feature can be set trough the "SET FEATURE (EFh)" command but isn't
> reported to the "READ ID Parameter Tables". Because the "ECC Field"
> still says that it is disabled. This is applicable at least
> for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> datasheet describes the ECC feature in chapter "ECC Protection".
> 
> Currently the driver checks the "READ ID Parameter" field directly after
> we enabled the feature. If the check fails we return immediately but
> leave the ECC on. Now all future read/program cycles goes trough the ecc
> and the host nfc gets confused and reports ECC errors.
> 
> To address this in a common way we need to turn off the ECC directly
> after reading the "READ ID Parameter" and before checking the
> "ECC status".
> 
> Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>

Good catch! However you report that on-die ECC correction is working
but you still disable it; any reason to do so ? Would it be better to
actually enable on-die ECC and explicitly mark these two chips as
buggy (see [1] for checking the chip IDs)?

[1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83

> ---
>  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> index 1622d3145587..fb199ad2f1a6 100644
> --- a/drivers/mtd/nand/raw/nand_micron.c
> +++ b/drivers/mtd/nand/raw/nand_micron.c
> @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
>  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
>  		return MICRON_ON_DIE_UNSUPPORTED;
>  
> +	/*
> +	 * It seems that there are devices which do not support ECC official.
> +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> +	 * So we have to guarantee that we disable the ECC feature directly
> +	 * after we did the READ_ID table command. Later we can evaluate the
> +	 * ECC_ENABLE support.
> +	 */
>  	ret = micron_nand_on_die_ecc_setup(chip, true);
>  	if (ret)
>  		return MICRON_ON_DIE_UNSUPPORTED;
> @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
>  	if (ret)
>  		return MICRON_ON_DIE_UNSUPPORTED;
>  
> -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> -		return MICRON_ON_DIE_UNSUPPORTED;
> -
>  	ret = micron_nand_on_die_ecc_setup(chip, false);
>  	if (ret)
>  		return MICRON_ON_DIE_UNSUPPORTED;
>  
> +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> +		return MICRON_ON_DIE_UNSUPPORTED;
> +
>  	ret = nand_readid_op(chip, 0, id, sizeof(id));
>  	if (ret)
>  		return MICRON_ON_DIE_UNSUPPORTED;

Thanks,
Miquèl
Miquel Raynal July 26, 2019, 8:34 a.m. UTC | #2
+ Actual address for Boris

Miquel Raynal <miquel.raynal@bootlin.com> wrote on Fri, 26 Jul 2019
10:28:58 +0200:

> Hi Marco,
> 
> + Richard
> + Working e-mail address for Boris
> 
> Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019
> 09:44:34 +0200:
> 
> > Some devices don't support ecc "official". By "official" I mean that the
> > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > still says that it is disabled. This is applicable at least
> > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > datasheet describes the ECC feature in chapter "ECC Protection".
> > 
> > Currently the driver checks the "READ ID Parameter" field directly after
> > we enabled the feature. If the check fails we return immediately but
> > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > and the host nfc gets confused and reports ECC errors.
> > 
> > To address this in a common way we need to turn off the ECC directly
> > after reading the "READ ID Parameter" and before checking the
> > "ECC status".
> > 
> > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
> 
> Good catch! However you report that on-die ECC correction is working
> but you still disable it; any reason to do so ? Would it be better to
> actually enable on-die ECC and explicitly mark these two chips as
> buggy (see [1] for checking the chip IDs)?
> 
> [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> 
> > ---
> >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > index 1622d3145587..fb199ad2f1a6 100644
> > --- a/drivers/mtd/nand/raw/nand_micron.c
> > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > +	/*
> > +	 * It seems that there are devices which do not support ECC official.
> > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > +	 * So we have to guarantee that we disable the ECC feature directly
> > +	 * after we did the READ_ID table command. Later we can evaluate the
> > +	 * ECC_ENABLE support.
> > +	 */
> >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> >  	if (ret)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> >  	if (ret)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > -		return MICRON_ON_DIE_UNSUPPORTED;
> > -
> >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> >  	if (ret)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > +		return MICRON_ON_DIE_UNSUPPORTED;
> > +
> >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> >  	if (ret)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> 
> Thanks,
> Miquèl
Lucas Stach July 26, 2019, 8:54 a.m. UTC | #3
Hi Miguel,

Am Freitag, den 26.07.2019, 10:28 +0200 schrieb Miquel Raynal:
> Hi Marco,
> 
> + Richard
> + Working e-mail address for Boris
> 
> > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019
> 09:44:34 +0200:
> 
> > Some devices don't support ecc "official". By "official" I mean that the
> > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > still says that it is disabled. This is applicable at least
> > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > datasheet describes the ECC feature in chapter "ECC Protection".
> > 
> > Currently the driver checks the "READ ID Parameter" field directly after
> > we enabled the feature. If the check fails we return immediately but
> > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > and the host nfc gets confused and reports ECC errors.
> > 
> > To address this in a common way we need to turn off the ECC directly
> > after reading the "READ ID Parameter" and before checking the
> > "ECC status".
> > 
> > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
> 
> Good catch! However you report that on-die ECC correction is working
> but you still disable it; any reason to do so ? Would it be better to
> actually enable on-die ECC and explicitly mark these two chips as
> buggy (see [1] for checking the chip IDs)?

It's the other way around. The chip is not supposed to have on-die ECC
according to the datasheet and correctly reflects this fact in the
READ_ID, so Linux should not try to use the on-die ECC.

The bug is that the NAND is not supposed to have on-die ECC and reports
this correctly, but then actually enables a on-die ECC unit when asked
to, probably due to the same die being used for on-die ECC and ECC off
devices. The consequence is that Linux (correctly) assumes that the
full OOB size is available to the controller, but the on-die ECC unit
scribbles over some of the OOB data.

I think this fix the most robust solution, as it makes sure to disable
the on-die ECC unit to avoid the issue, which might also be present on
other NAND chips we don't know about yet.

Regards,
Lucas 

> [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> 
> > ---
> >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > index 1622d3145587..fb199ad2f1a6 100644
> > --- a/drivers/mtd/nand/raw/nand_micron.c
> > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> > > >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> > > >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > > > +	/*
> > > > +	 * It seems that there are devices which do not support ECC official.
> > > > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > > > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > > > +	 * So we have to guarantee that we disable the ECC feature directly
> > > > +	 * after we did the READ_ID table command. Later we can evaluate the
> > > > +	 * ECC_ENABLE support.
> > > > +	 */
> > > >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> > > >  	if (ret)
> > > >  		return MICRON_ON_DIE_UNSUPPORTED;
> > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> > > >  	if (ret)
> > > >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > > > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > -		return MICRON_ON_DIE_UNSUPPORTED;
> > -
> > > >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> > > >  	if (ret)
> > > >  		return MICRON_ON_DIE_UNSUPPORTED;
> >  
> > > > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > +		return MICRON_ON_DIE_UNSUPPORTED;
> > +
> > > >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> > > >  	if (ret)
> >  		return MICRON_ON_DIE_UNSUPPORTED;
> 
> Thanks,
> Miquèl
>
Boris Brezillon July 26, 2019, 8:59 a.m. UTC | #4
On Fri, 26 Jul 2019 10:34:41 +0200
Miquel Raynal <miquel.raynal@bootlin.com> wrote:

> + Actual address for Boris
> 
> Miquel Raynal <miquel.raynal@bootlin.com> wrote on Fri, 26 Jul 2019
> 10:28:58 +0200:
> 
> > Hi Marco,
> > 
> > + Richard
> > + Working e-mail address for Boris
> > 
> > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019
> > 09:44:34 +0200:
> >   
> > > Some devices don't support ecc "official". By "official" I mean that the
> > > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > > still says that it is disabled. This is applicable at least
> > > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > > datasheet describes the ECC feature in chapter "ECC Protection".
> > > 
> > > Currently the driver checks the "READ ID Parameter" field directly after
> > > we enabled the feature. If the check fails we return immediately but
> > > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > > and the host nfc gets confused and reports ECC errors.
> > > 
> > > To address this in a common way we need to turn off the ECC directly
> > > after reading the "READ ID Parameter" and before checking the
> > > "ECC status".
> > > 
> > > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>  

Duh! Yet another bug on those Micron chips. I can't say I'm
surprised :-).

Anyway, the change looks good:

Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>

> > 
> > Good catch! However you report that on-die ECC correction is working
> > but you still disable it; any reason to do so ? Would it be better to
> > actually enable on-die ECC and explicitly mark these two chips as
> > buggy (see [1] for checking the chip IDs)?

That's a solution, but are we even sure ECC works correctly on those
NANDs? Given all the problem we have with on-die ECC on Micron chips I
think it might be a good thing to base the "on-die ECC support"
detection on the full ID (or even better, the part name provided by the
ONFi param page) instead of trying to be smart. This way we can
whitelist the NANDs that are known to work correctly and stop adding
more quirks every time we find a new bug...

> > 
> > [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> >   
> > > ---
> > >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > > index 1622d3145587..fb199ad2f1a6 100644
> > > --- a/drivers/mtd/nand/raw/nand_micron.c
> > > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> > >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> > >  		return MICRON_ON_DIE_UNSUPPORTED;
> > >  
> > > +	/*
> > > +	 * It seems that there are devices which do not support ECC official.
> > > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > > +	 * So we have to guarantee that we disable the ECC feature directly
> > > +	 * after we did the READ_ID table command. Later we can evaluate the
> > > +	 * ECC_ENABLE support.
> > > +	 */
> > >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> > >  	if (ret)
> > >  		return MICRON_ON_DIE_UNSUPPORTED;
> > > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)
> > >  	if (ret)
> > >  		return MICRON_ON_DIE_UNSUPPORTED;
> > >  
> > > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > -		return MICRON_ON_DIE_UNSUPPORTED;
> > > -
> > >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> > >  	if (ret)
> > >  		return MICRON_ON_DIE_UNSUPPORTED;
> > >  
> > > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > +		return MICRON_ON_DIE_UNSUPPORTED;
> > > +
> > >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> > >  	if (ret)
> > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > 
> > Thanks,
> > Miquèl
Miquel Raynal July 26, 2019, 9:17 a.m. UTC | #5
Hi Lucas, Marco,

Lucas Stach <l.stach@pengutronix.de> wrote on Fri, 26 Jul 2019 10:54:11
+0200:

> Hi Miguel,
> 
> Am Freitag, den 26.07.2019, 10:28 +0200 schrieb Miquel Raynal:
> > Hi Marco,
> > 
> > + Richard
> > + Working e-mail address for Boris
> >   
> > > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019  
> > 09:44:34 +0200:
> >   
> > > Some devices don't support ecc "official". By "official" I mean that the

                                 ^ uppercase ECC

> > > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > > still says that it is disabled. This is applicable at least
> > > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > > datasheet describes the ECC feature in chapter "ECC Protection".

What about:

"Some devices are supposed to do not support on-die ECC but
experience shows that internal ECC machinery can actually be enabled
through the "SET FEATURE (EFh)" command, even if a read of the "READ ID
Parameter Tables" returns that it is not."

> > > 
> > > Currently the driver checks the "READ ID Parameter" field directly after
> > > we enabled the feature. If the check fails we return immediately but
> > > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > > and the host nfc gets confused and reports ECC errors.

And here:

"Currently, the driver checks the "READ ID Parameter" field
directly after having enabled the feature. If the check fails it returns
immediately but leaves the ECC on. When using buggy chips like
MT29F2G08ABAGA and MT29F2G08ABBGA, all future read/program cycles will
go through the on-die ECC, confusing the host controller which is
supposed to be the one handling correction."

> > > To address this in a common way we need to turn off the ECC directly
> > > after reading the "READ ID Parameter" and before checking the
> > > "ECC status".
> > > 
> > > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>  
> > 
> > Good catch! However you report that on-die ECC correction is working
> > but you still disable it; any reason to do so ? Would it be better to
> > actually enable on-die ECC and explicitly mark these two chips as
> > buggy (see [1] for checking the chip IDs)?  
> 
> It's the other way around. The chip is not supposed to have on-die ECC
> according to the datasheet and correctly reflects this fact in the
> READ_ID, so Linux should not try to use the on-die ECC.

Ok I understood the opposite because of the "Even worse the datasheet
describes the ECC feature [...]" which implied to me that the on-die ECC
feature was actually expected despite the status bit not being set.

Marco, can you rephrase a bit the commit log? I proposed something,
feel free to adapt.

> The bug is that the NAND is not supposed to have on-die ECC and reports
> this correctly, but then actually enables a on-die ECC unit when asked
> to, probably due to the same die being used for on-die ECC and ECC off
> devices. The consequence is that Linux (correctly) assumes that the
> full OOB size is available to the controller, but the on-die ECC unit
> scribbles over some of the OOB data.
> 
> I think this fix the most robust solution, as it makes sure to disable
> the on-die ECC unit to avoid the issue, which might also be present on
> other NAND chips we don't know about yet.
> 
> Regards,
> Lucas 
> 
> > [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> >   
> > > ---
> > >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > > index 1622d3145587..fb199ad2f1a6 100644
> > > --- a/drivers/mtd/nand/raw/nand_micron.c
> > > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)  
> > > > >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> > > > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > >    
> > > > > +	/*
> > > > > +	 * It seems that there are devices which do not support ECC official.
> > > > > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > > > > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > > > > +	 * So we have to guarantee that we disable the ECC feature directly
> > > > > +	 * after we did the READ_ID table command. Later we can evaluate the
> > > > > +	 * ECC_ENABLE support.
> > > > > +	 */
> > > > >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> > > > >  	if (ret)
> > > > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)  
> > > > >  	if (ret)
> > > > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > >    
> > > > > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > -		return MICRON_ON_DIE_UNSUPPORTED;  
> > > -  
> > > > >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> > > > >  	if (ret)
> > > > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > >    
> > > > > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > +		return MICRON_ON_DIE_UNSUPPORTED;  
> > > +  
> > > > >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> > > > >  	if (ret)  
> > >  		return MICRON_ON_DIE_UNSUPPORTED;  
> > 
> > Thanks,
> > Miquèl
> >   


Thanks,
Miquèl
Miquel Raynal July 26, 2019, 9:20 a.m. UTC | #6
Wrong address for Boris again, sorry for the noise.

> Hi Lucas, Marco,
> 
> Lucas Stach <l.stach@pengutronix.de> wrote on Fri, 26 Jul 2019 10:54:11
> +0200:
> 
> > Hi Miguel,
> > 
> > Am Freitag, den 26.07.2019, 10:28 +0200 schrieb Miquel Raynal:  
> > > Hi Marco,
> > > 
> > > + Richard
> > > + Working e-mail address for Boris
> > >     
> > > > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019    
> > > 09:44:34 +0200:
> > >     
> > > > Some devices don't support ecc "official". By "official" I mean that the  
> 
>                                  ^ uppercase ECC
> 
> > > > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > > > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > > > still says that it is disabled. This is applicable at least
> > > > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > > > datasheet describes the ECC feature in chapter "ECC Protection".  
> 
> What about:
> 
> "Some devices are supposed to do not support on-die ECC but
> experience shows that internal ECC machinery can actually be enabled
> through the "SET FEATURE (EFh)" command, even if a read of the "READ ID
> Parameter Tables" returns that it is not."
> 
> > > > 
> > > > Currently the driver checks the "READ ID Parameter" field directly after
> > > > we enabled the feature. If the check fails we return immediately but
> > > > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > > > and the host nfc gets confused and reports ECC errors.  
> 
> And here:
> 
> "Currently, the driver checks the "READ ID Parameter" field
> directly after having enabled the feature. If the check fails it returns
> immediately but leaves the ECC on. When using buggy chips like
> MT29F2G08ABAGA and MT29F2G08ABBGA, all future read/program cycles will
> go through the on-die ECC, confusing the host controller which is
> supposed to be the one handling correction."
> 
> > > > To address this in a common way we need to turn off the ECC directly
> > > > after reading the "READ ID Parameter" and before checking the
> > > > "ECC status".
> > > > 
> > > > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>    
> > > 
> > > Good catch! However you report that on-die ECC correction is working
> > > but you still disable it; any reason to do so ? Would it be better to
> > > actually enable on-die ECC and explicitly mark these two chips as
> > > buggy (see [1] for checking the chip IDs)?    
> > 
> > It's the other way around. The chip is not supposed to have on-die ECC
> > according to the datasheet and correctly reflects this fact in the
> > READ_ID, so Linux should not try to use the on-die ECC.  
> 
> Ok I understood the opposite because of the "Even worse the datasheet
> describes the ECC feature [...]" which implied to me that the on-die ECC
> feature was actually expected despite the status bit not being set.
> 
> Marco, can you rephrase a bit the commit log? I proposed something,
> feel free to adapt.
> 
> > The bug is that the NAND is not supposed to have on-die ECC and reports
> > this correctly, but then actually enables a on-die ECC unit when asked
> > to, probably due to the same die being used for on-die ECC and ECC off
> > devices. The consequence is that Linux (correctly) assumes that the
> > full OOB size is available to the controller, but the on-die ECC unit
> > scribbles over some of the OOB data.
> > 
> > I think this fix the most robust solution, as it makes sure to disable
> > the on-die ECC unit to avoid the issue, which might also be present on
> > other NAND chips we don't know about yet.
> > 
> > Regards,
> > Lucas 
> >   
> > > [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> > >     
> > > > ---
> > > >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> > > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > > > index 1622d3145587..fb199ad2f1a6 100644
> > > > --- a/drivers/mtd/nand/raw/nand_micron.c
> > > > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > > > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)    
> > > > > >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > >      
> > > > > > +	/*
> > > > > > +	 * It seems that there are devices which do not support ECC official.
> > > > > > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > > > > > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > > > > > +	 * So we have to guarantee that we disable the ECC feature directly
> > > > > > +	 * after we did the READ_ID table command. Later we can evaluate the
> > > > > > +	 * ECC_ENABLE support.
> > > > > > +	 */
> > > > > >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> > > > > >  	if (ret)
> > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)    
> > > > > >  	if (ret)
> > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > >      
> > > > > > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > > -		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > -    
> > > > > >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> > > > > >  	if (ret)
> > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > >      
> > > > > > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > > +		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > +    
> > > > > >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> > > > > >  	if (ret)    
> > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > 
> > > Thanks,
> > > Miquèl
> > >     
> 
> 
> Thanks,
> Miquèl




Thanks,
Miquèl
Marco Felsch July 26, 2019, 9:40 a.m. UTC | #7
Hi Miquel,

On 19-07-26 11:20, Miquel Raynal wrote:
> Wrong address for Boris again, sorry for the noise.
> 
> > Hi Lucas, Marco,
> > 
> > Lucas Stach <l.stach@pengutronix.de> wrote on Fri, 26 Jul 2019 10:54:11
> > +0200:
> > 
> > > Hi Miguel,
> > > 
> > > Am Freitag, den 26.07.2019, 10:28 +0200 schrieb Miquel Raynal:  
> > > > Hi Marco,
> > > > 
> > > > + Richard
> > > > + Working e-mail address for Boris
> > > >     
> > > > > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019    
> > > > 09:44:34 +0200:
> > > >     
> > > > > Some devices don't support ecc "official". By "official" I mean that the  
> > 
> >                                  ^ uppercase ECC
> > 
> > > > > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > > > > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > > > > still says that it is disabled. This is applicable at least
> > > > > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > > > > datasheet describes the ECC feature in chapter "ECC Protection".  
> > 
> > What about:
> > 
> > "Some devices are supposed to do not support on-die ECC but
> > experience shows that internal ECC machinery can actually be enabled
> > through the "SET FEATURE (EFh)" command, even if a read of the "READ ID
> > Parameter Tables" returns that it is not."
> > 
> > > > > 
> > > > > Currently the driver checks the "READ ID Parameter" field directly after
> > > > > we enabled the feature. If the check fails we return immediately but
> > > > > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > > > > and the host nfc gets confused and reports ECC errors.  
> > 
> > And here:
> > 
> > "Currently, the driver checks the "READ ID Parameter" field
> > directly after having enabled the feature. If the check fails it returns
> > immediately but leaves the ECC on. When using buggy chips like
> > MT29F2G08ABAGA and MT29F2G08ABBGA, all future read/program cycles will
> > go through the on-die ECC, confusing the host controller which is
> > supposed to be the one handling correction."
> > 
> > > > > To address this in a common way we need to turn off the ECC directly
> > > > > after reading the "READ ID Parameter" and before checking the
> > > > > "ECC status".
> > > > > 
> > > > > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>    
> > > > 
> > > > Good catch! However you report that on-die ECC correction is working
> > > > but you still disable it; any reason to do so ? Would it be better to
> > > > actually enable on-die ECC and explicitly mark these two chips as
> > > > buggy (see [1] for checking the chip IDs)?    
> > > 
> > > It's the other way around. The chip is not supposed to have on-die ECC
> > > according to the datasheet and correctly reflects this fact in the
> > > READ_ID, so Linux should not try to use the on-die ECC.  
> > 
> > Ok I understood the opposite because of the "Even worse the datasheet
> > describes the ECC feature [...]" which implied to me that the on-die ECC
> > feature was actually expected despite the status bit not being set.
> > 
> > Marco, can you rephrase a bit the commit log? I proposed something,
> > feel free to adapt.

Thanks for the fast reply :) Of course I can adapt it and adding Boris rb-tag.

Regards,
  Marco

> > > The bug is that the NAND is not supposed to have on-die ECC and reports
> > > this correctly, but then actually enables a on-die ECC unit when asked
> > > to, probably due to the same die being used for on-die ECC and ECC off
> > > devices. The consequence is that Linux (correctly) assumes that the
> > > full OOB size is available to the controller, but the on-die ECC unit
> > > scribbles over some of the OOB data.
> > > 
> > > I think this fix the most robust solution, as it makes sure to disable
> > > the on-die ECC unit to avoid the issue, which might also be present on
> > > other NAND chips we don't know about yet.
> > > 
> > > Regards,
> > > Lucas 
> > >   
> > > > [1] https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/mtd/nand/raw/nand_macronix.c#L83
> > > >     
> > > > > ---
> > > > >  drivers/mtd/nand/raw/nand_micron.c | 14 +++++++++++---
> > > > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
> > > > > index 1622d3145587..fb199ad2f1a6 100644
> > > > > --- a/drivers/mtd/nand/raw/nand_micron.c
> > > > > +++ b/drivers/mtd/nand/raw/nand_micron.c
> > > > > @@ -390,6 +390,14 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)    
> > > > > > >  	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
> > > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > >      
> > > > > > > +	/*
> > > > > > > +	 * It seems that there are devices which do not support ECC official.
> > > > > > > +	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
> > > > > > > +	 * enabling the ECC feature but don't reflect that to the READ_ID table.
> > > > > > > +	 * So we have to guarantee that we disable the ECC feature directly
> > > > > > > +	 * after we did the READ_ID table command. Later we can evaluate the
> > > > > > > +	 * ECC_ENABLE support.
> > > > > > > +	 */
> > > > > > >  	ret = micron_nand_on_die_ecc_setup(chip, true);
> > > > > > >  	if (ret)
> > > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > > @@ -398,13 +406,13 @@ static int micron_supports_on_die_ecc(struct nand_chip *chip)    
> > > > > > >  	if (ret)
> > > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > >      
> > > > > > > -	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > > > -		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > > -    
> > > > > > >  	ret = micron_nand_on_die_ecc_setup(chip, false);
> > > > > > >  	if (ret)
> > > > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > >      
> > > > > > > +	if (!(id[4] & MICRON_ID_ECC_ENABLED))
> > > > > > > +		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > > +    
> > > > > > >  	ret = nand_readid_op(chip, 0, id, sizeof(id));
> > > > > > >  	if (ret)    
> > > > >  		return MICRON_ON_DIE_UNSUPPORTED;    
> > > > 
> > > > Thanks,
> > > > Miquèl
> > > >     
> > 
> > 
> > Thanks,
> > Miquèl
> 
> 
> 
> 
> Thanks,
> Miquèl
>
Miquel Raynal July 26, 2019, 1:26 p.m. UTC | #8
Hi Marco,

Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019
11:40:10 +0200:

> Hi Miquel,
> 
> On 19-07-26 11:20, Miquel Raynal wrote:
> > Wrong address for Boris again, sorry for the noise.
> >   
> > > Hi Lucas, Marco,
> > > 
> > > Lucas Stach <l.stach@pengutronix.de> wrote on Fri, 26 Jul 2019 10:54:11
> > > +0200:
> > >   
> > > > Hi Miguel,
> > > > 
> > > > Am Freitag, den 26.07.2019, 10:28 +0200 schrieb Miquel Raynal:    
> > > > > Hi Marco,
> > > > > 
> > > > > + Richard
> > > > > + Working e-mail address for Boris
> > > > >       
> > > > > > Marco Felsch <m.felsch@pengutronix.de> wrote on Fri, 26 Jul 2019      
> > > > > 09:44:34 +0200:
> > > > >       
> > > > > > Some devices don't support ecc "official". By "official" I mean that the    
> > > 
> > >                                  ^ uppercase ECC
> > >   
> > > > > > feature can be set trough the "SET FEATURE (EFh)" command but isn't
> > > > > > reported to the "READ ID Parameter Tables". Because the "ECC Field"
> > > > > > still says that it is disabled. This is applicable at least
> > > > > > for the MT29F2G08ABAGA and MT29F2G08ABBGA devices. Even worse the
> > > > > > datasheet describes the ECC feature in chapter "ECC Protection".    
> > > 
> > > What about:
> > > 
> > > "Some devices are supposed to do not support on-die ECC but
> > > experience shows that internal ECC machinery can actually be enabled
> > > through the "SET FEATURE (EFh)" command, even if a read of the "READ ID
> > > Parameter Tables" returns that it is not."
> > >   
> > > > > > 
> > > > > > Currently the driver checks the "READ ID Parameter" field directly after
> > > > > > we enabled the feature. If the check fails we return immediately but
> > > > > > leave the ECC on. Now all future read/program cycles goes trough the ecc
> > > > > > and the host nfc gets confused and reports ECC errors.    
> > > 
> > > And here:
> > > 
> > > "Currently, the driver checks the "READ ID Parameter" field
> > > directly after having enabled the feature. If the check fails it returns
> > > immediately but leaves the ECC on. When using buggy chips like
> > > MT29F2G08ABAGA and MT29F2G08ABBGA, all future read/program cycles will
> > > go through the on-die ECC, confusing the host controller which is
> > > supposed to be the one handling correction."
> > >   
> > > > > > To address this in a common way we need to turn off the ECC directly
> > > > > > after reading the "READ ID Parameter" and before checking the
> > > > > > "ECC status".
> > > > > > 
> > > > > > Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>      
> > > > > 
> > > > > Good catch! However you report that on-die ECC correction is working
> > > > > but you still disable it; any reason to do so ? Would it be better to
> > > > > actually enable on-die ECC and explicitly mark these two chips as
> > > > > buggy (see [1] for checking the chip IDs)?      
> > > > 
> > > > It's the other way around. The chip is not supposed to have on-die ECC
> > > > according to the datasheet and correctly reflects this fact in the
> > > > READ_ID, so Linux should not try to use the on-die ECC.    
> > > 
> > > Ok I understood the opposite because of the "Even worse the datasheet
> > > describes the ECC feature [...]" which implied to me that the on-die ECC
> > > feature was actually expected despite the status bit not being set.
> > > 
> > > Marco, can you rephrase a bit the commit log? I proposed something,
> > > feel free to adapt.  
> 
> Thanks for the fast reply :) Of course I can adapt it and adding Boris rb-tag.

I suppose you can also add Fixes and Stable tags.

Thanks,
Miquèl
diff mbox series

Patch

diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
index 1622d3145587..fb199ad2f1a6 100644
--- a/drivers/mtd/nand/raw/nand_micron.c
+++ b/drivers/mtd/nand/raw/nand_micron.c
@@ -390,6 +390,14 @@  static int micron_supports_on_die_ecc(struct nand_chip *chip)
 	    (chip->id.data[4] & MICRON_ID_INTERNAL_ECC_MASK) != 0x2)
 		return MICRON_ON_DIE_UNSUPPORTED;
 
+	/*
+	 * It seems that there are devices which do not support ECC official.
+	 * At least the MT29F2G08ABAGA / MT29F2G08ABBGA devices supports
+	 * enabling the ECC feature but don't reflect that to the READ_ID table.
+	 * So we have to guarantee that we disable the ECC feature directly
+	 * after we did the READ_ID table command. Later we can evaluate the
+	 * ECC_ENABLE support.
+	 */
 	ret = micron_nand_on_die_ecc_setup(chip, true);
 	if (ret)
 		return MICRON_ON_DIE_UNSUPPORTED;
@@ -398,13 +406,13 @@  static int micron_supports_on_die_ecc(struct nand_chip *chip)
 	if (ret)
 		return MICRON_ON_DIE_UNSUPPORTED;
 
-	if (!(id[4] & MICRON_ID_ECC_ENABLED))
-		return MICRON_ON_DIE_UNSUPPORTED;
-
 	ret = micron_nand_on_die_ecc_setup(chip, false);
 	if (ret)
 		return MICRON_ON_DIE_UNSUPPORTED;
 
+	if (!(id[4] & MICRON_ID_ECC_ENABLED))
+		return MICRON_ON_DIE_UNSUPPORTED;
+
 	ret = nand_readid_op(chip, 0, id, sizeof(id));
 	if (ret)
 		return MICRON_ON_DIE_UNSUPPORTED;