diff mbox

[v2] MTD/GPMI bugfix : reset the BCH module when it is not MX23

Message ID 1325233646-3343-1-git-send-email-b32955@freescale.com
State New, archived
Headers show

Commit Message

Huang Shijie Dec. 30, 2011, 8:27 a.m. UTC
In MX28, if we do not reset the BCH module. The BCH module may
becomes unstable when the board reboots for several thousands times.
This bug has been catched in customer's production.

The patch adds some comments, and fixes it now.

Also change gpmi_reset_block() to static.

Signed-off-by: Huang Shijie <b32955@freescale.com>
---
 drivers/mtd/nand/gpmi-nand/gpmi-lib.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

Comments

Marek Vasut Dec. 30, 2011, 2:01 p.m. UTC | #1
> In MX28, if we do not reset the BCH module. The BCH module may
> becomes unstable when the board reboots for several thousands times.
> This bug has been catched in customer's production.
> 
> The patch adds some comments, and fixes it now.
> 
> Also change gpmi_reset_block() to static.
> 
> Signed-off-by: Huang Shijie <b32955@freescale.com>
> ---
>  drivers/mtd/nand/gpmi-nand/gpmi-lib.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
> b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c index de4db76..1573e99 100644
> --- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
> +++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
> @@ -69,7 +69,8 @@ static int clear_poll_bit(void __iomem *addr, u32 mask)
>   *  [1] enable the module.
>   *  [2] reset the module.
>   *
> - * In most of the cases, it's ok. But there is a hardware bug in the BCH
> block. + * In most of the cases, it's ok.
> + * But in MX23, there is a hardware bug in the BCH block.
>   * If you try to soft reset the BCH block, it becomes unusable until
>   * the next hard reset. This case occurs in the NAND boot mode. When the
> board * boots by NAND, the ROM of the chip will initialize the BCH blocks
> itself. @@ -78,8 +79,10 @@ static int clear_poll_bit(void __iomem *addr,
> u32 mask) *
>   * To avoid this bug, just add a new parameter `just_enable` for
>   * the mxs_reset_block(), and rewrite it here.
> + *
> + * The bug has been fixed in the following chips, such as MX28.
>   */
> -int gpmi_reset_block(void __iomem *reset_addr, bool just_enable)
> +static int gpmi_reset_block(void __iomem *reset_addr, bool just_enable)
>  {
>  	int ret;
>  	int timeout = 0x400;
> @@ -206,7 +209,8 @@ int bch_set_geometry(struct gpmi_nand_data *this)
>  	if (ret)
>  		goto err_out;
> 
> -	ret = gpmi_reset_block(r->bch_regs, true);
> +	/* The bug only exits in mx23, the following chips fix it. */
> +	ret = gpmi_reset_block(r->bch_regs, GPMI_IS_MX23(this));
>  	if (ret)
>  		goto err_out;

Looks sane, though I can't test right now.

Acked-by: Marek Vasut <marek.vasut@gmail.com>
Wolfram Sang Dec. 30, 2011, 2:36 p.m. UTC | #2
On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
> In MX28, if we do not reset the BCH module. The BCH module may
> becomes unstable when the board reboots for several thousands times.

Do you have more details when and why this happens? What happens on MX23 then?

> Signed-off-by: Huang Shijie <b32955@freescale.com>

Is this Bug 2847 from the errata? Should be mentioned in the commit message and
comments of the patch if so.

Regards,

   Wolfram
Huang Shijie Dec. 31, 2011, 2:23 a.m. UTC | #3
Hi,
> On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
>> In MX28, if we do not reset the BCH module. The BCH module may
>> becomes unstable when the board reboots for several thousands times.
> Do you have more details when and why this happens? What happens on MX23 then?
In one customer's 3G router which uses the MX28:
[0] NAND boot mode, mount the UBIFS in the NAND partition.
[1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH module.
[2] The board will automatically reboot again after it booted by NAND
boot mode.
[3] After reboot more then thousands times(cost nearly one day), the BCH
mode became UNSTABLE,
the data read out was not right, so the system could not mount the UBIFS.

After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
module, the bug never happens.



The gpmi_reset_block() was coded to avoid the NAND boot bug in MX23. So
MX23 does not have the bug.
>> Signed-off-by: Huang Shijie <b32955@freescale.com>
> Is this Bug 2847 from the errata? Should be mentioned in the commit message and
Yes, this is the bug 2847 from the mx23's errata.

I will add some comments to it.

Best Regards
Huang Shijie
> comments of the patch if so.
>
> Regards,
>
>    Wolfram
>
Marek Vasut Dec. 31, 2011, 3:15 a.m. UTC | #4
> Hi,
> 
> > On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
> >> In MX28, if we do not reset the BCH module. The BCH module may
> >> becomes unstable when the board reboots for several thousands times.
> > 
> > Do you have more details when and why this happens? What happens on MX23
> > then?
> 
> In one customer's 3G router which uses the MX28:
> [0] NAND boot mode, mount the UBIFS in the NAND partition.
> [1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH module.
> [2] The board will automatically reboot again after it booted by NAND
> boot mode.
> [3] After reboot more then thousands times(cost nearly one day), the BCH
> mode became UNSTABLE,
> the data read out was not right, so the system could not mount the UBIFS.
> 
> After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
> module, the bug never happens.
> 
> 
> 
> The gpmi_reset_block() was coded to avoid the NAND boot bug in MX23. So
> MX23 does not have the bug.
> 
> >> Signed-off-by: Huang Shijie <b32955@freescale.com>
> > 
> > Is this Bug 2847 from the errata? Should be mentioned in the commit
> > message and
> 
> Yes, this is the bug 2847 from the mx23's errata.
> 
> I will add some comments to it.
> 
> Best Regards
> Huang Shijie
> 
> > comments of the patch if so.
> > 
> > Regards,
> > 
> >    Wolfram

Ah, so you confirmed it's a problem with the silicon. Good
Huang Shijie Dec. 31, 2011, 3:31 a.m. UTC | #5
hi,
>> Hi,
>>
>>> On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
>>>> In MX28, if we do not reset the BCH module. The BCH module may
>>>> becomes unstable when the board reboots for several thousands times.
>>> Do you have more details when and why this happens? What happens on MX23
>>> then?
>> In one customer's 3G router which uses the MX28:
>> [0] NAND boot mode, mount the UBIFS in the NAND partition.
>> [1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH module.
>> [2] The board will automatically reboot again after it booted by NAND
>> boot mode.
>> [3] After reboot more then thousands times(cost nearly one day), the BCH
>> mode became UNSTABLE,
>> the data read out was not right, so the system could not mount the UBIFS.
>>
>> After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
>> module, the bug never happens.
>>
>>
>>
>> The gpmi_reset_block() was coded to avoid the NAND boot bug in MX23. So
>> MX23 does not have the bug.
>>
>>>> Signed-off-by: Huang Shijie<b32955@freescale.com>
>>> Is this Bug 2847 from the errata? Should be mentioned in the commit
>>> message and
>> Yes, this is the bug 2847 from the mx23's errata.
>>
>> I will add some comments to it.
>>
>> Best Regards
>> Huang Shijie
>>
>>> comments of the patch if so.
>>>
>>> Regards,
>>>
>>>     Wolfram
> Ah, so you confirmed it's a problem with the silicon. Good
>
yes, the bug in mx23 is really a hardware bug.

BR
Huang Shijie
Marek Vasut Dec. 31, 2011, 4:45 a.m. UTC | #6
> hi,
> 
> >> Hi,
> >> 
> >>> On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
> >>>> In MX28, if we do not reset the BCH module. The BCH module may
> >>>> becomes unstable when the board reboots for several thousands times.
> >>> 
> >>> Do you have more details when and why this happens? What happens on
> >>> MX23 then?
> >> 
> >> In one customer's 3G router which uses the MX28:
> >> [0] NAND boot mode, mount the UBIFS in the NAND partition.
> >> [1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH
> >> module. [2] The board will automatically reboot again after it booted
> >> by NAND boot mode.
> >> [3] After reboot more then thousands times(cost nearly one day), the BCH
> >> mode became UNSTABLE,
> >> the data read out was not right, so the system could not mount the
> >> UBIFS.
> >> 
> >> After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
> >> module, the bug never happens.
> >> 
> >> 
> >> 
> >> The gpmi_reset_block() was coded to avoid the NAND boot bug in MX23. So
> >> MX23 does not have the bug.
> >> 
> >>>> Signed-off-by: Huang Shijie<b32955@freescale.com>
> >>> 
> >>> Is this Bug 2847 from the errata? Should be mentioned in the commit
> >>> message and
> >> 
> >> Yes, this is the bug 2847 from the mx23's errata.
> >> 
> >> I will add some comments to it.
> >> 
> >> Best Regards
> >> Huang Shijie
> >> 
> >>> comments of the patch if so.
> >>> 
> >>> Regards,
> >>> 
> >>>     Wolfram
> > 
> > Ah, so you confirmed it's a problem with the silicon. Good
> 
> yes, the bug in mx23 is really a hardware bug.
> 
> BR
> Huang Shijie

All right. Well ... this is probably the last email your way until next year so:

I wish you a successful 2012, thanks for all your awesome work and I hope we'll 
be able to continue working (not only) on MXS well next year too!

M
Wolfram Sang Dec. 31, 2011, 3:43 p.m. UTC | #7
Hi,

On Sat, Dec 31, 2011 at 10:23:50AM +0800, Huang Shijie wrote:

> > On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
> >> In MX28, if we do not reset the BCH module. The BCH module may
> >> becomes unstable when the board reboots for several thousands times.
> > Do you have more details when and why this happens? What happens on MX23 then?
> In one customer's 3G router which uses the MX28:
> [0] NAND boot mode, mount the UBIFS in the NAND partition.
> [1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH module.
> [2] The board will automatically reboot again after it booted by NAND
> boot mode.
> [3] After reboot more then thousands times(cost nearly one day), the BCH
> mode became UNSTABLE,
> the data read out was not right, so the system could not mount the UBIFS.
> 
> After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
> module, the bug never happens.

Yes, I undestood that. I was curious _why_ this happens, for example, because
of a bug in the rom-code or such. This would then lead to the question if this
can be fixed by some other means. Because currently it looks like this to me:
If you would do the above scenario of booting 10000 times on a MX23, then BCH
could enter the same unstable state (can it?). Since we cannot reset BCH due to
bug 2847, we have a serious problem, because NAND won't work until the next
power-cycle? I am curious if my assumptions are true and we have a serious
problem on the MX23.

> Yes, this is the bug 2847 from the mx23's errata.

OK, thanks.

Kind regards,

   Wolfram
Huang Shijie Jan. 1, 2012, 3:23 p.m. UTC | #8
On Sat, Dec 31, 2011 at 10:43 AM, Wolfram Sang <w.sang@pengutronix.de> wrote:
> Hi,
>
> On Sat, Dec 31, 2011 at 10:23:50AM +0800, Huang Shijie wrote:
>
>> > On Fri, Dec 30, 2011 at 04:27:26PM +0800, Huang Shijie wrote:
>> >> In MX28, if we do not reset the BCH module. The BCH module may
>> >> becomes unstable when the board reboots for several thousands times.
>> > Do you have more details when and why this happens? What happens on MX23 then?
>> In one customer's 3G router which uses the MX28:
>> [0] NAND boot mode, mount the UBIFS in the NAND partition.
>> [1] We used the gpmi_reset_block(r->bch_regs, true) to init the BCH module.
>> [2] The board will automatically reboot again after it booted by NAND
>> boot mode.
>> [3] After reboot more then thousands times(cost nearly one day), the BCH
>> mode became UNSTABLE,
>> the data read out was not right, so the system could not mount the UBIFS.
>>
>> After we use gpmi_reset_block(r->bch_regs, false) to init the BCH
>> module, the bug never happens.
>
> Yes, I undestood that. I was curious _why_ this happens, for example, because
> of a bug in the rom-code or such. This would then lead to the question if this

This happens because we do NOT follow the right init procedure. The IC
guy told me.
If the BCH was not initialized correctly, it can not guarantees the
data is right.
This is why the mx28 failed.


> can be fixed by some other means. Because currently it looks like this to me:
> If you would do the above scenario of booting 10000 times on a MX23, then BCH
> could enter the same unstable state (can it?). Since we cannot reset BCH due to

For mx23, it should not reset the BCH, this is the right init
procedure for mx23.
But I did not ever boot 10000 times on mx23. I am not sure if it can
fail or not.



> bug 2847, we have a serious problem, because NAND won't work until the next
> power-cycle? I am curious if my assumptions are true and we have a serious
> problem on the MX23.
You can test it if you have time. :)

Happy new year.

Huang Shijie
Wolfram Sang Jan. 1, 2012, 10:33 p.m. UTC | #9
Hi,

> This happens because we do NOT follow the right init procedure. The IC
> guy told me.
> If the BCH was not initialized correctly, it can not guarantees the
> data is right.
> This is why the mx28 failed.
...
> For mx23, it should not reset the BCH, this is the right init
> procedure for mx23.

Okay, this was the information I was looking for. So, the MX23 cannot be reset
due to bug #2847. But it also does not NEED to be reset, which is unlike the
MX28 which needs the reset. Correct? Before that, it was only clear that MX23
cannot be reset. It was not clear (at least to me) if it still could then run
into the same problems as the MX28 after 10000+x boot cycles. That would be a
really bad situation: being in need of the reset which we can't do due to 2847.

> But I did not ever boot 10000 times on mx23. I am not sure if it can
> fail or not.

Hmm, can't your IC guy tell you? I was hoping he knows if the same problem which
happens on the MX28 can also happen on the MX23? I don't want to make your life
unnecessarily hard, but I'd really like to know if I need to warn customers
when using MX23 and NAND (and if so, we have to update the comment in the code).

> > bug 2847, we have a serious problem, because NAND won't work until the next
> > power-cycle? I am curious if my assumptions are true and we have a serious
> > problem on the MX23.
> You can test it if you have time. :)

Testing is guessing in this case, the problem could arise after <n+1> cycles if
I tested <n>. Actually knowing the situation would be more helpful, I'd think.
If this is not possible to find out, we'd need to add this as a comment, too,
so people experiencing problems have a pointer which can save them *a lot of*
time.

Kind regards,

   Wolfram
Huang Shijie Jan. 3, 2012, 2:30 a.m. UTC | #10
On Sun, Jan 1, 2012 at 5:33 PM, Wolfram Sang <w.sang@pengutronix.de> wrote:
> Hi,
>
>> This happens because we do NOT follow the right init procedure. The IC
>> guy told me.
>> If the BCH was not initialized correctly, it can not guarantees the
>> data is right.
>> This is why the mx28 failed.
> ...
>> For mx23, it should not reset the BCH, this is the right init
>> procedure for mx23.
>
> Okay, this was the information I was looking for. So, the MX23 cannot be reset
> due to bug #2847. But it also does not NEED to be reset, which is unlike the
> MX28 which needs the reset. Correct? Before that, it was only clear that MX23
yes. i think so.

> cannot be reset. It was not clear (at least to me) if it still could then run
> into the same problems as the MX28 after 10000+x boot cycles. That would be a
> really bad situation: being in need of the reset which we can't do due to 2847.
I will do the MX23 test, when i am free. It really costs lot of time :(

The bug only occurs in the software reboot. If you do hardware
reboot(power from off to on).
it does not exit. I am wonder if it's really has the same scenario in
the real life.
Who will soft reboot for 10000 continously?



>
>> But I did not ever boot 10000 times on mx23. I am not sure if it can
>> fail or not.
>
> Hmm, can't your IC guy tell you? I was hoping he knows if the same problem which
Our IC guy did not do such tough test as our customers did.

> happens on the MX28 can also happen on the MX23? I don't want to make your life
> unnecessarily hard, but I'd really like to know if I need to warn customers
> when using MX23 and NAND (and if so, we have to update the comment in the code).
>
>> > bug 2847, we have a serious problem, because NAND won't work until the next
>> > power-cycle? I am curious if my assumptions are true and we have a serious
>> > problem on the MX23.
>> You can test it if you have time. :)
>
> Testing is guessing in this case, the problem could arise after <n+1> cycles if
> I tested <n>. Actually knowing the situation would be more helpful, I'd think.
> If this is not possible to find out, we'd need to add this as a comment, too,
> so people experiencing problems have a pointer which can save them *a lot of*
yes, i agree. add it in next version.


Best Regards
Huang Shijie
> time.
>
> Kind regards,
>
>   Wolfram
>
> --
> Pengutronix e.K.                           | Wolfram Sang                |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Wolfram Sang Jan. 3, 2012, 11:11 a.m. UTC | #11
Hi Huang,

> I will do the MX23 test, when i am free. It really costs lot of time :(

My point is like this: I wondered if you really found out _why_ the MX28
has this problem, e.g. "There is a case where the rom code does <x>, and
when the NAND then does <y> the BCH goes to state <z> which requires a
soft reset of the ip-core." If you had been at that point, it should be
quite easy to check if this can happen on MX23, too, I think.

If you don't know the cause of the problem in that detail because the
problem is hardly reproducable, well, that can happen, too. I fully
understand that. I don't want to enforce the time-consuming test on you
(would be interesting, though), but then at least the comments need
proper updates, so people are informed. From my understanding of the
situation, it could look like this:

===

Due to Errata #2847 of the MX23, the BCH cannot be soft reset on this
chip, otherwise it will lock up. So we skip resetting BCH on the MX23.
On the other hand, the MX28 needs the reset, because one case has been
seen where the BCH produced ECC errors constantly after 10000
consecutive reboots. The latter case has not been seen on the MX23 yet,
still we don't if it could happen there as well.

===

What do you think?

Thanks,

   Wolfram
Huang Shijie Jan. 4, 2012, 2:43 a.m. UTC | #12
hi,
> Due to Errata #2847 of the MX23, the BCH cannot be soft reset on this
> chip, otherwise it will lock up. So we skip resetting BCH on the MX23.
> On the other hand, the MX28 needs the reset, because one case has been
> seen where the BCH produced ECC errors constantly after 10000
> consecutive reboots. The latter case has not been seen on the MX23 yet,
> still we don't if it could happen there as well.
>
> ===
>
> What do you think?
>
ok, thanks. I will add this comment to the patch.

Huang Shijie
diff mbox

Patch

diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
index de4db76..1573e99 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-lib.c
@@ -69,7 +69,8 @@  static int clear_poll_bit(void __iomem *addr, u32 mask)
  *  [1] enable the module.
  *  [2] reset the module.
  *
- * In most of the cases, it's ok. But there is a hardware bug in the BCH block.
+ * In most of the cases, it's ok.
+ * But in MX23, there is a hardware bug in the BCH block.
  * If you try to soft reset the BCH block, it becomes unusable until
  * the next hard reset. This case occurs in the NAND boot mode. When the board
  * boots by NAND, the ROM of the chip will initialize the BCH blocks itself.
@@ -78,8 +79,10 @@  static int clear_poll_bit(void __iomem *addr, u32 mask)
  *
  * To avoid this bug, just add a new parameter `just_enable` for
  * the mxs_reset_block(), and rewrite it here.
+ *
+ * The bug has been fixed in the following chips, such as MX28.
  */
-int gpmi_reset_block(void __iomem *reset_addr, bool just_enable)
+static int gpmi_reset_block(void __iomem *reset_addr, bool just_enable)
 {
 	int ret;
 	int timeout = 0x400;
@@ -206,7 +209,8 @@  int bch_set_geometry(struct gpmi_nand_data *this)
 	if (ret)
 		goto err_out;
 
-	ret = gpmi_reset_block(r->bch_regs, true);
+	/* The bug only exits in mx23, the following chips fix it. */
+	ret = gpmi_reset_block(r->bch_regs, GPMI_IS_MX23(this));
 	if (ret)
 		goto err_out;