mtd: brcmnand: Workaround false ECC uncorrectable errors

Message ID	5659FF4A.7080203@simon.arlott.org.uk
State	Superseded
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> From: Simon Arlott <simon@fire.lp0.eu> Subject: [PATCH] mtd: brcmnand: Workaround false ECC uncorrectable errors To: Brian Norris <computersforpeace@gmail.com>, Kamal Dasu <kdasu.kdev@gmail.com>, David Woodhouse <dwmw2@infradead.org>, MTD Maling List <linux-mtd@lists.infradead.org>, bcm-kernel-feedback-list@broadcom.com, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Message-ID: <5659FF4A.7080203@simon.arlott.org.uk> Date: Sat, 28 Nov 2015 19:23:54 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 summary: Content analysis details: (-4.3 points) pts rule name description ---- ---------------------- -------------------------------------------------- -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium trust [2001:8b0:ffea:0:205:b4ff:fe12:530 listed in] [list.dnswl.org] -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -0.0 SPF_PASS SPF: sender matches SPF record -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature Precedence: list Cc: Jonas Gorski <jogo@openwrt.org>, Florian Fainelli <f.fainelli@gmail.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-mtd" <linux-mtd-bounces@lists.infradead.org> Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Simon Arlott Nov. 28, 2015, 7:23 p.m. UTC

Workaround false ECC uncorrectable errors by checking if the data
has been erased and the OOB data indicates that the data has been
erased. The v4.0 controller on the BCM63168 incorrectly handles
these as uncorrectable errors.

I don't know which version of the controller handles this scenario
correctly. I'm only able to test this in Hamming mode, so the BCH
version needs to be checked.

This code is quite complicated because the fact that either the data
buffer or the OOB data may not have been read from the controller, and
both of them are required to determine if the error is incorrect.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 drivers/mtd/nand/brcmnand/brcmnand.c | 107 ++++++++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

Jonas Gorski Dec. 1, 2015, 10:41 a.m. UTC | #1

Hi,

On Sat, Nov 28, 2015 at 8:23 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
> Workaround false ECC uncorrectable errors by checking if the data
> has been erased and the OOB data indicates that the data has been
> erased. The v4.0 controller on the BCM63168 incorrectly handles
> these as uncorrectable errors.
>
> I don't know which version of the controller handles this scenario
> correctly. I'm only able to test this in Hamming mode, so the BCH
> version needs to be checked.
>
> This code is quite complicated because the fact that either the data
> buffer or the OOB data may not have been read from the controller, and
> both of them are required to determine if the error is incorrect.
>
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
> ---
>  drivers/mtd/nand/brcmnand/brcmnand.c | 107 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 106 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mtd/nand/brcmnand/brcmnand.c b/drivers/mtd/nand/brcmnand/brcmnand.c
> index 5f26b8a..0857af7 100644
> --- a/drivers/mtd/nand/brcmnand/brcmnand.c
> +++ b/drivers/mtd/nand/brcmnand/brcmnand.c
> @@ -1387,6 +1387,102 @@ static int brcmnand_dma_trans(struct brcmnand_host *host, u64 addr, u32 *buf,
>  }
>
>  /*
> + * Workaround false ECC uncorrectable errors by checking if the data
> + * has been erased and the OOB data indicates that the data has been
> + * erased. The controller incorrectly handles these as uncorrectable
> + * errors.
> + */
> +static void brcmnand_read_ecc_unc_err(struct mtd_info *mtd,
> +                               struct nand_chip *chip,
> +                               int idx, unsigned int trans,
> +                               u32 *buf, u8 *oob_begin, u8 *oob_end,
> +                               u64 *err_addr)
> +{
> +       struct brcmnand_host *host = chip->priv;
> +       struct brcmnand_controller *ctrl = host->ctrl;
> +       u32 *buf_tmp = NULL;
> +       u8 *oob_tmp = NULL;
> +       bool buf_erased = false;
> +       bool oob_erased = false;
> +       int j;
> +
> +       /* Assume this is fixed in v5.0+ */
> +       if (ctrl->nand_version >= 0x0500)
> +               return;
> +
> +       /* Read OOB data if not already read */
> +       if (!oob_begin) {
> +               oob_tmp = kmalloc(ctrl->max_oob, GFP_KERNEL);
> +               if (!oob_tmp)
> +                       goto out_free;
> +
> +               oob_begin = oob_tmp;
> +               oob_end = oob_tmp;
> +
> +               oob_end += read_oob_from_regs(ctrl, idx, oob_tmp,
> +                               mtd->oobsize / trans,
> +                               host->hwcfg.sector_size_1k);
> +       }
> +
> +       if (is_hamming_ecc(&host->hwcfg)) {
> +               u8 *oob_offset = oob_begin + 6;
> +
> +               if (oob_offset + 3 < oob_end)
> +                       /* Erased if ECC bytes are all 0xFF, or the data bytes
> +                        * are all 0xFF which should have Hamming codes of 0x00
> +                        */
> +                       oob_erased = memchr_inv(oob_offset, 0xFF, 3) == NULL ||
> +                               memchr_inv(oob_offset, 0x00, 3) == NULL;
> +       } else { /* BCH */
> +               u8 *oob_offset = oob_end - ctrl->max_oob;
> +
> +               if (oob_offset >= oob_begin)
> +                       /* Erased if ECC bytes are all 0xFF */
> +                       oob_erased = memchr_inv(oob_offset, 0xFF,
> +                                               oob_end - oob_offset) == NULL;
> +       }
> +
> +       if (!oob_erased)
> +               goto out_free;
> +
> +       /* Read data buffer if not already read */
> +       if (!buf) {
> +               buf_tmp = kmalloc_array(FC_WORDS, sizeof(*buf_tmp), GFP_KERNEL);
> +               if (!buf_tmp)
> +                       goto out_free;
> +
> +               buf = buf_tmp;
> +
> +               brcmnand_soc_data_bus_prepare(ctrl->soc);
> +
> +               for (j = 0; j < FC_WORDS; j++, buf++)
> +                       *buf = brcmnand_read_fc(ctrl, j);
> +
> +               brcmnand_soc_data_bus_unprepare(ctrl->soc);
> +       }
> +
> +       /* Go to start of buffer */
> +       buf -= FC_WORDS;
> +
> +       /* Erased if all data bytes are 0xFF */
> +       buf_erased = memchr_inv(buf, 0xFF, FC_WORDS) == NULL;
> +
> +       if (!buf_erased)
> +               goto out_free;

We now have a function exactly for that use case in 4.4,
nand_check_erased_buf [1], consider using that. This also has the
benefit of treating bit flips as correctable as long as the ECC scheme
is strong enough.


Jonas

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/nand_base.c#n1110

Simon Arlott Dec. 2, 2015, 8:17 p.m. UTC | #2

On 01/12/15 10:41, Jonas Gorski wrote:
> On Sat, Nov 28, 2015 at 8:23 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
>> +
>> +       /* Go to start of buffer */
>> +       buf -= FC_WORDS;
>> +
>> +       /* Erased if all data bytes are 0xFF */
>> +       buf_erased = memchr_inv(buf, 0xFF, FC_WORDS) == NULL;
>> +
>> +       if (!buf_erased)
>> +               goto out_free;
> 
> We now have a function exactly for that use case in 4.4,
> nand_check_erased_buf [1], consider using that. This also has the
> benefit of treating bit flips as correctable as long as the ECC scheme
> is strong enough.

I have no idea whether or not it's appropriate to specify
bitflips_threshold > 0 so it'd just be a more complex way to do
a memchr_inv() search for 0xFF.

The code also has to check for the hamming code bytes being all 0x00,
because according to the comments [2], the controller also has
difficulty with the non-erased all-0xFFs scenario too.

[2] https://github.com/lp0/bcm963xx_4.12L.06B_consumer/blob/dd8fcb13046f738c311507dc2fcfd3e5d57a88e0/kernel/linux/drivers/mtd/brcmnand/brcmnand_base.c#L2459

> 
> Jonas
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/nand_base.c#n1110
>

Jonas Gorski Dec. 2, 2015, 8:44 p.m. UTC | #3

On Wed, Dec 2, 2015 at 9:17 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
> On 01/12/15 10:41, Jonas Gorski wrote:
>> On Sat, Nov 28, 2015 at 8:23 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
>>> +
>>> +       /* Go to start of buffer */
>>> +       buf -= FC_WORDS;
>>> +
>>> +       /* Erased if all data bytes are 0xFF */
>>> +       buf_erased = memchr_inv(buf, 0xFF, FC_WORDS) == NULL;
>>> +
>>> +       if (!buf_erased)
>>> +               goto out_free;
>>
>> We now have a function exactly for that use case in 4.4,
>> nand_check_erased_buf [1], consider using that. This also has the
>> benefit of treating bit flips as correctable as long as the ECC scheme
>> is strong enough.
>
> I have no idea whether or not it's appropriate to specify
> bitflips_threshold > 0 so it'd just be a more complex way to do
> a memchr_inv() search for 0xFF.

The threshold would be the amount of bitflips the code can correct, so
basically ecc.strength (at least that is my understanding).

> The code also has to check for the hamming code bytes being all 0x00,
> because according to the comments [2], the controller also has
> difficulty with the non-erased all-0xFFs scenario too.

According to brcmnand.c hamming can fix up to fifteen bitflips, but in
the current code you would fail a hamming protected all-0xff-page for
even a single bitflip in the data or in the ecc bytes, which means
that all-0xff-pages wouldn't be protected at all.


Jonas

Brian Norris Dec. 2, 2015, 8:54 p.m. UTC | #4

Hi,

On Wed, Dec 02, 2015 at 09:44:04PM +0100, Jonas Gorski wrote:
> On Wed, Dec 2, 2015 at 9:17 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
> > On 01/12/15 10:41, Jonas Gorski wrote:
> >> On Sat, Nov 28, 2015 at 8:23 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
> >>> +
> >>> +       /* Go to start of buffer */
> >>> +       buf -= FC_WORDS;
> >>> +
> >>> +       /* Erased if all data bytes are 0xFF */
> >>> +       buf_erased = memchr_inv(buf, 0xFF, FC_WORDS) == NULL;
> >>> +
> >>> +       if (!buf_erased)
> >>> +               goto out_free;
> >>
> >> We now have a function exactly for that use case in 4.4,
> >> nand_check_erased_buf [1], consider using that. This also has the
> >> benefit of treating bit flips as correctable as long as the ECC scheme
> >> is strong enough.
> >
> > I have no idea whether or not it's appropriate to specify
> > bitflips_threshold > 0 so it'd just be a more complex way to do
> > a memchr_inv() search for 0xFF.
> 
> The threshold would be the amount of bitflips the code can correct, so
> basically ecc.strength (at least that is my understanding).
> 
> > The code also has to check for the hamming code bytes being all 0x00,
> > because according to the comments [2], the controller also has
> > difficulty with the non-erased all-0xFFs scenario too.
> 
> According to brcmnand.c hamming can fix up to fifteen bitflips, but in

Hamming only protects 1 bitflip. The '15' is the value used by the
controller to represent Hamming (i.e., there is no BCH-15).

> the current code you would fail a hamming protected all-0xff-page for
> even a single bitflip in the data or in the ecc bytes, which means
> that all-0xff-pages wouldn't be protected at all.

BTW, I think Kamal had code to handle protecting bitflips in erased
pages code in the Broadcom STB Linux BSP. Perhaps he can port that to
upstream with nand_check_erased_ecc_chunk()? IIUC, that would probably
handle your case too, Simon, although it wouldn't be optimal for an
all-0xff check (i.e., bitflip_threshold == 0).

If that's really an issue (i.e., we have an implementation + data), I'm
sure we could add optimization to nand_check_erased_ecc_chunk() to
support the bitflip_threshold == 0 case.

Brian

Brian Norris Dec. 2, 2015, 9:06 p.m. UTC | #5

On Wed, Dec 02, 2015 at 12:54:15PM -0800, Brian Norris wrote:
> BTW, I think Kamal had code to handle protecting bitflips in erased
> pages code in the Broadcom STB Linux BSP. Perhaps he can port that to
> upstream with nand_check_erased_ecc_chunk()? IIUC, that would probably
> handle your case too, Simon, although it wouldn't be optimal for an
> all-0xff check (i.e., bitflip_threshold == 0).

I think this was for BCH only, actually. Hamming wasn't used as much on
STB. So the above would only help part of this patch. I'm less sure of
the behavior for the Hamming engine.

Brian

Jonas Gorski Dec. 2, 2015, 9:08 p.m. UTC | #6

Hi,

On Wed, Dec 2, 2015 at 9:54 PM, Brian Norris
<computersforpeace@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 02, 2015 at 09:44:04PM +0100, Jonas Gorski wrote:
>> On Wed, Dec 2, 2015 at 9:17 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
>> > On 01/12/15 10:41, Jonas Gorski wrote:
>> >> On Sat, Nov 28, 2015 at 8:23 PM, Simon Arlott <simon@fire.lp0.eu> wrote:
>> >>> +
>> >>> +       /* Go to start of buffer */
>> >>> +       buf -= FC_WORDS;
>> >>> +
>> >>> +       /* Erased if all data bytes are 0xFF */
>> >>> +       buf_erased = memchr_inv(buf, 0xFF, FC_WORDS) == NULL;
>> >>> +
>> >>> +       if (!buf_erased)
>> >>> +               goto out_free;
>> >>
>> >> We now have a function exactly for that use case in 4.4,
>> >> nand_check_erased_buf [1], consider using that. This also has the
>> >> benefit of treating bit flips as correctable as long as the ECC scheme
>> >> is strong enough.
>> >
>> > I have no idea whether or not it's appropriate to specify
>> > bitflips_threshold > 0 so it'd just be a more complex way to do
>> > a memchr_inv() search for 0xFF.
>>
>> The threshold would be the amount of bitflips the code can correct, so
>> basically ecc.strength (at least that is my understanding).
>>
>> > The code also has to check for the hamming code bytes being all 0x00,
>> > because according to the comments [2], the controller also has
>> > difficulty with the non-erased all-0xFFs scenario too.
>>
>> According to brcmnand.c hamming can fix up to fifteen bitflips, but in
>
> Hamming only protects 1 bitflip. The '15' is the value used by the
> controller to represent Hamming (i.e., there is no BCH-15).

Ah, yeah that confused me because I also vaguely remembered hamming
only providing protection for 1, but then saw the ecc_level = 15
assignment.

Still, that means that even hamming protected erased pages with a
single bitflip should be treated as readable / all-0xff, but with
correctable bitflips, and not as uncorrectable.

>> the current code you would fail a hamming protected all-0xff-page for
>> even a single bitflip in the data or in the ecc bytes, which means
>> that all-0xff-pages wouldn't be protected at all.
>
> BTW, I think Kamal had code to handle protecting bitflips in erased
> pages code in the Broadcom STB Linux BSP. Perhaps he can port that to
> upstream with nand_check_erased_ecc_chunk()? IIUC, that would probably
> handle your case too, Simon, although it wouldn't be optimal for an
> all-0xff check (i.e., bitflip_threshold == 0).
>
> If that's really an issue (i.e., we have an implementation + data), I'm
> sure we could add optimization to nand_check_erased_ecc_chunk() to
> support the bitflip_threshold == 0 case.

Maybe I'm missing something, but wasn't the point of introducing
nand_check_erased_ecc_chunk that bitflips in erased pages should be
treated as bitflips corrected by the ecc, and therefore fixed up
before passing the data further on? So having a theshold of 0 would be
wrong / no protection at all, and could be quite destructive on MLC
nand, where bitflips in erased pages are rather common.


Jonas

Simon Arlott Dec. 2, 2015, 9:29 p.m. UTC | #7

On 02/12/15 21:08, Jonas Gorski wrote:
> On Wed, Dec 2, 2015 at 9:54 PM, Brian Norris
> <computersforpeace@gmail.com> wrote:
>>
>> BTW, I think Kamal had code to handle protecting bitflips in erased
>> pages code in the Broadcom STB Linux BSP. Perhaps he can port that to
>> upstream with nand_check_erased_ecc_chunk()? IIUC, that would probably
>> handle your case too, Simon, although it wouldn't be optimal for an
>> all-0xff check (i.e., bitflip_threshold == 0).
>>
>> If that's really an issue (i.e., we have an implementation + data), I'm
>> sure we could add optimization to nand_check_erased_ecc_chunk() to
>> support the bitflip_threshold == 0 case.
> 
> Maybe I'm missing something, but wasn't the point of introducing
> nand_check_erased_ecc_chunk that bitflips in erased pages should be
> treated as bitflips corrected by the ecc, and therefore fixed up
> before passing the data further on? So having a theshold of 0 would be
> wrong / no protection at all, and could be quite destructive on MLC
> nand, where bitflips in erased pages are rather common.

Without this patch I can't access erased pages at all. I don't know if
the controller will still return an uncorrectable error if the page is
erased but has 1 or more bit flips.

Brian Norris Dec. 2, 2015, 10:22 p.m. UTC | #8

On Wed, Dec 02, 2015 at 10:08:20PM +0100, Jonas Gorski wrote:
> On Wed, Dec 2, 2015 at 9:54 PM, Brian Norris
> <computersforpeace@gmail.com> wrote:
> > If that's really an issue (i.e., we have an implementation + data), I'm
> > sure we could add optimization to nand_check_erased_ecc_chunk() to
> > support the bitflip_threshold == 0 case.
> 
> Maybe I'm missing something, but wasn't the point of introducing
> nand_check_erased_ecc_chunk that bitflips in erased pages should be
> treated as bitflips corrected by the ecc, and therefore fixed up
> before passing the data further on? So having a theshold of 0 would be
> wrong / no protection at all, and could be quite destructive on MLC
> nand, where bitflips in erased pages are rather common.

Yes, that would be pretty destructive on MLC NAND. But there might be
cases where this is reasonable. For instance, we might have:

 (a) SLC NAND that specifies up to 1 bitflip per 512 bytes
 (b) a Hamming ECC engine that can correct 1 bitflip
 (c) said Hamming engine does not handle all-0xff pages correctly, nor
 does it handle near-0xff pages (e.g., with a single 0 bit) correctly

Now, we might (for instance) allow up to ecc_strength / 2 bitflips in
erased pages [1], as a general rule, but for the above case, that means we
don't allow any on Hamming ECC. Hence, bitflip_threshold = 1 / 2 = 0.

This is kind of a degenerate case, so we might consider supporting it
differently than we would for BCH. e.g., don't use
nand_check_erased_ecc_chunk() at all.

Brian

[1] We might do this because we can expect that if an unprogrammed page
has N bitflips, there might be even more than N bitflips after
programming the page.

mtd: brcmnand: Workaround false ECC uncorrectable errors

Commit Message

Comments

Patch