Message ID | 1389343298-22473-1-git-send-email-b32955@freescale.com |
---|---|
State | Superseded |
Headers | show |
On 10 Jan 2014, b32955@freescale.com wrote: > This patch does a check for the uncorrectable failure in the following > steps: > [0] set the threshold. The threshold is set based on the truth: "A > single 0 bit will lead to gf_len(13 or 14) bits 0 after the BCH do the > ECC." > For the sake of safe, we will set the threshold with half the gf_len, > and do not make it bigger the ECC strength. > [1] count the bitflips of the current ECC chunk, assume it is N. > [2] if the (N <= threshold) is true, we continue to read out the page > with ECC disabled. and we count the bitflips again, assume it is N2. > [3] if the (N2 <= threshold) is true again, we can regard this is a > erased page. This is because a real erased page is full of 0xFF(maybe > also has several bitflips), while a page contains the 0xFF data will > definitely has many bitflips in the ECC parity area. > [4] if the [3] fails, we can regard this is a page filled with the > '0xFF' data. Sorry, I am a slow thinker. Why do we bother with steps 0-2 at all? Why not just read the page without ECC on an un-correctable error. Another driver (which I was patterning off of) is the fsmc_nand.c and it's count_written_bits() routine. It has an interesting feature to pass the strength to the counting routine, so that it aborts early if more than 'strength' zeros are encountered. If you remove steps 0-2, I think you end up with the same results and just the code size and run time change. For your cases, 1) Erased NAND sector, it will be much faster. 2) Un-correctable all xff it will be, a) just as fast for errors just above strength. b) slower for many errors. 3) A read error with non-ff data. Benefits from early abort due to strength exceeded. Will be slower if step 0-2 omitted. The case 2b should never happen for a properly functioning system. If a block has such a bad sector, it should be in the BB. I guess checking the ECCed data is of benefit for case 3. However, the most common case should be 1, an erased sector. It will be common during UBI scanning on boot up for instance. 2a and 3 are actually the same case with different page data. For certain, the short-circuit is a benefit if you leave the loop on the ECCed data. I think that the cases of permanent errors will be more common that read errors that repair by re-writing data/migrating data. So, I think that the probability is, 1) Erased page 2) Program error (just above strength) 3) Read failure (just above strength). 4) Errors far above strength (maybe impossible). For the items 2,3 these will be migrated to the bad blocks so maybe they are the same to us. I think that as Elie noted, the erased page is the really common item. Wouldn't it be best to optimize for that? Skipping the first check also makes the run time for xFF and non-FF data closer depending on an early abort of 'thresh' exceeded. Thanks for some interesting code. Bill Pringlemeir.
[cc trimmed] > On 10 Jan 2014, b32955@freescale.com wrote: >> This patch does a check for the uncorrectable failure in the following >> steps: On 10 Jan 2014, bpringlemeir@nbsps.com wrote: > Another driver (which I was patterning off of) is the fsmc_nand.c and > it's count_written_bits() routine. > It has an interesting feature to pass the strength to the counting > routine, so that it aborts early if more than 'strength' zeros are > encountered. I think 'fsmc_nand' will not detect the 'all ff data' and ECC failed condition? Also, it seems that UBI will write mostly xFF pages when just an EC page/sub-page is marked, so these type of pages maybe fairly common in practice. It certainly would be a nice feature for a hardware ECC which writes to the OOB data to have an 'erase detect' bit in the ECC status. I looked at some other MTD devices and it didn't seem they had to handle this. Do some controller recognize an erased page and give an ECC status as ok? Fwiw, Bill Pringlemeir.
On Fri, Jan 10, 2014 at 02:41:29PM -0500, Bill Pringlemeir wrote: > On 10 Jan 2014, b32955@freescale.com wrote: > > > This patch does a check for the uncorrectable failure in the following > > steps: > > > [0] set the threshold. The threshold is set based on the truth: "A > > single 0 bit will lead to gf_len(13 or 14) bits 0 after the BCH do the > > ECC." > > > For the sake of safe, we will set the threshold with half the gf_len, > > and do not make it bigger the ECC strength. > > > [1] count the bitflips of the current ECC chunk, assume it is N. > > > [2] if the (N <= threshold) is true, we continue to read out the page > > with ECC disabled. and we count the bitflips again, assume it is N2. > > > [3] if the (N2 <= threshold) is true again, we can regard this is a > > erased page. This is because a real erased page is full of 0xFF(maybe > > also has several bitflips), while a page contains the 0xFF data will > > definitely has many bitflips in the ECC parity area. > > > [4] if the [3] fails, we can regard this is a page filled with the > > '0xFF' data. > > Sorry, I am a slow thinker. Why do we bother with steps 0-2 at all? > Why not just read the page without ECC on an un-correctable error. Nearly all the nand chips (<19nm) have the read-retry feature. The steps 0-2 is for the read-retry shortcut. If we remove steps 0-2, the read-retry will have to read out the buffer with ECC disabled for each times the uncorrectable error occur, it will slower the READ-Retry. :) Most of the read-retry may need 8 times ECC reads or even more. So we'd better keep the steps 0-2. > Another driver (which I was patterning off of) is the fsmc_nand.c and > it's count_written_bits() routine. I think it is okay to add the similar shortcut code for steps 0-2. I will add it in the next version. thanks Huang Shijie
On Fri, Jan 10, 2014 at 03:46:20PM -0500, Bill Pringlemeir wrote: > I think 'fsmc_nand' will not detect the 'all ff data' and ECC failed > condition? Also, it seems that UBI will write mostly xFF pages when > just an EC page/sub-page is marked, so these type of pages maybe fairly > common in practice. > > It certainly would be a nice feature for a hardware ECC which writes to > the OOB data to have an 'erase detect' bit in the ECC status. I looked > at some other MTD devices and it didn't seem they had to handle this. > Do some controller recognize an erased page and give an ECC status as > ok? In fact, the gpmi can supports this feature by set the HW_BCH_MODE:ERASE_THRESHOLD. but Elie had proved it is a IC bug. I have issued a request to the hardware guy, and hope they can fix it in the future. thanks Huang Shijie
diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c index e2f5820..533f25f 100644 --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c @@ -958,6 +958,53 @@ static void block_mark_swapping(struct gpmi_nand_data *this, p[1] = (p[1] & mask) | (from_oob >> (8 - bit)); } +static bool gpmi_erased_check(struct gpmi_nand_data *this, + unsigned char *data, unsigned int chunk, int page, + unsigned int *max_bitflips) +{ + struct nand_chip *chip = &this->nand; + struct mtd_info *mtd = &this->mtd; + struct bch_geometry *geo = &this->bch_geometry; + int base = geo->ecc_chunk_size * chunk; + unsigned int flip_bits = 0, flip_bits_noecc = 0; + uint8_t *buf = this->data_buffer_dma; + unsigned int threshold; + int i; + + threshold = geo->gf_len / 2; + if (threshold > geo->ecc_strength) + threshold = geo->ecc_strength; + + /* Count bitflips */ + for (i = 0; i < geo->ecc_chunk_size; i++) + flip_bits += hweight8(~data[base + i]); + + if (flip_bits <= threshold) { + dev_dbg(this->dev, "check for the erased page:%d, chunk:%d\n", + page, chunk); + + /* Read out the page without ECC enabled, and check it again */ + chip->cmdfunc(mtd, NAND_CMD_READ0, 0, page); + chip->read_buf(mtd, buf, geo->payload_size); + + /* Count the bitflips for the no ECC buffer */ + for (i = 0; i < geo->payload_size; i++) + flip_bits_noecc += hweight8(~buf[i]); + + if (flip_bits_noecc <= threshold) { + dev_dbg(this->dev, "find an erased page %d(%d:%d)\n", + page, flip_bits, flip_bits_noecc); + + /* Tell the upper layer the bitflips we corrected. */ + *max_bitflips = max_t(unsigned int, *max_bitflips, + flip_bits); + memset(data, 0xff, geo->payload_size); + return true; + } + } + return false; +} + static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, uint8_t *buf, int oob_required, int page) { @@ -1007,6 +1054,10 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, continue; if (*status == STATUS_UNCORRECTABLE) { + if (gpmi_erased_check(this, payload_virt, i, + page, &max_bitflips)) + break; + mtd->ecc_stats.failed++; continue; }