diff mbox series

[9/9] mtd: nand: qcom: erased page bitflips detection

Message ID 1522845745-6624-10-git-send-email-absahu@codeaurora.org
State Changes Requested
Headers show
Series Update for QCOM NAND driver | expand

Commit Message

Abhishek Sahu April 4, 2018, 12:42 p.m. UTC
Some of the newer nand parts can have bit flips in an erased
page due to the process technology used. In this case, qpic
nand controller is not able to identify that page as an erased
page. Currently the driver calls nand_check_erased_ecc_chunk for
identifying the erased pages but this won’t work always since the
checking is being with ECC engine returned data. In case of
bitflips, the ECC engine tries to correct the data and then it
generates the uncorrectable error. Now, this data is not equal to
original raw data. For erased CW identification, the raw data
should be read again from NAND device and this
nand_check_erased_ecc_chunk function should be called for raw
data only.

Now following logic is being added to identify the erased
codeword bitflips.

1. In most of the case, not all the codewords will have bitflips
   and only single CW will have bitflips. So, there is no need to
   read the complete raw page data. The NAND raw read can be
   scheduled for any CW in page. The NAND controller works on CW
   basis and it will update the status register after each CW read.
   Maintain the bitmask for the CW which generated the uncorrectable
   error.
2. Schedule the raw flash read from NAND flash device to
   NAND controller buffer for all these CWs between first and last
   uncorrectable errors CWs. Copy the content from NAND controller
   buffer to actual data buffer only for the uncorrectable errors
   CWs so that other CW data content won’t be affected, and
   unnecessary data copy can be avoided.
3. Both DATA and OOB need to be checked for number of 0. The
   top-level API can be called with only data buf or oob buf so use
   chip->databuf if data buf is null and chip->oob_poi if
   oob buf is null for copying the raw bytes temporarily.
4. For each CW, check the number of 0 in cw_data and usable
   oob bytes, The bbm and spare bytes bit flip won’t affect the ECC
   so don’t check the number of bitflips in this area.

Signed-off-by: Abhishek Sahu <absahu@codeaurora.org>
---
 drivers/mtd/nand/qcom_nandc.c | 144 ++++++++++++++++++++++++++++++------------
 1 file changed, 104 insertions(+), 40 deletions(-)

Comments

Miquel Raynal April 10, 2018, 10:30 a.m. UTC | #1
Hi Abhishek,

On Wed,  4 Apr 2018 18:12:25 +0530, Abhishek Sahu
<absahu@codeaurora.org> wrote:

> Some of the newer nand parts can have bit flips in an erased
> page due to the process technology used. In this case, qpic

AFAIK, this has always been possible, it was just rare.

> nand controller is not able to identify that page as an erased
> page. Currently the driver calls nand_check_erased_ecc_chunk for
> identifying the erased pages but this won’t work always since the
> checking is being with ECC engine returned data. In case of
> bitflips, the ECC engine tries to correct the data and then it
> generates the uncorrectable error. Now, this data is not equal to
> original raw data. For erased CW identification, the raw data
> should be read again from NAND device and this
> nand_check_erased_ecc_chunk function should be called for raw
> data only.

Absolutely.

> 
> Now following logic is being added to identify the erased
> codeword bitflips.
> 
> 1. In most of the case, not all the codewords will have bitflips
>    and only single CW will have bitflips. So, there is no need to
>    read the complete raw page data. The NAND raw read can be
>    scheduled for any CW in page. The NAND controller works on CW
>    basis and it will update the status register after each CW read.
>    Maintain the bitmask for the CW which generated the uncorrectable
>    error.
> 2. Schedule the raw flash read from NAND flash device to
>    NAND controller buffer for all these CWs between first and last
>    uncorrectable errors CWs. Copy the content from NAND controller
>    buffer to actual data buffer only for the uncorrectable errors
>    CWs so that other CW data content won’t be affected, and
>    unnecessary data copy can be avoided.

In case of uncorrectable error, the penalty is huge anyway.

> 3. Both DATA and OOB need to be checked for number of 0. The
>    top-level API can be called with only data buf or oob buf so use
>    chip->databuf if data buf is null and chip->oob_poi if
>    oob buf is null for copying the raw bytes temporarily.

You can do that. But when you do, you should tell the core you used
that buffer and that it cannot rely on what is inside. Please
invalidate the page cache with:

chip->pagebuf = -1;

> 4. For each CW, check the number of 0 in cw_data and usable
>    oob bytes, The bbm and spare bytes bit flip won’t affect the ECC
>    so don’t check the number of bitflips in this area.

OOB is an area in which you are supposed to find the BBM, the ECC bytes
and the spare bytes. Spare bytes == usable OOB bytes. And the BBM
should be protected too. I don't get this sentence but I don't see its
application neither in the code?

> 
> Signed-off-by: Abhishek Sahu <absahu@codeaurora.org>
> ---

Thanks,
Miquèl
Abhishek Sahu April 12, 2018, 8 a.m. UTC | #2
On 2018-04-10 16:00, Miquel Raynal wrote:
> Hi Abhishek,
> 
> On Wed,  4 Apr 2018 18:12:25 +0530, Abhishek Sahu
> <absahu@codeaurora.org> wrote:
> 
>> Some of the newer nand parts can have bit flips in an erased
>> page due to the process technology used. In this case, qpic
> 
> AFAIK, this has always been possible, it was just rare.
> 

  Yes Miquel. It was rare earlier.
  Now, we are observing this more for newer parts coming.

>> nand controller is not able to identify that page as an erased
>> page. Currently the driver calls nand_check_erased_ecc_chunk for
>> identifying the erased pages but this won’t work always since the
>> checking is being with ECC engine returned data. In case of
>> bitflips, the ECC engine tries to correct the data and then it
>> generates the uncorrectable error. Now, this data is not equal to
>> original raw data. For erased CW identification, the raw data
>> should be read again from NAND device and this
>> nand_check_erased_ecc_chunk function should be called for raw
>> data only.
> 
> Absolutely.
> 
>> 
>> Now following logic is being added to identify the erased
>> codeword bitflips.
>> 
>> 1. In most of the case, not all the codewords will have bitflips
>>    and only single CW will have bitflips. So, there is no need to
>>    read the complete raw page data. The NAND raw read can be
>>    scheduled for any CW in page. The NAND controller works on CW
>>    basis and it will update the status register after each CW read.
>>    Maintain the bitmask for the CW which generated the uncorrectable
>>    error.
>> 2. Schedule the raw flash read from NAND flash device to
>>    NAND controller buffer for all these CWs between first and last
>>    uncorrectable errors CWs. Copy the content from NAND controller
>>    buffer to actual data buffer only for the uncorrectable errors
>>    CWs so that other CW data content won’t be affected, and
>>    unnecessary data copy can be avoided.
> 
> In case of uncorrectable error, the penalty is huge anyway.
> 

  Yes. We can't avoid that.
  But we are reducing that by doing raw read for few subpages in
  which we got uncorrectale error.

>> 3. Both DATA and OOB need to be checked for number of 0. The
>>    top-level API can be called with only data buf or oob buf so use
>>    chip->databuf if data buf is null and chip->oob_poi if
>>    oob buf is null for copying the raw bytes temporarily.
> 
> You can do that. But when you do, you should tell the core you used
> that buffer and that it cannot rely on what is inside. Please
> invalidate the page cache with:
> 
> chip->pagebuf = -1;
> 

  Thanks Miquel. I will check and update the patch.

>> 4. For each CW, check the number of 0 in cw_data and usable
>>    oob bytes, The bbm and spare bytes bit flip won’t affect the ECC
>>    so don’t check the number of bitflips in this area.
> 
> OOB is an area in which you are supposed to find the BBM, the ECC bytes
> and the spare bytes. Spare bytes == usable OOB bytes. And the BBM
> should be protected too. I don't get this sentence but I don't see its
> application neither in the code?
> 

  QCOM NAND layout does not support the BBM ECC protection.

  IN OOB,

  For all the possible layouts (4 bit RS/4 bit BCH/8 bit BCH)
  it has 16 usable OOB bytes which is protected with ECC.

  All the bytes in OOB other than BBM, ECC bytes and usable
  OOB bytes are ununsed.

  You can refer qcom_nand_host_setup for layout detail.

  Thanks,
  Abhishek
diff mbox series

Patch

diff --git a/drivers/mtd/nand/qcom_nandc.c b/drivers/mtd/nand/qcom_nandc.c
index f5d1fa4..ec0b7db 100644
--- a/drivers/mtd/nand/qcom_nandc.c
+++ b/drivers/mtd/nand/qcom_nandc.c
@@ -1698,25 +1698,112 @@  static int check_flash_errors(struct qcom_nand_host *host, int cw_cnt)
 }
 
 /*
+ * Bitflips can happen in erased codewords also so this function counts the
+ * number of 0 in each CW for which ECC engine returns the uncorrectable
+ * error. The page will be assumed as erased if this count is less than or
+ * equal to the ecc->strength for each CW.
+ *
+ * 1. Both DATA and OOB need to be checked for number of 0. The
+ *    top-level API can be called with only data buf or oob buf so use
+ *    chip->data_buf if data buf is null and chip->oob_poi if oob buf
+ *    is null for copying the raw bytes.
+ * 2. Perform raw read for all the CW which has uncorrectable errors.
+ * 3. For each CW, check the number of 0 in cw_data and usable oob bytes.
+ *    The bbm and spare bytes bit flip won’t affect the ECC so don’t check
+ *    the number of bitflips in this area.
+ */
+static int
+check_for_erased_page(struct qcom_nand_host *host, u8 *data_buf,
+		      u8 *oob_buf, unsigned long uncorrectable_err_cws,
+		      int page, unsigned int max_bitflips, bool last_cw)
+{
+	struct nand_chip *chip = &host->chip;
+	struct qcom_nand_controller *nandc = get_qcom_nand_controller(chip);
+	struct mtd_info *mtd = nand_to_mtd(chip);
+	struct nand_ecc_ctrl *ecc = &chip->ecc;
+	int i, start_step, last_step, ret = 0;
+
+	start_step = ffs(uncorrectable_err_cws) - 1;
+	last_step = fls(uncorrectable_err_cws);
+
+	if (!last_cw) {
+		if (!data_buf)
+			data_buf = chip->data_buf;
+		if (!oob_buf)
+			oob_buf = chip->oob_poi;
+		data_buf += start_step * host->cw_data;
+		oob_buf += start_step * ecc->bytes;
+	}
+
+	clear_read_regs(nandc);
+	nandc_read_page_raw(mtd, chip, data_buf, oob_buf, page,
+			    uncorrectable_err_cws);
+
+	for (i = start_step; i < last_step; i++) {
+		int data_size, oob_size;
+
+		if (i == (ecc->steps - 1)) {
+			data_size = ecc->size - ((ecc->steps - 1) << 2);
+			oob_size = (ecc->steps << 2) + host->ecc_bytes_hw;
+		} else {
+			data_size = host->cw_data;
+			oob_size = host->ecc_bytes_hw;
+		}
+
+		if (uncorrectable_err_cws & BIT(i)) {
+			/*
+			 * make sure it isn't an erased page reported
+			 * as not-erased by HW because of a few bitflips
+			 */
+			ret = nand_check_erased_ecc_chunk(data_buf,
+				data_size, oob_buf + host->bbm_size,
+				oob_size, NULL,
+				0, ecc->strength);
+			if (ret < 0) {
+				mtd->ecc_stats.failed++;
+			} else {
+				mtd->ecc_stats.corrected += ret;
+				max_bitflips =
+					max_t(unsigned int, max_bitflips, ret);
+			}
+		}
+
+		data_buf += data_size;
+		oob_buf += ecc->bytes;
+	}
+
+	return max_bitflips;
+}
+
+/*
  * reads back status registers set by the controller to notify page read
  * errors. this is equivalent to what 'ecc->correct()' would do.
  */
 static int parse_read_errors(struct qcom_nand_host *host, u8 *data_buf,
-			     u8 *oob_buf, bool last_cw)
+			     u8 *oob_buf, bool last_cw, int page)
 {
 	struct nand_chip *chip = &host->chip;
 	struct qcom_nand_controller *nandc = get_qcom_nand_controller(chip);
 	struct mtd_info *mtd = nand_to_mtd(chip);
 	struct nand_ecc_ctrl *ecc = &chip->ecc;
-	unsigned int max_bitflips = 0;
+	unsigned int max_bitflips = 0, uncorrectable_err_cws = 0;
 	struct read_stats *buf;
 	bool flash_op_err = false;
-	int i, cw_cnt = last_cw ? 1 : ecc->steps;
+	int i, start_cw, cw_cnt = last_cw ? 1 : ecc->steps;
+	u8 *data_buf_start = data_buf, *oob_buf_start = oob_buf;
 
 	buf = (struct read_stats *)nandc->reg_read_buf;
 	nandc_read_buffer_sync(nandc, true);
 
-	for (i = 0; i < cw_cnt; i++, buf++) {
+	if (last_cw) {
+		start_cw = ecc->steps - 1;
+		cw_cnt = 1;
+	} else {
+		start_cw = 0;
+		cw_cnt = ecc->steps;
+	}
+
+	for (i = start_cw; i < start_cw + cw_cnt; i++, buf++) {
 		u32 flash, buffer, erased_cw;
 		int data_len, oob_len;
 
@@ -1746,36 +1833,8 @@  static int parse_read_errors(struct qcom_nand_host *host, u8 *data_buf,
 				erased = false;
 			}
 
-			if (erased) {
-				data_buf += data_len;
-				if (oob_buf)
-					oob_buf += oob_len + ecc->bytes;
-				continue;
-			}
-
-			if (buffer & BS_UNCORRECTABLE_BIT) {
-				int ret, ecclen, extraooblen;
-				void *eccbuf;
-
-				eccbuf = oob_buf ? oob_buf + oob_len : NULL;
-				ecclen = oob_buf ? host->ecc_bytes_hw : 0;
-				extraooblen = oob_buf ? oob_len : 0;
-
-				/*
-				 * make sure it isn't an erased page reported
-				 * as not-erased by HW because of a few bitflips
-				 */
-				ret = nand_check_erased_ecc_chunk(data_buf,
-					data_len, eccbuf, ecclen, oob_buf,
-					extraooblen, ecc->strength);
-				if (ret < 0) {
-					mtd->ecc_stats.failed++;
-				} else {
-					mtd->ecc_stats.corrected += ret;
-					max_bitflips =
-						max_t(unsigned int, max_bitflips, ret);
-				}
-			}
+			if (!erased)
+				uncorrectable_err_cws |= BIT(i);
 		} else if (flash & (FS_OP_ERR | FS_MPU_ERR)) {
 			flash_op_err = true;
 		} else {
@@ -1795,7 +1854,12 @@  static int parse_read_errors(struct qcom_nand_host *host, u8 *data_buf,
 	if (flash_op_err)
 		return -EIO;
 
-	return max_bitflips;
+	if (!uncorrectable_err_cws)
+		return max_bitflips;
+
+	return check_for_erased_page(host, data_buf_start, oob_buf_start,
+				     uncorrectable_err_cws, page,
+				     max_bitflips, last_cw);
 }
 
 /*
@@ -1803,7 +1867,7 @@  static int parse_read_errors(struct qcom_nand_host *host, u8 *data_buf,
  * ecc->read_oob()
  */
 static int read_page_ecc(struct qcom_nand_host *host, u8 *data_buf,
-			 u8 *oob_buf)
+			 u8 *oob_buf, int page)
 {
 	struct nand_chip *chip = &host->chip;
 	struct qcom_nand_controller *nandc = get_qcom_nand_controller(chip);
@@ -1876,7 +1940,7 @@  static int read_page_ecc(struct qcom_nand_host *host, u8 *data_buf,
 
 	if (!ret)
 		ret = parse_read_errors(host, data_buf_start, oob_buf_start,
-					false);
+					false, page);
 
 	return ret;
 }
@@ -1917,7 +1981,7 @@  static int copy_last_cw(struct qcom_nand_host *host, int page)
 		if (host->use_ecc)
 			ret = parse_read_errors(host, nandc->data_buffer,
 						nandc->data_buffer + size,
-						true);
+						true, page);
 		else
 			ret = check_flash_errors(host, 1);
 	}
@@ -1939,7 +2003,7 @@  static int qcom_nandc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 
 	clear_bam_transaction(nandc);
 
-	return read_page_ecc(host, data_buf, oob_buf);
+	return read_page_ecc(host, data_buf, oob_buf, page);
 }
 
 /* implements ecc->read_page_raw() */
@@ -1966,7 +2030,7 @@  static int qcom_nandc_read_oob(struct mtd_info *mtd, struct nand_chip *chip,
 	set_address(host, 0, page);
 	update_rw_regs(host, ecc->steps, true);
 
-	return read_page_ecc(host, NULL, chip->oob_poi);
+	return read_page_ecc(host, NULL, chip->oob_poi, page);
 }
 
 /* implements ecc->write_page() */