diff mbox series

[2/2] ata: libata: skip error analysis for commands that are not errors

Message ID 20221111110921.1273193-3-niklas.cassel@wdc.com
State New
Headers show
Series libata NCQ error handling fix and improvement | expand

Commit Message

Niklas Cassel Nov. 11, 2022, 11:09 a.m. UTC
A NCQ error means that the device has aborted a single NCQ command and
halted further processing of queued commands.
To get the single NCQ command that caused the NCQ error, host software has
to read the NCQ error log, which also takes the device out of error state.

ata_eh_link_autopsy() starts off by calling ata_eh_analyze_ncq_error() to
read the NCQ error log, to find the offending command that caused the NCQ
error. ata_eh_analyze_ncq_error() marks the offending command by setting
qc->err_mask to AC_ERR_NCQ.

ata_eh_link_autopsy() then continues with further analysis for all
commands owned by libata EH.

However, once we have found the offending command, we know that all other
commands cannot be an error. (Theoretically a command could have timed out
just before the NCQ error happened (so EH was scheduled but did not yet
run), such command will have AC_ERR_TIMEOUT set in qc->err_mask.)

Therefore, after finding the offending command, we know that we can simply
skip the per command analysis for all commands that have not been marked
as error at this point, since we know that they have to be retried anyway.

Do this by changing ata_eh_analyze_ncq_error() to mark all non-failed
commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
to skip commands marked as ATA_QCFLAG_RETRY.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
Flag ATA_QCFLAG_FAILED is kept since it means that the command is owned
by libata EH. A failed command will always have qc->err_mask set.
Perhaps flag ATA_QCFLAG_FAILED should be renamed to something like
ATA_QCFLAG_OWNED_BY_EH to further clarify this meaning, and that the flag
does not necessarily mean that it is an error.

 drivers/ata/libata-eh.c   | 1 +
 drivers/ata/libata-sata.c | 8 ++++++++
 2 files changed, 9 insertions(+)

Comments

Damien Le Moal Nov. 14, 2022, 2:15 a.m. UTC | #1
On 11/11/22 20:09, Niklas Cassel wrote:
> A NCQ error means that the device has aborted a single NCQ command and
> halted further processing of queued commands.

...means that the device has aborted processing of all active commands.

> To get the single NCQ command that caused the NCQ error, host software has
> to read the NCQ error log, which also takes the device out of error state.
> 
> ata_eh_link_autopsy() starts off by calling ata_eh_analyze_ncq_error() to
> read the NCQ error log, to find the offending command that caused the NCQ
> error. ata_eh_analyze_ncq_error() marks the offending command by setting
> qc->err_mask to AC_ERR_NCQ.
> 
> ata_eh_link_autopsy() then continues with further analysis for all
> commands owned by libata EH.
> 
> However, once we have found the offending command, we know that all other
> commands cannot be an error. (Theoretically a command could have timed out
> just before the NCQ error happened (so EH was scheduled but did not yet
> run), such command will have AC_ERR_TIMEOUT set in qc->err_mask.)
> 
> Therefore, after finding the offending command, we know that we can simply
> skip the per command analysis for all commands that have not been marked
> as error at this point, since we know that they have to be retried anyway.
> 
> Do this by changing ata_eh_analyze_ncq_error() to mark all non-failed
> commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
> to skip commands marked as ATA_QCFLAG_RETRY.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
> Flag ATA_QCFLAG_FAILED is kept since it means that the command is owned
> by libata EH. A failed command will always have qc->err_mask set.
> Perhaps flag ATA_QCFLAG_FAILED should be renamed to something like
> ATA_QCFLAG_OWNED_BY_EH to further clarify this meaning, and that the flag
> does not necessarily mean that it is an error.
> 
>  drivers/ata/libata-eh.c   | 1 +
>  drivers/ata/libata-sata.c | 8 ++++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index bde15f855f70..34303ce67c14 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1955,6 +1955,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)
>  
>  	ata_qc_for_each_raw(ap, qc, tag) {
>  		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> +		    qc->flags & ATA_QCFLAG_RETRY ||
>  		    ata_dev_phys_link(qc->dev) != link)
>  			continue;
>  
> diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
> index 6b2dcf3eb2fb..18ef14e749a0 100644
> --- a/drivers/ata/libata-sata.c
> +++ b/drivers/ata/libata-sata.c
> @@ -1493,6 +1493,14 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  		 */
>  		qc->result_tf.status &= ~ATA_ERR;
>  		qc->result_tf.error = 0;
> +
> +		/*
> +		 * If we get a NCQ error, that means that a single command was
> +		 * aborted. All other failed commands for our link should be
> +		 * retried and has no business of going though further scrutiny
> +		 * by ata_eh_link_autopsy().
> +		 */
> +		qc->flags |= ATA_QCFLAG_RETRY;
>  	}
>  
>  	ehc->i.err_mask &= ~AC_ERR_DEV;

I think this patch should be squashed with the previous one. That will
really correctly fix the EH handling of NCQ errors.
diff mbox series

Patch

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index bde15f855f70..34303ce67c14 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1955,6 +1955,7 @@  static void ata_eh_link_autopsy(struct ata_link *link)
 
 	ata_qc_for_each_raw(ap, qc, tag) {
 		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
+		    qc->flags & ATA_QCFLAG_RETRY ||
 		    ata_dev_phys_link(qc->dev) != link)
 			continue;
 
diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 6b2dcf3eb2fb..18ef14e749a0 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -1493,6 +1493,14 @@  void ata_eh_analyze_ncq_error(struct ata_link *link)
 		 */
 		qc->result_tf.status &= ~ATA_ERR;
 		qc->result_tf.error = 0;
+
+		/*
+		 * If we get a NCQ error, that means that a single command was
+		 * aborted. All other failed commands for our link should be
+		 * retried and has no business of going though further scrutiny
+		 * by ata_eh_link_autopsy().
+		 */
+		qc->flags |= ATA_QCFLAG_RETRY;
 	}
 
 	ehc->i.err_mask &= ~AC_ERR_DEV;