diff mbox series

libata: don't request sense data on !ZAC ATA devices

Message ID 20190624163250.GP657710@devbig004.ftw2.facebook.com
State Not Applicable
Delegated to: David Miller
Headers show
Series libata: don't request sense data on !ZAC ATA devices | expand

Commit Message

Tejun Heo June 24, 2019, 4:32 p.m. UTC
ZAC support added sense data requesting on error for both ZAC and ATA
devices. This seems to cause erratic error handling behaviors on some
SSDs where the device reports sense data availability and then
delivers the wrong content making EH take the wrong actions.  The
failure mode was sporadic on a LITE-ON ssd and couldn't be reliably
reproduced.

There is no value in requesting sense data from non-ZAC ATA devices
while there's a significant risk of introducing EH misbehaviors which
are difficult to reproduce and fix.  Let's do the sense data dancing
only for ZAC devices.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hannes Reinecke <hare@kernel.org>
---
 drivers/ata/libata-eh.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Damien Le Moal June 24, 2019, 8:27 p.m. UTC | #1
On 2019/06/25 1:33, Tejun Heo wrote:
> ZAC support added sense data requesting on error for both ZAC and ATA
> devices. This seems to cause erratic error handling behaviors on some
> SSDs where the device reports sense data availability and then
> delivers the wrong content making EH take the wrong actions.  The
> failure mode was sporadic on a LITE-ON ssd and couldn't be reliably
> reproduced.
> 
> There is no value in requesting sense data from non-ZAC ATA devices
> while there's a significant risk of introducing EH misbehaviors which
> are difficult to reproduce and fix.  Let's do the sense data dancing
> only for ZAC devices.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/ata/libata-eh.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 9d687e1d4325..3bfd9da58473 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1469,7 +1469,7 @@ static int ata_eh_read_log_10h(struct ata_device *dev,
>  	tf->hob_lbah = buf[10];
>  	tf->nsect = buf[12];
>  	tf->hob_nsect = buf[13];
> -	if (ata_id_has_ncq_autosense(dev->id))
> +	if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id))
>  		tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16];
>  
>  	return 0;
> @@ -1716,7 +1716,8 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  	memcpy(&qc->result_tf, &tf, sizeof(tf));
>  	qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48;
>  	qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ;
> -	if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) {
> +	if (dev->class == ATA_DEV_ZAC &&
> +	    ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) {
>  		char sense_key, asc, ascq;
>  
>  		sense_key = (qc->result_tf.auxiliary >> 16) & 0xff;
> @@ -1770,10 +1771,11 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc,
>  	}
>  
>  	switch (qc->dev->class) {
> -	case ATA_DEV_ATA:
>  	case ATA_DEV_ZAC:
>  		if (stat & ATA_SENSE)
>  			ata_eh_request_sense(qc, qc->scsicmd);
> +		/* fall through */
> +	case ATA_DEV_ATA:
>  		if (err & ATA_ICRC)
>  			qc->err_mask |= AC_ERR_ATA_BUS;
>  		if (err & (ATA_UNC | ATA_AMNF))
> 

For NCQ commands, I believe it is mandatory to request sense data for the failed
command to get the device out of error mode. So isn't this approach breaking
anything for well behaving drives ? Wouldn't it be better to blacklist the
misbehaving SSD you observed the problem with ?
Tejun Heo June 24, 2019, 8:57 p.m. UTC | #2
Hello, Damien.

On Mon, Jun 24, 2019 at 08:27:02PM +0000, Damien Le Moal wrote:
> For NCQ commands, I believe it is mandatory to request sense data for the failed
> command to get the device out of error mode. So isn't this approach breaking

Hah, that's a news to me.  We never had that code path before ZAC
support was added, so I'm kinda skeptical that'd be the case.

> anything for well behaving drives ? Wouldn't it be better to blacklist the
> misbehaving SSD you observed the problem with ?

Provided I'm not wrong with the assumption, there's virtually no
benefit in doing this and that's gonna be a *really* difficult
blacklist to develop.

Thanks.
Damien Le Moal June 24, 2019, 9:59 p.m. UTC | #3
Tejun,

On 2019/06/25 5:57, Tejun Heo wrote:
> Hello, Damien.
> 
> On Mon, Jun 24, 2019 at 08:27:02PM +0000, Damien Le Moal wrote:
>> For NCQ commands, I believe it is mandatory to request sense data for the failed
>> command to get the device out of error mode. So isn't this approach breaking
> 
> Hah, that's a news to me.  We never had that code path before ZAC
> support was added, so I'm kinda skeptical that'd be the case.

I checked again the ACS specs, and your are right, REQUEST SENSE DATA EXT is
optional in general, dependent on support of the Sense Data Reporting feature set.

For NCQ command errors, from ACS:

"If an error occurs while the device is processing an NCQ command, then the
device shall return command aborted for all NCQ commands that are in the queue
and shall return command aborted for any subsequent commands, except a command
from the GPL feature set (see 4.10) that reads the NCQ Command Error log (see
9.13), until the device completes that command without error."

So as long as NCQ command error log page is read, the device queue will get out
of error mode and new commands can be issued. There is no need for REQUEST SENSE
DATA EXT. I got confused with the fact that the Sense data reporting feature is
mandatory with ZAC drives (that is defined in ZAC, not ACS).

>> anything for well behaving drives ? Wouldn't it be better to blacklist the
>> misbehaving SSD you observed the problem with ?
> 
> Provided I'm not wrong with the assumption, there's virtually no
> benefit in doing this and that's gonna be a *really* difficult
> blacklist to develop.

You are not wrong :)
Will test your patch on our test rig which generates (in purpose) a lot of
command failures on ZAC drives. We can also give it a run with generated errors
on regular disks.

Cheers.

> 
> Thanks.
>
Hannes Reinecke June 25, 2019, 6:05 a.m. UTC | #4
On 6/24/19 6:32 PM, Tejun Heo wrote:
> ZAC support added sense data requesting on error for both ZAC and ATA
> devices. This seems to cause erratic error handling behaviors on some
> SSDs where the device reports sense data availability and then
> delivers the wrong content making EH take the wrong actions.  The
> failure mode was sporadic on a LITE-ON ssd and couldn't be reliably
> reproduced.
> 
> There is no value in requesting sense data from non-ZAC ATA devices
> while there's a significant risk of introducing EH misbehaviors which
> are difficult to reproduce and fix.  Let's do the sense data dancing
> only for ZAC devices.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/ata/libata-eh.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
Ah well. I hoped those bothering to implement sense data would do it
properly; seems I've been mistaken.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Damien Le Moal June 25, 2019, 12:25 p.m. UTC | #5
On 2019/06/25 1:33, Tejun Heo wrote:
> ZAC support added sense data requesting on error for both ZAC and ATA
> devices. This seems to cause erratic error handling behaviors on some
> SSDs where the device reports sense data availability and then
> delivers the wrong content making EH take the wrong actions.  The
> failure mode was sporadic on a LITE-ON ssd and couldn't be reliably
> reproduced.
> 
> There is no value in requesting sense data from non-ZAC ATA devices
> while there's a significant risk of introducing EH misbehaviors which
> are difficult to reproduce and fix.  Let's do the sense data dancing
> only for ZAC devices.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/ata/libata-eh.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 9d687e1d4325..3bfd9da58473 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1469,7 +1469,7 @@ static int ata_eh_read_log_10h(struct ata_device *dev,
>  	tf->hob_lbah = buf[10];
>  	tf->nsect = buf[12];
>  	tf->hob_nsect = buf[13];
> -	if (ata_id_has_ncq_autosense(dev->id))
> +	if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id))
>  		tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16];
>  
>  	return 0;
> @@ -1716,7 +1716,8 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  	memcpy(&qc->result_tf, &tf, sizeof(tf));
>  	qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48;
>  	qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ;
> -	if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) {
> +	if (dev->class == ATA_DEV_ZAC &&
> +	    ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) {
>  		char sense_key, asc, ascq;
>  
>  		sense_key = (qc->result_tf.auxiliary >> 16) & 0xff;
> @@ -1770,10 +1771,11 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc,
>  	}
>  
>  	switch (qc->dev->class) {
> -	case ATA_DEV_ATA:
>  	case ATA_DEV_ZAC:
>  		if (stat & ATA_SENSE)
>  			ata_eh_request_sense(qc, qc->scsicmd);
> +		/* fall through */
> +	case ATA_DEV_ATA:
>  		if (err & ATA_ICRC)
>  			qc->err_mask |= AC_ERR_ATA_BUS;
>  		if (err & (ATA_UNC | ATA_AMNF))
> 

No problems with tests.

Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Jens Axboe June 25, 2019, 3:35 p.m. UTC | #6
On 6/24/19 10:32 AM, Tejun Heo wrote:
> ZAC support added sense data requesting on error for both ZAC and ATA
> devices. This seems to cause erratic error handling behaviors on some
> SSDs where the device reports sense data availability and then
> delivers the wrong content making EH take the wrong actions.  The
> failure mode was sporadic on a LITE-ON ssd and couldn't be reliably
> reproduced.
> 
> There is no value in requesting sense data from non-ZAC ATA devices
> while there's a significant risk of introducing EH misbehaviors which
> are difficult to reproduce and fix.  Let's do the sense data dancing
> only for ZAC devices.

Applied, thanks Tejun.
diff mbox series

Patch

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 9d687e1d4325..3bfd9da58473 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1469,7 +1469,7 @@  static int ata_eh_read_log_10h(struct ata_device *dev,
 	tf->hob_lbah = buf[10];
 	tf->nsect = buf[12];
 	tf->hob_nsect = buf[13];
-	if (ata_id_has_ncq_autosense(dev->id))
+	if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id))
 		tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16];
 
 	return 0;
@@ -1716,7 +1716,8 @@  void ata_eh_analyze_ncq_error(struct ata_link *link)
 	memcpy(&qc->result_tf, &tf, sizeof(tf));
 	qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48;
 	qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ;
-	if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) {
+	if (dev->class == ATA_DEV_ZAC &&
+	    ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) {
 		char sense_key, asc, ascq;
 
 		sense_key = (qc->result_tf.auxiliary >> 16) & 0xff;
@@ -1770,10 +1771,11 @@  static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc,
 	}
 
 	switch (qc->dev->class) {
-	case ATA_DEV_ATA:
 	case ATA_DEV_ZAC:
 		if (stat & ATA_SENSE)
 			ata_eh_request_sense(qc, qc->scsicmd);
+		/* fall through */
+	case ATA_DEV_ATA:
 		if (err & ATA_ICRC)
 			qc->err_mask |= AC_ERR_ATA_BUS;
 		if (err & (ATA_UNC | ATA_AMNF))