Message ID | 20190624163250.GP657710@devbig004.ftw2.facebook.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Series | libata: don't request sense data on !ZAC ATA devices | expand |
On 2019/06/25 1:33, Tejun Heo wrote: > ZAC support added sense data requesting on error for both ZAC and ATA > devices. This seems to cause erratic error handling behaviors on some > SSDs where the device reports sense data availability and then > delivers the wrong content making EH take the wrong actions. The > failure mode was sporadic on a LITE-ON ssd and couldn't be reliably > reproduced. > > There is no value in requesting sense data from non-ZAC ATA devices > while there's a significant risk of introducing EH misbehaviors which > are difficult to reproduce and fix. Let's do the sense data dancing > only for ZAC devices. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Hannes Reinecke <hare@kernel.org> > --- > drivers/ata/libata-eh.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index 9d687e1d4325..3bfd9da58473 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -1469,7 +1469,7 @@ static int ata_eh_read_log_10h(struct ata_device *dev, > tf->hob_lbah = buf[10]; > tf->nsect = buf[12]; > tf->hob_nsect = buf[13]; > - if (ata_id_has_ncq_autosense(dev->id)) > + if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id)) > tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16]; > > return 0; > @@ -1716,7 +1716,8 @@ void ata_eh_analyze_ncq_error(struct ata_link *link) > memcpy(&qc->result_tf, &tf, sizeof(tf)); > qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48; > qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ; > - if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) { > + if (dev->class == ATA_DEV_ZAC && > + ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) { > char sense_key, asc, ascq; > > sense_key = (qc->result_tf.auxiliary >> 16) & 0xff; > @@ -1770,10 +1771,11 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc, > } > > switch (qc->dev->class) { > - case ATA_DEV_ATA: > case ATA_DEV_ZAC: > if (stat & ATA_SENSE) > ata_eh_request_sense(qc, qc->scsicmd); > + /* fall through */ > + case ATA_DEV_ATA: > if (err & ATA_ICRC) > qc->err_mask |= AC_ERR_ATA_BUS; > if (err & (ATA_UNC | ATA_AMNF)) > For NCQ commands, I believe it is mandatory to request sense data for the failed command to get the device out of error mode. So isn't this approach breaking anything for well behaving drives ? Wouldn't it be better to blacklist the misbehaving SSD you observed the problem with ?
Hello, Damien. On Mon, Jun 24, 2019 at 08:27:02PM +0000, Damien Le Moal wrote: > For NCQ commands, I believe it is mandatory to request sense data for the failed > command to get the device out of error mode. So isn't this approach breaking Hah, that's a news to me. We never had that code path before ZAC support was added, so I'm kinda skeptical that'd be the case. > anything for well behaving drives ? Wouldn't it be better to blacklist the > misbehaving SSD you observed the problem with ? Provided I'm not wrong with the assumption, there's virtually no benefit in doing this and that's gonna be a *really* difficult blacklist to develop. Thanks.
Tejun, On 2019/06/25 5:57, Tejun Heo wrote: > Hello, Damien. > > On Mon, Jun 24, 2019 at 08:27:02PM +0000, Damien Le Moal wrote: >> For NCQ commands, I believe it is mandatory to request sense data for the failed >> command to get the device out of error mode. So isn't this approach breaking > > Hah, that's a news to me. We never had that code path before ZAC > support was added, so I'm kinda skeptical that'd be the case. I checked again the ACS specs, and your are right, REQUEST SENSE DATA EXT is optional in general, dependent on support of the Sense Data Reporting feature set. For NCQ command errors, from ACS: "If an error occurs while the device is processing an NCQ command, then the device shall return command aborted for all NCQ commands that are in the queue and shall return command aborted for any subsequent commands, except a command from the GPL feature set (see 4.10) that reads the NCQ Command Error log (see 9.13), until the device completes that command without error." So as long as NCQ command error log page is read, the device queue will get out of error mode and new commands can be issued. There is no need for REQUEST SENSE DATA EXT. I got confused with the fact that the Sense data reporting feature is mandatory with ZAC drives (that is defined in ZAC, not ACS). >> anything for well behaving drives ? Wouldn't it be better to blacklist the >> misbehaving SSD you observed the problem with ? > > Provided I'm not wrong with the assumption, there's virtually no > benefit in doing this and that's gonna be a *really* difficult > blacklist to develop. You are not wrong :) Will test your patch on our test rig which generates (in purpose) a lot of command failures on ZAC drives. We can also give it a run with generated errors on regular disks. Cheers. > > Thanks. >
On 6/24/19 6:32 PM, Tejun Heo wrote: > ZAC support added sense data requesting on error for both ZAC and ATA > devices. This seems to cause erratic error handling behaviors on some > SSDs where the device reports sense data availability and then > delivers the wrong content making EH take the wrong actions. The > failure mode was sporadic on a LITE-ON ssd and couldn't be reliably > reproduced. > > There is no value in requesting sense data from non-ZAC ATA devices > while there's a significant risk of introducing EH misbehaviors which > are difficult to reproduce and fix. Let's do the sense data dancing > only for ZAC devices. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Hannes Reinecke <hare@kernel.org> > --- > drivers/ata/libata-eh.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > Ah well. I hoped those bothering to implement sense data would do it properly; seems I've been mistaken. Reviewed-by: Hannes Reinecke <hare@suse.com> Cheers, Hannes
On 2019/06/25 1:33, Tejun Heo wrote: > ZAC support added sense data requesting on error for both ZAC and ATA > devices. This seems to cause erratic error handling behaviors on some > SSDs where the device reports sense data availability and then > delivers the wrong content making EH take the wrong actions. The > failure mode was sporadic on a LITE-ON ssd and couldn't be reliably > reproduced. > > There is no value in requesting sense data from non-ZAC ATA devices > while there's a significant risk of introducing EH misbehaviors which > are difficult to reproduce and fix. Let's do the sense data dancing > only for ZAC devices. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Hannes Reinecke <hare@kernel.org> > --- > drivers/ata/libata-eh.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index 9d687e1d4325..3bfd9da58473 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -1469,7 +1469,7 @@ static int ata_eh_read_log_10h(struct ata_device *dev, > tf->hob_lbah = buf[10]; > tf->nsect = buf[12]; > tf->hob_nsect = buf[13]; > - if (ata_id_has_ncq_autosense(dev->id)) > + if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id)) > tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16]; > > return 0; > @@ -1716,7 +1716,8 @@ void ata_eh_analyze_ncq_error(struct ata_link *link) > memcpy(&qc->result_tf, &tf, sizeof(tf)); > qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48; > qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ; > - if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) { > + if (dev->class == ATA_DEV_ZAC && > + ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) { > char sense_key, asc, ascq; > > sense_key = (qc->result_tf.auxiliary >> 16) & 0xff; > @@ -1770,10 +1771,11 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc, > } > > switch (qc->dev->class) { > - case ATA_DEV_ATA: > case ATA_DEV_ZAC: > if (stat & ATA_SENSE) > ata_eh_request_sense(qc, qc->scsicmd); > + /* fall through */ > + case ATA_DEV_ATA: > if (err & ATA_ICRC) > qc->err_mask |= AC_ERR_ATA_BUS; > if (err & (ATA_UNC | ATA_AMNF)) > No problems with tests. Tested-by: Masato Suzuki <masato.suzuki@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
On 6/24/19 10:32 AM, Tejun Heo wrote: > ZAC support added sense data requesting on error for both ZAC and ATA > devices. This seems to cause erratic error handling behaviors on some > SSDs where the device reports sense data availability and then > delivers the wrong content making EH take the wrong actions. The > failure mode was sporadic on a LITE-ON ssd and couldn't be reliably > reproduced. > > There is no value in requesting sense data from non-ZAC ATA devices > while there's a significant risk of introducing EH misbehaviors which > are difficult to reproduce and fix. Let's do the sense data dancing > only for ZAC devices. Applied, thanks Tejun.
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index 9d687e1d4325..3bfd9da58473 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -1469,7 +1469,7 @@ static int ata_eh_read_log_10h(struct ata_device *dev, tf->hob_lbah = buf[10]; tf->nsect = buf[12]; tf->hob_nsect = buf[13]; - if (ata_id_has_ncq_autosense(dev->id)) + if (dev->class == ATA_DEV_ZAC && ata_id_has_ncq_autosense(dev->id)) tf->auxiliary = buf[14] << 16 | buf[15] << 8 | buf[16]; return 0; @@ -1716,7 +1716,8 @@ void ata_eh_analyze_ncq_error(struct ata_link *link) memcpy(&qc->result_tf, &tf, sizeof(tf)); qc->result_tf.flags = ATA_TFLAG_ISADDR | ATA_TFLAG_LBA | ATA_TFLAG_LBA48; qc->err_mask |= AC_ERR_DEV | AC_ERR_NCQ; - if ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary) { + if (dev->class == ATA_DEV_ZAC && + ((qc->result_tf.command & ATA_SENSE) || qc->result_tf.auxiliary)) { char sense_key, asc, ascq; sense_key = (qc->result_tf.auxiliary >> 16) & 0xff; @@ -1770,10 +1771,11 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc, } switch (qc->dev->class) { - case ATA_DEV_ATA: case ATA_DEV_ZAC: if (stat & ATA_SENSE) ata_eh_request_sense(qc, qc->scsicmd); + /* fall through */ + case ATA_DEV_ATA: if (err & ATA_ICRC) qc->err_mask |= AC_ERR_ATA_BUS; if (err & (ATA_UNC | ATA_AMNF))
ZAC support added sense data requesting on error for both ZAC and ATA devices. This seems to cause erratic error handling behaviors on some SSDs where the device reports sense data availability and then delivers the wrong content making EH take the wrong actions. The failure mode was sporadic on a LITE-ON ssd and couldn't be reliably reproduced. There is no value in requesting sense data from non-ZAC ATA devices while there's a significant risk of introducing EH misbehaviors which are difficult to reproduce and fix. Let's do the sense data dancing only for ZAC devices. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hannes Reinecke <hare@kernel.org> --- drivers/ata/libata-eh.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)