[1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM

Message ID 1415336894-15327-2-git-send-email-martin.petersen@oracle.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Martin K. Petersen Nov. 7, 2014, 5:08 a.m.
As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5*
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-core.c | 18 ++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Nov. 7, 2014, 8:24 a.m. | #1
On Fri, Nov 07, 2014 at 12:08:12AM -0500, Martin K. Petersen wrote:
>  		if (ata_id_has_trim(args->id)) {
> -			rbuf[14] |= 0x80; /* TPE */
> +			rbuf[14] |= 0x80; /* LBPME */
>  
> -			if (ata_id_has_zero_after_trim(args->id))
> -				rbuf[14] |= 0x40; /* TPRZ */
> +			if (ata_id_has_zero_after_trim(args->id) &&
> +			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");

I think this should _info, not _warn.

Otherwise looks good to me,

Reviewed-by: Christoph Hellwig <hch@lst.de>

It would be nice if there was a way to trigger the flag from userspace,
so that we don't need to rebuild the kernel to add a whitelist entry.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Nov. 7, 2014, 3:37 p.m. | #2
>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

>> + ata_dev_warn(dev, "Enabling discard_zeroes_data\n");

Christoph> I think this should _info, not _warn.

Fixed.
Paolo Bonzini Dec. 5, 2014, 4:45 p.m. | #3
On 07/11/2014 06:08, Martin K. Petersen wrote:
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> ---
>  drivers/ata/libata-core.c | 18 ++++++++++++++----
>  drivers/ata/libata-scsi.c | 10 ++++++----
>  include/linux/libata.h    |  1 +
>  3 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index c5ba15af87d3..f41f24a8bc21 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4225,10 +4225,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
>  	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
>  
>  	/* devices that don't properly handle queued TRIM commands */
> -	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },

I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.

BTW. it's the same hardware as the M550, so probably the same set of
quirks should apply to both.

Paolo

> +
> +	/*
> +	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
> +	 * SSDs that provide reliable zero after TRIM.
> +	 */
> +	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
> +	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
>  
>  	/*
>  	 * Some WD SATA-I drives spin up and down erratically when the link
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 0586f66d70fa..deaa6e34ed4d 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2515,13 +2515,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
>  		rbuf[15] = lowest_aligned;
>  
>  		if (ata_id_has_trim(args->id)) {
> -			rbuf[14] |= 0x80; /* TPE */
> +			rbuf[14] |= 0x80; /* LBPME */
>  
> -			if (ata_id_has_zero_after_trim(args->id))
> -				rbuf[14] |= 0x40; /* TPRZ */
> +			if (ata_id_has_zero_after_trim(args->id) &&
> +			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");
> +				rbuf[14] |= 0x40; /* LBPRZ */
> +			}
>  		}
>  	}
> -
>  	return 0;
>  }
>  
> diff --git a/include/linux/libata.h b/include/linux/libata.h
> index bd5fefeaf548..45ac825b8366 100644
> --- a/include/linux/libata.h
> +++ b/include/linux/libata.h
> @@ -421,6 +421,7 @@ enum {
>  	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
>  	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
>  	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
> +	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
>  
>  	 /* DMA mask for user DMA control: User visible values; DO NOT
>  	    renumber */
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Elliott, Robert (Server Storage) Dec. 5, 2014, 10:58 p.m. | #4
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, 06 November, 2014 11:08 PM
> To: linux-scsi@vger.kernel.org; linux-ide@vger.kernel.org; linux-
> fsdevel@vger.kernel.org; neilb@suse.de
> Cc: Martin K. Petersen
> Subject: [PATCH 1/3] libata: Whitelist SSDs that are known to properly return
> zeroes after TRIM
> 
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs
> 

That description and Paolo's reply:
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Paolo Bonzini
> Sent: Friday, 05 December, 2014 10:45 AM
...
> I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.

make me concerned about this whitelist approach.

I think you need a manufacturer assertion that this is indeed
the design intent; you cannot be certain based on observation
from outside.

Since the SCSI and ATA standards allow ignoring the hint, it 
might be honored most of the time, but ignored in some rare
cases (e.g., drive firmware has a malloc() failure that only
happens when the drive is handling an overtemperature
condition and six other problems at the same time).

Maybe there should be two levels:
* vendor asserts the drive is designed to always honor the hint
* community observes the drive always seems to honor the hint

and a sysfs flag for users to select the level at which 
they feel safe.

A user running 3 replicas of the data in different sites 
might be more trusting than a user for which this is the 
only copy of the data.

---
Rob Elliott    HP Server Storage




--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Y. Ts'o Dec. 8, 2014, 3:15 p.m. | #5
On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> 
> make me concerned about this whitelist approach.
> 
> I think you need a manufacturer assertion that this is indeed
> the design intent; you cannot be certain based on observation
> from outside.

How is this different from a manufacturer assertion that they follow a
SCSI or ATA standard?  There have been cases in the distant past
(fortunately) of disk manufacturers that ignored a CACHE FLUSH command
just to get higher Winbench scores.  Does that mean we can't trust
them to do anything right?

What I'd suggest instead is that if a vendor states this on a spec
sheet --- more than just an e-mail assertion --- so they can be sued
if they knowingly misrepresent their product, that we take their word
at it.  Of course, there will be bugs, which is why we have
blacklists, or why we can remove them from the list if it turns out
there are edge conditions where it appears the disk doesn't quite do
the right thing.

After all, we generally take the manufacturer's word that air bags
will work as claimed, even if potentially 11 million of them are
currently subject to recall.  And do we think that "the community"
would necessarily be more suited than the vendors and the manufacturer
to figure out whether or not air bags or drives are working as
desired?

That being said, if someone wants to create a open source program
which stress tests SSD's to look for cases where it is dropping a
requested discard, that would certainly be a good thing to do...

Cheers,

						 - Ted
						 
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Dec. 8, 2014, 3:28 p.m. | #6
On Mon, 2014-12-08 at 10:15 -0500, Theodore Ts'o wrote:
> On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> > 
> > make me concerned about this whitelist approach.
> > 
> > I think you need a manufacturer assertion that this is indeed
> > the design intent; you cannot be certain based on observation
> > from outside.
> 
> How is this different from a manufacturer assertion that they follow a
> SCSI or ATA standard?  There have been cases in the distant past
> (fortunately) of disk manufacturers that ignored a CACHE FLUSH command
> just to get higher Winbench scores.  Does that mean we can't trust
> them to do anything right?

That answer depends on device type manufacturer.  USB devices, hell no.
ATA devices, maybe and SCSI devices usually.

The main problem is usually testing.  Consumer devices like USB and
(s)ATA rarely get tested on anything but windows.  USB devices tend to
supply their own driver, so they're on the "we fix it in the driver"
model which is why they bite us so badly. (S)ATA usually comply, but
they only test what windows exercises, so if windows doesn't do it,
chances are it never got tested.  SCSI devices still tend to be tested
in legacy UNIX environments, which are as diverse as we are.

> What I'd suggest instead is that if a vendor states this on a spec
> sheet --- more than just an e-mail assertion --- so they can be sued
> if they knowingly misrepresent their product, that we take their word
> at it.  Of course, there will be bugs, which is why we have
> blacklists, or why we can remove them from the list if it turns out
> there are edge conditions where it appears the disk doesn't quite do
> the right thing.
> 
> After all, we generally take the manufacturer's word that air bags
> will work as claimed, even if potentially 11 million of them are
> currently subject to recall.  And do we think that "the community"
> would necessarily be more suited than the vendors and the manufacturer
> to figure out whether or not air bags or drives are working as
> desired?
> 
> That being said, if someone wants to create a open source program
> which stress tests SSD's to look for cases where it is dropping a
> requested discard, that would certainly be a good thing to do...

The purpose of DRAT and RZAT is to enable disk arrays deterministically
to use TRIM/Unmap so arrays know what happens to stripes on discard.
Arrays are being built of mostly SATA technology these days, so some
manufacturers have retargetted to arrays and consumer technology (and
are testing the array cases).  However, windows doesn't use either
feature, so manufacturers not targetting arrays will never test this
feature.  Hence, in this case, I think a whitelist does make sense.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Cox Dec. 8, 2014, 10:59 p.m. | #7
On Mon, 8 Dec 2014 10:15:59 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> On Fri, Dec 05, 2014 at 10:58:09PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I have a Crucial_CT256MX1 (i.e. MX100) and it does reliably zero.
> > 
> > make me concerned about this whitelist approach.
> > 
> > I think you need a manufacturer assertion that this is indeed
> > the design intent; you cannot be certain based on observation
> > from outside.
> 
> How is this different from a manufacturer assertion that they follow a
> SCSI or ATA standard?  There have been cases in the distant past
> (fortunately) of disk manufacturers that ignored a CACHE FLUSH command
> just to get higher Winbench scores.  Does that mean we can't trust
> them to do anything right?

At the time they never promised to honour cache flush. The reason it was
became mandatory in the specification was in part so that the vendors
could all force each other to play fair. If its "optional" then it's
tough..., if they say they meet the standard it's class action 8)

If this is a promise then it ought to be good 

James:  "USB devices tend to supply their own driver"

has not been true for some years now. Microsoft provide an in-box driver
and vendors have the choice of using that or certifying their own via
WHQL, which is a bit like choosing between free ice cream and banging
your head against a plank cover in nails.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c5ba15af87d3..f41f24a8bc21 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4225,10 +4225,20 @@  static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0586f66d70fa..deaa6e34ed4d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2515,13 +2515,15 @@  static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_warn(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index bd5fefeaf548..45ac825b8366 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -421,6 +421,7 @@  enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */