diff mbox

configurable discard parameters

Message ID yq1616d4xxe.fsf@sermon.lab.mkp.net
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Martin K. Petersen June 24, 2015, 2:55 a.m. UTC
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632 total
Tom> ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376 average
Tom> untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8)

Every type of drive has its own internal restrictions. Unless the drive
guarantees zero after TRIM it is perfectly normal for heads, tails or
random pieces in-between to be left untouched.

I am surprised that the common case of contiguous range entries was not
handled properly by your drive. Most of them deal with that just fine
(regardless of their internal granularity and alignment constraints). It
is normally only the partial block tracking between command invocations
that causes grief.

In any case. Unless discard_zeroes_data is 1 (and that requires
guarantees above and beyond what can be discovered via the ATA Command
Set) all bets are off wrt. dependable behavior.

The code below is an untested proof of concept to see what it would take
to alleviate your particular issue. I am, however, not at all convinced
that introducing such a change is worth the risk. I know of at least a
couple of devices that would blow up as a result of this patch...

Comments

Tom Yan June 24, 2015, 12:46 p.m. UTC | #1
To be honest I don't think this patch should get in as well. First of
all it wouldn't really solve problem for all devices which have
different limits. Secondly I doubt it's related to Deterministic Zero
AT ALL. (For one the SanDisk drive I have shows Deterministic Zero,
still it behaves similarly as the Intel drive.) IHMO adding "switch"
base on characteristics which is uncertain to be / not at all relevant
is the worst thing to do ever. (Yeah I recall how Intel "follows"
HDMI/DP color range spec senselessly again.)

The only patching which would really mean something is to allow user
to configure blocks per range and ranges per command, so that for
users can tune kernel TRIM per device if they really want to, while
leaving the "safe default" intact.

However I am really curious how the drives "blow up" on less blocks
per range. Isn't that even more of a firmware bug than the problem I
have? If they blow up even with single range with less then 65535
blocks addressed, wouldn't they blow up anytime if they do filesystem
TRIM?

On 24 June 2015 at 10:55, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:
>
> Tom> total untrimmed sectors: (310635 - 1 - 2410) / 32 = 9632 total
> Tom> ranges: (43 GiB * (1024 ^ 3) / 512) / 65535 ~= 1376 average
> Tom> untrimmed sectors per range: (9632 / 1376) = 7 = (65535 % 8)
>
> Every type of drive has its own internal restrictions. Unless the drive
> guarantees zero after TRIM it is perfectly normal for heads, tails or
> random pieces in-between to be left untouched.
>
> I am surprised that the common case of contiguous range entries was not
> handled properly by your drive. Most of them deal with that just fine
> (regardless of their internal granularity and alignment constraints). It
> is normally only the partial block tracking between command invocations
> that causes grief.
>
> In any case. Unless discard_zeroes_data is 1 (and that requires
> guarantees above and beyond what can be discovered via the ATA Command
> Set) all bets are off wrt. dependable behavior.
>
> The code below is an untested proof of concept to see what it would take
> to alleviate your particular issue. I am, however, not at all convinced
> that introducing such a change is worth the risk. I know of at least a
> couple of devices that would blow up as a result of this patch...
>
> --
> Martin K. Petersen      Oracle Linux Engineering
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 3131adcc1f87..9c270a303f0b 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2238,8 +2238,24 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>          * with the unmap bit set.
>          */
>         if (ata_id_has_trim(args->id)) {
> -               put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
> -               put_unaligned_be32(1, &rbuf[28]);
> +               /*
> +                * If the drive supports reliable zero after trim we set
> +                * the granularity to 1 and the blocks per range to
> +                * 0xffff. Otherwise we set set the granularity to 8 and
> +                * restrict the blocks per range to 0xfff8. This is done
> +                * to accommodate older drives which prefer 4K-alignment.
> +                */
> +
> +               if (ata_id_has_zero_after_trim(args->id) &&
> +                   args->dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
> +                       put_unaligned_be64(ATA_MAX_TRIM_RANGE * 512 / 8,
> +                                          &rbuf[36]);
> +                       put_unaligned_be32(1, &rbuf[28]);
> +               } else {
> +                       put_unaligned_be64(ATA_ALIGNED_TRIM_RANGE * 512 / 8,
> +                                          &rbuf[36]);
> +                       put_unaligned_be32(8, &rbuf[28]);
> +               }
>         }
>
>         return 0;
> @@ -3168,7 +3184,14 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>                 goto invalid_fld;
>
>         buf = page_address(sg_page(scsi_sglist(scmd)));
> -       size = ata_set_lba_range_entries(buf, 512, block, n_block);
> +
> +       if (ata_id_has_zero_after_trim(dev->id) &&
> +           dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM)
> +               size = ata_set_lba_range_entries(buf, 512, block, n_block,
> +                                                ATA_MAX_TRIM_RANGE);
> +       else
> +               size = ata_set_lba_range_entries(buf, 512, block, n_block,
> +                                                ATA_ALIGNED_TRIM_RANGE);
>
>         if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
>                 /* Newer devices support queued TRIM commands */
> diff --git a/include/linux/ata.h b/include/linux/ata.h
> index fed36418dd1c..8a17fa22cdbe 100644
> --- a/include/linux/ata.h
> +++ b/include/linux/ata.h
> @@ -47,6 +47,8 @@ enum {
>         ATA_MAX_SECTORS         = 256,
>         ATA_MAX_SECTORS_LBA48   = 65535,/* TODO: 65536? */
>         ATA_MAX_SECTORS_TAPE    = 65535,
> +       ATA_MAX_TRIM_RANGE      = 0xffff,
> +       ATA_ALIGNED_TRIM_RANGE  = 0xfff8,
>
>         ATA_ID_WORDS            = 256,
>         ATA_ID_CONFIG           = 0,
> @@ -1012,19 +1014,20 @@ static inline void ata_id_to_hd_driveid(u16 *id)
>   * TO NV CACHE PINNED SET.
>   */
>  static inline unsigned ata_set_lba_range_entries(void *_buffer,
> -               unsigned buf_size, u64 sector, unsigned long count)
> +               unsigned buf_size, u64 sector, unsigned long count,
> +               u16 bpe)
>  {
>         __le64 *buffer = _buffer;
>         unsigned i = 0, used_bytes;
>
>         while (i < buf_size / 8 ) { /* 6-byte LBA + 2-byte range per entry */
>                 u64 entry = sector |
> -                       ((u64)(count > 0xffff ? 0xffff : count) << 48);
> +                       ((u64)(count > bpe ? bpe : count) << 48);
>                 buffer[i++] = __cpu_to_le64(entry);
> -               if (count <= 0xffff)
> +               if (count <= bpe)
>                         break;
> -               count -= 0xffff;
> -               sector += 0xffff;
> +               count -= bpe;
> +               sector += bpe;
>         }
>
>         used_bytes = ALIGN(i * 8, 512);
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen June 25, 2015, 1:15 a.m. UTC | #2
>>>>> "Tom" == Tom Yan <tom.ty89@gmail.com> writes:

Tom> First of all it wouldn't really solve problem for all devices which
Tom> have different limits.

Only by virtue of us generally aligning on large power of two
boundaries.

The only reason I am entertaining this in the first place is that I have
one drive that behaves in a way similar to yours. And if we can make
things slightly better (but not perfect) for several drives without
causing any regressions then that's worth exploring.

Tom> Secondly I doubt it's related to Deterministic Zero AT ALL. (For
Tom> one the SanDisk drive I have shows Deterministic Zero, still it
Tom> behaves similarly as the Intel drive.)

The drive reporting deterministic zero is not enough. It needs to be
explicitly whitelisted before we report discard_zeroes_data=1.

Tom> The only patching which would really mean something is to allow
Tom> user to configure blocks per range and ranges per command, so that
Tom> for users can tune kernel TRIM per device if they really want to,
Tom> while leaving the "safe default" intact.

I'll think about it.

Tom> However I am really curious how the drives "blow up" on less blocks
Tom> per range. Isn't that even more of a firmware bug than the problem
Tom> I have?

I have several older drives that expect a single contiguous LBA
range. They don't handle multiple discontiguous ranges at all.
Tom Yan June 26, 2015, 7:05 a.m. UTC | #3
On 25 June 2015 at 09:15, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
> The drive reporting deterministic zero is not enough. It needs to be
> explicitly whitelisted before we report discard_zeroes_data=1.
>

Not sure what you mean here. I just meant I don't think the "lower
limit" thing I am experiencing is related to the "type" of TRIM, at
least not for any Deterministic TRIM. (Although I don't really know
what exactly does Deterministic Read Data means. Perhaps it means
vendor would make sure the LBA after TRIM reads either zero or a
specific pattern of data?)

>
> I have several older drives that expect a single contiguous LBA
> range. They don't handle multiple discontiguous ranges at all.
>

Well discontiguous is somewhat inaccurate here. I guess what you mean
is they simply don't handle the next range until the previous one is
used up. For example, "0:65528, 65528:65528" are definitely
contiguous, but not fulfiling the requirement of your drives; while
"0:65535 131072:65535" are definitely not contiguous, but your drives
take them fine.

By the way what are the consequences? Data loss? System hang?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 3131adcc1f87..9c270a303f0b 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2238,8 +2238,24 @@  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	 * with the unmap bit set.
 	 */
 	if (ata_id_has_trim(args->id)) {
-		put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
-		put_unaligned_be32(1, &rbuf[28]);
+		/*
+		 * If the drive supports reliable zero after trim we set
+		 * the granularity to 1 and the blocks per range to
+		 * 0xffff. Otherwise we set set the granularity to 8 and
+		 * restrict the blocks per range to 0xfff8. This is done
+		 * to accommodate older drives which prefer 4K-alignment.
+		 */
+
+		if (ata_id_has_zero_after_trim(args->id) &&
+		    args->dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+			put_unaligned_be64(ATA_MAX_TRIM_RANGE * 512 / 8,
+					   &rbuf[36]);
+			put_unaligned_be32(1, &rbuf[28]);
+		} else {
+			put_unaligned_be64(ATA_ALIGNED_TRIM_RANGE * 512 / 8,
+					   &rbuf[36]);
+			put_unaligned_be32(8, &rbuf[28]);
+		}
 	}
 
 	return 0;
@@ -3168,7 +3184,14 @@  static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 		goto invalid_fld;
 
 	buf = page_address(sg_page(scsi_sglist(scmd)));
-	size = ata_set_lba_range_entries(buf, 512, block, n_block);
+
+	if (ata_id_has_zero_after_trim(dev->id) &&
+	    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM)
+		size = ata_set_lba_range_entries(buf, 512, block, n_block,
+						 ATA_MAX_TRIM_RANGE);
+	else
+		size = ata_set_lba_range_entries(buf, 512, block, n_block,
+						 ATA_ALIGNED_TRIM_RANGE);
 
 	if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
 		/* Newer devices support queued TRIM commands */
diff --git a/include/linux/ata.h b/include/linux/ata.h
index fed36418dd1c..8a17fa22cdbe 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -47,6 +47,8 @@  enum {
 	ATA_MAX_SECTORS		= 256,
 	ATA_MAX_SECTORS_LBA48	= 65535,/* TODO: 65536? */
 	ATA_MAX_SECTORS_TAPE	= 65535,
+	ATA_MAX_TRIM_RANGE	= 0xffff,
+	ATA_ALIGNED_TRIM_RANGE	= 0xfff8,
 
 	ATA_ID_WORDS		= 256,
 	ATA_ID_CONFIG		= 0,
@@ -1012,19 +1014,20 @@  static inline void ata_id_to_hd_driveid(u16 *id)
  * TO NV CACHE PINNED SET.
  */
 static inline unsigned ata_set_lba_range_entries(void *_buffer,
-		unsigned buf_size, u64 sector, unsigned long count)
+		unsigned buf_size, u64 sector, unsigned long count,
+		u16 bpe)
 {
 	__le64 *buffer = _buffer;
 	unsigned i = 0, used_bytes;
 
 	while (i < buf_size / 8 ) { /* 6-byte LBA + 2-byte range per entry */
 		u64 entry = sector |
-			((u64)(count > 0xffff ? 0xffff : count) << 48);
+			((u64)(count > bpe ? bpe : count) << 48);
 		buffer[i++] = __cpu_to_le64(entry);
-		if (count <= 0xffff)
+		if (count <= bpe)
 			break;
-		count -= 0xffff;
-		sector += 0xffff;
+		count -= bpe;
+		sector += bpe;
 	}
 
 	used_bytes = ALIGN(i * 8, 512);