diff mbox

[2/6] libata: Report supported TRIM payload size

Message ID 1282232941-9910-3-git-send-email-martin.petersen@oracle.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Martin K. Petersen Aug. 19, 2010, 3:48 p.m. UTC
ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of
TRIM payload information the device can accept in one command.  Use this
value to enable payloads > 512 bytes.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c |    7 ++++++-
 include/linux/ata.h       |    9 +++++++++
 2 files changed, 15 insertions(+), 1 deletions(-)

Comments

Greg Freemyer Aug. 19, 2010, 5:27 p.m. UTC | #1
On Thu, Aug 19, 2010 at 11:48 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of
> TRIM payload information the device can accept in one command.  Use this
> value to enable payloads > 512 bytes.
>
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> ---
>  drivers/ata/libata-scsi.c |    7 ++++++-
>  include/linux/ata.h       |    9 +++++++++
>  2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index e280ae6..145f099 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2152,7 +2152,12 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>         * with the unmap bit set.
>         */
>        if (ata_id_has_trim(args->id)) {
> -               put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
> +               unsigned int blocks;

since blocks is nebulous, a comment would be nice, maybe:

                 unsigned int blocks;    /* the ATA spec specifies
512-byte blocks for trim payload */

> +
> +               /* Default to 1 if unspecified in word 105.  Cap at 1 page. */
> +               blocks = clamp(ata_id_trim_range_blocks(args->id), 1U, 8U);


Should there at least be a todo comment about raising that cap?

Or is there a fundamental reason for it.  ie. I don't think the ATA
spec calls for it, so this is a kernel restriction I assume.

> +
> +               put_unaligned_be32(65535 * 512 / 8 * blocks, &rbuf[20]);

Is this patch actually enabling the block layer to initiate ATA
multi-sector trim payloads, or is this only allowing the max payloads
sectors to be queried?

Are there plans (or existing code) to accumulate trim ranges in the
block layer and create trim commands with multiple sectors?

AFAIK, the only block layer feature exported to the file system layer
only accepts one range.

I'd like to see multiple trim ranges accumulated either by the
filesystem prior to calling into the block layer, of have the block
layer accumulate them.  (I suspect its easier for the filesystem to do
it via the proposed fitrim() ioctl that is on the ext4 list recently.

ie. the current proposed fitrim() ioctl calls into the block layer for
each trim range, but it could easily have the ability to accumulate a
vector of trim ranges and call the block layer only once per trim
vector if the block layer had a interface for that.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Aug. 19, 2010, 6:05 p.m. UTC | #2
>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

>> + /* Default to 1 if unspecified in word 105. Cap at 1 page. */

Greg> Should there at least be a todo comment about raising that cap?

We don't currently plan to.  The payload is a single page which we can
allocate fairly easily.  A bigger payload would be more problematic.

With the 4KB payload you can adress 16GB with one command.


Greg> Or is there a fundamental reason for it.  ie. I don't think the
Greg> ATA spec calls for it, so this is a kernel restriction I assume.

Yes, this is part of the internal SCSI-ATA translation mechanism in
Linux.  But fwiw, there I'm only aware of one drive that currently
supports more than 512 bytes of payload and it also caps at 4KB.


Greg> Is this patch actually enabling the block layer to initiate ATA
Greg> multi-sector trim payloads, or is this only allowing the max
Greg> payloads sectors to be queried?

TRIM only takes 65535 blocks per entry.  So we need many entries to
decribe a single, contiguous space being freed.  That's what's being
bumped here.


Greg> Are there plans (or existing code) to accumulate trim ranges in
Greg> the block layer and create trim commands with multiple sectors?

I have worked on this on and off.  It's not trivial for several reasons.
We discussed the issues at the storage workshop last week and there was
concensus that the changes were too intrusive.  So we're holding off
until we see what's happening with:

      a) the SSD market. Win 7 also issues lots of small, individual
         trims like we do

      b) the TRIM TNG command that's being worked on in T13

This doesn't mean that TRIM coalescing is impossible and that it will
never happen.  But right now it appears to be more hassle than it's
worth...
Greg Freemyer Aug. 19, 2010, 9:08 p.m. UTC | #3
On Thu, Aug 19, 2010 at 2:05 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:
>
>>> + /* Default to 1 if unspecified in word 105. Cap at 1 page. */
>
> Greg> Should there at least be a todo comment about raising that cap?
>
> We don't currently plan to.  The payload is a single page which we can
> allocate fairly easily.  A bigger payload would be more problematic.
>
> With the 4KB payload you can adress 16GB with one command.
>
>
> Greg> Or is there a fundamental reason for it.  ie. I don't think the
> Greg> ATA spec calls for it, so this is a kernel restriction I assume.
>
> Yes, this is part of the internal SCSI-ATA translation mechanism in
> Linux.  But fwiw, there I'm only aware of one drive that currently
> supports more than 512 bytes of payload and it also caps at 4KB.
>

If you're curious about harware support for larger payloads, you might
ask Mark Lord.  I understand the way he implemented it in hdparm
accepts many more 512 byte payload blocks than one page worth.  And
the only SSD hardware he's reported problems with are Intel which
apparently only supports one 512 byte payload.

ie. I believe hdparm will typically use up to 1MB of payload.  And the
ranges he collects via his script and dispatches via hdparm are
typically not contiguous.

But I believe he is bypassing much of the kernel stack.  To be honest,
I have not traced a hdparm created trim command through the kernel, so
I don't know the details, and he may be breaking that 1 MB payload
that is coming from userspace into smaller payloads.

The fundamental problem with hdparm is that because it bypasses most
of the kernel i/o stack, it is not compatible with DM/LVM, but I think
its a great prototype of what the kernel could do eventually.

==> from man hdparm

       --trim-sector-ranges
              For Solid State Drives (SSDs).  EXCEPTIONALLY DANGEROUS.
DO NOT USE THIS FLAG!!  Tells the drive firmware to discard unneeded
data sectors, destroying any data that may have been
              present  within  them.   This makes those sectors
available for immediate use by the firmware's garbage collection
mechanism, to improve scheduling for wear-leveling of the flash
              media.  This option expects one or more sector range
pairs immediately after the flag: an LBA starting address, a colon,
and a sector count, with no intervening  spaces.   EXCEP-
              TIONALLY DANGEROUS. DO NOT USE THIS FLAG!!

              Eg.  hdparm --trim-sector-ranges 1000:4 7894:16 /dev/sdz

       --trim-sector-ranges-stdin
              Identical  to  --trim-sector-ranges above, except the
list of lba:count pairs is read from stdin rather than being specified
on the command line.  This can be used to avoid prob-
              lems with excessively long command lines.  It also
permits batching of many more sector ranges into single commands to
the drive, up to the currently  configured  transfer  limit
              (max_sectors_kb).
==>


>
> Greg> Is this patch actually enabling the block layer to initiate ATA
> Greg> multi-sector trim payloads, or is this only allowing the max
> Greg> payloads sectors to be queried?
>
> TRIM only takes 65535 blocks per entry.  So we need many entries to
> decribe a single, contiguous space being freed.  That's what's being
> bumped here.
>

Thanks, I did not appreciate that restriction and thus my confusion
about the patch

>
> Greg> Are there plans (or existing code) to accumulate trim ranges in
> Greg> the block layer and create trim commands with multiple sectors?
>
> I have worked on this on and off.  It's not trivial for several reasons.
> We discussed the issues at the storage workshop last week and there was
> concensus that the changes were too intrusive.  So we're holding off
> until we see what's happening with:
>
>      a) the SSD market. Win 7 also issues lots of small, individual
>         trims like we do
>
>      b) the TRIM TNG command that's being worked on in T13
>
> This doesn't mean that TRIM coalescing is impossible and that it will
> never happen.  But right now it appears to be more hassle than it's
> worth...

I believe it's almost trivial to coalesce non-contiguous ranges into a
single vector of ranges in the proposed fitrim() ioctl because it is
walking the filesystem structures looking for free ranges.  I assume
its just a matter of maintaining the locks long enough to ensure the
blocks being trimmed are not used by other filesystem code while the
ranges are being collected.

The bigger problem for now is that the block layer does not export a
way to pass in a vectorized list of ranges to discard.

> --
> Martin K. Petersen      Oracle Linux Engineering
>

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord Aug. 19, 2010, 9:50 p.m. UTC | #4
On 10-08-19 02:05 PM, Martin K. Petersen wrote:
>I'm only aware of one drive that currently
> supports more than 512 bytes of payload and it also caps at 4KB.

SSDs based upon the Indilinx Barefoot controller support
more or less "infinite" payload for TRIM, with no cap.
But it predates idword[105], so just has a zero there.

OCZ Vertex-LE specifies a single sector of payload.

Those are the ones I know about and have on hand here for TRIM.

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 20, 2010, 8:58 a.m. UTC | #5
On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote:
> On 10-08-19 02:05 PM, Martin K. Petersen wrote:
> >I'm only aware of one drive that currently
> >supports more than 512 bytes of payload and it also caps at 4KB.
> 
> SSDs based upon the Indilinx Barefoot controller support
> more or less "infinite" payload for TRIM, with no cap.
> But it predates idword[105], so just has a zero there.

Is there an easy way to identify them?  If so we could add a quirk
for them if it provides enough benefit.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord Aug. 20, 2010, 1:53 p.m. UTC | #6
On 10-08-20 04:58 AM, Christoph Hellwig wrote:
> On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote:
>> On 10-08-19 02:05 PM, Martin K. Petersen wrote:
>>> I'm only aware of one drive that currently
>>> supports more than 512 bytes of payload and it also caps at 4KB.
>>
>> SSDs based upon the Indilinx Barefoot controller support
>> more or less "infinite" payload for TRIM, with no cap.
>> But it predates idword[105], so just has a zero there.
>
> Is there an easy way to identify them?  If so we could add a quirk
> for them if it provides enough benefit.


Each brand/model identifies itself differently.
But we could start a whitelist on based on the model name field
from the ATA identify data.

The OCZ VERTEX drives I have here, identify themselves as "OCZ-VERTEX",
and accept very very long payloads (no limit?).

The OCZ VERTEX-LE drives here do have a limit, of 8 sectors,
and identify themselves as "OCZ VERTEX-LE" in that field.

That's what hdparm-9.30 uses to figure out the max TRIM payloads,
in the absence of a value in word 105.

Cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer Aug. 20, 2010, 5:13 p.m. UTC | #7
On Fri, Aug 20, 2010 at 9:53 AM, Mark Lord <kernel@teksavvy.com> wrote:
> On 10-08-20 04:58 AM, Christoph Hellwig wrote:
>>
>> On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote:
>>>
>>> On 10-08-19 02:05 PM, Martin K. Petersen wrote:
>>>>
>>>> I'm only aware of one drive that currently
>>>> supports more than 512 bytes of payload and it also caps at 4KB.
>>>
>>> SSDs based upon the Indilinx Barefoot controller support
>>> more or less "infinite" payload for TRIM, with no cap.
>>> But it predates idword[105], so just has a zero there.
>>
>> Is there an easy way to identify them?  If so we could add a quirk
>> for them if it provides enough benefit.
>
>
> Each brand/model identifies itself differently.
> But we could start a whitelist on based on the model name field
> from the ATA identify data.
>
> The OCZ VERTEX drives I have here, identify themselves as "OCZ-VERTEX",
> and accept very very long payloads (no limit?).
>
> The OCZ VERTEX-LE drives here do have a limit, of 8 sectors,
> and identify themselves as "OCZ VERTEX-LE" in that field.
>
> That's what hdparm-9.30 uses to figure out the max TRIM payloads,
> in the absence of a value in word 105.
>
> Cheers
>

A whitelist to enable large contiguous range trims via 8 512B-block
payloads seems useful for those devices that don't support word 105.
(ie. 4KB payload)

But, I doubt there is enough observable performance advantage to
justify reworking the internal SCSI-ATA translation mechanism in the
kernel to handle larger payloads.   Especially if only one or two SSDs
will accept greater than 4KB of payload to the trim command.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord Aug. 20, 2010, 6:28 p.m. UTC | #8
On 10-08-20 01:13 PM, Greg Freemyer wrote:
> On Fri, Aug 20, 2010 at 9:53 AM, Mark Lord<kernel@teksavvy.com>  wrote:
>> On 10-08-20 04:58 AM, Christoph Hellwig wrote:
>>>
>>> On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote:
>>>>
>>>> On 10-08-19 02:05 PM, Martin K. Petersen wrote:
>>>>>
>>>>> I'm only aware of one drive that currently
>>>>> supports more than 512 bytes of payload and it also caps at 4KB.
>>>>
>>>> SSDs based upon the Indilinx Barefoot controller support
>>>> more or less "infinite" payload for TRIM, with no cap.
>>>> But it predates idword[105], so just has a zero there.
>>>
>>> Is there an easy way to identify them?  If so we could add a quirk
>>> for them if it provides enough benefit.
..
> A whitelist to enable large contiguous range trims via 8 512B-block
> payloads seems useful for those devices that don't support word 105.
> (ie. 4KB payload)
>
> But, I doubt there is enough observable performance advantage to
> justify reworking the internal SCSI-ATA translation mechanism in the
> kernel to handle larger payloads.   Especially if only one or two SSDs
> will accept greater than 4KB of payload to the trim command.
..

Actually, for in-kernel uses, even just a single 512-byte long list
would likely be adequate.  The biggest gains will come from simply
having more than one range per TRIM command issued.

Once we get that, it's diminishing returns for larger and larger lists,
and in-kernel code is unlikely to ever be able to generate lists that long.

But getting started should be easier than folks are making it out to be.
The first step is to have "file deletion" issue multi-range trims for
all extents of the file at once, rather than one range at a time as at present.

That'll be the biggest gain, and shouldn't be too complex to implement.
Especially not after Christoph/Tejun's current barrier rework stuff is in place.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 23, 2010, 1:50 p.m. UTC | #9
On Thu, Aug 19, 2010 at 11:48:57AM -0400, Martin K. Petersen wrote:
> ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of
> TRIM payload information the device can accept in one command.  Use this
> value to enable payloads > 512 bytes.

Currently ata_scsi_write_same_xlat passes uses a constant 512 for the
payload all over the place, so the larger payloads won't actually work yet.

I don't think advertising a larger size than we actually support is a
good idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Aug. 23, 2010, 6:56 p.m. UTC | #10
>>>>> "Christoph" == Christoph Hellwig <hch@lst.de> writes:

Christoph> On Thu, Aug 19, 2010 at 11:48:57AM -0400, Martin K. Petersen wrote:
>> ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks
>> of TRIM payload information the device can accept in one command.
>> Use this value to enable payloads > 512 bytes.

Christoph> Currently ata_scsi_write_same_xlat passes uses a constant 512
Christoph> for the payload all over the place, so the larger payloads
Christoph> won't actually work yet.

Christoph> I don't think advertising a larger size than we actually
Christoph> support is a good idea.

I was on crack when I included this in the series last week.  It came
from my discard coalescing tree where things are quite different.

In any case we're also limited by the range of WRITE SAME as long as
we're relying on SAT.  So under all circumstances we're dead in the
water with a payload > 1KB.

Let's just drop this one for now.
diff mbox

Patch

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e280ae6..145f099 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2152,7 +2152,12 @@  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
 	 * with the unmap bit set.
 	 */
 	if (ata_id_has_trim(args->id)) {
-		put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
+		unsigned int blocks;
+
+		/* Default to 1 if unspecified in word 105.  Cap at 1 page. */
+		blocks = clamp(ata_id_trim_range_blocks(args->id), 1U, 8U);
+
+		put_unaligned_be32(65535 * 512 / 8 * blocks, &rbuf[20]);
 		put_unaligned_be32(1, &rbuf[28]);
 	}
 
diff --git a/include/linux/ata.h b/include/linux/ata.h
index fe6e681..5584356 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -88,6 +88,7 @@  enum {
 	ATA_ID_HW_CONFIG	= 93,
 	ATA_ID_SPG		= 98,
 	ATA_ID_LBA_CAPACITY_2	= 100,
+	ATA_ID_TRIM_RANGE_BLKS	= 105,
 	ATA_ID_SECTOR_SIZE	= 106,
 	ATA_ID_LAST_LUN		= 126,
 	ATA_ID_DLF		= 128,
@@ -827,6 +828,14 @@  static inline int ata_id_has_zero_after_trim(const u16 *id)
 	return 0;
 }
 
+static inline unsigned int ata_id_trim_range_blocks(const u16 *id)
+{
+	if (ata_id_has_trim(id))
+		return id[ATA_ID_TRIM_RANGE_BLKS];
+
+	return 0;
+}
+
 static inline int ata_id_current_chs_valid(const u16 *id)
 {
 	/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command