diff mbox

ext4: Add support for SFITRIM, an ioctl for secure FITRIM.

Message ID 20140613234134.GC5036@thunk.org
State Rejected
Headers show

Commit Message

Theodore Ts'o June 13, 2014, 11:41 p.m. UTC
On Fri, Jun 13, 2014 at 12:44:34PM -0700, JP Abgrall wrote:
> The per-file secure discard seems to be the way to go, as there are
> only a few places in Android where this needs to be turned on.
> The  current idletime-fstrim would  switch from FITRIM to SFITRIM to
> reduce the leftovers.

OK, how about this?  The following patch is in the Google data center
kernel, but I never got around to get it upstream (oops, was on my
todo list, but it never happened).

If you want to adopt this for usptream, and add support for
BLKSECDISCARD as well as BLKDISCARD, then you could for each file that
you want to do the per-file secure discard, you would just have to
open the file, call the BLKSECDISCARD ioctl, and then delete the file.

Cheers,

					- Ted

commit 16ff6352b123aa134417793d636f05cd4e240eaa
Author: Theodore Ts'o <tytso@google.com>
Date:   Fri Dec 20 12:48:26 2013 -0500

    ext4: add support for the BLKDISCARD ioctl
    
    The blkdicard ioctl previously only worked on block devices.  Allow
    this ioctl to work on ext4 files.
    
    This commit is intended to be sent upstream.
    
    Google-Bug-Id: 11517631
    Tested: fs smoke test; manual check of the blkdiscard ioctl
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Effort: fs/ext4
    Origin-3.3-SHA1: 85ef322daf0da0000759b1fe3a1e1bd3da3b906a
    Change-Id: If031d1b8e796ebcb0afcb8a5259c760441bc6b2e

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

JP Abgrall June 14, 2014, 12:46 a.m. UTC | #1
On Fri, Jun 13, 2014 at 4:41 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> If you want to adopt this for usptream, and add support for
> BLKSECDISCARD as well as BLKDISCARD, then you could for each file that
> you want to do the per-file secure discard, you would just have to
> open the file, call the BLKSECDISCARD ioctl, and then delete the file.

That would work also. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner June 17, 2014, 2:49 a.m. UTC | #2
On Fri, Jun 13, 2014 at 07:41:34PM -0400, Theodore Ts'o wrote:
> On Fri, Jun 13, 2014 at 12:44:34PM -0700, JP Abgrall wrote:
> > The per-file secure discard seems to be the way to go, as there are
> > only a few places in Android where this needs to be turned on.
> > The  current idletime-fstrim would  switch from FITRIM to SFITRIM to
> > reduce the leftovers.
> 
> OK, how about this?  The following patch is in the Google data center
> kernel, but I never got around to get it upstream (oops, was on my
> todo list, but it never happened).
> 
> If you want to adopt this for usptream, and add support for
> BLKSECDISCARD as well as BLKDISCARD, then you could for each file that
> you want to do the per-file secure discard, you would just have to
> open the file, call the BLKSECDISCARD ioctl, and then delete the file.
> 
> Cheers,
> 
> 					- Ted
> 
> commit 16ff6352b123aa134417793d636f05cd4e240eaa
> Author: Theodore Ts'o <tytso@google.com>
> Date:   Fri Dec 20 12:48:26 2013 -0500
> 
>     ext4: add support for the BLKDISCARD ioctl
>     
>     The blkdicard ioctl previously only worked on block devices.  Allow
>     this ioctl to work on ext4 files.
>     
>     This commit is intended to be sent upstream.

Not in that form - it's an ugly API hack.

This is really just an extension of hole punching (if the blocks in
the file are being removed) or zeroing (if the blocks are being
retained by the file). Either way, fallocate() is the interface
used for per-file block level manipulations, and either of these
operations could issue a discard (secure or not) during the
punch/zero operation....

Cheers,

Dave.
Theodore Ts'o June 17, 2014, 11:27 a.m. UTC | #3
On Tue, Jun 17, 2014 at 12:49:53PM +1000, Dave Chinner wrote:
> >     The blkdicard ioctl previously only worked on block devices.  Allow
> >     this ioctl to work on ext4 files.
> >     
> >     This commit is intended to be sent upstream.
> 
> Not in that form - it's an ugly API hack.

The whole *point* was that it's an API hack.  We had a userspace
program that wanted to use the same code regardless of whether it was
accessing a block device or a file, and we didn't want to spin up a
KVM just to simulate a block device (think the whole countainers
vs. virtualization argument).

> This is really just an extension of hole punching (if the blocks in
> the file are being removed) or zeroing (if the blocks are being
> retained by the file). Either way, fallocate() is the interface
> used for per-file block level manipulations, and either of these
> operations could issue a discard (secure or not) during the
> punch/zero operation....

Except that discard is not an exact equivalent of zeroing, since even
in the case of discard zeros data, the flash device is free to
disregard the discard request if it feels like it.  But if you have a
userspace application which is trying to manage the flash to optimize
write endurance, that's different from either hole punching or
zeroing.

Secure discard maps a bit more like a zero'ing or hole punching, since
hopefully the standard for secure discard was as incompetently
authored as the discard/trim specification.  So that might be a
different approach if this is an approach that we want to get adoption
across multiple file systems.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner June 17, 2014, 11:55 a.m. UTC | #4
On Tue, 17 Jun 2014, Dave Chinner wrote:

> Date: Tue, 17 Jun 2014 12:49:53 +1000
> From: Dave Chinner <david@fromorbit.com>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: JP Abgrall <jpa@google.com>, Eric Sandeen <sandeen@redhat.com>,
>     linux-ext4@vger.kernel.org, Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Fri, Jun 13, 2014 at 07:41:34PM -0400, Theodore Ts'o wrote:
> > On Fri, Jun 13, 2014 at 12:44:34PM -0700, JP Abgrall wrote:
> > > The per-file secure discard seems to be the way to go, as there are
> > > only a few places in Android where this needs to be turned on.
> > > The  current idletime-fstrim would  switch from FITRIM to SFITRIM to
> > > reduce the leftovers.
> > 
> > OK, how about this?  The following patch is in the Google data center
> > kernel, but I never got around to get it upstream (oops, was on my
> > todo list, but it never happened).
> > 
> > If you want to adopt this for usptream, and add support for
> > BLKSECDISCARD as well as BLKDISCARD, then you could for each file that
> > you want to do the per-file secure discard, you would just have to
> > open the file, call the BLKSECDISCARD ioctl, and then delete the file.
> > 
> > Cheers,
> > 
> > 					- Ted
> > 
> > commit 16ff6352b123aa134417793d636f05cd4e240eaa
> > Author: Theodore Ts'o <tytso@google.com>
> > Date:   Fri Dec 20 12:48:26 2013 -0500
> > 
> >     ext4: add support for the BLKDISCARD ioctl
> >     
> >     The blkdicard ioctl previously only worked on block devices.  Allow
> >     this ioctl to work on ext4 files.
> >     
> >     This commit is intended to be sent upstream.
> 
> Not in that form - it's an ugly API hack.
> 
> This is really just an extension of hole punching (if the blocks in
> the file are being removed) or zeroing (if the blocks are being
> retained by the file). Either way, fallocate() is the interface
> used for per-file block level manipulations, and either of these
> operations could issue a discard (secure or not) during the
> punch/zero operation....

I definitely agree with Dave here it is an ugly API hack. Fallocate
seems much more suitable for this.

New flag FALLOC_FL_ISSUE_DISCARD which would work with
FALLOC_FL_PUNCH_HOLE, FALLOC_FL_ZERO_RANGE and possibly
FALLOC_FL_COLLAPSE_RANGE might actually be useful.

-Lukas

> 
> Cheers,
> 
> Dave.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 17, 2014, 12:46 p.m. UTC | #5
On Tue, Jun 17, 2014 at 01:55:26PM +0200, Lukáš Czerner wrote:
> 
> I definitely agree with Dave here it is an ugly API hack. Fallocate
> seems much more suitable for this.
> 
> New flag FALLOC_FL_ISSUE_DISCARD which would work with
> FALLOC_FL_PUNCH_HOLE, FALLOC_FL_ZERO_RANGE and possibly
> FALLOC_FL_COLLAPSE_RANGE might actually be useful.

I agree it would be useful to have an FL_ISSUE_DISCARD (and while
we're at it, FL_ISSUE_SECDISCARD) as an fallocate flag.  That doesn't
obviate the usefulness of a BLKDISCARD ioctl for ext4 files, though.

Something else that might be useful, and perhaps more appropriate for
the Android use case, is to add a SECDISCARD flag to the unlinkat(2)
system call.  That way, people who want to do a "discard and then
unlink" don't have to be forced to do an open(2), fallocate(2),
close(2), and only *then* the unlink(2) system call.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner June 17, 2014, 1 p.m. UTC | #6
On Tue, 17 Jun 2014, Theodore Ts'o wrote:

> Date: Tue, 17 Jun 2014 08:46:29 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>, JP Abgrall <jpa@google.com>,
>     Eric Sandeen <sandeen@redhat.com>, linux-ext4@vger.kernel.org,
>     Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Tue, Jun 17, 2014 at 01:55:26PM +0200, Lukáš Czerner wrote:
> > 
> > I definitely agree with Dave here it is an ugly API hack. Fallocate
> > seems much more suitable for this.
> > 
> > New flag FALLOC_FL_ISSUE_DISCARD which would work with
> > FALLOC_FL_PUNCH_HOLE, FALLOC_FL_ZERO_RANGE and possibly
> > FALLOC_FL_COLLAPSE_RANGE might actually be useful.
> 
> I agree it would be useful to have an FL_ISSUE_DISCARD (and while
> we're at it, FL_ISSUE_SECDISCARD) as an fallocate flag.  That doesn't
> obviate the usefulness of a BLKDISCARD ioctl for ext4 files, though.
> 
> Something else that might be useful, and perhaps more appropriate for
> the Android use case, is to add a SECDISCARD flag to the unlinkat(2)
> system call.  That way, people who want to do a "discard and then
> unlink" don't have to be forced to do an open(2), fallocate(2),
> close(2), and only *then* the unlink(2) system call.

What is the difference between -o discard mount option ? I guess
that this way you can do it selectively on certain files, but I
wonder how useful it is going to be anyway ?

Nevertheless, I think that there is a conclusion that there is no
"security" to be had with file system and SECDISCARD. And no secure
erase with this type of interface would be "secure" enough.

If they are ok with only best effort, then we can have FISTRIM ioctl
which would use the same internal file system functionality as
FITRIM but we would add a flag to be able to call sb_issue_discard()
with BLKDEV_DISCARD_SECURE flag, disable the optimization to skip
already discarded groups and call sync on the file system before we
start doing any actuall work. I wish I added flags to the FITRIM
ioctl when I created it...

If we do this though we should not add word "security" anywhere for
the use to see :)

-Lukas

> 
> Cheers,
> 
> 						- Ted
>
Theodore Ts'o June 17, 2014, 1:54 p.m. UTC | #7
On Tue, Jun 17, 2014 at 03:00:40PM +0200, Lukáš Czerner wrote:
> 
> What is the difference between -o discard mount option ? I guess
> that this way you can do it selectively on certain files, but I
> wonder how useful it is going to be anyway ?

Well, it will reduce the amount of flash wear, since a SECDISCARD
requires an immediate copy of the remaining data in the erase block
followed by a erase.  This increases write magnification.

> Nevertheless, I think that there is a conclusion that there is no
> "security" to be had with file system and SECDISCARD. And no secure
> erase with this type of interface would be "secure" enough.

There's an assumption here that the eMMC SECDISCARD functionality is
more competently spec'ed out compared to the ATA/SCSI interface.  I'm
not sure whether or not that's true, but perhaps JP and Geremy can
confirm that.  And even if it isn't guaranteed by the MMC spec, a
mobile handset manufacturer is buying in sufficently large quantities
that they can probably negotiate with their suppliers and demand a
custom firmware which doesn't drop the discard command if the flash
device doesn't feel like it.  

(There's nothing new about this, by the way.  Very large buyers of
hard drives such as EMC, Amazon, Facebook, etc. do their own
performance and quality control testing, and then have demanded custom
firmware if necessary for a very long time.)

So at least in some specific use cases, it should be possible to make
this to be secure.  And the reason why to call it secure is SECDISCARD
is the term used in the spec.  And if the spec doesn't guarantee it,
we can mock the spec.  :-)

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
JP Abgrall June 17, 2014, 5:35 p.m. UTC | #8
On Tue, Jun 17, 2014 at 6:00 AM, Lukáš Czerner <lczerner@redhat.com> wrote:
>
> What is the difference between -o discard mount option ?
Android will not mount -o discard because of the overall penalty.

> I guess
> that this way you can do it selectively on certain files, but I
> wonder how useful it is going to be anyway ?
For certain files only. Useful enough.


> If they are ok with only best effort, then we can have FISTRIM ioctl
> which would use the same internal file system functionality as
> FITRIM but we would add a flag to be able to call sb_issue_discard()
> with BLKDEV_DISCARD_SECURE flag,
I think this was the idea of the original patch.

> disable the optimization to skip
> already discarded groups and call sync on the file system before we
> start doing any actuall work.
That would need adding to the original patch.

> If we do this though we should not add word "security" anywhere for
> the use to see :)
We can live with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
JP Abgrall June 17, 2014, 5:53 p.m. UTC | #9
On Tue, Jun 17, 2014 at 6:54 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> There's an assumption here that the eMMC SECDISCARD functionality is
> more competently spec'ed out compared to the ATA/SCSI interface.  I'm
> not sure whether or not that's true, but perhaps JP and Geremy can
> confirm that.
We only care about the eMMC spec. In the eMMC spec there is no room
for ignoring the commands. If the card declares it can do a secure
erase/trim it must do it as per the spec.
  JEDEC Standard No. 84-A441
  7.6.9 Secure Erase
  7.6.10 Secure Trim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner June 18, 2014, 9:33 a.m. UTC | #10
On Tue, 17 Jun 2014, JP Abgrall wrote:

> Date: Tue, 17 Jun 2014 10:53:21 -0700
> From: JP Abgrall <jpa@google.com>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: Lukáš Czerner <lczerner@redhat.com>, Dave Chinner <david@fromorbit.com>,
>     Eric Sandeen <sandeen@redhat.com>, linux-ext4@vger.kernel.org,
>     Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Tue, Jun 17, 2014 at 6:54 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > There's an assumption here that the eMMC SECDISCARD functionality is
> > more competently spec'ed out compared to the ATA/SCSI interface.  I'm
> > not sure whether or not that's true, but perhaps JP and Geremy can
> > confirm that.
> We only care about the eMMC spec. In the eMMC spec there is no room
> for ignoring the commands. If the card declares it can do a secure
> erase/trim it must do it as per the spec.

Except when it does not. Looking at the mmc driver I can see that we
have already some devices where secure trim is broken.

/*
 * On these Samsung MoviNAND parts, performing secure erase or
 * secure trim can result in unrecoverable corruption due to a
 * firmware bug.
 */

And I have no illusion that those are the only ones that does not
work. This hardware can not be trusted and this must not be
advertised as a security feature.

Btw, AFACIT with this backlisted hardware the user will not even
notice that secure discard did not went through and the driver will
simply use normal discard, which is a bug if you ask me.

Let's call is deep discard, or whatever but avoid the security word
at least when file system comes into play.

-Lukas

>   JEDEC Standard No. 84-A441
>   7.6.9 Secure Erase
>   7.6.10 Secure Trim
JP Abgrall June 18, 2014, 9:51 p.m. UTC | #11
On Wed, Jun 18, 2014 at 2:33 AM, Lukáš Czerner <lczerner@redhat.com> wrote:
> Except when it does not. Looking at the mmc driver I can see that we
> have already some devices where secure trim is broken.


And that is why we don't just blindly use random eMMC devices.

> Let's call is deep discard, or whatever but avoid the security word
> at least when file system comes into play.

Best effort discard :)
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 18, 2014, 10:06 p.m. UTC | #12
On Wed, Jun 18, 2014 at 11:33:47AM +0200, Lukáš Czerner wrote:
> Except when it does not. Looking at the mmc driver I can see that we
> have already some devices where secure trim is broken.
> 
> /*
>  * On these Samsung MoviNAND parts, performing secure erase or
>  * secure trim can result in unrecoverable corruption due to a
>  * firmware bug.
>  */

The bug IMHO is that the eMMC driver is falling back to discard,
instead of returning EOPNOTSUPP.  The question of whether we should
fall back to discard or not should be made at a level much higher up
than the MMC device driver....

> And I have no illusion that those are the only ones that does not
> work. This hardware can not be trusted and this must not be
> advertised as a security feature.

There's always crappy hardware out there.  If that's true, should then
not call ATA Secure Erase by that term because somewhere out there,
there will be an incompetently implemented SSD that doesn't do the
right thing with ATA Secure Erase?  I just don't think that's
particularly useful.  If the command is called "secure erase" or
"secure discard" in the specification, then that's what we should use,
just to avoid confusion if nothing else.

Now, if the spec explicitly disclaims correct behavior, in the case of
discard and discard zeros data, that's a different matter.  But I
don't think that is what's going on here.  The MMC specification makes
certain guarantees; there is no "the drive is allowed to disregard the
discard command if it's too busy to attend to it mealy-mouthed
language", as there is in the ATA discard specification.

The fact that there happens to be buggy hardware out there, is just
the natural state of affairs.  But that's what black lists are for.

    	    	     	       	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner June 19, 2014, 12:36 a.m. UTC | #13
On Wed, Jun 18, 2014 at 06:06:01PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 18, 2014 at 11:33:47AM +0200, Lukáš Czerner wrote:
> > And I have no illusion that those are the only ones that does not
> > work. This hardware can not be trusted and this must not be
> > advertised as a security feature.
> 
> There's always crappy hardware out there.  If that's true, should then
> not call ATA Secure Erase by that term because somewhere out there,
> there will be an incompetently implemented SSD that doesn't do the
> right thing with ATA Secure Erase?  I just don't think that's
> particularly useful.  If the command is called "secure erase" or
> "secure discard" in the specification, then that's what we should use,
> just to avoid confusion if nothing else.

That's just a steaming pile of rhetoric. If that was true, then we
wouldn't be calling our operations BLKDISCARD or "discard", would
we?  It would be called "TRIM" or "WRITE_SAME" because that's what
the device layer standards call the operations.

Sure, we have a "FITRIM" ioctl, but we acknowledged early on that it
was badly named because different protocols use different names.
That's why we started to use "discard" instead - it's a protocol and
device neutral term that describes the intent of the operation - to
-discard blocks-.

IOWs, I think that Lukas is right on the money here - we should not
imply something is secure when it is not, nor should we name high
level interfaces based on the standardise name on the low level
primitive some class of device or protocol uses. 

Rather, we should describe it for what it is: it is a command
to *scrub the data* from a range of blocks. i.e. it's not a
discard operation at all - it's a "scrub" operation that we are
asking the device to perform.

And further, scrubbing has a specific meaning in the security
environment - it doesn't imply security - it just means there is a
mechanism for physically removing data from it's known locations.
Security comes from what you do with the scrubbing mechanism at
higher layers.

Scrubbing is something people already understand and it's clear
that it's a data manipulation operation and not some magic "secure"
operation. And by calling it "scrub" we get away from the idea that
it only works on specific hardware - hardware acceleration is good,
but there's no reason why we should design the functionality to only
be useful on systems with hardware scrubbing capability...

Cheers,

Dave.
Lukas Czerner June 19, 2014, 8:10 a.m. UTC | #14
On Wed, 18 Jun 2014, JP Abgrall wrote:

> Date: Wed, 18 Jun 2014 14:51:42 -0700
> From: JP Abgrall <jpa@google.com>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Theodore Ts'o <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>,
>     Eric Sandeen <sandeen@redhat.com>, linux-ext4@vger.kernel.org,
>     Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Wed, Jun 18, 2014 at 2:33 AM, Lukáš Czerner <lczerner@redhat.com> wrote:
> > Except when it does not. Looking at the mmc driver I can see that we
> > have already some devices where secure trim is broken.
> 
> And that is why we don't just blindly use random eMMC devices.

'we' as in ... ? Remember, you're not the only one using this code.

> 
> > Let's call is deep discard, or whatever but avoid the security word
> > at least when file system comes into play.
> 
> Best effort discard :)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Lukas Czerner June 19, 2014, 8:15 a.m. UTC | #15
On Thu, 19 Jun 2014, Dave Chinner wrote:

> Date: Thu, 19 Jun 2014 10:36:57 +1000
> From: Dave Chinner <david@fromorbit.com>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: Lukáš Czerner <lczerner@redhat.com>, JP Abgrall <jpa@google.com>,
>     Eric Sandeen <sandeen@redhat.com>, linux-ext4@vger.kernel.org,
>     Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Wed, Jun 18, 2014 at 06:06:01PM -0400, Theodore Ts'o wrote:
> > On Wed, Jun 18, 2014 at 11:33:47AM +0200, Lukáš Czerner wrote:
> > > And I have no illusion that those are the only ones that does not
> > > work. This hardware can not be trusted and this must not be
> > > advertised as a security feature.
> > 
> > There's always crappy hardware out there.  If that's true, should then
> > not call ATA Secure Erase by that term because somewhere out there,
> > there will be an incompetently implemented SSD that doesn't do the
> > right thing with ATA Secure Erase?  I just don't think that's
> > particularly useful.  If the command is called "secure erase" or
> > "secure discard" in the specification, then that's what we should use,
> > just to avoid confusion if nothing else.
> 
> That's just a steaming pile of rhetoric. If that was true, then we
> wouldn't be calling our operations BLKDISCARD or "discard", would
> we?  It would be called "TRIM" or "WRITE_SAME" because that's what
> the device layer standards call the operations.
> 
> Sure, we have a "FITRIM" ioctl, but we acknowledged early on that it
> was badly named because different protocols use different names.
> That's why we started to use "discard" instead - it's a protocol and
> device neutral term that describes the intent of the operation - to
> -discard blocks-.
> 
> IOWs, I think that Lukas is right on the money here - we should not
> imply something is secure when it is not, nor should we name high
> level interfaces based on the standardise name on the low level
> primitive some class of device or protocol uses. 
> 
> Rather, we should describe it for what it is: it is a command
> to *scrub the data* from a range of blocks. i.e. it's not a
> discard operation at all - it's a "scrub" operation that we are
> asking the device to perform.
> 
> And further, scrubbing has a specific meaning in the security
> environment - it doesn't imply security - it just means there is a
> mechanism for physically removing data from it's known locations.
> Security comes from what you do with the scrubbing mechanism at
> higher layers.
> 
> Scrubbing is something people already understand and it's clear
> that it's a data manipulation operation and not some magic "secure"
> operation. And by calling it "scrub" we get away from the idea that
> it only works on specific hardware - hardware acceleration is good,
> but there's no reason why we should design the functionality to only
> be useful on systems with hardware scrubbing capability...

+1 for the "scrub" operation, it makes perfect sense to me.

-Lukas

> 
> Cheers,
> 
> Dave.
>
Lukas Czerner June 19, 2014, 8:33 a.m. UTC | #16
On Wed, 18 Jun 2014, Theodore Ts'o wrote:

> Date: Wed, 18 Jun 2014 18:06:01 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: JP Abgrall <jpa@google.com>, Dave Chinner <david@fromorbit.com>,
>     Eric Sandeen <sandeen@redhat.com>, linux-ext4@vger.kernel.org,
>     Geremy Condra <gcondra@google.com>,
>     "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
> Subject: Re: [PATCH] ext4: Add support for SFITRIM,
>     an ioctl for secure FITRIM.
> 
> On Wed, Jun 18, 2014 at 11:33:47AM +0200, Lukáš Czerner wrote:
> > Except when it does not. Looking at the mmc driver I can see that we
> > have already some devices where secure trim is broken.
> > 
> > /*
> >  * On these Samsung MoviNAND parts, performing secure erase or
> >  * secure trim can result in unrecoverable corruption due to a
> >  * firmware bug.
> >  */
> 
> The bug IMHO is that the eMMC driver is falling back to discard,
> instead of returning EOPNOTSUPP.  The question of whether we should
> fall back to discard or not should be made at a level much higher up
> than the MMC device driver....

Yes, that's what I meant and I already sent a patch to return
EOPNOTSUPP.

> 
> > And I have no illusion that those are the only ones that does not
> > work. This hardware can not be trusted and this must not be
> > advertised as a security feature.
> 
> There's always crappy hardware out there.  If that's true, should then
> not call ATA Secure Erase by that term because somewhere out there,
> there will be an incompetently implemented SSD that doesn't do the
> right thing with ATA Secure Erase?  I just don't think that's
> particularly useful.  If the command is called "secure erase" or
> "secure discard" in the specification, then that's what we should use,
> just to avoid confusion if nothing else.
> 
> Now, if the spec explicitly disclaims correct behavior, in the case of
> discard and discard zeros data, that's a different matter.  But I
> don't think that is what's going on here.  The MMC specification makes
> certain guarantees; there is no "the drive is allowed to disregard the
> discard command if it's too busy to attend to it mealy-mouthed
> language", as there is in the ATA discard specification.
> 
> The fact that there happens to be buggy hardware out there, is just
> the natural state of affairs.  But that's what black lists are for.

But there is no "security" in the way we're using it right now.
Moreover the secure trim command is actually split to two commands,
one which can be called multiple times to identify all the block
ranges and second one which will actually do the job. If the write
comes in between than the respective block (or blocks) will not be
processed. I do not think that this can happen as it is implemented
right now, but I might be wrong.

However currently we're not using it to it's full potential since
we're only doing the first step once, but I can imagine doing the
first step for all the ranges we want to discard and then doing the
actual discard. In that case we would actually need some guarantees
that no write can come in between. Also we would probably need some
protection against power failure and so on.

And what about defragmentation ? When the data is moving around we can
no longer tell where the previous version of that particular piece of
data is.

And I am sure there are loads of other problems as well, so
currently there are no guarantees at all and it simple does not have
anything to do with security. That's perfectly fine as long as we do
not staple it as "secure" operation. I think that the confusion here
comes from different point of view of the higher level file system and
the lower level device itself as Dave already pointed out.

-Lukas

> 
>     	    	     	       	   - Ted
Martin K. Petersen June 20, 2014, 2:44 a.m. UTC | #17
>>>>> "Lukáš" == Lukáš Czerner <lczerner@redhat.com> writes:

>> Scrubbing is something people already understand and it's clear that
>> it's a data manipulation operation and not some magic "secure"
>> operation. And by calling it "scrub" we get away from the idea that
>> it only works on specific hardware - hardware acceleration is good,
>> but there's no reason why we should design the functionality to only
>> be useful on systems with hardware scrubbing capability...

Lukáš> +1 for the "scrub" operation, it makes perfect sense to me.

I'm not sure I agree with the choice of "scrub" to describe this:

	http://en.wikipedia.org/wiki/Data_scrubbing

What about purge or sanitize? That's what the security folks generally
use...
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b3a0f21..0203d9c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2695,6 +2695,8 @@  extern int ext4_check_blockref(const char *, unsigned int,
 /* extents.c */
 struct ext4_ext_path;
 struct ext4_extent;
+typedef int (*extent_iterator_t)(struct inode *inode, struct extent_status *es,
+				 unsigned int flags, void *private);
 
 extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
 extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
@@ -2733,6 +2735,9 @@  extern int ext4_find_delalloc_range(struct inode *inode,
 				    ext4_lblk_t lblk_start,
 				    ext4_lblk_t lblk_end);
 extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk);
+extern int ext4_extent_iterator(struct inode *inode,
+				ext4_lblk_t block, ext4_lblk_t num,
+				extent_iterator_t callback, void *private);
 extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			__u64 start, __u64 len);
 extern int ext4_ext_precache(struct inode *inode);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2c84bc0..6118a1d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2118,9 +2118,13 @@  cleanup:
 	return err;
 }
 
-static int ext4_fill_fiemap_extents(struct inode *inode,
-				    ext4_lblk_t block, ext4_lblk_t num,
-				    struct fiemap_extent_info *fieinfo)
+
+typedef int (*extent_iterator_t)(struct inode *inode, struct extent_status *es,
+				 unsigned int flags, void *private);
+
+int ext4_extent_iterator(struct inode *inode,
+			 ext4_lblk_t block, ext4_lblk_t num,
+			 extent_iterator_t callback, void *private)
 {
 	struct ext4_ext_path *path = NULL;
 	struct ext4_extent *ex;
@@ -2129,7 +2133,6 @@  static int ext4_fill_fiemap_extents(struct inode *inode,
 	ext4_lblk_t last = block + num;
 	int exists, depth = 0, err = 0;
 	unsigned int flags = 0;
-	unsigned char blksize_bits = inode->i_sb->s_blocksize_bits;
 
 	while (block < last && block != EXT_MAX_BLOCKS) {
 		num = last - block;
@@ -2253,11 +2256,7 @@  static int ext4_fill_fiemap_extents(struct inode *inode,
 		}
 
 		if (exists) {
-			err = fiemap_fill_next_extent(fieinfo,
-				(__u64)es.es_lblk << blksize_bits,
-				(__u64)es.es_pblk << blksize_bits,
-				(__u64)es.es_len << blksize_bits,
-				flags);
+			err = callback(inode, &es, flags, private);
 			if (err < 0)
 				break;
 			if (err == 1) {
@@ -2277,6 +2276,27 @@  static int ext4_fill_fiemap_extents(struct inode *inode,
 	return err;
 }
 
+static int call_fill_fiemap(struct inode *inode, struct extent_status *es,
+			    unsigned int flags, void *private)
+{
+	unsigned char blksize_bits = inode->i_sb->s_blocksize_bits;
+
+	return fiemap_fill_next_extent(private,
+				       (__u64)es->es_lblk << blksize_bits,
+				       (__u64)es->es_pblk << blksize_bits,
+				       (__u64)es->es_len << blksize_bits,
+				       flags);
+}
+
+static int ext4_fill_fiemap_extents(struct inode *inode,
+				    ext4_lblk_t block, ext4_lblk_t num,
+				    struct fiemap_extent_info *fieinfo)
+{
+	return ext4_extent_iterator(inode, block, num,
+				    call_fill_fiemap, fieinfo);
+}
+
+
 /*
  * ext4_ext_put_gap_in_cache:
  * calculate boundaries of the gap that the requested block fits into
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 5498f75..da6eac0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -214,6 +214,132 @@  swap_boot_out:
 	return err;
 }
 
+static int discard_callback(struct inode *inode, struct extent_status *es,
+			    unsigned int flags, void *private)
+{
+	struct ext4_map_blocks *map = private;
+	ext4_lblk_t es_lblk = es->es_lblk;
+	ext4_lblk_t es_len = es->es_len;
+	ext4_fsblk_t es_pblk = es->es_pblk;
+
+	if (flags & (FIEMAP_EXTENT_UNKNOWN |
+		     FIEMAP_EXTENT_ENCODED |
+		     FIEMAP_EXTENT_DATA_ENCRYPTED |
+		     FIEMAP_EXTENT_DELALLOC |
+		     FIEMAP_EXTENT_DATA_TAIL |
+		     FIEMAP_EXTENT_DATA_INLINE |
+		     FIEMAP_EXTENT_NOT_ALIGNED |
+		     FIEMAP_EXTENT_SHARED))
+		return 0;
+
+	if (es_lblk < map->m_lblk) {
+		ext4_lblk_t d = map->m_lblk - es_lblk;
+		if (d > es_len)
+			return 0;
+		es_lblk += d;
+		es_pblk += d;
+		es_len -= d;
+	}
+
+	if (es_lblk + es_len > map->m_lblk + map->m_len)
+		es_len -= es_lblk + es_len - (map->m_lblk + map->m_len);
+#ifdef BLKDISCARD_DEBUG
+	ext4_msg(inode->i_sb, KERN_NOTICE, "discard: %llu len %lu",
+		 (unsigned long long) es_pblk, (unsigned long) es_len);
+	return 0;
+#else
+	return sb_issue_discard(inode->i_sb, es_pblk, es_len, GFP_KERNEL, 0);
+#endif
+}
+
+static int blkdiscard_inode(struct inode *inode, u64 start_offset, u64 len)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_map_blocks map;
+	unsigned int num;
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	if (!blk_queue_discard(bdev_get_queue(sb->s_bdev)))
+		return -EOPNOTSUPP;
+
+	if (!bdev_discard_zeroes_data(sb->s_bdev) && !capable(CAP_SYS_ADMIN))
+		return -EOPNOTSUPP;
+
+	num = start_offset & (sb->s_blocksize - 1);
+	if (num) {
+		num = sb->s_blocksize - num;
+		start_offset += num;
+		len = (len > num) ? len - num : 0;
+	}
+	if (len == 0)
+		return 0;
+	if (start_offset > sb->s_maxbytes)
+		return -EFBIG;
+	if (len > sb->s_maxbytes || (sb->s_maxbytes - len) < start_offset)
+		len = sb->s_maxbytes - start_offset;
+
+	map.m_lblk = start_offset >> sb->s_blocksize_bits;
+	map.m_len = len >> sb->s_blocksize_bits;
+
+#ifdef BLKDISCARD_DEBUG
+	ext4_msg(sb, KERN_NOTICE, "blkdiscard range: %lu len %lu",
+		 (unsigned long) map.m_lblk, (unsigned long) map.m_len);
+#endif
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return ext4_extent_iterator(inode, map.m_lblk, map.m_len,
+					    discard_callback, &map);
+
+	num = map.m_len;
+	while (num) {
+		int ret = ext4_map_blocks(NULL, inode, &map, 0);
+
+		if (ret < 0)
+			return ret;
+
+		if (ret == 0) {
+#ifdef BLKDISCARD_DEBUG
+			ext4_msg(sb, KERN_NOTICE,
+				 "skip: lblk %lu len %lu ret %lu num %lu",
+				 (unsigned long) map.m_lblk,
+				 (unsigned long) map.m_len,
+				 (unsigned long) ret,
+				 (unsigned long) num);
+#endif
+			map.m_lblk++;
+			num--;
+			continue;
+		}
+#ifdef BLKDISCARD_DEBUG
+		ext4_msg(sb, KERN_NOTICE,
+			 "walk: lblk %lu pblk %llu len %lu ret %lu num %lu",
+			 (unsigned long) map.m_lblk,
+			 (unsigned long long) map.m_pblk,
+			 (unsigned long) map.m_len,
+			 (unsigned long) ret,
+			 (unsigned long) num);
+#endif
+		if (ret > num)
+			ret = num;
+		map.m_lblk += ret;
+		num -= ret;
+		map.m_len = num;
+
+#ifdef BLKDISCARD_DEBUG
+		ext4_msg(sb, KERN_NOTICE, "discard: %llu len %lu",
+			 (unsigned long long) map.m_pblk, (unsigned long) ret);
+#else
+		ret = sb_issue_discard(sb, map.m_pblk, ret,
+				       GFP_KERNEL, 0);
+		if (ret)
+			return ret;
+#endif
+	}
+	return 0;
+}
+
 long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -627,6 +753,18 @@  resizefs_out:
 	case EXT4_IOC_PRECACHE_EXTENTS:
 		return ext4_ext_precache(inode);
 
+	case BLKDISCARD: {
+		uint64_t range[2];
+
+		if (!(filp->f_mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(range, (void __user *)arg, sizeof(range)))
+			return -EFAULT;
+
+		return blkdiscard_inode(file_inode(filp), range[0], range[1]);
+	}
+
 	default:
 		return -ENOTTY;
 	}
@@ -691,6 +829,7 @@  long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case FITRIM:
 	case EXT4_IOC_RESIZE_FS:
 	case EXT4_IOC_PRECACHE_EXTENTS:
+	case BLKDISCARD:
 		break;
 	default:
 		return -ENOIOCTLCMD;