diff mbox

[RFC] ext4_bmap() may return blocks outside filesystem

Message ID 498AD58B.5000805@ph.tum.de
State Superseded, archived
Headers show

Commit Message

Thiemo Nagel Feb. 5, 2009, 12:03 p.m. UTC
Hello,

during testing of ext4 with intentionally corrupted filesystem images I 
noticed that sometimes ext4_bmap() returns physical block numbers which 
lie outside of the filesystem.  In most cases, the error is caught by 
the block layer (?) leading to error messages of the kind:

attempt to access beyond end of device
loop0: rw=0, want=xxx, limit=xxx

But there also are cases which are not handled gracefully by bmap() callers.

I've attached a conceptual patch against 2.6.29-rc2 which fixes one case 
in which invalid block numbers are returned (there might be more) by 
adding sanity checks to ext4_ext_find_extent(), but before I start 
looking for further occurences, I'd like to ask whether you think my 
approach is reasonable.

Kind regards,

Thiemo Nagel

Comments

Theodore Ts'o Feb. 5, 2009, 1:49 p.m. UTC | #1
On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote:
> But there also are cases which are not handled gracefully by bmap() callers.
>
> I've attached a conceptual patch against 2.6.29-rc2 which fixes one case  
> in which invalid block numbers are returned (there might be more) by  
> adding sanity checks to ext4_ext_find_extent(), but before I start  
> looking for further occurences, I'd like to ask whether you think my  
> approach is reasonable.

Yes, it's reasonable; the right thing is not just to jump out to
errout, though, but to call ext4_error() first, since the filesystem is
clearly corrupted, so we want to mark the filesystem as needing to be
fsck'ed, and so if the filesystem is marked "remount readonly" or
"panic" on filesystem errors, that the right thing happens.  We should
also log the device name, inode number and logical block number that
was requested, so that someone who is looking in the console logs can
see what was going on at the time.

As an unrelated patch, might also want to put a check in
fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect
bogus direct/indirect blocks and flag them with the appropriate
errors. 

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer Feb. 5, 2009, 3:22 p.m. UTC | #2
On Thu, Feb 5, 2009 at 8:49 AM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote:
>> But there also are cases which are not handled gracefully by bmap() callers.
>>
>> I've attached a conceptual patch against 2.6.29-rc2 which fixes one case
>> in which invalid block numbers are returned (there might be more) by
>> adding sanity checks to ext4_ext_find_extent(), but before I start
>> looking for further occurences, I'd like to ask whether you think my
>> approach is reasonable.
>
> Yes, it's reasonable; the right thing is not just to jump out to
> errout, though, but to call ext4_error() first, since the filesystem is
> clearly corrupted, so we want to mark the filesystem as needing to be
> fsck'ed, and so if the filesystem is marked "remount readonly" or
> "panic" on filesystem errors, that the right thing happens.  We should
> also log the device name, inode number and logical block number that
> was requested, so that someone who is looking in the console logs can
> see what was going on at the time.
>
> As an unrelated patch, might also want to put a check in
> fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect
> bogus direct/indirect blocks and flag them with the appropriate
> errors.
>
>                                                - Ted

This is just a rant, and I doubt anyone can do anything about it, but
it is still worth reading imho.

<rant>
This brings up a concern I have with the proposed Thin Provisioning
updates to the SCSI and ATA specs.

As I'm sure most know, both are looking at supporting the concept of
mapped / unmapped sectors being tracked not only in the filesystem but
also in the storage device.

[SSDs are one use case, and  storage arrays are the other.  Many
storage arrays already support thin provisioning but not via the new
"discard" functionality in the linux kernel.]

My big concern is that neither is proposing a way for a tool like fsck
to query the storage device to verify the filesystem's view of what is
mapped vs unmapped agrees with the storage devices view.

For both sets of proposed spec updates there are circumstances where
the storage device spec allows garbage to be returned for non-mapped
sectors.  Thus in the situation of a corrupt filesystem, it is very
possible that some of the sectors that the filesystem is relying on
are actually unmapped and potentially garbage.

Lacking any knowledge of which specific sectors the underlying storage
systems treats as reliable vs. unreliable, I can imagine the
filesystem corruption will go from a correctable situation to a
"restore from backups" situation.

The solution in my mind is that both specs add a way for diagnostic
tools to query the status of a sector to see if it is mapped vs
unmapped, etc.
</rant>.

Greg
Ric Wheeler Feb. 5, 2009, 3:39 p.m. UTC | #3
Greg Freemyer wrote:
> On Thu, Feb 5, 2009 at 8:49 AM, Theodore Tso <tytso@mit.edu> wrote:
>   
>> On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote:
>>     
>>> But there also are cases which are not handled gracefully by bmap() callers.
>>>
>>> I've attached a conceptual patch against 2.6.29-rc2 which fixes one case
>>> in which invalid block numbers are returned (there might be more) by
>>> adding sanity checks to ext4_ext_find_extent(), but before I start
>>> looking for further occurences, I'd like to ask whether you think my
>>> approach is reasonable.
>>>       
>> Yes, it's reasonable; the right thing is not just to jump out to
>> errout, though, but to call ext4_error() first, since the filesystem is
>> clearly corrupted, so we want to mark the filesystem as needing to be
>> fsck'ed, and so if the filesystem is marked "remount readonly" or
>> "panic" on filesystem errors, that the right thing happens.  We should
>> also log the device name, inode number and logical block number that
>> was requested, so that someone who is looking in the console logs can
>> see what was going on at the time.
>>
>> As an unrelated patch, might also want to put a check in
>> fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect
>> bogus direct/indirect blocks and flag them with the appropriate
>> errors.
>>
>>                                                - Ted
>>     
>
> This is just a rant, and I doubt anyone can do anything about it, but
> it is still worth reading imho.
>
> <rant>
> This brings up a concern I have with the proposed Thin Provisioning
> updates to the SCSI and ATA specs.
>
> As I'm sure most know, both are looking at supporting the concept of
> mapped / unmapped sectors being tracked not only in the filesystem but
> also in the storage device.
>
> [SSDs are one use case, and  storage arrays are the other.  Many
> storage arrays already support thin provisioning but not via the new
> "discard" functionality in the linux kernel.]
>
> My big concern is that neither is proposing a way for a tool like fsck
> to query the storage device to verify the filesystem's view of what is
> mapped vs unmapped agrees with the storage devices view.
>   
I think that from a file system point of view (including tools like 
fsck), that is a feature, not a bug. The features should be, if done 
right, invisible to us and this should be irrelevant to fsck .....

> For both sets of proposed spec updates there are circumstances where
> the storage device spec allows garbage to be returned for non-mapped
> sectors.  Thus in the situation of a corrupt filesystem, it is very
> possible that some of the sectors that the filesystem is relying on
> are actually unmapped and potentially garbage.
>
> Lacking any knowledge of which specific sectors the underlying storage
> systems treats as reliable vs. unreliable, I can imagine the
> filesystem corruption will go from a correctable situation to a
> "restore from backups" situation.
>   

I disagree - any written data, specifically all meta-data, will have the 
correct data returned on read. All unmapped data is also by definition 
un-allocated at the fs layer (for fsck as well) and we should not be 
reading it back if the tools work correctly.

> The solution in my mind is that both specs add a way for diagnostic
> tools to query the status of a sector to see if it is mapped vs
> unmapped, etc.
> </rant>.
>
> Greg
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 5, 2009, 4:48 p.m. UTC | #4
On Thu, Feb 05, 2009 at 10:39:59AM -0500, Ric Wheeler wrote:
> Greg Freemyer wrote:
>> This is just a rant, and I doubt anyone can do anything about it, but
>> it is still worth reading imho.

It also has absolutely nothing to do with the original thread, which
was block numbers which are far outside the range of valid block
numbers given the size of the block device.  :-)

>> My big concern is that neither is proposing a way for a tool like fsck
>> to query the storage device to verify the filesystem's view of what is
>> mapped vs unmapped agrees with the storage devices view.
>>   
> I think that from a file system point of view (including tools like  
> fsck), that is a feature, not a bug. The features should be, if done  
> right, invisible to us and this should be irrelevant to fsck .....

Yeah, pretty much.  There are two cases.  One is where the
thin-provisioned disk thinks the block is in use, but it is marked as
not in use.  That is only a problem in that some space is wasted.  It
could occur the first time the filesystem is moved onto the
thin-provisioned device.  Adding code to e2fsck to support this is not
particularly hard.  It's just somethining you would do after fsck's
pass 5, once the block allocation bitmaps are validated.  Or you could
do it as a separate program, which requires that the last fsck time ==
the mount time, and then uses the block allocation bitmaps to tell the
storage device that which blocks aren't in use.

The other case is where the block is still in use by the filesystem,
but somehow the thin-provisioned disk thinks it is freed.  This is
most likely going to happen if a block is claimed as being in use by
two inodes, and then one of the two inodes is deleted.  This case has
almost always involved data loss, since the original inode's data was
already overwritten, and even if it was the original inode which gets
deleted, since the block is marked freed, the filesystem can allocate
it for use by a new inode, and then the other inode's data would get
corrupted.

>> Lacking any knowledge of which specific sectors the underlying storage
>> systems treats as reliable vs. unreliable, I can imagine the
>> filesystem corruption will go from a correctable situation to a
>> "restore from backups" situation.

As I have described in above, if the block allocation bitmaps are
corrupted to the point where this problem can occur, the possibility
for data loss has already happened; this is nothing new.  The only
thing new here is that instead of the data getting corrupted when the
block gets reused by another inode in the same filesystem, it can get
corrupted when the block gets reused by data in another filesystem.
(And, the data might get immediately invalidated as soon as the TRIM
command or equivalent is sent to the storage device.  That just makes
the data loss deterministic.)

So this is not a new secnario, and fortunately, it rarely happens,
thanks to ext3's journalling.  The most common way it happened before
was with ext2 filesystems when impatient users bypassed fsck after an
unclean shutdown.  These days, it usually requires some kind of
hardware failure, either in memory or in the storage subsystem.

> I disagree - any written data, specifically all meta-data, will have the  
> correct data returned on read. All unmapped data is also by definition  
> un-allocated at the fs layer (for fsck as well) and we should not be  
> reading it back if the tools work correctly.

... or if there is a hardware problem, but this is a previously
unsolved problem.  And things work fairly well today.

>> The solution in my mind is that both specs add a way for diagnostic
>> tools to query the status of a sector to see if it is mapped vs
>> unmapped, etc.

This is not a bad thing, and if it's there, utilities to make sure the
storage device is in synch with the block allocation devices isn't a
bad thing, but I suspect the main reason will be for efficiency's
sake; currently on an unclean shutdown, we don't report the blocks
that were freed immediately before an unclean shutdown, since that's
just a storage leak, not a data loss problem.  And if the blocks gets
reused immediately afterwards, it's not a big deal at all.

       		   	       	    	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer Feb. 5, 2009, 10:01 p.m. UTC | #5
On Thu, Feb 5, 2009 at 11:48 AM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Feb 05, 2009 at 10:39:59AM -0500, Ric Wheeler wrote:
>> Greg Freemyer wrote:
>>> This is just a rant, and I doubt anyone can do anything about it, but
>>> it is still worth reading imho.
>
> It also has absolutely nothing to do with the original thread, which
> was block numbers which are far outside the range of valid block
> numbers given the size of the block device.  :-)
>

The subject was "return blocks outside filesystem".

In a thin-provisioning environment I'd argue that unmapped sectors are
"outside the filesystem". :)

Unfortunately, I can't get anyone else to see the world from my
apparently unique perspective. :(
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 5, 2009, 10:18 p.m. UTC | #6
On Thu, Feb 05, 2009 at 05:01:01PM -0500, Greg Freemyer wrote:
> > It also has absolutely nothing to do with the original thread, which
> > was block numbers which are far outside the range of valid block
> > numbers given the size of the block device.  :-)
> 
> The subject was "return blocks outside filesystem".

Yes, it's clear you didn't read the e-mail thread, but rather just
keyed off the subject line.  :-)

> In a thin-provisioning environment I'd argue that unmapped sectors are
> "outside the filesystem". :)
> 
> Unfortunately, I can't get anyone else to see the world from my
> apparently unique perspective. :(

If you don't like this, don't use thin-provisioned devices.  Again, I
don't see the likely scenario where your fears are likely to be a
factor in a real world scenario.  If there are bugs in the
thin-provisioned devices, people shouldn't use them.  Given that we
are conservative about when we tell thin-provisioned devices that
blocks are no longer in use (i.e., on journal commits, and if we
crash, just don't tell the device the blocks can be reused), what's
the problem that you're worried about?  How does it occur in real
life?

It's hard to defend against a theoretical problem when you only give
vague fears about how it might be triggered...

						- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goswin von Brederlow Feb. 7, 2009, 1:27 p.m. UTC | #7
Theodore Tso <tytso@mit.edu> writes:

> On Thu, Feb 05, 2009 at 05:01:01PM -0500, Greg Freemyer wrote:
>> > It also has absolutely nothing to do with the original thread, which
>> > was block numbers which are far outside the range of valid block
>> > numbers given the size of the block device.  :-)
>> 
>> The subject was "return blocks outside filesystem".
>
> Yes, it's clear you didn't read the e-mail thread, but rather just
> keyed off the subject line.  :-)
>
>> In a thin-provisioning environment I'd argue that unmapped sectors are
>> "outside the filesystem". :)
>> 
>> Unfortunately, I can't get anyone else to see the world from my
>> apparently unique perspective. :(
>
> If you don't like this, don't use thin-provisioned devices.  Again, I
> don't see the likely scenario where your fears are likely to be a
> factor in a real world scenario.  If there are bugs in the

There will be bugs.

> thin-provisioned devices, people shouldn't use them.  Given that we

And people will still use them.

Assuming that storage boxes work perfectly is just ignoring reality.
Even if the software has no bugs there will still be hardware
failures. Given enough boxes there will be multi-bit toggles with
correct ECC sum in ram or on disks. Power and battery backups will
fail mid update and and and.

> are conservative about when we tell thin-provisioned devices that
> blocks are no longer in use (i.e., on journal commits, and if we
> crash, just don't tell the device the blocks can be reused), what's
> the problem that you're worried about?  How does it occur in real
> life?
>
> It's hard to defend against a theoretical problem when you only give
> vague fears about how it might be triggered...
>
> 						- Ted

I see the following scenario:

1) The filesystem / thin-provision gets corrupted somehow. fs bug,
hardware, whatever.

2) The thin-provision thinks a block is free while the FS thinks it is
in use. Make it a meta data block so it really matters.

3) The thin-provision still has the mapping and data of the block and
hasn't reused the block yet. On read the device will return the
correct data as long as the block is not reused. This seems to be a
valid implementation for a thin-provision device.

4) fsck will find no error but future writes will reuse the block on
the thin-provision device overwriting the data and causing
catastrophic FS corruption.


So I think a fsck pass to check FS used blocks against hardware used
blocks is essential if the FS does support thin-provisioned devices.
Once you free hardware blocks you have to check that what the FS and
hardware think are compatible.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 7, 2009, 3:51 p.m. UTC | #8
On Sat, Feb 07, 2009 at 02:27:31PM +0100, Goswin von Brederlow wrote:
> I see the following scenario:
> 
> 1) The filesystem / thin-provision gets corrupted somehow. fs bug,
> hardware, whatever.
> 
> 2) The thin-provision thinks a block is free while the FS thinks it is
> in use. Make it a meta data block so it really matters.
> 
> 3) The thin-provision still has the mapping and data of the block and
> hasn't reused the block yet. On read the device will return the
> correct data as long as the block is not reused. This seems to be a
> valid implementation for a thin-provision device.

That's highly unlikely, actually.  Once you tell the thin-provisioning
device that the block is not in use, they will delete the mapping from
their mapping structures.  So it's highly unlikely you will be able to
recover once you send the TRIM command.

> 4) fsck will find no error but future writes will reuse the block on
> the thin-provision device overwriting the data and causing
> catastrophic FS corruption.

The way this can happen today is if the bitmap block gets corrupted,
and so a block which is in use gets used by another inode.  So now you
have a filesystem block overwritten by a data block from an inode ---
so you have potentially catastrophic FS corruption, even before you
issue the ATA TRIM command.  This can happen to day, and in practice,
it is extremely rare.  So permit me for being highly dubious about
your claim this is going to happen more often with thin-provisioned
devices.

> So I think a fsck pass to check FS used blocks against hardware used
> blocks is essential if the FS does support thin-provisioned devices.

The filesystem might not even know whether or not a thin-provisioned
device is in use.  The OS may not even know whether the device is
thin-provisioned.  So ultiamtely, it's not up to the FS...

		      		       	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Feb. 7, 2009, 6:20 p.m. UTC | #9
Theodore Tso wrote:
> On Sat, Feb 07, 2009 at 02:27:31PM +0100, Goswin von Brederlow wrote:
>   
>> I see the following scenario:
>>
>> 1) The filesystem / thin-provision gets corrupted somehow. fs bug,
>> hardware, whatever.
>>
>> 2) The thin-provision thinks a block is free while the FS thinks it is
>> in use. Make it a meta data block so it really matters.
>>
>> 3) The thin-provision still has the mapping and data of the block and
>> hasn't reused the block yet. On read the device will return the
>> correct data as long as the block is not reused. This seems to be a
>> valid implementation for a thin-provision device.
>>     
>
> That's highly unlikely, actually.  Once you tell the thin-provisioning
> device that the block is not in use, they will delete the mapping from
> their mapping structures.  So it's highly unlikely you will be able to
> recover once you send the TRIM command.
>   

For SCSI, that is actually not unlikely since the spec does not require 
you to actually do anything with the command - they can simply be 
ignored, so the original data will stay there unchanged.

If it is unmapped (in SCSI speak), and you read that sector, the storage 
device must return consistent contents for each subsequent read.

Ric

>   
>> 4) fsck will find no error but future writes will reuse the block on
>> the thin-provision device overwriting the data and causing
>> catastrophic FS corruption.
>>     
>
> The way this can happen today is if the bitmap block gets corrupted,
> and so a block which is in use gets used by another inode.  So now you
> have a filesystem block overwritten by a data block from an inode ---
> so you have potentially catastrophic FS corruption, even before you
> issue the ATA TRIM command.  This can happen to day, and in practice,
> it is extremely rare.  So permit me for being highly dubious about
> your claim this is going to happen more often with thin-provisioned
> devices.
>
>   
>> So I think a fsck pass to check FS used blocks against hardware used
>> blocks is essential if the FS does support thin-provisioned devices.
>>     
>
> The filesystem might not even know whether or not a thin-provisioned
> device is in use.  The OS may not even know whether the device is
> thin-provisioned.  So ultiamtely, it's not up to the FS...
>
> 		      		       	      - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- ../download/linux-2.6.29-rc2-vanilla/fs/ext4/extents.c	2009-02-05 12:31:19.000000000 +0100
+++ fs/ext4/extents.c	2009-02-05 12:42:49.000000000 +0100
@@ -595,8 +595,15 @@ 
 	/* find extent */
 	ext4_ext_binsearch(inode, path + ppos, block);
 	/* if not an empty leaf */
-	if (path[ppos].p_ext)
+	if (path[ppos].p_ext) {
 		path[ppos].p_block = ext_pblock(path[ppos].p_ext);
+               if (path[ppos].p_block < EXT4_SB(inode->i_sb)->s_es->s_first_data_block
+                   || path[ppos].p_block + path[ppos].p_ext->ee_len
+                      >= ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es)) {
+                       printk("ext4_ext_find_extent: extent out of range\n");
+                       goto err;
+               }
+       }
 
 	ext4_ext_show_path(inode, path);