Message ID | 498AD58B.5000805@ph.tum.de |
---|---|
State | Superseded, archived |
Headers | show |
On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote: > But there also are cases which are not handled gracefully by bmap() callers. > > I've attached a conceptual patch against 2.6.29-rc2 which fixes one case > in which invalid block numbers are returned (there might be more) by > adding sanity checks to ext4_ext_find_extent(), but before I start > looking for further occurences, I'd like to ask whether you think my > approach is reasonable. Yes, it's reasonable; the right thing is not just to jump out to errout, though, but to call ext4_error() first, since the filesystem is clearly corrupted, so we want to mark the filesystem as needing to be fsck'ed, and so if the filesystem is marked "remount readonly" or "panic" on filesystem errors, that the right thing happens. We should also log the device name, inode number and logical block number that was requested, so that someone who is looking in the console logs can see what was going on at the time. As an unrelated patch, might also want to put a check in fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect bogus direct/indirect blocks and flag them with the appropriate errors. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 5, 2009 at 8:49 AM, Theodore Tso <tytso@mit.edu> wrote: > On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote: >> But there also are cases which are not handled gracefully by bmap() callers. >> >> I've attached a conceptual patch against 2.6.29-rc2 which fixes one case >> in which invalid block numbers are returned (there might be more) by >> adding sanity checks to ext4_ext_find_extent(), but before I start >> looking for further occurences, I'd like to ask whether you think my >> approach is reasonable. > > Yes, it's reasonable; the right thing is not just to jump out to > errout, though, but to call ext4_error() first, since the filesystem is > clearly corrupted, so we want to mark the filesystem as needing to be > fsck'ed, and so if the filesystem is marked "remount readonly" or > "panic" on filesystem errors, that the right thing happens. We should > also log the device name, inode number and logical block number that > was requested, so that someone who is looking in the console logs can > see what was going on at the time. > > As an unrelated patch, might also want to put a check in > fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect > bogus direct/indirect blocks and flag them with the appropriate > errors. > > - Ted This is just a rant, and I doubt anyone can do anything about it, but it is still worth reading imho. <rant> This brings up a concern I have with the proposed Thin Provisioning updates to the SCSI and ATA specs. As I'm sure most know, both are looking at supporting the concept of mapped / unmapped sectors being tracked not only in the filesystem but also in the storage device. [SSDs are one use case, and storage arrays are the other. Many storage arrays already support thin provisioning but not via the new "discard" functionality in the linux kernel.] My big concern is that neither is proposing a way for a tool like fsck to query the storage device to verify the filesystem's view of what is mapped vs unmapped agrees with the storage devices view. For both sets of proposed spec updates there are circumstances where the storage device spec allows garbage to be returned for non-mapped sectors. Thus in the situation of a corrupt filesystem, it is very possible that some of the sectors that the filesystem is relying on are actually unmapped and potentially garbage. Lacking any knowledge of which specific sectors the underlying storage systems treats as reliable vs. unreliable, I can imagine the filesystem corruption will go from a correctable situation to a "restore from backups" situation. The solution in my mind is that both specs add a way for diagnostic tools to query the status of a sector to see if it is mapped vs unmapped, etc. </rant>. Greg
Greg Freemyer wrote: > On Thu, Feb 5, 2009 at 8:49 AM, Theodore Tso <tytso@mit.edu> wrote: > >> On Thu, Feb 05, 2009 at 01:03:23PM +0100, Thiemo Nagel wrote: >> >>> But there also are cases which are not handled gracefully by bmap() callers. >>> >>> I've attached a conceptual patch against 2.6.29-rc2 which fixes one case >>> in which invalid block numbers are returned (there might be more) by >>> adding sanity checks to ext4_ext_find_extent(), but before I start >>> looking for further occurences, I'd like to ask whether you think my >>> approach is reasonable. >>> >> Yes, it's reasonable; the right thing is not just to jump out to >> errout, though, but to call ext4_error() first, since the filesystem is >> clearly corrupted, so we want to mark the filesystem as needing to be >> fsck'ed, and so if the filesystem is marked "remount readonly" or >> "panic" on filesystem errors, that the right thing happens. We should >> also log the device name, inode number and logical block number that >> was requested, so that someone who is looking in the console logs can >> see what was going on at the time. >> >> As an unrelated patch, might also want to put a check in >> fs/ext4/inode.c's ext4_get_branch(), so we can equivalently detect >> bogus direct/indirect blocks and flag them with the appropriate >> errors. >> >> - Ted >> > > This is just a rant, and I doubt anyone can do anything about it, but > it is still worth reading imho. > > <rant> > This brings up a concern I have with the proposed Thin Provisioning > updates to the SCSI and ATA specs. > > As I'm sure most know, both are looking at supporting the concept of > mapped / unmapped sectors being tracked not only in the filesystem but > also in the storage device. > > [SSDs are one use case, and storage arrays are the other. Many > storage arrays already support thin provisioning but not via the new > "discard" functionality in the linux kernel.] > > My big concern is that neither is proposing a way for a tool like fsck > to query the storage device to verify the filesystem's view of what is > mapped vs unmapped agrees with the storage devices view. > I think that from a file system point of view (including tools like fsck), that is a feature, not a bug. The features should be, if done right, invisible to us and this should be irrelevant to fsck ..... > For both sets of proposed spec updates there are circumstances where > the storage device spec allows garbage to be returned for non-mapped > sectors. Thus in the situation of a corrupt filesystem, it is very > possible that some of the sectors that the filesystem is relying on > are actually unmapped and potentially garbage. > > Lacking any knowledge of which specific sectors the underlying storage > systems treats as reliable vs. unreliable, I can imagine the > filesystem corruption will go from a correctable situation to a > "restore from backups" situation. > I disagree - any written data, specifically all meta-data, will have the correct data returned on read. All unmapped data is also by definition un-allocated at the fs layer (for fsck as well) and we should not be reading it back if the tools work correctly. > The solution in my mind is that both specs add a way for diagnostic > tools to query the status of a sector to see if it is mapped vs > unmapped, etc. > </rant>. > > Greg > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2009 at 10:39:59AM -0500, Ric Wheeler wrote: > Greg Freemyer wrote: >> This is just a rant, and I doubt anyone can do anything about it, but >> it is still worth reading imho. It also has absolutely nothing to do with the original thread, which was block numbers which are far outside the range of valid block numbers given the size of the block device. :-) >> My big concern is that neither is proposing a way for a tool like fsck >> to query the storage device to verify the filesystem's view of what is >> mapped vs unmapped agrees with the storage devices view. >> > I think that from a file system point of view (including tools like > fsck), that is a feature, not a bug. The features should be, if done > right, invisible to us and this should be irrelevant to fsck ..... Yeah, pretty much. There are two cases. One is where the thin-provisioned disk thinks the block is in use, but it is marked as not in use. That is only a problem in that some space is wasted. It could occur the first time the filesystem is moved onto the thin-provisioned device. Adding code to e2fsck to support this is not particularly hard. It's just somethining you would do after fsck's pass 5, once the block allocation bitmaps are validated. Or you could do it as a separate program, which requires that the last fsck time == the mount time, and then uses the block allocation bitmaps to tell the storage device that which blocks aren't in use. The other case is where the block is still in use by the filesystem, but somehow the thin-provisioned disk thinks it is freed. This is most likely going to happen if a block is claimed as being in use by two inodes, and then one of the two inodes is deleted. This case has almost always involved data loss, since the original inode's data was already overwritten, and even if it was the original inode which gets deleted, since the block is marked freed, the filesystem can allocate it for use by a new inode, and then the other inode's data would get corrupted. >> Lacking any knowledge of which specific sectors the underlying storage >> systems treats as reliable vs. unreliable, I can imagine the >> filesystem corruption will go from a correctable situation to a >> "restore from backups" situation. As I have described in above, if the block allocation bitmaps are corrupted to the point where this problem can occur, the possibility for data loss has already happened; this is nothing new. The only thing new here is that instead of the data getting corrupted when the block gets reused by another inode in the same filesystem, it can get corrupted when the block gets reused by data in another filesystem. (And, the data might get immediately invalidated as soon as the TRIM command or equivalent is sent to the storage device. That just makes the data loss deterministic.) So this is not a new secnario, and fortunately, it rarely happens, thanks to ext3's journalling. The most common way it happened before was with ext2 filesystems when impatient users bypassed fsck after an unclean shutdown. These days, it usually requires some kind of hardware failure, either in memory or in the storage subsystem. > I disagree - any written data, specifically all meta-data, will have the > correct data returned on read. All unmapped data is also by definition > un-allocated at the fs layer (for fsck as well) and we should not be > reading it back if the tools work correctly. ... or if there is a hardware problem, but this is a previously unsolved problem. And things work fairly well today. >> The solution in my mind is that both specs add a way for diagnostic >> tools to query the status of a sector to see if it is mapped vs >> unmapped, etc. This is not a bad thing, and if it's there, utilities to make sure the storage device is in synch with the block allocation devices isn't a bad thing, but I suspect the main reason will be for efficiency's sake; currently on an unclean shutdown, we don't report the blocks that were freed immediately before an unclean shutdown, since that's just a storage leak, not a data loss problem. And if the blocks gets reused immediately afterwards, it's not a big deal at all. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 5, 2009 at 11:48 AM, Theodore Tso <tytso@mit.edu> wrote: > On Thu, Feb 05, 2009 at 10:39:59AM -0500, Ric Wheeler wrote: >> Greg Freemyer wrote: >>> This is just a rant, and I doubt anyone can do anything about it, but >>> it is still worth reading imho. > > It also has absolutely nothing to do with the original thread, which > was block numbers which are far outside the range of valid block > numbers given the size of the block device. :-) > The subject was "return blocks outside filesystem". In a thin-provisioning environment I'd argue that unmapped sectors are "outside the filesystem". :) Unfortunately, I can't get anyone else to see the world from my apparently unique perspective. :( -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2009 at 05:01:01PM -0500, Greg Freemyer wrote: > > It also has absolutely nothing to do with the original thread, which > > was block numbers which are far outside the range of valid block > > numbers given the size of the block device. :-) > > The subject was "return blocks outside filesystem". Yes, it's clear you didn't read the e-mail thread, but rather just keyed off the subject line. :-) > In a thin-provisioning environment I'd argue that unmapped sectors are > "outside the filesystem". :) > > Unfortunately, I can't get anyone else to see the world from my > apparently unique perspective. :( If you don't like this, don't use thin-provisioned devices. Again, I don't see the likely scenario where your fears are likely to be a factor in a real world scenario. If there are bugs in the thin-provisioned devices, people shouldn't use them. Given that we are conservative about when we tell thin-provisioned devices that blocks are no longer in use (i.e., on journal commits, and if we crash, just don't tell the device the blocks can be reused), what's the problem that you're worried about? How does it occur in real life? It's hard to defend against a theoretical problem when you only give vague fears about how it might be triggered... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso <tytso@mit.edu> writes: > On Thu, Feb 05, 2009 at 05:01:01PM -0500, Greg Freemyer wrote: >> > It also has absolutely nothing to do with the original thread, which >> > was block numbers which are far outside the range of valid block >> > numbers given the size of the block device. :-) >> >> The subject was "return blocks outside filesystem". > > Yes, it's clear you didn't read the e-mail thread, but rather just > keyed off the subject line. :-) > >> In a thin-provisioning environment I'd argue that unmapped sectors are >> "outside the filesystem". :) >> >> Unfortunately, I can't get anyone else to see the world from my >> apparently unique perspective. :( > > If you don't like this, don't use thin-provisioned devices. Again, I > don't see the likely scenario where your fears are likely to be a > factor in a real world scenario. If there are bugs in the There will be bugs. > thin-provisioned devices, people shouldn't use them. Given that we And people will still use them. Assuming that storage boxes work perfectly is just ignoring reality. Even if the software has no bugs there will still be hardware failures. Given enough boxes there will be multi-bit toggles with correct ECC sum in ram or on disks. Power and battery backups will fail mid update and and and. > are conservative about when we tell thin-provisioned devices that > blocks are no longer in use (i.e., on journal commits, and if we > crash, just don't tell the device the blocks can be reused), what's > the problem that you're worried about? How does it occur in real > life? > > It's hard to defend against a theoretical problem when you only give > vague fears about how it might be triggered... > > - Ted I see the following scenario: 1) The filesystem / thin-provision gets corrupted somehow. fs bug, hardware, whatever. 2) The thin-provision thinks a block is free while the FS thinks it is in use. Make it a meta data block so it really matters. 3) The thin-provision still has the mapping and data of the block and hasn't reused the block yet. On read the device will return the correct data as long as the block is not reused. This seems to be a valid implementation for a thin-provision device. 4) fsck will find no error but future writes will reuse the block on the thin-provision device overwriting the data and causing catastrophic FS corruption. So I think a fsck pass to check FS used blocks against hardware used blocks is essential if the FS does support thin-provisioned devices. Once you free hardware blocks you have to check that what the FS and hardware think are compatible. MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Feb 07, 2009 at 02:27:31PM +0100, Goswin von Brederlow wrote: > I see the following scenario: > > 1) The filesystem / thin-provision gets corrupted somehow. fs bug, > hardware, whatever. > > 2) The thin-provision thinks a block is free while the FS thinks it is > in use. Make it a meta data block so it really matters. > > 3) The thin-provision still has the mapping and data of the block and > hasn't reused the block yet. On read the device will return the > correct data as long as the block is not reused. This seems to be a > valid implementation for a thin-provision device. That's highly unlikely, actually. Once you tell the thin-provisioning device that the block is not in use, they will delete the mapping from their mapping structures. So it's highly unlikely you will be able to recover once you send the TRIM command. > 4) fsck will find no error but future writes will reuse the block on > the thin-provision device overwriting the data and causing > catastrophic FS corruption. The way this can happen today is if the bitmap block gets corrupted, and so a block which is in use gets used by another inode. So now you have a filesystem block overwritten by a data block from an inode --- so you have potentially catastrophic FS corruption, even before you issue the ATA TRIM command. This can happen to day, and in practice, it is extremely rare. So permit me for being highly dubious about your claim this is going to happen more often with thin-provisioned devices. > So I think a fsck pass to check FS used blocks against hardware used > blocks is essential if the FS does support thin-provisioned devices. The filesystem might not even know whether or not a thin-provisioned device is in use. The OS may not even know whether the device is thin-provisioned. So ultiamtely, it's not up to the FS... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Sat, Feb 07, 2009 at 02:27:31PM +0100, Goswin von Brederlow wrote: > >> I see the following scenario: >> >> 1) The filesystem / thin-provision gets corrupted somehow. fs bug, >> hardware, whatever. >> >> 2) The thin-provision thinks a block is free while the FS thinks it is >> in use. Make it a meta data block so it really matters. >> >> 3) The thin-provision still has the mapping and data of the block and >> hasn't reused the block yet. On read the device will return the >> correct data as long as the block is not reused. This seems to be a >> valid implementation for a thin-provision device. >> > > That's highly unlikely, actually. Once you tell the thin-provisioning > device that the block is not in use, they will delete the mapping from > their mapping structures. So it's highly unlikely you will be able to > recover once you send the TRIM command. > For SCSI, that is actually not unlikely since the spec does not require you to actually do anything with the command - they can simply be ignored, so the original data will stay there unchanged. If it is unmapped (in SCSI speak), and you read that sector, the storage device must return consistent contents for each subsequent read. Ric > >> 4) fsck will find no error but future writes will reuse the block on >> the thin-provision device overwriting the data and causing >> catastrophic FS corruption. >> > > The way this can happen today is if the bitmap block gets corrupted, > and so a block which is in use gets used by another inode. So now you > have a filesystem block overwritten by a data block from an inode --- > so you have potentially catastrophic FS corruption, even before you > issue the ATA TRIM command. This can happen to day, and in practice, > it is extremely rare. So permit me for being highly dubious about > your claim this is going to happen more often with thin-provisioned > devices. > > >> So I think a fsck pass to check FS used blocks against hardware used >> blocks is essential if the FS does support thin-provisioned devices. >> > > The filesystem might not even know whether or not a thin-provisioned > device is in use. The OS may not even know whether the device is > thin-provisioned. So ultiamtely, it's not up to the FS... > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- ../download/linux-2.6.29-rc2-vanilla/fs/ext4/extents.c 2009-02-05 12:31:19.000000000 +0100 +++ fs/ext4/extents.c 2009-02-05 12:42:49.000000000 +0100 @@ -595,8 +595,15 @@ /* find extent */ ext4_ext_binsearch(inode, path + ppos, block); /* if not an empty leaf */ - if (path[ppos].p_ext) + if (path[ppos].p_ext) { path[ppos].p_block = ext_pblock(path[ppos].p_ext); + if (path[ppos].p_block < EXT4_SB(inode->i_sb)->s_es->s_first_data_block + || path[ppos].p_block + path[ppos].p_ext->ee_len + >= ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es)) { + printk("ext4_ext_find_extent: extent out of range\n"); + goto err; + } + } ext4_ext_show_path(inode, path);