Message ID | 20090428093145.GA13719@skywalker |
---|---|
State | Superseded, archived |
Headers | show |
On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: > > Looking at the source again i guess setting just b_dev is not enough. > unmap_underlying_metadata looks at the mapping block number, which we > don't have in case on unwritten buffer_head. How about the below patch ? > It involve vfs changes. But i guess it is correct with respect to the > meaning of BH_New (Disk mapping was newly created by get_block). I guess > BH_New implies BH_Mapped. Argh. So we have multiple problems going on here. One is the original problem, namely that of a partial write into an preallocated block can leave garbage behind in that unitialized block. The other problem seems to be in the case of a delayed allocation write, where we return a buffer_head which is marked new, and this causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). In theory this could cause problems if we try installing a new bootloader in the filesystem's boot block while there's a delayed writes happening in the background, since we could end up discarding the write to the boot sector. We've lived with this for quite a wihle though. My concern with making the fs/buffer.c changes is that we need to make sure it doesn't break any of the other filesystems, so that's going to make it hard to try to slip this with 2.6.30-rc4 nearly upon us. (Silly question; why doesn't XFS get caught by this?) So the question is do we try to fix both bugs with one patch, and very likely have to wait until 2.6.31 before the patch is incorporated? Or do we fix the second bug using an ext4-only fix, with the knowledge that post 2.6.30, we'll need undo most of it and fix it properly with a change that involves fs/buffer.c? My preference is for the former, unless we belive the 2nd bug is serious enough that we really need to address it ASAP (in which case we have a lot of work ahead of us in terms of coordinating with the other filesystem developers). What do other folks think? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote: > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: > > > > Looking at the source again i guess setting just b_dev is not enough. > > unmap_underlying_metadata looks at the mapping block number, which we > > don't have in case on unwritten buffer_head. How about the below patch ? > > It involve vfs changes. But i guess it is correct with respect to the > > meaning of BH_New (Disk mapping was newly created by get_block). I guess > > BH_New implies BH_Mapped. > > Argh. So we have multiple problems going on here. One is the > original problem, namely that of a partial write into an preallocated > block can leave garbage behind in that unitialized block. > > The other problem seems to be in the case of a delayed allocation > write, where we return a buffer_head which is marked new, and this > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). Not just that. On block allocation we are not calling unmap_underlying_metadata(dev, blocknumber) for delayed allocated blocks. That would imply file corruption. > > In theory this could cause problems if we try installing a new > bootloader in the filesystem's boot block while there's a delayed > writes happening in the background, since we could end up discarding > the write to the boot sector. We've lived with this for quite a wihle > though. > > My concern with making the fs/buffer.c changes is that we need to make > sure it doesn't break any of the other filesystems, so that's going to > make it hard to try to slip this with 2.6.30-rc4 nearly upon us. > (Silly question; why doesn't XFS get caught by this?) > > So the question is do we try to fix both bugs with one patch, and very > likely have to wait until 2.6.31 before the patch is incorporated? Or > do we fix the second bug using an ext4-only fix, with the knowledge > that post 2.6.30, we'll need undo most of it and fix it properly with > a change that involves fs/buffer.c? > > My preference is for the former, unless we belive the 2nd bug is > serious enough that we really need to address it ASAP (in which case > we have a lot of work ahead of us in terms of coordinating with the > other filesystem developers). What do other folks think? The original reported problem is something really easy to reproduce. So i guess if we can have a ext4 local change that would fix the original problem that would be good. Considering that map_bh(bdev, 0) didn't create any issues till now, what we can do is to do a similar update for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version of the patch with the below addition bh_result->b_blocknr = 0; -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: >> Looking at the source again i guess setting just b_dev is not enough. >> unmap_underlying_metadata looks at the mapping block number, which we >> don't have in case on unwritten buffer_head. How about the below patch ? >> It involve vfs changes. But i guess it is correct with respect to the >> meaning of BH_New (Disk mapping was newly created by get_block). I guess >> BH_New implies BH_Mapped. > > Argh. So we have multiple problems going on here. One is the > original problem, namely that of a partial write into an preallocated > block can leave garbage behind in that unitialized block. > > The other problem seems to be in the case of a delayed allocation > write, where we return a buffer_head which is marked new, and this > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > > In theory this could cause problems if we try installing a new > bootloader in the filesystem's boot block while there's a delayed > writes happening in the background, since we could end up discarding > the write to the boot sector. We've lived with this for quite a wihle > though. > > My concern with making the fs/buffer.c changes is that we need to make > sure it doesn't break any of the other filesystems, so that's going to > make it hard to try to slip this with 2.6.30-rc4 nearly upon us. > (Silly question; why doesn't XFS get caught by this?) I'm not sure offhand. All xfs does is this in the get_block path: * With sub-block writes into unwritten extents we also need to mark * the buffer as new so that the unwritten parts of the buffer gets * correctly zeroed. */ if (create && ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || (offset >= i_size_read(inode)) || (iomap.iomap_flags & (IOMAP_NEW|IOMAP_UNWRITTEN)))) set_buffer_new(bh_result); so it returns with BH_New as well. > So the question is do we try to fix both bugs with one patch, and very > likely have to wait until 2.6.31 before the patch is incorporated? Or > do we fix the second bug using an ext4-only fix, with the knowledge > that post 2.6.30, we'll need undo most of it and fix it properly with > a change that involves fs/buffer.c? I have the sense that this might need a bit more digging around, and I finally got stuff out of the way to do so :) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote: > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: > > > > > > Looking at the source again i guess setting just b_dev is not enough. > > > unmap_underlying_metadata looks at the mapping block number, which we > > > don't have in case on unwritten buffer_head. How about the below patch ? > > > It involve vfs changes. But i guess it is correct with respect to the > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess > > > BH_New implies BH_Mapped. > > > > Argh. So we have multiple problems going on here. One is the > > original problem, namely that of a partial write into an preallocated > > block can leave garbage behind in that unitialized block. > > > > The other problem seems to be in the case of a delayed allocation > > write, where we return a buffer_head which is marked new, and this > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > > Not just that. On block allocation we are not calling > unmap_underlying_metadata(dev, blocknumber) for delayed allocated > blocks. That would imply file corruption. I don't think I'm following you . If we write into block that was delayed allocated. Are you saying we might get in trouble of the delayed allocation block is mmap'ed in? > The original reported problem is something really easy to reproduce. So > i guess if we can have a ext4 local change that would fix the original > problem that would be good. Considering that map_bh(bdev, 0) didn't > create any issues till now, what we can do is to do a similar update > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version > of the patch with the below addition > bh_result->b_blocknr = 0; OK, I can put togehter a patch to do this. Whatever we do, I think we're going to need a *lot* of testing. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote: > On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: > > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote: > > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: > > > > > > > > Looking at the source again i guess setting just b_dev is not enough. > > > > unmap_underlying_metadata looks at the mapping block number, which we > > > > don't have in case on unwritten buffer_head. How about the below patch ? > > > > It involve vfs changes. But i guess it is correct with respect to the > > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess > > > > BH_New implies BH_Mapped. > > > > > > Argh. So we have multiple problems going on here. One is the > > > original problem, namely that of a partial write into an preallocated > > > block can leave garbage behind in that unitialized block. > > > > > > The other problem seems to be in the case of a delayed allocation > > > write, where we return a buffer_head which is marked new, and this > > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > > > > Not just that. On block allocation we are not calling > > unmap_underlying_metadata(dev, blocknumber) for delayed allocated > > blocks. That would imply file corruption. > > I don't think I'm following you . If we write into block that was > delayed allocated. Are you saying we might get in trouble of the > delayed allocation block is mmap'ed in? We allocate blocks for delayed buffer during writepage. Now we need to make sure after getting the blocks we drop the old buffer_head mapping that we may have with this particular block attached to the block device. That is done by calling unmap_underlying_metadata. Now the current code doesn't call unmap_underlying_metadata for delayed allocated blocks. That would mean we can see corrupt files if old buffer_head mapping gets synced to disk AFTER we write the new buffer_head mapping. > > > The original reported problem is something really easy to reproduce. So > > i guess if we can have a ext4 local change that would fix the original > > problem that would be good. Considering that map_bh(bdev, 0) didn't > > create any issues till now, what we can do is to do a similar update > > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version > > of the patch with the below addition > > bh_result->b_blocknr = 0; > > OK, I can put togehter a patch to do this. Whatever we do, I think > we're going to need a *lot* of testing. I just sent -v3 version -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V wrote: > On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote: >> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: ... >>>> The other problem seems to be in the case of a delayed allocation >>>> write, where we return a buffer_head which is marked new, and this >>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). >>> Not just that. On block allocation we are not calling >>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated >>> blocks. That would imply file corruption. >> I don't think I'm following you . If we write into block that was >> delayed allocated. Are you saying we might get in trouble of the >> delayed allocation block is mmap'ed in? > > We allocate blocks for delayed buffer during writepage. Now we need to > make sure after getting the blocks we drop the old buffer_head mapping > that we may have with this particular block attached to the block > device. That is done by calling unmap_underlying_metadata. Now the > current code doesn't call unmap_underlying_metadata for delayed > allocated blocks. That would mean we can see corrupt files if old > buffer_head mapping gets synced to disk AFTER we write the new > buffer_head mapping. Talking w/ Aneesh on IRC, I don't see how we can have stray dirty mappings lying around for this block device unless someone is writing directly to the mounted block device, which I don't think is ever considered safe ... I'm not quite sure what the call to __unmap_underlying_blocks() in mpage_da_map_blocks() is for, I guess? -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-04-28 at 13:00 -0400, Theodore Tso wrote: > On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: > > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote: > > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote: > > > > > > > > Looking at the source again i guess setting just b_dev is not enough. > > > > unmap_underlying_metadata looks at the mapping block number, which we > > > > don't have in case on unwritten buffer_head. How about the below patch ? > > > > It involve vfs changes. But i guess it is correct with respect to the > > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess > > > > BH_New implies BH_Mapped. > > > > > > Argh. So we have multiple problems going on here. One is the > > > original problem, namely that of a partial write into an preallocated > > > block can leave garbage behind in that unitialized block. > > > > > > The other problem seems to be in the case of a delayed allocation > > > write, where we return a buffer_head which is marked new, and this > > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > > > > Not just that. On block allocation we are not calling > > unmap_underlying_metadata(dev, blocknumber) for delayed allocated > > blocks. That would imply file corruption. > > I don't think I'm following you . If we write into block that was > delayed allocated. Are you saying we might get in trouble of the > delayed allocation block is mmap'ed in? > > > The original reported problem is something really easy to reproduce. So > > i guess if we can have a ext4 local change that would fix the original > > problem that would be good. Considering that map_bh(bdev, 0) didn't > > create any issues till now, what we can do is to do a similar update > > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version > > of the patch with the below addition > > bh_result->b_blocknr = 0; > > OK, I can put togehter a patch to do this. Whatever we do, I think > we're going to need a *lot* of testing. > > - Ted Aneesh, Eric and I discussed this online today, we find a separate issue, the lookup on the preallocated extent doesn't set the buffer_mapped(), so loop up/write to the same preallocated block multiple times (e.g. write 1 byte at a time, for 10 bytes total) will end up calling ext4_get_blocks_wrap() multiple times. It seems reasonable to set the buffer mapped for preallocated buffer, with blocknr set to the real mapped block number (rather than faked -1 for the buffer blocknr in the V3 proposed fix for partial write garbage issue), and later reply on unwritten flag to force the writepage()/mpage_da_map_blocks calls get_block() to do the unintialized extent split. But this change seems require more thoughts and heavy auditing, and not as urgency as the data corruption problem. Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Aneesh Kumar K.V wrote: > > On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote: > >> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: > ... > >>>> The other problem seems to be in the case of a delayed allocation > >>>> write, where we return a buffer_head which is marked new, and this > >>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > >>> Not just that. On block allocation we are not calling > >>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated > >>> blocks. That would imply file corruption. > >> I don't think I'm following you . If we write into block that was > >> delayed allocated. Are you saying we might get in trouble of the > >> delayed allocation block is mmap'ed in? > > > > We allocate blocks for delayed buffer during writepage. Now we need to > > make sure after getting the blocks we drop the old buffer_head mapping > > that we may have with this particular block attached to the block > > device. That is done by calling unmap_underlying_metadata. Now the > > current code doesn't call unmap_underlying_metadata for delayed > > allocated blocks. That would mean we can see corrupt files if old > > buffer_head mapping gets synced to disk AFTER we write the new > > buffer_head mapping. > > > Talking w/ Aneesh on IRC, I don't see how we can have stray dirty > mappings lying around for this block device unless someone is writing > directly to the mounted block device, which I don't think is ever > considered safe ... > > I'm not quite sure what the call to __unmap_underlying_blocks() in > mpage_da_map_blocks() is for, I guess? For ext3 / ext4 I think we don't need unmap_underlying_blocks() since before we reallocate a block, we make sure that the transaction freeing the block is committed and clear all dirty bits from freed blocks. But for more careless filesystems, if they reallocate metadata block as a data block and don't clear the dirty bit in blockdev mapping, unmap_underlying_blocks() does it for them. Honza
Jan Kara wrote: >> Aneesh Kumar K.V wrote: >>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote: >>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: >> ... >>>>>> The other problem seems to be in the case of a delayed allocation >>>>>> write, where we return a buffer_head which is marked new, and this >>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). >>>>> Not just that. On block allocation we are not calling >>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated >>>>> blocks. That would imply file corruption. >>>> I don't think I'm following you . If we write into block that was >>>> delayed allocated. Are you saying we might get in trouble of the >>>> delayed allocation block is mmap'ed in? >>> We allocate blocks for delayed buffer during writepage. Now we need to >>> make sure after getting the blocks we drop the old buffer_head mapping >>> that we may have with this particular block attached to the block >>> device. That is done by calling unmap_underlying_metadata. Now the >>> current code doesn't call unmap_underlying_metadata for delayed >>> allocated blocks. That would mean we can see corrupt files if old >>> buffer_head mapping gets synced to disk AFTER we write the new >>> buffer_head mapping. >> >> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty >> mappings lying around for this block device unless someone is writing >> directly to the mounted block device, which I don't think is ever >> considered safe ... >> >> I'm not quite sure what the call to __unmap_underlying_blocks() in >> mpage_da_map_blocks() is for, I guess? > For ext3 / ext4 I think we don't need unmap_underlying_blocks() since > before we reallocate a block, we make sure that the transaction freeing > the block is committed and clear all dirty bits from freed blocks. > But for more careless filesystems, if they reallocate metadata block > as a data block and don't clear the dirty bit in blockdev mapping, > unmap_underlying_blocks() does it for them. That's what I thought - so I was wondering why we have specific calls to this in ext4: mpage_da_map_blocks __unmap_underlying_blocks for (i = 0; i < blocks; i++) unmap_underlying_metadata -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 29-04-09 09:08:05, Eric Sandeen wrote: > Jan Kara wrote: > >> Aneesh Kumar K.V wrote: > >>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote: > >>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote: > >> ... > >>>>>> The other problem seems to be in the case of a delayed allocation > >>>>>> write, where we return a buffer_head which is marked new, and this > >>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0). > >>>>> Not just that. On block allocation we are not calling > >>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated > >>>>> blocks. That would imply file corruption. > >>>> I don't think I'm following you . If we write into block that was > >>>> delayed allocated. Are you saying we might get in trouble of the > >>>> delayed allocation block is mmap'ed in? > >>> We allocate blocks for delayed buffer during writepage. Now we need to > >>> make sure after getting the blocks we drop the old buffer_head mapping > >>> that we may have with this particular block attached to the block > >>> device. That is done by calling unmap_underlying_metadata. Now the > >>> current code doesn't call unmap_underlying_metadata for delayed > >>> allocated blocks. That would mean we can see corrupt files if old > >>> buffer_head mapping gets synced to disk AFTER we write the new > >>> buffer_head mapping. > >> > >> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty > >> mappings lying around for this block device unless someone is writing > >> directly to the mounted block device, which I don't think is ever > >> considered safe ... > >> > >> I'm not quite sure what the call to __unmap_underlying_blocks() in > >> mpage_da_map_blocks() is for, I guess? > > For ext3 / ext4 I think we don't need unmap_underlying_blocks() since > > before we reallocate a block, we make sure that the transaction freeing > > the block is committed and clear all dirty bits from freed blocks. > > But for more careless filesystems, if they reallocate metadata block > > as a data block and don't clear the dirty bit in blockdev mapping, > > unmap_underlying_blocks() does it for them. > > That's what I thought - so I was wondering why we have specific calls to > this in ext4: > > mpage_da_map_blocks > __unmap_underlying_blocks > for (i = 0; i < blocks; i++) > unmap_underlying_metadata Hmm, OK. So maybe change it warn on dirty blockdev buffer and if the warning does not trigger we can believe that our theory is right ;). Honza
diff --git a/fs/buffer.c b/fs/buffer.c index b3e5be7..13f0d52 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1867,15 +1867,22 @@ static int __block_prepare_write(struct inode *inode, struct page *page, err = get_block(inode, block, bh, 1); if (err) break; - if (buffer_new(bh)) { + if (buffer_new(bh)) unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); + if (buffer_new(bh) || buffer_unwritten(bh) || + buffer_delay(bh)) { if (PageUptodate(page)) { clear_buffer_new(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); continue; } + /* + * sub-block writes into unwritten or + * delayed buffer should result in zero out + * of the rest of the buffer + */ if (block_end > to || block_start < from) zero_user_segments(page, to, block_end, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e91f978..504afb7 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1892,13 +1892,17 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical, if (buffer_delay(bh)) { bh->b_blocknr = pblock; clear_buffer_delay(bh); + set_buffer_mapped(bh); bh->b_bdev = inode->i_sb->s_bdev; + unmap_underlying_metadata(bh->b_bdev, + pblock); } else if (buffer_unwritten(bh)) { bh->b_blocknr = pblock; clear_buffer_unwritten(bh); set_buffer_mapped(bh); - set_buffer_new(bh); bh->b_bdev = inode->i_sb->s_bdev; + unmap_underlying_metadata(bh->b_bdev, + pblock); } else if (buffer_mapped(bh)) BUG_ON(bh->b_blocknr != pblock); @@ -2318,8 +2322,6 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock, /* not enough space to reserve */ return ret; - map_bh(bh_result, inode->i_sb, 0); - set_buffer_new(bh_result); set_buffer_delay(bh_result); } else if (ret > 0) { bh_result->b_size = (ret << inode->i_blkbits);