diff mbox

[RFC] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Message ID 20090428093145.GA13719@skywalker
State Superseded, archived
Headers show

Commit Message

Aneesh Kumar K.V April 28, 2009, 9:31 a.m. UTC
On Tue, Apr 28, 2009 at 09:50:49AM +0530, Aneesh Kumar K.V wrote:
> On Mon, Apr 27, 2009 at 04:04:54PM -0700, Mingming Cao wrote:
> .....
> 
> > 
> > Index: linux-2.6.28-rc6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.28-rc6.orig/fs/ext4/inode.c	2009-03-12 10:21:05.000000000 -0700
> > +++ linux-2.6.28-rc6/fs/ext4/inode.c	2009-04-27 14:35:21.000000000 -0700
> > @@ -2177,7 +2177,10 @@ static int ext4_da_get_block_prep(struct
> >  		set_buffer_new(bh_result);
> >  		set_buffer_delay(bh_result);
> >  	} else if (ret > 0) {
> > +		if (buffer_unwritten(bh_result))
> > +			set_buffer_new(bh_result);
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> > +		bh_result->b_bdev = inode->i_sb->s_bdev;
> 
> 
> Updated patch to set bh_result->b_dev. I also added comments in the
> source to explain whey we need to mark buffer_head new. Also updated
> single line patch summary. I will send the update (-v2) patch.

Looking at the source again i guess setting just b_dev is not enough.
unmap_underlying_metadata looks at the mapping block number, which we
don't have in case on unwritten buffer_head. How about the below patch ?
It involve vfs changes. But i guess it is correct with respect to the
meaning of BH_New (Disk mapping was newly created by get_block). I guess
BH_New implies BH_Mapped.

I haven't tested the patch yet. Also it should be split into multiple
patches. It also a fix a problem where we missed an
unamp_underlying_metadata in case of delayed allocated blocks. I guess
that can also cause corruption with delayed allocation.


From: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Subject: [PATCH -V3] ext4: Fix sub-block zeroing for buffered writes into unwritten extents.

We need to mark the  buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

---
 fs/buffer.c     |    9 ++++++++-
 fs/ext4/inode.c |    8 +++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

Comments

Theodore Ts'o April 28, 2009, 12:48 p.m. UTC | #1
On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> 
> Looking at the source again i guess setting just b_dev is not enough.
> unmap_underlying_metadata looks at the mapping block number, which we
> don't have in case on unwritten buffer_head. How about the below patch ?
> It involve vfs changes. But i guess it is correct with respect to the
> meaning of BH_New (Disk mapping was newly created by get_block). I guess
> BH_New implies BH_Mapped.

Argh.  So we have multiple problems going on here.  One is the
original problem, namely that of a partial write into an preallocated
block can leave garbage behind in that unitialized block.

The other problem seems to be in the case of a delayed allocation
write, where we return a buffer_head which is marked new, and this
causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).

In theory this could cause problems if we try installing a new
bootloader in the filesystem's boot block while there's a delayed
writes happening in the background, since we could end up discarding
the write to the boot sector.  We've lived with this for quite a wihle
though.

My concern with making the fs/buffer.c changes is that we need to make
sure it doesn't break any of the other filesystems, so that's going to
make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
(Silly question; why doesn't XFS get caught by this?) 

So the question is do we try to fix both bugs with one patch, and very
likely have to wait until 2.6.31 before the patch is incorporated?  Or
do we fix the second bug using an ext4-only fix, with the knowledge
that post 2.6.30, we'll need undo most of it and fix it properly with
a change that involves fs/buffer.c?

My preference is for the former, unless we belive the 2nd bug is
serious enough that we really need to address it ASAP (in which case
we have a lot of work ahead of us in terms of coordinating with the
other filesystem developers).   What do other folks think?

      		 		     	      	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V April 28, 2009, 4:35 p.m. UTC | #2
On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > 
> > Looking at the source again i guess setting just b_dev is not enough.
> > unmap_underlying_metadata looks at the mapping block number, which we
> > don't have in case on unwritten buffer_head. How about the below patch ?
> > It involve vfs changes. But i guess it is correct with respect to the
> > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > BH_New implies BH_Mapped.
> 
> Argh.  So we have multiple problems going on here.  One is the
> original problem, namely that of a partial write into an preallocated
> block can leave garbage behind in that unitialized block.
> 
> The other problem seems to be in the case of a delayed allocation
> write, where we return a buffer_head which is marked new, and this
> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).

Not just that. On block allocation we are not calling
unmap_underlying_metadata(dev, blocknumber) for delayed allocated
blocks. That would imply file corruption.

> 
> In theory this could cause problems if we try installing a new
> bootloader in the filesystem's boot block while there's a delayed
> writes happening in the background, since we could end up discarding
> the write to the boot sector.  We've lived with this for quite a wihle
> though.
> 
> My concern with making the fs/buffer.c changes is that we need to make
> sure it doesn't break any of the other filesystems, so that's going to
> make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
> (Silly question; why doesn't XFS get caught by this?) 
> 
> So the question is do we try to fix both bugs with one patch, and very
> likely have to wait until 2.6.31 before the patch is incorporated?  Or
> do we fix the second bug using an ext4-only fix, with the knowledge
> that post 2.6.30, we'll need undo most of it and fix it properly with
> a change that involves fs/buffer.c?
> 
> My preference is for the former, unless we belive the 2nd bug is
> serious enough that we really need to address it ASAP (in which case
> we have a lot of work ahead of us in terms of coordinating with the
> other filesystem developers).   What do other folks think?

The original reported problem is something really easy to reproduce. So
i guess if we can have a ext4 local change that would fix the original
problem that would be good. Considering that map_bh(bdev, 0) didn't
create any issues till now, what we can do is to do a similar update
for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
of the patch with the below addition
	bh_result->b_blocknr = 0;

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen April 28, 2009, 4:37 p.m. UTC | #3
Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
>> Looking at the source again i guess setting just b_dev is not enough.
>> unmap_underlying_metadata looks at the mapping block number, which we
>> don't have in case on unwritten buffer_head. How about the below patch ?
>> It involve vfs changes. But i guess it is correct with respect to the
>> meaning of BH_New (Disk mapping was newly created by get_block). I guess
>> BH_New implies BH_Mapped.
> 
> Argh.  So we have multiple problems going on here.  One is the
> original problem, namely that of a partial write into an preallocated
> block can leave garbage behind in that unitialized block.
> 
> The other problem seems to be in the case of a delayed allocation
> write, where we return a buffer_head which is marked new, and this
> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> 
> In theory this could cause problems if we try installing a new
> bootloader in the filesystem's boot block while there's a delayed
> writes happening in the background, since we could end up discarding
> the write to the boot sector.  We've lived with this for quite a wihle
> though.
> 
> My concern with making the fs/buffer.c changes is that we need to make
> sure it doesn't break any of the other filesystems, so that's going to
> make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
> (Silly question; why doesn't XFS get caught by this?) 

I'm not sure offhand.  All xfs does is this in the get_block path:

         * With sub-block writes into unwritten extents we also need to mark
         * the buffer as new so that the unwritten parts of the buffer gets
         * correctly zeroed.
         */
        if (create &&
            ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
             (offset >= i_size_read(inode)) ||
             (iomap.iomap_flags & (IOMAP_NEW|IOMAP_UNWRITTEN))))
                set_buffer_new(bh_result);

so it returns with BH_New as well.

> So the question is do we try to fix both bugs with one patch, and very
> likely have to wait until 2.6.31 before the patch is incorporated?  Or
> do we fix the second bug using an ext4-only fix, with the knowledge
> that post 2.6.30, we'll need undo most of it and fix it properly with
> a change that involves fs/buffer.c?

I have the sense that this might need a bit more digging around, and I
finally got stuff out of the way to do so :)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 28, 2009, 5 p.m. UTC | #4
On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > > 
> > > Looking at the source again i guess setting just b_dev is not enough.
> > > unmap_underlying_metadata looks at the mapping block number, which we
> > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > It involve vfs changes. But i guess it is correct with respect to the
> > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > BH_New implies BH_Mapped.
> > 
> > Argh.  So we have multiple problems going on here.  One is the
> > original problem, namely that of a partial write into an preallocated
> > block can leave garbage behind in that unitialized block.
> > 
> > The other problem seems to be in the case of a delayed allocation
> > write, where we return a buffer_head which is marked new, and this
> > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> 
> Not just that. On block allocation we are not calling
> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> blocks. That would imply file corruption.

I don't think I'm following you .  If we write into block that was
delayed allocated.  Are you saying we might get in trouble of the
delayed allocation block is mmap'ed in?

> The original reported problem is something really easy to reproduce. So
> i guess if we can have a ext4 local change that would fix the original
> problem that would be good. Considering that map_bh(bdev, 0) didn't
> create any issues till now, what we can do is to do a similar update
> for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> of the patch with the below addition
> 	bh_result->b_blocknr = 0;

OK, I can put togehter a patch to do this.  Whatever we do, I think
we're going to need a *lot* of testing.

				- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V April 28, 2009, 6:57 p.m. UTC | #5
On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > > > 
> > > > Looking at the source again i guess setting just b_dev is not enough.
> > > > unmap_underlying_metadata looks at the mapping block number, which we
> > > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > > It involve vfs changes. But i guess it is correct with respect to the
> > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > > BH_New implies BH_Mapped.
> > > 
> > > Argh.  So we have multiple problems going on here.  One is the
> > > original problem, namely that of a partial write into an preallocated
> > > block can leave garbage behind in that unitialized block.
> > > 
> > > The other problem seems to be in the case of a delayed allocation
> > > write, where we return a buffer_head which is marked new, and this
> > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> > 
> > Not just that. On block allocation we are not calling
> > unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> > blocks. That would imply file corruption.
> 
> I don't think I'm following you .  If we write into block that was
> delayed allocated.  Are you saying we might get in trouble of the
> delayed allocation block is mmap'ed in?

We allocate blocks for delayed buffer during writepage. Now we need to
make sure after getting the blocks we drop the old buffer_head mapping
that we may have with this particular block attached to the block
device. That is done by calling unmap_underlying_metadata. Now the
current code doesn't call unmap_underlying_metadata for delayed
allocated blocks. That would mean we can see corrupt files if old
buffer_head mapping gets synced to disk AFTER we write the new
buffer_head mapping.

> 
> > The original reported problem is something really easy to reproduce. So
> > i guess if we can have a ext4 local change that would fix the original
> > problem that would be good. Considering that map_bh(bdev, 0) didn't
> > create any issues till now, what we can do is to do a similar update
> > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> > of the patch with the below addition
> > 	bh_result->b_blocknr = 0;
> 
> OK, I can put togehter a patch to do this.  Whatever we do, I think
> we're going to need a *lot* of testing.

I just sent -v3 version 

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen April 28, 2009, 7:35 p.m. UTC | #6
Aneesh Kumar K.V wrote:
> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
...

>>>> The other problem seems to be in the case of a delayed allocation
>>>> write, where we return a buffer_head which is marked new, and this
>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>>> Not just that. On block allocation we are not calling
>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
>>> blocks. That would imply file corruption.
>> I don't think I'm following you .  If we write into block that was
>> delayed allocated.  Are you saying we might get in trouble of the
>> delayed allocation block is mmap'ed in?
> 
> We allocate blocks for delayed buffer during writepage. Now we need to
> make sure after getting the blocks we drop the old buffer_head mapping
> that we may have with this particular block attached to the block
> device. That is done by calling unmap_underlying_metadata. Now the
> current code doesn't call unmap_underlying_metadata for delayed
> allocated blocks. That would mean we can see corrupt files if old
> buffer_head mapping gets synced to disk AFTER we write the new
> buffer_head mapping.


Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
mappings lying around for this block device unless someone is writing
directly to the mounted block device, which I don't think is ever
considered safe ...

I'm not quite sure what the call to __unmap_underlying_blocks() in
mpage_da_map_blocks() is for, I guess?

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mingming Cao April 29, 2009, 1:38 a.m. UTC | #7
On Tue, 2009-04-28 at 13:00 -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > > > 
> > > > Looking at the source again i guess setting just b_dev is not enough.
> > > > unmap_underlying_metadata looks at the mapping block number, which we
> > > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > > It involve vfs changes. But i guess it is correct with respect to the
> > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > > BH_New implies BH_Mapped.
> > > 
> > > Argh.  So we have multiple problems going on here.  One is the
> > > original problem, namely that of a partial write into an preallocated
> > > block can leave garbage behind in that unitialized block.
> > > 
> > > The other problem seems to be in the case of a delayed allocation
> > > write, where we return a buffer_head which is marked new, and this
> > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> > 
> > Not just that. On block allocation we are not calling
> > unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> > blocks. That would imply file corruption.
> 
> I don't think I'm following you .  If we write into block that was
> delayed allocated.  Are you saying we might get in trouble of the
> delayed allocation block is mmap'ed in?
> 
> > The original reported problem is something really easy to reproduce. So
> > i guess if we can have a ext4 local change that would fix the original
> > problem that would be good. Considering that map_bh(bdev, 0) didn't
> > create any issues till now, what we can do is to do a similar update
> > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> > of the patch with the below addition
> > 	bh_result->b_blocknr = 0;
> 
> OK, I can put togehter a patch to do this.  Whatever we do, I think
> we're going to need a *lot* of testing.
> 
> 				- Ted

Aneesh, Eric and I discussed this online today, we find a separate
issue, the lookup on the preallocated extent doesn't set the
buffer_mapped(), so loop up/write to the same preallocated block
multiple times (e.g. write 1 byte at a time, for 10 bytes total) will
end up calling ext4_get_blocks_wrap() multiple times.

It seems reasonable to set the buffer mapped for preallocated buffer,
with blocknr set to the real mapped block number (rather than faked -1
for the buffer blocknr in the V3 proposed fix for partial write garbage
issue), and later reply on unwritten flag to force the
writepage()/mpage_da_map_blocks calls get_block() to do the unintialized
extent split. But this change seems require more thoughts and heavy
auditing, and not as urgency as the data corruption problem.

Mingming

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 29, 2009, 11:57 a.m. UTC | #8
> Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> >> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> ...
> >>>> The other problem seems to be in the case of a delayed allocation
> >>>> write, where we return a buffer_head which is marked new, and this
> >>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >>> Not just that. On block allocation we are not calling
> >>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> >>> blocks. That would imply file corruption.
> >> I don't think I'm following you .  If we write into block that was
> >> delayed allocated.  Are you saying we might get in trouble of the
> >> delayed allocation block is mmap'ed in?
> > 
> > We allocate blocks for delayed buffer during writepage. Now we need to
> > make sure after getting the blocks we drop the old buffer_head mapping
> > that we may have with this particular block attached to the block
> > device. That is done by calling unmap_underlying_metadata. Now the
> > current code doesn't call unmap_underlying_metadata for delayed
> > allocated blocks. That would mean we can see corrupt files if old
> > buffer_head mapping gets synced to disk AFTER we write the new
> > buffer_head mapping.
> 
> 
> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
> mappings lying around for this block device unless someone is writing
> directly to the mounted block device, which I don't think is ever
> considered safe ...
> 
> I'm not quite sure what the call to __unmap_underlying_blocks() in
> mpage_da_map_blocks() is for, I guess?
  For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
before we reallocate a block, we make sure that the transaction freeing
the block is committed and clear all dirty bits from freed blocks.
  But for more careless filesystems, if they reallocate metadata block
as a data block and don't clear the dirty bit in blockdev mapping,
unmap_underlying_blocks() does it for them.

								Honza
Eric Sandeen April 29, 2009, 2:08 p.m. UTC | #9
Jan Kara wrote:
>> Aneesh Kumar K.V wrote:
>>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
>>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
>> ...
>>>>>> The other problem seems to be in the case of a delayed allocation
>>>>>> write, where we return a buffer_head which is marked new, and this
>>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>>>>> Not just that. On block allocation we are not calling
>>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
>>>>> blocks. That would imply file corruption.
>>>> I don't think I'm following you .  If we write into block that was
>>>> delayed allocated.  Are you saying we might get in trouble of the
>>>> delayed allocation block is mmap'ed in?
>>> We allocate blocks for delayed buffer during writepage. Now we need to
>>> make sure after getting the blocks we drop the old buffer_head mapping
>>> that we may have with this particular block attached to the block
>>> device. That is done by calling unmap_underlying_metadata. Now the
>>> current code doesn't call unmap_underlying_metadata for delayed
>>> allocated blocks. That would mean we can see corrupt files if old
>>> buffer_head mapping gets synced to disk AFTER we write the new
>>> buffer_head mapping.
>>
>> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
>> mappings lying around for this block device unless someone is writing
>> directly to the mounted block device, which I don't think is ever
>> considered safe ...
>>
>> I'm not quite sure what the call to __unmap_underlying_blocks() in
>> mpage_da_map_blocks() is for, I guess?
>   For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
> before we reallocate a block, we make sure that the transaction freeing
> the block is committed and clear all dirty bits from freed blocks.
>   But for more careless filesystems, if they reallocate metadata block
> as a data block and don't clear the dirty bit in blockdev mapping,
> unmap_underlying_blocks() does it for them.

That's what I thought - so I was wondering why we have specific calls to
this in ext4:

mpage_da_map_blocks
	__unmap_underlying_blocks
		for (i = 0; i < blocks; i++)
			unmap_underlying_metadata

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 29, 2009, 6:13 p.m. UTC | #10
On Wed 29-04-09 09:08:05, Eric Sandeen wrote:
> Jan Kara wrote:
> >> Aneesh Kumar K.V wrote:
> >>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> >>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> >> ...
> >>>>>> The other problem seems to be in the case of a delayed allocation
> >>>>>> write, where we return a buffer_head which is marked new, and this
> >>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >>>>> Not just that. On block allocation we are not calling
> >>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> >>>>> blocks. That would imply file corruption.
> >>>> I don't think I'm following you .  If we write into block that was
> >>>> delayed allocated.  Are you saying we might get in trouble of the
> >>>> delayed allocation block is mmap'ed in?
> >>> We allocate blocks for delayed buffer during writepage. Now we need to
> >>> make sure after getting the blocks we drop the old buffer_head mapping
> >>> that we may have with this particular block attached to the block
> >>> device. That is done by calling unmap_underlying_metadata. Now the
> >>> current code doesn't call unmap_underlying_metadata for delayed
> >>> allocated blocks. That would mean we can see corrupt files if old
> >>> buffer_head mapping gets synced to disk AFTER we write the new
> >>> buffer_head mapping.
> >>
> >> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
> >> mappings lying around for this block device unless someone is writing
> >> directly to the mounted block device, which I don't think is ever
> >> considered safe ...
> >>
> >> I'm not quite sure what the call to __unmap_underlying_blocks() in
> >> mpage_da_map_blocks() is for, I guess?
> >   For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
> > before we reallocate a block, we make sure that the transaction freeing
> > the block is committed and clear all dirty bits from freed blocks.
> >   But for more careless filesystems, if they reallocate metadata block
> > as a data block and don't clear the dirty bit in blockdev mapping,
> > unmap_underlying_blocks() does it for them.
> 
> That's what I thought - so I was wondering why we have specific calls to
> this in ext4:
> 
> mpage_da_map_blocks
> 	__unmap_underlying_blocks
> 		for (i = 0; i < blocks; i++)
> 			unmap_underlying_metadata
  Hmm, OK. So maybe change it warn on dirty blockdev buffer and if the warning
does not trigger we can believe that our theory is right ;).

									Honza
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..13f0d52 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1867,15 +1867,22 @@  static int __block_prepare_write(struct inode *inode, struct page *page,
 			err = get_block(inode, block, bh, 1);
 			if (err)
 				break;
-			if (buffer_new(bh)) {
+			if (buffer_new(bh))
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
+			if (buffer_new(bh) || buffer_unwritten(bh) ||
+					buffer_delay(bh)) {
 				if (PageUptodate(page)) {
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
 					mark_buffer_dirty(bh);
 					continue;
 				}
+				/*
+				 * sub-block writes into unwritten or
+				 * delayed buffer should result in zero out
+				 * of the rest of the buffer
+				 */
 				if (block_end > to || block_start < from)
 					zero_user_segments(page,
 						to, block_end,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..504afb7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1892,13 +1892,17 @@  static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
 				if (buffer_delay(bh)) {
 					bh->b_blocknr = pblock;
 					clear_buffer_delay(bh);
+					set_buffer_mapped(bh);
 					bh->b_bdev = inode->i_sb->s_bdev;
+					unmap_underlying_metadata(bh->b_bdev,
+								pblock);
 				} else if (buffer_unwritten(bh)) {
 					bh->b_blocknr = pblock;
 					clear_buffer_unwritten(bh);
 					set_buffer_mapped(bh);
-					set_buffer_new(bh);
 					bh->b_bdev = inode->i_sb->s_bdev;
+					unmap_underlying_metadata(bh->b_bdev,
+								pblock);
 				} else if (buffer_mapped(bh))
 					BUG_ON(bh->b_blocknr != pblock);
 
@@ -2318,8 +2322,6 @@  static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 			/* not enough space to reserve */
 			return ret;
 
-		map_bh(bh_result, inode->i_sb, 0);
-		set_buffer_new(bh_result);
 		set_buffer_delay(bh_result);
 	} else if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);