diff mbox

[-V4,1/2] Fix sub-block zeroing for buffered writes into unwritten extents

Message ID 1240980441-8105-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
State Superseded, archived
Headers show

Commit Message

Aneesh Kumar K.V April 29, 2009, 4:47 a.m. UTC
We need to mark the  buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Also use block number -1 as the fake block number so that
unmap_underlying_metadata doesn't drop wrong buffer_head

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

---
 fs/ext4/inode.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

Comments

Eric Sandeen April 29, 2009, 1:59 p.m. UTC | #1
Aneesh Kumar K.V wrote:
> We need to mark the  buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
> 
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> ---
>  fs/ext4/inode.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e91f978..12dcfab 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>  		set_buffer_delay(bh_result);
>  	} else if (ret > 0) {
>  		bh_result->b_size = (ret << inode->i_blkbits);
> +		/*
> +		 * With sub-block writes into unwritten extents
> +		 * we also need to mark the buffer as new so that
> +		 * the unwritten parts of the buffer gets correctly zeroed.
> +		 */
> +		if (buffer_unwritten(bh_result)) {
> +			bh_result->b_bdev = inode->i_sb->s_bdev;
> +			set_buffer_new(bh_result);
> +			bh_result->b_blocknr = -1;
> +		}
>  		ret = 0;
>  	}
>  

Ok, I guess this seems like the safest approach.  Long term we should
look really hard at the state & block nr of these buffer heads, but I
agree that keeping the changes restricted to the preallocation path for
now is safest.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mingming Cao April 29, 2009, 5:28 p.m. UTC | #2
On Wed, 2009-04-29 at 08:59 -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > We need to mark the  buffer_head mapping prealloc space
> > as new during write_begin. Otherwise we don't zero out the
> > page cache content properly for a partial write. This will
> > cause file corruption with preallocation.
> > 
> > Also use block number -1 as the fake block number so that
> > unmap_underlying_metadata doesn't drop wrong buffer_head
> > 
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > 
> > ---
> >  fs/ext4/inode.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e91f978..12dcfab 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> >  		set_buffer_delay(bh_result);
> >  	} else if (ret > 0) {
> >  		bh_result->b_size = (ret << inode->i_blkbits);
> > +		/*
> > +		 * With sub-block writes into unwritten extents
> > +		 * we also need to mark the buffer as new so that
> > +		 * the unwritten parts of the buffer gets correctly zeroed.
> > +		 */
> > +		if (buffer_unwritten(bh_result)) {
> > +			bh_result->b_bdev = inode->i_sb->s_bdev;
> > +			set_buffer_new(bh_result);
> > +			bh_result->b_blocknr = -1;
> > +		}
> >  		ret = 0;
> >  	}
> >  
> 
> Ok, I guess this seems like the safest approach.  Long term we should
> look really hard at the state & block nr of these buffer heads, but I
> agree that keeping the changes restricted to the preallocation path for
> now is safest.
> 

This path (ret >0) this is the path where get_blocks() find the block
allocated or preallocated. The buffer_unwritten() is strict to the
preallocation case, but why not take care of the buffer_new() when we
set the buffer_unwritten() for preallocation  in ext4_ext_get_blocks()
at the first place? That makes the "preallocation" case handling there
all together. 

But both patch is correct, I have tested the prealloc,
prealloc->paritial write, prealloc->paritial long
write->partial-short-write, the content of the afterward read seems all
sane in both patch.

Any thoughts about the comments update I made in my previous patch? This
part of comment in preallocation  handling in ext4_ext_get_blocks()
needs some cleanup.


Think this over, if we set the buffer new here(i.e. in the write_begin()
path), I wonder about the read case: where do we set the buffer_new()
for the read on preallocated space? the ext4_ext_get_blocks() with
create = 0 on preallocated extent will return bh unwritten, but not new.
However my read tests right after new preallocation returns all zeroed
data. I wonder what I am missing.

Mingming
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 12, 2009, 2:42 a.m. UTC | #3
On Wed, Apr 29, 2009 at 10:17:20AM +0530, Aneesh Kumar K.V wrote:
> We need to mark the  buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
> 
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head

The buffer_head code is starting to scare me more and more. 

I'm looking at this code again and I can't figure out why it's safe
(or why we would need to) put in an invalid number into
bh_result->b_blocknr:

> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>  		set_buffer_delay(bh_result);
>  	} else if (ret > 0) {
>  		bh_result->b_size = (ret << inode->i_blkbits);
> +		/*
> +		 * With sub-block writes into unwritten extents
> +		 * we also need to mark the buffer as new so that
> +		 * the unwritten parts of the buffer gets correctly zeroed.
> +		 */
> +		if (buffer_unwritten(bh_result)) {
> +			bh_result->b_bdev = inode->i_sb->s_bdev;
> +			set_buffer_new(bh_result);
> +			bh_result->b_blocknr = -1;

Why do we need to avoid calling unmap_underlying_metadata()?

And after the buffer is zero'ed out, it leaves b_blocknr in a
buffer_head attached to the page at an invalid block number.  Doesn't
that get us in trouble later on?

I see that this line is removed later on in the for-2.6.31 patch "Mark
the unwritten buffer_head as mapped during write_begin".  But is it
safe for 2.6.30?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 12, 2009, 3:37 a.m. UTC | #4
Theodore Tso wrote:
> On Wed, Apr 29, 2009 at 10:17:20AM +0530, Aneesh Kumar K.V wrote:
>> We need to mark the  buffer_head mapping prealloc space
>> as new during write_begin. Otherwise we don't zero out the
>> page cache content properly for a partial write. This will
>> cause file corruption with preallocation.
>>
>> Also use block number -1 as the fake block number so that
>> unmap_underlying_metadata doesn't drop wrong buffer_head
> 
> The buffer_head code is starting to scare me more and more. 
> 
> I'm looking at this code again and I can't figure out why it's safe
> (or why we would need to) put in an invalid number into
> bh_result->b_blocknr:

I don't know for sure why it should be invalid; I think a preallocated
block, since it has an *actual* *block* *allocated* after all, should
have that block number.  But if it's going to be fake, let's not use a
"real" one like the superblock location...

A real block nr does eventually get assigned when we do getblock with
create=1 AFAICT.

>> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>>  		set_buffer_delay(bh_result);
>>  	} else if (ret > 0) {
>>  		bh_result->b_size = (ret << inode->i_blkbits);
>> +		/*
>> +		 * With sub-block writes into unwritten extents
>> +		 * we also need to mark the buffer as new so that
>> +		 * the unwritten parts of the buffer gets correctly zeroed.
>> +		 */
>> +		if (buffer_unwritten(bh_result)) {
>> +			bh_result->b_bdev = inode->i_sb->s_bdev;
>> +			set_buffer_new(bh_result);
>> +			bh_result->b_blocknr = -1;
> 
> Why do we need to avoid calling unmap_underlying_metadata()?

For that matter, why do we call unmap_underlying_metadata at all, ever?

> And after the buffer is zero'ed out, it leaves b_blocknr in a
> buffer_head attached to the page at an invalid block number.  Doesn't
> that get us in trouble later on?
> 
> I see that this line is removed later on in the for-2.6.31 patch "Mark
> the unwritten buffer_head as mapped during write_begin".  But is it
> safe for 2.6.30?

I have this in F11 now, but it's giving me the heebie-jeebies still.  At
least it's confined to preallocation (one of the great new ext4 features
I've been promoting recently... :)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..12dcfab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2323,6 +2323,16 @@  static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 		set_buffer_delay(bh_result);
 	} else if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
+		/*
+		 * With sub-block writes into unwritten extents
+		 * we also need to mark the buffer as new so that
+		 * the unwritten parts of the buffer gets correctly zeroed.
+		 */
+		if (buffer_unwritten(bh_result)) {
+			bh_result->b_bdev = inode->i_sb->s_bdev;
+			set_buffer_new(bh_result);
+			bh_result->b_blocknr = -1;
+		}
 		ret = 0;
 	}