diff mbox

[2/3] ext4: Clear the unwritten buffer_head flag properly

Message ID 1241692770-22547-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
State Superseded, archived
Headers show

Commit Message

Aneesh Kumar K.V May 7, 2009, 10:39 a.m. UTC
ext4_get_blocks_wrap does a block lookup requesting to
allocate new blocks. A lookup of blocks in prealloc area
result in setting the unwritten flag in buffer_head. So
a write to an unwritten extent will cause the buffer_head
to have unwritten and mapped flag set. Clear hte unwritten
buffer_head flag before requesting to allocate blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/ext4/inode.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

Comments

Eric Sandeen May 7, 2009, 3:36 p.m. UTC | #1
Aneesh Kumar K.V wrote:
> ext4_get_blocks_wrap does a block lookup requesting to
> allocate new blocks. A lookup of blocks in prealloc area
> result in setting the unwritten flag in buffer_head. So
> a write to an unwritten extent will cause the buffer_head
> to have unwritten and mapped flag set. Clear hte unwritten
> buffer_head flag before requesting to allocate blocks.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  fs/ext4/inode.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c3cd00f..f6d7e9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1149,6 +1149,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
>  	int retval;
>  
>  	clear_buffer_mapped(bh);
> +	clear_buffer_unwritten(bh);
>  
>  	/*
>  	 * Try to see if we can get  the block without requesting
> @@ -1179,6 +1180,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
>  		return retval;
>  
>  	/*
> +	 * The above get_blocks can cause the buffer to be
> +	 * marked unwritten. So clear the same.
> +	 */
> +	clear_buffer_unwritten(bh);

hm, thinking out loud here.

ext4_ext_get_blocks() will only set unwritten if (!create) ... but then
ext4_get_blocks_wrap() calls ext4_ext_get_blocks() !create as an
argument no matter what, the first time, for an initial lookup.

But if ext4_get_blocks_wrap() was called with !create, then we return
regardless, so ok - by the time you get to the above hunk, we -are- in
create mode, we're planning to write it ... so I guess clearing the
unwritten state makes sense here.

But is this too late, because it's after this?

        /*
         * Returns if the blocks have already allocated
         *
         * Note that if blocks have been preallocated
         * ext4_ext_get_block() returns th create = 0
         * with buffer head unmapped.
         */
        if (retval > 0 && buffer_mapped(bh))
                return retval;

I guess not; ext4_ext_get_blocks() won't map the buffer if it's found to
be preallocated/unwritten because it was called with !create.  If we're
going on to write it, we want to clear unwritten.

So I guess this looks right, although I can't help but think that in
general, the buffer_head state management is really getting to be a
hard-to-follow mess...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V May 8, 2009, 8:12 a.m. UTC | #2
On Thu, May 07, 2009 at 10:36:49AM -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > ext4_get_blocks_wrap does a block lookup requesting to
> > allocate new blocks. A lookup of blocks in prealloc area
> > result in setting the unwritten flag in buffer_head. So
> > a write to an unwritten extent will cause the buffer_head
> > to have unwritten and mapped flag set. Clear hte unwritten
> > buffer_head flag before requesting to allocate blocks.
> > 
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > ---
> >  fs/ext4/inode.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c3cd00f..f6d7e9b 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1149,6 +1149,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> >  	int retval;
> >  
> >  	clear_buffer_mapped(bh);
> > +	clear_buffer_unwritten(bh);
> >  
> >  	/*
> >  	 * Try to see if we can get  the block without requesting
> > @@ -1179,6 +1180,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> >  		return retval;
> >  
> >  	/*
> > +	 * The above get_blocks can cause the buffer to be
> > +	 * marked unwritten. So clear the same.
> > +	 */
> > +	clear_buffer_unwritten(bh);
> 
> hm, thinking out loud here.
> 
> ext4_ext_get_blocks() will only set unwritten if (!create) ... but then
> ext4_get_blocks_wrap() calls ext4_ext_get_blocks() !create as an
> argument no matter what, the first time, for an initial lookup.
> 
> But if ext4_get_blocks_wrap() was called with !create, then we return
> regardless, so ok - by the time you get to the above hunk, we -are- in
> create mode, we're planning to write it ... so I guess clearing the
> unwritten state makes sense here.
> 
> But is this too late, because it's after this?
> 
>         /*
>          * Returns if the blocks have already allocated
>          *
>          * Note that if blocks have been preallocated
>          * ext4_ext_get_block() returns th create = 0
>          * with buffer head unmapped.
>          */
>         if (retval > 0 && buffer_mapped(bh))
>                 return retval;
> 
> I guess not; ext4_ext_get_blocks() won't map the buffer if it's found to
> be preallocated/unwritten because it was called with !create.  If we're
> going on to write it, we want to clear unwritten.
> 
> So I guess this looks right, although I can't help but think that in
> general, the buffer_head state management is really getting to be a
> hard-to-follow mess...

To further clarify what i think was causing the I/O error.

1) We do a multi block delayed alloc to prealloc space. That would get
us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12)
2) pdflush attempt to write some pages (say mapping block 10) which cause
a get_block call with create = 1. That would attempt to convert
uninitialized extent to initialized one. This can cause multiple blocks
to be marked initialized. ( say 10, 11 , 12)
3) We do an overwrite of block 11. That would mean we call
ext4_da_get_block_prep, which would again do a get_block for block 11
with create = 0. But remember we already have buffer_head marked with
BH_Unwritten flag. But the buffer was unmapped because it is unwritten
( We are fixing this mess in the patch for 2.6.31).
4) The get_block call will find the buffer mapped due to step b. And
mark the buffer_head mapped. There we go . We end up with buffer_head
mapped and unwritten
5) later in ext4_da_get_block_prep we check whether the buffer_head in marked
BH_Unwritten if so we set the block number to ~0. This is introduced by
[PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents
6) So now we have a buffer_head that is mapped, unwritten, with
b_blocknr = ~0. That would result in the I/O error.

-aneesh


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 12, 2009, 3:08 a.m. UTC | #3
On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
> ext4_get_blocks_wrap does a block lookup requesting to
> allocate new blocks. A lookup of blocks in prealloc area
> result in setting the unwritten flag in buffer_head. So
> a write to an unwritten extent will cause the buffer_head
> to have unwritten and mapped flag set. Clear hte unwritten
> buffer_head flag before requesting to allocate blocks.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

I've rewritten the commit changelog to this, which I believe more
accurately describes the patch.  Comments, please?

ext4: Clear the unwritten buffer_head flag after the extent is initialized

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area.  That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate.  Hence, we need to make sure the
BH_Unwritten is not left set, to avoid the ensuing confusion and
hilarty.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V May 12, 2009, 4:46 a.m. UTC | #4
On Mon, May 11, 2009 at 11:08:56PM -0400, Theodore Tso wrote:
> On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
> > ext4_get_blocks_wrap does a block lookup requesting to
> > allocate new blocks. A lookup of blocks in prealloc area
> > result in setting the unwritten flag in buffer_head. So
> > a write to an unwritten extent will cause the buffer_head
> > to have unwritten and mapped flag set. Clear hte unwritten
> > buffer_head flag before requesting to allocate blocks.
> > 
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> I've rewritten the commit changelog to this, which I believe more
> accurately describes the patch.  Comments, please?
> 
> ext4: Clear the unwritten buffer_head flag after the extent is initialized
> 
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> The BH_Unwritten flag indicates that the buffer is allocated on disk
> but has not been written; that is, the disk was part of a persistent
> preallocation area.  That flag should only be set when a get_blocks()
> function is looking up a inode's logical to physical block mapping.
> 
> When ext4_get_blocks_wrap() is called with create=1, the uninitialized
> extent is converted into an initialized one, so the BH_Unwritten flag
> is no longer appropriate.  Hence, we need to make sure the
> BH_Unwritten is not left set, to avoid the ensuing confusion and
> hilarty.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 

I think it is good. But one thing missing in the commit message is,
what happens if we do a write to prealloc space. Since a
get_block(create = 1) is now split into __get_block(create = 0 )  and
__get_block(create = 1). That would mean if we pass a buffer head with
BH_Unwritten cleared we will have


1) buffer_head as BH_Unwritten cleared.

2) __get_block(create = 0 ) -> Since it is prealloc space we will have
BH_Unwritten set .

3) __get_block(create = 1) -> get the blocks out of prealloc space.
and retun with BH_Mapped set. 

That would imply we have BH_Unwritten and BH_Mapped set in the above
case which is wrong. So we need a BH_Unwritten clear between (2) and
(3). The patch does the same. May be we need to capture it in commit
message.

-aneesh



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 13, 2009, 6:56 p.m. UTC | #5
Aneesh Kumar K.V wrote:
> On Mon, May 11, 2009 at 11:08:56PM -0400, Theodore Tso wrote:
>> On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
>>> ext4_get_blocks_wrap does a block lookup requesting to
>>> allocate new blocks. A lookup of blocks in prealloc area
>>> result in setting the unwritten flag in buffer_head. So
>>> a write to an unwritten extent will cause the buffer_head
>>> to have unwritten and mapped flag set. Clear hte unwritten
>>> buffer_head flag before requesting to allocate blocks.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> I've rewritten the commit changelog to this, which I believe more
>> accurately describes the patch.  Comments, please?
>>
>> ext4: Clear the unwritten buffer_head flag after the extent is initialized
>>
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>
>> The BH_Unwritten flag indicates that the buffer is allocated on disk
>> but has not been written; that is, the disk was part of a persistent
>> preallocation area.  That flag should only be set when a get_blocks()
>> function is looking up a inode's logical to physical block mapping.
>>
>> When ext4_get_blocks_wrap() is called with create=1, the uninitialized
>> extent is converted into an initialized one, so the BH_Unwritten flag
>> is no longer appropriate.  Hence, we need to make sure the
>> BH_Unwritten is not left set, to avoid the ensuing confusion and
>> hilarty.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>
> 
> I think it is good. But one thing missing in the commit message is,
> what happens if we do a write to prealloc space. Since a
> get_block(create = 1) is now split into __get_block(create = 0 )  and
> __get_block(create = 1). That would mean if we pass a buffer head with
> BH_Unwritten cleared we will have
> 
> 
> 1) buffer_head as BH_Unwritten cleared.
> 
> 2) __get_block(create = 0 ) -> Since it is prealloc space we will have
> BH_Unwritten set .

Why do we need to set BH_Unwritten on a !create call at all?

Or maybe another way of asking is, are there any !create callers of
get_block who -want- BH_Unwritten set?

Which is to say, should we just not be setting BH_Unwritten in get_block
in the !create case, ever?

The comment:

 	/*
+	 * The above get_blocks can cause the buffer to be
+	 * marked unwritten. So clear the same.
+	 */
+	clear_buffer_unwritten(bh);

is imho not helpful; to me it says "we -just- set this, so clear it!"
It leaves me scratching my head.

> 3) __get_block(create = 1) -> get the blocks out of prealloc space.
> and retun with BH_Mapped set. 
> 
> That would imply we have BH_Unwritten and BH_Mapped set in the above
> case which is wrong. So we need a BH_Unwritten clear between (2) and
> (3). The patch does the same. May be we need to capture it in commit
> message.

Better in comments, I think.  :)

-Eric

> -aneesh
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 13, 2009, 10:28 p.m. UTC | #6
On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> The comment:
> 
>  	/*
> +	 * The above get_blocks can cause the buffer to be
> +	 * marked unwritten. So clear the same.
> +	 */
> +	clear_buffer_unwritten(bh);
> 
> is imho not helpful; to me it says "we -just- set this, so clear it!"
> It leaves me scratching my head.

I've updated it the comment to say this instead.

	/*
	 * When we call get_blocks without the create flag, the
	 * BH_Unwritten flag could have gotten set if the blocks
	 * requested were part of a uninitialized extent.  We need to
	 * clear this flag now that we are committed to convert all or
	 * part of the uninitialized extent to be an initialized
	 * extent.  This is because we need to avoid the combination
	 * of BH_Unwritten and BH_Mapped flags being simultaneously
	 * set on the buffer_head.
	 */

I'm still not entirely clear what is the downside of having
BH_Unwritten and BH_Mapped at the same time; whether it is just a CPU
time-waster that will cause some unneeded calls to get_blocks() in the
write_begin path which "just" wastes a little CPU, or whether there
are other massive disasters that take place on the combination of
BH_Unwritten and BH_Mapped.

What we *desperately* need is documentation that describes which
functions sets these flags, and whether they are intended for
long-term use (either stored in buffer heads attached to a page, or in
the buffer cache, and if so, which), or just as short-term hacks to
pass information between two functions, or multiple functions deep in
a call stack, and if so, what the implications are of the bits, and
when they should be cleared --- and then we should add assert's or
BUG_ON's to enforce what we think the abstraction invariant should be.

Magic flags attached to objects that are really there because we don't
want to change function signatures are very scary; for example,
i_delalloc_reserved_flag in ext4_inode_info --- it's not clear to me
exactly what this is supposed to do, but I note that mballoc.c does a
lot more with the flag than balloc.c.  So it's not clear to me if
there is magic semantics associated with that flag which are being
done in mballoc.c, but we never got around to implementing them in
balloc.c.  This leads to the kind of code fragility that I've been
complaining about for some time.

The fact that Aneesh missed one of these magic code paths in his patch
attests to why this style of programming is really bad and not
long-term maintainable.  So we need to document all of this mess, and
then gradually clean it up, one tiny patch at a time, each one
describing what we were doing before, and what the new way of doing
this should be, and why it's safe to make that change.

> > That would imply we have BH_Unwritten and BH_Mapped set in the above
> > case which is wrong. So we need a BH_Unwritten clear between (2) and
> > (3). The patch does the same. May be we need to capture it in commit
> > message.
> 
> Better in comments, I think.  :)

The comprehensive description of all of this should be in comments,
yes.  See the example of the sort of comments that I've added in the
clear-unwritten-bh-flag and initialize-map_bh-state.  If we don't like
such verbose comments, then we should use simpler programming
semantics when passing information between the various abstracton
layers.  :-)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V May 14, 2009, 5:40 a.m. UTC | #7
On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> > 
> > I think it is good. But one thing missing in the commit message is,
> > what happens if we do a write to prealloc space. Since a
> > get_block(create = 1) is now split into __get_block(create = 0 )  and
> > __get_block(create = 1). That would mean if we pass a buffer head with
> > BH_Unwritten cleared we will have
> > 
> > 
> > 1) buffer_head as BH_Unwritten cleared.
> > 
> > 2) __get_block(create = 0 ) -> Since it is prealloc space we will have
> > BH_Unwritten set .
> 
> Why do we need to set BH_Unwritten on a !create call at all?
> 
> Or maybe another way of asking is, are there any !create callers of
> get_block who -want- BH_Unwritten set?
> 
> Which is to say, should we just not be setting BH_Unwritten in get_block
> in the !create case, ever?

It should only be set in the !create case. With create == 1, we would
have already converted the uninitialized extent to initialized one and
the buffer_head should not be unwritten at all. My understanding is
unwritten flag is used to indicate the buffer_head state between a
write_begin and write_page phase with delayed allocation. ie, when we
write to fallocate space, since we have delalloc enabled, we  just
do a block lookup (get_block with create = 0). The buffer_head returned
in the above case should have unwritten set so that during writepage
we do the actual block allocation (get_block writh create = 1)
looking at the flag.


> 
> The comment:
> 
>  	/*
> +	 * The above get_blocks can cause the buffer to be
> +	 * marked unwritten. So clear the same.
> +	 */
> +	clear_buffer_unwritten(bh);
> 
> is imho not helpful; to me it says "we -just- set this, so clear it!"
> It leaves me scratching my head.
> 
> > 3) __get_block(create = 1) -> get the blocks out of prealloc space.
> > and retun with BH_Mapped set. 
> > 
> > That would imply we have BH_Unwritten and BH_Mapped set in the above
> > case which is wrong. So we need a BH_Unwritten clear between (2) and
> > (3). The patch does the same. May be we need to capture it in commit
> > message.
> 
> Better in comments, I think.  :)
> 

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V May 14, 2009, 6 a.m. UTC | #8
On Wed, May 13, 2009 at 06:28:47PM -0400, Theodore Tso wrote:
> On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> > The comment:
> > 
> >  	/*
> > +	 * The above get_blocks can cause the buffer to be
> > +	 * marked unwritten. So clear the same.
> > +	 */
> > +	clear_buffer_unwritten(bh);
> > 
> > is imho not helpful; to me it says "we -just- set this, so clear it!"
> > It leaves me scratching my head.
> 
> I've updated it the comment to say this instead.
> 
> 	/*
> 	 * When we call get_blocks without the create flag, the
> 	 * BH_Unwritten flag could have gotten set if the blocks
> 	 * requested were part of a uninitialized extent.  We need to
> 	 * clear this flag now that we are committed to convert all or
> 	 * part of the uninitialized extent to be an initialized
> 	 * extent.  This is because we need to avoid the combination
> 	 * of BH_Unwritten and BH_Mapped flags being simultaneously
> 	 * set on the buffer_head.
> 	 */

The last line in the above comment is not a problem with the latest
kernel with all the patches in the patch queue. The patch that does that
is  "ext4: Mark the unwritten buffer_head as mapped during write_begin"

The unwritten and mapped state together was a problem with the code path
we had before in ext4_da_get_block_prep
we had:

ret = ext4_get_blocks(NULL, inode, iblock, 1,  bh_result, 0);
.....
..
} else if (ret > 0) {
	bh_result->b_size = (ret << inode->i_blkbits);
	if (buffer_unwritten(bh_result)) {
		bh_result.b_blocknr = ~0;
		....
	}
}

The above can result in 

1) We do a multi block delayed alloc to prealloc space. That would get
us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12)
2) pdflush attempt to write some pages (say mapping block 10) which
cause
a get_block call with create = 1. That would attempt to convert
uninitialized extent to initialized one. This can cause multiple blocks
to be marked initialized. ( say 10, 11 , 12)
3) We do an overwrite of block 11. That would mean we call
ext4_da_get_block_prep, which would again do a get_block for block 11
with create = 0. But remember we already have buffer_head marked with
BH_Unwritten flag. But the buffer was unmapped because it is unwritten
( We are fixing this mess in the patch for 2.6.31).
4) The get_block call will find the buffer mapped due to step b. And
mark the buffer_head mapped. There we go . We end up with buffer_head
mapped and unwritten
5) later in ext4_da_get_block_prep we check whether the buffer_head in
marked
BH_Unwritten if so we set the block number to ~0. This is introduced by
[PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten
extents
6) So now we have a buffer_head that is mapped, unwritten, with
b_blocknr = ~0. That would result in the I/O error.


Now that we have the actual block number in bh_result.b_blocknr I guess
we should be ok to have the unwritten flag set. But then i guess it is
against the VFS assumption of buffer_head state. The state unwritten
indicate the blocks are not yet written. So if we don't do a
clear_buffer_unwritten as in the patch we end up with a block that is
written ( done via create = 1 get_block call) and still marked
unwritten. (see the 6 step example above)

Now to explain what "ext4: Mark the unwritten buffer_head as mapped
during write_begin" does.

It marks the buffer_head as mapped in the write_begin phase.
And that helps in making sure we don't end up calling get_block
multiple times.  So with delayed allocation we now have between
the write_begin and writepage phase a buffer_head which is mapped
and unwritten for blocks in the fallocate space. Once we do
writepage for the block we will have buffer_head which is mapped and
unwritten flag cleared. The unwritten get cleared when we do a get_block
call with create = 1.


To achieve the above we need to make sure writepage code path looks at
the unwritten flag and does a get_blocks call with create = 1. With
mainline we have in writepage code path

mpage_da_map_blocks(...)

if ((mpd->b_state  & (1 << BH_Mapped)) &&
    !(mpd->b_state & (1 << BH_Delay)))
	return 0;
	...
	..
	ext4_da_get_block_write()


If we start marking buffer_head mapped for fallocate blocks in the
write_begin phase, then the above code will return without doing
any get_block(create = 1) call. That means we don't convert the
uninitialized extent to initialized one. So along with marking
buffer_head mapped and unwritten in the write_begin phase we also
need changes to writepage code path which does something like below

mpage_da_map_blocks(...)

if ((mpd->b_state  & (1 << BH_Mapped)) &&
	!(mpd->b_state & (1 << BH_Delay)) &&
	!(mpd->b_state & (1 << BH_Unwritten)))
	    return 0;
	...
	..
	ext4_da_get_block_write()

this is done in patch "ext4: Mark the unwritten buffer_head as mapped
during write_begin".

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 14, 2009, 1:14 p.m. UTC | #9
On Thu, May 14, 2009 at 11:10:02AM +0530, Aneesh Kumar K.V wrote:
> 
> It should only be set in the !create case. With create == 1, we would
> have already converted the uninitialized extent to initialized one and
> the buffer_head should not be unwritten at all.  My understanding is
> unwritten flag is used to indicate the buffer_head state between a
> write_begin and write_page phase with delayed allocation. ie, when we
> write to fallocate space, since we have delalloc enabled, we  just
> do a block lookup (get_block with create = 0). The buffer_head returned
> in the above case should have unwritten set so that during writepage
> we do the actual block allocation (get_block writh create = 1)
> looking at the flag.

At the moment, ext4_da_get_block_prep(), which is used as a callback
function by ext4_da_write_begin(), checks for buffer_unwritten(), and
if true, set BH_New and BH_Mapped.

So between the time that that write_begin() and the time when the page
is actually written out, BH_Unwritten and BH_Mapped will both be set.
If we end up bailing due to some error of some kind, such that we
don't complete the write(2) operation we *can* have some pages that
are simultaneously have BH_Unwritten and BH_Mapped flags set.  So this
had better be a harmless case, since I think it can happen.

What's confusing then is some of the comments which have been made
about why BH_Unwritten and BH_Mapped simultaneously are a bad.  It may
be bad at some points in time, but at other points in time it's
completely normal operations.  Or am I missing something?

	   	  	       	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c3cd00f..f6d7e9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1149,6 +1149,7 @@  int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 	int retval;
 
 	clear_buffer_mapped(bh);
+	clear_buffer_unwritten(bh);
 
 	/*
 	 * Try to see if we can get  the block without requesting
@@ -1179,6 +1180,12 @@  int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 		return retval;
 
 	/*
+	 * The above get_blocks can cause the buffer to be
+	 * marked unwritten. So clear the same.
+	 */
+	clear_buffer_unwritten(bh);
+
+	/*
 	 * New blocks allocate and/or writing to uninitialized extent
 	 * will possibly result in updating i_data, so we take
 	 * the write lock of i_data_sem, and call get_blocks()