diff mbox

ext4: fix overhead calculation in bigalloc filesystem (Re: ... )

Message ID 20130222040544.GA13667@thunk.org
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o Feb. 22, 2013, 4:05 a.m. UTC
On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> 
> I agree with you that we should forbid user to use bigalloc feature with
> block size = 1k or 2k because I guess no one really use it, at least in
> Taobao we always use bigalloc feature with block size = 4k.

The main reason why file systems with 1k or 2k (with or without
bigalloc) is that it's useful is that it's a good way of testing what
happens on an architecture with a 16k page size and a 4k blocksize.  I
am regularly testing with a 1k blocksize because it catches problems
that would otherwise only show up on PowerPC or Intanium platforms.
I'm not testing with bigalloc plus 1k block size, though, I admit.

> FWIW, recently I am considering whether we could remove 'data=journal'
> and 'data=writeback' mode.  'data=journal' mode hurts performance
> dramatically.  Further, 'data=writeback' seems also useless, especially
> after we have 'no journal' feature.  TBH, these modes are lack of tests.
> Maybe this is a crazy idea in my mind.

As far as data=writeback and data=ordered are concerned, the only
difference is a single call to ext4_jbd2_file_end() in
ext4_ordered_write_end().  In fact, it looks like we could do
something like the following attached patch to simplify the code and
improve code coverage from a testing perspective.  (Note: patch not
yet tested!)

As far as "data=journalled", it's a bit more complicated but I do have
a sentimental attachment to it, since it's something which no other
file system has.  I have been regularly testing it, and it's something
which has been pretty stable for quite a while now.

If there was a mode that I'd be tempted to get rid of, it would be the
combined data=ordered/data=writeback modes.  The main reason why we
keep it is because of a concern of buggy applications that depend on
the implied fsync() of data=ordered mode at each commit.  However,
ext4 has been around for long enough that I think most of the buggy
applications have been fixed by now.  And of course, those buggy
applications will lose in the same way when they are using btrfs and
xfs.  Something to consider....

						- Ted

commit d075b5c3031752a4c41642ff505c3302e5321940
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Feb 21 23:04:59 2013 -0500

    ext4: collapse handling of data=ordered and data=writeback codepaths
    
    The only difference between how we handle data=ordered and
    data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
    code duplication by factoring out redundant the code paths.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Lukas Czerner Feb. 22, 2013, 8:04 a.m. UTC | #1
On Thu, 21 Feb 2013, Theodore Ts'o wrote:

> Date: Thu, 21 Feb 2013 23:05:44 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> Subject: Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
>     (Re: ... )
> 
> On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> > 
> > I agree with you that we should forbid user to use bigalloc feature with
> > block size = 1k or 2k because I guess no one really use it, at least in
> > Taobao we always use bigalloc feature with block size = 4k.
> 
> The main reason why file systems with 1k or 2k (with or without
> bigalloc) is that it's useful is that it's a good way of testing what
> happens on an architecture with a 16k page size and a 4k blocksize.  I
> am regularly testing with a 1k blocksize because it catches problems
> that would otherwise only show up on PowerPC or Intanium platforms.
> I'm not testing with bigalloc plus 1k block size, though, I admit.

I agree having block size smaller than 4k is useful not only for
testing purposes. However in my opinion with bigalloc it is entirely
unnecessary to allow those sizes.

-Lukas

> 
> > FWIW, recently I am considering whether we could remove 'data=journal'
> > and 'data=writeback' mode.  'data=journal' mode hurts performance
> > dramatically.  Further, 'data=writeback' seems also useless, especially
> > after we have 'no journal' feature.  TBH, these modes are lack of tests.
> > Maybe this is a crazy idea in my mind.
> 
> As far as data=writeback and data=ordered are concerned, the only
> difference is a single call to ext4_jbd2_file_end() in
> ext4_ordered_write_end().  In fact, it looks like we could do
> something like the following attached patch to simplify the code and
> improve code coverage from a testing perspective.  (Note: patch not
> yet tested!)
> 
> As far as "data=journalled", it's a bit more complicated but I do have
> a sentimental attachment to it, since it's something which no other
> file system has.  I have been regularly testing it, and it's something
> which has been pretty stable for quite a while now.
> 
> If there was a mode that I'd be tempted to get rid of, it would be the
> combined data=ordered/data=writeback modes.  The main reason why we
> keep it is because of a concern of buggy applications that depend on
> the implied fsync() of data=ordered mode at each commit.  However,
> ext4 has been around for long enough that I think most of the buggy
> applications have been fixed by now.  And of course, those buggy
> applications will lose in the same way when they are using btrfs and
> xfs.  Something to consider....
> 
> 						- Ted
> 
> commit d075b5c3031752a4c41642ff505c3302e5321940
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Feb 21 23:04:59 2013 -0500
> 
>     ext4: collapse handling of data=ordered and data=writeback codepaths
>     
>     The only difference between how we handle data=ordered and
>     data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
>     code duplication by factoring out redundant the code paths.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6e16c18..85dfd49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1373,6 +1373,7 @@ enum {
>  	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
>  					   nolocking */
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
> +	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..2e338a6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1055,77 +1055,36 @@ static int ext4_generic_write_end(struct file *file,
>   * ext4 never places buffers on inode->i_mapping->private_list.  metadata
>   * buffers are managed internally.
>   */
> -static int ext4_ordered_write_end(struct file *file,
> -				  struct address_space *mapping,
> -				  loff_t pos, unsigned len, unsigned copied,
> -				  struct page *page, void *fsdata)
> +static int ext4_write_end(struct file *file,
> +			  struct address_space *mapping,
> +			  loff_t pos, unsigned len, unsigned copied,
> +			  struct page *page, void *fsdata)
>  {
>  	handle_t *handle = ext4_journal_current_handle();
>  	struct inode *inode = mapping->host;
>  	int ret = 0, ret2;
>  
>  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> -	ret = ext4_jbd2_file_inode(handle, inode);
> -
> -	if (ret == 0) {
> -		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -		copied = ret2;
> -		if (pos + len > inode->i_size && ext4_can_truncate(inode))
> -			/* if we have allocated more blocks and copied
> -			 * less. We will have blocks allocated outside
> -			 * inode->i_size. So truncate them
> -			 */
> -			ext4_orphan_add(handle, inode);
> -		if (ret2 < 0)
> -			ret = ret2;
> -	} else {
> -		unlock_page(page);
> -		page_cache_release(page);
> -	}
> -
> -	ret2 = ext4_journal_stop(handle);
> -	if (!ret)
> -		ret = ret2;
> -
> -	if (pos + len > inode->i_size) {
> -		ext4_truncate_failed_write(inode);
> -		/*
> -		 * If truncate failed early the inode might still be
> -		 * on the orphan list; we need to make sure the inode
> -		 * is removed from the orphan list in that case.
> -		 */
> -		if (inode->i_nlink)
> -			ext4_orphan_del(NULL, inode);
> +	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
> +		ret = ext4_jbd2_file_inode(handle, inode);
> +		if (ret) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto errout;
> +		}
>  	}
>  
> -
> -	return ret ? ret : copied;
> -}
> -
> -static int ext4_writeback_write_end(struct file *file,
> -				    struct address_space *mapping,
> -				    loff_t pos, unsigned len, unsigned copied,
> -				    struct page *page, void *fsdata)
> -{
> -	handle_t *handle = ext4_journal_current_handle();
> -	struct inode *inode = mapping->host;
> -	int ret = 0, ret2;
> -
> -	trace_ext4_writeback_write_end(inode, pos, len, copied);
> -	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -	copied = ret2;
> +	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
> +					page, fsdata);
> +	if (copied < 0)
> +		ret = copied;
>  	if (pos + len > inode->i_size && ext4_can_truncate(inode))
>  		/* if we have allocated more blocks and copied
>  		 * less. We will have blocks allocated outside
>  		 * inode->i_size. So truncate them
>  		 */
>  		ext4_orphan_add(handle, inode);
> -
> -	if (ret2 < 0)
> -		ret = ret2;
> -
> +errout:
>  	ret2 = ext4_journal_stop(handle);
>  	if (!ret)
>  		ret = ret2;
> @@ -2656,18 +2615,9 @@ static int ext4_da_write_end(struct file *file,
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> -	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
> -		switch (ext4_inode_journal_mode(inode)) {
> -		case EXT4_INODE_ORDERED_DATA_MODE:
> -			return ext4_ordered_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		case EXT4_INODE_WRITEBACK_DATA_MODE:
> -			return ext4_writeback_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		default:
> -			BUG();
> -		}
> -	}
> +	if (write_mode == FALL_BACK_TO_NONDELALLOC)
> +		return ext4_write_end(file, mapping, pos,
> +				      len, copied, page, fsdata);
>  
>  	trace_ext4_da_write_end(inode, pos, len, copied);
>  	start = pos & (PAGE_CACHE_SIZE - 1);
> @@ -3172,27 +3122,12 @@ static int ext4_journalled_set_page_dirty(struct page *page)
>  	return __set_page_dirty_nobuffers(page);
>  }
>  
> -static const struct address_space_operations ext4_ordered_aops = {
> +static const struct address_space_operations ext4_aops = {
>  	.readpage		= ext4_readpage,
>  	.readpages		= ext4_readpages,
>  	.writepage		= ext4_writepage,
>  	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_ordered_write_end,
> -	.bmap			= ext4_bmap,
> -	.invalidatepage		= ext4_invalidatepage,
> -	.releasepage		= ext4_releasepage,
> -	.direct_IO		= ext4_direct_IO,
> -	.migratepage		= buffer_migrate_page,
> -	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> -};
> -
> -static const struct address_space_operations ext4_writeback_aops = {
> -	.readpage		= ext4_readpage,
> -	.readpages		= ext4_readpages,
> -	.writepage		= ext4_writepage,
> -	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_writeback_write_end,
> +	.write_end		= ext4_write_end,
>  	.bmap			= ext4_bmap,
>  	.invalidatepage		= ext4_invalidatepage,
>  	.releasepage		= ext4_releasepage,
> @@ -3237,23 +3172,21 @@ void ext4_set_aops(struct inode *inode)
>  {
>  	switch (ext4_inode_journal_mode(inode)) {
>  	case EXT4_INODE_ORDERED_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_ordered_aops;
> +		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_writeback_aops;
> +		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_JOURNAL_DATA_MODE:
>  		inode->i_mapping->a_ops = &ext4_journalled_aops;
> -		break;
> +		return;
>  	default:
>  		BUG();
>  	}
> +	if (test_opt(inode->i_sb, DELALLOC))
> +		inode->i_mapping->a_ops = &ext4_da_aops;
> +	else
> +		inode->i_mapping->a_ops = &ext4_aops;
>  }
>  
>  
>
Zheng Liu Feb. 22, 2013, 1:18 p.m. UTC | #2
On Thu, Feb 21, 2013 at 11:05:44PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> > 
> > I agree with you that we should forbid user to use bigalloc feature with
> > block size = 1k or 2k because I guess no one really use it, at least in
> > Taobao we always use bigalloc feature with block size = 4k.
> 
> The main reason why file systems with 1k or 2k (with or without
> bigalloc) is that it's useful is that it's a good way of testing what
> happens on an architecture with a 16k page size and a 4k blocksize.  I
> am regularly testing with a 1k blocksize because it catches problems
> that would otherwise only show up on PowerPC or Intanium platforms.
> I'm not testing with bigalloc plus 1k block size, though, I admit.
> 
> > FWIW, recently I am considering whether we could remove 'data=journal'
> > and 'data=writeback' mode.  'data=journal' mode hurts performance
> > dramatically.  Further, 'data=writeback' seems also useless, especially
> > after we have 'no journal' feature.  TBH, these modes are lack of tests.
> > Maybe this is a crazy idea in my mind.
> 
> As far as data=writeback and data=ordered are concerned, the only
> difference is a single call to ext4_jbd2_file_end() in
> ext4_ordered_write_end().  In fact, it looks like we could do
> something like the following attached patch to simplify the code and
> improve code coverage from a testing perspective.  (Note: patch not
> yet tested!)
> 
> As far as "data=journalled", it's a bit more complicated but I do have
> a sentimental attachment to it, since it's something which no other
> file system has.  I have been regularly testing it, and it's something
> which has been pretty stable for quite a while now.

Indeed, it seems that other file systems don't have this feature AFAIK.

> 
> If there was a mode that I'd be tempted to get rid of, it would be the
> combined data=ordered/data=writeback modes.  The main reason why we
> keep it is because of a concern of buggy applications that depend on
> the implied fsync() of data=ordered mode at each commit.  However,
> ext4 has been around for long enough that I think most of the buggy
> applications have been fixed by now.  And of course, those buggy
> applications will lose in the same way when they are using btrfs and
> xfs.  Something to consider....

IMHO, the application shouldn't depend on a kernel feature.  So maybe it
is time to highlight this buggy usage.

> 
> 						- Ted
> 
> commit d075b5c3031752a4c41642ff505c3302e5321940
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Feb 21 23:04:59 2013 -0500
> 
>     ext4: collapse handling of data=ordered and data=writeback codepaths
>     
>     The only difference between how we handle data=ordered and
>     data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
>     code duplication by factoring out redundant the code paths.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Just one minor comment below.  Otherwise the patch looks good to me, and
it can pass in xfstests with 'data=ordered' and 'data=writeback'.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6e16c18..85dfd49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1373,6 +1373,7 @@ enum {
>  	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
>  					   nolocking */
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
> +	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..2e338a6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1055,77 +1055,36 @@ static int ext4_generic_write_end(struct file *file,
>   * ext4 never places buffers on inode->i_mapping->private_list.  metadata
>   * buffers are managed internally.
>   */
> -static int ext4_ordered_write_end(struct file *file,
> -				  struct address_space *mapping,
> -				  loff_t pos, unsigned len, unsigned copied,
> -				  struct page *page, void *fsdata)
> +static int ext4_write_end(struct file *file,
> +			  struct address_space *mapping,
> +			  loff_t pos, unsigned len, unsigned copied,
> +			  struct page *page, void *fsdata)
>  {
>  	handle_t *handle = ext4_journal_current_handle();
>  	struct inode *inode = mapping->host;
>  	int ret = 0, ret2;
>  
>  	trace_ext4_ordered_write_end(inode, pos, len, copied);

Here this function needs to be renamed with trace_ext4_write_end().

Regards,
                                                - Zheng

> -	ret = ext4_jbd2_file_inode(handle, inode);
> -
> -	if (ret == 0) {
> -		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -		copied = ret2;
> -		if (pos + len > inode->i_size && ext4_can_truncate(inode))
> -			/* if we have allocated more blocks and copied
> -			 * less. We will have blocks allocated outside
> -			 * inode->i_size. So truncate them
> -			 */
> -			ext4_orphan_add(handle, inode);
> -		if (ret2 < 0)
> -			ret = ret2;
> -	} else {
> -		unlock_page(page);
> -		page_cache_release(page);
> -	}
> -
> -	ret2 = ext4_journal_stop(handle);
> -	if (!ret)
> -		ret = ret2;
> -
> -	if (pos + len > inode->i_size) {
> -		ext4_truncate_failed_write(inode);
> -		/*
> -		 * If truncate failed early the inode might still be
> -		 * on the orphan list; we need to make sure the inode
> -		 * is removed from the orphan list in that case.
> -		 */
> -		if (inode->i_nlink)
> -			ext4_orphan_del(NULL, inode);
> +	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
> +		ret = ext4_jbd2_file_inode(handle, inode);
> +		if (ret) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto errout;
> +		}
>  	}
>  
> -
> -	return ret ? ret : copied;
> -}
> -
> -static int ext4_writeback_write_end(struct file *file,
> -				    struct address_space *mapping,
> -				    loff_t pos, unsigned len, unsigned copied,
> -				    struct page *page, void *fsdata)
> -{
> -	handle_t *handle = ext4_journal_current_handle();
> -	struct inode *inode = mapping->host;
> -	int ret = 0, ret2;
> -
> -	trace_ext4_writeback_write_end(inode, pos, len, copied);
> -	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -	copied = ret2;
> +	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
> +					page, fsdata);
> +	if (copied < 0)
> +		ret = copied;
>  	if (pos + len > inode->i_size && ext4_can_truncate(inode))
>  		/* if we have allocated more blocks and copied
>  		 * less. We will have blocks allocated outside
>  		 * inode->i_size. So truncate them
>  		 */
>  		ext4_orphan_add(handle, inode);
> -
> -	if (ret2 < 0)
> -		ret = ret2;
> -
> +errout:
>  	ret2 = ext4_journal_stop(handle);
>  	if (!ret)
>  		ret = ret2;
> @@ -2656,18 +2615,9 @@ static int ext4_da_write_end(struct file *file,
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> -	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
> -		switch (ext4_inode_journal_mode(inode)) {
> -		case EXT4_INODE_ORDERED_DATA_MODE:
> -			return ext4_ordered_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		case EXT4_INODE_WRITEBACK_DATA_MODE:
> -			return ext4_writeback_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		default:
> -			BUG();
> -		}
> -	}
> +	if (write_mode == FALL_BACK_TO_NONDELALLOC)
> +		return ext4_write_end(file, mapping, pos,
> +				      len, copied, page, fsdata);
>  
>  	trace_ext4_da_write_end(inode, pos, len, copied);
>  	start = pos & (PAGE_CACHE_SIZE - 1);
> @@ -3172,27 +3122,12 @@ static int ext4_journalled_set_page_dirty(struct page *page)
>  	return __set_page_dirty_nobuffers(page);
>  }
>  
> -static const struct address_space_operations ext4_ordered_aops = {
> +static const struct address_space_operations ext4_aops = {
>  	.readpage		= ext4_readpage,
>  	.readpages		= ext4_readpages,
>  	.writepage		= ext4_writepage,
>  	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_ordered_write_end,
> -	.bmap			= ext4_bmap,
> -	.invalidatepage		= ext4_invalidatepage,
> -	.releasepage		= ext4_releasepage,
> -	.direct_IO		= ext4_direct_IO,
> -	.migratepage		= buffer_migrate_page,
> -	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> -};
> -
> -static const struct address_space_operations ext4_writeback_aops = {
> -	.readpage		= ext4_readpage,
> -	.readpages		= ext4_readpages,
> -	.writepage		= ext4_writepage,
> -	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_writeback_write_end,
> +	.write_end		= ext4_write_end,
>  	.bmap			= ext4_bmap,
>  	.invalidatepage		= ext4_invalidatepage,
>  	.releasepage		= ext4_releasepage,
> @@ -3237,23 +3172,21 @@ void ext4_set_aops(struct inode *inode)
>  {
>  	switch (ext4_inode_journal_mode(inode)) {
>  	case EXT4_INODE_ORDERED_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_ordered_aops;
> +		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_writeback_aops;
> +		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_JOURNAL_DATA_MODE:
>  		inode->i_mapping->a_ops = &ext4_journalled_aops;
> -		break;
> +		return;
>  	default:
>  		BUG();
>  	}
> +	if (test_opt(inode->i_sb, DELALLOC))
> +		inode->i_mapping->a_ops = &ext4_da_aops;
> +	else
> +		inode->i_mapping->a_ops = &ext4_aops;
>  }
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 22, 2013, 3:20 p.m. UTC | #3
On Fri, Feb 22, 2013 at 09:18:12PM +0800, Zheng Liu wrote:
> > If there was a mode that I'd be tempted to get rid of, it would be the
> > combined data=ordered/data=writeback modes.  The main reason why we
> > keep it is because of a concern of buggy applications that depend on
> > the implied fsync() of data=ordered mode at each commit.  However,
> > ext4 has been around for long enough that I think most of the buggy
> > applications have been fixed by now.  And of course, those buggy
> > applications will lose in the same way when they are using btrfs and
> > xfs.  Something to consider....
> 
> IMHO, the application shouldn't depend on a kernel feature.  So maybe it
> is time to highlight this buggy usage.

Oh, agreed.  The question is how many people want us to keep the ext3
semantics to support those buggy applications.  To the extent that
distros are considering using ext4 to support ext3 file systems, there
might be a desire to maintain (as closely as possible) ext3 semantics,
even those that support buggy applications.  The primary problem is
that the when comes down to application developers versus file system
developers, the application developers vastly outnumber us.   :-)

> Just one minor comment below.  Otherwise the patch looks good to me, and
> it can pass in xfstests with 'data=ordered' and 'data=writeback'.

I hadn't bothered testing it yet because I'm focused on testing
and cleaning up the set of patches for the merge window --- and this
change is clearly for the next merge window.  Thanks for testing it!

> >  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> 
> Here this function needs to be renamed with trace_ext4_write_end().

Yes, agreed.

I also need to get rid of trace_ext4_writeback_write_end() in
include/trace/events/ext4.h.

The other thing that needs to be done --- probably in a separate
commit, just to make things easier to review for correctness, is now
that we've folded ext4_writeback_write_end() and ext4_ordered_write_end()
into a single function, we now have a single user of
ext4_generic_write_end(), so we can now fold ext4_generic_write_end()
into ext4_write_end().

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu Feb. 22, 2013, 4:26 p.m. UTC | #4
On Fri, Feb 22, 2013 at 10:20:42AM -0500, Theodore Ts'o wrote:
> On Fri, Feb 22, 2013 at 09:18:12PM +0800, Zheng Liu wrote:
> > > If there was a mode that I'd be tempted to get rid of, it would be the
> > > combined data=ordered/data=writeback modes.  The main reason why we
> > > keep it is because of a concern of buggy applications that depend on
> > > the implied fsync() of data=ordered mode at each commit.  However,
> > > ext4 has been around for long enough that I think most of the buggy
> > > applications have been fixed by now.  And of course, those buggy
> > > applications will lose in the same way when they are using btrfs and
> > > xfs.  Something to consider....
> > 
> > IMHO, the application shouldn't depend on a kernel feature.  So maybe it
> > is time to highlight this buggy usage.
> 
> Oh, agreed.  The question is how many people want us to keep the ext3
> semantics to support those buggy applications.  To the extent that
> distros are considering using ext4 to support ext3 file systems, there
> might be a desire to maintain (as closely as possible) ext3 semantics,
> even those that support buggy applications.  The primary problem is
> that the when comes down to application developers versus file system
> developers, the application developers vastly outnumber us.   :-)

Yes, as file system developers we always need to meet the application
developers' requirement.  So it seems that we still need to keep it in
ext4. :-)

> 
> > Just one minor comment below.  Otherwise the patch looks good to me, and
> > it can pass in xfstests with 'data=ordered' and 'data=writeback'.
> 
> I hadn't bothered testing it yet because I'm focused on testing
> and cleaning up the set of patches for the merge window --- and this
> change is clearly for the next merge window.  Thanks for testing it!

I guess you are busy testing patches for the merge window.  One thing I
need to let you know is this patch [1].  I really think it should be
applied for this merge window because it fixes a security hole due to my
fault.  As commit log describes, a non-privilege user could cause system
crash using a truncate(1) command.  So please check it.

1. http://www.spinics.net/lists/linux-ext4/msg36784.html

> 
> > >  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> > 
> > Here this function needs to be renamed with trace_ext4_write_end().
> 
> Yes, agreed.
> 
> I also need to get rid of trace_ext4_writeback_write_end() in
> include/trace/events/ext4.h.
> 
> The other thing that needs to be done --- probably in a separate
> commit, just to make things easier to review for correctness, is now
> that we've folded ext4_writeback_write_end() and ext4_ordered_write_end()
> into a single function, we now have a single user of
> ext4_generic_write_end(), so we can now fold ext4_generic_write_end()
> into ext4_write_end().

Yes, we can take a close look at in next merge window.

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6e16c18..85dfd49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1373,6 +1373,7 @@  enum {
 	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
 					   nolocking */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
+	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 95a0c62..2e338a6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1055,77 +1055,36 @@  static int ext4_generic_write_end(struct file *file,
  * ext4 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_write_end(struct file *file,
-				  struct address_space *mapping,
-				  loff_t pos, unsigned len, unsigned copied,
-				  struct page *page, void *fsdata)
+static int ext4_write_end(struct file *file,
+			  struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
 	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 
 	trace_ext4_ordered_write_end(inode, pos, len, copied);
-	ret = ext4_jbd2_file_inode(handle, inode);
-
-	if (ret == 0) {
-		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
-							page, fsdata);
-		copied = ret2;
-		if (pos + len > inode->i_size && ext4_can_truncate(inode))
-			/* if we have allocated more blocks and copied
-			 * less. We will have blocks allocated outside
-			 * inode->i_size. So truncate them
-			 */
-			ext4_orphan_add(handle, inode);
-		if (ret2 < 0)
-			ret = ret2;
-	} else {
-		unlock_page(page);
-		page_cache_release(page);
-	}
-
-	ret2 = ext4_journal_stop(handle);
-	if (!ret)
-		ret = ret2;
-
-	if (pos + len > inode->i_size) {
-		ext4_truncate_failed_write(inode);
-		/*
-		 * If truncate failed early the inode might still be
-		 * on the orphan list; we need to make sure the inode
-		 * is removed from the orphan list in that case.
-		 */
-		if (inode->i_nlink)
-			ext4_orphan_del(NULL, inode);
+	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
+		ret = ext4_jbd2_file_inode(handle, inode);
+		if (ret) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto errout;
+		}
 	}
 
-
-	return ret ? ret : copied;
-}
-
-static int ext4_writeback_write_end(struct file *file,
-				    struct address_space *mapping,
-				    loff_t pos, unsigned len, unsigned copied,
-				    struct page *page, void *fsdata)
-{
-	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = mapping->host;
-	int ret = 0, ret2;
-
-	trace_ext4_writeback_write_end(inode, pos, len, copied);
-	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
-							page, fsdata);
-	copied = ret2;
+	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
+					page, fsdata);
+	if (copied < 0)
+		ret = copied;
 	if (pos + len > inode->i_size && ext4_can_truncate(inode))
 		/* if we have allocated more blocks and copied
 		 * less. We will have blocks allocated outside
 		 * inode->i_size. So truncate them
 		 */
 		ext4_orphan_add(handle, inode);
-
-	if (ret2 < 0)
-		ret = ret2;
-
+errout:
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
@@ -2656,18 +2615,9 @@  static int ext4_da_write_end(struct file *file,
 	unsigned long start, end;
 	int write_mode = (int)(unsigned long)fsdata;
 
-	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
-		switch (ext4_inode_journal_mode(inode)) {
-		case EXT4_INODE_ORDERED_DATA_MODE:
-			return ext4_ordered_write_end(file, mapping, pos,
-					len, copied, page, fsdata);
-		case EXT4_INODE_WRITEBACK_DATA_MODE:
-			return ext4_writeback_write_end(file, mapping, pos,
-					len, copied, page, fsdata);
-		default:
-			BUG();
-		}
-	}
+	if (write_mode == FALL_BACK_TO_NONDELALLOC)
+		return ext4_write_end(file, mapping, pos,
+				      len, copied, page, fsdata);
 
 	trace_ext4_da_write_end(inode, pos, len, copied);
 	start = pos & (PAGE_CACHE_SIZE - 1);
@@ -3172,27 +3122,12 @@  static int ext4_journalled_set_page_dirty(struct page *page)
 	return __set_page_dirty_nobuffers(page);
 }
 
-static const struct address_space_operations ext4_ordered_aops = {
+static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.write_begin		= ext4_write_begin,
-	.write_end		= ext4_ordered_write_end,
-	.bmap			= ext4_bmap,
-	.invalidatepage		= ext4_invalidatepage,
-	.releasepage		= ext4_releasepage,
-	.direct_IO		= ext4_direct_IO,
-	.migratepage		= buffer_migrate_page,
-	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
-};
-
-static const struct address_space_operations ext4_writeback_aops = {
-	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
-	.writepage		= ext4_writepage,
-	.write_begin		= ext4_write_begin,
-	.write_end		= ext4_writeback_write_end,
+	.write_end		= ext4_write_end,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -3237,23 +3172,21 @@  void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
-		if (test_opt(inode->i_sb, DELALLOC))
-			inode->i_mapping->a_ops = &ext4_da_aops;
-		else
-			inode->i_mapping->a_ops = &ext4_ordered_aops;
+		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
-		if (test_opt(inode->i_sb, DELALLOC))
-			inode->i_mapping->a_ops = &ext4_da_aops;
-		else
-			inode->i_mapping->a_ops = &ext4_writeback_aops;
+		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_JOURNAL_DATA_MODE:
 		inode->i_mapping->a_ops = &ext4_journalled_aops;
-		break;
+		return;
 	default:
 		BUG();
 	}
+	if (test_opt(inode->i_sb, DELALLOC))
+		inode->i_mapping->a_ops = &ext4_da_aops;
+	else
+		inode->i_mapping->a_ops = &ext4_aops;
 }