diff mbox

[GIT,PULL] Ext3 latency fixes

Message ID 1239294219.31257.10.camel@think.oraclecorp.com
State Superseded, archived
Headers show

Commit Message

Chris Mason April 9, 2009, 4:23 p.m. UTC
On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> 
> On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > 
> > One of these patches fixes a performance regression caused by a64c8610,
> > which unplugged the write queue after every page write.  Now that Jens
> > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > to further improbve ext3 latency, especially during the "sync" command
> > in Linus's write-big-file-and-sync workload.
> 
> So here's a question and a untested _conceptual_ patch. 
> 
> The kind of writeback mode I'd personally prefer would be more of a 
> mixture of the current "data=writeback" and "data=ordered" modes, with 
> something of the best of both worlds. I'd like the data writeback to get 
> _started_ when the journal is written to disk, but I'd like it to not 
> block journal updates.
> 
> IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> be totally unordered either.
> 

I started working on the xfs style i_size updates last night, and here's
my current (most definitely broken) proof of concept.  I call it
data=guarded.

In guarded mode the on disk i_size is not updated until after the data
writes are complete.  I've got a per FS work queue and I'm abusing
bh->b_private as a list pointer.  So, what happens is:

* writepage sets up the buffer with the guarded end_io handler

* The end_io handler puts the buffer onto the per-sb list of guarded
buffers and then it kicks the work queue

* The workqueue updates the on disk i_size to the min of the end of the
buffer or the in-memory i_size, and then it logs the inode.

* Then the regular async bh end_io handler is called to end writeback on
the page.

One big gotcha is that we starting a transaction while a page is
writeback.  It means that anyone who waits for writeback to finish on
the datapage with a transaction running could deadlock against the work
queue func trying to start a transaction.

I couldn't find anyone doing that, but if it matters, we can always just
mark the inode dirty and let some other async func handle the logging.
We could also play tricks with logging the inode after the real end_io
handler clears PG_writeback.

This code doesn't:

* Deal with hole filling (plan is just to use the ordered code there)

* Make sure all the blocks are on disk between the new disk i_size and
the old one.  For this, I'll add an rbtree to track BH_New buffers and
delay updating the disk isize until the pending BH_New IO is on disk.
Btrfs already does this, so I should have a handle on the spots I need
to fiddle.

There's a ton of room for optimization like not doing async end_io if
we're already inside disk i_size.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Kara April 9, 2009, 5:49 p.m. UTC | #1
> On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > 
> > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > 
> > > One of these patches fixes a performance regression caused by a64c8610,
> > > which unplugged the write queue after every page write.  Now that Jens
> > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > to further improbve ext3 latency, especially during the "sync" command
> > > in Linus's write-big-file-and-sync workload.
> > 
> > So here's a question and a untested _conceptual_ patch. 
> > 
> > The kind of writeback mode I'd personally prefer would be more of a 
> > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > something of the best of both worlds. I'd like the data writeback to get 
> > _started_ when the journal is written to disk, but I'd like it to not 
> > block journal updates.
> > 
> > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > be totally unordered either.
> > 
> 
> I started working on the xfs style i_size updates last night, and here's
> my current (most definitely broken) proof of concept.  I call it
> data=guarded.
> 
> In guarded mode the on disk i_size is not updated until after the data
> writes are complete.  I've got a per FS work queue and I'm abusing
> bh->b_private as a list pointer.  So, what happens is:
> 
> * writepage sets up the buffer with the guarded end_io handler
> 
> * The end_io handler puts the buffer onto the per-sb list of guarded
> buffers and then it kicks the work queue
> 
> * The workqueue updates the on disk i_size to the min of the end of the
> buffer or the in-memory i_size, and then it logs the inode.
> 
> * Then the regular async bh end_io handler is called to end writeback on
> the page.
> 
> One big gotcha is that we starting a transaction while a page is
> writeback.  It means that anyone who waits for writeback to finish on
> the datapage with a transaction running could deadlock against the work
> queue func trying to start a transaction.
  For ext3 I don't think anyone waits for PageWriteback with a
transaction open. We definitely don't do it from ext3 code and generic
code does usually sequence like:
  lock_page(page);
  ...
  wait_on_page_writeback(page)

  and because lock ordering is page_lock < transaction start, we
shouldn't have transaction open at that point.
  But with ext4 it may be different - there, the lock ordering is
transaction start > page_lock and so above code could well have
transaction started.
  Wouldn't it actually be better to update i_size when the page is
fully written out after we clear PG_writeback as you write below?
  One thing which does not seem to be handled is that your code can
happily race with truncate. So IO completion could reset i_size which
has been just set by truncate. And I'm not sure how to handle this
effectively. Generally I'm not sure how much this is going to cost...

> I couldn't find anyone doing that, but if it matters, we can always just
> mark the inode dirty and let some other async func handle the logging.
> We could also play tricks with logging the inode after the real end_io
> handler clears PG_writeback.
> 
> This code doesn't:
> 
> * Deal with hole filling (plan is just to use the ordered code there)
> 
> * Make sure all the blocks are on disk between the new disk i_size and
> the old one.  For this, I'll add an rbtree to track BH_New buffers and
> delay updating the disk isize until the pending BH_New IO is on disk.
> Btrfs already does this, so I should have a handle on the spots I need
> to fiddle.
> 
> There's a ton of room for optimization like not doing async end_io if
> we're already inside disk i_size.

									Honza
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..c5e1ffd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -505,7 +505,7 @@ still_busy:
>   * Completion handler for block_write_full_page() - pages which are unlocked
>   * during I/O, and which have PageWriteback cleared upon I/O completion.
>   */
> -static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
> +void end_buffer_async_write(struct buffer_head *bh, int uptodate)
>  {
>  	char b[BDEVNAME_SIZE];
>  	unsigned long flags;
> @@ -583,11 +583,17 @@ static void mark_buffer_async_read(struct buffer_head *bh)
>  	set_buffer_async_read(bh);
>  }
>  
> -void mark_buffer_async_write(struct buffer_head *bh)
> +void mark_buffer_async_write_endio(struct buffer_head *bh,
> +				   bh_end_io_t *handler)
>  {
> -	bh->b_end_io = end_buffer_async_write;
> +	bh->b_end_io = handler;
>  	set_buffer_async_write(bh);
>  }
> +
> +void mark_buffer_async_write(struct buffer_head *bh)
> +{
> +	mark_buffer_async_write_endio(bh, end_buffer_async_write);
> +}
>  EXPORT_SYMBOL(mark_buffer_async_write);
>  
>  
> @@ -1706,7 +1712,8 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
>   * prevents this contention from occurring.
>   */
>  static int __block_write_full_page(struct inode *inode, struct page *page,
> -			get_block_t *get_block, struct writeback_control *wbc)
> +			get_block_t *get_block, struct writeback_control *wbc,
> +			bh_end_io_t *handler)
>  {
>  	int err;
>  	sector_t block;
> @@ -1789,7 +1796,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  			continue;
>  		}
>  		if (test_clear_buffer_dirty(bh)) {
> -			mark_buffer_async_write(bh);
> +			mark_buffer_async_write_endio(bh, handler);
>  		} else {
>  			unlock_buffer(bh);
>  		}
> @@ -1842,7 +1849,7 @@ recover:
>  		if (buffer_mapped(bh) && buffer_dirty(bh) &&
>  		    !buffer_delay(bh)) {
>  			lock_buffer(bh);
> -			mark_buffer_async_write(bh);
> +			mark_buffer_async_write_endio(bh, handler);
>  		} else {
>  			/*
>  			 * The buffer may have been set dirty during
> @@ -2760,7 +2767,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
>  out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
> -		ret = __block_write_full_page(inode, page, get_block, wbc);
> +		ret = __block_write_full_page(inode, page, get_block, wbc,
> +					      end_buffer_async_write);
>  	return ret;
>  }
>  EXPORT_SYMBOL(nobh_writepage);
> @@ -2918,9 +2926,10 @@ out:
>  
>  /*
>   * The generic ->writepage function for buffer-backed address_spaces
> + * this form passes in the end_io handler used to finish the IO.
>   */
> -int block_write_full_page(struct page *page, get_block_t *get_block,
> -			struct writeback_control *wbc)
> +int block_write_full_page_endio(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc, bh_end_io_t *handler)
>  {
>  	struct inode * const inode = page->mapping->host;
>  	loff_t i_size = i_size_read(inode);
> @@ -2929,7 +2938,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
>  
>  	/* Is the page fully inside i_size? */
>  	if (page->index < end_index)
> -		return __block_write_full_page(inode, page, get_block, wbc);
> +		return __block_write_full_page(inode, page, get_block, wbc,
> +					       handler);
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> @@ -2952,9 +2962,20 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
>  	 * writes to that region are not written out to the file."
>  	 */
>  	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> -	return __block_write_full_page(inode, page, get_block, wbc);
> +	return __block_write_full_page(inode, page, get_block, wbc, handler);
>  }
>  
> +/*
> + * The generic ->writepage function for buffer-backed address_spaces
> + */
> +int block_write_full_page(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc)
> +{
> +	return block_write_full_page_endio(page, get_block, wbc,
> +					   end_buffer_async_write);
> +}
> +
> +
>  sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
>  			    get_block_t *get_block)
>  {
> @@ -3422,9 +3443,11 @@ EXPORT_SYMBOL(block_read_full_page);
>  EXPORT_SYMBOL(block_sync_page);
>  EXPORT_SYMBOL(block_truncate_page);
>  EXPORT_SYMBOL(block_write_full_page);
> +EXPORT_SYMBOL(block_write_full_page_endio);
>  EXPORT_SYMBOL(cont_write_begin);
>  EXPORT_SYMBOL(end_buffer_read_sync);
>  EXPORT_SYMBOL(end_buffer_write_sync);
> +EXPORT_SYMBOL_GPL(end_buffer_async_write);
>  EXPORT_SYMBOL(file_fsync);
>  EXPORT_SYMBOL(fsync_bdev);
>  EXPORT_SYMBOL(generic_block_bmap);
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 5fa453b..64995d0 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -38,6 +38,7 @@
>  #include <linux/bio.h>
>  #include <linux/fiemap.h>
>  #include <linux/namei.h>
> +#include <linux/workqueue.h>
>  #include "xattr.h"
>  #include "acl.h"
>  
> @@ -766,6 +767,21 @@ err_out:
>  	return err;
>  }
>  
> +static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
> +{
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	/* FIXME add a lock in the inode */
> +	spin_lock_irqsave(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
> +	if (EXT3_I(inode)->i_disksize < new_size) {
> +		EXT3_I(inode)->i_disksize = new_size;
> +		ret = 1;
> +	}
> +	spin_unlock_irqrestore(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
> +	return ret;
> +}
> +
>  /*
>   * Allocation strategy is simple: if we have to allocate something, we will
>   * have to go the whole way to leaf. So let's do it before attaching anything
> @@ -915,9 +931,13 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
>  	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
>  	 * protect it if you're about to implement concurrent
>  	 * ext3_get_block() -bzzz
> +	 *
> +	 * FIXME, I think this only needs to extend the disk i_size when
> +	 * we're filling holes that came from using ftruncate to increase
> +	 * i_size.  Need to verify.
>  	*/
> -	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> -		ei->i_disksize = inode->i_size;
> +	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
> +		maybe_update_disk_isize(inode, inode->i_size);
>  	mutex_unlock(&ei->truncate_mutex);
>  	if (err)
>  		goto cleanup;
> @@ -1079,6 +1099,50 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
>  	return NULL;
>  }
>  
> +void ext3_run_guarded_work(struct work_struct *work)
> +{
> +	struct ext3_sb_info *sbi =
> +		container_of(work, struct ext3_sb_info, guarded_work);
> +	struct buffer_head *bh;
> +	struct buffer_head *next;
> +	struct inode *inode;
> +	struct page *page;
> +	struct address_space *mapping;
> +	loff_t offset;
> +
> +	spin_lock_irq(&sbi->guarded_lock);
> +	while(sbi->guarded_buffers) {
> +		bh = sbi->guarded_buffers;
> +		next = bh->b_private;
> +		if (!next)
> +			sbi->guarded_tail = NULL;
> +		sbi->guarded_buffers = next;
> +		bh->b_private = NULL;
> +		spin_unlock_irq(&sbi->guarded_lock);
> +
> +		page = bh->b_page;
> +		mapping = page->mapping;
> +		if (!mapping)
> +			goto out;
> +
> +		/* set the offset to the end of this buffer */
> +		offset = page_offset(page) + bh_offset(bh) + bh->b_size;
> +		inode = mapping->host;
> +
> +		/*
> +		 * then chomp back to i_size if that is smaller than the
> +		 * offset
> +		 */
> +		offset = min(offset, inode->i_size);
> +		if (maybe_update_disk_isize(inode, offset))
> +			ext3_dirty_inode(inode);
> +out:
> +		end_buffer_async_write(bh, buffer_uptodate(bh));
> +		spin_lock_irq(&sbi->guarded_lock);
> +	}
> +	spin_unlock_irq(&sbi->guarded_lock);
> +}
> +
>  static int walk_page_buffers(	handle_t *handle,
>  				struct buffer_head *head,
>  				unsigned from,
> @@ -1275,8 +1339,7 @@ static int ext3_ordered_write_end(struct file *file,
>  		loff_t new_i_size;
>  
>  		new_i_size = pos + copied;
> -		if (new_i_size > EXT3_I(inode)->i_disksize)
> -			EXT3_I(inode)->i_disksize = new_i_size;
> +		maybe_update_disk_isize(inode, new_i_size);
>  		ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
>  							page, fsdata);
>  		copied = ret2;
> @@ -1303,8 +1366,30 @@ static int ext3_writeback_write_end(struct file *file,
>  	loff_t new_i_size;
>  
>  	new_i_size = pos + copied;
> -	if (new_i_size > EXT3_I(inode)->i_disksize)
> -		EXT3_I(inode)->i_disksize = new_i_size;
> +	maybe_update_disk_isize(inode, new_i_size);
> +
> +	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> +							page, fsdata);
> +	copied = ret2;
> +	if (ret2 < 0)
> +		ret = ret2;
> +
> +	ret2 = ext3_journal_stop(handle);
> +	if (!ret)
> +		ret = ret2;
> +	unlock_page(page);
> +	page_cache_release(page);
> +
> +	return ret ? ret : copied;
> +}
> +
> +static int ext3_guarded_write_end(struct file *file,
> +				struct address_space *mapping,
> +				loff_t pos, unsigned len, unsigned copied,
> +				struct page *page, void *fsdata)
> +{
> +	handle_t *handle = ext3_journal_current_handle();
> +	int ret = 0, ret2;
>  
>  	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
>  							page, fsdata);
> @@ -1553,6 +1638,74 @@ out_fail:
>  	return ret;
>  }
>  
> +/*
> + * Completion handler for block_write_full_page() - pages which are unlocked
> + * during I/O, and which have PageWriteback cleared upon I/O completion.
> + */
> +static void end_buffer_async_write_guarded(struct buffer_head *bh,
> +					   int uptodate)
> +{
> +	struct ext3_sb_info *sbi;
> +	struct address_space *mapping;
> +	unsigned long flags;
> +
> +	mapping = bh->b_page->mapping;
> +	if (!mapping || bh->b_private) {
> +		end_buffer_async_write(bh, uptodate);
> +		return;
> +	}
> +
> +	/*
> +	 * the end_io callback deals with IO errors later
> +	 */
> +	if (uptodate)
> +		set_buffer_uptodate(bh);
> +	else
> +		clear_buffer_uptodate(bh);
> +
> +	sbi = EXT3_SB(mapping->host->i_sb);
> +	spin_lock_irqsave(&sbi->guarded_lock, flags);
> +	if (sbi->guarded_tail) {
> +		struct buffer_head *last = sbi->guarded_tail;
> +		last->b_private = bh;
> +	} else
> +		sbi->guarded_buffers = bh;
> +	sbi->guarded_tail = bh;
> +	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> +	queue_work(sbi->guarded_wq, &sbi->guarded_work);
> +}
> +
> +static int ext3_guarded_writepage(struct page *page,
> +				struct writeback_control *wbc)
> +{
> +	struct inode *inode = page->mapping->host;
> +	handle_t *handle = NULL;
> +	int ret = 0;
> +	int err;
> +
> +	if (ext3_journal_current_handle())
> +		goto out_fail;
> +
> +	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		goto out_fail;
> +	}
> +
> +	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
> +					  end_buffer_async_write_guarded);
> +
> +	err = ext3_journal_stop(handle);
> +	if (!ret)
> +		ret = err;
> +	return ret;
> +
> +out_fail:
> +	redirty_page_for_writepage(wbc, page);
> +	unlock_page(page);
> +	return ret;
> +}
> +
>  static int ext3_writeback_writepage(struct page *page,
>  				struct writeback_control *wbc)
>  {
> @@ -1812,6 +1965,21 @@ static const struct address_space_operations ext3_writeback_aops = {
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  };
>  
> +static const struct address_space_operations ext3_guarded_aops = {
> +	.readpage		= ext3_readpage,
> +	.readpages		= ext3_readpages,
> +	.writepage		= ext3_guarded_writepage,
> +	.sync_page		= block_sync_page,
> +	.write_begin		= ext3_write_begin,
> +	.write_end		= ext3_guarded_write_end,
> +	.bmap			= ext3_bmap,
> +	.invalidatepage		= ext3_invalidatepage,
> +	.releasepage		= ext3_releasepage,
> +	.direct_IO		= ext3_direct_IO,
> +	.migratepage		= buffer_migrate_page,
> +	.is_partially_uptodate  = block_is_partially_uptodate,
> +};
> +
>  static const struct address_space_operations ext3_journalled_aops = {
>  	.readpage		= ext3_readpage,
>  	.readpages		= ext3_readpages,
> @@ -1830,6 +1998,8 @@ void ext3_set_aops(struct inode *inode)
>  {
>  	if (ext3_should_order_data(inode))
>  		inode->i_mapping->a_ops = &ext3_ordered_aops;
> +	else if (ext3_should_guard_data(inode))
> +		inode->i_mapping->a_ops = &ext3_guarded_aops;
>  	else if (ext3_should_writeback_data(inode))
>  		inode->i_mapping->a_ops = &ext3_writeback_aops;
>  	else
> @@ -3081,6 +3251,14 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
>  		}
>  
>  		error = ext3_orphan_add(handle, inode);
> +
> +		/*
> +		 * this is pretty confusing, but we don't need to worry
> +		 * about guarded i_size here because ext3 truncate fixes
> +		 * it to the correct i_size when the truncate is all done,
> +		 * and the ext3_orphan_add makes sure we'll have a sane
> +		 * i_size after a crash
> +		 */
>  		EXT3_I(inode)->i_disksize = attr->ia_size;
>  		rc = ext3_mark_inode_dirty(handle, inode);
>  		if (!error)
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 4a97041..0534a95 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -37,6 +37,7 @@
>  #include <linux/quotaops.h>
>  #include <linux/seq_file.h>
>  #include <linux/log2.h>
> +#include <linux/workqueue.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -393,6 +394,9 @@ static void ext3_put_super (struct super_block * sb)
>  	struct ext3_super_block *es = sbi->s_es;
>  	int i, err;
>  
> +	flush_workqueue(sbi->guarded_wq);
> +	destroy_workqueue(sbi->guarded_wq);
> +
>  	ext3_xattr_put_super(sb);
>  	err = journal_destroy(sbi->s_journal);
>  	sbi->s_journal = NULL;
> @@ -628,6 +632,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
>  		seq_puts(seq, ",data=journal");
>  	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
>  		seq_puts(seq, ",data=ordered");
> +	else if (test_opt(sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> +		seq_puts(seq, ",data=guarded");
>  	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
>  		seq_puts(seq, ",data=writeback");
>  
> @@ -786,7 +792,7 @@ enum {
>  	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
>  	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
>  	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
> -	Opt_data_err_abort, Opt_data_err_ignore,
> +	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
>  	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>  	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>  	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> @@ -828,6 +834,7 @@ static const match_table_t tokens = {
>  	{Opt_abort, "abort"},
>  	{Opt_data_journal, "data=journal"},
>  	{Opt_data_ordered, "data=ordered"},
> +	{Opt_data_guarded, "data=guarded"},
>  	{Opt_data_writeback, "data=writeback"},
>  	{Opt_data_err_abort, "data_err=abort"},
>  	{Opt_data_err_ignore, "data_err=ignore"},
> @@ -1030,6 +1037,9 @@ static int parse_options (char *options, struct super_block *sb,
>  		case Opt_data_ordered:
>  			data_opt = EXT3_MOUNT_ORDERED_DATA;
>  			goto datacheck;
> +		case Opt_data_guarded:
> +			data_opt = EXT3_MOUNT_GUARDED_DATA;
> +			goto datacheck;
>  		case Opt_data_writeback:
>  			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
>  		datacheck:
> @@ -1945,11 +1955,24 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  			clear_opt(sbi->s_mount_opt, NOBH);
>  		}
>  	}
> +
> +	/*
> +	 * setup the guarded work list
> +	 */
> +	EXT3_SB(sb)->guarded_buffers = NULL;
> +	EXT3_SB(sb)->guarded_tail = NULL;
> +	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
> +	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
> +	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
> +	if (!EXT3_SB(sb)->guarded_wq) {
> +		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
> +		goto failed_mount_guard;
> +	}
> +
>  	/*
>  	 * The journal_load will have done any necessary log recovery,
>  	 * so we can safely mount the rest of the filesystem now.
>  	 */
> -
>  	root = ext3_iget(sb, EXT3_ROOT_INO);
>  	if (IS_ERR(root)) {
>  		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
> @@ -1961,6 +1984,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
>  		goto failed_mount4;
>  	}
> +
>  	sb->s_root = d_alloc_root(root);
>  	if (!sb->s_root) {
>  		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
> @@ -1970,6 +1994,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  	}
>  
>  	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
> +
>  	/*
>  	 * akpm: core read_super() calls in here with the superblock locked.
>  	 * That deadlocks, because orphan cleanup needs to lock the superblock
> @@ -1985,9 +2010,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
>  	ext3_mark_recovery_complete(sb, es);
>  	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
> -		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> -		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> -		"writeback");
> +	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> +	      test_opt(sb,GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA ? "guarded":
> +	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> +	      "writeback");
>  
>  	lock_kernel();
>  	return 0;
> @@ -1999,6 +2025,8 @@ cantfind_ext3:
>  	goto failed_mount;
>  
>  failed_mount4:
> +	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
> +failed_mount_guard:
>  	journal_destroy(sbi->s_journal);
>  failed_mount3:
>  	percpu_counter_destroy(&sbi->s_freeblocks_counter);
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index bd7ac79..507b38d 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -155,6 +155,7 @@ void create_empty_buffers(struct page *, unsigned long,
>  			unsigned long b_state);
>  void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
>  void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
> +void end_buffer_async_write(struct buffer_head *bh, int uptodate);
>  
>  /* Things to do with buffers at mapping->private_list */
>  void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
> @@ -204,6 +205,8 @@ extern int buffer_heads_over_limit;
>  void block_invalidatepage(struct page *page, unsigned long offset);
>  int block_write_full_page(struct page *page, get_block_t *get_block,
>  				struct writeback_control *wbc);
> +int block_write_full_page_endio(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc, bh_end_io_t *handler);
>  int block_read_full_page(struct page*, get_block_t*);
>  int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
>  				unsigned long from);
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index dd495b8..7966bdb 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -18,6 +18,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/magic.h>
> +#include <linux/workqueue.h>
>  
>  /*
>   * The second extended filesystem constants/structures
> @@ -397,7 +398,6 @@ struct ext3_inode {
>  #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
>  #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
>  #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
> -#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
>  #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
>  #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
>  #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
> @@ -413,6 +413,12 @@ struct ext3_inode {
>  #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
>  #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
>  						  * error in ordered mode */
> +#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
> +						    i_size */
> +#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
> +					 EXT3_MOUNT_ORDERED_DATA | \
> +					 EXT3_MOUNT_WRITEBACK_DATA | \
> +					 EXT3_MOUNT_GUARDED_DATA)
>  
>  /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
>  #ifndef _LINUX_EXT2_FS_H
> @@ -891,6 +897,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +void ext3_run_guarded_work(struct work_struct *work);
>  
>  /* ioctl.c */
>  extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
> diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
> index f07f34d..868d2cd 100644
> --- a/include/linux/ext3_fs_sb.h
> +++ b/include/linux/ext3_fs_sb.h
> @@ -21,6 +21,7 @@
>  #include <linux/wait.h>
>  #include <linux/blockgroup_lock.h>
>  #include <linux/percpu_counter.h>
> +#include <linux/workqueue.h>
>  #endif
>  #include <linux/rbtree.h>
>  
> @@ -82,6 +83,12 @@ struct ext3_sb_info {
>  	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
>  	int s_jquota_fmt;			/* Format of quota to use */
>  #endif
> +
> +	struct workqueue_struct *guarded_wq;
> +	struct work_struct guarded_work;
> +	struct buffer_head *guarded_buffers;
> +	struct buffer_head *guarded_tail;
> +	spinlock_t guarded_lock;
>  };
>  
>  static inline spinlock_t *
> diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
> index cf82d51..45cb4aa 100644
> --- a/include/linux/ext3_jbd.h
> +++ b/include/linux/ext3_jbd.h
> @@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
>  	return 0;
>  }
>  
> +static inline int ext3_should_guard_data(struct inode *inode)
> +{
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
> +		return 0;
> +	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> +		return 1;
> +	return 0;
> +}
> +
>  static inline int ext3_should_writeback_data(struct inode *inode)
>  {
>  	if (!S_ISREG(inode->i_mode))
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason April 9, 2009, 6:10 p.m. UTC | #2
On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > 
> > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > 
> > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > which unplugged the write queue after every page write.  Now that Jens
> > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > to further improbve ext3 latency, especially during the "sync" command
> > > > in Linus's write-big-file-and-sync workload.
> > > 
> > > So here's a question and a untested _conceptual_ patch. 
> > > 
> > > The kind of writeback mode I'd personally prefer would be more of a 
> > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > something of the best of both worlds. I'd like the data writeback to get 
> > > _started_ when the journal is written to disk, but I'd like it to not 
> > > block journal updates.
> > > 
> > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > be totally unordered either.
> > > 
> > 
> > I started working on the xfs style i_size updates last night, and here's
> > my current (most definitely broken) proof of concept.  I call it
> > data=guarded.
> > 
> > In guarded mode the on disk i_size is not updated until after the data
> > writes are complete.  I've got a per FS work queue and I'm abusing
> > bh->b_private as a list pointer.  So, what happens is:
> > 
> > * writepage sets up the buffer with the guarded end_io handler
> > 
> > * The end_io handler puts the buffer onto the per-sb list of guarded
> > buffers and then it kicks the work queue
> > 
> > * The workqueue updates the on disk i_size to the min of the end of the
> > buffer or the in-memory i_size, and then it logs the inode.
> > 
> > * Then the regular async bh end_io handler is called to end writeback on
> > the page.
> > 
> > One big gotcha is that we starting a transaction while a page is
> > writeback.  It means that anyone who waits for writeback to finish on
> > the datapage with a transaction running could deadlock against the work
> > queue func trying to start a transaction.
>   For ext3 I don't think anyone waits for PageWriteback with a
> transaction open. We definitely don't do it from ext3 code and generic
> code does usually sequence like:
>   lock_page(page);
>   ...
>   wait_on_page_writeback(page)
> 
>   and because lock ordering is page_lock < transaction start, we
> shouldn't have transaction open at that point.
>   But with ext4 it may be different - there, the lock ordering is
> transaction start > page_lock and so above code could well have
> transaction started.
>   Wouldn't it actually be better to update i_size when the page is
> fully written out after we clear PG_writeback as you write below?

It would, but then we have to take a ref on the inode and risk iput
leading to inode deletion in the handler that is supposed to be doing IO
completion.  It's icky either way ;)

The nice part with doing it before writeback is that we know that when
we wait for page writeback, we don't also have to wait for i_size update
to be finished.

If we go this route and it gets copied to ext4, we can weigh our options
I guess.

>   One thing which does not seem to be handled is that your code can
> happily race with truncate. So IO completion could reset i_size which
> has been just set by truncate. And I'm not sure how to handle this
> effectively. Generally I'm not sure how much this is going to cost...
> 

Yeah, I was worried about that.  What ends up happening is the setattr
call sets the disk i_size and then calls inode_setattr, who calls
vmtruncate who actually waits on the writeback.

Then, we wander into the ext3 truncate who resets disk i_size down
again.  It's a pretty strange window of updates, but I was thinking it
would work to cut down i_size, wait on IO, then cut it down again in
setattr.

Once we wait on all IO past the new in-memory i_size, writepage won't
send any more down.  So updating disk i_size after the wait should be
enough.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 9, 2009, 7:04 p.m. UTC | #3
On Thu 09-04-09 14:10:03, Chris Mason wrote:
> On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > > 
> > > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > > 
> > > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > > which unplugged the write queue after every page write.  Now that Jens
> > > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > > to further improbve ext3 latency, especially during the "sync" command
> > > > > in Linus's write-big-file-and-sync workload.
> > > > 
> > > > So here's a question and a untested _conceptual_ patch. 
> > > > 
> > > > The kind of writeback mode I'd personally prefer would be more of a 
> > > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > > something of the best of both worlds. I'd like the data writeback to get 
> > > > _started_ when the journal is written to disk, but I'd like it to not 
> > > > block journal updates.
> > > > 
> > > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > > be totally unordered either.
> > > > 
> > > 
> > > I started working on the xfs style i_size updates last night, and here's
> > > my current (most definitely broken) proof of concept.  I call it
> > > data=guarded.
> > > 
> > > In guarded mode the on disk i_size is not updated until after the data
> > > writes are complete.  I've got a per FS work queue and I'm abusing
> > > bh->b_private as a list pointer.  So, what happens is:
> > > 
> > > * writepage sets up the buffer with the guarded end_io handler
> > > 
> > > * The end_io handler puts the buffer onto the per-sb list of guarded
> > > buffers and then it kicks the work queue
> > > 
> > > * The workqueue updates the on disk i_size to the min of the end of the
> > > buffer or the in-memory i_size, and then it logs the inode.
> > > 
> > > * Then the regular async bh end_io handler is called to end writeback on
> > > the page.
> > > 
> > > One big gotcha is that we starting a transaction while a page is
> > > writeback.  It means that anyone who waits for writeback to finish on
> > > the datapage with a transaction running could deadlock against the work
> > > queue func trying to start a transaction.
> >   For ext3 I don't think anyone waits for PageWriteback with a
> > transaction open. We definitely don't do it from ext3 code and generic
> > code does usually sequence like:
> >   lock_page(page);
> >   ...
> >   wait_on_page_writeback(page)
> > 
> >   and because lock ordering is page_lock < transaction start, we
> > shouldn't have transaction open at that point.
> >   But with ext4 it may be different - there, the lock ordering is
> > transaction start > page_lock and so above code could well have
> > transaction started.
> >   Wouldn't it actually be better to update i_size when the page is
> > fully written out after we clear PG_writeback as you write below?
> 
> It would, but then we have to take a ref on the inode and risk iput
> leading to inode deletion in the handler that is supposed to be doing IO
> completion.  It's icky either way ;)
  Yeah, I though something like this could happen... I had a similar
problem in JBD2 where kjournald could be the last to drop inode reference.
In the end I've solved it by waiting in clear_inode() until the commit code
is done with the inode (and the commit code does not hold any reference
against the inode).

> The nice part with doing it before writeback is that we know that when
> we wait for page writeback, we don't also have to wait for i_size update
> to be finished.
> 
> If we go this route and it gets copied to ext4, we can weigh our options
> I guess.
> 
> >   One thing which does not seem to be handled is that your code can
> > happily race with truncate. So IO completion could reset i_size which
> > has been just set by truncate. And I'm not sure how to handle this
> > effectively. Generally I'm not sure how much this is going to cost...
> > 
> 
> Yeah, I was worried about that.  What ends up happening is the setattr
> call sets the disk i_size and then calls inode_setattr, who calls
> vmtruncate who actually waits on the writeback.
> 
> Then, we wander into the ext3 truncate who resets disk i_size down
> again.  It's a pretty strange window of updates, but I was thinking it
> would work to cut down i_size, wait on IO, then cut it down again in
> setattr.
> 
> Once we wait on all IO past the new in-memory i_size, writepage won't
> send any more down.  So updating disk i_size after the wait should be
> enough.
  Yes, this should work. But it's a bit nasty...

								Honza
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..c5e1ffd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -505,7 +505,7 @@  still_busy:
  * Completion handler for block_write_full_page() - pages which are unlocked
  * during I/O, and which have PageWriteback cleared upon I/O completion.
  */
-static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
+void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 {
 	char b[BDEVNAME_SIZE];
 	unsigned long flags;
@@ -583,11 +583,17 @@  static void mark_buffer_async_read(struct buffer_head *bh)
 	set_buffer_async_read(bh);
 }
 
-void mark_buffer_async_write(struct buffer_head *bh)
+void mark_buffer_async_write_endio(struct buffer_head *bh,
+				   bh_end_io_t *handler)
 {
-	bh->b_end_io = end_buffer_async_write;
+	bh->b_end_io = handler;
 	set_buffer_async_write(bh);
 }
+
+void mark_buffer_async_write(struct buffer_head *bh)
+{
+	mark_buffer_async_write_endio(bh, end_buffer_async_write);
+}
 EXPORT_SYMBOL(mark_buffer_async_write);
 
 
@@ -1706,7 +1712,8 @@  EXPORT_SYMBOL(unmap_underlying_metadata);
  * prevents this contention from occurring.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
-			get_block_t *get_block, struct writeback_control *wbc)
+			get_block_t *get_block, struct writeback_control *wbc,
+			bh_end_io_t *handler)
 {
 	int err;
 	sector_t block;
@@ -1789,7 +1796,7 @@  static int __block_write_full_page(struct inode *inode, struct page *page,
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			unlock_buffer(bh);
 		}
@@ -1842,7 +1849,7 @@  recover:
 		if (buffer_mapped(bh) && buffer_dirty(bh) &&
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			/*
 			 * The buffer may have been set dirty during
@@ -2760,7 +2767,8 @@  int nobh_writepage(struct page *page, get_block_t *get_block,
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
-		ret = __block_write_full_page(inode, page, get_block, wbc);
+		ret = __block_write_full_page(inode, page, get_block, wbc,
+					      end_buffer_async_write);
 	return ret;
 }
 EXPORT_SYMBOL(nobh_writepage);
@@ -2918,9 +2926,10 @@  out:
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
+ * this form passes in the end_io handler used to finish the IO.
  */
-int block_write_full_page(struct page *page, get_block_t *get_block,
-			struct writeback_control *wbc)
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler)
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
@@ -2929,7 +2938,8 @@  int block_write_full_page(struct page *page, get_block_t *get_block,
 
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc);
+		return __block_write_full_page(inode, page, get_block, wbc,
+					       handler);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
@@ -2952,9 +2962,20 @@  int block_write_full_page(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-	return __block_write_full_page(inode, page, get_block, wbc);
+	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
+/*
+ * The generic ->writepage function for buffer-backed address_spaces
+ */
+int block_write_full_page(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc)
+{
+	return block_write_full_page_endio(page, get_block, wbc,
+					   end_buffer_async_write);
+}
+
+
 sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
 			    get_block_t *get_block)
 {
@@ -3422,9 +3443,11 @@  EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
+EXPORT_SYMBOL(block_write_full_page_endio);
 EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
+EXPORT_SYMBOL_GPL(end_buffer_async_write);
 EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(fsync_bdev);
 EXPORT_SYMBOL(generic_block_bmap);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..64995d0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@ 
 #include <linux/bio.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/workqueue.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -766,6 +767,21 @@  err_out:
 	return err;
 }
 
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	/* FIXME add a lock in the inode */
+	spin_lock_irqsave(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
+	if (EXT3_I(inode)->i_disksize < new_size) {
+		EXT3_I(inode)->i_disksize = new_size;
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
+	return ret;
+}
+
 /*
  * Allocation strategy is simple: if we have to allocate something, we will
  * have to go the whole way to leaf. So let's do it before attaching anything
@@ -915,9 +931,13 @@  int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
 	 * protect it if you're about to implement concurrent
 	 * ext3_get_block() -bzzz
+	 *
+	 * FIXME, I think this only needs to extend the disk i_size when
+	 * we're filling holes that came from using ftruncate to increase
+	 * i_size.  Need to verify.
 	*/
-	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-		ei->i_disksize = inode->i_size;
+	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+		maybe_update_disk_isize(inode, inode->i_size);
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
@@ -1079,6 +1099,50 @@  struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 	return NULL;
 }
 
+void ext3_run_guarded_work(struct work_struct *work)
+{
+	struct ext3_sb_info *sbi =
+		container_of(work, struct ext3_sb_info, guarded_work);
+	struct buffer_head *bh;
+	struct buffer_head *next;
+	struct inode *inode;
+	struct page *page;
+	struct address_space *mapping;
+	loff_t offset;
+
+	spin_lock_irq(&sbi->guarded_lock);
+	while(sbi->guarded_buffers) {
+		bh = sbi->guarded_buffers;
+		next = bh->b_private;
+		if (!next)
+			sbi->guarded_tail = NULL;
+		sbi->guarded_buffers = next;
+		bh->b_private = NULL;
+		spin_unlock_irq(&sbi->guarded_lock);
+
+		page = bh->b_page;
+		mapping = page->mapping;
+		if (!mapping)
+			goto out;
+
+		/* set the offset to the end of this buffer */
+		offset = page_offset(page) + bh_offset(bh) + bh->b_size;
+		inode = mapping->host;
+
+		/*
+		 * then chomp back to i_size if that is smaller than the
+		 * offset
+		 */
+		offset = min(offset, inode->i_size);
+		if (maybe_update_disk_isize(inode, offset))
+			ext3_dirty_inode(inode);
+out:
+		end_buffer_async_write(bh, buffer_uptodate(bh));
+		spin_lock_irq(&sbi->guarded_lock);
+	}
+	spin_unlock_irq(&sbi->guarded_lock);
+}
+
 static int walk_page_buffers(	handle_t *handle,
 				struct buffer_head *head,
 				unsigned from,
@@ -1275,8 +1339,7 @@  static int ext3_ordered_write_end(struct file *file,
 		loff_t new_i_size;
 
 		new_i_size = pos + copied;
-		if (new_i_size > EXT3_I(inode)->i_disksize)
-			EXT3_I(inode)->i_disksize = new_i_size;
+		maybe_update_disk_isize(inode, new_i_size);
 		ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
 							page, fsdata);
 		copied = ret2;
@@ -1303,8 +1366,30 @@  static int ext3_writeback_write_end(struct file *file,
 	loff_t new_i_size;
 
 	new_i_size = pos + copied;
-	if (new_i_size > EXT3_I(inode)->i_disksize)
-		EXT3_I(inode)->i_disksize = new_i_size;
+	maybe_update_disk_isize(inode, new_i_size);
+
+	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	copied = ret2;
+	if (ret2 < 0)
+		ret = ret2;
+
+	ret2 = ext3_journal_stop(handle);
+	if (!ret)
+		ret = ret2;
+	unlock_page(page);
+	page_cache_release(page);
+
+	return ret ? ret : copied;
+}
+
+static int ext3_guarded_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	handle_t *handle = ext3_journal_current_handle();
+	int ret = 0, ret2;
 
 	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
 							page, fsdata);
@@ -1553,6 +1638,74 @@  out_fail:
 	return ret;
 }
 
+/*
+ * Completion handler for block_write_full_page() - pages which are unlocked
+ * during I/O, and which have PageWriteback cleared upon I/O completion.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+					   int uptodate)
+{
+	struct ext3_sb_info *sbi;
+	struct address_space *mapping;
+	unsigned long flags;
+
+	mapping = bh->b_page->mapping;
+	if (!mapping || bh->b_private) {
+		end_buffer_async_write(bh, uptodate);
+		return;
+	}
+
+	/*
+	 * the end_io callback deals with IO errors later
+	 */
+	if (uptodate)
+		set_buffer_uptodate(bh);
+	else
+		clear_buffer_uptodate(bh);
+
+	sbi = EXT3_SB(mapping->host->i_sb);
+	spin_lock_irqsave(&sbi->guarded_lock, flags);
+	if (sbi->guarded_tail) {
+		struct buffer_head *last = sbi->guarded_tail;
+		last->b_private = bh;
+	} else
+		sbi->guarded_buffers = bh;
+	sbi->guarded_tail = bh;
+	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+	queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+				struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	handle_t *handle = NULL;
+	int ret = 0;
+	int err;
+
+	if (ext3_journal_current_handle())
+		goto out_fail;
+
+	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_fail;
+	}
+
+	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+					  end_buffer_async_write_guarded);
+
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+
+out_fail:
+	redirty_page_for_writepage(wbc, page);
+	unlock_page(page);
+	return ret;
+}
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1812,6 +1965,21 @@  static const struct address_space_operations ext3_writeback_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
+static const struct address_space_operations ext3_guarded_aops = {
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_guarded_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_guarded_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
+};
+
 static const struct address_space_operations ext3_journalled_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
@@ -1830,6 +1998,8 @@  void ext3_set_aops(struct inode *inode)
 {
 	if (ext3_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext3_ordered_aops;
+	else if (ext3_should_guard_data(inode))
+		inode->i_mapping->a_ops = &ext3_guarded_aops;
 	else if (ext3_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext3_writeback_aops;
 	else
@@ -3081,6 +3251,14 @@  int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		}
 
 		error = ext3_orphan_add(handle, inode);
+
+		/*
+		 * this is pretty confusing, but we don't need to worry
+		 * about guarded i_size here because ext3 truncate fixes
+		 * it to the correct i_size when the truncate is all done,
+		 * and the ext3_orphan_add makes sure we'll have a sane
+		 * i_size after a crash
+		 */
 		EXT3_I(inode)->i_disksize = attr->ia_size;
 		rc = ext3_mark_inode_dirty(handle, inode);
 		if (!error)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 4a97041..0534a95 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@ 
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/workqueue.h>
 
 #include <asm/uaccess.h>
 
@@ -393,6 +394,9 @@  static void ext3_put_super (struct super_block * sb)
 	struct ext3_super_block *es = sbi->s_es;
 	int i, err;
 
+	flush_workqueue(sbi->guarded_wq);
+	destroy_workqueue(sbi->guarded_wq);
+
 	ext3_xattr_put_super(sb);
 	err = journal_destroy(sbi->s_journal);
 	sbi->s_journal = NULL;
@@ -628,6 +632,8 @@  static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",data=journal");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
 		seq_puts(seq, ",data=ordered");
+	else if (test_opt(sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		seq_puts(seq, ",data=guarded");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");
 
@@ -786,7 +792,7 @@  enum {
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
-	Opt_data_err_abort, Opt_data_err_ignore,
+	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -828,6 +834,7 @@  static const match_table_t tokens = {
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
+	{Opt_data_guarded, "data=guarded"},
 	{Opt_data_writeback, "data=writeback"},
 	{Opt_data_err_abort, "data_err=abort"},
 	{Opt_data_err_ignore, "data_err=ignore"},
@@ -1030,6 +1037,9 @@  static int parse_options (char *options, struct super_block *sb,
 		case Opt_data_ordered:
 			data_opt = EXT3_MOUNT_ORDERED_DATA;
 			goto datacheck;
+		case Opt_data_guarded:
+			data_opt = EXT3_MOUNT_GUARDED_DATA;
+			goto datacheck;
 		case Opt_data_writeback:
 			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
 		datacheck:
@@ -1945,11 +1955,24 @@  static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
 	}
+
+	/*
+	 * setup the guarded work list
+	 */
+	EXT3_SB(sb)->guarded_buffers = NULL;
+	EXT3_SB(sb)->guarded_tail = NULL;
+	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+	if (!EXT3_SB(sb)->guarded_wq) {
+		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+		goto failed_mount_guard;
+	}
+
 	/*
 	 * The journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
 	 */
-
 	root = ext3_iget(sb, EXT3_ROOT_INO);
 	if (IS_ERR(root)) {
 		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1961,6 +1984,7 @@  static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
 		goto failed_mount4;
 	}
+
 	sb->s_root = d_alloc_root(root);
 	if (!sb->s_root) {
 		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1970,6 +1994,7 @@  static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	}
 
 	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
 	/*
 	 * akpm: core read_super() calls in here with the superblock locked.
 	 * That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1985,9 +2010,10 @@  static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
 	ext3_mark_recovery_complete(sb, es);
 	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
-		"writeback");
+	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
+	      test_opt(sb,GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA ? "guarded":
+	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
+	      "writeback");
 
 	lock_kernel();
 	return 0;
@@ -1999,6 +2025,8 @@  cantfind_ext3:
 	goto failed_mount;
 
 failed_mount4:
+	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
 	journal_destroy(sbi->s_journal);
 failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd7ac79..507b38d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -155,6 +155,7 @@  void create_empty_buffers(struct page *, unsigned long,
 			unsigned long b_state);
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
+void end_buffer_async_write(struct buffer_head *bh, int uptodate);
 
 /* Things to do with buffers at mapping->private_list */
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
@@ -204,6 +205,8 @@  extern int buffer_heads_over_limit;
 void block_invalidatepage(struct page *page, unsigned long offset);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler);
 int block_read_full_page(struct page*, get_block_t*);
 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
 				unsigned long from);
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index dd495b8..7966bdb 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@ 
 
 #include <linux/types.h>
 #include <linux/magic.h>
+#include <linux/workqueue.h>
 
 /*
  * The second extended filesystem constants/structures
@@ -397,7 +398,6 @@  struct ext3_inode {
 #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
 #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
 #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
@@ -413,6 +413,12 @@  struct ext3_inode {
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
 						  * error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
+						    i_size */
+#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
+					 EXT3_MOUNT_ORDERED_DATA | \
+					 EXT3_MOUNT_WRITEBACK_DATA | \
+					 EXT3_MOUNT_GUARDED_DATA)
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -891,6 +897,7 @@  extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);
 
 /* ioctl.c */
 extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..868d2cd 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@ 
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
 #include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
 #endif
 #include <linux/rbtree.h>
 
@@ -82,6 +83,12 @@  struct ext3_sb_info {
 	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
+
+	struct workqueue_struct *guarded_wq;
+	struct work_struct guarded_work;
+	struct buffer_head *guarded_buffers;
+	struct buffer_head *guarded_tail;
+	spinlock_t guarded_lock;
 };
 
 static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@  static inline int ext3_should_order_data(struct inode *inode)
 	return 0;
 }
 
+static inline int ext3_should_guard_data(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+		return 0;
+	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		return 1;
+	return 0;
+}
+
 static inline int ext3_should_writeback_data(struct inode *inode)
 {
 	if (!S_ISREG(inode->i_mode))