Patchwork Query about DIO/AIO WRITE throttling and ext4 serialization

login
register
mail settings
Submitter Christoph Hellwig
Date June 3, 2011, 1:02 a.m.
Message ID <20110603010233.GA17726@infradead.org>
Download mbox | patch
Permalink /patch/98498/
State New
Headers show

Comments

Christoph Hellwig - June 3, 2011, 1:02 a.m.
On Thu, Jun 02, 2011 at 08:54:03PM -0400, Vivek Goyal wrote:
> Just wondering why ext4 and XFS behavior are different and which is a
> more appropriate behavior. ext4 does not seem to be waiting for all
> pending AIO/DIO to finish while XFS does.

They're both wrong.  Ext4 completely misses support in fsync or sync
to catch pending unwrittent extent conversions, and thus fails to obey
the data integrity guarante.  XFS is beeing rather stupid about the
amount of synchronization it requires.  The untested patch below
should help with avoiding the synchronization if you're purely doing
overwrites:


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal - June 3, 2011, 1:28 a.m.
On Thu, Jun 02, 2011 at 09:02:33PM -0400, Christoph Hellwig wrote:
> On Thu, Jun 02, 2011 at 08:54:03PM -0400, Vivek Goyal wrote:
> > Just wondering why ext4 and XFS behavior are different and which is a
> > more appropriate behavior. ext4 does not seem to be waiting for all
> > pending AIO/DIO to finish while XFS does.
> 
> They're both wrong.  Ext4 completely misses support in fsync or sync
> to catch pending unwrittent extent conversions, and thus fails to obey
> the data integrity guarante.  XFS is beeing rather stupid about the
> amount of synchronization it requires.  The untested patch below
> should help with avoiding the synchronization if you're purely doing
> overwrites:

Yes this patch helps. I have already laid out the file and doing
overwrites.

I throttled aio-stress in one cgroup to 1 byte/sec and edited another
file from other cgroup and did a "sync" and it completed.

Thanks
Vivek

> 
> 
> Index: xfs/fs/xfs/linux-2.6/xfs_aops.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-06-03 09:54:52.964337556 +0900
> +++ xfs/fs/xfs/linux-2.6/xfs_aops.c	2011-06-03 09:57:06.877674259 +0900
> @@ -270,7 +270,7 @@ xfs_finish_ioend_sync(
>   * (vs. incore size).
>   */
>  STATIC xfs_ioend_t *
> -xfs_alloc_ioend(
> +__xfs_alloc_ioend(
>  	struct inode		*inode,
>  	unsigned int		type)
>  {
> @@ -290,7 +290,6 @@ xfs_alloc_ioend(
>  	ioend->io_inode = inode;
>  	ioend->io_buffer_head = NULL;
>  	ioend->io_buffer_tail = NULL;
> -	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
>  	ioend->io_offset = 0;
>  	ioend->io_size = 0;
>  	ioend->io_iocb = NULL;
> @@ -300,6 +299,18 @@ xfs_alloc_ioend(
>  	return ioend;
>  }
>  
> +STATIC xfs_ioend_t *
> +xfs_alloc_ioend(
> +	struct inode		*inode,
> +	unsigned int		type)
> +{
> +	struct xfs_ioend	*ioend;
> +
> +	ioend = __xfs_alloc_ioend(inode, type);
> +	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
> +	return ioend;
> +}
> +
>  STATIC int
>  xfs_map_blocks(
>  	struct inode		*inode,
> @@ -1318,6 +1329,7 @@ xfs_end_io_direct_write(
>  	 */
>  	iocb->private = NULL;
>  
> +	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
>  	ioend->io_offset = offset;
>  	ioend->io_size = size;
>  	if (private && size > 0)
> @@ -1354,7 +1366,7 @@ xfs_vm_direct_IO(
>  	ssize_t			ret;
>  
>  	if (rw & WRITE) {
> -		iocb->private = xfs_alloc_ioend(inode, IO_DIRECT);
> +		iocb->private = __xfs_alloc_ioend(inode, IO_DIRECT);
>  
>  		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
>  					    offset, nr_segs,
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal - June 3, 2011, 1:33 a.m.
On Thu, Jun 02, 2011 at 09:28:58PM -0400, Vivek Goyal wrote:
> On Thu, Jun 02, 2011 at 09:02:33PM -0400, Christoph Hellwig wrote:
> > On Thu, Jun 02, 2011 at 08:54:03PM -0400, Vivek Goyal wrote:
> > > Just wondering why ext4 and XFS behavior are different and which is a
> > > more appropriate behavior. ext4 does not seem to be waiting for all
> > > pending AIO/DIO to finish while XFS does.
> > 
> > They're both wrong.  Ext4 completely misses support in fsync or sync
> > to catch pending unwrittent extent conversions, and thus fails to obey
> > the data integrity guarante.  XFS is beeing rather stupid about the
> > amount of synchronization it requires.  The untested patch below
> > should help with avoiding the synchronization if you're purely doing
> > overwrites:
> 
> Yes this patch helps. I have already laid out the file and doing
> overwrites.
> 
> I throttled aio-stress in one cgroup to 1 byte/sec and edited another
> file from other cgroup and did a "sync" and it completed.

Even other test where I am running aio-stress in one window and edited
a file in another window and typed "sync" worked. "sync" does not hang
waiting for aio-stress to finish.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - June 3, 2011, 3:30 a.m.
On 6/2/11 8:02 PM, Christoph Hellwig wrote:
> On Thu, Jun 02, 2011 at 08:54:03PM -0400, Vivek Goyal wrote:
>> Just wondering why ext4 and XFS behavior are different and which is a
>> more appropriate behavior. ext4 does not seem to be waiting for all
>> pending AIO/DIO to finish while XFS does.
> 
> They're both wrong.  Ext4 completely misses support in fsync or sync
> to catch pending unwrittent extent conversions, and thus fails to obey
> the data integrity guarante.  

I'm not sure about that.

ext4_sync_file() does ext4_flush_completed_IO() which does:

 * When IO is completed, the work to convert unwritten extents to
 * written is queued on workqueue but may not get immediately
 * scheduled. When fsync is called, we need to ensure the
 * conversion is complete before fsync returns.
 * The inode keeps track of a list of pending/completed IO that
 * might needs to do the conversion. This function walks through
 * the list and convert the related unwritten extents for completed IO
 * to written.

Granted, I get easily lost in ext4's codepaths here, which is actually
why I suggested Vivek pose these questions to the list ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - June 3, 2011, 5 a.m.
On Thu, Jun 02, 2011 at 10:30:38PM -0500, Eric Sandeen wrote:
> On 6/2/11 8:02 PM, Christoph Hellwig wrote:
> > On Thu, Jun 02, 2011 at 08:54:03PM -0400, Vivek Goyal wrote:
> >> Just wondering why ext4 and XFS behavior are different and which is a
> >> more appropriate behavior. ext4 does not seem to be waiting for all
> >> pending AIO/DIO to finish while XFS does.
> > 
> > They're both wrong.  Ext4 completely misses support in fsync or sync
> > to catch pending unwrittent extent conversions, and thus fails to obey
> > the data integrity guarante.  
> 
> I'm not sure about that.
> 
> ext4_sync_file() does ext4_flush_completed_IO() which does:

> Granted, I get easily lost in ext4's codepaths here, which is actually
> why I suggested Vivek pose these questions to the list ;)

You're right it gets fsync right, but the sync still seems to be missing,
which does not just include sync, but also the syncfs system call
and unmount.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - June 9, 2011, 1:09 p.m.
On Thu, Jun 02, 2011 at 09:33:45PM -0400, Vivek Goyal wrote:
> > Yes this patch helps. I have already laid out the file and doing
> > overwrites.
> > 
> > I throttled aio-stress in one cgroup to 1 byte/sec and edited another
> > file from other cgroup and did a "sync" and it completed.
> 
> Even other test where I am running aio-stress in one window and edited
> a file in another window and typed "sync" worked. "sync" does not hang
> waiting for aio-stress to finish.

I've been thinking about the patch a bit more, and I think it's simply
incorrect.  i_iocount is the only thing that actually tracks in-flight
DIO/AIO requests, so we can't actually skip incrementing it as that
means we can't wait for pending AIO in fsync/sync/inode reclaim or
remount r/o.

We could simply declare AIO is off limits for sync and skip it there,
which is easily doable, but we'd still need a special case version of
sync for remount r/o as that absolutely needs to stop all pending I/O.

Of the other filesystem ext4 also has the counter, but only waits for
it during inode teardown, and using a slightly different, but also
effective scheme for fsync, but completely ignores sync and remount.

I couldn't find a similar scheme in other filesystem supporting AIO,
but it might be hidden a bit better.

I suspect we could optimize things by using the dual count and list
approach ext4 does - there is a counter for in-flight direct I/O, which
we only check for inode teardown and remount, as those need to stop
pending I/O, but sync and fsync can skip them as they only need to
flush pending I/O.  There is a list for the pending unwritten extent
conversions that only gets appended to when the actual I/O is done,
and the unwritten extent conversion is queued up. 

I'll see if I can come up with a good scheme for that, preferably
sitting directly in the direct I/O code, so that everyone gets it
without additional work.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

Index: xfs/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-06-03 09:54:52.964337556 +0900
+++ xfs/fs/xfs/linux-2.6/xfs_aops.c	2011-06-03 09:57:06.877674259 +0900
@@ -270,7 +270,7 @@  xfs_finish_ioend_sync(
  * (vs. incore size).
  */
 STATIC xfs_ioend_t *
-xfs_alloc_ioend(
+__xfs_alloc_ioend(
 	struct inode		*inode,
 	unsigned int		type)
 {
@@ -290,7 +290,6 @@  xfs_alloc_ioend(
 	ioend->io_inode = inode;
 	ioend->io_buffer_head = NULL;
 	ioend->io_buffer_tail = NULL;
-	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
 	ioend->io_offset = 0;
 	ioend->io_size = 0;
 	ioend->io_iocb = NULL;
@@ -300,6 +299,18 @@  xfs_alloc_ioend(
 	return ioend;
 }
 
+STATIC xfs_ioend_t *
+xfs_alloc_ioend(
+	struct inode		*inode,
+	unsigned int		type)
+{
+	struct xfs_ioend	*ioend;
+
+	ioend = __xfs_alloc_ioend(inode, type);
+	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
+	return ioend;
+}
+
 STATIC int
 xfs_map_blocks(
 	struct inode		*inode,
@@ -1318,6 +1329,7 @@  xfs_end_io_direct_write(
 	 */
 	iocb->private = NULL;
 
+	atomic_inc(&XFS_I(ioend->io_inode)->i_iocount);
 	ioend->io_offset = offset;
 	ioend->io_size = size;
 	if (private && size > 0)
@@ -1354,7 +1366,7 @@  xfs_vm_direct_IO(
 	ssize_t			ret;
 
 	if (rw & WRITE) {
-		iocb->private = xfs_alloc_ioend(inode, IO_DIRECT);
+		iocb->private = __xfs_alloc_ioend(inode, IO_DIRECT);
 
 		ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
 					    offset, nr_segs,