diff mbox

[5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write

Message ID 1327091686-23177-6-git-send-email-jack@suse.cz
State Superseded, archived
Headers show

Commit Message

Jan Kara Jan. 20, 2012, 8:34 p.m. UTC
Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
dictated by the page fault code we have to call sb_start_write() after we
acquire ilock.

Similarly we have to protect xfs_setattr_size() because it can modify last
page of truncated file. Because ilock is dropped in xfs_setattr_size() we
have to drop and retake write access as well to avoid deadlocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c |    6 ++++--
 fs/xfs/xfs_iops.c |    6 ++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Dave Chinner Jan. 24, 2012, 7:19 a.m. UTC | #1
On Fri, Jan 20, 2012 at 09:34:43PM +0100, Jan Kara wrote:
> Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
> dictated by the page fault code we have to call sb_start_write() after we
> acquire ilock.

It appears to me that you have indeed confused the ilock with the
iolock.

> Similarly we have to protect xfs_setattr_size() because it can modify last
> page of truncated file. Because ilock is dropped in xfs_setattr_size() we
> have to drop and retake write access as well to avoid deadlocks.

> 
> CC: Ben Myers <bpm@sgi.com>
> CC: Alex Elder <elder@kernel.org>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/xfs/xfs_file.c |    6 ++++--
>  fs/xfs/xfs_iops.c |    6 ++++++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 753ed9b..9efd153 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
>  		*iolock = XFS_IOLOCK_SHARED;
>  	}
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
>  	ret = generic_file_direct_write(iocb, iovp,
>  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);

That's inside the iolock, not the ilock. Either way, it is
incorrect. This accounting should be outside the iolock - because
xfs_trans_alloc() can be called with the iolock held. Therefore the
freeze/lock order needs to be

	sb_start_write(SB_FREEZE_WRITE)
	  XFS(ip)->i_iolock
	    XFS(ip)->i_ilock
	sb_end_write(SB_FREEZE_WRITE)

Which matches the current freeze/lock order.

> @@ -945,8 +949,6 @@ xfs_file_aio_write(
>  	if (ocount == 0)
>  		return 0;
>  
> -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> -

that's where sb_start_write() needs to be, and the sb-end_write()
call needs to below the generic_write_sync() calls that will trigger
IO on O_SYNC writes. Otherwise it is not covering all the IO path
correctly.

>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 3579bc8..798b9c6 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -793,6 +793,7 @@ xfs_setattr_size(
>  		return xfs_setattr_nonsize(ip, iattr, 0);
>  	}
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  	/*
>  	 * Make sure that the dquots are attached to the inode.
>  	 */
> @@ -849,10 +850,14 @@ xfs_setattr_size(
>  				     xfs_get_blocks);
>  	if (error)
>  		goto out_unlock;
> +	/* Drop the write access to avoid lock inversion with ilock */
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	lock_flags |= XFS_ILOCK_EXCL;
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> +

This is caused by the previous problems I pointed out. You should
not need to drop the freeze reference here at all.

Cheers,

Dave.
Jan Kara Jan. 24, 2012, 7:35 p.m. UTC | #2
On Tue 24-01-12 18:19:26, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:43PM +0100, Jan Kara wrote:
> > Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> > a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
> > dictated by the page fault code we have to call sb_start_write() after we
> > acquire ilock.
> 
> It appears to me that you have indeed confused the ilock with the
> iolock.
> 
> > Similarly we have to protect xfs_setattr_size() because it can modify last
> > page of truncated file. Because ilock is dropped in xfs_setattr_size() we
> > have to drop and retake write access as well to avoid deadlocks.
> 
> > 
> > CC: Ben Myers <bpm@sgi.com>
> > CC: Alex Elder <elder@kernel.org>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/xfs/xfs_file.c |    6 ++++--
> >  fs/xfs/xfs_iops.c |    6 ++++++
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 753ed9b..9efd153 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
> >  		*iolock = XFS_IOLOCK_SHARED;
> >  	}
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> >  	ret = generic_file_direct_write(iocb, iovp,
> >  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> 
> That's inside the iolock, not the ilock. Either way, it is
> incorrect. This accounting should be outside the iolock - because
> xfs_trans_alloc() can be called with the iolock held. Therefore the
> freeze/lock order needs to be
> 
> 	sb_start_write(SB_FREEZE_WRITE)
> 	  XFS(ip)->i_iolock
> 	    XFS(ip)->i_ilock
> 	sb_end_write(SB_FREEZE_WRITE)
> 
> Which matches the current freeze/lock order.
  Hmm, so I was looking at this and I think there are following locking
constrants (please correct me if I have something wrong):
iolock -> trans start (per your claim above)
trans start -> ilock (ditto)
iolock -> mmap_sem (aio write holds iolock and copying data from userspace
  might need mmap sem if it hits page fault)
mmap_sem -> ilock (do_wp_page -> block_page_mkwrite -> __xfs_get_blocks)
freezing -> trans start (so that we can clean the filesystem during
              freezing)

So I see two choices here.
  1) Put 'freezing' above iolock as you suggest. But then handling the page
fault path becomes challenging. We cannot block there easily because we are
called with mmap_sem held. I just talked with Mel and it seems that
dropping mmap_sem in ->page_mkwrite(), blocking, retaking mmap_sem and
returning VM_FAULT_RETRY might work but we'll see whether some other mm guy
won't kill me for that ;).
  2) Put 'freezing' below mmap_sem. That would put it below iolock/i_mutex
as well. Then handling page fault is easy. We could not block in ->aio_write
but we'd have to block in ->write_begin() instead. Similarly we would have
to block in other write paths.

The first approach has the advantage that we could put lots of frozen
checks into VFS thus making them shared among filesystems (possibly even
making freezing reliable for filesystems such as ext2). The second approach
is simpler as we could do most of the freezing checks while we start a
transaction at least for filesystems that have transactions... Any
preferences?

								Honza
 
> > @@ -945,8 +949,6 @@ xfs_file_aio_write(
> >  	if (ocount == 0)
> >  		return 0;
> >  
> > -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> > -
> 
> that's where sb_start_write() needs to be, and the sb-end_write()
> call needs to below the generic_write_sync() calls that will trigger
> IO on O_SYNC writes. Otherwise it is not covering all the IO path
> correctly.
> 
> >  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> >  		return -EIO;
> >  
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 3579bc8..798b9c6 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -793,6 +793,7 @@ xfs_setattr_size(
> >  		return xfs_setattr_nonsize(ip, iattr, 0);
> >  	}
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	/*
> >  	 * Make sure that the dquots are attached to the inode.
> >  	 */
> > @@ -849,10 +850,14 @@ xfs_setattr_size(
> >  				     xfs_get_blocks);
> >  	if (error)
> >  		goto out_unlock;
> > +	/* Drop the write access to avoid lock inversion with ilock */
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	lock_flags |= XFS_ILOCK_EXCL;
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> > +
> 
> This is caused by the previous problems I pointed out. You should
> not need to drop the freeze reference here at all.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 753ed9b..9efd153 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -862,9 +862,11 @@  xfs_file_dio_aio_write(
 		*iolock = XFS_IOLOCK_SHARED;
 	}
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_direct_write(iocb, iovp,
 			&nr_segs, pos, &iocb->ki_pos, count, ocount);
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* No fallback to buffered IO on errors for XFS. */
 	ASSERT(ret < 0 || ret == count);
@@ -899,6 +901,7 @@  xfs_file_buffered_aio_write(
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 write_retry:
 	trace_xfs_file_buffered_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_buffered_write(iocb, iovp, nr_segs,
@@ -914,6 +917,7 @@  write_retry:
 		enospc = 1;
 		goto write_retry;
 	}
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	current->backing_dev_info = NULL;
 	return ret;
 }
@@ -945,8 +949,6 @@  xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
-
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3579bc8..798b9c6 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -793,6 +793,7 @@  xfs_setattr_size(
 		return xfs_setattr_nonsize(ip, iattr, 0);
 	}
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	/*
 	 * Make sure that the dquots are attached to the inode.
 	 */
@@ -849,10 +850,14 @@  xfs_setattr_size(
 				     xfs_get_blocks);
 	if (error)
 		goto out_unlock;
+	/* Drop the write access to avoid lock inversion with ilock */
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	lock_flags |= XFS_ILOCK_EXCL;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
+
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0,
 				 XFS_TRANS_PERM_LOG_RES,
@@ -924,6 +929,7 @@  xfs_setattr_size(
 
 	error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
 out_unlock:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	if (lock_flags)
 		xfs_iunlock(ip, lock_flags);
 	return error;