diff mbox

[2/8] vfs: Protect write paths by sb_start_write - sb_end_write

Message ID 1327091686-23177-3-git-send-email-jack@suse.cz
State Superseded, archived
Headers show

Commit Message

Jan Kara Jan. 20, 2012, 8:34 p.m. UTC
There are three entry points which dirty pages in a filesystem.  mmap (handled
by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), and truncate (it can dirty last partial page -
handled inside each filesystem separately). Protect these places with
sb_start_write() and sb_end_write().

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c  |   22 ++++------------------
 mm/filemap.c |    3 ++-
 2 files changed, 6 insertions(+), 19 deletions(-)

Comments

Dave Chinner Jan. 24, 2012, 8:21 a.m. UTC | #1
On Fri, Jan 20, 2012 at 09:34:40PM +0100, Jan Kara wrote:
> There are three entry points which dirty pages in a filesystem.  mmap (handled
> by block_page_mkwrite()), buffered write (handled by
> __generic_file_aio_write()), and truncate (it can dirty last partial page -
> handled inside each filesystem separately). Protect these places with
> sb_start_write() and sb_end_write().

fallocate can also dirty pages, either during preallocation or hole
punching.  Hence if you are going to promote truncate to
SB_FREEZE_WRITE protection then you need to promote everything else
that can zero partial blocks as well.

That also means that anything the has implemented XFS_IOC_ ioctl
interfaces for prellocation and hole punching (xfs, ocfs2 and gfs2
IIRC) also needs to be protected in the same way.

Cheers,

Dave.
Jan Kara Jan. 24, 2012, 11:44 a.m. UTC | #2
On Tue 24-01-12 19:21:19, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:40PM +0100, Jan Kara wrote:
> > There are three entry points which dirty pages in a filesystem.  mmap (handled
> > by block_page_mkwrite()), buffered write (handled by
> > __generic_file_aio_write()), and truncate (it can dirty last partial page -
> > handled inside each filesystem separately). Protect these places with
> > sb_start_write() and sb_end_write().
> 
> fallocate can also dirty pages, either during preallocation or hole
> punching.  Hence if you are going to promote truncate to
> SB_FREEZE_WRITE protection then you need to promote everything else
> that can zero partial blocks as well.
> 
> That also means that anything the has implemented XFS_IOC_ ioctl
> interfaces for prellocation and hole punching (xfs, ocfs2 and gfs2
> IIRC) also needs to be protected in the same way.
  Yeah, you are right. As I wrote in the introductory mail, there's problem
with metadata operations (e.g. directory modifications) anyway so we'll
have to audit all places where transactions are started. First I'll do this
for ext4 as a POC and then I'll try to do that for XFS if Eric doesn't beat
me to it (he promised to have a look at XFS part ;).

								Honza
Eric Sandeen Feb. 5, 2012, 6:13 a.m. UTC | #3
On 1/20/12 2:34 PM, Jan Kara wrote:
> There are three entry points which dirty pages in a filesystem.  mmap (handled
> by block_page_mkwrite()), buffered write (handled by
> __generic_file_aio_write()), and truncate (it can dirty last partial page -
> handled inside each filesystem separately). Protect these places with
> sb_start_write() and sb_end_write().

The protection for truncate got lost since the first patchset, was that
on purpose?

-Eric

> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c  |   22 ++++------------------
>  mm/filemap.c |    3 ++-
>  2 files changed, 6 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 19d8eb7..550714d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2338,8 +2338,8 @@ EXPORT_SYMBOL(block_commit_write);
>   * beyond EOF, then the page is guaranteed safe against truncation until we
>   * unlock the page.
>   *
> - * Direct callers of this function should call vfs_check_frozen() so that page
> - * fault does not busyloop until the fs is thawed.
> + * Direct callers of this function should protect against filesystem freezing
> + * using sb_start_write() - sb_end_write() functions.
>   */
>  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			 get_block_t get_block)
> @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  
>  	if (unlikely(ret < 0))
>  		goto out_unlock;
> -	/*
> -	 * Freezing in progress? We check after the page is marked dirty and
> -	 * with page lock held so if the test here fails, we are sure freezing
> -	 * code will wait during syncing until the page fault is done - at that
> -	 * point page will be dirty and unlocked so freezing code will write it
> -	 * and writeprotect it again.
> -	 */
>  	set_page_dirty(page);
> -	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
>  	wait_on_page_writeback(page);
>  	return 0;
>  out_unlock:
> @@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	int ret;
>  	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
>  
> -	/*
> -	 * This check is racy but catches the common case. The check in
> -	 * __block_page_mkwrite() is reliable.
> -	 */
> -	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +	sb_start_write(sb, SB_FREEZE_WRITE);
>  	ret = __block_page_mkwrite(vma, vmf, get_block);
> +	sb_end_write(sb, SB_FREEZE_WRITE);
>  	return block_page_mkwrite_return(ret);
>  }
>  EXPORT_SYMBOL(block_page_mkwrite);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..471b9ae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  	count = ocount;
>  	pos = *ppos;
>  
> -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  
>  	/* We can write back this queue in page reclaim */
>  	current->backing_dev_info = mapping->backing_dev_info;
> @@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  				pos, ppos, count, written);
>  	}
>  out:
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  	current->backing_dev_info = NULL;
>  	return written ? written : err;
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Feb. 6, 2012, 3:33 p.m. UTC | #4
On Sun 05-02-12 00:13:38, Eric Sandeen wrote:
> On 1/20/12 2:34 PM, Jan Kara wrote:
> > There are three entry points which dirty pages in a filesystem.  mmap (handled
> > by block_page_mkwrite()), buffered write (handled by
> > __generic_file_aio_write()), and truncate (it can dirty last partial page -
> > handled inside each filesystem separately). Protect these places with
> > sb_start_write() and sb_end_write().
> 
> The protection for truncate got lost since the first patchset, was that
> on purpose?
  It was not lost but it got moved down into the filesystem. I forgot to
update the changelog. But after Dave's comments I think it can go back into
VFS. Just lockdep complained about deadlocks in my first naive approach -
that's why I started doing weird things with XFS locks after all.

  Anyway now I'm wiser regarding XFS locking and I also have better idea
how to achive proper lock ordering in VFS. Just we are finishing SLE11 SP2
so I didn't get to writing the patches last week... But I should get to it
maybe even today and if not then at least during this week ;)

									Honza

> > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/buffer.c  |   22 ++++------------------
> >  mm/filemap.c |    3 ++-
> >  2 files changed, 6 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 19d8eb7..550714d 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2338,8 +2338,8 @@ EXPORT_SYMBOL(block_commit_write);
> >   * beyond EOF, then the page is guaranteed safe against truncation until we
> >   * unlock the page.
> >   *
> > - * Direct callers of this function should call vfs_check_frozen() so that page
> > - * fault does not busyloop until the fs is thawed.
> > + * Direct callers of this function should protect against filesystem freezing
> > + * using sb_start_write() - sb_end_write() functions.
> >   */
> >  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			 get_block_t get_block)
> > @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  
> >  	if (unlikely(ret < 0))
> >  		goto out_unlock;
> > -	/*
> > -	 * Freezing in progress? We check after the page is marked dirty and
> > -	 * with page lock held so if the test here fails, we are sure freezing
> > -	 * code will wait during syncing until the page fault is done - at that
> > -	 * point page will be dirty and unlocked so freezing code will write it
> > -	 * and writeprotect it again.
> > -	 */
> >  	set_page_dirty(page);
> > -	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
> > -		ret = -EAGAIN;
> > -		goto out_unlock;
> > -	}
> >  	wait_on_page_writeback(page);
> >  	return 0;
> >  out_unlock:
> > @@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	int ret;
> >  	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
> >  
> > -	/*
> > -	 * This check is racy but catches the common case. The check in
> > -	 * __block_page_mkwrite() is reliable.
> > -	 */
> > -	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > +	sb_start_write(sb, SB_FREEZE_WRITE);
> >  	ret = __block_page_mkwrite(vma, vmf, get_block);
> > +	sb_end_write(sb, SB_FREEZE_WRITE);
> >  	return block_page_mkwrite_return(ret);
> >  }
> >  EXPORT_SYMBOL(block_page_mkwrite);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..471b9ae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >  	count = ocount;
> >  	pos = *ppos;
> >  
> > -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  
> >  	/* We can write back this queue in page reclaim */
> >  	current->backing_dev_info = mapping->backing_dev_info;
> > @@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >  				pos, ppos, count, written);
> >  	}
> >  out:
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	current->backing_dev_info = NULL;
> >  	return written ? written : err;
> >  }
>
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..550714d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2338,8 +2338,8 @@  EXPORT_SYMBOL(block_commit_write);
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
  *
- * Direct callers of this function should call vfs_check_frozen() so that page
- * fault does not busyloop until the fs is thawed.
+ * Direct callers of this function should protect against filesystem freezing
+ * using sb_start_write() - sb_end_write() functions.
  */
 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 get_block_t get_block)
@@ -2371,18 +2371,7 @@  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	if (unlikely(ret < 0))
 		goto out_unlock;
-	/*
-	 * Freezing in progress? We check after the page is marked dirty and
-	 * with page lock held so if the test here fails, we are sure freezing
-	 * code will wait during syncing until the page fault is done - at that
-	 * point page will be dirty and unlocked so freezing code will write it
-	 * and writeprotect it again.
-	 */
 	set_page_dirty(page);
-	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
 	wait_on_page_writeback(page);
 	return 0;
 out_unlock:
@@ -2397,12 +2386,9 @@  int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int ret;
 	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
 
-	/*
-	 * This check is racy but catches the common case. The check in
-	 * __block_page_mkwrite() is reliable.
-	 */
-	vfs_check_frozen(sb, SB_FREEZE_WRITE);
+	sb_start_write(sb, SB_FREEZE_WRITE);
 	ret = __block_page_mkwrite(vma, vmf, get_block);
+	sb_end_write(sb, SB_FREEZE_WRITE);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL(block_page_mkwrite);
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..471b9ae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2529,7 +2529,7 @@  ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	count = ocount;
 	pos = *ppos;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
@@ -2601,6 +2601,7 @@  ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				pos, ppos, count, written);
 	}
 out:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }