diff mbox

ext4: Prevent race while waling extent tree

Message ID 1352372929-18513-1-git-send-email-lczerner@redhat.com
State Superseded, archived
Headers show

Commit Message

Lukas Czerner Nov. 8, 2012, 11:08 a.m. UTC
Currently ext4_ext_walk_space() only takes i_data_sem for read when
searching for the extent at given block with ext4_ext_find_extent().
Then it drops the lock and the extent tree can be changed at will.
However later on we're searching for the 'next' extent, but the extent
tree might already have changed, so the information might not be
accurate.

In fact we can hit BUG_ON(end <= start) if the extent got inserted into
the tree after the one we found and before the block we were searching
for. This has been reproduced by running xfstests 225 in loop on s390x
architecture, but theoretically we could hit this on any other
architecture as well, but probably not as often.

ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
if we do not hit the BUG_ON() fiemap might return scrambled information
to the user.

Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
held. By calling it from ext4_fiemap() we can only take the i_data_sem
for read, but possibly other users might want to modify the extents so
they will be able to take write lock.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/extents.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

Comments

Dmitry Monakhov Nov. 8, 2012, 12:01 p.m. UTC | #1
On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> Currently ext4_ext_walk_space() only takes i_data_sem for read when
> searching for the extent at given block with ext4_ext_find_extent().
> Then it drops the lock and the extent tree can be changed at will.
> However later on we're searching for the 'next' extent, but the extent
> tree might already have changed, so the information might not be
> accurate.
> 
> In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> the tree after the one we found and before the block we were searching
> for. This has been reproduced by running xfstests 225 in loop on s390x
> architecture, but theoretically we could hit this on any other
> architecture as well, but probably not as often.
> 
> ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> if we do not hit the BUG_ON() fiemap might return scrambled information
> to the user.
> 
> Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> held. By calling it from ext4_fiemap() we can only take the i_data_sem
> for read, but possibly other users might want to modify the extents so
> they will be able to take write lock.
Agree as a short term fix for BUGON case, but Theodore suggested to use
seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25

> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
>  fs/ext4/extents.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 7011ac9..f1aca06 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1959,6 +1959,11 @@ cleanup:
>  	return err;
>  }
>  
> +/*
> + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> + * not modifying found extents, or extent tree in callback function, then
> + * read lock is ok.
> + */
>  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>  			       ext4_lblk_t num, ext_prepare_callback func,
>  			       void *cbdata)
> @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>  	while (block < last && block != EXT_MAX_BLOCKS) {
>  		num = last - block;
>  		/* find extent for this block */
> -		down_read(&EXT4_I(inode)->i_data_sem);
>  		path = ext4_ext_find_extent(inode, block, path);
> -		up_read(&EXT4_I(inode)->i_data_sem);
>  		if (IS_ERR(path)) {
>  			err = PTR_ERR(path);
>  			path = NULL;
> @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		 * Walk the extent tree gathering extent information.
>  		 * ext4_ext_fiemap_cb will push extents back to user.
>  		 */
> +		down_read(&EXT4_I(inode)->i_data_sem);
>  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
>  					  ext4_ext_fiemap_cb, fieinfo);
> +		up_read(&EXT4_I(inode)->i_data_sem);
>  	}
>  
>  	return error;
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner Nov. 8, 2012, 1:43 p.m. UTC | #2
On Thu, 8 Nov 2012, Dmitry Monakhov wrote:

> Date: Thu, 08 Nov 2012 16:01:17 +0400
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> To: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> Cc: tytso@mit.edu, Lukas Czerner <lczerner@redhat.com>
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > searching for the extent at given block with ext4_ext_find_extent().
> > Then it drops the lock and the extent tree can be changed at will.
> > However later on we're searching for the 'next' extent, but the extent
> > tree might already have changed, so the information might not be
> > accurate.
> > 
> > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > the tree after the one we found and before the block we were searching
> > for. This has been reproduced by running xfstests 225 in loop on s390x
> > architecture, but theoretically we could hit this on any other
> > architecture as well, but probably not as often.
> > 
> > ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> > if we do not hit the BUG_ON() fiemap might return scrambled information
> > to the user.
> > 
> > Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> > held. By calling it from ext4_fiemap() we can only take the i_data_sem
> > for read, but possibly other users might want to modify the extents so
> > they will be able to take write lock.
> Agree as a short term fix for BUGON case, but Theodore suggested to use
> seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25

Yeah, it make sense to protect us from fiemap abuse, however using
seqlock for walking the extent tree seems like an overkill
especially considering how much work will that require. We would
have to make sure that everything we do in the ext4_ext_walk_space()
and other function we're calling there is safe even if the extent
tree change under our hands. I do not think this is the right way.

I was thinking about checking for contentions on the semaphore from
within the ext4_ext_walk_space() - possibly enabling/disabling it
with a function parameter ?

Sadly kernel does not provide a helper to check for that so what
about something like this in the beginning of the while loop in
ext4_ext_walk_space ?

if (check_contention) {
	int contends = 0;
	unsigned int flags;

	raw_spin_lock_irqsave(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
	if (!list_empty(&EXT4_I(inode)->i_data_sem->wait_list)
		contends = 1
	raw_spin_unlock_irqrestore(&EXT4_I(inode)->i_data_sem->wait_lock, flags);

	if (contends)
		break
}

or we can add the helper to the rwsem code and use that.


What do you think ?

Thanks!
-Lukas

> 
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> >  fs/ext4/extents.c |    9 +++++++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 7011ac9..f1aca06 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -1959,6 +1959,11 @@ cleanup:
> >  	return err;
> >  }
> >  
> > +/*
> > + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> > + * not modifying found extents, or extent tree in callback function, then
> > + * read lock is ok.
> > + */
> >  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >  			       ext4_lblk_t num, ext_prepare_callback func,
> >  			       void *cbdata)
> > @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >  	while (block < last && block != EXT_MAX_BLOCKS) {
> >  		num = last - block;
> >  		/* find extent for this block */
> > -		down_read(&EXT4_I(inode)->i_data_sem);
> >  		path = ext4_ext_find_extent(inode, block, path);
> > -		up_read(&EXT4_I(inode)->i_data_sem);
> >  		if (IS_ERR(path)) {
> >  			err = PTR_ERR(path);
> >  			path = NULL;
> > @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >  		 * Walk the extent tree gathering extent information.
> >  		 * ext4_ext_fiemap_cb will push extents back to user.
> >  		 */
> > +		down_read(&EXT4_I(inode)->i_data_sem);
> >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> >  					  ext4_ext_fiemap_cb, fieinfo);
> > +		up_read(&EXT4_I(inode)->i_data_sem);
> >  	}
> >  
> >  	return error;
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner Nov. 8, 2012, 4:07 p.m. UTC | #3
On Thu, 8 Nov 2012, Lukáš Czerner wrote:

> Date: Thu, 8 Nov 2012 14:43:19 +0100 (CET)
> From: Lukáš Czerner <lczerner@redhat.com>
> To: Dmitry Monakhov <dmonakhov@openvz.org>
> Cc: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org,
>     tytso@mit.edu
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu, 8 Nov 2012, Dmitry Monakhov wrote:
> 
> > Date: Thu, 08 Nov 2012 16:01:17 +0400
> > From: Dmitry Monakhov <dmonakhov@openvz.org>
> > To: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> > Cc: tytso@mit.edu, Lukas Czerner <lczerner@redhat.com>
> > Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> > 
> > On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> > > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > > searching for the extent at given block with ext4_ext_find_extent().
> > > Then it drops the lock and the extent tree can be changed at will.
> > > However later on we're searching for the 'next' extent, but the extent
> > > tree might already have changed, so the information might not be
> > > accurate.
> > > 
> > > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > > the tree after the one we found and before the block we were searching
> > > for. This has been reproduced by running xfstests 225 in loop on s390x
> > > architecture, but theoretically we could hit this on any other
> > > architecture as well, but probably not as often.
> > > 
> > > ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> > > if we do not hit the BUG_ON() fiemap might return scrambled information
> > > to the user.
> > > 
> > > Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> > > held. By calling it from ext4_fiemap() we can only take the i_data_sem
> > > for read, but possibly other users might want to modify the extents so
> > > they will be able to take write lock.
> > Agree as a short term fix for BUGON case, but Theodore suggested to use
> > seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25
> 
> Yeah, it make sense to protect us from fiemap abuse, however using
> seqlock for walking the extent tree seems like an overkill
> especially considering how much work will that require. We would
> have to make sure that everything we do in the ext4_ext_walk_space()
> and other function we're calling there is safe even if the extent
> tree change under our hands. I do not think this is the right way.
> 
> I was thinking about checking for contentions on the semaphore from
> within the ext4_ext_walk_space() - possibly enabling/disabling it
> with a function parameter ?
> 
> Sadly kernel does not provide a helper to check for that so what
> about something like this in the beginning of the while loop in
> ext4_ext_walk_space ?
> 
> if (check_contention) {
> 	int contends = 0;
> 	unsigned int flags;
> 
> 	raw_spin_lock_irqsave(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
> 	if (!list_empty(&EXT4_I(inode)->i_data_sem->wait_list)
> 		contends = 1
> 	raw_spin_unlock_irqrestore(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
> 
> 	if (contends)
> 		break
> }
> 
> or we can add the helper to the rwsem code and use that.
> 
> 
> What do you think ?

Nevermind, trhere is no generic way to tell how many waiters for the
semaphore there is...

-Lukas

> 
> Thanks!
> -Lukas
> 
> > 
> > > 
> > > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > > ---
> > >  fs/ext4/extents.c |    9 +++++++--
> > >  1 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index 7011ac9..f1aca06 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -1959,6 +1959,11 @@ cleanup:
> > >  	return err;
> > >  }
> > >  
> > > +/*
> > > + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> > > + * not modifying found extents, or extent tree in callback function, then
> > > + * read lock is ok.
> > > + */
> > >  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >  			       ext4_lblk_t num, ext_prepare_callback func,
> > >  			       void *cbdata)
> > > @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >  	while (block < last && block != EXT_MAX_BLOCKS) {
> > >  		num = last - block;
> > >  		/* find extent for this block */
> > > -		down_read(&EXT4_I(inode)->i_data_sem);
> > >  		path = ext4_ext_find_extent(inode, block, path);
> > > -		up_read(&EXT4_I(inode)->i_data_sem);
> > >  		if (IS_ERR(path)) {
> > >  			err = PTR_ERR(path);
> > >  			path = NULL;
> > > @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > >  		 * Walk the extent tree gathering extent information.
> > >  		 * ext4_ext_fiemap_cb will push extents back to user.
> > >  		 */
> > > +		down_read(&EXT4_I(inode)->i_data_sem);
> > >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> > >  					  ext4_ext_fiemap_cb, fieinfo);
> > > +		up_read(&EXT4_I(inode)->i_data_sem);
> > >  	}
> > >  
> > >  	return error;
> > > -- 
> > > 1.7.7.6
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
>
Zach Brown Nov. 8, 2012, 9:52 p.m. UTC | #4
On Thu, Nov 08, 2012 at 12:08:49PM +0100, Lukas Czerner wrote:
> +		down_read(&EXT4_I(inode)->i_data_sem);
>  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
>  					  ext4_ext_fiemap_cb, fieinfo);
> +		up_read(&EXT4_I(inode)->i_data_sem);

Can this deadlock?  ext4_ext_fiemap_cb() seems to be doing all kinds of
exciting things that might also try and acquire the i_data_sem, like
GFP_KERNEL allocs (reclaim -> writepage) and copying to userspace (mmap
fault -> readpage -> get blocks).

It seems like the safer fix is to broaden the sampling lock coverage to
include referencing all the extent data but to release it around the
callback.

No?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner Nov. 9, 2012, 9:19 a.m. UTC | #5
On Thu, 8 Nov 2012, Zach Brown wrote:

> Date: Thu, 8 Nov 2012 13:52:33 -0800
> From: Zach Brown <zab@redhat.com>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, tytso@mit.edu
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu, Nov 08, 2012 at 12:08:49PM +0100, Lukas Czerner wrote:
> > +		down_read(&EXT4_I(inode)->i_data_sem);
> >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> >  					  ext4_ext_fiemap_cb, fieinfo);
> > +		up_read(&EXT4_I(inode)->i_data_sem);
> 
> Can this deadlock?  ext4_ext_fiemap_cb() seems to be doing all kinds of
> exciting things that might also try and acquire the i_data_sem, like
> GFP_KERNEL allocs (reclaim -> writepage) and copying to userspace (mmap
> fault -> readpage -> get blocks).
> 
> It seems like the safer fix is to broaden the sampling lock coverage to
> include referencing all the extent data but to release it around the
> callback.
> 
> No?
> 
> - z

Yeah, you're right. Having the lock around the whole
ext4_ext_walk_space() might deadlock. I'll fix this.

Thanks for noticing this!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..f1aca06 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1959,6 +1959,11 @@  cleanup:
 	return err;
 }
 
+/*
+ * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
+ * not modifying found extents, or extent tree in callback function, then
+ * read lock is ok.
+ */
 static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			       ext4_lblk_t num, ext_prepare_callback func,
 			       void *cbdata)
@@ -1976,9 +1981,7 @@  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 	while (block < last && block != EXT_MAX_BLOCKS) {
 		num = last - block;
 		/* find extent for this block */
-		down_read(&EXT4_I(inode)->i_data_sem);
 		path = ext4_ext_find_extent(inode, block, path);
-		up_read(&EXT4_I(inode)->i_data_sem);
 		if (IS_ERR(path)) {
 			err = PTR_ERR(path);
 			path = NULL;
@@ -5021,8 +5024,10 @@  int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		 * Walk the extent tree gathering extent information.
 		 * ext4_ext_fiemap_cb will push extents back to user.
 		 */
+		down_read(&EXT4_I(inode)->i_data_sem);
 		error = ext4_ext_walk_space(inode, start_blk, len_blks,
 					  ext4_ext_fiemap_cb, fieinfo);
+		up_read(&EXT4_I(inode)->i_data_sem);
 	}
 
 	return error;