diff mbox

More ext4 acl/xattr corruption - 4th occurence now

Message ID 20090514212325.GG21316@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o May 14, 2009, 9:23 p.m. UTC
On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> short circuit i_cached_extent patch, mounted the fs without
> nodelalloc. I was able to run the full exchange backup without
> triggering the check_block_validity error.

Great!

So here's the final fix (it replaces the short circuit i_cached_extent
patch) which I plan to push to Linus.  It should be much less of a
performance hit than simply short-circuiting i_cached_extent...

Thanks so much for helping to find track this down!!!  If ever someone
deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
you....

       	       	   	      	   	      - Ted

commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu May 14 17:09:37 2009 -0400

    ext4: Fix race in ext4_inode_info.i_cached_extent
    
    If one CPU is reading from a file while another CPU is writing to the
    same file different locations, there is nothing protecting the
    i_cached_extent structure from being used and updated at the same
    time.  This could potentially cause the wrong location on disk to be
    read or written to, including potentially causing the corruption of
    the block group descriptors and/or inode table.
    
    Many thanks to Ken Shannah for helping to track down this problem.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Kevin Shanahan May 14, 2009, 9:33 p.m. UTC | #1
On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > short circuit i_cached_extent patch, mounted the fs without
> > nodelalloc. I was able to run the full exchange backup without
> > triggering the check_block_validity error.
> 
> Great!
> 
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus.  It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
> 
> Thanks so much for helping to find track this down!!!  If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....

Hehe, no problem. Will do the final testing shortly (ran out of time
this morning, users are back on the system now). Just one little
correction to your patch below:

> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu May 14 17:09:37 2009 -0400
> 
>     ext4: Fix race in ext4_inode_info.i_cached_extent
>     
>     If one CPU is reading from a file while another CPU is writing to the
>     same file different locations, there is nothing protecting the
>     i_cached_extent structure from being used and updated at the same
>     time.  This could potentially cause the wrong location on disk to be
>     read or written to, including potentially causing the corruption of
>     the block group descriptors and/or inode table.
>     
>     Many thanks to Ken Shannah for helping to track down this problem.
                     ^^^^^^^^^^^

Cheers,
Kevin Shanahan.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 15, 2009, 1:21 a.m. UTC | #2
Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
>> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
>> short circuit i_cached_extent patch, mounted the fs without
>> nodelalloc. I was able to run the full exchange backup without
>> triggering the check_block_validity error.
> 
> Great!
> 
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus.  It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
> 
> Thanks so much for helping to find track this down!!!  If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
> 
>        	       	   	      	   	      - Ted

So here's a fio job I thought would try to hit this ...

[global]
ioengine=libaio
iodepth=1
filesize=4g
bs=1m
norandommap
direct=1
loops=500
size=16g

[setup]
filename=file
rw=randwrite
loops=1

[thread1]
stonewall
filename=file
rw=randwrite

[thread2]
filename=file
rw=randread

it should lay out a 4g file in random 1m direct IOs to fragment it and
get a lot of extents, then launch 2 threads, one each doing random reads
and random writes of that same file.

I can't make this trip it, though ...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V May 15, 2009, 4:55 a.m. UTC | #3
On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > short circuit i_cached_extent patch, mounted the fs without
> > nodelalloc. I was able to run the full exchange backup without
> > triggering the check_block_validity error.
> 
> Great!
> 
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus.  It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
> 
> Thanks so much for helping to find track this down!!!  If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
> 
>        	       	   	      	   	      - Ted
> 
> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu May 14 17:09:37 2009 -0400
> 
>     ext4: Fix race in ext4_inode_info.i_cached_extent
>     
>     If one CPU is reading from a file while another CPU is writing to the
>     same file different locations, there is nothing protecting the
>     i_cached_extent structure from being used and updated at the same
>     time.  This could potentially cause the wrong location on disk to be
>     read or written to, including potentially causing the corruption of
>     the block group descriptors and/or inode table.


It should be multiple readers. We don't allow read/write or multiple
writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed
to protect read/write and multiple writers. What it allowed was
multiple readers(get_block call with create = 0). And readers did cache
the extent information which it read from the disk. So the fix is
correct, but we need to update the commit message.

Reviewed-by:Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


>     
>     Many thanks to Ken Shannah for helping to track down this problem.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 172656c..e3a55eb 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
>  {
>  	struct ext4_ext_cache *cex;
>  	BUG_ON(len == 0);
> +	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
>  	cex = &EXT4_I(inode)->i_cached_extent;
>  	cex->ec_type = type;
>  	cex->ec_block = block;
>  	cex->ec_len = len;
>  	cex->ec_start = start;
> +	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
>  }
> 
>  /*
> @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
>  			struct ext4_extent *ex)
>  {
>  	struct ext4_ext_cache *cex;
> +	int ret = EXT4_EXT_CACHE_NO;
> 
> +	/* 
> +	 * We borrow i_block_reservation_lock to protect i_cached_extent
> +	 */
> +	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
>  	cex = &EXT4_I(inode)->i_cached_extent;
> 
>  	/* has cache valid data? */
>  	if (cex->ec_type == EXT4_EXT_CACHE_NO)
> -		return EXT4_EXT_CACHE_NO;
> +		goto errout;
> 
>  	BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP &&
>  			cex->ec_type != EXT4_EXT_CACHE_EXTENT);
> @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
>  		ext_debug("%u cached by %u:%u:%llu\n",
>  				block,
>  				cex->ec_block, cex->ec_len, cex->ec_start);
> -		return cex->ec_type;
> +		ret = cex->ec_type;
>  	}
> -
> -	/* not in cache */
> -	return EXT4_EXT_CACHE_NO;
> +errout:
> +	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> +	return ret;
>  }
> 
>  /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 15, 2009, 10:11 a.m. UTC | #4
On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote:
> > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> > Author: Theodore Ts'o <tytso@mit.edu>
> > Date:   Thu May 14 17:09:37 2009 -0400
> > 
> >     ext4: Fix race in ext4_inode_info.i_cached_extent
> >     
> >     If one CPU is reading from a file while another CPU is writing to the
> >     same file different locations, there is nothing protecting the
> >     i_cached_extent structure from being used and updated at the same
> >     time.  This could potentially cause the wrong location on disk to be
> >     read or written to, including potentially causing the corruption of
> >     the block group descriptors and/or inode table.
> 
> 
> It should be multiple readers. We don't allow read/write or multiple
> writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed
> to protect read/write and multiple writers. What it allowed was
> multiple readers(get_block call with create = 0). And readers did cache
> the extent information which it read from the disk. So the fix is
> correct, but we need to update the commit message.

Yes, you're right.  The i_data_sem would have protected us from the
worst of these problems.  The commit message isn't totally wrong,
though, since even a writer will call ext4_ext_get_blocks() with
create == 0.  But we should describe the failure mode much more
accurately, and say that it is two callers of ext4_ext_get_blocks()
with create=0 racing against one another.  It's possible, perhaps even
likely, that one of those callers came from a call of
ext4_get_blocks() with create=1 --- given that the people who were
reporting this were doing so when doing backup jobs.

However, one could imagine multiple mysql (or DB2) threads writing
into random parts of a database, and if two CPU's try to bmap already
allocated (and initialized) blocks out of the database file, so that
ext4_ext_get_blocks() is always being called with create=0, you could
very easily end up with a situation where blocks would be written to
the wrong location on disk.  <<shiver>>

We should make sure that all of the enterprise distributions that are
shipping ext4 as a "technology preview" get a copy of this patch as a
fixing a high-priority bug, since a database workload should trip this
pretty easily.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 15, 2009, 12:50 p.m. UTC | #5
On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
> 
> it should lay out a 4g file in random 1m direct IOs to fragment it and
> get a lot of extents, then launch 2 threads, one each doing random reads
> and random writes of that same file.
> 
> I can't make this trip it, though ...

If all of the blocks are in the page cache, you won't end up calling
ext4_get_blocks().  Try adding a shell script which runs in parallel
doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 15, 2009, 12:58 p.m. UTC | #6
Theodore Tso wrote:
> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>> get a lot of extents, then launch 2 threads, one each doing random reads
>> and random writes of that same file.
>>
>> I can't make this trip it, though ...
> 
> If all of the blocks are in the page cache, you won't end up calling
> ext4_get_blocks().  Try adding a shell script which runs in parallel
> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
> 
> 							- Ted

I made sure it was a big enough file, and consumed enough memory on the
system before the test, that the entire file couldn't fit in memory.

I can try doing the dropping in the bg ... but it should have been going
to disk already.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 15, 2009, 1:07 p.m. UTC | #7
On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote:
> So the fix is correct, but we need to update the commit message.

Here's the updated commit message.

						- Ted

ext4: Fix race in ext4_inode_info.i_cached_extent

If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time.  This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development.  Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 15, 2009, 3:24 p.m. UTC | #8
Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>>> get a lot of extents, then launch 2 threads, one each doing random reads
>>> and random writes of that same file.
>>>
>>> I can't make this trip it, though ...
>> If all of the blocks are in the page cache, you won't end up calling
>> ext4_get_blocks().  Try adding a shell script which runs in parallel
>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
>>
>> 							- Ted
> 
> I made sure it was a big enough file, and consumed enough memory on the
> system before the test, that the entire file couldn't fit in memory.
> 
> I can try doing the dropping in the bg ... but it should have been going
> to disk already.
> 
> -Eric

in a desperate attempt to show the window, I tried this in
ext4_ext_put_in_cache():

        cex->ec_block = -1;
        cex->ec_start = -1;
        schedule_timeout_uninterruptible(HZ/2);
        cex->ec_start = start;
        cex->ec_block = block;

and this in ext4_ext_in_cache():

        if (cex->ec_block == -1 || cex->ec_start == -1)
                printk("%s got bad cache\n", __func__);

and it's not firing.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 15, 2009, 4:27 p.m. UTC | #9
Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Theodore Tso wrote:
>>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>>>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>>>> get a lot of extents, then launch 2 threads, one each doing random reads
>>>> and random writes of that same file.
>>>>
>>>> I can't make this trip it, though ...
>>> If all of the blocks are in the page cache, you won't end up calling
>>> ext4_get_blocks().  Try adding a shell script which runs in parallel
>>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
>>>
>>> 							- Ted
>> I made sure it was a big enough file, and consumed enough memory on the
>> system before the test, that the entire file couldn't fit in memory.
>>
>> I can try doing the dropping in the bg ... but it should have been going
>> to disk already.
>>
>> -Eric
> 
> in a desperate attempt to show the window, I tried this in
> ext4_ext_put_in_cache():
> 
>         cex->ec_block = -1;
>         cex->ec_start = -1;
>         schedule_timeout_uninterruptible(HZ/2);
>         cex->ec_start = start;
>         cex->ec_block = block;
> 
> and this in ext4_ext_in_cache():
> 
>         if (cex->ec_block == -1 || cex->ec_start == -1)
>                 printk("%s got bad cache\n", __func__);
> 
> and it's not firing.

I take it back, needed a different workload.

Sorry for being pedantic, but if this race is so blindingly obvious and
we're getting so few reports, I wanted to be sure we could hit it.  With
my artificially wide window now I think I can see it, but I'm still not
winding up with any corruption or EIOs....

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevin Shanahan May 15, 2009, 11:18 p.m. UTC | #10
On Fri, May 15, 2009 at 07:03:14AM +0930, Kevin Shanahan wrote:
> On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > > short circuit i_cached_extent patch, mounted the fs without
> > > nodelalloc. I was able to run the full exchange backup without
> > > triggering the check_block_validity error.
> > 
> > Great!
> > 
> > So here's the final fix (it replaces the short circuit i_cached_extent
> > patch) which I plan to push to Linus.  It should be much less of a
> > performance hit than simply short-circuiting i_cached_extent...
> > 
> > Thanks so much for helping to find track this down!!!  If ever someone
> > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> > you....
> 
> Hehe, no problem. Will do the final testing shortly (ran out of time
> this morning, users are back on the system now). Just one little
> correction to your patch below:
> 
> > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> > Author: Theodore Ts'o <tytso@mit.edu>
> > Date:   Thu May 14 17:09:37 2009 -0400
> > 
> >     ext4: Fix race in ext4_inode_info.i_cached_extent
> >     
> >     If one CPU is reading from a file while another CPU is writing to the
> >     same file different locations, there is nothing protecting the
> >     i_cached_extent structure from being used and updated at the same
> >     time.  This could potentially cause the wrong location on disk to be
> >     read or written to, including potentially causing the corruption of
> >     the block group descriptors and/or inode table.
> >     
> >     Many thanks to Ken Shannah for helping to track down this problem.
>                      ^^^^^^^^^^^

Tested-by: Kevin Shanahan <kmshanah@ucwb.org.au>

Yes, this patch seems to have fixed the issue. I ran my "exchange
backup to samba share" test on 2.6.29.3 + check_block_validity patch +
the fix race patch with no problems.

Cheers,
Kevin.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Vignaud May 19, 2009, 10 a.m. UTC | #11
Theodore Tso <tytso@mit.edu> writes:

> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus.  It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
> 
> Thanks so much for helping to find track this down!!!  If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
> 
>        	       	   	      	   	      - Ted
> 
> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu May 14 17:09:37 2009 -0400
> 
>     ext4: Fix race in ext4_inode_info.i_cached_extent
>     
>     If one CPU is reading from a file while another CPU is writing to the
>     same file different locations, there is nothing protecting the
>     i_cached_extent structure from being used and updated at the same
>     time.  This could potentially cause the wrong location on disk to be
>     read or written to, including potentially causing the corruption of
>     the block group descriptors and/or inode table.
>     
>     Many thanks to Ken Shannah for helping to track down this problem.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

I wonder if that would explain the corruption I reported a couple weeks
ago.

Now I remember I wrongly got 2 parallel cp from the same source
directory to the same target directory.
Could this be the cause?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 19, 2009, 11:36 a.m. UTC | #12
On Tue, May 19, 2009 at 12:00:56PM +0200, Thierry Vignaud wrote:
> I wonder if that would explain the corruption I reported a couple weeks
> ago.
> 
> Now I remember I wrongly got 2 parallel cp from the same source
> directory to the same target directory.
>
> Could this be the cause?

Yes, it's possible.  There could be a problem either if the two cp's
tried targetting the same file at the same time, or if the directory
was getting expanded at the same time by the two different processes.
Since we don't cache logical->physical mapping for directories (since
because of the journalling requirements directories are stored in the
buffer cache, not the page cache), rather more likely to run into
problems with directories; more so since directoris will tend to be
fragmented.  

On the other hand, most of the time writes into the directory will
tend to be into pre-existing free space; but if you had two parallel
cp's copying a large number of files into the same directory, that
could certainly happen.

							- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Tomas May 19, 2009, 12:01 p.m. UTC | #13
Theodore Tso wrote:
> On the other hand, most of the time writes into the directory will
> tend to be into pre-existing free space; but if you had two parallel
> cp's copying a large number of files into the same directory, that
> could certainly happen.

isn't access to a directory protected with i_mutex ?

thanks, Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 19, 2009, 3:04 p.m. UTC | #14
On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote:
> Theodore Tso wrote:
>> On the other hand, most of the time writes into the directory will
>> tend to be into pre-existing free space; but if you had two parallel
>> cp's copying a large number of files into the same directory, that
>> could certainly happen.
>
> isn't access to a directory protected with i_mutex ?

Hmm, good point.  Yes, that should prevent problems with directories.
So there should only be a problem when two processes are writing to
the same file at the same time.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Tomas May 19, 2009, 3:16 p.m. UTC | #15
Theodore Tso wrote:
> On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote:
>> Theodore Tso wrote:
>>> On the other hand, most of the time writes into the directory will
>>> tend to be into pre-existing free space; but if you had two parallel
>>> cp's copying a large number of files into the same directory, that
>>> could certainly happen.
>> isn't access to a directory protected with i_mutex ?
> 
> Hmm, good point.  Yes, that should prevent problems with directories.
> So there should only be a problem when two processes are writing to
> the same file at the same time.

I guess reading can corrupt it as well ?

thanks, Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Vignaud May 19, 2009, 3:18 p.m. UTC | #16
Theodore Tso <tytso@mit.edu> writes:

> >> On the other hand, most of the time writes into the directory will
> >> tend to be into pre-existing free space; but if you had two parallel
> >> cp's copying a large number of files into the same directory, that
> >> could certainly happen.
> >
> > isn't access to a directory protected with i_mutex ?
> 
> Hmm, good point.  Yes, that should prevent problems with directories.
> So there should only be a problem when two processes are writing to
> the same file at the same time.

I remember that happened...
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 172656c..e3a55eb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1841,11 +1841,13 @@  ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
 {
 	struct ext4_ext_cache *cex;
 	BUG_ON(len == 0);
+	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
 	cex = &EXT4_I(inode)->i_cached_extent;
 	cex->ec_type = type;
 	cex->ec_block = block;
 	cex->ec_len = len;
 	cex->ec_start = start;
+	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
 }
 
 /*
@@ -1902,12 +1904,17 @@  ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
 			struct ext4_extent *ex)
 {
 	struct ext4_ext_cache *cex;
+	int ret = EXT4_EXT_CACHE_NO;
 
+	/* 
+	 * We borrow i_block_reservation_lock to protect i_cached_extent
+	 */
+	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
 	cex = &EXT4_I(inode)->i_cached_extent;
 
 	/* has cache valid data? */
 	if (cex->ec_type == EXT4_EXT_CACHE_NO)
-		return EXT4_EXT_CACHE_NO;
+		goto errout;
 
 	BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP &&
 			cex->ec_type != EXT4_EXT_CACHE_EXTENT);
@@ -1918,11 +1925,11 @@  ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
 		ext_debug("%u cached by %u:%u:%llu\n",
 				block,
 				cex->ec_block, cex->ec_len, cex->ec_start);
-		return cex->ec_type;
+		ret = cex->ec_type;
 	}
-
-	/* not in cache */
-	return EXT4_EXT_CACHE_NO;
+errout:
+	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
+	return ret;
 }
 
 /*