Message ID | 20090514212325.GG21316@mit.edu |
---|---|
State | Accepted, archived |
Headers | show |
On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > short circuit i_cached_extent patch, mounted the fs without > > nodelalloc. I was able to run the full exchange backup without > > triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... Hehe, no problem. Will do the final testing shortly (ran out of time this morning, users are back on the system now). Just one little correction to your patch below: > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. > > Many thanks to Ken Shannah for helping to track down this problem. ^^^^^^^^^^^ Cheers, Kevin Shanahan. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: >> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + >> short circuit i_cached_extent patch, mounted the fs without >> nodelalloc. I was able to run the full exchange backup without >> triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted So here's a fio job I thought would try to hit this ... [global] ioengine=libaio iodepth=1 filesize=4g bs=1m norandommap direct=1 loops=500 size=16g [setup] filename=file rw=randwrite loops=1 [thread1] stonewall filename=file rw=randwrite [thread2] filename=file rw=randread it should lay out a 4g file in random 1m direct IOs to fragment it and get a lot of extents, then launch 2 threads, one each doing random reads and random writes of that same file. I can't make this trip it, though ... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > short circuit i_cached_extent patch, mounted the fs without > > nodelalloc. I was able to run the full exchange backup without > > triggering the check_block_validity error. > > Great! > > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. It should be multiple readers. We don't allow read/write or multiple writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed to protect read/write and multiple writers. What it allowed was multiple readers(get_block call with create = 0). And readers did cache the extent information which it read from the disk. So the fix is correct, but we need to update the commit message. Reviewed-by:Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > > Many thanks to Ken Shannah for helping to track down this problem. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 172656c..e3a55eb 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, > { > struct ext4_ext_cache *cex; > BUG_ON(len == 0); > + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); > cex = &EXT4_I(inode)->i_cached_extent; > cex->ec_type = type; > cex->ec_block = block; > cex->ec_len = len; > cex->ec_start = start; > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > } > > /* > @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, > struct ext4_extent *ex) > { > struct ext4_ext_cache *cex; > + int ret = EXT4_EXT_CACHE_NO; > > + /* > + * We borrow i_block_reservation_lock to protect i_cached_extent > + */ > + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); > cex = &EXT4_I(inode)->i_cached_extent; > > /* has cache valid data? */ > if (cex->ec_type == EXT4_EXT_CACHE_NO) > - return EXT4_EXT_CACHE_NO; > + goto errout; > > BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP && > cex->ec_type != EXT4_EXT_CACHE_EXTENT); > @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, > ext_debug("%u cached by %u:%u:%llu\n", > block, > cex->ec_block, cex->ec_len, cex->ec_start); > - return cex->ec_type; > + ret = cex->ec_type; > } > - > - /* not in cache */ > - return EXT4_EXT_CACHE_NO; > +errout: > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > + return ret; > } > > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote: > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > > Author: Theodore Ts'o <tytso@mit.edu> > > Date: Thu May 14 17:09:37 2009 -0400 > > > > ext4: Fix race in ext4_inode_info.i_cached_extent > > > > If one CPU is reading from a file while another CPU is writing to the > > same file different locations, there is nothing protecting the > > i_cached_extent structure from being used and updated at the same > > time. This could potentially cause the wrong location on disk to be > > read or written to, including potentially causing the corruption of > > the block group descriptors and/or inode table. > > > It should be multiple readers. We don't allow read/write or multiple > writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed > to protect read/write and multiple writers. What it allowed was > multiple readers(get_block call with create = 0). And readers did cache > the extent information which it read from the disk. So the fix is > correct, but we need to update the commit message. Yes, you're right. The i_data_sem would have protected us from the worst of these problems. The commit message isn't totally wrong, though, since even a writer will call ext4_ext_get_blocks() with create == 0. But we should describe the failure mode much more accurately, and say that it is two callers of ext4_ext_get_blocks() with create=0 racing against one another. It's possible, perhaps even likely, that one of those callers came from a call of ext4_get_blocks() with create=1 --- given that the people who were reporting this were doing so when doing backup jobs. However, one could imagine multiple mysql (or DB2) threads writing into random parts of a database, and if two CPU's try to bmap already allocated (and initialized) blocks out of the database file, so that ext4_ext_get_blocks() is always being called with create=0, you could very easily end up with a situation where blocks would be written to the wrong location on disk. <<shiver>> We should make sure that all of the enterprise distributions that are shipping ext4 as a "technology preview" get a copy of this patch as a fixing a high-priority bug, since a database workload should trip this pretty easily. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: > > it should lay out a 4g file in random 1m direct IOs to fragment it and > get a lot of extents, then launch 2 threads, one each doing random reads > and random writes of that same file. > > I can't make this trip it, though ... If all of the blocks are in the page cache, you won't end up calling ext4_get_blocks(). Try adding a shell script which runs in parallel doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >> it should lay out a 4g file in random 1m direct IOs to fragment it and >> get a lot of extents, then launch 2 threads, one each doing random reads >> and random writes of that same file. >> >> I can't make this trip it, though ... > > If all of the blocks are in the page cache, you won't end up calling > ext4_get_blocks(). Try adding a shell script which runs in parallel > doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". > > - Ted I made sure it was a big enough file, and consumed enough memory on the system before the test, that the entire file couldn't fit in memory. I can try doing the dropping in the bg ... but it should have been going to disk already. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote: > So the fix is correct, but we need to update the commit message. Here's the updated commit message. - Ted ext4: Fix race in ext4_inode_info.i_cached_extent If two CPU's simultaneously call ext4_ext_get_blocks() at the same time, there is nothing protecting the i_cached_extent structure from being used and updated at the same time. This could potentially cause the wrong location on disk to be read or written to, including potentially causing the corruption of the block group descriptors and/or inode table. This bug has been in the ext4 code since almost the very beginning of ext4's development. Fortunately once the data is stored in the page cache cache, ext4_get_blocks() doesn't need to be called, so trying to replicate this problem to the point where we could identify its root cause was *extremely* difficult. Many thanks to Kevin Shanahan for working over several months to be able to reproduce this easily so we could finally nail down the cause of the corruption. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Sandeen wrote: > Theodore Tso wrote: >> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >>> it should lay out a 4g file in random 1m direct IOs to fragment it and >>> get a lot of extents, then launch 2 threads, one each doing random reads >>> and random writes of that same file. >>> >>> I can't make this trip it, though ... >> If all of the blocks are in the page cache, you won't end up calling >> ext4_get_blocks(). Try adding a shell script which runs in parallel >> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". >> >> - Ted > > I made sure it was a big enough file, and consumed enough memory on the > system before the test, that the entire file couldn't fit in memory. > > I can try doing the dropping in the bg ... but it should have been going > to disk already. > > -Eric in a desperate attempt to show the window, I tried this in ext4_ext_put_in_cache(): cex->ec_block = -1; cex->ec_start = -1; schedule_timeout_uninterruptible(HZ/2); cex->ec_start = start; cex->ec_block = block; and this in ext4_ext_in_cache(): if (cex->ec_block == -1 || cex->ec_start == -1) printk("%s got bad cache\n", __func__); and it's not firing. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Sandeen wrote: > Eric Sandeen wrote: >> Theodore Tso wrote: >>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote: >>>> it should lay out a 4g file in random 1m direct IOs to fragment it and >>>> get a lot of extents, then launch 2 threads, one each doing random reads >>>> and random writes of that same file. >>>> >>>> I can't make this trip it, though ... >>> If all of the blocks are in the page cache, you won't end up calling >>> ext4_get_blocks(). Try adding a shell script which runs in parallel >>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done". >>> >>> - Ted >> I made sure it was a big enough file, and consumed enough memory on the >> system before the test, that the entire file couldn't fit in memory. >> >> I can try doing the dropping in the bg ... but it should have been going >> to disk already. >> >> -Eric > > in a desperate attempt to show the window, I tried this in > ext4_ext_put_in_cache(): > > cex->ec_block = -1; > cex->ec_start = -1; > schedule_timeout_uninterruptible(HZ/2); > cex->ec_start = start; > cex->ec_block = block; > > and this in ext4_ext_in_cache(): > > if (cex->ec_block == -1 || cex->ec_start == -1) > printk("%s got bad cache\n", __func__); > > and it's not firing. I take it back, needed a different workload. Sorry for being pedantic, but if this race is so blindingly obvious and we're getting so few reports, I wanted to be sure we could hit it. With my artificially wide window now I think I can see it, but I'm still not winding up with any corruption or EIOs.... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 15, 2009 at 07:03:14AM +0930, Kevin Shanahan wrote: > On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote: > > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote: > > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch + > > > short circuit i_cached_extent patch, mounted the fs without > > > nodelalloc. I was able to run the full exchange backup without > > > triggering the check_block_validity error. > > > > Great! > > > > So here's the final fix (it replaces the short circuit i_cached_extent > > patch) which I plan to push to Linus. It should be much less of a > > performance hit than simply short-circuiting i_cached_extent... > > > > Thanks so much for helping to find track this down!!! If ever someone > > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > > you.... > > Hehe, no problem. Will do the final testing shortly (ran out of time > this morning, users are back on the system now). Just one little > correction to your patch below: > > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > > Author: Theodore Ts'o <tytso@mit.edu> > > Date: Thu May 14 17:09:37 2009 -0400 > > > > ext4: Fix race in ext4_inode_info.i_cached_extent > > > > If one CPU is reading from a file while another CPU is writing to the > > same file different locations, there is nothing protecting the > > i_cached_extent structure from being used and updated at the same > > time. This could potentially cause the wrong location on disk to be > > read or written to, including potentially causing the corruption of > > the block group descriptors and/or inode table. > > > > Many thanks to Ken Shannah for helping to track down this problem. > ^^^^^^^^^^^ Tested-by: Kevin Shanahan <kmshanah@ucwb.org.au> Yes, this patch seems to have fixed the issue. I ran my "exchange backup to samba share" test on 2.6.29.3 + check_block_validity patch + the fix race patch with no problems. Cheers, Kevin. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso <tytso@mit.edu> writes: > So here's the final fix (it replaces the short circuit i_cached_extent > patch) which I plan to push to Linus. It should be much less of a > performance hit than simply short-circuiting i_cached_extent... > > Thanks so much for helping to find track this down!!! If ever someone > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be > you.... > > - Ted > > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu May 14 17:09:37 2009 -0400 > > ext4: Fix race in ext4_inode_info.i_cached_extent > > If one CPU is reading from a file while another CPU is writing to the > same file different locations, there is nothing protecting the > i_cached_extent structure from being used and updated at the same > time. This could potentially cause the wrong location on disk to be > read or written to, including potentially causing the corruption of > the block group descriptors and/or inode table. > > Many thanks to Ken Shannah for helping to track down this problem. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> I wonder if that would explain the corruption I reported a couple weeks ago. Now I remember I wrongly got 2 parallel cp from the same source directory to the same target directory. Could this be the cause? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2009 at 12:00:56PM +0200, Thierry Vignaud wrote: > I wonder if that would explain the corruption I reported a couple weeks > ago. > > Now I remember I wrongly got 2 parallel cp from the same source > directory to the same target directory. > > Could this be the cause? Yes, it's possible. There could be a problem either if the two cp's tried targetting the same file at the same time, or if the directory was getting expanded at the same time by the two different processes. Since we don't cache logical->physical mapping for directories (since because of the journalling requirements directories are stored in the buffer cache, not the page cache), rather more likely to run into problems with directories; more so since directoris will tend to be fragmented. On the other hand, most of the time writes into the directory will tend to be into pre-existing free space; but if you had two parallel cp's copying a large number of files into the same directory, that could certainly happen. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On the other hand, most of the time writes into the directory will > tend to be into pre-existing free space; but if you had two parallel > cp's copying a large number of files into the same directory, that > could certainly happen. isn't access to a directory protected with i_mutex ? thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote: > Theodore Tso wrote: >> On the other hand, most of the time writes into the directory will >> tend to be into pre-existing free space; but if you had two parallel >> cp's copying a large number of files into the same directory, that >> could certainly happen. > > isn't access to a directory protected with i_mutex ? Hmm, good point. Yes, that should prevent problems with directories. So there should only be a problem when two processes are writing to the same file at the same time. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote: >> Theodore Tso wrote: >>> On the other hand, most of the time writes into the directory will >>> tend to be into pre-existing free space; but if you had two parallel >>> cp's copying a large number of files into the same directory, that >>> could certainly happen. >> isn't access to a directory protected with i_mutex ? > > Hmm, good point. Yes, that should prevent problems with directories. > So there should only be a problem when two processes are writing to > the same file at the same time. I guess reading can corrupt it as well ? thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso <tytso@mit.edu> writes: > >> On the other hand, most of the time writes into the directory will > >> tend to be into pre-existing free space; but if you had two parallel > >> cp's copying a large number of files into the same directory, that > >> could certainly happen. > > > > isn't access to a directory protected with i_mutex ? > > Hmm, good point. Yes, that should prevent problems with directories. > So there should only be a problem when two processes are writing to > the same file at the same time. I remember that happened... -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 172656c..e3a55eb 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, { struct ext4_ext_cache *cex; BUG_ON(len == 0); + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); cex = &EXT4_I(inode)->i_cached_extent; cex->ec_type = type; cex->ec_block = block; cex->ec_len = len; cex->ec_start = start; + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); } /* @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, struct ext4_extent *ex) { struct ext4_ext_cache *cex; + int ret = EXT4_EXT_CACHE_NO; + /* + * We borrow i_block_reservation_lock to protect i_cached_extent + */ + spin_lock(&EXT4_I(inode)->i_block_reservation_lock); cex = &EXT4_I(inode)->i_cached_extent; /* has cache valid data? */ if (cex->ec_type == EXT4_EXT_CACHE_NO) - return EXT4_EXT_CACHE_NO; + goto errout; BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP && cex->ec_type != EXT4_EXT_CACHE_EXTENT); @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block, ext_debug("%u cached by %u:%u:%llu\n", block, cex->ec_block, cex->ec_len, cex->ec_start); - return cex->ec_type; + ret = cex->ec_type; } - - /* not in cache */ - return EXT4_EXT_CACHE_NO; +errout: + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); + return ret; } /*