Message ID | 20091201160324.GA25873@quack.suse.cz |
---|---|
State | Not Applicable, archived |
Headers | show |
On Tue, 2009-12-01 at 17:03 +0100, Jan Kara wrote:
> The patch below fixes the issue for me...
Ditto.
-Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2009-12-01, at 09:03, Jan Kara wrote: > On Tue 01-12-09 15:35:59, Jan Kara wrote: >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote: >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote: >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote: >>>>> This test case is distilled from an actual application which >>>>> doesn't even intentionally use writev: it just uses C++'s >>>>> ofstream class to write data to a file. Unfortunately, that >>>>> class smart and uses writev under the covers. Unfortunately, I >>>>> guess nobody ever tests linux writev behavior, since it's broken >>>>> _so_much_of_the_time_. I really am quite astounded to see such a >>>>> bad track record for such a fundamental core system call.... I suspect an excellent way of exposing problems with the writev() interface would be to wire it into fsx, which is commonly run as a stress test for Linux. I don't know if it would have caught this case, but it definitely couldn't hurt to get more testing cycles for it. >> Ext4 also has this problem but delayed allocation mitigates the >> effect to an error in accounting of blocks reserved for delayed >> allocation and thus under normal circumstances nothing bad happens. It looks like ext4 might still hit this problem, if delalloc is disabled. Could you please submit a similar patch for ext4 also. > The patch below fixes the issue for me... > > Honza > > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@suse.cz> > Date: Tue, 1 Dec 2009 16:53:06 +0100 > Subject: [PATCH] ext3: Fix data / filesystem corruption when write > fails to copy data > > When ext3_write_begin fails after allocating some blocks or > generic_perform_write fails to copy data to write, we truncate > blocks already instantiated beyond i_size. Although these blocks > were never inside i_size, we have to truncate pagecache of these > blocks so that corresponding buffers get unmapped. Otherwise > subsequent __block_prepare_write (called because we are retrying the > write) will find the buffers mapped, not call ->get_block, and thus > the page will be backed by already freed blocks leading to > filesystem and data corruption. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext3/inode.c | 18 ++++++++++++++---- > 1 files changed, 14 insertions(+), 4 deletions(-) > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c > index 354ed3b..f9d6937 100644 > --- a/fs/ext3/inode.c > +++ b/fs/ext3/inode.c > @@ -1151,6 +1151,16 @@ static int > do_journal_get_write_access(handle_t *handle, > return ext3_journal_get_write_access(handle, bh); > } > > +/* > + * Truncate blocks that were not used by write. We have to truncate > the > + * pagecache as well so that corresponding buffers get properly > unmapped. > + */ > +static void ext3_truncate_failed_write(struct inode *inode) > +{ > + truncate_inode_pages(inode->i_mapping, inode->i_size); > + ext3_truncate(inode); > +} > + > static int ext3_write_begin(struct file *file, struct address_space > *mapping, > loff_t pos, unsigned len, unsigned flags, > struct page **pagep, void **fsdata) > @@ -1209,7 +1219,7 @@ write_begin_failed: > unlock_page(page); > page_cache_release(page); > if (pos + len > inode->i_size) > - ext3_truncate(inode); > + ext3_truncate_failed_write(inode); > } > if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) > goto retry; > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file > *file, > page_cache_release(page); > > if (pos + len > inode->i_size) > - ext3_truncate(inode); > + ext3_truncate_failed_write(inode); > return ret ? ret : copied; > } > > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct > file *file, > page_cache_release(page); > > if (pos + len > inode->i_size) > - ext3_truncate(inode); > + ext3_truncate_failed_write(inode); > return ret ? ret : copied; > } > > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct > file *file, > page_cache_release(page); > > if (pos + len > inode->i_size) > - ext3_truncate(inode); > + ext3_truncate_failed_write(inode); > return ret ? ret : copied; > } > > -- > 1.6.4.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux- > ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote: > On 2009-12-01, at 09:03, Jan Kara wrote: > > On Tue 01-12-09 15:35:59, Jan Kara wrote: > >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote: > >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote: > >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote: > >>>>> This test case is distilled from an actual application which > >>>>> doesn't even intentionally use writev: it just uses C++'s > >>>>> ofstream class to write data to a file. Unfortunately, that > >>>>> class smart and uses writev under the covers. Unfortunately, I > >>>>> guess nobody ever tests linux writev behavior, since it's broken > >>>>> _so_much_of_the_time_. I really am quite astounded to see such a > >>>>> bad track record for such a fundamental core system call.... > > I suspect an excellent way of exposing problems with the writev() > interface would be to wire it into fsx, which is commonly run as a > stress test for Linux. I don't know if it would have caught this > case, but it definitely couldn't hurt to get more testing cycles for it. Maybe someone from LTP would be interested in adding this test functionality to fsx-linux ? Source/test program is available at http://marc.info/?l=linux-kernel&m=125961612418323&w=2 > >> Ext4 also has this problem but delayed allocation mitigates the > >> effect to an error in accounting of blocks reserved for delayed > >> allocation and thus under normal circumstances nothing bad happens. > > It looks like ext4 might still hit this problem, if delalloc is > disabled. Could you please submit a similar patch for ext4 also. > > > The patch below fixes the issue for me... > > > > Honza > > > > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001 > > From: Jan Kara <jack@suse.cz> > > Date: Tue, 1 Dec 2009 16:53:06 +0100 > > Subject: [PATCH] ext3: Fix data / filesystem corruption when write > > fails to copy data > > > > When ext3_write_begin fails after allocating some blocks or > > generic_perform_write fails to copy data to write, we truncate > > blocks already instantiated beyond i_size. Although these blocks > > were never inside i_size, we have to truncate pagecache of these > > blocks so that corresponding buffers get unmapped. Otherwise > > subsequent __block_prepare_write (called because we are retrying the > > write) will find the buffers mapped, not call ->get_block, and thus > > the page will be backed by already freed blocks leading to > > filesystem and data corruption. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/ext3/inode.c | 18 ++++++++++++++---- > > 1 files changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c > > index 354ed3b..f9d6937 100644 > > --- a/fs/ext3/inode.c > > +++ b/fs/ext3/inode.c > > @@ -1151,6 +1151,16 @@ static int > > do_journal_get_write_access(handle_t *handle, > > return ext3_journal_get_write_access(handle, bh); > > } > > > > +/* > > + * Truncate blocks that were not used by write. We have to truncate > > the > > + * pagecache as well so that corresponding buffers get properly > > unmapped. > > + */ > > +static void ext3_truncate_failed_write(struct inode *inode) > > +{ > > + truncate_inode_pages(inode->i_mapping, inode->i_size); > > + ext3_truncate(inode); > > +} > > + > > static int ext3_write_begin(struct file *file, struct address_space > > *mapping, > > loff_t pos, unsigned len, unsigned flags, > > struct page **pagep, void **fsdata) > > @@ -1209,7 +1219,7 @@ write_begin_failed: > > unlock_page(page); > > page_cache_release(page); > > if (pos + len > inode->i_size) > > - ext3_truncate(inode); > > + ext3_truncate_failed_write(inode); > > } > > if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) > > goto retry; > > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file > > *file, > > page_cache_release(page); > > > > if (pos + len > inode->i_size) > > - ext3_truncate(inode); > > + ext3_truncate_failed_write(inode); > > return ret ? ret : copied; > > } > > > > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct > > file *file, > > page_cache_release(page); > > > > if (pos + len > inode->i_size) > > - ext3_truncate(inode); > > + ext3_truncate_failed_write(inode); > > return ret ? ret : copied; > > } > > > > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct > > file *file, > > page_cache_release(page); > > > > if (pos + len > inode->i_size) > > - ext3_truncate(inode); > > + ext3_truncate_failed_write(inode); > > return ret ? ret : copied; > > } > > > > -- > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 1, 2009, at 11:03 AM, Jan Kara wrote: > On Tue 01-12-09 15:35:59, Jan Kara wrote: >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote: >>> I bisected it this morning. Bisected cleanly to... >>> >>> 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit >> OK, I've debugged it. This commit is really at fault. The problem is >> following: >> When using writev, the page we copy from is not paged in (while when we >> use ordinary write, it is paged in). This difference might be worth >> investigation on its own (as it is likely to heavily impact performance of >> writev) but is irrelevant for us now - we should handle this without data >> corruption anyway. Because the source page is not available, we pass 0 as >> the number of copied bytes to write_end and thus ext3_write_end decides to >> truncate the file to original size. This is perfectly fine. The problem is >> that we do this by ext3_truncate() which just frees corresponding block but >> does not unmap buffers. So we leave mapped buffers beyond i_size (they >> actually never were inside i_size) but the blocks they are mapped to are >> already free. The write is then retried (after mapping the page), >> block_write_begin() sees the buffer is mapped (although it is beyond >> i_size) and thus it does not call ext3_get_block() anymore. So as a result, >> data is written to a block that is no longer allocated to the file. Bummer >> - welcome filesystem corruption. >> Ext4 also has this problem but delayed allocation mitigates the effect to >> an error in accounting of blocks reserved for delayed allocation and thus >> under normal circumstances nothing bad happens. >> The question is how to solve this in the cleanest way. We can call >> vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to >> get rid of that (that's why I originally changed the code to what it is >> now). So probably we could just manually call truncate_pagecache() instead. >> Nick, I think your truncate calling sequence patch set needs similar fix >> for all filesystems as well. > The patch below fixes the issue for me... Thank you! I can confirm that the patch fixes the issue in my real application as well. James-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-12-02 at 10:53 -0800, Randy Dunlap wrote: > On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote: > > > On 2009-12-01, at 09:03, Jan Kara wrote: > > > On Tue 01-12-09 15:35:59, Jan Kara wrote: > > >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote: > > >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote: > > >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote: > > >>>>> This test case is distilled from an actual application which > > >>>>> doesn't even intentionally use writev: it just uses C++'s > > >>>>> ofstream class to write data to a file. Unfortunately, that > > >>>>> class smart and uses writev under the covers. Unfortunately, I > > >>>>> guess nobody ever tests linux writev behavior, since it's broken > > >>>>> _so_much_of_the_time_. I really am quite astounded to see such a > > >>>>> bad track record for such a fundamental core system call.... > > > > I suspect an excellent way of exposing problems with the writev() > > interface would be to wire it into fsx, which is commonly run as a > > stress test for Linux. I don't know if it would have caught this > > case, but it definitely couldn't hurt to get more testing cycles for it. > > Maybe someone from LTP would be interested in adding this test functionality > to fsx-linux ? > Sure Randy, We will do it. Regards-- Subrata > Source/test program is available at > http://marc.info/?l=linux-kernel&m=125961612418323&w=2 > > > > >> Ext4 also has this problem but delayed allocation mitigates the > > >> effect to an error in accounting of blocks reserved for delayed > > >> allocation and thus under normal circumstances nothing bad happens. > > > > It looks like ext4 might still hit this problem, if delalloc is > > disabled. Could you please submit a similar patch for ext4 also. > > > > > The patch below fixes the issue for me... > > > > > > Honza > > > > > > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001 > > > From: Jan Kara <jack@suse.cz> > > > Date: Tue, 1 Dec 2009 16:53:06 +0100 > > > Subject: [PATCH] ext3: Fix data / filesystem corruption when write > > > fails to copy data > > > > > > When ext3_write_begin fails after allocating some blocks or > > > generic_perform_write fails to copy data to write, we truncate > > > blocks already instantiated beyond i_size. Although these blocks > > > were never inside i_size, we have to truncate pagecache of these > > > blocks so that corresponding buffers get unmapped. Otherwise > > > subsequent __block_prepare_write (called because we are retrying the > > > write) will find the buffers mapped, not call ->get_block, and thus > > > the page will be backed by already freed blocks leading to > > > filesystem and data corruption. > > > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > fs/ext3/inode.c | 18 ++++++++++++++---- > > > 1 files changed, 14 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c > > > index 354ed3b..f9d6937 100644 > > > --- a/fs/ext3/inode.c > > > +++ b/fs/ext3/inode.c > > > @@ -1151,6 +1151,16 @@ static int > > > do_journal_get_write_access(handle_t *handle, > > > return ext3_journal_get_write_access(handle, bh); > > > } > > > > > > +/* > > > + * Truncate blocks that were not used by write. We have to truncate > > > the > > > + * pagecache as well so that corresponding buffers get properly > > > unmapped. > > > + */ > > > +static void ext3_truncate_failed_write(struct inode *inode) > > > +{ > > > + truncate_inode_pages(inode->i_mapping, inode->i_size); > > > + ext3_truncate(inode); > > > +} > > > + > > > static int ext3_write_begin(struct file *file, struct address_space > > > *mapping, > > > loff_t pos, unsigned len, unsigned flags, > > > struct page **pagep, void **fsdata) > > > @@ -1209,7 +1219,7 @@ write_begin_failed: > > > unlock_page(page); > > > page_cache_release(page); > > > if (pos + len > inode->i_size) > > > - ext3_truncate(inode); > > > + ext3_truncate_failed_write(inode); > > > } > > > if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) > > > goto retry; > > > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file > > > *file, > > > page_cache_release(page); > > > > > > if (pos + len > inode->i_size) > > > - ext3_truncate(inode); > > > + ext3_truncate_failed_write(inode); > > > return ret ? ret : copied; > > > } > > > > > > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct > > > file *file, > > > page_cache_release(page); > > > > > > if (pos + len > inode->i_size) > > > - ext3_truncate(inode); > > > + ext3_truncate_failed_write(inode); > > > return ret ? ret : copied; > > > } > > > > > > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct > > > file *file, > > > page_cache_release(page); > > > > > > if (pos + len > inode->i_size) > > > - ext3_truncate(inode); > > > + ext3_truncate_failed_write(inode); > > > return ret ? ret : copied; > > > } > > > > > > -- > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > --- > ~Randy > > ------------------------------------------------------------------------------ > Join us December 9, 2009 for the Red Hat Virtual Experience, > a free event focused on virtualization and cloud computing. > Attend in-depth sessions from your desk. Your couch. Anywhere. > http://p.sf.net/sfu/redhat-sfdev2dev > _______________________________________________ > Ltp-list mailing list > Ltp-list@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ltp-list -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 354ed3b..f9d6937 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1151,6 +1151,16 @@ static int do_journal_get_write_access(handle_t *handle, return ext3_journal_get_write_access(handle, bh); } +/* + * Truncate blocks that were not used by write. We have to truncate the + * pagecache as well so that corresponding buffers get properly unmapped. + */ +static void ext3_truncate_failed_write(struct inode *inode) +{ + truncate_inode_pages(inode->i_mapping, inode->i_size); + ext3_truncate(inode); +} + static int ext3_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) @@ -1209,7 +1219,7 @@ write_begin_failed: unlock_page(page); page_cache_release(page); if (pos + len > inode->i_size) - ext3_truncate(inode); + ext3_truncate_failed_write(inode); } if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) goto retry; @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file *file, page_cache_release(page); if (pos + len > inode->i_size) - ext3_truncate(inode); + ext3_truncate_failed_write(inode); return ret ? ret : copied; } @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct file *file, page_cache_release(page); if (pos + len > inode->i_size) - ext3_truncate(inode); + ext3_truncate_failed_write(inode); return ret ? ret : copied; } @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct file *file, page_cache_release(page); if (pos + len > inode->i_size) - ext3_truncate(inode); + ext3_truncate_failed_write(inode); return ret ? ret : copied; }