Message ID | 1238491766-13182-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com |
---|---|
State | Accepted, archived |
Headers | show |
On Tue 31-03-09 14:59:25, Aneesh Kumar K.V wrote: > We should add inode to the orphan list in the same transaction > as block allocation. This ensures that if we crash after a failed > block allocation and before we do a vmtruncate we don't leak block > (ie block marked as used in bitmap but not claimed by the inode). > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > CC: Jan Kara <jack@suse.cz> Looks fine. Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext4/inode.c | 15 +++++++++++++-- > 1 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 2231a65..074185f 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1424,7 +1424,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, > struct page **pagep, void **fsdata) > { > struct inode *inode = mapping->host; > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode); > + int ret, needed_blocks; > handle_t *handle; > int retries = 0; > struct page *page; > @@ -1435,6 +1435,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, > "dev %s ino %lu pos %llu len %u flags %u", > inode->i_sb->s_id, inode->i_ino, > (unsigned long long) pos, len, flags); > + /* > + * Reserve one block more for addition to orphan list in case > + * we allocate blocks but write fails for some reason > + */ > + needed_blocks = ext4_writepage_trans_blocks(inode) + 1; > index = pos >> PAGE_CACHE_SHIFT; > from = pos & (PAGE_CACHE_SIZE - 1); > to = from + len; > @@ -1468,14 +1473,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, > > if (ret) { > unlock_page(page); > - ext4_journal_stop(handle); > page_cache_release(page); > /* > * block_write_begin may have instantiated a few blocks > * outside i_size. Trim these off again. Don't need > * i_size_read because we hold i_mutex. > + * > + * Add inode to orphan list in case we crash before > + * truncate finishes > */ > if (pos + len > inode->i_size) > + ext4_orphan_add(handle, inode); > + > + ext4_journal_stop(handle); > + if (pos + len > inode->i_size) > vmtruncate(inode, inode->i_size); > } > > -- > 1.6.2.1.404.gb0085.dirty >
On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote: > We should add inode to the orphan list in the same transaction > as block allocation. This ensures that if we crash after a failed > block allocation and before we do a vmtruncate we don't leak block > (ie block marked as used in bitmap but not claimed by the inode). How likely is this going to happen? If it's a failure in the block allocation, we will have really end up with blocks outside i_size? And in that case, I'm missing how this ends up being a leaked block, since presumably the block is still being referenced by the inode. Am I missing something here? I guess it can happen if do_journal_get_write_access() returns an error, which could potentially happen if there is a ENOMEM error coming from the jbd2 layer. But that brings up another potential problem. It's possible for vmtruncate() to fail; if ext4_truncate() fails too early (particularly in ext4_ext_truncate, due to a memory error in a jbd2 routines), it's possible for it to return without calling ext4_orphan_del() --- and stray inodes on the orphan list will cause the kernel to panic on umount. I think this can be fixed by making sure that ext4_truncate() and ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error paths. That *should* the problem, since at the moment, it doesn't look vmtruncate() will return without calling inode->i_op->truncate(). But could you double check this carefully? Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Sat 04-04-09 23:11:16, Theodore Tso wrote: > On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote: > > We should add inode to the orphan list in the same transaction > > as block allocation. This ensures that if we crash after a failed > > block allocation and before we do a vmtruncate we don't leak block > > (ie block marked as used in bitmap but not claimed by the inode). > > How likely is this going to happen? If it's a failure in the block > allocation, we will have really end up with blocks outside i_size? > And in that case, I'm missing how this ends up being a leaked block, > since presumably the block is still being referenced by the inode. Am > I missing something here? Aneesh's changelog was not completely precise here. First note that some problems (namely those in ext4_write_begin()) can happen only if blocksize < pagesize - there we can succeed in allocating some blocks for the page but not all of them. Since we then decide to redo the whole write, we have to truncate blocks we have already allocated. The problem in ext4_..._write_end() is different (and rather easy to hit under heavy memory pressure) - the problem is that generic_perform_write() fails to copy data into our kernel page because the user page from which we should copy the data has been paged-out. In this situation we again decide to redo the write and so we should truncate the already allocated blocks since i_size is still set to the value before write. Otherwise we'd have blocks allocated beyond i_size and so a crash before we successfully redo the write would "leak" blocks (you're right the inode would be still referencing them but they'll be above i_size and I guess it could confuse some code). > I guess it can happen if do_journal_get_write_access() returns an > error, which could potentially happen if there is a ENOMEM error > coming from the jbd2 layer. But that brings up another potential > problem. It's possible for vmtruncate() to fail; if ext4_truncate() > fails too early (particularly in ext4_ext_truncate, due to a memory > error in a jbd2 routines), it's possible for it to return without > calling ext4_orphan_del() --- and stray inodes on the orphan list will > cause the kernel to panic on umount. > > I think this can be fixed by making sure that ext4_truncate() and > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error > paths. That *should* the problem, since at the moment, it doesn't > look vmtruncate() will return without calling inode->i_op->truncate(). > But could you double check this carefully? Ah, OK, that should be fixed. But note that current ext4_setattr() does exactly the same thing on standard truncates - it adds inode to orphan list and calls inode_setattr() which end's up calling vmtruncate(). Honza
On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote: > > I think this can be fixed by making sure that ext4_truncate() and > > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error > > paths. That *should* the problem, since at the moment, it doesn't > > look vmtruncate() will return without calling inode->i_op->truncate(). > > But could you double check this carefully? > > Ah, OK, that should be fixed. But note that current ext4_setattr() > does exactly the same thing on standard truncates - it adds inode to > orphan list and calls inode_setattr() which end's up calling vmtruncate(). I finally had a chance to take a closer look at this. ext4_setattr() is safe, because it does this after calling inode_setattr(): /* If inode_setattr's call to ext4_truncate failed to get a * transaction handle at all, we need to clean up the in-core * orphan list manually. */ if (inode->i_nlink) ext4_orphan_del(NULL, inode); So if we put the same thing into the ext4_write_begin() and ext4_writeback_write_end() in these patches, it should be OK. The key is that if the inode is already is on the orphan list, it's harmless to call ext4_orphan_add() --- and if the inode has already been removed from the orphan list, it's harmless to call ext4_orphan_del() on it. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I've tried applying these two patches to the ext4 patch queue. On a metadata-heavy workload (specifically, fsx), it causes a 5% degradation in the wall clock run-time of the fsx run-time. Maybe there's no way around that, but it's rather unfortunate.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote: > I've tried applying these two patches to the ext4 patch queue. On a > metadata-heavy workload (specifically, fsx), it causes a 5% > degradation in the wall clock run-time of the fsx run-time. I just did some more timing tests, and it looks like I can't confirm it. The test times are noisy enough I need to run some more under very controlled circumsntances. So ignore this for now, it might not be a problem after all. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 05-06-09 03:24:17, Theodore Tso wrote: > On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote: > > I've tried applying these two patches to the ext4 patch queue. On a > > metadata-heavy workload (specifically, fsx), it causes a 5% > > degradation in the wall clock run-time of the fsx run-time. > > I just did some more timing tests, and it looks like I can't confirm > it. The test times are noisy enough I need to run some more under very > controlled circumsntances. So ignore this for now, it might not be a > problem after all. It would be strange if the patches caused a measurable slowdown. They change something only in the failure path where we fail to allocate some blocks or fail to copy in all the data and these should better be exceptional cases... Honza
On Fri 05-06-09 00:31:17, Theodore Tso wrote: > On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote: > > > I think this can be fixed by making sure that ext4_truncate() and > > > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error > > > paths. That *should* the problem, since at the moment, it doesn't > > > look vmtruncate() will return without calling inode->i_op->truncate(). > > > But could you double check this carefully? > > > > Ah, OK, that should be fixed. But note that current ext4_setattr() > > does exactly the same thing on standard truncates - it adds inode to > > orphan list and calls inode_setattr() which end's up calling vmtruncate(). > > I finally had a chance to take a closer look at this. ext4_setattr() > is safe, because it does this after calling inode_setattr(): > > /* If inode_setattr's call to ext4_truncate failed to get a > * transaction handle at all, we need to clean up the in-core > * orphan list manually. */ > if (inode->i_nlink) > ext4_orphan_del(NULL, inode); > > So if we put the same thing into the ext4_write_begin() and > ext4_writeback_write_end() in these patches, it should be OK. The key > is that if the inode is already is on the orphan list, it's harmless > to call ext4_orphan_add() --- and if the inode has already been > removed from the orphan list, it's harmless to call ext4_orphan_del() > on it. Ah, good point. I haven't noticed this in ext4_inode_setattr when I was checking it. Aneesh, will you take care of it for ext4? I'll cook up an appropriate change for ext3... Thanks Ted for looking into this. Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 2231a65..074185f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1424,7 +1424,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, struct page **pagep, void **fsdata) { struct inode *inode = mapping->host; - int ret, needed_blocks = ext4_writepage_trans_blocks(inode); + int ret, needed_blocks; handle_t *handle; int retries = 0; struct page *page; @@ -1435,6 +1435,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, "dev %s ino %lu pos %llu len %u flags %u", inode->i_sb->s_id, inode->i_ino, (unsigned long long) pos, len, flags); + /* + * Reserve one block more for addition to orphan list in case + * we allocate blocks but write fails for some reason + */ + needed_blocks = ext4_writepage_trans_blocks(inode) + 1; index = pos >> PAGE_CACHE_SHIFT; from = pos & (PAGE_CACHE_SIZE - 1); to = from + len; @@ -1468,14 +1473,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, if (ret) { unlock_page(page); - ext4_journal_stop(handle); page_cache_release(page); /* * block_write_begin may have instantiated a few blocks * outside i_size. Trim these off again. Don't need * i_size_read because we hold i_mutex. + * + * Add inode to orphan list in case we crash before + * truncate finishes */ if (pos + len > inode->i_size) + ext4_orphan_add(handle, inode); + + ext4_journal_stop(handle); + if (pos + len > inode->i_size) vmtruncate(inode, inode->i_size); }
We should add inode to the orphan list in the same transaction as block allocation. This ensures that if we crash after a failed block allocation and before we do a vmtruncate we don't leak block (ie block marked as used in bitmap but not claimed by the inode). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-)