Message ID | 20130318025401.GA12611@thunk.org |
---|---|
State | Accepted, archived |
Headers | show |
On Sun 17-03-13 22:54:01, Ted Tso wrote: > On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote: > > > We use debian for a number of machines in our storage infrastructure > > > and we have recently been seeing a number of "hangs". We primary > > > notice this by seeing nfsd processes locking up and then a hung task > > > killer going wild. We finally managed to get a trace last night - its > > > pasted below: > > Thanks for reporting this. I thought we had fixed this in 3.0. > Before then, when we had a tid wrap, it would result in kjournald > spinning forever. I suspect this was your "spontaneous reboots" that > you mentioned you mentioned when you were using 2.6.39 --- did you > have a hardware or softward watchdog timer enabled by any chance? > > Since we didn't have a good way of reproducing the problem at the > time, I didn't realize that the problem had not been fully fixed; > since while jbd2_log_start_commit() would no longer cause kjournald to > spin forwever, a subsequent call to jbd2_log_wait_commit() with a > stale transaction id would wait for a very long time (possibly until > the heat death of the universe :-) > > I think a patch like this should fix things; I've run a stress test > with a hack to increment the transaction id by 1 << 24 after each > commit, to more quickly cause an tid wrap, and the regression tests > seem to be passing without complaint. > > - Ted > > From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001 > From: Theodore Ts'o <tytso@mit.edu> > Date: Sun, 17 Mar 2013 22:24:46 -0400 > Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by > wraparound > > In the case where an inode has a very stale transaction id (tid) in > i_datasync_tid or i_sync_tid, it's possible that after a very large > (2**31) number of transactions, that the tid number space might wrap, > causing tid_geq()'s calculations to fail. > > Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified > by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", > attempted to fix this problem, but it only avoided kjournald spinning > forever by fixing the logic in jbd2_log_start_commit(). > > Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c > that might call jbd2_log_start_commit() with a stale tid, those > functions will subsequently call jbd2_log_wait_commit() with the same > stale tid, and then wait for a very long time. To fix this, we > replace the calls to jbd2_log_start_commit() and > jbd2_log_wait_commit() with a call to a new function, > jbd2_complete_transaction(), which will correctly handle stale tid's. > > As a bonus, jbd2_complete_transaction() will avoid locking > j_state_lock for writing unless a commit needs to be started. This > should have a small (but probably not measurable) improvement for > ext4's scalability. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Reported-by: Ben Hutchings <ben@decadent.org.uk> > Reported-by: George Barnett <gbarnett@atlassian.com> > Cc: stable@vger.kernel.org Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of inventing new function? So jbd2_log_wait_commit() would do something like: __func__, journal->j_commit_request, tid); } #endif + /* Not running or committing trans => must be already committed. */ + if (!((journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) || + (journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid))) { + read_unlock(&journal->j_state_lock); + return 0; + } while (tid_gt(tid, journal->j_commit_sequence)) { jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n", tid, journal->j_commit_sequence); Honza > --- > fs/ext4/fsync.c | 3 +-- > fs/ext4/inode.c | 3 +-- > fs/jbd2/journal.c | 31 +++++++++++++++++++++++++++++++ > include/linux/jbd2.h | 1 + > 4 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 3278e64..e0ba8a4 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > if (journal->j_flags & JBD2_BARRIER && > !jbd2_trans_will_send_data_barrier(journal, commit_tid)) > needs_barrier = true; > - jbd2_log_start_commit(journal, commit_tid); > - ret = jbd2_log_wait_commit(journal, commit_tid); > + ret = jbd2_complete_transaction(journal, commit_tid); > if (needs_barrier) { > err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); > if (!ret) > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index b6fab7c..de4b58d 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode) > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; > > - jbd2_log_start_commit(journal, commit_tid); > - jbd2_log_wait_commit(journal, commit_tid); > + jbd2_complete_transaction(journal, commit_tid); > filemap_write_and_wait(&inode->i_data); > } > truncate_inode_pages(&inode->i_data, 0); > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > index ed10991..886ec2f 100644 > --- a/fs/jbd2/journal.c > +++ b/fs/jbd2/journal.c > @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) > } > > /* > + * When this function returns the transaction corresponding to tid > + * will be completed. If the transaction has currently running, start > + * committing that transaction before waiting for it to complete. If > + * the transaction id is stale, it is by definition already completed, > + * so just return SUCCESS. > + */ > +int jbd2_complete_transaction(journal_t *journal, tid_t tid) > +{ > + int need_to_wait = 1; > + > + read_lock(&journal->j_state_lock); > + if (journal->j_running_transaction && > + journal->j_running_transaction->t_tid == tid) { > + if (journal->j_commit_request != tid) { > + /* transaction not yet started, so request it */ > + read_unlock(&journal->j_state_lock); > + jbd2_log_start_commit(journal, tid); > + goto wait_commit; > + } > + } else if (!(journal->j_committing_transaction && > + journal->j_committing_transaction->t_tid == tid)) > + need_to_wait = 0; > + read_unlock(&journal->j_state_lock); > + if (!need_to_wait) > + return 0; > +wait_commit: > + return jbd2_log_wait_commit(journal, tid); > +} > +EXPORT_SYMBOL(jbd2_complete_transaction); > + > +/* > * Log buffer allocation routines: > */ > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > index 50e5a5e..f028975 100644 > --- a/include/linux/jbd2.h > +++ b/include/linux/jbd2.h > @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid); > int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); > int jbd2_journal_force_commit_nested(journal_t *journal); > int jbd2_log_wait_commit(journal_t *journal, tid_t tid); > +int jbd2_complete_transaction(journal_t *journal, tid_t tid); > int jbd2_log_do_checkpoint(journal_t *journal); > int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid); > > -- > 1.7.12.rc0.22.gcdd159b > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote: > Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of > inventing new function? In most of the places where we call jbd2_log_start_commit(), we're actually starting the current running transaction. So the fact that we pass in a tid, and we're having to validate that the tid is actually a valid one, is a bit of a waste. So in the long run I think it's worth rethinking whether or not jbd2_log_{start,wait}_commit() should exist in their current form, or whether we should reorganize their functionality (i.e., by having a jbd2_start_running_commit(), for example.). Piling on fixes to jbd2_log_wait_commit() would make it get even more complicated, and I think if we separate out the various ways in which we use these functions, we can make the code simpler and easier to read. In fact, I had started making this rather large set of changes when I decided it would be better to save that kind of wholesale refactoring for the next merge window. So the reason why I ended up fixing the patch the way I did was to keep things simple. Also as I mentioned in the commit description, by using a single function I was also able to optimize the locking the locking somewhat. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 21-03-13 17:09:40, Ted Tso wrote: > On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote: > > Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of > > inventing new function? > > In most of the places where we call jbd2_log_start_commit(), we're > actually starting the current running transaction. So the fact that > we pass in a tid, and we're having to validate that the tid is > actually a valid one, is a bit of a waste. So in the long run I think > it's worth rethinking whether or not jbd2_log_{start,wait}_commit() > should exist in their current form, or whether we should reorganize > their functionality (i.e., by having a jbd2_start_running_commit(), > for example.). Piling on fixes to jbd2_log_wait_commit() would make > it get even more complicated, and I think if we separate out the > various ways in which we use these functions, we can make the code > simpler and easier to read. I don't find jbd2_log_wait_commit() that complex that it couldn't bear another if :) But given there are really two waiting operations that make sense: a) request commit of running transaction and wait for it b) wait for committing transaction then I agree there may be a better interface. OTOH I'm somewhat curious about the new interface because the only race-free way of identifying a transaction you want to wait for is using its tid. > In fact, I had started making this rather large set of changes when I > decided it would be better to save that kind of wholesale refactoring > for the next merge window. So the reason why I ended up fixing the > patch the way I did was to keep things simple. > > Also as I mentioned in the commit description, by using a single > function I was also able to optimize the locking the locking somewhat. Yeah. I'm not as much opposed to the new function doing start commit & wait but what I dislike is the fact that we have still exposed the function jbd2_log_wait_commit() which can possibly lockup if tid overflows. I agree there aren't currently any other callers where this could happen but in a few years who knows... Honza
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 3278e64..e0ba8a4 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (journal->j_flags & JBD2_BARRIER && !jbd2_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = true; - jbd2_log_start_commit(journal, commit_tid); - ret = jbd2_log_wait_commit(journal, commit_tid); + ret = jbd2_complete_transaction(journal, commit_tid); if (needs_barrier) { err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); if (!ret) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b6fab7c..de4b58d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode) journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; - jbd2_log_start_commit(journal, commit_tid); - jbd2_log_wait_commit(journal, commit_tid); + jbd2_complete_transaction(journal, commit_tid); filemap_write_and_wait(&inode->i_data); } truncate_inode_pages(&inode->i_data, 0); diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index ed10991..886ec2f 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) } /* + * When this function returns the transaction corresponding to tid + * will be completed. If the transaction has currently running, start + * committing that transaction before waiting for it to complete. If + * the transaction id is stale, it is by definition already completed, + * so just return SUCCESS. + */ +int jbd2_complete_transaction(journal_t *journal, tid_t tid) +{ + int need_to_wait = 1; + + read_lock(&journal->j_state_lock); + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == tid) { + if (journal->j_commit_request != tid) { + /* transaction not yet started, so request it */ + read_unlock(&journal->j_state_lock); + jbd2_log_start_commit(journal, tid); + goto wait_commit; + } + } else if (!(journal->j_committing_transaction && + journal->j_committing_transaction->t_tid == tid)) + need_to_wait = 0; + read_unlock(&journal->j_state_lock); + if (!need_to_wait) + return 0; +wait_commit: + return jbd2_log_wait_commit(journal, tid); +} +EXPORT_SYMBOL(jbd2_complete_transaction); + +/* * Log buffer allocation routines: */ diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 50e5a5e..f028975 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid); int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); +int jbd2_complete_transaction(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);