ext4/jbd2: don't wait (forever) for stale tid caused by wraparound

Message ID 1363582395.3937.319.camel@deadeye.wl.decadent.org.uk
State Not Applicable, archived
Headers show

Commit Message

Ben Hutchings March 18, 2013, 4:53 a.m.
On Mon, 2013-03-18 at 15:24 +1100, George Barnett wrote:
> On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:

> > I think a patch like this should fix things; I've run a stress test
> > with a hack to increment the transaction id by 1 << 24 after each
> > commit, to more quickly cause an tid wrap, and the regression tests
> > seem to be passing without complaint.
> Excellent news.  Again, thank you for your help in this regard.
> @Ben - could you let me know what your preferred course of action
> would be here?  As I'm sure you can understand, I do not wish to
> maintain a forked kernel from Debian upstream.  Is this something you
> would be prepared to integrate into the 3.2 BPO kernels?

We need you to verify that this fix works first.  If it does, it should
get included in the various 3.x.y stable branches and in Debian kernel

Ted might prefer you to test this against current mainline (3.9-rc3),
though you may find it easier to test against the version you already
have.  In the latter case you'll need a slightly modified version of the
patch (attached) and instructions at



Theodore Y. Ts'o March 18, 2013, 2:31 p.m. | #1
On Mon, Mar 18, 2013 at 04:53:15AM +0000, Ben Hutchings wrote:
> We need you to verify that this fix works first.  If it does, it should
> get included in the various 3.x.y stable branches and in Debian kernel
> packages.

I suspect it will be hard for George to verify this, since it requires
a tid wrap, which by definition takes a long time.

Also, this will probably not hit the stable kernels until after the
next merge window, since it's already post -rc3 and I really want to
make sure this gets a lot of testing and review.

I'll also note that I managed to trigger a BUG when incrementing by a
factor of (1 << 24), but we don't see a BUG_ON when incrementing by
((1 << 24) + 1).  (Neither of these testing changes were in the patch
that I sent out; so the patch is "safe" in that I very much doubt it
will make things worse --- those changes were to stress test the patch
so that I wouldn't have to wait several months until the tid wrapped
to test whether we had finally fixed all of the potential problems.)
So there is something we probably do want to look at a bit more
closely before we formally push this fix into mainline.

As far as the Debian servers are concerned, I'm pretty sure the patch
should be safe in that it won't make things worse than they were
before --- however, if you are looking for the lowest risk approach,
it's probably better to simply schedule downtime every few months and
force a reboot at a time that is minimizes developer inconvenience.
You can use "dumpe2fs -h /dev/XXX" to get the current sequence number
of the journal, if you measure the sequence number separated by 24 or
48 hours, you should be able to calculate when the the sequence number
will have incremented by 2**31, and thus calculate the frequency of
scheduled reboots for your workload.


						- Ted
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Y. Ts'o March 18, 2013, 2:34 p.m. | #2
One other observeration; problems relating to tid wrap have been
around for a very long time.  It's an issue for both ext3 and ext4, so
Red Hat and SuSE have been shipping QA'ed enterprise kernels with this
issue for over a decade.  This is one of the reasons why I don't think
it's a good idea to rush a fix into mainline, the stable 3.2.x
kernels, or the Debian kernel, until we really make sure we have any
proposed bug fixes fully tested and verified.

						- Ted
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -260,8 +260,7 @@  int ext4_sync_file(struct file *file, lo
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier)
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -146,8 +146,7 @@  void ext4_evict_inode(struct inode *inod
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 		truncate_inode_pages(&inode->i_data, 0);
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -663,6 +663,37 @@  int jbd2_log_wait_commit(journal_t *jour
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+	int	need_to_wait = 1;
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+	return jbd2_log_wait_commit(journal, tid);
  * Log buffer allocation routines:
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1165,6 +1165,7 @@  int __jbd2_log_start_commit(journal_t *j
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);