diff mbox

ext4/jbd2: don't wait (forever) for stale tid caused by wraparound

Message ID 20130318025401.GA12611@thunk.org
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o March 18, 2013, 2:54 a.m. UTC
On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote:
> > We use debian for a number of machines in our storage infrastructure
> > and we have recently been seeing a number of "hangs". We primary
> > notice this by seeing nfsd processes locking up and then a hung task
> > killer going wild. We finally managed to get a trace last night - its
> > pasted below: 

Thanks for reporting this.  I thought we had fixed this in 3.0.
Before then, when we had a tid wrap, it would result in kjournald
spinning forever.  I suspect this was your "spontaneous reboots" that
you mentioned you mentioned when you were using 2.6.39 --- did you
have a hardware or softward watchdog timer enabled by any chance?

Since we didn't have a good way of reproducing the problem at the
time, I didn't realize that the problem had not been fully fixed;
since while jbd2_log_start_commit() would no longer cause kjournald to
spin forwever, a subsequent call to jbd2_log_wait_commit() with a
stale transaction id would wait for a very long time (possibly until
the heat death of the universe :-)

I think a patch like this should fix things; I've run a stress test
with a hack to increment the transaction id by 1 << 24 after each
commit, to more quickly cause an tid wrap, and the regression tests
seem to be passing without complaint.

						- Ted

From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 17 Mar 2013 22:24:46 -0400
Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
 wraparound

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
---
 fs/ext4/fsync.c      |  3 +--
 fs/ext4/inode.c      |  3 +--
 fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

Comments

Jan Kara March 21, 2013, 8:46 p.m. UTC | #1
On Sun 17-03-13 22:54:01, Ted Tso wrote:
> On Sat, Mar 16, 2013 at 05:34:22AM +0000, Ben Hutchings wrote:
> > > We use debian for a number of machines in our storage infrastructure
> > > and we have recently been seeing a number of "hangs". We primary
> > > notice this by seeing nfsd processes locking up and then a hung task
> > > killer going wild. We finally managed to get a trace last night - its
> > > pasted below: 
> 
> Thanks for reporting this.  I thought we had fixed this in 3.0.
> Before then, when we had a tid wrap, it would result in kjournald
> spinning forever.  I suspect this was your "spontaneous reboots" that
> you mentioned you mentioned when you were using 2.6.39 --- did you
> have a hardware or softward watchdog timer enabled by any chance?
> 
> Since we didn't have a good way of reproducing the problem at the
> time, I didn't realize that the problem had not been fully fixed;
> since while jbd2_log_start_commit() would no longer cause kjournald to
> spin forwever, a subsequent call to jbd2_log_wait_commit() with a
> stale transaction id would wait for a very long time (possibly until
> the heat death of the universe :-)
> 
> I think a patch like this should fix things; I've run a stress test
> with a hack to increment the transaction id by 1 << 24 after each
> commit, to more quickly cause an tid wrap, and the regression tests
> seem to be passing without complaint.
> 
> 						- Ted
> 
> From 76b05344fef573701b22ced444223188f054f94d Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <tytso@mit.edu>
> Date: Sun, 17 Mar 2013 22:24:46 -0400
> Subject: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by
>  wraparound
> 
> In the case where an inode has a very stale transaction id (tid) in
> i_datasync_tid or i_sync_tid, it's possible that after a very large
> (2**31) number of transactions, that the tid number space might wrap,
> causing tid_geq()'s calculations to fail.
> 
> Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
> by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
> attempted to fix this problem, but it only avoided kjournald spinning
> forever by fixing the logic in jbd2_log_start_commit().
> 
> Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
> that might call jbd2_log_start_commit() with a stale tid, those
> functions will subsequently call jbd2_log_wait_commit() with the same
> stale tid, and then wait for a very long time.  To fix this, we
> replace the calls to jbd2_log_start_commit() and
> jbd2_log_wait_commit() with a call to a new function,
> jbd2_complete_transaction(), which will correctly handle stale tid's.
> 
> As a bonus, jbd2_complete_transaction() will avoid locking
> j_state_lock for writing unless a commit needs to be started.  This
> should have a small (but probably not measurable) improvement for
> ext4's scalability.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Reported-by: Ben Hutchings <ben@decadent.org.uk>
> Reported-by: George Barnett <gbarnett@atlassian.com>
> Cc: stable@vger.kernel.org
  Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
inventing new function? So jbd2_log_wait_commit() would do something like:

			__func__, journal->j_commit_request, tid);
	}
#endif
+	/* Not running or committing trans => must be already committed. */
+	if (!((journal->j_running_transaction &&
+	     journal->j_running_transaction->t_tid == tid) ||
+	    (journal->j_committing_transaction &&
+	     journal->j_committing_transaction->t_tid == tid))) {
+		read_unlock(&journal->j_state_lock);
+		return 0;
+	}
	while (tid_gt(tid, journal->j_commit_sequence)) {
		jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
				  tid, journal->j_commit_sequence);
 
								Honza

> ---
>  fs/ext4/fsync.c      |  3 +--
>  fs/ext4/inode.c      |  3 +--
>  fs/jbd2/journal.c    | 31 +++++++++++++++++++++++++++++++
>  include/linux/jbd2.h |  1 +
>  4 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 3278e64..e0ba8a4 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -166,8 +166,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	if (journal->j_flags & JBD2_BARRIER &&
>  	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
>  		needs_barrier = true;
> -	jbd2_log_start_commit(journal, commit_tid);
> -	ret = jbd2_log_wait_commit(journal, commit_tid);
> +	ret = jbd2_complete_transaction(journal, commit_tid);
>  	if (needs_barrier) {
>  		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
>  		if (!ret)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b6fab7c..de4b58d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -211,8 +211,7 @@ void ext4_evict_inode(struct inode *inode)
>  			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
>  			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
>  
> -			jbd2_log_start_commit(journal, commit_tid);
> -			jbd2_log_wait_commit(journal, commit_tid);
> +			jbd2_complete_transaction(journal, commit_tid);
>  			filemap_write_and_wait(&inode->i_data);
>  		}
>  		truncate_inode_pages(&inode->i_data, 0);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index ed10991..886ec2f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -710,6 +710,37 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>  }
>  
>  /*
> + * When this function returns the transaction corresponding to tid
> + * will be completed.  If the transaction has currently running, start
> + * committing that transaction before waiting for it to complete.  If
> + * the transaction id is stale, it is by definition already completed,
> + * so just return SUCCESS.
> + */
> +int jbd2_complete_transaction(journal_t *journal, tid_t tid)
> +{
> +	int	need_to_wait = 1;
> +
> +	read_lock(&journal->j_state_lock);
> +	if (journal->j_running_transaction &&
> +	    journal->j_running_transaction->t_tid == tid) {
> +		if (journal->j_commit_request != tid) {
> +			/* transaction not yet started, so request it */
> +			read_unlock(&journal->j_state_lock);
> +			jbd2_log_start_commit(journal, tid);
> +			goto wait_commit;
> +		}
> +	} else if (!(journal->j_committing_transaction &&
> +		     journal->j_committing_transaction->t_tid == tid))
> +		need_to_wait = 0;
> +	read_unlock(&journal->j_state_lock);
> +	if (!need_to_wait)
> +		return 0;
> +wait_commit:
> +	return jbd2_log_wait_commit(journal, tid);
> +}
> +EXPORT_SYMBOL(jbd2_complete_transaction);
> +
> +/*
>   * Log buffer allocation routines:
>   */
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 50e5a5e..f028975 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1200,6 +1200,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
>  int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
>  int jbd2_journal_force_commit_nested(journal_t *journal);
>  int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
> +int jbd2_complete_transaction(journal_t *journal, tid_t tid);
>  int jbd2_log_do_checkpoint(journal_t *journal);
>  int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
>  
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 21, 2013, 9:09 p.m. UTC | #2
On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote:
>   Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
> inventing new function?

In most of the places where we call jbd2_log_start_commit(), we're
actually starting the current running transaction.  So the fact that
we pass in a tid, and we're having to validate that the tid is
actually a valid one, is a bit of a waste.  So in the long run I think
it's worth rethinking whether or not jbd2_log_{start,wait}_commit()
should exist in their current form, or whether we should reorganize
their functionality (i.e., by having a jbd2_start_running_commit(),
for example.).  Piling on fixes to jbd2_log_wait_commit() would make
it get even more complicated, and I think if we separate out the
various ways in which we use these functions, we can make the code
simpler and easier to read.

In fact, I had started making this rather large set of changes when I
decided it would be better to save that kind of wholesale refactoring
for the next merge window.  So the reason why I ended up fixing the
patch the way I did was to keep things simple.

Also as I mentioned in the commit description, by using a single
function I was also able to optimize the locking the locking somewhat.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 21, 2013, 10:41 p.m. UTC | #3
On Thu 21-03-13 17:09:40, Ted Tso wrote:
> On Thu, Mar 21, 2013 at 09:46:38PM +0100, Jan Kara wrote:
> >   Good catch! But shouldn't we rather fix jbd2_log_wait_commit() instead of
> > inventing new function?
> 
> In most of the places where we call jbd2_log_start_commit(), we're
> actually starting the current running transaction.  So the fact that
> we pass in a tid, and we're having to validate that the tid is
> actually a valid one, is a bit of a waste.  So in the long run I think
> it's worth rethinking whether or not jbd2_log_{start,wait}_commit()
> should exist in their current form, or whether we should reorganize
> their functionality (i.e., by having a jbd2_start_running_commit(),
> for example.).  Piling on fixes to jbd2_log_wait_commit() would make
> it get even more complicated, and I think if we separate out the
> various ways in which we use these functions, we can make the code
> simpler and easier to read.
  I don't find jbd2_log_wait_commit() that complex that it couldn't bear
another if :) But given there are really two waiting operations that make
sense:
a) request commit of running transaction and wait for it
b) wait for committing transaction

then I agree there may be a better interface. OTOH I'm somewhat
curious about the new interface because the only race-free way of
identifying a transaction you want to wait for is using its tid.

> In fact, I had started making this rather large set of changes when I
> decided it would be better to save that kind of wholesale refactoring
> for the next merge window.  So the reason why I ended up fixing the
> patch the way I did was to keep things simple.
> 
> Also as I mentioned in the commit description, by using a single
> function I was also able to optimize the locking the locking somewhat.
  Yeah. I'm not as much opposed to the new function doing start commit
& wait but what I dislike is the fact that we have still exposed the
function jbd2_log_wait_commit() which can possibly lockup if tid overflows.
I agree there aren't currently any other callers where this could happen
but in a few years who knows...

								Honza
diff mbox

Patch

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 3278e64..e0ba8a4 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -166,8 +166,7 @@  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	jbd2_log_start_commit(journal, commit_tid);
-	ret = jbd2_log_wait_commit(journal, commit_tid);
+	ret = jbd2_complete_transaction(journal, commit_tid);
 	if (needs_barrier) {
 		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
 		if (!ret)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b6fab7c..de4b58d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -211,8 +211,7 @@  void ext4_evict_inode(struct inode *inode)
 			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
 
-			jbd2_log_start_commit(journal, commit_tid);
-			jbd2_log_wait_commit(journal, commit_tid);
+			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
 		truncate_inode_pages(&inode->i_data, 0);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ed10991..886ec2f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -710,6 +710,37 @@  int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 }
 
 /*
+ * When this function returns the transaction corresponding to tid
+ * will be completed.  If the transaction has currently running, start
+ * committing that transaction before waiting for it to complete.  If
+ * the transaction id is stale, it is by definition already completed,
+ * so just return SUCCESS.
+ */
+int jbd2_complete_transaction(journal_t *journal, tid_t tid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return jbd2_log_wait_commit(journal, tid);
+}
+EXPORT_SYMBOL(jbd2_complete_transaction);
+
+/*
  * Log buffer allocation routines:
  */
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..f028975 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1200,6 +1200,7 @@  int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);