diff mbox

[2/2] jbd: fix fsync() tid wraparound bug

Message ID 20110430171711.GA2819@thunk.org
State Not Applicable, archived
Headers show

Commit Message

Theodore Ts'o April 30, 2011, 5:17 p.m. UTC
Hi Jan,

I don't know if you've been following this thread, but I was wondering
if you could review this patch, (a) for inclusion in the ext3 tree,
and (b) because I'd appreciate a second pair of eyes looking at this
patch, since I intend to push similar change to jbd2.

I'm not entirely convinced this is caused by tid's wrapping around,
since that would be a huge number of commits, but if it's not that,
somehow i_datasync_tid or i_sync_tid is either getting corrupted or
not getting set --- and I have no idea how that could be happening.
This patch should at least avoid the system from crashing when we hit
the case, and harmlessly handle the situation --- with at the worst
case, an journal commit that wouldn't otherwise be needed.

As background, I've been on this bug for months now, as it's been
reported to me as occasionally happening on Android devices that have
been using ext4.  Since I hadn't seen any reports of this in the field
in the x86 world, and this code hadn't changed in a long, long time, I
had originally assumed it was an ARM-specific bug.  However, recently,
Martin Zielinski (on this thread) has reported this problem on an x86
system --- and on a x86 system to boot.

Martin suspects it may have to do with sqllite --- which is consistent
with I've seen, since I believe Android devices use sqllite quite
heavily as well.

				- Ted

jbd: fix fsync() tid wraparound bug

If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated.  If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in
fs/jbd/commit.c:

	J_ASSERT(journal->j_running_transaction != NULL);

It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync().  But
with the right workload, it can.

We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that we is valid for us to
wait until it is commited: namely, the currently running transaction
(if it exists).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/jbd/journal.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Kara May 2, 2011, 3:07 p.m. UTC | #1
Hi Ted,

On Sat 30-04-11 13:17:11, Ted Tso wrote:
> I don't know if you've been following this thread, but I was wondering
> if you could review this patch, (a) for inclusion in the ext3 tree,
> and (b) because I'd appreciate a second pair of eyes looking at this
> patch, since I intend to push similar change to jbd2.
  Thanks for forwarding. For some reason I got unsubscribed from linux-ext4
a while ago and didn't notice this since linux-fsdevel goes into the same
mailbox.

> I'm not entirely convinced this is caused by tid's wrapping around,
> since that would be a huge number of commits, but if it's not that,
> somehow i_datasync_tid or i_sync_tid is either getting corrupted or
> not getting set --- and I have no idea how that could be happening.
> This patch should at least avoid the system from crashing when we hit
> the case, and harmlessly handle the situation --- with at the worst
> case, an journal commit that wouldn't otherwise be needed.
  The patch looks OK in any case. I'll take it in my tree. It would take
about 24 days of constant 1000 trans/s load to trigger this. That's a quite
heavy load but not so unrealistic with today's HW.

> As background, I've been on this bug for months now, as it's been
> reported to me as occasionally happening on Android devices that have
> been using ext4.  Since I hadn't seen any reports of this in the field
> in the x86 world, and this code hadn't changed in a long, long time, I
> had originally assumed it was an ARM-specific bug.  However, recently,
> Martin Zielinski (on this thread) has reported this problem on an x86
> system --- and on a x86 system to boot.
> 
> Martin suspects it may have to do with sqllite --- which is consistent
> with I've seen, since I believe Android devices use sqllite quite
> heavily as well.
  Yeah, it may be.

								Honza
> jbd: fix fsync() tid wraparound bug
> 
> If an application program does not make any changes to the indirect
> blocks or extent tree, i_datasync_tid will not get updated.  If there
> are enough commits (i.e., 2**31) such that tid_geq()'s calculations
> wrap, and there isn't a currently active transaction at the time of
> the fdatasync() call, this can end up triggering a BUG_ON in
> fs/jbd/commit.c:
> 
> 	J_ASSERT(journal->j_running_transaction != NULL);
> 
> It's pretty rare that this can happen, since it requires the use of
> fdatasync() plus *very* frequent and excessive use of fsync().  But
> with the right workload, it can.
> 
> We fix this by replacing the use of tid_geq() with an equality test,
> since there's only one valid transaction id that we is valid for us to
> wait until it is commited: namely, the currently running transaction
> (if it exists).
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/jbd/journal.c |   16 +++++++++++++---
>  1 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index b3713af..1b71ce6 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -437,9 +437,12 @@ int __log_space_left(journal_t *journal)
>  int __log_start_commit(journal_t *journal, tid_t target)
>  {
>  	/*
> -	 * Are we already doing a recent enough commit?
> +	 * The only transaction we can possibly wait upon is the
> +	 * currently running transaction (if it exists).  Otherwise,
> +	 * the target tid must be an old one.
>  	 */
> -	if (!tid_geq(journal->j_commit_request, target)) {
> +	if (journal->j_running_transaction &&
> +	    journal->j_running_transaction->t_tid == target) {
>  		/*
>  		 * We want a new commit: OK, mark the request and wakeup the
>  		 * commit thread.  We do _not_ do the commit ourselves.
> @@ -451,7 +454,14 @@ int __log_start_commit(journal_t *journal, tid_t target)
>  			  journal->j_commit_sequence);
>  		wake_up(&journal->j_wait_commit);
>  		return 1;
> -	}
> +	} else if (!tid_geq(journal->j_commit_request, target))
> +		/* This should never happen, but if it does, preserve
> +		   the evidence before kjournald goes into a loop and
> +		   increments j_commit_sequence beyond all recognition. */
> +		WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n",
> +		     journal->j_commit_request, journal->j_commit_sequence,
> +		     target, journal->j_running_transaction ? 
> +		     journal->j_running_transaction->t_tid : 0);
>  	return 0;
>  }
>
Theodore Ts'o May 2, 2011, 6:29 p.m. UTC | #2
On Mon, May 02, 2011 at 05:07:58PM +0200, Jan Kara wrote:
>   The patch looks OK in any case. I'll take it in my tree.
Great, thanks.

> It would take
> about 24 days of constant 1000 trans/s load to trigger this. That's a quite
> heavy load but not so unrealistic with today's HW.

True; but one of the reason why I'm not sure I believe that is this
bug is showing up on some Android devices, where (a) 1000 trans/second
*is* a not just a huge load, but almost impossible to believe, and (b)
if we really are doing 2**31 commits, we would be wearing out the
flash storage being used on the Android devices!  (And we're seeing a
handful of failures every week from the testers who participate in a
kerneloops-like reporting system.)

So while I would very much like to believe that it's caused by a tid
wrap, I'm worried there is another bug hiding here....

    		      	 	     	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 7:04 p.m. UTC | #3
On Mon 02-05-11 14:29:45, Ted Tso wrote:
> On Mon, May 02, 2011 at 05:07:58PM +0200, Jan Kara wrote:
> >   The patch looks OK in any case. I'll take it in my tree.
> Great, thanks.
> 
> > It would take
> > about 24 days of constant 1000 trans/s load to trigger this. That's a quite
> > heavy load but not so unrealistic with today's HW.
> 
> True; but one of the reason why I'm not sure I believe that is this
> bug is showing up on some Android devices, where (a) 1000 trans/second
> *is* a not just a huge load, but almost impossible to believe, and (b)
> if we really are doing 2**31 commits, we would be wearing out the
> flash storage being used on the Android devices!  (And we're seeing a
> handful of failures every week from the testers who participate in a
> kerneloops-like reporting system.)
  Ah, OK. I didn't know it's so many reports. Indeed it looks like there's
another bug. But then the WARN you have added should tell us more about
who's causing the problem, right?

								Honza
Theodore Ts'o May 2, 2011, 9:31 p.m. UTC | #4
On Mon, May 02, 2011 at 09:04:41PM +0200, Jan Kara wrote:
>   Ah, OK. I didn't know it's so many reports. Indeed it looks like there's
> another bug. But then the WARN you have added should tell us more about
> who's causing the problem, right?

That's the hope.  I'm not sure WARNs get logged in Android's reporting
system.  I need to work the Android folks to figure this out.  Worst
case I may have to turn the WARN into a BUG for them (although it may
be tough to convince them to make such a change).  I'm hoping though
that the debugging information from Martin's machines will also come
through with some useful information that will allow us to root cause
whatever is causing this to happen.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin_Zielinski@McAfee.com May 4, 2011, 2:21 p.m. UTC | #5
Here's an update.
In my first post I was not aware of the implementation of tid_gt.
I agree that 2 and a half billion commits on an SD card are - hmph - unlikely

The kernels with the patch is not yet installed because we recently provided a kernel with a panic() instead of J_ASSERT and a crashkernel. 
We now have 5 cores all showing the same behavior in j_commit_sequence vs. j_commit_request.

What the cores show (snipped):
(gdb) bt
#0  crash_setup_regs (regs=0x0) at /usr/src/debug/linux-2.6.32/arch/x86/include/asm/kexec.h:127
#1  crash_kexec (regs=0x0) at kernel/kexec.c:1076
#2  0xffffffff81048872 in panic (fmt=0xffffffff818bdb95 "running transaction is NULL")
    at kernel/panic.c:82
#3  0xffffffff81159b79 in journal_commit_transaction (journal=0xffff88031e6c4600)
    at fs/jbd/commit.c:347
#4  0xffffffff8115de49 in kjournald (arg=<value optimized out>) at fs/jbd/journal.c:150
#5  0xffffffff8105fccf in kthread (_create=<value optimized out>) at kernel/kthread.c:78
#6  0xffffffff8100c9da in ?? ()
#7  0x0000000000000000 in ?? ()

gdb) p *journal
$4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, 
  j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
      slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
      raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, 
      prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, 
  j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
...
j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, 
  j_commit_sequence = 2288014067, j_commit_request = 140530417,
...
  j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, 
  j_average_commit_time = 69247, j_private = 0xffff88031fd49400}

Means that j_commit_request would be considered to be greater than j_commit_sequence and trigger a commit.

The j_last_sync_writer has a promising backtrace:

crash> bt 4568
PID: 4568   TASK: ffff8802c38ce300  CPU: 8   COMMAND: "core"
 #0 [ffff8802c38d1da8] schedule at ffffffff8159f5e8
 #1 [ffff8802c38d1e50] log_wait_commit at ffffffff8115d4a1
 #2 [ffff8802c38d1ec0] ext3_sync_file at ffffffff8113c589
 #3 [ffff8802c38d1ef0] vfs_fsync_range at ffffffff81105588
 #4 [ffff8802c38d1f30] vfs_fsync at ffffffff81105612
 #5 [ffff8802c38d1f40] do_fsync at ffffffff81105646
 #6 [ffff8802c38d1f70] sys_fdatasync at ffffffff8110566a
 #7 [ffff8802c38d1f80] system_call_fastpath at ffffffff8100ba2b
    RIP: 00007f5be9bd6587  RSP: 0000000047386f38  RFLAGS: 00010202
    RAX: 000000000000004b  RBX: ffffffff8100ba2b  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00000000000000f7
    RBP: 00007f5aecf23cc8   R8: 00007f5bc54e56a0   R9: 00000000000011d8
    R10: 0000000001c01800  R11: 0000000000000202  R12: ffffffff8110566a
    R13: ffff8802c38d1f78  R14: 00007f5aecf23b98  R15: 0000000000000000
    ORIG_RAX: 000000000000004b  CS: 0033  SS: 002b

To make a long story short: The inode did not carry the j_commit_request tid.

But:
crash> foreach files
...
247 ffff88031e65bb00 ffff88026c3ea540 ffff88031f0c0758 REG  /opt/data/users/user.db
...
(this is an sqlite database)

And:
(gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
$5 = {counter = -2006954411}
(gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
$3 = {counter = 140530417}

> j_commit_request = 140530417

So it *is* a datasync from sqlite. And your fix will catch it. 
I still don't understand, where this number comes from. 

Unfortunately I cannot provide the core files as they are coming from customers. 
But if there is anything I can do to provide further information, please let me know.

Cheers,
Martin

-----Original Message-----
From: Ted Ts'o [mailto:tytso@mit.edu] 
Sent: Montag, 2. Mai 2011 23:31
To: Jan Kara
Cc: Ext4 Developers List; Zielinski, Martin
Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug

On Mon, May 02, 2011 at 09:04:41PM +0200, Jan Kara wrote:
>   Ah, OK. I didn't know it's so many reports. Indeed it looks like there's
> another bug. But then the WARN you have added should tell us more about
> who's causing the problem, right?

That's the hope.  I'm not sure WARNs get logged in Android's reporting
system.  I need to work the Android folks to figure this out.  Worst
case I may have to turn the WARN into a BUG for them (although it may
be tough to convince them to make such a change).  I'm hoping though
that the debugging information from Martin's machines will also come
through with some useful information that will allow us to root cause
whatever is causing this to happen.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 4, 2011, 9:55 p.m. UTC | #6
On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote:
> Here's an update.
> In my first post I was not aware of the implementation of tid_gt.
> I agree that 2 and a half billion commits on an SD card are - hmph -
> unlikely
  <snip>

> gdb) p *journal
> $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, 
>   j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
>       slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
>       raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, 
>       prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, 
>   j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
> ...
> j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, 
>   j_commit_sequence = 2288014067, j_commit_request = 140530417,
> ...
>   j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, 
>   j_average_commit_time = 69247, j_private = 0xffff88031fd49400}
  <snip>

> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
> $5 = {counter = -2006954411}
> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
> $3 = {counter = 140530417}
> 
> > j_commit_request = 140530417
> 
> So it *is* a datasync from sqlite. And your fix will catch it. 
> I still don't understand, where this number comes from. 
  Ok, so i_datasync_tid got corrupted. But look at the numbers in hex:
i_datasync_tid==140530417==0x86052F1
and
i_commit_sequence==2288014067==0x886052F3

So it's a single bit error - we lost the highest bit of the number. Are you
getting the cores from different machines? Otherwise I'd suspect the HW.
If it's not HW I'm at loss what can cause it... You can try moving
i_datasync_tid to a different place in struct ext3_inode_info so that we
can rule out / confirm whether some code external to i_datasync_tid
handling is just causing random memory corruption...

								Honza
Martin_Zielinski@McAfee.com May 5, 2011, 2:11 p.m. UTC | #7
It is not a hardware bug and very unlikely a race condition or random memory corruption.
We have 7 machines that failed with an uptime of 12-13 days.
12 days earlier about 5 different machines failed after the same uptime. All machines were rebooted
after this issue.

In 5 out of 7 cores the commit request number come from the same sqlite database.
In 2 cores I could not find an inode or file structure pointing to this database and hence
it could not be verified that the request number comes from this database inode.

I'm not so sure about the bit error. Some have the hi-bit set, some not.
Due to the implementation of tid_gt() I would  expect that the numbers differ in the high bit.
The condition produces a wrong result, if the difference between the numbers is greater than INT_MAX.

The sequence / request numbers / difference INT_MAX + x:
886052f3 / 086052f1 / 3
61d305fe / e1ce83f4 / 295434
887d10c8 / 087acf05 / 147908
8e3d0b25 / 0e374365 / 378817
702d4061 / f02a5e0b / 189014
73d6775a / f3d67756 / 4
824846ad / 024846ab / 2

On some of the machines we traced the commit_sequence for a short time.
However no indication could be found the logs, that a datasync was triggered with always the same tid.

Result example:

Uptime 5 days, 33 min : commit_sequence: 1135688364
Uptime 12 days,  2:40 : commit_sequence: 1960701710

825013346 commits in 170 hours

4853019 commits per hour => 442 hours or 18 days to have INT_MAX commits.

Cheers,
Martin

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz]
Sent: Mittwoch, 4. Mai 2011 23:55
To: Zielinski, Martin
Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org
Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug

On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote:
> Here's an update.
> In my first post I was not aware of the implementation of tid_gt.
> I agree that 2 and a half billion commits on an SD card are - hmph -
> unlikely
  <snip>

> gdb) p *journal
> $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8,
>   j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
>       slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
>       raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638,
>       prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0,
>   j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
> ...
> j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068,
>   j_commit_sequence = 2288014067, j_commit_request = 140530417,
> ...
>   j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568,
>   j_average_commit_time = 69247, j_private = 0xffff88031fd49400}
  <snip>

> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
> $5 = {counter = -2006954411}
> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
> $3 = {counter = 140530417}
>
> > j_commit_request = 140530417
>
> So it *is* a datasync from sqlite. And your fix will catch it.
> I still don't understand, where this number comes from.
  Ok, so i_datasync_tid got corrupted. But look at the numbers in hex:
i_datasync_tid==140530417==0x86052F1
and
i_commit_sequence==2288014067==0x886052F3

So it's a single bit error - we lost the highest bit of the number. Are you
getting the cores from different machines? Otherwise I'd suspect the HW.
If it's not HW I'm at loss what can cause it... You can try moving
i_datasync_tid to a different place in struct ext3_inode_info so that we
can rule out / confirm whether some code external to i_datasync_tid
handling is just causing random memory corruption...

                                                                Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin_Zielinski@McAfee.com May 5, 2011, 2:55 p.m. UTC | #8
Hello once more.
I have one concern against the patch:
If the situation is triggered again and again, the patch would produce lots of output.
Maybe it's better to use WARN_ONCE.

Cheers,
Martin

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Mittwoch, 4. Mai 2011 23:55
To: Zielinski, Martin
Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org
Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug

On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote:
> Here's an update.
> In my first post I was not aware of the implementation of tid_gt.
> I agree that 2 and a half billion commits on an SD card are - hmph -
> unlikely
  <snip>

> gdb) p *journal
> $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, 
>   j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
>       slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
>       raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, 
>       prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, 
>   j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
> ...
> j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, 
>   j_commit_sequence = 2288014067, j_commit_request = 140530417,
> ...
>   j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, 
>   j_average_commit_time = 69247, j_private = 0xffff88031fd49400}
  <snip>

> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
> $5 = {counter = -2006954411}
> (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
> $3 = {counter = 140530417}
> 
> > j_commit_request = 140530417
> 
> So it *is* a datasync from sqlite. And your fix will catch it. 
> I still don't understand, where this number comes from. 
  Ok, so i_datasync_tid got corrupted. But look at the numbers in hex:
i_datasync_tid==140530417==0x86052F1
and
i_commit_sequence==2288014067==0x886052F3

So it's a single bit error - we lost the highest bit of the number. Are you
getting the cores from different machines? Otherwise I'd suspect the HW.
If it's not HW I'm at loss what can cause it... You can try moving
i_datasync_tid to a different place in struct ext3_inode_info so that we
can rule out / confirm whether some code external to i_datasync_tid
handling is just causing random memory corruption...

								Honza
Jan Kara May 5, 2011, 3:43 p.m. UTC | #9
On Thu 05-05-11 09:55:22, Martin_Zielinski@McAfee.com wrote:
> Hello once more.
> I have one concern against the patch:
> If the situation is triggered again and again, the patch would produce lots of output.
> Maybe it's better to use WARN_ONCE.
  Yes, probably it will. Changed to WARN_ONCE in the JBD patch.

								Honza
> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Mittwoch, 4. Mai 2011 23:55
> To: Zielinski, Martin
> Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org
> Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug
> 
> On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote:
> > Here's an update.
> > In my first post I was not aware of the implementation of tid_gt.
> > I agree that 2 and a half billion commits on an SD card are - hmph -
> > unlikely
>   <snip>
> 
> > gdb) p *journal
> > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, 
> >   j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
> >       slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
> >       raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, 
> >       prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, 
> >   j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
> > ...
> > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, 
> >   j_commit_sequence = 2288014067, j_commit_request = 140530417,
> > ...
> >   j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, 
> >   j_average_commit_time = 69247, j_private = 0xffff88031fd49400}
>   <snip>
> 
> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
> > $5 = {counter = -2006954411}
> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
> > $3 = {counter = 140530417}
> > 
> > > j_commit_request = 140530417
> > 
> > So it *is* a datasync from sqlite. And your fix will catch it. 
> > I still don't understand, where this number comes from. 
>   Ok, so i_datasync_tid got corrupted. But look at the numbers in hex:
> i_datasync_tid==140530417==0x86052F1
> and
> i_commit_sequence==2288014067==0x886052F3
> 
> So it's a single bit error - we lost the highest bit of the number. Are you
> getting the cores from different machines? Otherwise I'd suspect the HW.
> If it's not HW I'm at loss what can cause it... You can try moving
> i_datasync_tid to a different place in struct ext3_inode_info so that we
> can rule out / confirm whether some code external to i_datasync_tid
> handling is just causing random memory corruption...
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
Jan Kara May 5, 2011, 3:53 p.m. UTC | #10
On Thu 05-05-11 09:11:22, Martin_Zielinski@McAfee.com wrote:
> It is not a hardware bug and very unlikely a race condition or random
> memory corruption.  We have 7 machines that failed with an uptime of
> 12-13 days.  12 days earlier about 5 different machines failed after the
> same uptime. All machines were rebooted after this issue.
> 
> In 5 out of 7 cores the commit request number come from the same sqlite
> database.  In 2 cores I could not find an inode or file structure
> pointing to this database and hence it could not be verified that the
> request number comes from this database inode.
> 
> I'm not so sure about the bit error. Some have the hi-bit set, some not.
> Due to the implementation of tid_gt() I would  expect that the numbers
> differ in the high bit.  The condition produces a wrong result, if the
> difference between the numbers is greater than INT_MAX.
   Yes, sorry. I was confused yesterday. The numbers are exactly such as
they should be when an application constantly calls fdatasync() without
modifying file metadata. 
 
> The sequence / request numbers / difference INT_MAX + x:
> 886052f3 / 086052f1 / 3
> 61d305fe / e1ce83f4 / 295434
> 887d10c8 / 087acf05 / 147908
> 8e3d0b25 / 0e374365 / 378817
> 702d4061 / f02a5e0b / 189014
> 73d6775a / f3d67756 / 4
> 824846ad / 024846ab / 2
> 
V On some of the machines we traced the commit_sequence for a short time.
> However no indication could be found the logs, that a datasync was
> triggered with always the same tid.
> 
> Result example:
> 
> Uptime 5 days, 33 min : commit_sequence: 1135688364
> Uptime 12 days,  2:40 : commit_sequence: 1960701710
> 
> 825013346 commits in 170 hours
> 
> 4853019 commits per hour => 442 hours or 18 days to have INT_MAX commits.
  Yes. In your case it seems we really wrap-around tid. I'll push the fix
to Linus in the next merge window and then it will go also to -stable
kernels. Thanks for your testing!

									Honza
> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz]
> Sent: Mittwoch, 4. Mai 2011 23:55
> To: Zielinski, Martin
> Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org
> Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug
> 
> On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote:
> > Here's an update.
> > In my first post I was not aware of the implementation of tid_gt.
> > I agree that 2 and a half billion commits on an SD card are - hmph -
> > unlikely
>   <snip>
> 
> > gdb) p *journal
> > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8,
> >   j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = {
> >       slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = {
> >       raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638,
> >       prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0,
> >   j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40,
> > ...
> > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068,
> >   j_commit_sequence = 2288014067, j_commit_request = 140530417,
> > ...
> >   j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568,
> >   j_average_commit_time = 69247, j_private = 0xffff88031fd49400}
>   <snip>
> 
> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid
> > $5 = {counter = -2006954411}
> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid
> > $3 = {counter = 140530417}
> >
> > > j_commit_request = 140530417
> >
> > So it *is* a datasync from sqlite. And your fix will catch it.
> > I still don't understand, where this number comes from.
>   Ok, so i_datasync_tid got corrupted. But look at the numbers in hex:
> i_datasync_tid==140530417==0x86052F1
> and
> i_commit_sequence==2288014067==0x886052F3
> 
> So it's a single bit error - we lost the highest bit of the number. Are you
> getting the cores from different machines? Otherwise I'd suspect the HW.
> If it's not HW I'm at loss what can cause it... You can try moving
> i_datasync_tid to a different place in struct ext3_inode_info so that we
> can rule out / confirm whether some code external to i_datasync_tid
> handling is just causing random memory corruption...
> 
>                                                                 Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
diff mbox

Patch

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index b3713af..1b71ce6 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -437,9 +437,12 @@  int __log_space_left(journal_t *journal)
 int __log_start_commit(journal_t *journal, tid_t target)
 {
 	/*
-	 * Are we already doing a recent enough commit?
+	 * The only transaction we can possibly wait upon is the
+	 * currently running transaction (if it exists).  Otherwise,
+	 * the target tid must be an old one.
 	 */
-	if (!tid_geq(journal->j_commit_request, target)) {
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == target) {
 		/*
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
@@ -451,7 +454,14 @@  int __log_start_commit(journal_t *journal, tid_t target)
 			  journal->j_commit_sequence);
 		wake_up(&journal->j_wait_commit);
 		return 1;
-	}
+	} else if (!tid_geq(journal->j_commit_request, target))
+		/* This should never happen, but if it does, preserve
+		   the evidence before kjournald goes into a loop and
+		   increments j_commit_sequence beyond all recognition. */
+		WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n",
+		     journal->j_commit_request, journal->j_commit_sequence,
+		     target, journal->j_running_transaction ? 
+		     journal->j_running_transaction->t_tid : 0);
 	return 0;
 }