Message ID | 20110430171711.GA2819@thunk.org |
---|---|
State | Not Applicable, archived |
Headers | show |
Hi Ted, On Sat 30-04-11 13:17:11, Ted Tso wrote: > I don't know if you've been following this thread, but I was wondering > if you could review this patch, (a) for inclusion in the ext3 tree, > and (b) because I'd appreciate a second pair of eyes looking at this > patch, since I intend to push similar change to jbd2. Thanks for forwarding. For some reason I got unsubscribed from linux-ext4 a while ago and didn't notice this since linux-fsdevel goes into the same mailbox. > I'm not entirely convinced this is caused by tid's wrapping around, > since that would be a huge number of commits, but if it's not that, > somehow i_datasync_tid or i_sync_tid is either getting corrupted or > not getting set --- and I have no idea how that could be happening. > This patch should at least avoid the system from crashing when we hit > the case, and harmlessly handle the situation --- with at the worst > case, an journal commit that wouldn't otherwise be needed. The patch looks OK in any case. I'll take it in my tree. It would take about 24 days of constant 1000 trans/s load to trigger this. That's a quite heavy load but not so unrealistic with today's HW. > As background, I've been on this bug for months now, as it's been > reported to me as occasionally happening on Android devices that have > been using ext4. Since I hadn't seen any reports of this in the field > in the x86 world, and this code hadn't changed in a long, long time, I > had originally assumed it was an ARM-specific bug. However, recently, > Martin Zielinski (on this thread) has reported this problem on an x86 > system --- and on a x86 system to boot. > > Martin suspects it may have to do with sqllite --- which is consistent > with I've seen, since I believe Android devices use sqllite quite > heavily as well. Yeah, it may be. Honza > jbd: fix fsync() tid wraparound bug > > If an application program does not make any changes to the indirect > blocks or extent tree, i_datasync_tid will not get updated. If there > are enough commits (i.e., 2**31) such that tid_geq()'s calculations > wrap, and there isn't a currently active transaction at the time of > the fdatasync() call, this can end up triggering a BUG_ON in > fs/jbd/commit.c: > > J_ASSERT(journal->j_running_transaction != NULL); > > It's pretty rare that this can happen, since it requires the use of > fdatasync() plus *very* frequent and excessive use of fsync(). But > with the right workload, it can. > > We fix this by replacing the use of tid_geq() with an equality test, > since there's only one valid transaction id that we is valid for us to > wait until it is commited: namely, the currently running transaction > (if it exists). > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > --- > fs/jbd/journal.c | 16 +++++++++++++--- > 1 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c > index b3713af..1b71ce6 100644 > --- a/fs/jbd/journal.c > +++ b/fs/jbd/journal.c > @@ -437,9 +437,12 @@ int __log_space_left(journal_t *journal) > int __log_start_commit(journal_t *journal, tid_t target) > { > /* > - * Are we already doing a recent enough commit? > + * The only transaction we can possibly wait upon is the > + * currently running transaction (if it exists). Otherwise, > + * the target tid must be an old one. > */ > - if (!tid_geq(journal->j_commit_request, target)) { > + if (journal->j_running_transaction && > + journal->j_running_transaction->t_tid == target) { > /* > * We want a new commit: OK, mark the request and wakeup the > * commit thread. We do _not_ do the commit ourselves. > @@ -451,7 +454,14 @@ int __log_start_commit(journal_t *journal, tid_t target) > journal->j_commit_sequence); > wake_up(&journal->j_wait_commit); > return 1; > - } > + } else if (!tid_geq(journal->j_commit_request, target)) > + /* This should never happen, but if it does, preserve > + the evidence before kjournald goes into a loop and > + increments j_commit_sequence beyond all recognition. */ > + WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n", > + journal->j_commit_request, journal->j_commit_sequence, > + target, journal->j_running_transaction ? > + journal->j_running_transaction->t_tid : 0); > return 0; > } >
On Mon, May 02, 2011 at 05:07:58PM +0200, Jan Kara wrote: > The patch looks OK in any case. I'll take it in my tree. Great, thanks. > It would take > about 24 days of constant 1000 trans/s load to trigger this. That's a quite > heavy load but not so unrealistic with today's HW. True; but one of the reason why I'm not sure I believe that is this bug is showing up on some Android devices, where (a) 1000 trans/second *is* a not just a huge load, but almost impossible to believe, and (b) if we really are doing 2**31 commits, we would be wearing out the flash storage being used on the Android devices! (And we're seeing a handful of failures every week from the testers who participate in a kerneloops-like reporting system.) So while I would very much like to believe that it's caused by a tid wrap, I'm worried there is another bug hiding here.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 02-05-11 14:29:45, Ted Tso wrote: > On Mon, May 02, 2011 at 05:07:58PM +0200, Jan Kara wrote: > > The patch looks OK in any case. I'll take it in my tree. > Great, thanks. > > > It would take > > about 24 days of constant 1000 trans/s load to trigger this. That's a quite > > heavy load but not so unrealistic with today's HW. > > True; but one of the reason why I'm not sure I believe that is this > bug is showing up on some Android devices, where (a) 1000 trans/second > *is* a not just a huge load, but almost impossible to believe, and (b) > if we really are doing 2**31 commits, we would be wearing out the > flash storage being used on the Android devices! (And we're seeing a > handful of failures every week from the testers who participate in a > kerneloops-like reporting system.) Ah, OK. I didn't know it's so many reports. Indeed it looks like there's another bug. But then the WARN you have added should tell us more about who's causing the problem, right? Honza
On Mon, May 02, 2011 at 09:04:41PM +0200, Jan Kara wrote: > Ah, OK. I didn't know it's so many reports. Indeed it looks like there's > another bug. But then the WARN you have added should tell us more about > who's causing the problem, right? That's the hope. I'm not sure WARNs get logged in Android's reporting system. I need to work the Android folks to figure this out. Worst case I may have to turn the WARN into a BUG for them (although it may be tough to convince them to make such a change). I'm hoping though that the debugging information from Martin's machines will also come through with some useful information that will allow us to root cause whatever is causing this to happen. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Here's an update. In my first post I was not aware of the implementation of tid_gt. I agree that 2 and a half billion commits on an SD card are - hmph - unlikely The kernels with the patch is not yet installed because we recently provided a kernel with a panic() instead of J_ASSERT and a crashkernel. We now have 5 cores all showing the same behavior in j_commit_sequence vs. j_commit_request. What the cores show (snipped): (gdb) bt #0 crash_setup_regs (regs=0x0) at /usr/src/debug/linux-2.6.32/arch/x86/include/asm/kexec.h:127 #1 crash_kexec (regs=0x0) at kernel/kexec.c:1076 #2 0xffffffff81048872 in panic (fmt=0xffffffff818bdb95 "running transaction is NULL") at kernel/panic.c:82 #3 0xffffffff81159b79 in journal_commit_transaction (journal=0xffff88031e6c4600) at fs/jbd/commit.c:347 #4 0xffffffff8115de49 in kjournald (arg=<value optimized out>) at fs/jbd/journal.c:150 #5 0xffffffff8105fccf in kthread (_create=<value optimized out>) at kernel/kthread.c:78 #6 0xffffffff8100c9da in ?? () #7 0x0000000000000000 in ?? () gdb) p *journal $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, ... j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, j_commit_sequence = 2288014067, j_commit_request = 140530417, ... j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, j_average_commit_time = 69247, j_private = 0xffff88031fd49400} Means that j_commit_request would be considered to be greater than j_commit_sequence and trigger a commit. The j_last_sync_writer has a promising backtrace: crash> bt 4568 PID: 4568 TASK: ffff8802c38ce300 CPU: 8 COMMAND: "core" #0 [ffff8802c38d1da8] schedule at ffffffff8159f5e8 #1 [ffff8802c38d1e50] log_wait_commit at ffffffff8115d4a1 #2 [ffff8802c38d1ec0] ext3_sync_file at ffffffff8113c589 #3 [ffff8802c38d1ef0] vfs_fsync_range at ffffffff81105588 #4 [ffff8802c38d1f30] vfs_fsync at ffffffff81105612 #5 [ffff8802c38d1f40] do_fsync at ffffffff81105646 #6 [ffff8802c38d1f70] sys_fdatasync at ffffffff8110566a #7 [ffff8802c38d1f80] system_call_fastpath at ffffffff8100ba2b RIP: 00007f5be9bd6587 RSP: 0000000047386f38 RFLAGS: 00010202 RAX: 000000000000004b RBX: ffffffff8100ba2b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000f7 RBP: 00007f5aecf23cc8 R8: 00007f5bc54e56a0 R9: 00000000000011d8 R10: 0000000001c01800 R11: 0000000000000202 R12: ffffffff8110566a R13: ffff8802c38d1f78 R14: 00007f5aecf23b98 R15: 0000000000000000 ORIG_RAX: 000000000000004b CS: 0033 SS: 002b To make a long story short: The inode did not carry the j_commit_request tid. But: crash> foreach files ... 247 ffff88031e65bb00 ffff88026c3ea540 ffff88031f0c0758 REG /opt/data/users/user.db ... (this is an sqlite database) And: (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid $5 = {counter = -2006954411} (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid $3 = {counter = 140530417} > j_commit_request = 140530417 So it *is* a datasync from sqlite. And your fix will catch it. I still don't understand, where this number comes from. Unfortunately I cannot provide the core files as they are coming from customers. But if there is anything I can do to provide further information, please let me know. Cheers, Martin -----Original Message----- From: Ted Ts'o [mailto:tytso@mit.edu] Sent: Montag, 2. Mai 2011 23:31 To: Jan Kara Cc: Ext4 Developers List; Zielinski, Martin Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug On Mon, May 02, 2011 at 09:04:41PM +0200, Jan Kara wrote: > Ah, OK. I didn't know it's so many reports. Indeed it looks like there's > another bug. But then the WARN you have added should tell us more about > who's causing the problem, right? That's the hope. I'm not sure WARNs get logged in Android's reporting system. I need to work the Android folks to figure this out. Worst case I may have to turn the WARN into a BUG for them (although it may be tough to convince them to make such a change). I'm hoping though that the debugging information from Martin's machines will also come through with some useful information that will allow us to root cause whatever is causing this to happen. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote: > Here's an update. > In my first post I was not aware of the implementation of tid_gt. > I agree that 2 and a half billion commits on an SD card are - hmph - > unlikely <snip> > gdb) p *journal > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, > j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { > slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { > raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, > prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, > j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, > ... > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, > j_commit_sequence = 2288014067, j_commit_request = 140530417, > ... > j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, > j_average_commit_time = 69247, j_private = 0xffff88031fd49400} <snip> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid > $5 = {counter = -2006954411} > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid > $3 = {counter = 140530417} > > > j_commit_request = 140530417 > > So it *is* a datasync from sqlite. And your fix will catch it. > I still don't understand, where this number comes from. Ok, so i_datasync_tid got corrupted. But look at the numbers in hex: i_datasync_tid==140530417==0x86052F1 and i_commit_sequence==2288014067==0x886052F3 So it's a single bit error - we lost the highest bit of the number. Are you getting the cores from different machines? Otherwise I'd suspect the HW. If it's not HW I'm at loss what can cause it... You can try moving i_datasync_tid to a different place in struct ext3_inode_info so that we can rule out / confirm whether some code external to i_datasync_tid handling is just causing random memory corruption... Honza
It is not a hardware bug and very unlikely a race condition or random memory corruption. We have 7 machines that failed with an uptime of 12-13 days. 12 days earlier about 5 different machines failed after the same uptime. All machines were rebooted after this issue. In 5 out of 7 cores the commit request number come from the same sqlite database. In 2 cores I could not find an inode or file structure pointing to this database and hence it could not be verified that the request number comes from this database inode. I'm not so sure about the bit error. Some have the hi-bit set, some not. Due to the implementation of tid_gt() I would expect that the numbers differ in the high bit. The condition produces a wrong result, if the difference between the numbers is greater than INT_MAX. The sequence / request numbers / difference INT_MAX + x: 886052f3 / 086052f1 / 3 61d305fe / e1ce83f4 / 295434 887d10c8 / 087acf05 / 147908 8e3d0b25 / 0e374365 / 378817 702d4061 / f02a5e0b / 189014 73d6775a / f3d67756 / 4 824846ad / 024846ab / 2 On some of the machines we traced the commit_sequence for a short time. However no indication could be found the logs, that a datasync was triggered with always the same tid. Result example: Uptime 5 days, 33 min : commit_sequence: 1135688364 Uptime 12 days, 2:40 : commit_sequence: 1960701710 825013346 commits in 170 hours 4853019 commits per hour => 442 hours or 18 days to have INT_MAX commits. Cheers, Martin -----Original Message----- From: Jan Kara [mailto:jack@suse.cz] Sent: Mittwoch, 4. Mai 2011 23:55 To: Zielinski, Martin Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote: > Here's an update. > In my first post I was not aware of the implementation of tid_gt. > I agree that 2 and a half billion commits on an SD card are - hmph - > unlikely <snip> > gdb) p *journal > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, > j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { > slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { > raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, > prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, > j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, > ... > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, > j_commit_sequence = 2288014067, j_commit_request = 140530417, > ... > j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, > j_average_commit_time = 69247, j_private = 0xffff88031fd49400} <snip> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid > $5 = {counter = -2006954411} > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid > $3 = {counter = 140530417} > > > j_commit_request = 140530417 > > So it *is* a datasync from sqlite. And your fix will catch it. > I still don't understand, where this number comes from. Ok, so i_datasync_tid got corrupted. But look at the numbers in hex: i_datasync_tid==140530417==0x86052F1 and i_commit_sequence==2288014067==0x886052F3 So it's a single bit error - we lost the highest bit of the number. Are you getting the cores from different machines? Otherwise I'd suspect the HW. If it's not HW I'm at loss what can cause it... You can try moving i_datasync_tid to a different place in struct ext3_inode_info so that we can rule out / confirm whether some code external to i_datasync_tid handling is just causing random memory corruption... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello once more. I have one concern against the patch: If the situation is triggered again and again, the patch would produce lots of output. Maybe it's better to use WARN_ONCE. Cheers, Martin -----Original Message----- From: Jan Kara [mailto:jack@suse.cz] Sent: Mittwoch, 4. Mai 2011 23:55 To: Zielinski, Martin Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote: > Here's an update. > In my first post I was not aware of the implementation of tid_gt. > I agree that 2 and a half billion commits on an SD card are - hmph - > unlikely <snip> > gdb) p *journal > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, > j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { > slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { > raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, > prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, > j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, > ... > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, > j_commit_sequence = 2288014067, j_commit_request = 140530417, > ... > j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, > j_average_commit_time = 69247, j_private = 0xffff88031fd49400} <snip> > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid > $5 = {counter = -2006954411} > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid > $3 = {counter = 140530417} > > > j_commit_request = 140530417 > > So it *is* a datasync from sqlite. And your fix will catch it. > I still don't understand, where this number comes from. Ok, so i_datasync_tid got corrupted. But look at the numbers in hex: i_datasync_tid==140530417==0x86052F1 and i_commit_sequence==2288014067==0x886052F3 So it's a single bit error - we lost the highest bit of the number. Are you getting the cores from different machines? Otherwise I'd suspect the HW. If it's not HW I'm at loss what can cause it... You can try moving i_datasync_tid to a different place in struct ext3_inode_info so that we can rule out / confirm whether some code external to i_datasync_tid handling is just causing random memory corruption... Honza
On Thu 05-05-11 09:55:22, Martin_Zielinski@McAfee.com wrote: > Hello once more. > I have one concern against the patch: > If the situation is triggered again and again, the patch would produce lots of output. > Maybe it's better to use WARN_ONCE. Yes, probably it will. Changed to WARN_ONCE in the JBD patch. Honza > -----Original Message----- > From: Jan Kara [mailto:jack@suse.cz] > Sent: Mittwoch, 4. Mai 2011 23:55 > To: Zielinski, Martin > Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org > Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug > > On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote: > > Here's an update. > > In my first post I was not aware of the implementation of tid_gt. > > I agree that 2 and a half billion commits on an SD card are - hmph - > > unlikely > <snip> > > > gdb) p *journal > > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, > > j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { > > slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { > > raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, > > prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, > > j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, > > ... > > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, > > j_commit_sequence = 2288014067, j_commit_request = 140530417, > > ... > > j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, > > j_average_commit_time = 69247, j_private = 0xffff88031fd49400} > <snip> > > > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid > > $5 = {counter = -2006954411} > > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid > > $3 = {counter = 140530417} > > > > > j_commit_request = 140530417 > > > > So it *is* a datasync from sqlite. And your fix will catch it. > > I still don't understand, where this number comes from. > Ok, so i_datasync_tid got corrupted. But look at the numbers in hex: > i_datasync_tid==140530417==0x86052F1 > and > i_commit_sequence==2288014067==0x886052F3 > > So it's a single bit error - we lost the highest bit of the number. Are you > getting the cores from different machines? Otherwise I'd suspect the HW. > If it's not HW I'm at loss what can cause it... You can try moving > i_datasync_tid to a different place in struct ext3_inode_info so that we > can rule out / confirm whether some code external to i_datasync_tid > handling is just causing random memory corruption... > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR
On Thu 05-05-11 09:11:22, Martin_Zielinski@McAfee.com wrote: > It is not a hardware bug and very unlikely a race condition or random > memory corruption. We have 7 machines that failed with an uptime of > 12-13 days. 12 days earlier about 5 different machines failed after the > same uptime. All machines were rebooted after this issue. > > In 5 out of 7 cores the commit request number come from the same sqlite > database. In 2 cores I could not find an inode or file structure > pointing to this database and hence it could not be verified that the > request number comes from this database inode. > > I'm not so sure about the bit error. Some have the hi-bit set, some not. > Due to the implementation of tid_gt() I would expect that the numbers > differ in the high bit. The condition produces a wrong result, if the > difference between the numbers is greater than INT_MAX. Yes, sorry. I was confused yesterday. The numbers are exactly such as they should be when an application constantly calls fdatasync() without modifying file metadata. > The sequence / request numbers / difference INT_MAX + x: > 886052f3 / 086052f1 / 3 > 61d305fe / e1ce83f4 / 295434 > 887d10c8 / 087acf05 / 147908 > 8e3d0b25 / 0e374365 / 378817 > 702d4061 / f02a5e0b / 189014 > 73d6775a / f3d67756 / 4 > 824846ad / 024846ab / 2 > V On some of the machines we traced the commit_sequence for a short time. > However no indication could be found the logs, that a datasync was > triggered with always the same tid. > > Result example: > > Uptime 5 days, 33 min : commit_sequence: 1135688364 > Uptime 12 days, 2:40 : commit_sequence: 1960701710 > > 825013346 commits in 170 hours > > 4853019 commits per hour => 442 hours or 18 days to have INT_MAX commits. Yes. In your case it seems we really wrap-around tid. I'll push the fix to Linus in the next merge window and then it will go also to -stable kernels. Thanks for your testing! Honza > -----Original Message----- > From: Jan Kara [mailto:jack@suse.cz] > Sent: Mittwoch, 4. Mai 2011 23:55 > To: Zielinski, Martin > Cc: tytso@mit.edu; jack@suse.cz; linux-ext4@vger.kernel.org > Subject: Re: [PATCH 2/2] jbd: fix fsync() tid wraparound bug > > On Wed 04-05-11 09:21:04, Martin_Zielinski@McAfee.com wrote: > > Here's an update. > > In my first post I was not aware of the implementation of tid_gt. > > I agree that 2 and a half billion commits on an SD card are - hmph - > > unlikely > <snip> > > > gdb) p *journal > > $4 = {j_flags = 16, j_errno = 0, j_sb_buffer = 0xffff88031f156dc8, > > j_superblock = 0xffff88031f876000, j_format_version = 2, j_state_lock = {raw_lock = { > > slock = 2874125135}}, j_barrier_count = 0, j_barrier = {count = {counter = 1}, wait_lock = { > > raw_lock = {slock = 0}}, wait_list = {next = 0xffff88031e6c4638, > > prev = 0xffff88031e6c4638}, owner = 0x0}, j_running_transaction = 0x0, > > j_committing_transaction = 0x0, j_checkpoint_transactions = 0xffff88031bd16b40, > > ... > > j_tail_sequence = 2288011385, j_transaction_sequence = 2288014068, > > j_commit_sequence = 2288014067, j_commit_request = 140530417, > > ... > > j_wbuf = 0xffff88031de98000, j_wbufsize = 512, j_last_sync_writer = 4568, > > j_average_commit_time = 69247, j_private = 0xffff88031fd49400} > <snip> > > > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_sync_tid > > $5 = {counter = -2006954411} > > (gdb) p ((struct ext3_inode_info*)(0xffff88031f0c0758-0xd0))->i_datasync_tid > > $3 = {counter = 140530417} > > > > > j_commit_request = 140530417 > > > > So it *is* a datasync from sqlite. And your fix will catch it. > > I still don't understand, where this number comes from. > Ok, so i_datasync_tid got corrupted. But look at the numbers in hex: > i_datasync_tid==140530417==0x86052F1 > and > i_commit_sequence==2288014067==0x886052F3 > > So it's a single bit error - we lost the highest bit of the number. Are you > getting the cores from different machines? Otherwise I'd suspect the HW. > If it's not HW I'm at loss what can cause it... You can try moving > i_datasync_tid to a different place in struct ext3_inode_info so that we > can rule out / confirm whether some code external to i_datasync_tid > handling is just causing random memory corruption... > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index b3713af..1b71ce6 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -437,9 +437,12 @@ int __log_space_left(journal_t *journal) int __log_start_commit(journal_t *journal, tid_t target) { /* - * Are we already doing a recent enough commit? + * The only transaction we can possibly wait upon is the + * currently running transaction (if it exists). Otherwise, + * the target tid must be an old one. */ - if (!tid_geq(journal->j_commit_request, target)) { + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == target) { /* * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. @@ -451,7 +454,14 @@ int __log_start_commit(journal_t *journal, tid_t target) journal->j_commit_sequence); wake_up(&journal->j_wait_commit); return 1; - } + } else if (!tid_geq(journal->j_commit_request, target)) + /* This should never happen, but if it does, preserve + the evidence before kjournald goes into a loop and + increments j_commit_sequence beyond all recognition. */ + WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n", + journal->j_commit_request, journal->j_commit_sequence, + target, journal->j_running_transaction ? + journal->j_running_transaction->t_tid : 0); return 0; }
Hi Jan, I don't know if you've been following this thread, but I was wondering if you could review this patch, (a) for inclusion in the ext3 tree, and (b) because I'd appreciate a second pair of eyes looking at this patch, since I intend to push similar change to jbd2. I'm not entirely convinced this is caused by tid's wrapping around, since that would be a huge number of commits, but if it's not that, somehow i_datasync_tid or i_sync_tid is either getting corrupted or not getting set --- and I have no idea how that could be happening. This patch should at least avoid the system from crashing when we hit the case, and harmlessly handle the situation --- with at the worst case, an journal commit that wouldn't otherwise be needed. As background, I've been on this bug for months now, as it's been reported to me as occasionally happening on Android devices that have been using ext4. Since I hadn't seen any reports of this in the field in the x86 world, and this code hadn't changed in a long, long time, I had originally assumed it was an ARM-specific bug. However, recently, Martin Zielinski (on this thread) has reported this problem on an x86 system --- and on a x86 system to boot. Martin suspects it may have to do with sqllite --- which is consistent with I've seen, since I believe Android devices use sqllite quite heavily as well. - Ted jbd: fix fsync() tid wraparound bug If an application program does not make any changes to the indirect blocks or extent tree, i_datasync_tid will not get updated. If there are enough commits (i.e., 2**31) such that tid_geq()'s calculations wrap, and there isn't a currently active transaction at the time of the fdatasync() call, this can end up triggering a BUG_ON in fs/jbd/commit.c: J_ASSERT(journal->j_running_transaction != NULL); It's pretty rare that this can happen, since it requires the use of fdatasync() plus *very* frequent and excessive use of fsync(). But with the right workload, it can. We fix this by replacing the use of tid_geq() with an equality test, since there's only one valid transaction id that we is valid for us to wait until it is commited: namely, the currently running transaction (if it exists). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/jbd/journal.c | 16 +++++++++++++--- 1 files changed, 13 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html