Message ID | 20121023221913.GC28626@thunk.org |
---|---|
State | Superseded, archived |
Headers | show |
On 23 Oct 2012, Theodore Ts'o said: > The reason why the problem happens rarely is that the effect of the > buggy commit is that if the journal's starting block is zero, we fail > to truncate the journal when we unmount the file system. Oh dear oh dear. > This can > happen if we mount and then unmount the file system fairly quickly, > before the log has a chance to wrap. ... which is quite likely if you're rebooting frequently to try to track down some other kernel bug. > After the first time this has > happened, it's not a disaster, since when we replay the journal, we'll > just replay some extra transactions. But if this happens twice, the > oldest valid transaction will still not have gotten updated, but some > of the newer transactions from the last mount session will have gotten > written by the very latest transacitons, and when we then try to do > the extra transaction replays, the metadata blocks can end up getting > very scrambled indeed. Ow. OK, it's a good thing I rebooted fast. :) and only fses that got written to, but not too much, will see this. Hence my /usr/src stayed intact because it had lots of updates of lots of tiny files, more than enough to cause the journal to wrap over and over again, even journalling only metadata. But /home doesn't see so many updates, and neither does /var... This seems to explain everything. It looks like fscking everything will fix it (it'll replay the buggered journal, mangling the metadata, but then fix up the scrambled metadata and fix the journal's starting block). So I probably don't need to worry about latent corruption hiding waiting to pounce. Phew. > *Sigh*. My apologies for not catching this when I reviewed this > patch. I believe the following patch should fix the bug; once it's > reviewed by other ext4 developers, I'll push this to Linus ASAP. No problem. This is my first data-corruption bug in more than seventeen years of ext* use (it even survived horribly faulty RAM). I call that a good record. And it happened one day after a full backup, and was immediately highlighted by corruption of .bash_history and input/output errors logging in -- and fsck pretty much fixed the problem, with only a few missing files, one file full of garbage, and one high-ASCII filename in a temporary directory to show for it. I call that luckier than I have any right to be. Plus, my faith in the amazingly fast bugfixing talents of ext4 devs is undimmed! :) > - Ted > > commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Tue Oct 23 18:15:22 2012 -0400 > > jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty I'll apply this tomorrow (enough fun with filesystem restoration for today) and see what happens. (What could *possibly* go wrong?) But I might not upgrade to stable kernels quite so often in future :( you know what they say: once burnt, twice not upgrading before doing a full backup!
On 23 Oct 2012, Theodore Ts'o verbalised: > *Sigh*. My apologies for not catching this when I reviewed this > patch. I believe the following patch should fix the bug; once it's > reviewed by other ext4 developers, I'll push this to Linus ASAP. I note that the patch is in the latest stable releases of 3.4.x and 3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does indeed fix it, this justifies pushing out new releases of both these stable kernels with just this patch in, just to make sure people taking the latest stable kernel from those releases don't eat their filesystems. The bug did really quite a lot of damage to my /home fs in only a few minutes of uptime, given how few files I wrote to it. What it could have done to a more conventional distro install with everything including /home on one filesystem, I shudder to think. (I've posted a note to the LWN thread announcing 3.6.3, just in case that saves someone's fs.)
Just to follow up (mostly for ext4 developers). After talking to Eric via irc, it appears he thought it was sufficient to check (s_start == 0) from commit 24bcc89c7e, which was authored by Jan Kara. (Which we now need to audit very carefully, although it's been in the upstream kernel since 3.4, so it's obviously not causing failures as spectacularly or as easily as eeecef0af5e.) And I suspect the reason why Jan thought this was OK is because of the following totally bogus comment at fs/jbd2/recovery.c:259: /* * The journal superblock's s_start field (the current log head) * is always zero if, and only if, the journal was cleanly * unmounted. */ After doing some code archeology, I've found that this comment dates back to the very first commit in the historic git tree when the fs/jbd code was added to the 2.4.14 tree. I suspect that s_start was originally a physical block number (in the very early days when sct was initially developing ext3, before it was submitted to the kernel), but then when Stephen added the ability to store the journal in an inode, it became a logical block number, and this comment became incorrect, but no one noticed and/or decided to fix the comment in the last ten years. :-( So now we know the root cause of the thought processes that lead to the bug, and now we need to double check the changes in commits 24bcc89c7e for jbd2, and 9754e39c7b for jbd (a similar change was also added to ext3 in v3.5). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 24, 2012 at 12:06:21AM +0100, Nix wrote: > I note that the patch is in the latest stable releases of 3.4.x and > 3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does > indeed fix it, this justifies pushing out new releases of both these > stable kernels with just this patch in, just to make sure people taking > the latest stable kernel from those releases don't eat their > filesystems. Eric is in the process of reviewing the bug, and creating a repro case so we can definitely show that my theory is sound, and that the bug has been fixed by my proposed fix. We know that my patch definitely restores the behaviour previous to commit eeecef0af5, so it can't hurt, but we do want to make 100% sure that it really fixes the problem. (I found the potential bug by desk checking the all of the commits between v3.6.1 and v3.6.3, and none of the other commits triggered my WTF alarm, but we want to have a easy repro case so we can be 100% sure it's been fixed. It's always nice when theory is backed up with empircal evidence. :-) Until then, it should also be fine to just revert that commit on the other stable kernels. > The bug did really quite a lot of damage to my /home fs in only a few > minutes of uptime, given how few files I wrote to it. What it could have > done to a more conventional distro install with everything including > /home on one filesystem, I shudder to think. Well, the problem won't show up if the journal has wrapped. So it will only show up if the system has been rebooted twice in fairly quick succession. A full conventional distro install probably wouldn't have triggered a bug... although someone who habitually reboots their laptop instead of using suspend/resume or hiberbate, or someone who is trying to bisect the kernel looking for some other bug could easily trip over this --- which I guess is how you got hit by it. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24 Oct 2012, Theodore Ts'o told this: > hurt, but we do want to make 100% sure that it really fixes the > problem. Well, yes, that would be nice. I can certainly try to verify that it stops my filesystems getting corrupted. (And if so, I owe you a $BEVERAGE. Though I suspect I owe you about three million of those already for other code written in the past.) >> The bug did really quite a lot of damage to my /home fs in only a few >> minutes of uptime, given how few files I wrote to it. What it could have >> done to a more conventional distro install with everything including >> /home on one filesystem, I shudder to think. > > Well, the problem won't show up if the journal has wrapped. So it > will only show up if the system has been rebooted twice in fairly > quick succession. A full conventional distro install probably > wouldn't have triggered a bug... A full *install* from scratch, no. I was more worried about the possibility of someone running -stable kernels on an existing distro installation, and shutting down every night (given what's been happening to UK electricity prices in the last few years I suspect there are quite a lot of people doing that in the UK to save power). If they happen not to do much on one particular day other than a bit of light distro updating, they could perfectly well end up roasting things touched during the distro update. Things like glibc :( > although someone who habitually > reboots their laptop instead of using suspend/resume or hiberbate, or > someone who is trying to bisect the kernel looking for some other bug > could easily trip over this --- which I guess is how you got hit by > it. I was first hit by it in /var before I was even trying to bisect: I was just rebooting to unwedge NFS lockd. It's true that in less than a week probably not all that many people have rebooted often enough to trip over this. I hope.
On 10/23/12 5:19 PM, Theodore Ts'o wrote: > On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote: >> >> It is now quite clear that this is a bug introduced by one or more of >> the post-3.6.1 ext4 patches (which have all been backported at least to >> 3.5, so the problem is probably there too). >> >> [ 60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd >> [ 60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. >> > > I think I've found the problem. I believe the commit at fault is commit > 14b4ed22a6 (upstream commit eeecef0af5e): > > jbd2: don't write superblock when if its empty > > which first appeared in v3.6.2. > > The reason why the problem happens rarely is that the effect of the > buggy commit is that if the journal's starting block is zero, we fail > to truncate the journal when we unmount the file system. This can > happen if we mount and then unmount the file system fairly quickly, > before the log has a chance to wrap.After the first time this has > happened, it's not a disaster, since when we replay the journal, we'll > just replay some extra transactions. But if this happens twice, the > oldest valid transaction will still not have gotten updated, but some > of the newer transactions from the last mount session will have gotten > written by the very latest transacitons, and when we then try to do > the extra transaction replays, the metadata blocks can end up getting > very scrambled indeed. I'm stumped by this; maybe Ted can see if I'm missing something. (and Nix, is there anything special about your fs? Any nondefault mkfs or mount options, external journal, inordinately large fs, or anything like that?) The suspect commit added this in jbd2_mark_journal_empty(): /* Is it already empty? */ if (sb->s_start == 0) { read_unlock(&journal->j_state_lock); return; } thereby short circuiting the function. But Ted's suggestion that mounting the fs, doing a little work, and unmounting before we wrap would lead to this doesn't make sense to me. When I do a little work, s_start is at 1, not 0. We start the journal at s_first: load_superblock() journal->j_first = be32_to_cpu(sb->s_first); And when we wrap the journal, we wrap back to j_first: jbd2_journal_next_log_block(): if (journal->j_head == journal->j_last) journal->j_head = journal->j_first; and j_first comes from s_first, which is set at journal creation time to be "1" for an internal journal. So s_start == 0 sure looks special to me; so far I can only see that we get there if we've been through jbd2_mark_journal_empty() already, though I'm eyeballing jbd2_journal_get_log_tail() as well. Ted's proposed patch seems harmless but so far I don't understand what problem it fixes, and I cannot recreate getting to jbd2_mark_journal_empty() with a dirty log and s_start == 0. -Eric > *Sigh*. My apologies for not catching this when I reviewed this > patch. I believe the following patch should fix the bug; once it's > reviewed by other ext4 developers, I'll push this to Linus ASAP. > > - Ted > > commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Tue Oct 23 18:15:22 2012 -0400 > > jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty > > Fix a potential file system corrupting bug which was introduced by > commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write > superblock when if its empty. > > We should only skip writing the journal superblock if there is nothing > to do --- not just when s_start is zero. > > This has caused users to report file system corruptions in ext4 that > look like this: > > EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd > JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > after the file system has been corrupted. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: stable@vger.kernel.org > > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > index 0f16edd..0064181 100644 > --- a/fs/jbd2/journal.c > +++ b/fs/jbd2/journal.c > @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid, > static void jbd2_mark_journal_empty(journal_t *journal) > { > journal_superblock_t *sb = journal->j_superblock; > + __be32 new_tail_sequence; > > BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex)); > read_lock(&journal->j_state_lock); > - /* Is it already empty? */ > - if (sb->s_start == 0) { > + new_tail_sequence = cpu_to_be32(journal->j_tail_sequence); > + /* Nothing to do? */ > + if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) { > read_unlock(&journal->j_state_lock); > return; > } > jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n", > journal->j_tail_sequence); > > - sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); > + sb->s_sequence = new_tail_sequence; > sb->s_start = cpu_to_be32(0); > read_unlock(&journal->j_state_lock); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 24.10.2012 00:19, schrieb Theodore Ts'o: > [...] > The reason why the problem happens rarely is that the effect of the > buggy commit is that if the journal's starting block is zero, we fail > to truncate the journal when we unmount the file system. This can > happen if we mount and then unmount the file system fairly quickly, > before the log has a chance to wrap. After the first time this has > happened, it's not a disaster, since when we replay the journal, we'll > just replay some extra transactions. But if this happens twice, the > oldest valid transaction will still not have gotten updated, but some > of the newer transactions from the last mount session will have gotten > written by the very latest transacitons, and when we then try to do > the extra transaction replays, the metadata blocks can end up getting > very scrambled indeed. > [...] As a "normal linux user" I'm interested in the practical things to do now to avoid data loss. I'm running several systems with 3.6.2 and ext4. Fearing loss of data: - Is there a way to see whether the journal of a specific partition has been wrapped (since mounting) so that umounting and mounting (or doing a reboot to downgrade the kernel) is safe? - Is there a way to "force" a journal-wrap? Run any filesystem-benchmark? Which one with what parameters? Or is it unwise since I might even further corrupt data if I hit the case already? - Is it wise to umount now and run e2fsck or might I corrupt my files just by umounting now if the journal hasn't wrapped yet? - How do you define "fairly quickly"? Of course servers run 24/7 but I might be using my PC 2-5 hrs a day... Is that a "reboot to soon after booting"? - Any more advice you can give to the ordinary user to avoid fs-corruption? Don't shut down machines for some days? Better down- or upgrade the kernel? Best regards, Jannis Achstetter -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 23-10-12 19:57:09, Eric Sandeen wrote: > On 10/23/12 5:19 PM, Theodore Ts'o wrote: > > On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote: > >> > >> It is now quite clear that this is a bug introduced by one or more of > >> the post-3.6.1 ext4 patches (which have all been backported at least to > >> 3.5, so the problem is probably there too). > >> > >> [ 60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd > >> [ 60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > >> > > > > I think I've found the problem. I believe the commit at fault is commit > > 14b4ed22a6 (upstream commit eeecef0af5e): > > > > jbd2: don't write superblock when if its empty > > > > which first appeared in v3.6.2. > > > > The reason why the problem happens rarely is that the effect of the > > buggy commit is that if the journal's starting block is zero, we fail > > to truncate the journal when we unmount the file system. This can > > happen if we mount and then unmount the file system fairly quickly, > > before the log has a chance to wrap.After the first time this has > > happened, it's not a disaster, since when we replay the journal, we'll > > just replay some extra transactions. But if this happens twice, the > > oldest valid transaction will still not have gotten updated, but some > > of the newer transactions from the last mount session will have gotten > > written by the very latest transacitons, and when we then try to do > > the extra transaction replays, the metadata blocks can end up getting > > very scrambled indeed. > > I'm stumped by this; maybe Ted can see if I'm missing something. > > (and Nix, is there anything special about your fs? Any nondefault > mkfs or mount options, external journal, inordinately large fs, or > anything like that?) > > The suspect commit added this in jbd2_mark_journal_empty(): > > /* Is it already empty? */ > if (sb->s_start == 0) { > read_unlock(&journal->j_state_lock); > return; > } > > thereby short circuiting the function. > > But Ted's suggestion that mounting the fs, doing a little work, and > unmounting before we wrap would lead to this doesn't make sense to > me. When I do a little work, s_start is at 1, not 0. We start > the journal at s_first: > > load_superblock() > journal->j_first = be32_to_cpu(sb->s_first); > > And when we wrap the journal, we wrap back to j_first: > > jbd2_journal_next_log_block(): > if (journal->j_head == journal->j_last) > journal->j_head = journal->j_first; > > and j_first comes from s_first, which is set at journal creation > time to be "1" for an internal journal. > > So s_start == 0 sure looks special to me; so far I can only see that > we get there if we've been through jbd2_mark_journal_empty() already, > though I'm eyeballing jbd2_journal_get_log_tail() as well. > > Ted's proposed patch seems harmless but so far I don't understand > what problem it fixes, and I cannot recreate getting to > jbd2_mark_journal_empty() with a dirty log and s_start == 0. Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly see how that would cause the corruption... Honza
On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote: > > As a "normal linux user" I'm interested in the practical things to do > now to avoid data loss. I'm running several systems with 3.6.2 and ext4. > Fearing loss of data: > - Is there a way to see whether the journal of a specific partition has > been wrapped (since mounting) so that umounting and mounting (or doing a > reboot to downgrade the kernel) is safe? My initial analysis of what had been causing the problem now looks incorrect (or at least incomplete). Both Eric and I have been unable to reproduce the failure based on my initial theory of what had been going on. So the best information at this point is that it's probably not related to the file system getting unmounted before the journal has wrapped. (Keep in mind this is why commercial software corporations like Microsoft or Apple generally don't make discussions as they are trying to root cause a problem public; sometimes the initial theories can be incorrect, and it's unfortunate when misinformation ends up on Phoronix or Slashdot, leading to people to panic... but this is open source, so that means we do everything in the open, since that way we can all work towards finding the best answer.) At the *moment* it looks like it might be related to an unclean shutdown (i.e., a forced reset or power failure while the file system is mounted or is in the process of being unmounted). That being said, a simply kill -9 of kvm running a test kernel while the file system is mounted by otherwise quiscient doesn't trigger the problem (I was trying that last night). It's a little bit too early for this meme: http://memegenerator.net/instance/28936247 But do please note that that Fedora !7 users have been using 3.6.2 for a while, so if this were an easily triggered bug, (a) Eric and I would have managed to reproduce it by now, and (b) lots of people would be complaining, since the symptoms of the bug are not subtle. That's not to say we aren't treating this seriously; but people shouldn't panic unduly.... (and if you are using a critical enterprise/production server on bleeding edge kernels, may I suggest that this might not be such a good idea; there is a *reason* why enterprise Linux distro's spend 6-9 months or more just stablizing the kernel, and being super paranoid about making changes afterwards for years, and it's not because they enjoy backporting patches and working with trailing edge kernel sources. :-) Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 24.10.2012 23:31, schrieb Theodore Ts'o: > On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote: >> >> As a "normal linux user" I'm interested in the practical things to do >> now to avoid data loss. I'm running several systems with 3.6.2 and ext4. >> Fearing loss of data: >> - Is there a way to see whether the journal of a specific partition has >> been wrapped (since mounting) so that umounting and mounting (or doing a >> reboot to downgrade the kernel) is safe? > [...] > (Keep in mind this is why commercial software corporations like > [...] > can all work towards finding the best answer.) I really appreciate this and I like it since although the root-cause hasn't been found for sure yet, it is a transparent process. And it's great good thing that we can directly talk to the involved devs w/o going through 200 layers of marketing and spokesmen (as it were with the two companies you mentioned). > It's a little bit too early for this meme: > http://memegenerator.net/instance/28936247 That's a good one :) > But do please note that that Fedora !7 users have been using 3.6.2 for > [...] > with trailing edge kernel sources. :-) Yes, the downside of running Gentoo unstable. But even the "stable" tree used 3.5.7 and this is the one my NAS uses where I do store my backups. Nevertheless, your reply eased my mind to a great extend and I'm thankful for it. Time for bed now :) Jannis -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24 Oct 2012, Theodore Ts'o uttered the following: > (Keep in mind this is why commercial software corporations like > Microsoft or Apple generally don't make discussions as they are trying > to root cause a problem public; sometimes the initial theories can be > incorrect, and it's unfortunate when misinformation ends up on > Phoronix or Slashdot, leading to people to panic... but this is open > source, so that means we do everything in the open, since that way we > can all work towards finding the best answer.) Quite. The first few days of any problem diagnosis are often a process of taking something from 'oh my god it might be the end of the world' to 'oh look it's really obscure, no wonder nobody has ever seen it before'. This is quite *definitely* such a problem. > It's a little bit too early for this meme: > > http://memegenerator.net/instance/28936247 It appears I have taken up a new post as the Iraqi Information Minister. This is why I was disturbed to see the thing hitting Phoronix and then Slashdot: as the guy whose FSes are being eaten, this is probably not an easy bug to hit! If it hits, the consequences are serious, but it doesn't seem to be easy to hit. (I should perhaps have phrased the subject line better, but I'd just had my $HOME eaten and was rather stressed out...) > But do please note that that Fedora !7 users have been using 3.6.2 for > a while, so if this were an easily triggered bug, (a) Eric and I would > have managed to reproduce it by now, and (b) lots of people would be > complaining, since the symptoms of the bug are not subtle. Quite.
On Wed, Oct 24, 2012 at 11:31 PM, Theodore Ts'o <tytso@mit.edu> wrote: > But do please note that that Fedora !7 users have been using 3.6.2 for > a while, Same in Arch Linux.
On 10/24/12 3:17 PM, Jan Kara wrote: > On Tue 23-10-12 19:57:09, Eric Sandeen wrote: >> On 10/23/12 5:19 PM, Theodore Ts'o wrote: >>> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote: >>>> >>>> It is now quite clear that this is a bug introduced by one or more of >>>> the post-3.6.1 ext4 patches (which have all been backported at least to >>>> 3.5, so the problem is probably there too). >>>> >>>> [ 60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd >>>> [ 60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. >>>> >>> >>> I think I've found the problem. I believe the commit at fault is commit >>> 14b4ed22a6 (upstream commit eeecef0af5e): >>> >>> jbd2: don't write superblock when if its empty >>> >>> which first appeared in v3.6.2. >>> >>> The reason why the problem happens rarely is that the effect of the >>> buggy commit is that if the journal's starting block is zero, we fail >>> to truncate the journal when we unmount the file system. This can >>> happen if we mount and then unmount the file system fairly quickly, >>> before the log has a chance to wrap.After the first time this has >>> happened, it's not a disaster, since when we replay the journal, we'll >>> just replay some extra transactions. But if this happens twice, the >>> oldest valid transaction will still not have gotten updated, but some >>> of the newer transactions from the last mount session will have gotten >>> written by the very latest transacitons, and when we then try to do >>> the extra transaction replays, the metadata blocks can end up getting >>> very scrambled indeed. >> >> I'm stumped by this; maybe Ted can see if I'm missing something. >> >> (and Nix, is there anything special about your fs? Any nondefault >> mkfs or mount options, external journal, inordinately large fs, or >> anything like that?) >> >> The suspect commit added this in jbd2_mark_journal_empty(): >> >> /* Is it already empty? */ >> if (sb->s_start == 0) { >> read_unlock(&journal->j_state_lock); >> return; >> } >> >> thereby short circuiting the function. >> >> But Ted's suggestion that mounting the fs, doing a little work, and >> unmounting before we wrap would lead to this doesn't make sense to >> me. When I do a little work, s_start is at 1, not 0. We start >> the journal at s_first: >> >> load_superblock() >> journal->j_first = be32_to_cpu(sb->s_first); >> >> And when we wrap the journal, we wrap back to j_first: >> >> jbd2_journal_next_log_block(): >> if (journal->j_head == journal->j_last) >> journal->j_head = journal->j_first; >> >> and j_first comes from s_first, which is set at journal creation >> time to be "1" for an internal journal. >> >> So s_start == 0 sure looks special to me; so far I can only see that >> we get there if we've been through jbd2_mark_journal_empty() already, >> though I'm eyeballing jbd2_journal_get_log_tail() as well. >> >> Ted's proposed patch seems harmless but so far I don't understand >> what problem it fixes, and I cannot recreate getting to >> jbd2_mark_journal_empty() with a dirty log and s_start == 0. > Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED > when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly > see how that would cause the corruption... Agreed, except so far I cannot see any way to get here with s_start == 0 without ALREADY having JBD2_FLUSHED set. Can you? Anyway, I think the problem is still poorly understood; lots of random facts floating about, and a pretty weird usecase with nonstandard/dangerous mount options. I do want to figure out what regressed (if anything) but so far this investigation doesn't seem very methodical. -Eric > Honza > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 0f16edd..0064181 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid, static void jbd2_mark_journal_empty(journal_t *journal) { journal_superblock_t *sb = journal->j_superblock; + __be32 new_tail_sequence; BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex)); read_lock(&journal->j_state_lock); - /* Is it already empty? */ - if (sb->s_start == 0) { + new_tail_sequence = cpu_to_be32(journal->j_tail_sequence); + /* Nothing to do? */ + if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) { read_unlock(&journal->j_state_lock); return; } jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n", journal->j_tail_sequence); - sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); + sb->s_sequence = new_tail_sequence; sb->s_start = cpu_to_be32(0); read_unlock(&journal->j_state_lock);