diff mbox

Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

Message ID 20121023221913.GC28626@thunk.org
State Superseded, archived
Headers show

Commit Message

Theodore Ts'o Oct. 23, 2012, 10:19 p.m. UTC
On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
> 
> It is now quite clear that this is a bug introduced by one or more of
> the post-3.6.1 ext4 patches (which have all been backported at least to
> 3.5, so the problem is probably there too).
> 
> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> 

I think I've found the problem.  I believe the commit at fault is commit
14b4ed22a6 (upstream commit eeecef0af5e):

    jbd2: don't write superblock when if its empty

which first appeared in v3.6.2.

The reason why the problem happens rarely is that the effect of the
buggy commit is that if the journal's starting block is zero, we fail
to truncate the journal when we unmount the file system.  This can
happen if we mount and then unmount the file system fairly quickly,
before the log has a chance to wrap.  After the first time this has
happened, it's not a disaster, since when we replay the journal, we'll
just replay some extra transactions.  But if this happens twice, the
oldest valid transaction will still not have gotten updated, but some
of the newer transactions from the last mount session will have gotten
written by the very latest transacitons, and when we then try to do
the extra transaction replays, the metadata blocks can end up getting
very scrambled indeed.

*Sigh*.  My apologies for not catching this when I reviewed this
patch.  I believe the following patch should fix the bug; once it's
reviewed by other ext4 developers, I'll push this to Linus ASAP.

						- Ted

commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Tue Oct 23 18:15:22 2012 -0400

    jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
    
    Fix a potential file system corrupting bug which was introduced by
    commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
    superblock when if its empty.
    
    We should only skip writing the journal superblock if there is nothing
    to do --- not just when s_start is zero.
    
    This has caused users to report file system corruptions in ext4 that
    look like this:
    
    EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
    JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
    
    after the file system has been corrupted.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Nix Oct. 23, 2012, 10:47 p.m. UTC | #1
On 23 Oct 2012, Theodore Ts'o said:

> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.

Oh dear oh dear.

>                                                           This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.

... which is quite likely if you're rebooting frequently to try to track
down some other kernel bug.

>                                       After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

Ow. OK, it's a good thing I rebooted fast. :) and only fses that got
written to, but not too much, will see this. Hence my /usr/src stayed
intact because it had lots of updates of lots of tiny files, more than
enough to cause the journal to wrap over and over again, even
journalling only metadata. But /home doesn't see so many updates, and
neither does /var...

This seems to explain everything.

It looks like fscking everything will fix it (it'll replay the buggered
journal, mangling the metadata, but then fix up the scrambled metadata
and fix the journal's starting block). So I probably don't need to worry
about latent corruption hiding waiting to pounce. Phew.

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.

No problem. This is my first data-corruption bug in more than seventeen
years of ext* use (it even survived horribly faulty RAM). I call that a
good record. And it happened one day after a full backup, and was
immediately highlighted by corruption of .bash_history and input/output
errors logging in -- and fsck pretty much fixed the problem, with only a
few missing files, one file full of garbage, and one high-ASCII filename
in a temporary directory to show for it. I call that luckier than I have
any right to be.

Plus, my faith in the amazingly fast bugfixing talents of ext4 devs is
undimmed! :)

> 						- Ted
>
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Tue Oct 23 18:15:22 2012 -0400
>
>     jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty

I'll apply this tomorrow (enough fun with filesystem restoration for
today) and see what happens. (What could *possibly* go wrong?)

But I might not upgrade to stable kernels quite so often in future :(
you know what they say: once burnt, twice not upgrading before doing a
full backup!
Nix Oct. 23, 2012, 11:06 p.m. UTC | #2
On 23 Oct 2012, Theodore Ts'o verbalised:
> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.

I note that the patch is in the latest stable releases of 3.4.x and
3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does
indeed fix it, this justifies pushing out new releases of both these
stable kernels with just this patch in, just to make sure people taking
the latest stable kernel from those releases don't eat their
filesystems.

The bug did really quite a lot of damage to my /home fs in only a few
minutes of uptime, given how few files I wrote to it. What it could have
done to a more conventional distro install with everything including
/home on one filesystem, I shudder to think.

(I've posted a note to the LWN thread announcing 3.6.3, just in case
that saves someone's fs.)
Theodore Ts'o Oct. 23, 2012, 11:16 p.m. UTC | #3
Just to follow up (mostly for ext4 developers).  After talking to Eric
via irc, it appears he thought it was sufficient to check (s_start ==
0) from commit 24bcc89c7e, which was authored by Jan Kara.  (Which we
now need to audit very carefully, although it's been in the upstream
kernel since 3.4, so it's obviously not causing failures as
spectacularly or as easily as eeecef0af5e.)

And I suspect the reason why Jan thought this was OK is because of the
following totally bogus comment at fs/jbd2/recovery.c:259:

	/*
	 * The journal superblock's s_start field (the current log head)
	 * is always zero if, and only if, the journal was cleanly
	 * unmounted.
	 */

After doing some code archeology, I've found that this comment dates
back to the very first commit in the historic git tree when the fs/jbd
code was added to the 2.4.14 tree.  I suspect that s_start was
originally a physical block number (in the very early days when sct
was initially developing ext3, before it was submitted to the kernel),
but then when Stephen added the ability to store the journal in an
inode, it became a logical block number, and this comment became
incorrect, but no one noticed and/or decided to fix the comment in the
last ten years.  :-(

So now we know the root cause of the thought processes that lead to
the bug, and now we need to double check the changes in commits
24bcc89c7e for jbd2, and 9754e39c7b for jbd (a similar change was also
added to ext3 in v3.5).

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 23, 2012, 11:28 p.m. UTC | #4
On Wed, Oct 24, 2012 at 12:06:21AM +0100, Nix wrote:
> I note that the patch is in the latest stable releases of 3.4.x and
> 3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does
> indeed fix it, this justifies pushing out new releases of both these
> stable kernels with just this patch in, just to make sure people taking
> the latest stable kernel from those releases don't eat their
> filesystems.

Eric is in the process of reviewing the bug, and creating a repro case
so we can definitely show that my theory is sound, and that the bug
has been fixed by my proposed fix.  We know that my patch definitely
restores the behaviour previous to commit eeecef0af5, so it can't
hurt, but we do want to make 100% sure that it really fixes the
problem.  (I found the potential bug by desk checking the all of the
commits between v3.6.1 and v3.6.3, and none of the other commits
triggered my WTF alarm, but we want to have a easy repro case so we
can be 100% sure it's been fixed.  It's always nice when theory is
backed up with empircal evidence.  :-)

Until then, it should also be fine to just revert that commit on the
other stable kernels.

> The bug did really quite a lot of damage to my /home fs in only a few
> minutes of uptime, given how few files I wrote to it. What it could have
> done to a more conventional distro install with everything including
> /home on one filesystem, I shudder to think.

Well, the problem won't show up if the journal has wrapped.  So it
will only show up if the system has been rebooted twice in fairly
quick succession.  A full conventional distro install probably
wouldn't have triggered a bug... although someone who habitually
reboots their laptop instead of using suspend/resume or hiberbate, or
someone who is trying to bisect the kernel looking for some other bug
could easily trip over this --- which I guess is how you got hit by
it.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix Oct. 23, 2012, 11:34 p.m. UTC | #5
On 24 Oct 2012, Theodore Ts'o told this:

> hurt, but we do want to make 100% sure that it really fixes the
> problem.

Well, yes, that would be nice. I can certainly try to verify that it
stops my filesystems getting corrupted. (And if so, I owe you a
$BEVERAGE. Though I suspect I owe you about three million of those
already for other code written in the past.)

>> The bug did really quite a lot of damage to my /home fs in only a few
>> minutes of uptime, given how few files I wrote to it. What it could have
>> done to a more conventional distro install with everything including
>> /home on one filesystem, I shudder to think.
>
> Well, the problem won't show up if the journal has wrapped.  So it
> will only show up if the system has been rebooted twice in fairly
> quick succession.  A full conventional distro install probably
> wouldn't have triggered a bug...

A full *install* from scratch, no. I was more worried about the
possibility of someone running -stable kernels on an existing distro
installation, and shutting down every night (given what's been happening
to UK electricity prices in the last few years I suspect there are quite
a lot of people doing that in the UK to save power). If they happen not
to do much on one particular day other than a bit of light distro
updating, they could perfectly well end up roasting things touched
during the distro update. Things like glibc :(

>                                  although someone who habitually
> reboots their laptop instead of using suspend/resume or hiberbate, or
> someone who is trying to bisect the kernel looking for some other bug
> could easily trip over this --- which I guess is how you got hit by
> it.

I was first hit by it in /var before I was even trying to bisect: I was
just rebooting to unwedge NFS lockd. It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.
Eric Sandeen Oct. 24, 2012, 12:57 a.m. UTC | #6
On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>
>> It is now quite clear that this is a bug introduced by one or more of
>> the post-3.6.1 ext4 patches (which have all been backported at least to
>> 3.5, so the problem is probably there too).
>>
>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>>
> 
> I think I've found the problem.  I believe the commit at fault is commit
> 14b4ed22a6 (upstream commit eeecef0af5e):
> 
>     jbd2: don't write superblock when if its empty
> 
> which first appeared in v3.6.2.
> 
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

I'm stumped by this; maybe Ted can see if I'm missing something.

(and Nix, is there anything special about your fs?  Any nondefault
mkfs or mount options, external journal, inordinately large fs, or
anything like that?)

The suspect commit added this in jbd2_mark_journal_empty():

        /* Is it already empty? */
        if (sb->s_start == 0) {
                read_unlock(&journal->j_state_lock);
                return;
        }

thereby short circuiting the function.

But Ted's suggestion that mounting the fs, doing a little work, and
unmounting before we wrap would lead to this doesn't make sense to
me.  When I do a little work, s_start is at 1, not 0.  We start
the journal at s_first:

load_superblock()
	journal->j_first = be32_to_cpu(sb->s_first);

And when we wrap the journal, we wrap back to j_first:

jbd2_journal_next_log_block():
        if (journal->j_head == journal->j_last)
                journal->j_head = journal->j_first;

and j_first comes from s_first, which is set at journal creation
time to be "1" for an internal journal.

So s_start == 0 sure looks special to me; so far I can only see that
we get there if we've been through jbd2_mark_journal_empty() already,
though I'm eyeballing jbd2_journal_get_log_tail() as well.

Ted's proposed patch seems harmless but so far I don't understand
what problem it fixes, and I cannot recreate getting to
jbd2_mark_journal_empty() with a dirty log and s_start == 0.

-Eric

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.
> 
> 						- Ted
> 
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Tue Oct 23 18:15:22 2012 -0400
> 
>     jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
>     
>     Fix a potential file system corrupting bug which was introduced by
>     commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
>     superblock when if its empty.
>     
>     We should only skip writing the journal superblock if there is nothing
>     to do --- not just when s_start is zero.
>     
>     This has caused users to report file system corruptions in ext4 that
>     look like this:
>     
>     EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
>     JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>     
>     after the file system has been corrupted.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@vger.kernel.org
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 0f16edd..0064181 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
>  static void jbd2_mark_journal_empty(journal_t *journal)
>  {
>  	journal_superblock_t *sb = journal->j_superblock;
> +	__be32		new_tail_sequence;
>  
>  	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
>  	read_lock(&journal->j_state_lock);
> -	/* Is it already empty? */
> -	if (sb->s_start == 0) {
> +	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
> +	/* Nothing to do? */
> +	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
>  		read_unlock(&journal->j_state_lock);
>  		return;
>  	}
>  	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
>  		  journal->j_tail_sequence);
>  
> -	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> +	sb->s_sequence = new_tail_sequence;
>  	sb->s_start    = cpu_to_be32(0);
>  	read_unlock(&journal->j_state_lock);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jannis Achstetter Oct. 24, 2012, 7:13 p.m. UTC | #7
Am 24.10.2012 00:19, schrieb Theodore Ts'o:
> [...]
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.  After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.
> [...]

As a "normal linux user" I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to "force" a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define "fairly quickly"? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a "reboot to soon after
booting"?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
	Jannis Achstetter


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Oct. 24, 2012, 8:17 p.m. UTC | #8
On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> > On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
> >>
> >> It is now quite clear that this is a bug introduced by one or more of
> >> the post-3.6.1 ext4 patches (which have all been backported at least to
> >> 3.5, so the problem is probably there too).
> >>
> >> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
> >> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> >>
> > 
> > I think I've found the problem.  I believe the commit at fault is commit
> > 14b4ed22a6 (upstream commit eeecef0af5e):
> > 
> >     jbd2: don't write superblock when if its empty
> > 
> > which first appeared in v3.6.2.
> > 
> > The reason why the problem happens rarely is that the effect of the
> > buggy commit is that if the journal's starting block is zero, we fail
> > to truncate the journal when we unmount the file system.  This can
> > happen if we mount and then unmount the file system fairly quickly,
> > before the log has a chance to wrap.After the first time this has
> > happened, it's not a disaster, since when we replay the journal, we'll
> > just replay some extra transactions.  But if this happens twice, the
> > oldest valid transaction will still not have gotten updated, but some
> > of the newer transactions from the last mount session will have gotten
> > written by the very latest transacitons, and when we then try to do
> > the extra transaction replays, the metadata blocks can end up getting
> > very scrambled indeed.
> 
> I'm stumped by this; maybe Ted can see if I'm missing something.
> 
> (and Nix, is there anything special about your fs?  Any nondefault
> mkfs or mount options, external journal, inordinately large fs, or
> anything like that?)
> 
> The suspect commit added this in jbd2_mark_journal_empty():
> 
>         /* Is it already empty? */
>         if (sb->s_start == 0) {
>                 read_unlock(&journal->j_state_lock);
>                 return;
>         }
> 
> thereby short circuiting the function.
> 
> But Ted's suggestion that mounting the fs, doing a little work, and
> unmounting before we wrap would lead to this doesn't make sense to
> me.  When I do a little work, s_start is at 1, not 0.  We start
> the journal at s_first:
> 
> load_superblock()
> 	journal->j_first = be32_to_cpu(sb->s_first);
> 
> And when we wrap the journal, we wrap back to j_first:
> 
> jbd2_journal_next_log_block():
>         if (journal->j_head == journal->j_last)
>                 journal->j_head = journal->j_first;
> 
> and j_first comes from s_first, which is set at journal creation
> time to be "1" for an internal journal.
> 
> So s_start == 0 sure looks special to me; so far I can only see that
> we get there if we've been through jbd2_mark_journal_empty() already,
> though I'm eyeballing jbd2_journal_get_log_tail() as well.
> 
> Ted's proposed patch seems harmless but so far I don't understand
> what problem it fixes, and I cannot recreate getting to
> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
  Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
see how that would cause the corruption...

								Honza
Theodore Ts'o Oct. 24, 2012, 9:31 p.m. UTC | #9
On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
> 
> As a "normal linux user" I'm interested in the practical things to do
> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
> Fearing loss of data:
> - Is there a way to see whether the journal of a specific partition has
> been wrapped (since mounting) so that umounting and mounting (or doing a
> reboot to downgrade the kernel) is safe?

My initial analysis of what had been causing the problem now looks
incorrect (or at least incomplete).  Both Eric and I have been unable
to reproduce the failure based on my initial theory of what had been
going on.  So the best information at this point is that it's probably
not related to the file system getting unmounted before the journal
has wrapped.

(Keep in mind this is why commercial software corporations like
Microsoft or Apple generally don't make discussions as they are trying
to root cause a problem public; sometimes the initial theories can be
incorrect, and it's unfortunate when misinformation ends up on
Phoronix or Slashdot, leading to people to panic...  but this is open
source, so that means we do everything in the open, since that way we
can all work towards finding the best answer.)

At the *moment* it looks like it might be related to an unclean
shutdown (i.e., a forced reset or power failure while the file system
is mounted or is in the process of being unmounted).  That being said,
a simply kill -9 of kvm running a test kernel while the file system is
mounted by otherwise quiscient doesn't trigger the problem (I was
trying that last night).

It's a little bit too early for this meme:

    http://memegenerator.net/instance/28936247

But do please note that that Fedora !7 users have been using 3.6.2 for
a while, so if this were an easily triggered bug, (a) Eric and I would
have managed to reproduce it by now, and (b) lots of people would be
complaining, since the symptoms of the bug are not subtle.

That's not to say we aren't treating this seriously; but people
shouldn't panic unduly.... (and if you are using a critical
enterprise/production server on bleeding edge kernels, may I suggest
that this might not be such a good idea; there is a *reason* why
enterprise Linux distro's spend 6-9 months or more just stablizing the
kernel, and being super paranoid about making changes afterwards for
years, and it's not because they enjoy backporting patches and working
with trailing edge kernel sources.  :-)

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jannis Achstetter Oct. 24, 2012, 10:05 p.m. UTC | #10
Am 24.10.2012 23:31, schrieb Theodore Ts'o:
> On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
>>
>> As a "normal linux user" I'm interested in the practical things to do
>> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
>> Fearing loss of data:
>> - Is there a way to see whether the journal of a specific partition has
>> been wrapped (since mounting) so that umounting and mounting (or doing a
>> reboot to downgrade the kernel) is safe?
> [...]
> (Keep in mind this is why commercial software corporations like
> [...]
> can all work towards finding the best answer.)

I really appreciate this and I like it since although the root-cause
hasn't been found for sure yet, it is a transparent process.
And it's great good thing that we can directly talk to the involved devs
w/o going through 200 layers of marketing and spokesmen (as it were with
the two companies you mentioned).

> It's a little bit too early for this meme:
>     http://memegenerator.net/instance/28936247

That's a good one :)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> [...]
> with trailing edge kernel sources.  :-)

Yes, the downside of running Gentoo unstable. But even the "stable" tree
used 3.5.7 and this is the one my NAS uses where I do store my backups.
Nevertheless, your reply eased my mind to a great extend and I'm
thankful for it.
Time for bed now :)

	Jannis

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix Oct. 24, 2012, 11:47 p.m. UTC | #11
On 24 Oct 2012, Theodore Ts'o uttered the following:
> (Keep in mind this is why commercial software corporations like
> Microsoft or Apple generally don't make discussions as they are trying
> to root cause a problem public; sometimes the initial theories can be
> incorrect, and it's unfortunate when misinformation ends up on
> Phoronix or Slashdot, leading to people to panic...  but this is open
> source, so that means we do everything in the open, since that way we
> can all work towards finding the best answer.)

Quite. The first few days of any problem diagnosis are often a process
of taking something from 'oh my god it might be the end of the world' to
'oh look it's really obscure, no wonder nobody has ever seen it before'.

This is quite *definitely* such a problem.

> It's a little bit too early for this meme:
>
>     http://memegenerator.net/instance/28936247

It appears I have taken up a new post as the Iraqi Information Minister.
This is why I was disturbed to see the thing hitting Phoronix and then
Slashdot: as the guy whose FSes are being eaten, this is probably not an
easy bug to hit! If it hits, the consequences are serious, but it
doesn't seem to be easy to hit. (I should perhaps have phrased the
subject line better, but I'd just had my $HOME eaten and was rather
stressed out...)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> a while, so if this were an easily triggered bug, (a) Eric and I would
> have managed to reproduce it by now, and (b) lots of people would be
> complaining, since the symptoms of the bug are not subtle.

Quite.
Felipe Contreras Oct. 25, 2012, 5:02 p.m. UTC | #12
On Wed, Oct 24, 2012 at 11:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:

> But do please note that that Fedora !7 users have been using 3.6.2 for
> a while,

Same in Arch Linux.
Eric Sandeen Oct. 26, 2012, 3:25 p.m. UTC | #13
On 10/24/12 3:17 PM, Jan Kara wrote:
> On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
>> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
>>> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>>>
>>>> It is now quite clear that this is a bug introduced by one or more of
>>>> the post-3.6.1 ext4 patches (which have all been backported at least to
>>>> 3.5, so the problem is probably there too).
>>>>
>>>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
>>>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>>>>
>>>
>>> I think I've found the problem.  I believe the commit at fault is commit
>>> 14b4ed22a6 (upstream commit eeecef0af5e):
>>>
>>>     jbd2: don't write superblock when if its empty
>>>
>>> which first appeared in v3.6.2.
>>>
>>> The reason why the problem happens rarely is that the effect of the
>>> buggy commit is that if the journal's starting block is zero, we fail
>>> to truncate the journal when we unmount the file system.  This can
>>> happen if we mount and then unmount the file system fairly quickly,
>>> before the log has a chance to wrap.After the first time this has
>>> happened, it's not a disaster, since when we replay the journal, we'll
>>> just replay some extra transactions.  But if this happens twice, the
>>> oldest valid transaction will still not have gotten updated, but some
>>> of the newer transactions from the last mount session will have gotten
>>> written by the very latest transacitons, and when we then try to do
>>> the extra transaction replays, the metadata blocks can end up getting
>>> very scrambled indeed.
>>
>> I'm stumped by this; maybe Ted can see if I'm missing something.
>>
>> (and Nix, is there anything special about your fs?  Any nondefault
>> mkfs or mount options, external journal, inordinately large fs, or
>> anything like that?)
>>
>> The suspect commit added this in jbd2_mark_journal_empty():
>>
>>         /* Is it already empty? */
>>         if (sb->s_start == 0) {
>>                 read_unlock(&journal->j_state_lock);
>>                 return;
>>         }
>>
>> thereby short circuiting the function.
>>
>> But Ted's suggestion that mounting the fs, doing a little work, and
>> unmounting before we wrap would lead to this doesn't make sense to
>> me.  When I do a little work, s_start is at 1, not 0.  We start
>> the journal at s_first:
>>
>> load_superblock()
>> 	journal->j_first = be32_to_cpu(sb->s_first);
>>
>> And when we wrap the journal, we wrap back to j_first:
>>
>> jbd2_journal_next_log_block():
>>         if (journal->j_head == journal->j_last)
>>                 journal->j_head = journal->j_first;
>>
>> and j_first comes from s_first, which is set at journal creation
>> time to be "1" for an internal journal.
>>
>> So s_start == 0 sure looks special to me; so far I can only see that
>> we get there if we've been through jbd2_mark_journal_empty() already,
>> though I'm eyeballing jbd2_journal_get_log_tail() as well.
>>
>> Ted's proposed patch seems harmless but so far I don't understand
>> what problem it fixes, and I cannot recreate getting to
>> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
>   Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
> when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
> see how that would cause the corruption...

Agreed, except so far I cannot see any way to get here with s_start == 0
without ALREADY having JBD2_FLUSHED set.  Can you?

Anyway, I think the problem is still poorly understood; lots of random facts
floating about, and a pretty weird usecase with nonstandard/dangerous mount
options.  I do want to figure out what regressed (if anything) but so far
this investigation doesn't seem very methodical.

-Eric

> 								Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0f16edd..0064181 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1351,18 +1351,20 @@  void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	__be32		new_tail_sequence;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
-	/* Is it already empty? */
-	if (sb->s_start == 0) {
+	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
+	/* Nothing to do? */
+	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
 		read_unlock(&journal->j_state_lock);
 		return;
 	}
 	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
 		  journal->j_tail_sequence);
 
-	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
+	sb->s_sequence = new_tail_sequence;
 	sb->s_start    = cpu_to_be32(0);
 	read_unlock(&journal->j_state_lock);