Patchwork Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

login
register
mail settings
Submitter Theodore Ts'o
Date Oct. 24, 2012, 5:23 a.m.
Message ID <20121024052351.GB21714@thunk.org>
Download mbox | patch
Permalink /patch/193684/
State Superseded
Headers show

Comments

Theodore Ts'o - Oct. 24, 2012, 5:23 a.m.
On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
> 
> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
> journal_async_commit, but that doesn't mean this isn't a regression.

Note that Toralf has reported almost exactly the same set of symptoms,
but he's using an external USB stick --- and as far as I know he
wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
you confirm what, if any, mount options you were using when you saw
it.

I've been looking at this some more, and there's one other thing that
the short circuit code does, which is neglects setting the
JBD2_FLUSHED flag, which is used by the commit code to know when it
needs to reset the s_start fields in the superblock when we make our
next commit.  However, this would only happen if the short circuit
code is getting hit some time other than when the file system is
getting unmounted --- and that's what Eric and I can't figure out how
it might be happening.  Journal flushes outside of an unmount does
happen as part of online resizing, the FIBMAP ioctl, or when the file
system is frozen.  But it didn't sound like Toralf or Nix was using
any of those features.  (Toralf, Nix, please correct me if my
assumptions here is wrong).

So here's a replacement patch which essentially restores the effects
of eeecef0af5e while still keeping the optimization and fixing the
read/only testing issue which eeecef0af5e is trying to fix up.  It
also have a debugging printk that will trigger so we can perhaps have
a better chance of figuring out what might be going on.

Toralf, Nix, if you could try applying this patch (at the end of this
message), and let me know how and when the WARN_ON triggers, and if it
does, please send the empty_bug_workaround plus the WARN_ON(1) report.
I know about the case where a file system is mounted and then
immediately unmounted, but we don't think that's the problematic case.
If you see any other cases where WARN_ON is triggering, it would be
really good to know....

      	    	       		      - Ted

P.S.  This is a list of all of the commits between v3.6.1 and v3.6.2
(there were no ext4-related changes between v3.6.2 and v3.6.3), and a
quick analysis of the patch.  The last commit, 14b4ed2, is the only
one that I could see as potentially being problematic, which is why
I've been pushing so hard on this one even though my original analysis
doesn't seem to be correct, and Eric and I can't see how the change in
14b4ed2 could be causing the fs corruption.


Online Defrag
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugh Dickins - Oct. 24, 2012, 7 a.m.
On Wed, 24 Oct 2012, Theodore Ts'o wrote:
> Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

I believe it also happens at swapon of a swapfile on the filesystem.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 24, 2012, 11:45 a.m.
On 24 Oct 2012, Theodore Ts'o stated:

>                         Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.

Quite so -- the corrupted filesystems have space reserved for resizing,
and one of them has been resized, years ago, but I haven't resized
either of them with this kernel, or with any kernel numbered 3.x for
that matter.

> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know....

I'll give it a test later today, after another backup has finished.
Daily backups are normally overkill, but I don't think they are right
now.
Nix - Oct. 24, 2012, 11:46 a.m.
On 24 Oct 2012, Hugh Dickins verbalised:

> On Wed, 24 Oct 2012, Theodore Ts'o wrote:
>> Journal flushes outside of an unmount does
>> happen as part of online resizing, the FIBMAP ioctl, or when the file
>> system is frozen.  But it didn't sound like Toralf or Nix was using
>> any of those features.  (Toralf, Nix, please correct me if my
>> assumptions here is wrong).
>
> I believe it also happens at swapon of a swapfile on the filesystem.

I'm not using swapfiles, only swap partitions (on separate LVM LVs).
So that's not it either.
Eric Sandeen - Oct. 24, 2012, 5:22 p.m.
On 10/24/2012 12:23 AM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
>>
>> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
>> journal_async_commit, but that doesn't mean this isn't a regression.
> 
> Note that Toralf has reported almost exactly the same set of symptoms,
> but he's using an external USB stick --- and as far as I know he
> wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
> you confirm what, if any, mount options you were using when you saw
> it.
> 
> I've been looking at this some more, and there's one other thing that
> the short circuit code does, which is neglects setting the
> JBD2_FLUSHED flag, which is used by the commit code to know when it
> needs to reset the s_start fields in the superblock when we make our
> next commit.  However, this would only happen if the short circuit
> code is getting hit some time other than when the file system is
> getting unmounted --- and that's what Eric and I can't figure out how
> it might be happening.  Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

If I freeze w/ anything in the log, then s_start !=0 and we proceed
normally.  If I re-freeze w/o anything in the log, it's already set to
FLUSHED (which makes sense) so not re-setting it doesn't matter.  So I
don't see that that's an issue.

As for FIBMAP I think we only do journal_flush if it's data=journal.

In other news, Phoronix is on the case, so expect escalating freakouts ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 24, 2012, 7:49 p.m.
On 24 Oct 2012, Theodore Ts'o spake thusly:
> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know....

Confirmed, it triggers. Traceback below.


But first, a rather lengthy apology: I did indeed forget something
unusual about my system. In my defence, this is a change I made to my
shutdown scripts many years ago, when umount -l was first introduced
(early 2000s? something like that). So it's not surprising I forgot
about it until I needed to add sleeps to it to capture the tracebacks
below. It is really ugly. You may need a sick bag. In brief: some of my
filesystems will sometimes be uncleanly unmounted and experience journal
replay even on clean shutdowns, and which it is will vary unpredictably.


Some of my machines have fairly intricate webs of NFS-mounted and
non-NFS-mounted filesystems, and I expect them all to reboot
successfully if commanded remotely, because sometimes I'm hundreds of
miles away when I do it and can hardly hit the reset button.

Unfortunately, if I have a mount structure like this:

/usr         local
/usr/foo     NFS-mounted (may be loopback-NFS-mounted)
/usr/foo/bar local

and /usr/foo is down, any attempt to umount /usr/foo/bar will hang
indefinitely. Worse yet, if I umount the nfs filesystem, the local fs
isn't going to be reachable either -- but umounting nfs filesystems has
to happen first so I can killall everything (which would include e.g.
rpc.statd and rpc.nfsd) in order to free up the local filesystems for
umount.

The only way I could see to fix this is to umount -l everything rather
than umounting it (sure, I could do some sort of NFS-versus-non-NFS
analysis and only do this to some filesystems, but testing this
complexity for the -- for me -- rare case of system shutdown was too
annoying to consider). I consider a hang on shutdown much worse than an
occasional unclean umount, because all my filesystems are journalled so
journal recovery will make everything quite happy.

So I do

sync
umount -a -l -t nfs & sleep 2
killall5 -15
killall5 -9
exportfs -ua
quotaoff -a
swapoff -a
LANG=C sort -r -k 2 /proc/mounts | \
(DIRS=""
 while read DEV DIR TYPE REST; do
     case "$DIR" in
         /|/proc|/dev|/proc/*|/sys)
             continue;; # Ignoring virtual file systems needed later
     esac

     case $TYPE in
         proc|procfs|sysfs|usbfs|usbdevfs|devpts)
             continue;; # Ignoring non-tmpfs virtual file systems
     esac
     DIRS="$DIRS $DIR"
done
umount -l -r -d $DIRS) # rely on mount's toposort
sleep 2

The net effect of this being to cleanly umount everything whose mount
points are reachable and which unmounts cleanly in less than a couple of
seconds, and to leave the rest mounted and let journal recovery handle
them. This is clearly really horrible -- I'd far prefer to say 'sleep
until filesystems have finished doing I/O' or better have mount just not
return from mount(8) unless that is true. But this isn't available, and
even it was some fses would still be left to journal recovery, so I
kludged it -- and then forgot about doing anything to improve the
situation for many years.

So, the net effect of this is that normally I get no journal recovery on
anything at all -- but sometimes, if umounting takes longer than a few
seconds, I reboot with not everything unmounted, and journal recovery
kicks in on reboot. My post-test fscks this time suggest that only when
journal recovery kicks in after rebooting out of 2.6.3 do I see
corruption. So this is indeed an unclean shutdown journal-replay
situation: it just happens that I routinely have one or two fses
uncleanly unmounted when all the rest are cleanly unmounted. This
perhaps explains the scattershot nature of the corruption I see, and why
most of my ext4 filesystems get off scot-free.

I'll wait for a minute until you're finished projectile-vomiting. (And
if you have suggestions for making the case of nested local/rewmote
filesystems work without rebooting while umounts may still be in
progress, or even better suggestions to allow me to umount mounts that
happen to be mounted below NFS-mounted mounts with dead or nonresponsive
NFS server, I'd be glad to hear them! Distros appear to take the
opposite tack, and prefer to simply lock up forever waiting for a
nonresponsive NFS server in this situation. I could never accept that.)


[...]

OK. That umount of local filesystems sprayed your added
empty bug workaround and WARN_ONs so many times that nearly all of them
scrolled off the screen -- and because syslogd was dead by now and this
is where my netconsole logs go, they're lost. I suspect every single
umounted filesystem sprayed one of these (and this happened long before
any reboot-before-we're-done).

But I did the old trick of camera-capturing the last one (which was
probably /boot, which has never got corrupted because I hardly ever
write anything to it at all). I hope it's more useful than nothing. (I
can rearrange things to umount /var last, and try again, if you think
that a specific warning from an fs known to get corrupted is especially
likely to be valuable.)

So I see, for one umount at least (and the chunk of the previous one
that scrolled offscreen is consistent with this):

jbd2_mark_journal_empty bug workaround (21218, 21219)
[obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
...
[addresses omitted for sanity: traceback only]
warn_slowpath_common+0x83/0x9b
warn_slowpath_null+0x1a/0x1c
jbd2_mark_journal_empty+06c/0xbd
jbd2_journal_destroy+0x183/0x20c
? abort_exclusive_wait+0x8e/0x8e
ext4_put_super+0x6c/0x316
? evict_inodes+0xe6/0xf1
generic_shutdown_super+0x59/0xd1
? free_vfsmnt+0x18/0x3c
kill_block_super+0x27/0x6a
deactivate_locked_super+0x26/0x57
deactivate_super+0x3f/0x43
mntput_no_expire+0x134/0x13c
sys_umount+0x308/0x33a
system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 24, 2012, 7:54 p.m.
On 24 Oct 2012, nix@esperi.org.uk uttered the following:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot. My post-test fscks this time suggest that only when
> journal recovery kicks in after rebooting out of 2.6.3 do I see
> corruption. So this is indeed an unclean shutdown journal-replay
> situation: it just happens that I routinely have one or two fses
> uncleanly unmounted when all the rest are cleanly unmounted. This
> perhaps explains the scattershot nature of the corruption I see, and why
> most of my ext4 filesystems get off scot-free.

Note that two umounts are not required: fsck found corruption on /var
after a single boot+shutdown round in 3.6.3+this patch. (It did do a
journal replay on /var first.)
Eric Sandeen - Oct. 24, 2012, 8:30 p.m.
On 10/24/2012 02:49 PM, Nix wrote:
> On 24 Oct 2012, Theodore Ts'o spake thusly:
>> Toralf, Nix, if you could try applying this patch (at the end of this
>> message), and let me know how and when the WARN_ON triggers, and if it
>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>> I know about the case where a file system is mounted and then
>> immediately unmounted, but we don't think that's the problematic case.
>> If you see any other cases where WARN_ON is triggering, it would be
>> really good to know....
> 
> Confirmed, it triggers. Traceback below.
> 

<giant snip>

The warn on triggers, but I can't tell - did the corruption still occur
with Ted's patch?

-Eric

> 
> OK. That umount of local filesystems sprayed your added
> empty bug workaround and WARN_ONs so many times that nearly all of them
> scrolled off the screen -- and because syslogd was dead by now and this
> is where my netconsole logs go, they're lost. I suspect every single
> umounted filesystem sprayed one of these (and this happened long before
> any reboot-before-we're-done).
> 
> But I did the old trick of camera-capturing the last one (which was
> probably /boot, which has never got corrupted because I hardly ever
> write anything to it at all). I hope it's more useful than nothing. (I
> can rearrange things to umount /var last, and try again, if you think
> that a specific warning from an fs known to get corrupted is especially
> likely to be valuable.)
> 
> So I see, for one umount at least (and the chunk of the previous one
> that scrolled offscreen is consistent with this):
> 
> jbd2_mark_journal_empty bug workaround (21218, 21219)
> [obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
> ...
> [addresses omitted for sanity: traceback only]
> warn_slowpath_common+0x83/0x9b
> warn_slowpath_null+0x1a/0x1c
> jbd2_mark_journal_empty+06c/0xbd
> jbd2_journal_destroy+0x183/0x20c
> ? abort_exclusive_wait+0x8e/0x8e
> ext4_put_super+0x6c/0x316
> ? evict_inodes+0xe6/0xf1
> generic_shutdown_super+0x59/0xd1
> ? free_vfsmnt+0x18/0x3c
> kill_block_super+0x27/0x6a
> deactivate_locked_super+0x26/0x57
> deactivate_super+0x3f/0x43
> mntput_no_expire+0x134/0x13c
> sys_umount+0x308/0x33a
> system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 24, 2012, 8:34 p.m.
On 24 Oct 2012, Eric Sandeen uttered the following:

> On 10/24/2012 02:49 PM, Nix wrote:
>> On 24 Oct 2012, Theodore Ts'o spake thusly:
>>> Toralf, Nix, if you could try applying this patch (at the end of this
>>> message), and let me know how and when the WARN_ON triggers, and if it
>>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>>> I know about the case where a file system is mounted and then
>>> immediately unmounted, but we don't think that's the problematic case.
>>> If you see any other cases where WARN_ON is triggering, it would be
>>> really good to know....
>> 
>> Confirmed, it triggers. Traceback below.
>
> <giant snip>
>
> The warn on triggers, but I can't tell - did the corruption still occur
> with Ted's patch?

Yes. I fscked the filesystems in 3.6.1 after rebooting: /var had a
journal replay, and the usual varieties of corruption (free space bitmap
problems and multiply-claimed blocks). (The other filesystems for which
the warning triggered had neither a journal replay nor corruption.
At least one of them, /home, likely had a few writes but not enough to
cause a journal wrap.)

I note that the warning may well *not* have triggered for /var: if the
reason it had a journal replay was simply that it was still in use by
something that hadn't died, the umount -l will have avoided doing a full
umount for that filesystem alone.

Also, the corrupted filesystem was mounted in 3.6.3 exactly once.
Multiple umounts are not necessary, but an unclean umount apparently is.
Nix - Oct. 24, 2012, 8:45 p.m.
On 24 Oct 2012, nix@esperi.org.uk spake thusly:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot.

It occurs to me that it is possible that this bug hits only those
filesystems for which a umount has started but been unable to complete.
If so, this is a relatively rare and unimportant bug which probably hits
only me and users of slow removable filesystems in the whole world...
Theodore Ts'o - Oct. 24, 2012, 9:08 p.m.
On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
> 
> It occurs to me that it is possible that this bug hits only those
> filesystems for which a umount has started but been unable to complete.
> If so, this is a relatively rare and unimportant bug which probably hits
> only me and users of slow removable filesystems in the whole world...

Can you verify this?  Does the bug show up if you just hit the power
switch while the system is booted?

How about changing the "sleep 2" to "sleep 0.5"?  (Feel free to
unmount your other partitions, and just leave a test file system
mounted to minimize the chances that you lose partitions that require
hours and hours to restore...)

If you can get a very reliable repro, we might have to ask you to try
the following experiments:

0) Make sure the reliable repro does _not_ work with 3.6.1 booted

1) Try a 3.6.2 kernel

2) (If the problem shows up above) try a 3.6.2 kernel with 14b4ed2 reverted

3) (If the problem shows up above) try a 3.6.2 kernel with all of ext4
   related patches reverted:
92b7722 ext4: fix mtime update in nodelalloc mode
34414b2 ext4: fix fdatasync() for files with only i_size changes
12ebdf0 ext4: always set i_op in ext4_mknod()
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of BUG'ing
14b4ed2 jbd2: don't write superblock when if its empty

4) (If the problem still shows up) then we may need to do a full
   bisect to figure out what is going on....

   	     	     	    	     	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 24, 2012, 11:27 p.m.
On 24 Oct 2012, Theodore Ts'o verbalised:

> On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
>> 
>> It occurs to me that it is possible that this bug hits only those
>> filesystems for which a umount has started but been unable to complete.
>> If so, this is a relatively rare and unimportant bug which probably hits
>> only me and users of slow removable filesystems in the whole world...
>
> Can you verify this?  Does the bug show up if you just hit the power
> switch while the system is booted?

Verified! You do indeed need to do passing strange things to trigger
this bug -- not surprising, really, or everyone and his dog would have
reported it by now. As it is, I'm sorry this hit slashdot, because it
reflects unnecessarily badly on a filesystem that is experiencing
problems only when people do rather insane things to it.

> How about changing the "sleep 2" to "sleep 0.5"?

I tried the following:

 - /sbin/reboot -f of running system
   -> Journal replay, no problems other than the expected free block
      count problems. This is not such a severe problem after all!

 - Normal shutdown, but a 60 second pause after lazy umount, more than
   long enough for all umounts to proceed to termination
   -> no corruption, but curiously /home experienced a journal replay
      before being fscked, even though a cat of /proc/mounts after
      umounting revealed that the only mounted filesystem was /,
      read-only, so /home should have been clean

 - Normal shutdown, a 60 second pause after lazy umount of everything
   other than /var, and then a umount of /var the instant before
   reboot, no sleep at all
   -> massive corruption just as seen before.

Unfortunately, the massive corruption in the last testcase was seen in
3.6.1 as well as 3.6.3: it appears that the only effect that superblock
change had in 3.6.3 was to make this problem easier to hit, and that the
bug itself was introduced probably somewhere between 3.5 and 3.6 (though
I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
~1/20 boots, that it may have been present for longer and I never
noticed).

So the problem is caused by rebooting or powering off or disconnecting
the device *while* umounting a filesystem with a dirty journal, and
might have been introduced by I/O scheduler changes or who knows what
other changes, not just ext4 changes, since the order of block writes by
umount is clearly at issue.

Even though my own system relies on the possibility of rebooting during
umount to reboot reliably, I'd be inclined to say 'not a bug, don't do
that then' -- except that this renders it unreliable to use umount -l to
unmount all the filesystems you can, skipping those that are not
reachable due to having unresponsive servers in the way. As far as I can
tell, there is *no* way to tell when a lazy umount has completed, except
perhaps for polling /proc/mounts: and there is no way at all to tell
when a lazy umount switches from 'waiting for the last process to stop
using me, you can reboot without incident' to 'doing umount, rebooting
is disastrous'. And unfortunately I want to reboot if we're in the
former state, but not in the latter.

(It also makes it unreliable to use ext4 on devices like USB sticks that
might suddenly get disconnected during a umount.)


Further, it seems to me that this makes it dangerous to ever use umount
-l at all, even during normal system operation, since the real umount
might only start when all processes are killed at system shutdown, and
the reboot could well kick in before the umount has finished.

It also appears impossible for me to reliably shut my system down,
though a 60s timeout after lazy umount and before reboot is likely to
work in all but the most pathological of cases (where a downed NFS
server comes up at just the wrong instant): it is clear that the
previous 5s timeout eventually became insufficient simply because of the
amount of time it can take to do a umount on today's larger filesystems.

Truly, my joy is unbounded :(

> 0) Make sure the reliable repro does _not_ work with 3.6.1 booted

Oh dear. Sorry :(((

I can try to bisect this and track down which kernel release it appeared
in -- if it isn't expected behaviour, of course, which is perfectly
possible: rebooting during a umount is at best questionable. But I can't
do anything that lengthy before the weekend, I'm afraid.
Nix - Oct. 24, 2012, 11:42 p.m.
On 25 Oct 2012, nix@esperi.org.uk said:
> Even though my own system relies on the possibility of rebooting during
> umount to reboot reliably, I'd be inclined to say 'not a bug, don't do
> that then' -- except that this renders it unreliable to use umount -l to
> unmount all the filesystems you can, skipping those that are not
> reachable due to having unresponsive servers in the way.

It's worse than that. If you're using filesystem namespaces, how can
*any* shell script loop, or anything in userspace, reliably unmount all
filesystems before reboot? It seems to me this is impossible. There is
no process that necessarily has access to all namespaces, and when you
bring PID namespaces into the picture there is no process that can even
kill all userspace processes in order to zap their filesystems.

I suspect we need a new blocking 'umountall' syscall and a command that
calls it, which umounts everything it can in every filesystem namespace
it can, skipping those that are (unreachable?) network mounts, and
returns only when everything is done. (Possibly it should first kill
every process it sees in every PID namespace other than that of the
caller, too.)

Then shutdown scripts can just call this, and get the right behaviour
immediately.
Theodore Ts'o - Oct. 25, 2012, 1:10 a.m.
On Thu, Oct 25, 2012 at 12:27:02AM +0100, Nix wrote:
>
>  - /sbin/reboot -f of running system
>    -> Journal replay, no problems other than the expected free block
>       count problems. This is not such a severe problem after all!
> 
>  - Normal shutdown, but a 60 second pause after lazy umount, more than
>    long enough for all umounts to proceed to termination
>    -> no corruption, but curiously /home experienced a journal replay
>       before being fscked, even though a cat of /proc/mounts after
>       umounting revealed that the only mounted filesystem was /,
>       read-only, so /home should have been clean

Question: how are you doing the journal replay?  Is it happening as
part of running e2fsck, or are you mounting the file system and
letting kernel do the journal replay?

Also, can you reproduce the problem with the nobarrier and
journal_async_commit options *removed*?  Yes, I know you have battery
backup, but it would be interesting to see if the problem shows up in
the default configuration with none of the more specialist options.
(So it would probably be good to test with journal_checksum removed as
well.)

If that does make the problem go away, that will be a very interesting
data point....

> Unfortunately, the massive corruption in the last testcase was seen in
> 3.6.1 as well as 3.6.3: it appears that the only effect that superblock
> change had in 3.6.3 was to make this problem easier to hit, and that the
> bug itself was introduced probably somewhere between 3.5 and 3.6 (though
> I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
> ~1/20 boots, that it may have been present for longer and I never
> noticed).

Hmm.... ok.  Can you tell whether or not the 2nd patch I posted on
this thread made any difference to how frequently it happened?  The
main difference with 3.6.3 with 2nd patch applied compared to 3.6.1 is
that if it detects that the journal superblock update is a no-op, it
skips the write request.  With 3.6.1, it submits the journal
superblock write regardless of whether or not it would be a no-op.  So
if my patch isn't making a difference to the freqency to when you are
seeing the corruption, then it must be the write request itself which
is important.

When you say it's rare before 3.6.[23], how rare is it?  How reliably
can you trigger it under 3.6.1?  One in 3?  One in 5?  One in 20?

As far as bisecting, one experiment that I'd really appreciate your
doing is to check and see whether you can reproduce the problem using
the 3.4 kernel, and if you can, to see if it reproduces under the 3.3
kernel.

The reason why I ask this is there were not any major changes between
3.5 and 3.6, or between 3.4 and 3.5.  There *were* however, some
fairly major changes made by Jan Kara that were introduced between 3.3
and 3.4.  Among other things, this is where we started using FUA
(Force Unit Attention) writes to update the journal superblock instead
of just using REQ_FLUSH.  This is in fact the most likely place where
we might have introduced the regression, since it wouldn't surprise me
if Jan didn't test the case of using nobarrier with a storage array
with battery backup (I certainly didn't, since I don't have easy
access to such fancy toys :-).

> It also appears impossible for me to reliably shut my system down,
> though a 60s timeout after lazy umount and before reboot is likely to
> work in all but the most pathological of cases (where a downed NFS
> server comes up at just the wrong instant): it is clear that the
> previous 5s timeout eventually became insufficient simply because of the
> amount of time it can take to do a umount on today's larger filesystems.

Something that you might want to consider trying is after you kill all
of the processes, remount all of the local disk file systems
read-only, then kick off the unmount of the NFS file systems (just to
be nice to the NFS servers, so they are notified of the unmount), and
then just force the reboot.  If the file systems have been remounted
r/o, that will cause the journal to be shutdown cleanly, and all of
the write flushed out.  (Modulo issues with nobarrier, but that's a
separate issue.  I'm now thinking that a smart thing to do might be
force a flush on an unmount or remount r/o, regardless of whether
nobarrier is specified, just to make sure everything is written out
before the poweroff, battery backup or no.)

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 25, 2012, 1:45 a.m.
On 25 Oct 2012, Theodore Ts'o stated:

> On Thu, Oct 25, 2012 at 12:27:02AM +0100, Nix wrote:
>>
>>  - /sbin/reboot -f of running system
>>    -> Journal replay, no problems other than the expected free block
>>       count problems. This is not such a severe problem after all!
>> 
>>  - Normal shutdown, but a 60 second pause after lazy umount, more than
>>    long enough for all umounts to proceed to termination
>>    -> no corruption, but curiously /home experienced a journal replay
>>       before being fscked, even though a cat of /proc/mounts after
>>       umounting revealed that the only mounted filesystem was /,
>>       read-only, so /home should have been clean
>
> Question: how are you doing the journal replay?  Is it happening as
> part of running e2fsck, or are you mounting the file system and
> letting kernel do the journal replay?

This most recent instance was e2fsck. Normally, it's mount. Both
seem able to yield the same corruption.

> Also, can you reproduce the problem with the nobarrier and
> journal_async_commit options *removed*?  Yes, I know you have battery
> backup, but it would be interesting to see if the problem shows up in
> the default configuration with none of the more specialist options.
> (So it would probably be good to test with journal_checksum removed as
> well.)

I'll try that, hopefully tomorrow sometime. It's 2:30am now and probably
time to sleep.

>> Unfortunately, the massive corruption in the last testcase was seen in
>> 3.6.1 as well as 3.6.3: it appears that the only effect that superblock
>> change had in 3.6.3 was to make this problem easier to hit, and that the
>> bug itself was introduced probably somewhere between 3.5 and 3.6 (though
>> I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
>> ~1/20 boots, that it may have been present for longer and I never
>> noticed).
>
> Hmm.... ok.  Can you tell whether or not the 2nd patch I posted on
> this thread made any difference to how frequently it happened?  The

Well, I had a couple of reboots without corruption with that patch
applied, and /home was only ever corrupted with it not applied -- but
that could perfectly well be chance, since I only had two or three
instances of /home corruption so far, thank goodness.

> When you say it's rare before 3.6.[23], how rare is it?  How reliably
> can you trigger it under 3.6.1?  One in 3?  One in 5?  One in 20?

I've rebooted out of 3.6.1 about fifteen times so far. I've seen once
instance of corruption. I've never seen it before 3.6, but I only
rebooted 3.5.x or 3.4.x once or twice in total, so that too could be
chance.

> As far as bisecting, one experiment that I'd really appreciate your
> doing is to check and see whether you can reproduce the problem using
> the 3.4 kernel, and if you can, to see if it reproduces under the 3.3
> kernel.

Will try. It might be the weekend before I can find the time though :(

> The reason why I ask this is there were not any major changes between
> 3.5 and 3.6, or between 3.4 and 3.5.  There *were* however, some
> fairly major changes made by Jan Kara that were introduced between 3.3
> and 3.4.  Among other things, this is where we started using FUA
> (Force Unit Attention) writes to update the journal superblock instead
> of just using REQ_FLUSH.  This is in fact the most likely place where
> we might have introduced the regression, since it wouldn't surprise me
> if Jan didn't test the case of using nobarrier with a storage array
> with battery backup (I certainly didn't, since I don't have easy
> access to such fancy toys :-).

Hm. At boot, I see this for both volumes on the Areca controller:

[    0.855376] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    0.855465] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So it looks to me like FUA changes themselves could have little effect.

(btw, the controller cost only about £150... if it was particularly
fancy I certainly couldn't have afforded it.)

>> It also appears impossible for me to reliably shut my system down,
>> though a 60s timeout after lazy umount and before reboot is likely to
>> work in all but the most pathological of cases (where a downed NFS
>> server comes up at just the wrong instant): it is clear that the
>> previous 5s timeout eventually became insufficient simply because of the
>> amount of time it can take to do a umount on today's larger filesystems.
>
> Something that you might want to consider trying is after you kill all
> of the processes, remount all of the local disk file systems
> read-only, then kick off the unmount of the NFS file systems (just to
> be nice to the NFS servers, so they are notified of the unmount), and

Actually I umount NFS first of all, because if I kill the processes
first, this causes trouble with the NFS unmounts, particularly if I'm
doing self-mounting (which I do sometimes, though not at the moment).

I will certainly try a readonly remount instead.

> force a flush on an unmount or remount r/o, regardless of whether
> nobarrier is specified, just to make sure everything is written out
> before the poweroff, battery backup or no.)

I'm rather surprised that doesn't happen anyway. I always thought it
did.
Nix - Oct. 25, 2012, 11:06 a.m.
On 25 Oct 2012, Theodore Ts'o stated:
> Also, can you reproduce the problem with the nobarrier and
> journal_async_commit options *removed*?  Yes, I know you have battery
> backup, but it would be interesting to see if the problem shows up in
> the default configuration with none of the more specialist options.
> (So it would probably be good to test with journal_checksum removed as
> well.)

I'm going to spend some time after work today trying to reproduce this
in a virtual machine. If rebooting while umounting is truly all it
takes, as it seems, this should eliminate a bunch of variables and make
it a lot easier to reproduce. (Assuming Eric hasn't already done just
that, that is.)
Theodore Ts'o - Oct. 25, 2012, 2:12 p.m.
I've been thinking about this some more, and if you don't have a lot
of time, perhaps the most important test to do is this.  Does the
chance of your seeing corrupted files in v3.6.3 go down if you run
3.6.3 with commit 14b4ed22a6 reverted?  Keep your current
configuration, using nobarrier, et. al, constant.  If reverting the
commit makes things better, then that's what would be most useful to
know as soon as possible, since the correct short-term solution is to
revert that commit for 3.7-rcX, as well as the 3.6 and 3.5 stable
kernels.

We can investigate later whether nobarrier, journal_async_commit seem
to make the problem worse, and whether the less common corruption case
that you were seeing with 3.6.1 was actually a change which was
introduced between 3.3 and 3.4.

But most importantly, even if the bug doesn't show up with the default
mount options at all (which explains why Eric and I weren't able to
reproduce it), there are probably other users using nobarrier, so if
the frequency with which you were seeing corruptions went up
significantly between 3.6.1 and 3.6.3, and reverting 14b4ed22a6 brings
the frequency back down to what you were seeing with 3.6.1, we should
do that ASAP.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 25, 2012, 2:15 p.m.
On 25 Oct 2012, Theodore Ts'o stated:

> I've been thinking about this some more, and if you don't have a lot
> of time,

I've got time, but it's this weekend, not during the week :)

>          perhaps the most important test to do is this.  Does the
> chance of your seeing corrupted files in v3.6.3 go down if you run
> 3.6.3 with commit 14b4ed22a6 reverted?

This I can verify, sometime this evening. (I presume what we're really
interested in is whether the window in which files get corrupted has
narrowed such that my 5s sleep after umount is now long enough to have a
lower likelihood of corruption, since we know that a near-0s sleep after
umount causes corruption almost every time on 3.6.1 as well: I've now
done that three times and got corruption every time.)

> But most importantly, even if the bug doesn't show up with the default
> mount options at all (which explains why Eric and I weren't able to
> reproduce it), there are probably other users using nobarrier, so if
> the frequency with which you were seeing corruptions went up
> significantly between 3.6.1 and 3.6.3, and reverting 14b4ed22a6 brings
> the frequency back down to what you were seeing with 3.6.1, we should
> do that ASAP.

Agreed.
Nix - Oct. 25, 2012, 5:39 p.m.
On 25 Oct 2012, nix@esperi.org.uk said:

> This I can verify, sometime this evening.

Sometime *tomorrow* evening. This has been quite stressful and I can
hardly keep my eyes open. I'm not doing anything risky in this state.
Nix - Oct. 26, 2012, 12:22 a.m.
On 25 Oct 2012, Theodore Ts'o told this:
> If that does make the problem go away, that will be a very interesting
> data point....

I'll be looking at this tomorrow, but as sod's law would have it I have
another user on this machine who didn't want it mega-rebooted tonight,
so I was reduced to trying to reproduce the problem in virtualization
under qemu.

I failed, for one very simple reason: on 3.6.3, even with a umount -l
still in the process of unmounting the fs and flushing changes, even on
an fs mounted nobarrier,journal_async_commit, even when mounted atop
LVM, reboot(2) will block until umount's writeout is complete (and lvm
vgchange refuses to deactivate the volume group while that is happening,
but I don't bother deactivating volume groups on the afflicted machine
so I know that can't be related). Obviously, this suffices to ensure
that a reboot is not possible while umounts are underway -- though a
power cut is still possible, I suppose.

On the afflicted machine (with a block device stack running LVM, then
libata/arcmsr), as far as I can tell reboot(8) is *not* blocking if a
unmount is underway: it shoots down everything and restarts at once. I
have no direct proof of this yet, but during the last week I've
routinely seen it reboot with lots of writes underway and umount -l log
messages streaming up the screen: it certainly doesn't wait for all the
umount -l's to be done the way it is in virtualization. I have no idea
how this can be possible: I thought fses on a block device had to be
quiesced (thus, in the case of an ongoing umount, unmounted and flushed)
before any attempt at all was made to shut the underlying block device
down, and I'd be fairly surprised if a flush wasn't done even if
nobarrier was active (it certainly seems to be for virtio-blk, but that
may well be a special case). But arcmsr (or libata? I should test with a
simulated libata rather than virtio-blk next) appears to be getting
around that somehow. This would probably explain all sorts of horrible
corruption if umounting during a reboot, right?

So maybe it's the stack of block devices that's at fault, and not the
filesystem at all! I'll admit I don't really understand what happens at
system halt time well enough to be sure, and getting log info from a
machine in the middle of reboot(8) appears likely to be a complete sod
(maybe halt(8) would be better: at least I could take a photo of the
screen then). If that's true, it would *certainly* explain why nobody
else can see this problem: only arcmsr users who also do umount -l's
would have a chance, and that population probably has a size of one.


I'll try to prove this tomorrow by writing a few gigs of junk to a temp
filesytem held open by a temporary cat /dev/null, umount -l'ing it and
killing off the cat the instant before the reboot -f call. If I don't
see the reboot call blocking, the hypothesis is proved. (This is much
what I did in virtualization, where I observe reboot blocking.)

(Another blockdev-related possibility, if reboot *is* observed to block,
is that arcmsr may be throwing away very-recently-written data when the
adapter is shut down right before reboot.)


Argh. How can rebooting a system be so damn complicated. Bring back the
C64 or BBC Master where I could just pull the power lead out and stick
it back in. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

=============
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
   No behavioral change unless e4defrag has been used.

Online Resize
=============
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of BUG'ing
   No observable change unless online resizing (e2resize) has been used

Other Commits
=============
92b7722 ext4: fix mtime update in nodelalloc mode
   Changes where we call file_update_time()

34414b2 ext4: fix fdatasync() for files with only i_size changes
   Forces the inode changes to be commited if only i_sync changes when
   fdatasync() is called.  No changes except performance impact
   to fdatasync() and correctness after a system crash.

12ebdf0 ext4: always set i_op in ext4_mknod()
   Fixes a bug if CONFIG_EXT4_FS_XATTR is not defined;
   no change if CONFIG_EXT4_FS_XATTR is defined

2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
   Remove an erroneous "static" for an function so it is allocated on the stack;
   fixes a bug if two processes cat /proc/mounts at the same time

1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
   Fixes a circular lock dependency

14b4ed2 jbd2: don't write superblock when if its empty
   If journal->s_start is zero, we may not update journal->s_sequence when
   it might be needed.  (But we at the moement we can't see how this could
   lead to the reported fs corruptions.)


commit cb57108637e01ec2f02d9311cedc3013e96f25d4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Wed Oct 24 01:01:41 2012 -0400

    jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
    
    Fix a potential file system corrupting bug which was introduced by
    commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
    superblock when if its empty.
    
    We should only skip writing the journal superblock if there is nothing
    to do --- not just when s_start is zero.
    
    This has caused users to report file system corruptions in ext4 that
    look like this:
    
    EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
    JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
    
    after the file system has been corrupted.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0f16edd..26b2983 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1351,24 +1351,33 @@  void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	__be32		new_tail_sequence;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
-	/* Is it already empty? */
+	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
+	/* Nothing to do? */
 	if (sb->s_start == 0) {
+		pr_err("JBD2: jbd2_mark_journal_empty bug workaround (%u, %u)\n",
+		       (unsigned) be32_to_cpu(sb->s_sequence),
+		       (unsigned) be32_to_cpu(new_tail_sequence));
+		WARN_ON(1);
+	}
+	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
 		read_unlock(&journal->j_state_lock);
-		return;
+		goto set_flushed;
 	}
 	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
 		  journal->j_tail_sequence);
 
-	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
+	sb->s_sequence = new_tail_sequence;
 	sb->s_start    = cpu_to_be32(0);
 	read_unlock(&journal->j_state_lock);
 
 	jbd2_write_superblock(journal, WRITE_FUA);
 
-	/* Log is no longer empty */
+set_flushed:
+	/* Log is empty */
 	write_lock(&journal->j_state_lock);
 	journal->j_flags |= JBD2_FLUSHED;
 	write_unlock(&journal->j_state_lock);