Patchwork ext4: fix unjournaled inode bitmap modification

login
register
mail settings
Submitter Eric Sandeen
Date Oct. 28, 2012, 4:23 a.m.
Message ID <508CB35D.9080102@redhat.com>
Download mbox | patch
Permalink /patch/194647/
State Superseded
Headers show

Comments

Eric Sandeen - Oct. 28, 2012, 4:23 a.m.
commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.

Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming.  The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"

I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box.  Without the patch I hit a journal
checksum error every time.  With this fix it survives
many iterations.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
cc: Nix <nix@esperi.org.uk>
---

A little more going on here to try to properly handle error
cases & moving to the next group; despite
ext4_handle_release_buffer being a no-op, I've tried
to sprinkle it in at the right places.  Double checking
on review is probably a fine idea ;)


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nix - Oct. 28, 2012, 1:59 p.m.
On 28 Oct 2012, Eric Sandeen outgrape:

> I've tested this by mounting with journal_checksum and
> running fsstress then dropping power; I've also tested by
> hacking DM to create snapshots w/o first quiescing, which
> allows me to test journal replay repeatedly w/o actually
> power-cycling the box.  Without the patch I hit a journal
> checksum error every time.  With this fix it survives
> many iterations.

Works for me too. Looks like you fixed it.


btw, I dug back through my old notes and found the reason I turned on
journal_async_commit in the first place, back in 2009. It's got nothing
to do with performance: I was trying to save power.

The server that's been having all this trouble has four WD GreenPower
disks. These are rightly maligned for leaving the way they save power
undocumented, changing it without notice, and having most of those
power-saving methods be extremely silly (e.g. complete spindowns, with a
nonresponsive disk for N seconds until it spins up again). But I was
lucky and got disks that actually did save power without being damaged.

By observing the machine's power draw I was able to guess that they spin
partway down (rumour says to ~2500rpm) after about eight seconds with no
activity at all, following which it takes a fairly large burst of
activity to get them to spin up again: they can service low loads
without spinning up.

Unfortunately back in 2009 ext4's journalling was preventing them from
ever spinning down, since even on idle rw-mounted filesystems it was
touching the disk with what blktrace said was something journal-related
once every five seconds, so the disks decided they should never spin
down, costing me about 10W of power. Now laptop_mode would have solved
this problem, but laptop_mode makes other changes that I didn't want
(e.g. changing the way dirty writeout is done so that all writeout is
lumpy: smooth dirty writeout is fine, I just don't want the disks
touched all the time when the system is actually idle).

So I turned on commit=30... and nothing changed, a steady pulse of
commits every five seconds. Only when I threw caution to the wind and
tried turning on journal_async_commit (even though the description of
its function seemed quite unrelated) did the commit rate slow to one
every 30s as the commit interval suggested.

I now suspect this was a bug, or multiple bugs, and I should have
reported it rather than flailing around to try to get things working --
but whatever the problem it has by now been fixed: journal committing is
now working rather better, one commit every 15s, even with async commit
turned off. (It is peculiar that I'm seeing one commit every 15s when I
asked for commit=30, but it's less often than once every 8s and that's
what matters to me.)

Patch

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..1d18fba 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -706,10 +706,17 @@  got_group:
 			continue;
 		}
 
-		brelse(inode_bitmap_bh);
+		if (inode_bitmap_bh) {
+			ext4_handle_release_buffer(handle, inode_bitmap_bh);
+			brelse(inode_bitmap_bh);
+		}
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		if (!inode_bitmap_bh)
 			goto fail;
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;
 
 repeat_in_this_group:
 		ino = ext4_find_next_zero_bit((unsigned long *)
@@ -734,10 +741,16 @@  repeat_in_this_group:
 		if (ino < EXT4_INODES_PER_GROUP(sb))
 			goto repeat_in_this_group;
 	}
+	ext4_handle_release_buffer(handle, inode_bitmap_bh);
 	err = -ENOSPC;
 	goto out;
 
 got:
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+	if (err)
+		goto fail;
+
 	/* We may have to initialize the block bitmap if it isn't already */
 	if (ext4_has_group_desc_csum(sb) &&
 	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -771,11 +784,6 @@  got:
 			goto fail;
 	}
 
-	BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
-	err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, group_desc_bh);
 	if (err)
@@ -823,11 +831,6 @@  got:
 	}
 	ext4_unlock_group(sb, group);
 
-	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
-	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
 	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
 	if (err)