| Submitter | Eric Sandeen |
|---|---|
| Date | Oct. 28, 2012, 4:23 a.m. |
| Message ID | <508CB35D.9080102@redhat.com> |
| Download | mbox | patch |
| Permalink | /patch/194647/ |
| State | Superseded |
| Headers | show |
Comments
On 28 Oct 2012, Eric Sandeen outgrape: > I've tested this by mounting with journal_checksum and > running fsstress then dropping power; I've also tested by > hacking DM to create snapshots w/o first quiescing, which > allows me to test journal replay repeatedly w/o actually > power-cycling the box. Without the patch I hit a journal > checksum error every time. With this fix it survives > many iterations. Works for me too. Looks like you fixed it. btw, I dug back through my old notes and found the reason I turned on journal_async_commit in the first place, back in 2009. It's got nothing to do with performance: I was trying to save power. The server that's been having all this trouble has four WD GreenPower disks. These are rightly maligned for leaving the way they save power undocumented, changing it without notice, and having most of those power-saving methods be extremely silly (e.g. complete spindowns, with a nonresponsive disk for N seconds until it spins up again). But I was lucky and got disks that actually did save power without being damaged. By observing the machine's power draw I was able to guess that they spin partway down (rumour says to ~2500rpm) after about eight seconds with no activity at all, following which it takes a fairly large burst of activity to get them to spin up again: they can service low loads without spinning up. Unfortunately back in 2009 ext4's journalling was preventing them from ever spinning down, since even on idle rw-mounted filesystems it was touching the disk with what blktrace said was something journal-related once every five seconds, so the disks decided they should never spin down, costing me about 10W of power. Now laptop_mode would have solved this problem, but laptop_mode makes other changes that I didn't want (e.g. changing the way dirty writeout is done so that all writeout is lumpy: smooth dirty writeout is fine, I just don't want the disks touched all the time when the system is actually idle). So I turned on commit=30... and nothing changed, a steady pulse of commits every five seconds. Only when I threw caution to the wind and tried turning on journal_async_commit (even though the description of its function seemed quite unrelated) did the commit rate slow to one every 30s as the commit interval suggested. I now suspect this was a bug, or multiple bugs, and I should have reported it rather than flailing around to try to get things working -- but whatever the problem it has by now been fixed: journal committing is now working rather better, one commit every 15s, even with async commit turned off. (It is peculiar that I'm seeing one commit every 15s when I asked for commit=30, but it's less often than once every 8s and that's what matters to me.)
Patch
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 4facdd2..1d18fba 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -706,10 +706,17 @@ got_group: continue; } - brelse(inode_bitmap_bh); + if (inode_bitmap_bh) { + ext4_handle_release_buffer(handle, inode_bitmap_bh); + brelse(inode_bitmap_bh); + } inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); if (!inode_bitmap_bh) goto fail; + BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); + err = ext4_journal_get_write_access(handle, inode_bitmap_bh); + if (err) + goto fail; repeat_in_this_group: ino = ext4_find_next_zero_bit((unsigned long *) @@ -734,10 +741,16 @@ repeat_in_this_group: if (ino < EXT4_INODES_PER_GROUP(sb)) goto repeat_in_this_group; } + ext4_handle_release_buffer(handle, inode_bitmap_bh); err = -ENOSPC; goto out; got: + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); + if (err) + goto fail; + /* We may have to initialize the block bitmap if it isn't already */ if (ext4_has_group_desc_csum(sb) && gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { @@ -771,11 +784,6 @@ got: goto fail; } - BUFFER_TRACE(inode_bitmap_bh, "get_write_access"); - err = ext4_journal_get_write_access(handle, inode_bitmap_bh); - if (err) - goto fail; - BUFFER_TRACE(group_desc_bh, "get_write_access"); err = ext4_journal_get_write_access(handle, group_desc_bh); if (err) @@ -823,11 +831,6 @@ got: } ext4_unlock_group(sb, group); - BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); - err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh); - if (err) - goto fail; - BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh); if (err)
commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed ext4_new_inode() such that the inode bitmap was being modified outside a transaction, which could lead to corruption, and was discovered when journal_checksum found a bad checksum in the journal during log replay. Nix ran into this when using the journal_async_commit mount option, which enables journal checksumming. The ensuing journal replay failures due to the bad checksums led to filesystem corruption reported as the now infamous "Apparent serious progressive ext4 data corruption bug" I've tested this by mounting with journal_checksum and running fsstress then dropping power; I've also tested by hacking DM to create snapshots w/o first quiescing, which allows me to test journal replay repeatedly w/o actually power-cycling the box. Without the patch I hit a journal checksum error every time. With this fix it survives many iterations. Signed-off-by: Eric Sandeen <sandeen@redhat.com> cc: Nix <nix@esperi.org.uk> --- A little more going on here to try to properly handle error cases & moving to the next group; despite ext4_handle_release_buffer being a no-op, I've tried to sprinkle it in at the right places. Double checking on review is probably a fine idea ;) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html