Message ID | 4AF4A429.7090507@redhat.com |
---|---|
State | Superseded, archived |
Headers | show |
On 2009-11-06, at 15:33, Eric Sandeen wrote: > commit a71ce8c6c9bf269b192f352ea555217815cf027e updated > ext4_statfs() to update the on-disk superblock counters, > but modified this buffer directly without any journaling of > the change. This is one of the accesses that was causing the crc > errors in journal replay as seen in kernel.org bugzilla #14354. > > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > --- > > > @@ -3650,20 +3652,33 @@ static int ext4_statfs(struct dentry > *dentry, struct kstatfs *buf) > + handle = ext4_journal_start_sb(sb, 1); > + if (IS_ERR(handle)) { > + err = PTR_ERR(handle); > + goto out; > + } > + err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh); > + if (err) > + goto journal_stop; > + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); > + ext4_free_blocks_count_set(es, buf->f_bfree); > + err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh); I admit to being the instigator of this change. The intention is that we want to update the on-disk superblock block/ inode counters from the per-cpu data periodically, since they are never updated anymore (only the group summaries are updated, to avoid contention). However, this isn't critical work, since it is only useful for read-only e2fsck not reporting spurious errors on the filesystem and dumpe2fs/debugfs having some chance at reporting a reasonable value for the filesystem space usage. Starting a transaction as part of statfs is really counter-productive to making that code efficient, which was the whole point of the original patch to remove the per-call "overhead" calculation. The intention was that the in-memory superblock would be updated whenever statfs is called (this doesn't cost anything, since we've already computed the value for statfs), and if the superblock is written to disk for some other reason they will go along for the ride. If the choice is between adding a proper transaction here, or not doing this at all, I'd rather just not do it at all. Of course, I'd like to work out some kind of compromise, like only updating the superblock when there is already a shadow BH that is being used to write to the journal, or similar. If there is a desire to keep a transaction here and update the superblock counters, it _definitely_ doesn't need to be done on every statfs, but at most once every 30 seconds or whatever. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger wrote: > On 2009-11-06, at 15:33, Eric Sandeen wrote: >> commit a71ce8c6c9bf269b192f352ea555217815cf027e updated >> ext4_statfs() to update the on-disk superblock counters, but >> modified this buffer directly without any journaling of the change. >> This is one of the accesses that was causing the crc errors in >> journal replay as seen in kernel.org bugzilla #14354. >> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- ... > I admit to being the instigator of this change. > > The intention is that we want to update the on-disk superblock > block/inode counters from the per-cpu data periodically, since they > are never updated anymore (only the group summaries are updated, to > avoid contention). However, this isn't critical work, since it is > only useful for read-only e2fsck not reporting spurious errors on the > filesystem and dumpe2fs/debugfs having some chance at reporting a > reasonable value for the filesystem space usage. > > Starting a transaction as part of statfs is really counter-productive > to making that code efficient, which was the whole point of the > original patch to remove the per-call "overhead" calculation. > > The intention was that the in-memory superblock would be updated > whenever statfs is called (this doesn't cost anything, since we've > already computed the value for statfs), and if the superblock is > written to disk for some other reason they will go along for the > ride. > > If the choice is between adding a proper transaction here, or not > doing this at all, I'd rather just not do it at all. Of course, I'd > like to work out some kind of compromise, like only updating the > superblock when there is already a shadow BH that is being used to > write to the journal, or similar. > > If there is a desire to keep a transaction here and update the > superblock counters, it _definitely_ doesn't need to be done on every > statfs, but at most once every 30 seconds or whatever. You know, I think I thought about all that, and I wrote the patch anyway somehow; blame a late friday evening for that one. :) I'll think of a better route to take. Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote: > If the choice is between adding a proper transaction here, or not > doing this at all, I'd rather just not do it at all. Of course, I'd > like to work out some kind of compromise, like only updating the > superblock when there is already a shadow BH that is being used to > write to the journal, or similar. In practice, the superblock is never going to modified in normal operations, unless a resize happens to be happening. Since we already force the superblock summary counters to be correct during an unmount or file system freeze, we really only need this so that it's correct after a file system crash. I don't think people generally end up calling statfs() all that frequently, so it's not clear how much adding a 30 second throttle would help. Maybe we should just not bother trying to update the superblock at all on a statfs()? Hmm... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote: >> If the choice is between adding a proper transaction here, or not >> doing this at all, I'd rather just not do it at all. Of course, I'd >> like to work out some kind of compromise, like only updating the >> superblock when there is already a shadow BH that is being used to >> write to the journal, or similar. > > In practice, the superblock is never going to modified in normal > operations, unless a resize happens to be happening. Since we already > force the superblock summary counters to be correct during an unmount > or file system freeze, we really only need this so that it's correct > after a file system crash. > > I don't think people generally end up calling statfs() all that > frequently, so it's not clear how much adding a 30 second throttle > would help. Maybe we should just not bother trying to update the > superblock at all on a statfs()? for now maybe that's better.... But don't we journal the superblock sometimes, not others ... for example write_super -> ext4_write_super -> ext4_commit_super does no journaling of superblock modifications. ext4_orphan_add, however, does. This would likely lead to trouble w/ the debugging patch ... though I didn't see it ... ? So I was premature w/ this patch, I think. Maybe we could unconditionally do the copy-out in jbd2_journal_write_metadata_buffer() ...? -Eric > Hmm... > > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2009-11-08, at 14:48, Theodore Tso wrote: > On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote: >> If the choice is between adding a proper transaction here, or not >> doing this at all, I'd rather just not do it at all. Of course, I'd >> like to work out some kind of compromise, like only updating the >> superblock when there is already a shadow BH that is being used to >> write to the journal, or similar. > > In practice, the superblock is never going to modified in normal > operations, unless a resize happens to be happening. Actually, Eric had a important case where the superblock is modified is whenever any file is added to the orphan list, so this basically happens all the time. When the orphan code was added, it made sense to put the head of the orphan list on the superblock because it was updated for every truncate to change the free block counters. > Since we already force the superblock summary counters to be correct > during an unmount or file system freeze, we really only need this so > that it's correct after a file system crash. > > I don't think people generally end up calling statfs() all that > frequently, so it's not clear how much adding a 30 second throttle > would help. Maybe we should just not bother trying to update the > superblock at all on a statfs()? The reason we added this was for running a read-only e2fsck on a filesystem without getting spurious errors just because the superblock summaries were incorrect. The other alternative is to change e2fsck so that it doesn't consider just a block/inode summary an error. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote: > > But don't we journal the superblock sometimes, not others ... for > example write_super -> ext4_write_super -> ext4_commit_super does no > journaling of superblock modifications. ext4_orphan_add, however, does. > This would likely lead to trouble w/ the debugging patch ... though I > didn't see it ... ? Ah, I had forgotten about ext4_orphan_add(); that is indeed the one place where we would be updating the super block under normal operations, besides online-resize. I've been looking at the write_super() paths, and from what I can tell it's only used in two places. The generic fsync() handler, file_fsync(), which we do't use, and sync_supers(), which will indeed call write_super() -> ext4_write_super() if sb->s_dirt is set. That led me to examine the places where we set s_dirt, and it's in a lot of places where we're no longer modifying the superblock any more, but we're still setting sb->s_dirt. I don't know why you didn't see problems with the debugging patch; the only thing I can think of is that since the actual superblock update is deferred to a timer-triggered callback, you were getting consistently lucky --- which is hard for me to believe, but I don't have a better suggestion. What I think we do need to do is eliminate all of the places where we set sb->s_dirt, and if we need to update the superblock, we do it ourselves, under journaling control. That leaves places which call ext4_commit_super() directly, which is at mount and unmount time (which should be OK, as long as it's before or after journalling is active) and when we freeze the filesystem, which might be OK, but we need to take a careful look at it. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2009-11-09, at 05:53, Theodore Tso wrote: > On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote: >> But don't we journal the superblock sometimes, not others ... for >> example write_super -> ext4_write_super -> ext4_commit_super does no >> journaling of superblock modifications. ext4_orphan_add, however, >> does. >> This would likely lead to trouble w/ the debugging patch ... though I >> didn't see it ... ? > > Ah, I had forgotten about ext4_orphan_add(); that is indeed the one > place where we would be updating the super block under normal > operations, besides online-resize. > > I've been looking at the write_super() paths, and from what I can tell > it's only used in two places. The generic fsync() handler, > file_fsync(), which we do't use, and sync_supers(), which will indeed > call write_super() -> ext4_write_super() if sb->s_dirt is set. That > led me to examine the places where we set s_dirt, and it's in a lot of > places where we're no longer modifying the superblock any more, but > we're still setting sb->s_dirt. I don't know why you didn't see > problems with the debugging patch; the only thing I can think of is > that since the actual superblock update is deferred to a > timer-triggered callback, you were getting consistently lucky --- > which is hard for me to believe, but I don't have a better suggestion. I suspect this is because the only thing that changes in the superblock these days is the orphan list, so out-of-order writes to the superblock will at worst result in a few entries added/missing from the orphan list. I do recall that there are "inodes from a corrupt orphan list found" messages seen occasionally during full e2fsck runs, but it has never been important enough to investigate. > What I think we do need to do is eliminate all of the places where we > set sb->s_dirt, and if we need to update the superblock, we do it > ourselves, under journaling control. We have to ensure that writeout of the superblock is still being done correctly during non-journal mode operation. > That leaves places which call ext4_commit_super() directly, which is > at mount and unmount time (which should be OK, as long as it's before > or after journalling is active) and when we freeze the filesystem, > which might be OK, but we need to take a careful look at it. We also write out the superblock directly in ext4_error(), so that the EXT4_ERROR_FS flag is written to disk (if at all possible) rather than putting the superblock into a journal transaction that will not be replayed (due to the transaction never committing after the journal is aborted or the node panics). Since that will be in the last transaction anyways (unless errors=continue is used) I don't see it as a major problem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 08370ae..b02daf6 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3605,6 +3605,8 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf) struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_super_block *es = sbi->s_es; u64 fsid; + handle_t *handle; + int err = 0; if (test_opt(sb, MINIX_DF)) { sbi->s_overhead_last = 0; @@ -3650,20 +3652,33 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last; buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) - percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter); - ext4_free_blocks_count_set(es, buf->f_bfree); buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es); if (buf->f_bfree < ext4_r_blocks_count(es)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter); - es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT4_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL; buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL; - return 0; + handle = ext4_journal_start_sb(sb, 1); + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + goto out; + } + err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh); + if (err) + goto journal_stop; + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); + ext4_free_blocks_count_set(es, buf->f_bfree); + err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh); + +journal_stop: + ext4_journal_stop(handle); +out: + return err; } /* Helper function for writing quotas on sync - we need to start transaction
commit a71ce8c6c9bf269b192f352ea555217815cf027e updated ext4_statfs() to update the on-disk superblock counters, but modified this buffer directly without any journaling of the change. This is one of the accesses that was causing the crc errors in journal replay as seen in kernel.org bugzilla #14354. Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html