diff mbox

[2/2] ext4: journal superblock modifications in ext4_statfs()

Message ID 4AF4A429.7090507@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen Nov. 6, 2009, 10:33 p.m. UTC
commit a71ce8c6c9bf269b192f352ea555217815cf027e updated
ext4_statfs() to update the on-disk superblock counters,
but modified this buffer directly without any journaling of
the change.  This is one of the accesses that was causing 
the crc errors in journal replay as seen in kernel.org 
bugzilla #14354.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Dilger Nov. 7, 2009, 12:26 a.m. UTC | #1
On 2009-11-06, at 15:33, Eric Sandeen wrote:
> commit a71ce8c6c9bf269b192f352ea555217815cf027e updated
> ext4_statfs() to update the on-disk superblock counters,
> but modified this buffer directly without any journaling of
> the change.  This is one of the accesses that was causing the crc  
> errors in journal replay as seen in kernel.org bugzilla #14354.
>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
>
>
> @@ -3650,20 +3652,33 @@ static int ext4_statfs(struct dentry  
> *dentry, struct kstatfs *buf)
> +	handle = ext4_journal_start_sb(sb, 1);
> +	if (IS_ERR(handle)) {
> +		err = PTR_ERR(handle);
> +		goto out;
> +	}
> +	err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
> +	if (err)
> +		goto journal_stop;
> +	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
> +	ext4_free_blocks_count_set(es, buf->f_bfree);
> +	err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);


I admit to being the instigator of this change.

The intention is that we want to update the on-disk superblock block/ 
inode
counters from the per-cpu data periodically, since they are never  
updated
anymore (only the group summaries are updated, to avoid contention).   
However,
this isn't critical work, since it is only useful for read-only e2fsck  
not
reporting spurious errors on the filesystem and dumpe2fs/debugfs  
having some
chance at reporting a reasonable value for the filesystem space usage.

Starting a transaction as part of statfs is really counter-productive  
to making
that code efficient, which was the whole point of the original patch  
to remove
the per-call "overhead" calculation.

The intention was that the in-memory superblock would be updated  
whenever statfs
is called (this doesn't cost anything, since we've already computed  
the value
for statfs), and if the superblock is written to disk for some other  
reason they
will go along for the ride.

If the choice is between adding a proper transaction here, or not  
doing this at
all, I'd rather just not do it at all.  Of course, I'd like to work  
out some
kind of compromise, like only updating the superblock when there is  
already a shadow BH that is being used to write to the journal, or  
similar.

If there is a desire to keep a transaction here and update the  
superblock
counters, it _definitely_ doesn't need to be done on every statfs, but  
at most
once every 30 seconds or whatever.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 7, 2009, 1:08 a.m. UTC | #2
Andreas Dilger wrote:
> On 2009-11-06, at 15:33, Eric Sandeen wrote:
>> commit a71ce8c6c9bf269b192f352ea555217815cf027e updated 
>> ext4_statfs() to update the on-disk superblock counters, but
>> modified this buffer directly without any journaling of the change.
>> This is one of the accesses that was causing the crc errors in
>> journal replay as seen in kernel.org bugzilla #14354.
>> 
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com> ---

...

> I admit to being the instigator of this change.
> 
> The intention is that we want to update the on-disk superblock
> block/inode counters from the per-cpu data periodically, since they
> are never updated anymore (only the group summaries are updated, to
> avoid contention). However, this isn't critical work, since it is
> only useful for read-only e2fsck not reporting spurious errors on the
> filesystem and dumpe2fs/debugfs having some chance at reporting a
> reasonable value for the filesystem space usage.
> 
> Starting a transaction as part of statfs is really counter-productive
> to making that code efficient, which was the whole point of the
> original patch to remove the per-call "overhead" calculation.
> 
> The intention was that the in-memory superblock would be updated 
> whenever statfs is called (this doesn't cost anything, since we've
> already computed the value for statfs), and if the superblock is
> written to disk for some other reason they will go along for the
> ride.
> 
> If the choice is between adding a proper transaction here, or not
> doing this at all, I'd rather just not do it at all.  Of course, I'd
> like to work out some kind of compromise, like only updating the
> superblock when there is already a shadow BH that is being used to
> write to the journal, or similar.
> 
> If there is a desire to keep a transaction here and update the
> superblock counters, it _definitely_ doesn't need to be done on every
> statfs, but at most once every 30 seconds or whatever.

You know, I think I thought about all that, and I wrote the patch anyway 
somehow; blame a late friday evening for that one.  :)

I'll think of a better route to take.

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 8, 2009, 9:48 p.m. UTC | #3
On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
> If the choice is between adding a proper transaction here, or not
> doing this at all, I'd rather just not do it at all.  Of course, I'd
> like to work out some kind of compromise, like only updating the
> superblock when there is already a shadow BH that is being used to
> write to the journal, or similar.

In practice, the superblock is never going to modified in normal
operations, unless a resize happens to be happening.  Since we already
force the superblock summary counters to be correct during an unmount
or file system freeze, we really only need this so that it's correct
after a file system crash.

I don't think people generally end up calling statfs() all that
frequently, so it's not clear how much adding a 30 second throttle
would help.  Maybe we should just not bother trying to update the
superblock at all on a statfs()?

Hmm...

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 8, 2009, 10:09 p.m. UTC | #4
Theodore Tso wrote:
> On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
>> If the choice is between adding a proper transaction here, or not
>> doing this at all, I'd rather just not do it at all.  Of course, I'd
>> like to work out some kind of compromise, like only updating the
>> superblock when there is already a shadow BH that is being used to
>> write to the journal, or similar.
> 
> In practice, the superblock is never going to modified in normal
> operations, unless a resize happens to be happening.  Since we already
> force the superblock summary counters to be correct during an unmount
> or file system freeze, we really only need this so that it's correct
> after a file system crash.
> 
> I don't think people generally end up calling statfs() all that
> frequently, so it's not clear how much adding a 30 second throttle
> would help.  Maybe we should just not bother trying to update the
> superblock at all on a statfs()?

for now maybe that's better....

But don't we journal the superblock sometimes, not others ... for 
example write_super -> ext4_write_super -> ext4_commit_super does no 
journaling of superblock modifications. ext4_orphan_add, however, does. 
  This would likely lead to trouble w/ the debugging patch ... though I 
didn't see it ... ?

So I was premature w/ this patch, I think.

Maybe we could unconditionally do the copy-out in 
jbd2_journal_write_metadata_buffer() ...?

-Eric

> Hmm...
> 
> 						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Nov. 9, 2009, 4:41 a.m. UTC | #5
On 2009-11-08, at 14:48, Theodore Tso wrote:
> On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
>> If the choice is between adding a proper transaction here, or not
>> doing this at all, I'd rather just not do it at all.  Of course, I'd
>> like to work out some kind of compromise, like only updating the
>> superblock when there is already a shadow BH that is being used to
>> write to the journal, or similar.
>
> In practice, the superblock is never going to modified in normal
> operations, unless a resize happens to be happening.

Actually, Eric had a important case where the superblock is modified
is whenever any file is added to the orphan list, so this basically
happens all the time.  When the orphan code was added, it made sense
to put the head of the orphan list on the superblock because it was
updated for every truncate to change the free block counters.

> Since we already force the superblock summary counters to be correct
> during an unmount or file system freeze, we really only need this so
> that it's correct after a file system crash.
>
> I don't think people generally end up calling statfs() all that
> frequently, so it's not clear how much adding a 30 second throttle
> would help.  Maybe we should just not bother trying to update the
> superblock at all on a statfs()?


The reason we added this was for running a read-only e2fsck on a
filesystem without getting spurious errors just because the superblock
summaries were incorrect.  The other alternative is to change e2fsck
so that it doesn't consider just a block/inode summary an error.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 9, 2009, 12:53 p.m. UTC | #6
On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote:
>
> But don't we journal the superblock sometimes, not others ... for  
> example write_super -> ext4_write_super -> ext4_commit_super does no  
> journaling of superblock modifications. ext4_orphan_add, however, does.  
> This would likely lead to trouble w/ the debugging patch ... though I  
> didn't see it ... ?

Ah, I had forgotten about ext4_orphan_add(); that is indeed the one
place where we would be updating the super block under normal
operations, besides online-resize.

I've been looking at the write_super() paths, and from what I can tell
it's only used in two places.  The generic fsync() handler,
file_fsync(), which we do't use, and sync_supers(), which will indeed
call write_super() -> ext4_write_super() if sb->s_dirt is set.  That
led me to examine the places where we set s_dirt, and it's in a lot of
places where we're no longer modifying the superblock any more, but
we're still setting sb->s_dirt.  I don't know why you didn't see
problems with the debugging patch; the only thing I can think of is
that since the actual superblock update is deferred to a
timer-triggered callback, you were getting consistently lucky ---
which is hard for me to believe, but I don't have a better suggestion.

What I think we do need to do is eliminate all of the places where we
set sb->s_dirt, and if we need to update the superblock, we do it
ourselves, under journaling control.

That leaves places which call ext4_commit_super() directly, which is
at mount and unmount time (which should be OK, as long as it's before
or after journalling is active) and when we freeze the filesystem,
which might be OK, but we need to take a careful look at it.

            	       	       	       	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Nov. 9, 2009, 5:55 p.m. UTC | #7
On 2009-11-09, at 05:53, Theodore Tso wrote:
> On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote:
>> But don't we journal the superblock sometimes, not others ... for
>> example write_super -> ext4_write_super -> ext4_commit_super does no
>> journaling of superblock modifications. ext4_orphan_add, however,  
>> does.
>> This would likely lead to trouble w/ the debugging patch ... though I
>> didn't see it ... ?
>
> Ah, I had forgotten about ext4_orphan_add(); that is indeed the one
> place where we would be updating the super block under normal
> operations, besides online-resize.
>
> I've been looking at the write_super() paths, and from what I can tell
> it's only used in two places.  The generic fsync() handler,
> file_fsync(), which we do't use, and sync_supers(), which will indeed
> call write_super() -> ext4_write_super() if sb->s_dirt is set.  That
> led me to examine the places where we set s_dirt, and it's in a lot of
> places where we're no longer modifying the superblock any more, but
> we're still setting sb->s_dirt.  I don't know why you didn't see
> problems with the debugging patch; the only thing I can think of is
> that since the actual superblock update is deferred to a
> timer-triggered callback, you were getting consistently lucky ---
> which is hard for me to believe, but I don't have a better suggestion.

I suspect this is because the only thing that changes in the superblock
these days is the orphan list, so out-of-order writes to the superblock
will at worst result in a few entries added/missing from the orphan  
list.
I do recall that there are "inodes from a corrupt orphan list found"
messages seen occasionally during full e2fsck runs, but it has never
been important enough to investigate.

> What I think we do need to do is eliminate all of the places where we
> set sb->s_dirt, and if we need to update the superblock, we do it
> ourselves, under journaling control.

We have to ensure that writeout of the superblock is still being done
correctly during non-journal mode operation.

> That leaves places which call ext4_commit_super() directly, which is
> at mount and unmount time (which should be OK, as long as it's before
> or after journalling is active) and when we freeze the filesystem,
> which might be OK, but we need to take a careful look at it.


We also write out the superblock directly in ext4_error(), so that the
EXT4_ERROR_FS flag is written to disk (if at all possible) rather than
putting the superblock into a journal transaction that will not be
replayed (due to the transaction never committing after the journal is
aborted or the node panics).  Since that will be in the last transaction
anyways (unless errors=continue is used) I don't see it as a major  
problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 08370ae..b02daf6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3605,6 +3605,8 @@  static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_super_block *es = sbi->s_es;
 	u64 fsid;
+	handle_t *handle;
+	int err = 0;
 
 	if (test_opt(sb, MINIX_DF)) {
 		sbi->s_overhead_last = 0;
@@ -3650,20 +3652,33 @@  static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
 	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
 		       percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
-	ext4_free_blocks_count_set(es, buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
 	if (buf->f_bfree < ext4_r_blocks_count(es))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
 	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
-	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT4_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
 	       le64_to_cpup((void *)es->s_uuid + sizeof(u64));
 	buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
 	buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
 
-	return 0;
+	handle = ext4_journal_start_sb(sb, 1);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		goto out;
+	}
+	err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
+	if (err)
+		goto journal_stop;
+	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
+	ext4_free_blocks_count_set(es, buf->f_bfree);
+	err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+
+journal_stop:
+	ext4_journal_stop(handle);
+out:
+	return err;
 }
 
 /* Helper function for writing quotas on sync - we need to start transaction