Patchwork ext4: turn on i_version updates by default

login
register
mail settings
Submitter J. Bruce Fields
Date May 14, 2012, 2:06 p.m.
Message ID <20120514140618.GA29902@fieldses.org>
Download mbox | patch
Permalink /patch/159024/
State Superseded
Headers show

Comments

J. Bruce Fields - May 14, 2012, 2:06 p.m.
knfsd needs i_version updates on, as will userspace nfs servers and
probably others.

The only effects are that inode->i_version is bumped (under the i_lock)
in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
more frequently than once per jiffy on write (see file_update_time).
However the latter appears to be mostly a no-op in that case.

So, simplify our life and just keep this feature turned on all the time.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/ext4/super.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

If people are worried that the performance impact isn't obvious, I'll
find the time somehow to test this properly....
Andreas Dilger - May 14, 2012, 3:02 p.m.
On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> knfsd needs i_version updates on, as will userspace nfs servers and
> probably others.
> 
> The only effects are that inode->i_version is bumped (under the i_lock)
> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> more frequently than once per jiffy on write (see file_update_time).
> However the latter appears to be mostly a no-op in that case.

I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?

This is one of the reasons that the i_version update is conditional. If someone is exporting a filesystem from userspace the should be able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?

> So, simplify our life and just keep this feature turned on all the time.
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
> fs/ext4/super.c |    3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> If people are worried that the performance impact isn't obvious, I'll
> find the time somehow to test this properly....
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e1fb1d5..a99f827 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1483,7 +1483,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>        sbi->s_mount_flags |= EXT4_MF_FS_ABORTED;
>        return 1;
>    case Opt_i_version:
> -        sb->s_flags |= MS_I_VERSION;
> +        /* no-op; this is on by default now */
>        return 1;
>    case Opt_journal_dev:
>        if (is_remount) {
> @@ -2979,6 +2979,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>        goto out_free_orig;
>    }
>    sb->s_fs_info = sbi;
> +    sb->s_flags |= MS_I_VERSION;
>    sbi->s_mount_opt = 0;
>    sbi->s_resuid = EXT4_DEF_RESUID;
>    sbi->s_resgid = EXT4_DEF_RESGID;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 14, 2012, 3:23 p.m.
On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > knfsd needs i_version updates on, as will userspace nfs servers and
> > probably others.
> > 
> > The only effects are that inode->i_version is bumped (under the i_lock)
> > in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> > more frequently than once per jiffy on write (see file_update_time).
> > However the latter appears to be mostly a no-op in that case.
> 
> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?

There's no reason it should be, should it, if we already just dirtied
the inode a moment ago?

> This is one of the reasons that the i_version update is conditional.
> If someone is exporting a filesystem from userspace the should be able
> to turn this on as a mount option, and knfsd could do it from inside
> the kernel. Why add overhead when it is not needed?

Any user of the change attribute also wants it to function correctly
while they're away.

And if it at all possible I'd rather have it be something that Just
Works rather than something that requires extra configuration.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - May 14, 2012, 5:27 p.m.
On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
>>> knfsd needs i_version updates on, as will userspace nfs servers and
>>> probably others.
>>> 
>>> The only effects are that inode->i_version is bumped (under the i_lock)
>>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
>>> more frequently than once per jiffy on write (see file_update_time).
>>> However the latter appears to be mostly a no-op in that case.
>> 
>> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
> 
> There's no reason it should be, should it, if we already just dirtied
> the inode a moment ago?

Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
is that it copies the whole in-core inode to the on-disk inode every
time it is marked dirty.  That ensures that the on-disk inode is
up-to-date when the journal flushes the blocks to disk, but is not
an ideal implementation.  It has been this way since the first ext3
implementation was done.

As a result, dirtying the inode very frequently for ext[34] is
currently expensive and should be avoided.

I _think_ that the ext4 metadata checksum patches have changed this
to only flag the inode dirty and run a pre-commit callback to copy
the in-core inode to the on-disk inode.  I'm not sure what the
current status of that patch is, nor how easily it could be split
from that patch series and land separately.

>> This is one of the reasons that the i_version update is conditional.
>> If someone is exporting a filesystem from userspace the should be able
>> to turn this on as a mount option, and knfsd could do it from inside
>> the kernel. Why add overhead when it is not needed?
> 
> Any user of the change attribute also wants it to function correctly
> while they're away.

It would only need to change once, however, not continuously.
Is there any way to know when a consumer has sampled the version?
That way the on-disk version could be bumped once after the version
was referenced, and wouldn't have to be changed thousands of times
per second, nor at all if nothing is using the version.

The MS_I_VERSION is intended to be used to indicate that i_version
needs to be updated.  I can imagine that it might make sense to
make this flag "sticky" on a filesystem, so that once it is used
for NFSv4 the version will be bumped once for an inode change even
if MS_I_VERSION is not in use, but that is sufficient for NFSv4
and it does not have to be a permanent drag on the filesystem.

> And if it at all possible I'd rather have it be something that Just
> Works rather than something that requires extra configuration.

Sure, but this is only useful for NFSv4, but costs everyone using
ext4 continuous overhead, so it isn't a clear-cut case to enable
the version just on the thought that NFS might one day be used on
any particular filesystem.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - May 14, 2012, 5:58 p.m.
On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > And if it at all possible I'd rather have it be something that Just
> > Works rather than something that requires extra configuration.
> 
> Sure, but this is only useful for NFSv4, but costs everyone using
> ext4 continuous overhead, so it isn't a clear-cut case to enable
> the version just on the thought that NFS might one day be used on
> any particular filesystem.

It's not a matter of "NFSv4 might one day be used"; if we don't turn
on i_version updates until the file system is actually exported via
NFSv4, there would be no deleterious effects.

I always thought that was going to be the plan; that there would be
some flag that would be set in struct super_block when the file system
was exported that would enable i_version updates.

That way we satisfy the "no extra configuration" needed requirement,
which I agree is ideal, but we also don't waste any CPU overhead if
the file system is not exported via NFSv4.  I tried to implement
anything along these lines because I don't care enough, and I don't
use NFSv4 personally....

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik - May 14, 2012, 6:33 p.m.
On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > And if it at all possible I'd rather have it be something that Just
> > > Works rather than something that requires extra configuration.
> > 
> > Sure, but this is only useful for NFSv4, but costs everyone using
> > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > the version just on the thought that NFS might one day be used on
> > any particular filesystem.
> 
> It's not a matter of "NFSv4 might one day be used"; if we don't turn
> on i_version updates until the file system is actually exported via
> NFSv4, there would be no deleterious effects.
> 
> I always thought that was going to be the plan; that there would be
> some flag that would be set in struct super_block when the file system
> was exported that would enable i_version updates.
> 
> That way we satisfy the "no extra configuration" needed requirement,
> which I agree is ideal, but we also don't waste any CPU overhead if
> the file system is not exported via NFSv4.  I tried to implement
> anything along these lines because I don't care enough, and I don't
> use NFSv4 personally....
> 

Seems like this is just a bad place to be doing inode_inc_iversion().  If
MS_IVERSION is set we will update iversion in file_update_time() and then call
mark_inode_dirty which will jack up the iversion again.  In btrfs we just change
it wherever we change ctime and that way you don't really notice the extra
overhead since you are doing it in paths where you are changing a bunch of stuff
in the inode already, and mostly where you hold the i_mutex so you aren't going
to be hitting any contention on the i_lock.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton - May 14, 2012, 6:48 p.m.
On Mon, 14 May 2012 14:33:17 -0400
Josef Bacik <josef@redhat.com> wrote:

> On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > And if it at all possible I'd rather have it be something that Just
> > > > Works rather than something that requires extra configuration.
> > > 
> > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > the version just on the thought that NFS might one day be used on
> > > any particular filesystem.
> > 
> > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > on i_version updates until the file system is actually exported via
> > NFSv4, there would be no deleterious effects.
> > 
> > I always thought that was going to be the plan; that there would be
> > some flag that would be set in struct super_block when the file system
> > was exported that would enable i_version updates.
> > 
> > That way we satisfy the "no extra configuration" needed requirement,
> > which I agree is ideal, but we also don't waste any CPU overhead if
> > the file system is not exported via NFSv4.  I tried to implement
> > anything along these lines because I don't care enough, and I don't
> > use NFSv4 personally....
> > 
> 
> Seems like this is just a bad place to be doing inode_inc_iversion().  If
> MS_IVERSION is set we will update iversion in file_update_time() and then call
> mark_inode_dirty which will jack up the iversion again.  In btrfs we just change
> it wherever we change ctime and that way you don't really notice the extra
> overhead since you are doing it in paths where you are changing a bunch of stuff
> in the inode already, and mostly where you hold the i_mutex so you aren't going
> to be hitting any contention on the i_lock.  Thanks,
> 

Well, you do incur a bit more overhead in btrfs too:

------------------[snip]----------------------
        if (!timespec_equal(&inode->i_mtime, &now))
                sync_it = S_MTIME;

        if (!timespec_equal(&inode->i_ctime, &now))
                sync_it |= S_CTIME;

        if (IS_I_VERSION(inode))
                sync_it |= S_VERSION;
------------------[snip]----------------------

So you'll end up with sync_it being 0 if i_version updates are
disabled, and the mtime/ctime didn't visibly change.

If your jiffies are coarse-grained enough, then you might get "lucky"
rather often, but is that a case worth optimizing for? How often does
it happen that you mark the inode dirty, flush it to disk and then
re-mark it dirty within the same jiffy?

If that's a common occurrence then you might see some performance
impact here from turning i_version on all the time. Otherwise, it seems
like it wouldn't make that big a difference.
Josef Bacik - May 14, 2012, 6:51 p.m.
On Mon, May 14, 2012 at 02:48:02PM -0400, Jeff Layton wrote:
> On Mon, 14 May 2012 14:33:17 -0400
> Josef Bacik <josef@redhat.com> wrote:
> 
> > On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > > And if it at all possible I'd rather have it be something that Just
> > > > > Works rather than something that requires extra configuration.
> > > > 
> > > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > > the version just on the thought that NFS might one day be used on
> > > > any particular filesystem.
> > > 
> > > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > > on i_version updates until the file system is actually exported via
> > > NFSv4, there would be no deleterious effects.
> > > 
> > > I always thought that was going to be the plan; that there would be
> > > some flag that would be set in struct super_block when the file system
> > > was exported that would enable i_version updates.
> > > 
> > > That way we satisfy the "no extra configuration" needed requirement,
> > > which I agree is ideal, but we also don't waste any CPU overhead if
> > > the file system is not exported via NFSv4.  I tried to implement
> > > anything along these lines because I don't care enough, and I don't
> > > use NFSv4 personally....
> > > 
> > 
> > Seems like this is just a bad place to be doing inode_inc_iversion().  If
> > MS_IVERSION is set we will update iversion in file_update_time() and then call
> > mark_inode_dirty which will jack up the iversion again.  In btrfs we just change
> > it wherever we change ctime and that way you don't really notice the extra
> > overhead since you are doing it in paths where you are changing a bunch of stuff
> > in the inode already, and mostly where you hold the i_mutex so you aren't going
> > to be hitting any contention on the i_lock.  Thanks,
> > 
> 
> Well, you do incur a bit more overhead in btrfs too:
> 
> ------------------[snip]----------------------
>         if (!timespec_equal(&inode->i_mtime, &now))
>                 sync_it = S_MTIME;
> 
>         if (!timespec_equal(&inode->i_ctime, &now))
>                 sync_it |= S_CTIME;
> 
>         if (IS_I_VERSION(inode))
>                 sync_it |= S_VERSION;
> ------------------[snip]----------------------
> 
> So you'll end up with sync_it being 0 if i_version updates are
> disabled, and the mtime/ctime didn't visibly change.
> 
> If your jiffies are coarse-grained enough, then you might get "lucky"
> rather often, but is that a case worth optimizing for? How often does
> it happen that you mark the inode dirty, flush it to disk and then
> re-mark it dirty within the same jiffy?
> 

Well sync_it just means do we need to call mark_inode_dirty, not necessarily do
we need to write it to disk, so you are just updating a field in memory.  Now if
your jiffies are coarse enough for you to not notice the ctime update then yes
you are incurring an extra lock/inc/unlock, but this is called in the write path
where you are going to do much more latency inducting operations than locking
and unlocking a generally uncontended spin lock.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 14, 2012, 6:54 p.m.
On Mon, May 14, 2012 at 02:33:17PM -0400, Josef Bacik wrote:
> On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > And if it at all possible I'd rather have it be something that Just
> > > > Works rather than something that requires extra configuration.
> > > 
> > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > the version just on the thought that NFS might one day be used on
> > > any particular filesystem.
> > 
> > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > on i_version updates until the file system is actually exported via
> > NFSv4, there would be no deleterious effects.
> > 
> > I always thought that was going to be the plan; that there would be
> > some flag that would be set in struct super_block when the file system
> > was exported that would enable i_version updates.
> > 
> > That way we satisfy the "no extra configuration" needed requirement,
> > which I agree is ideal, but we also don't waste any CPU overhead if
> > the file system is not exported via NFSv4.  I tried to implement
> > anything along these lines because I don't care enough, and I don't
> > use NFSv4 personally....
> > 
> 
> Seems like this is just a bad place to be doing inode_inc_iversion().  If
> MS_IVERSION is set we will update iversion in file_update_time() and then call
> mark_inode_dirty which will jack up the iversion again.

Agreed, that's weird.

> In btrfs we just change
> it wherever we change ctime and that way you don't really notice the extra
> overhead since you are doing it in paths where you are changing a bunch of stuff
> in the inode already, and mostly where you hold the i_mutex so you aren't going
> to be hitting any contention on the i_lock.  Thanks,

I don't think they're worried about the inode_inc_iversion() calls
themselves, but the behavior of file_update_time():

        if (!timespec_equal(&inode->i_mtime, &now))
                sync_it = S_MTIME;

        if (!timespec_equal(&inode->i_ctime, &now))
                sync_it |= S_CTIME;

        if (IS_I_VERSION(inode))
                sync_it |= S_VERSION;

        if (!sync_it)
                return;
	...
	mark_inode_dirty_sync(inode);

So now mark_inode_dirty_sync() is called on every update, instead of
merely on every update that sees a time change (so at most once a
jiffy).

So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
may get called more often if you're writing very frequently.

I'm a bit surprised that's expected to add significant overhead to the
write.

I guess I should stare at the code and try to follow Andreas's
explanation....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik - May 14, 2012, 7:05 p.m.
On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 02:33:17PM -0400, Josef Bacik wrote:
> > On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > > And if it at all possible I'd rather have it be something that Just
> > > > > Works rather than something that requires extra configuration.
> > > > 
> > > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > > the version just on the thought that NFS might one day be used on
> > > > any particular filesystem.
> > > 
> > > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > > on i_version updates until the file system is actually exported via
> > > NFSv4, there would be no deleterious effects.
> > > 
> > > I always thought that was going to be the plan; that there would be
> > > some flag that would be set in struct super_block when the file system
> > > was exported that would enable i_version updates.
> > > 
> > > That way we satisfy the "no extra configuration" needed requirement,
> > > which I agree is ideal, but we also don't waste any CPU overhead if
> > > the file system is not exported via NFSv4.  I tried to implement
> > > anything along these lines because I don't care enough, and I don't
> > > use NFSv4 personally....
> > > 
> > 
> > Seems like this is just a bad place to be doing inode_inc_iversion().  If
> > MS_IVERSION is set we will update iversion in file_update_time() and then call
> > mark_inode_dirty which will jack up the iversion again.
> 
> Agreed, that's weird.
> 
> > In btrfs we just change
> > it wherever we change ctime and that way you don't really notice the extra
> > overhead since you are doing it in paths where you are changing a bunch of stuff
> > in the inode already, and mostly where you hold the i_mutex so you aren't going
> > to be hitting any contention on the i_lock.  Thanks,
> 
> I don't think they're worried about the inode_inc_iversion() calls
> themselves, but the behavior of file_update_time():
> 
>         if (!timespec_equal(&inode->i_mtime, &now))
>                 sync_it = S_MTIME;
> 
>         if (!timespec_equal(&inode->i_ctime, &now))
>                 sync_it |= S_CTIME;
> 
>         if (IS_I_VERSION(inode))
>                 sync_it |= S_VERSION;
> 
>         if (!sync_it)
>                 return;
> 	...
> 	mark_inode_dirty_sync(inode);
> 
> So now mark_inode_dirty_sync() is called on every update, instead of
> merely on every update that sees a time change (so at most once a
> jiffy).
> 
> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
> may get called more often if you're writing very frequently.
> 
> I'm a bit surprised that's expected to add significant overhead to the
> write.
> 
> I guess I should stare at the code and try to follow Andreas's
> explanation....
>

It shouldn't, let's be honest, most systems aren't going to have such a coarse
jiffie counter that they'll be able to get away with doing 2 calls to write() or
->page_mkwrite() in the same jiffie and skip the update to mtime/ctime anyway.
If they do they are damned lucky, and again the amount of overhead added even if
they are should be negligible since 99% of us all incur the overhead from having
to update mtime/ctime anyway.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - May 14, 2012, 9:27 p.m.
On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
> On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
>> I don't think they're worried about the inode_inc_iversion() calls
>> themselves, but the behavior of file_update_time():
>> 
>>        if (!timespec_equal(&inode->i_mtime, &now))
>>                sync_it = S_MTIME;
>> 
>>        if (!timespec_equal(&inode->i_ctime, &now))
>>                sync_it |= S_CTIME;
>> 
>>        if (IS_I_VERSION(inode))
>>                sync_it |= S_VERSION;
>> 
>>        if (!sync_it)
>>                return;
>> 	...
>> 	mark_inode_dirty_sync(inode);
>> 
>> So now mark_inode_dirty_sync() is called on every update, instead of
>> merely on every update that sees a time change (so at most once a
>> jiffy).
>> 
>> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
>> may get called more often if you're writing very frequently.
>> 
>> I'm a bit surprised that's expected to add significant overhead to the
>> write.
> 
> It shouldn't, let's be honest, most systems aren't going to have such
> a coarse jiffie counter that they'll be able to get away with doing
> 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
> update to mtime/ctime anyway.  If they do they are damned lucky, and
> again the amount of overhead added even if they are should be
> negligible since 99% of us all incur the overhead from having
> to update mtime/ctime anyway.  Thanks,

Seriously?  The whole reason the above checks for timespec_equal()
are there is to avoid calling mark_inode_dirty_sync() thousands of
times per second.  If doing write() calls in the same jiffie were
so rare as you suggest then I don't think such an optimization
would ever have appeared in the first place.

For writes to a high-IOPS device (e.g. SSD) can run far higher than
1000 IOPS, and this is an important use case that people care about
today, so why add useless overhead when it isn't needed?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust - May 14, 2012, 11:08 p.m.
On Mon, 2012-05-14 at 09:02 -0600, Andreas Dilger wrote:
> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:

> > knfsd needs i_version updates on, as will userspace nfs servers and

> > probably others.

> > 

> > The only effects are that inode->i_version is bumped (under the i_lock)

> > in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called

> > more frequently than once per jiffy on write (see file_update_time).

> > However the latter appears to be mostly a no-op in that case.

> 

> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?

> 

> This is one of the reasons that the i_version update is conditional. If someone is exporting a filesystem from userspace the should be able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?


No. Having knfsd doing something like that under the covers is a BAD
idea. It is way too easy to get into situations where someone starts
changing files after the mount and before knfsd is started. As soon as
you allow files to be changed without i_version changing, then you are
setting yourself up for future corruption.

Ideally, an NFSv4-exported filesystem should be required to set the
tune2fs mount_opts to include the 'i_version' flag to make it hard to
inadvertently mount that filesystem without it.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Andreas Dilger - May 14, 2012, 11:33 p.m.
On 2012-05-14, at 5:08 PM, Myklebust, Trond wrote:
> On Mon, 2012-05-14 at 09:02 -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
>>> knfsd needs i_version updates on, as will userspace nfs servers and
>>> probably others.
>>> 
>>> The only effects are that inode->i_version is bumped (under the i_lock)
>>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
>>> more frequently than once per jiffy on write (see file_update_time).
>>> However the latter appears to be mostly a no-op in that case.
>> 
>> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
>> 
>> This is one of the reasons that the i_version update is conditional.
>> If someone is exporting a filesystem from userspace the should be
>> able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?
> 
> No. Having knfsd doing something like that under the covers is a BAD
> idea. It is way too easy to get into situations where someone starts
> changing files after the mount and before knfsd is started. As soon as
> you allow files to be changed without i_version changing, then you are
> setting yourself up for future corruption.
> 
> Ideally, an NFSv4-exported filesystem should be required to set the
> tune2fs mount_opts to include the 'i_version' flag to make it hard to
> inadvertently mount that filesystem without it.

I said as much in another reply - that once i_version is used on
a filesystem, it should be made "sticky" (i.e. permanently enabled
for that filesystem).  However, until that time it shouldn't be
enabled just because it might one day be used.

Even better than just blindly bumping the i_version on every change,
it would be better to have users of i_version (i.e. knfsd) flag the
inode with "needs i_version update" then read the version.  When the
filesystem/VFS bumps i_version the next time it can clear this flag
and not update i_version again until after the next time i_version
is actually used.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 14, 2012, 11:54 p.m.
On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> I said as much in another reply - that once i_version is used on
> a filesystem, it should be made "sticky" (i.e. permanently enabled
> for that filesystem).  However, until that time it shouldn't be
> enabled just because it might one day be used.
> 
> Even better than just blindly bumping the i_version on every change,
> it would be better to have users of i_version (i.e. knfsd) flag the
> inode with "needs i_version update" then read the version.  When the
> filesystem/VFS bumps i_version the next time it can clear this flag
> and not update i_version again until after the next time i_version
> is actually used.

I really don't want to do anything more complicated than necessary.

What would be the worst-case test for the extra inode dirtying, so we
can see what the numbers actually are?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust - May 15, 2012, 12:13 a.m.
T24gTW9uLCAyMDEyLTA1LTE0IGF0IDE3OjMzIC0wNjAwLCBBbmRyZWFzIERpbGdlciB3cm90ZToN
Cj4gRXZlbiBiZXR0ZXIgdGhhbiBqdXN0IGJsaW5kbHkgYnVtcGluZyB0aGUgaV92ZXJzaW9uIG9u
IGV2ZXJ5IGNoYW5nZSwNCj4gaXQgd291bGQgYmUgYmV0dGVyIHRvIGhhdmUgdXNlcnMgb2YgaV92
ZXJzaW9uIChpLmUuIGtuZnNkKSBmbGFnIHRoZQ0KPiBpbm9kZSB3aXRoICJuZWVkcyBpX3ZlcnNp
b24gdXBkYXRlIiB0aGVuIHJlYWQgdGhlIHZlcnNpb24uICBXaGVuIHRoZQ0KPiBmaWxlc3lzdGVt
L1ZGUyBidW1wcyBpX3ZlcnNpb24gdGhlIG5leHQgdGltZSBpdCBjYW4gY2xlYXIgdGhpcyBmbGFn
DQo+IGFuZCBub3QgdXBkYXRlIGlfdmVyc2lvbiBhZ2FpbiB1bnRpbCBhZnRlciB0aGUgbmV4dCB0
aW1lIGlfdmVyc2lvbg0KPiBpcyBhY3R1YWxseSB1c2VkLg0KDQpUaGF0IGRvdWJseSBwZW5hbGlz
ZXMgYWN0dWFsIHVzZXJzIG9mIGlfdmVyc2lvbjogbm90IG9ubHkgZG9lcyB0aGUNCmZpbGVzeXN0
ZW0gaGF2ZSB0byBidW1wIGl0IG9uIGEgZmlsZSBjaGFuZ2UsIGJ1dCBpdCBub3cgaGFzIHRvIHNl
dCBhbmQNCndyaXRlIHRoaXMgZmxhZyBvbiB0aGUgbmV4dCByZWFkIG9mIGlfdmVyc2lvbi4NCg0K
SWYgdGhvc2UgcmVhZHMgb2YgaV92ZXJzaW9uIHdlcmUgcmVhbGx5IGZldyBhbmQgZmFyIGJldHdl
ZW4sIHRoZW4gSQ0KYWdyZWUgdGhhdCBpdCBtaWdodCBiZSBhIHNvbHV0aW9uLCBidXQgbW9zdCBO
RlMgY2xpZW50cyB3aWxsIGFzayBmb3IgdGhlDQp1cGRhdGVkIGlfdmVyc2lvbiBhZnRlciBldmVy
eSBjaGFuZ2UgdGhhdCB0aGV5IG1ha2UgdG8gdGhlIGZpbGUgYmVjYXVzZQ0KdGhleSBuZWVkIHRv
IHRyYWNrIHRoYXQgdmFsdWUgZm9yIGNhY2hlIGNvbnNpc3RlbmN5IG1hbmFnZW1lbnQgcHVycG9z
ZXMuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN
Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - May 15, 2012, 10:30 a.m.
On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > I said as much in another reply - that once i_version is used on
> > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > for that filesystem).  However, until that time it shouldn't be
> > enabled just because it might one day be used.
> > 
> > Even better than just blindly bumping the i_version on every change,
> > it would be better to have users of i_version (i.e. knfsd) flag the
> > inode with "needs i_version update" then read the version.  When the
> > filesystem/VFS bumps i_version the next time it can clear this flag
> > and not update i_version again until after the next time i_version
> > is actually used.
> 
> I really don't want to do anything more complicated than necessary.
> 
> What would be the worst-case test for the extra inode dirtying, so we
> can see what the numbers actually are?
  Something like:

  int fd, i;
  struct timeval tv[2];

  fd = open("file", O_CREAT | O_RDWR, 0644);
  if (fd < 0)
    return 1;
  for (i = 0; i < 1000000; i++) {
    gettimeofday(tv);
    tv[1] = tv[0];
    if (futimes(fd, tv) < 0)
      return 1;
  }
  return 0;

  And see how long does it take with and without i_version?

								Honza
J. Bruce Fields - May 15, 2012, 12:35 p.m.
On Tue, May 15, 2012 at 12:30:21PM +0200, Jan Kara wrote:
> On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> > On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > > I said as much in another reply - that once i_version is used on
> > > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > > for that filesystem).  However, until that time it shouldn't be
> > > enabled just because it might one day be used.
> > > 
> > > Even better than just blindly bumping the i_version on every change,
> > > it would be better to have users of i_version (i.e. knfsd) flag the
> > > inode with "needs i_version update" then read the version.  When the
> > > filesystem/VFS bumps i_version the next time it can clear this flag
> > > and not update i_version again until after the next time i_version
> > > is actually used.
> > 
> > I really don't want to do anything more complicated than necessary.
> > 
> > What would be the worst-case test for the extra inode dirtying, so we
> > can see what the numbers actually are?
>   Something like:
> 
>   int fd, i;
>   struct timeval tv[2];
> 
>   fd = open("file", O_CREAT | O_RDWR, 0644);
>   if (fd < 0)
>     return 1;
>   for (i = 0; i < 1000000; i++) {
>     gettimeofday(tv);
>     tv[1] = tv[0];
>     if (futimes(fd, tv) < 0)
>       return 1;
>   }
>   return 0;
> 
>   And see how long does it take with and without i_version?

The complaint I hear from Andreas is that we'll cause file_update_time()
to call mark_inode_dirty() more often.

I don't believe futimes() calls file_update_time().

So maybe replace that futimes() by a one-byte write?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik - May 15, 2012, 1:28 p.m.
On Mon, May 14, 2012 at 03:27:47PM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
> > On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
> >> I don't think they're worried about the inode_inc_iversion() calls
> >> themselves, but the behavior of file_update_time():
> >> 
> >>        if (!timespec_equal(&inode->i_mtime, &now))
> >>                sync_it = S_MTIME;
> >> 
> >>        if (!timespec_equal(&inode->i_ctime, &now))
> >>                sync_it |= S_CTIME;
> >> 
> >>        if (IS_I_VERSION(inode))
> >>                sync_it |= S_VERSION;
> >> 
> >>        if (!sync_it)
> >>                return;
> >> 	...
> >> 	mark_inode_dirty_sync(inode);
> >> 
> >> So now mark_inode_dirty_sync() is called on every update, instead of
> >> merely on every update that sees a time change (so at most once a
> >> jiffy).
> >> 
> >> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
> >> may get called more often if you're writing very frequently.
> >> 
> >> I'm a bit surprised that's expected to add significant overhead to the
> >> write.
> > 
> > It shouldn't, let's be honest, most systems aren't going to have such
> > a coarse jiffie counter that they'll be able to get away with doing
> > 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
> > update to mtime/ctime anyway.  If they do they are damned lucky, and
> > again the amount of overhead added even if they are should be
> > negligible since 99% of us all incur the overhead from having
> > to update mtime/ctime anyway.  Thanks,
> 
> Seriously?  The whole reason the above checks for timespec_equal()
> are there is to avoid calling mark_inode_dirty_sync() thousands of
> times per second.  If doing write() calls in the same jiffie were
> so rare as you suggest then I don't think such an optimization
> would ever have appeared in the first place.
> 

So here is the commit log

commit ce06e0b21d6732a2bab10a585a3ec6909499be28
Author: Andi Kleen <andi@firstfloor.org>
Date:   Fri Sep 18 13:05:48 2009 -0700

    vfs: optimize touch_time() too
    
    Do a similar optimization as earlier for touch_atime.  Getting the lock in
    mnt_get_write is relatively costly, so try all avenues to avoid it first.
    
    This patch is careful to still only update inode fields inside the lock
    region.
    
    This didn't show up in benchmarks, but it's easy enough to do.

Notice that last bit?  I'm sure maybe at some point in the future we'll be able
to see the overhead, but right now I highly doubt we can, and if we can I'd like
to see benchmarks before blanket dismissing a feature that would be super
helpful for people exporting nfs volumes.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - May 15, 2012, 2:43 p.m.
On Tue 15-05-12 08:35:50, J. Bruce Fields wrote:
> On Tue, May 15, 2012 at 12:30:21PM +0200, Jan Kara wrote:
> > On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> > > On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > > > I said as much in another reply - that once i_version is used on
> > > > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > > > for that filesystem).  However, until that time it shouldn't be
> > > > enabled just because it might one day be used.
> > > > 
> > > > Even better than just blindly bumping the i_version on every change,
> > > > it would be better to have users of i_version (i.e. knfsd) flag the
> > > > inode with "needs i_version update" then read the version.  When the
> > > > filesystem/VFS bumps i_version the next time it can clear this flag
> > > > and not update i_version again until after the next time i_version
> > > > is actually used.
> > > 
> > > I really don't want to do anything more complicated than necessary.
> > > 
> > > What would be the worst-case test for the extra inode dirtying, so we
> > > can see what the numbers actually are?
> >   Something like:
> > 
> >   int fd, i;
> >   struct timeval tv[2];
> > 
> >   fd = open("file", O_CREAT | O_RDWR, 0644);
> >   if (fd < 0)
> >     return 1;
> >   for (i = 0; i < 1000000; i++) {
> >     gettimeofday(tv);
> >     tv[1] = tv[0];
> >     if (futimes(fd, tv) < 0)
> >       return 1;
> >   }
> >   return 0;
> > 
> >   And see how long does it take with and without i_version?
> 
> The complaint I hear from Andreas is that we'll cause file_update_time()
> to call mark_inode_dirty() more often.
> 
> I don't believe futimes() calls file_update_time().
  Yeah, right. I didn't check and I was wrong...

> So maybe replace that futimes() by a one-byte write?
  Yes, "pwrite(fd, "data", 1, 0);" should be then a proper test instead of
futimes().

								Honza
J. Bruce Fields - May 15, 2012, 5:33 p.m.
On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> > On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> >> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> >>> knfsd needs i_version updates on, as will userspace nfs servers and
> >>> probably others.
> >>> 
> >>> The only effects are that inode->i_version is bumped (under the i_lock)
> >>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> >>> more frequently than once per jiffy on write (see file_update_time).
> >>> However the latter appears to be mostly a no-op in that case.
> >> 
> >> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
> > 
> > There's no reason it should be, should it, if we already just dirtied
> > the inode a moment ago?
> 
> Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
> is that it copies the whole in-core inode to the on-disk inode every
> time it is marked dirty.  That ensures that the on-disk inode is
> up-to-date when the journal flushes the blocks to disk, but is not
> an ideal implementation.  It has been this way since the first ext3
> implementation was done.
> 
> As a result, dirtying the inode very frequently for ext[34] is
> currently expensive and should be avoided.
> 
> I _think_ that the ext4 metadata checksum patches have changed this
> to only flag the inode dirty and run a pre-commit callback to copy
> the in-core inode to the on-disk inode.  I'm not sure what the
> current status of that patch is, nor how easily it could be split
> from that patch series and land separately.

I did some searching, found a couple of versions of the metadata
checksum patches, but no patch that looked like what you're describing.
Any idea where that is?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marco Stornelli - May 15, 2012, 5:59 p.m.
Il 15/05/2012 15:28, Josef Bacik ha scritto:
> On Mon, May 14, 2012 at 03:27:47PM -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
>>> On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
>>>> I don't think they're worried about the inode_inc_iversion() calls
>>>> themselves, but the behavior of file_update_time():
>>>>
>>>>         if (!timespec_equal(&inode->i_mtime,&now))
>>>>                 sync_it = S_MTIME;
>>>>
>>>>         if (!timespec_equal(&inode->i_ctime,&now))
>>>>                 sync_it |= S_CTIME;
>>>>
>>>>         if (IS_I_VERSION(inode))
>>>>                 sync_it |= S_VERSION;
>>>>
>>>>         if (!sync_it)
>>>>                 return;
>>>> 	...
>>>> 	mark_inode_dirty_sync(inode);
>>>>
>>>> So now mark_inode_dirty_sync() is called on every update, instead of
>>>> merely on every update that sees a time change (so at most once a
>>>> jiffy).
>>>>
>>>> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
>>>> may get called more often if you're writing very frequently.
>>>>
>>>> I'm a bit surprised that's expected to add significant overhead to the
>>>> write.
>>>
>>> It shouldn't, let's be honest, most systems aren't going to have such
>>> a coarse jiffie counter that they'll be able to get away with doing
>>> 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
>>> update to mtime/ctime anyway.  If they do they are damned lucky, and
>>> again the amount of overhead added even if they are should be
>>> negligible since 99% of us all incur the overhead from having
>>> to update mtime/ctime anyway.  Thanks,
>>
>> Seriously?  The whole reason the above checks for timespec_equal()
>> are there is to avoid calling mark_inode_dirty_sync() thousands of
>> times per second.  If doing write() calls in the same jiffie were
>> so rare as you suggest then I don't think such an optimization
>> would ever have appeared in the first place.
>>
>

Only a really really stupid question (I don't know NFS protocol well 
enough). In 3.3 kernel, I see that only ext4 uses MS_I_VERSION, so I 
wonder: if i_version change it's needed for exportable fs and so for 
nfs, other exportable fs? Is this only a particular problem for ext4? I 
mean, it doesn't seems a blocking problem (or we could have a lot of 
traffic on fs-devel :) ), it seems a "more compliant behavior". If this 
considerations is right, I think the current behavior of ext4 is ok.

Marco

Marco
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong - May 15, 2012, 6:50 p.m.
On Tue, May 15, 2012 at 01:33:40PM -0400, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> > > On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> > >> On 2012-05-14, at 8:06, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > >>> knfsd needs i_version updates on, as will userspace nfs servers and
> > >>> probably others.
> > >>> 
> > >>> The only effects are that inode->i_version is bumped (under the i_lock)
> > >>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> > >>> more frequently than once per jiffy on write (see file_update_time).
> > >>> However the latter appears to be mostly a no-op in that case.
> > >> 
> > >> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
> > > 
> > > There's no reason it should be, should it, if we already just dirtied
> > > the inode a moment ago?
> > 
> > Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
> > is that it copies the whole in-core inode to the on-disk inode every
> > time it is marked dirty.  That ensures that the on-disk inode is
> > up-to-date when the journal flushes the blocks to disk, but is not
> > an ideal implementation.  It has been this way since the first ext3
> > implementation was done.
> > 
> > As a result, dirtying the inode very frequently for ext[34] is
> > currently expensive and should be avoided.
> > 
> > I _think_ that the ext4 metadata checksum patches have changed this
> > to only flag the inode dirty and run a pre-commit callback to copy
> > the in-core inode to the on-disk inode.  I'm not sure what the
> > current status of that patch is, nor how easily it could be split
> > from that patch series and land separately.
> 
> I did some searching, found a couple of versions of the metadata
> checksum patches, but no patch that looked like what you're describing.
> Any idea where that is?

That _was_ going to be the basis of phase 2 of my metadata checksum patchset,
but since I've been moved to other projects, I don't see that on my plate in
the near future. :/

(tldr: It doesn't exist.)

--D
> 
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 15, 2012, 7:18 p.m.
On Tue, May 15, 2012 at 07:59:11PM +0200, Marco Stornelli wrote:
> Only a really really stupid question (I don't know NFS protocol well
> enough). In 3.3 kernel, I see that only ext4 uses MS_I_VERSION, so I
> wonder: if i_version change it's needed for exportable fs and so for
> nfs, other exportable fs?

Yes, it's needed for others as well.  I believe btrfs and xfs are both
adding it.

We're currently using ctime for the nfs change attribute.  That's
effectively jiffy granularity.  So to see the problem at a minimum you'd
need two writes to be processed within one jiffy, and a stat to come
between them.  But that's a correctness problem, and we'd like to see it
fixed before it becomes more common.

More generally, it's useful to be able to ask whether a file changed
without rereading all its data, and a clock that registers every change
and is consistent across a filesystem sounds difficult to scale.  We may
eventually find we need something like this outside nfs.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e1fb1d5..a99f827 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1483,7 +1483,7 @@  static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		sbi->s_mount_flags |= EXT4_MF_FS_ABORTED;
 		return 1;
 	case Opt_i_version:
-		sb->s_flags |= MS_I_VERSION;
+		/* no-op; this is on by default now */
 		return 1;
 	case Opt_journal_dev:
 		if (is_remount) {
@@ -2979,6 +2979,7 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto out_free_orig;
 	}
 	sb->s_fs_info = sbi;
+	sb->s_flags |= MS_I_VERSION;
 	sbi->s_mount_opt = 0;
 	sbi->s_resuid = EXT4_DEF_RESUID;
 	sbi->s_resgid = EXT4_DEF_RESGID;