writeback: avoid double-writing the inode on a lazytime expiration
diff mbox series

Message ID 20200307020043.60118-1-tytso@mit.edu
State New
Headers show
Series
  • writeback: avoid double-writing the inode on a lazytime expiration
Related show

Commit Message

Theodore Y. Ts'o March 7, 2020, 2 a.m. UTC
In the case that an inode has dirty timestamp for longer than the
lazytime expiration timeout (or if all such inodes are being flushed
out due to a sync or syncfs system call), we need to inform the file
system that the inode is dirty so that the inode's timestamps can be
copied out to the on-disk data structures.  That's because if the file
system supports lazytime, it will have ignored the dirty_inode(inode,
I_DIRTY_TIME) notification when the timestamp was modified in memory.q

Previously, this was accomplished by calling mark_inode_dirty_sync(),
but that has the unfortunate side effect of also putting the inode the
writeback list, and that's not necessary in this case, since we will
immediately call write_inode() afterwards.

Eric Biggers noticed that this was causing problems for fscrypt after
the key was removed[1].

[1] https://lore.kernel.org/r/20200306004555.GB225345@gmail.com

Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/fs-writeback.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Eric Biggers March 11, 2020, 3:20 a.m. UTC | #1
On Fri, Mar 06, 2020 at 09:00:43PM -0500, Theodore Ts'o wrote:
> In the case that an inode has dirty timestamp for longer than the
> lazytime expiration timeout (or if all such inodes are being flushed
> out due to a sync or syncfs system call), we need to inform the file
> system that the inode is dirty so that the inode's timestamps can be
> copied out to the on-disk data structures.  That's because if the file
> system supports lazytime, it will have ignored the dirty_inode(inode,
> I_DIRTY_TIME) notification when the timestamp was modified in memory.q
> 
> Previously, this was accomplished by calling mark_inode_dirty_sync(),
> but that has the unfortunate side effect of also putting the inode the
> writeback list, and that's not necessary in this case, since we will
> immediately call write_inode() afterwards.
> 
> Eric Biggers noticed that this was causing problems for fscrypt after
> the key was removed[1].
> 
> [1] https://lore.kernel.org/r/20200306004555.GB225345@gmail.com
> 
> Reported-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/fs-writeback.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 76ac9c7d32ec..32101349ba97 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1504,8 +1504,9 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  
>  	spin_unlock(&inode->i_lock);
>  
> -	if (dirty & I_DIRTY_TIME)
> -		mark_inode_dirty_sync(inode);
> +	/* This was a lazytime expiration; we need to tell the file system */
> +	if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode)
> +		inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED);
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & ~I_DIRTY_PAGES) {
>  		int err = write_inode(inode, wbc);
> -- 

Thanks Ted!  This fixes the fscrypt test failure.

However, are you sure this works correctly on all filesystems?  I'm not sure
about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and in its
->dirty_inode() it does:

	static void
	xfs_fs_dirty_inode(
		struct inode                    *inode,
		int                             flag)
	{
		struct xfs_inode                *ip = XFS_I(inode);
		struct xfs_mount                *mp = ip->i_mount;
		struct xfs_trans                *tp;

		if (!(inode->i_sb->s_flags & SB_LAZYTIME))
			return;
		if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
			return;

		if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
			return;
		xfs_ilock(ip, XFS_ILOCK_EXCL);
		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
		xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
		xfs_trans_commit(tp);
	}


So flag=I_DIRTY_TIME_EXPIRED will be a no-op.

Maybe you should be using I_DIRTY_SYNC instead?  Or perhaps XFS should be
checking for either I_DIRTY_TIME_EXPIRED or I_DIRTY_SYNC?

- Eric
Theodore Y. Ts'o March 11, 2020, 12:57 p.m. UTC | #2
On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote:
> Thanks Ted!  This fixes the fscrypt test failure.
> 
> However, are you sure this works correctly on all filesystems?  I'm not sure
> about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and in its
> ->dirty_inode() it does:
  ...
> 		if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> 			return;

That's true, but when the timestamps were originally modified,
dirty_inode() will be called with flag == I_DIRTY_TIME, which will
*not* be a no-op; which is to say, XFS will force the timestamps to be
updated on disk when the timestamps are first dirtied, because it
doesn't support I_DIRTY_TIME.

So I think we're fine.

					- Ted
Dave Chinner March 11, 2020, 11:54 p.m. UTC | #3
On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote:
> On Fri, Mar 06, 2020 at 09:00:43PM -0500, Theodore Ts'o wrote:
> > In the case that an inode has dirty timestamp for longer than the
> > lazytime expiration timeout (or if all such inodes are being flushed
> > out due to a sync or syncfs system call), we need to inform the file
> > system that the inode is dirty so that the inode's timestamps can be
> > copied out to the on-disk data structures.  That's because if the file
> > system supports lazytime, it will have ignored the dirty_inode(inode,
> > I_DIRTY_TIME) notification when the timestamp was modified in memory.q
> > 
> > Previously, this was accomplished by calling mark_inode_dirty_sync(),
> > but that has the unfortunate side effect of also putting the inode the
> > writeback list, and that's not necessary in this case, since we will
> > immediately call write_inode() afterwards.
> > 
> > Eric Biggers noticed that this was causing problems for fscrypt after
> > the key was removed[1].
> > 
> > [1] https://lore.kernel.org/r/20200306004555.GB225345@gmail.com
> > 
> > Reported-by: Eric Biggers <ebiggers@kernel.org>
> > Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> > ---
> >  fs/fs-writeback.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 76ac9c7d32ec..32101349ba97 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1504,8 +1504,9 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> >  
> >  	spin_unlock(&inode->i_lock);
> >  
> > -	if (dirty & I_DIRTY_TIME)
> > -		mark_inode_dirty_sync(inode);
> > +	/* This was a lazytime expiration; we need to tell the file system */
> > +	if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode)
> > +		inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED);
> >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> >  	if (dirty & ~I_DIRTY_PAGES) {
> >  		int err = write_inode(inode, wbc);
> > -- 
> 
> Thanks Ted!  This fixes the fscrypt test failure.
> 
> However, are you sure this works correctly on all filesystems?  I'm not sure
> about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and in its
> ->dirty_inode() it does:
> 
> 	static void
> 	xfs_fs_dirty_inode(
> 		struct inode                    *inode,
> 		int                             flag)
> 	{
> 		struct xfs_inode                *ip = XFS_I(inode);
> 		struct xfs_mount                *mp = ip->i_mount;
> 		struct xfs_trans                *tp;
> 
> 		if (!(inode->i_sb->s_flags & SB_LAZYTIME))
> 			return;
> 		if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> 			return;
> 
> 		if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
> 			return;
> 		xfs_ilock(ip, XFS_ILOCK_EXCL);
> 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> 		xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
> 		xfs_trans_commit(tp);
> 	}
> 
> 
> So flag=I_DIRTY_TIME_EXPIRED will be a no-op.
> 
> Maybe you should be using I_DIRTY_SYNC instead?  Or perhaps XFS should be
> checking for either I_DIRTY_TIME_EXPIRED or I_DIRTY_SYNC?

Right, XFS does not use the VFS inode writeback code at all - we
track all metadata changes internally via journalling. The VFS uses
I_DIRTY_SYNC to indicate and inode is metadata dirty and a writeback
candidate. Hence if we need to mark an inode dirty for integrity
purposes for _any reason_, then I_DIRTY_SYNC is the correct flag to
be passing to ->dirty_inode.

Cheers,

Dave.
Dave Chinner March 12, 2020, 12:07 a.m. UTC | #4
On Wed, Mar 11, 2020 at 08:57:49AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote:
> > Thanks Ted!  This fixes the fscrypt test failure.
> > 
> > However, are you sure this works correctly on all filesystems?  I'm not sure
> > about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and in its
> > ->dirty_inode() it does:
>   ...
> > 		if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> > 			return;
> 
> That's true, but when the timestamps were originally modified,
> dirty_inode() will be called with flag == I_DIRTY_TIME, which will
> *not* be a no-op; which is to say, XFS will force the timestamps to be
> updated on disk when the timestamps are first dirtied, because it
> doesn't support I_DIRTY_TIME.

We log the initial timestamp change, and then ignore timestamp
updates until the dirty time expires and the inode is set
I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have
time stamps that may be 24 hours out of date in memory, and they
still need to be flushed to the journal.

However, your change does not mark the inode dirtying on expiry
anymore, so...

> So I think we're fine.

... we're not fine. This breaks XFS and any other filesystem that
relies on a I_DIRTY_SYNC notification to handle dirty time expiry
correctly.

Cheers,

Dave.
Christoph Hellwig March 12, 2020, 2:34 p.m. UTC | #5
On Thu, Mar 12, 2020 at 11:07:17AM +1100, Dave Chinner wrote:
> > That's true, but when the timestamps were originally modified,
> > dirty_inode() will be called with flag == I_DIRTY_TIME, which will
> > *not* be a no-op; which is to say, XFS will force the timestamps to be
> > updated on disk when the timestamps are first dirtied, because it
> > doesn't support I_DIRTY_TIME.
> 
> We log the initial timestamp change, and then ignore timestamp
> updates until the dirty time expires and the inode is set
> I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have
> time stamps that may be 24 hours out of date in memory, and they
> still need to be flushed to the journal.
> 
> However, your change does not mark the inode dirtying on expiry
> anymore, so...
> 
> > So I think we're fine.
> 
> ... we're not fine. This breaks XFS and any other filesystem that
> relies on a I_DIRTY_SYNC notification to handle dirty time expiry
> correctly.

I haven't seen the original mail this replies to, but if we could
get the lazytime expirty by some other means (e.g. an explicit
callback), XFS could opt out of all the VFS inode tracking again,
which would simplify a few things.
Dave Chinner March 12, 2020, 10:39 p.m. UTC | #6
On Thu, Mar 12, 2020 at 07:34:45AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 12, 2020 at 11:07:17AM +1100, Dave Chinner wrote:
> > > That's true, but when the timestamps were originally modified,
> > > dirty_inode() will be called with flag == I_DIRTY_TIME, which will
> > > *not* be a no-op; which is to say, XFS will force the timestamps to be
> > > updated on disk when the timestamps are first dirtied, because it
> > > doesn't support I_DIRTY_TIME.
> > 
> > We log the initial timestamp change, and then ignore timestamp
> > updates until the dirty time expires and the inode is set
> > I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have
> > time stamps that may be 24 hours out of date in memory, and they
> > still need to be flushed to the journal.
> > 
> > However, your change does not mark the inode dirtying on expiry
> > anymore, so...
> > 
> > > So I think we're fine.
> > 
> > ... we're not fine. This breaks XFS and any other filesystem that
> > relies on a I_DIRTY_SYNC notification to handle dirty time expiry
> > correctly.
> 
> I haven't seen the original mail this replies to,

The original problem was calling mark_inode_dirty_sync() on expiry
during inode writeback was causing the inode to be put back on the
dirty inode list and so ext4 was flushing it twice - once on expiry
and once 5 seconds later on the next background writeback pass.

This is a problem that XFS does not have because it does not
implement ->write_inode...

> but if we could
> get the lazytime expirty by some other means (e.g. an explicit
> callback), XFS could opt out of all the VFS inode tracking again,
> which would simplify a few things.

Yes, that would definitely make things simpler for XFS, and it would
also solve the problem that the generic lazytime expiry code has....

Cheers,

Dave.
Theodore Y. Ts'o March 20, 2020, 2:46 a.m. UTC | #7
On Thu, Mar 12, 2020 at 07:34:45AM -0700, Christoph Hellwig wrote:
> I haven't seen the original mail this replies to, but if we could
> get the lazytime expirty by some other means (e.g. an explicit
> callback), XFS could opt out of all the VFS inode tracking again,
> which would simplify a few things.

Part of my thinking of calling 

       inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED);

So that it would be an explicit callback to XFS.  So why don't I break
this as two patches --- one which uses I_DIRTY_SYNC, as before, and a
second one which changes calls dirty_inode() with
I_DIRTY_TIME_EXPIRED, and with a change to XFS so that it recognizes
I_DIRTY_TIME_EXPIRED as if it were I_DIRTY_SYNC.  If this would then
allow XFS to simplify how it handles VFS tracking, you could do that
in a separate patch.

Does that work?  I'll send out the two patches, and if you can
review/ack the second patch, that would be great.

	       	      	     	  	- Ted

Patch
diff mbox series

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..32101349ba97 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1504,8 +1504,9 @@  __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 
 	spin_unlock(&inode->i_lock);
 
-	if (dirty & I_DIRTY_TIME)
-		mark_inode_dirty_sync(inode);
+	/* This was a lazytime expiration; we need to tell the file system */
+	if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode)
+		inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & ~I_DIRTY_PAGES) {
 		int err = write_inode(inode, wbc);