diff mbox series

[6/9] fat: Fix possibly missing inode write on fsync(2)

Message ID 20260511121356.241821-15-jack@suse.cz
State Not Applicable
Headers show
Series fs: Fix missed inode write during fsync | expand

Commit Message

Jan Kara May 11, 2026, 12:13 p.m. UTC
Use mmb inode buffer writeout infrastructure to reliably write out
inode's buffer on fsync(2).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fat/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

OGAWA Hirofumi May 11, 2026, 2:32 p.m. UTC | #1
Jan Kara <jack@suse.cz> writes:

> Use mmb inode buffer writeout infrastructure to reliably write out
> inode's buffer on fsync(2).

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fat/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> index 28f78df086ef..4ca00b7a618b 100644
> --- a/fs/fat/inode.c
> +++ b/fs/fat/inode.c
> @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
>  	}
>  	spin_unlock(&sbi->inode_hash_lock);
>  	mark_buffer_dirty(bh);
> +	MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr;

When inode position was changed/removed, this will point the wrong
block. And maybe sync a unrelated block and wait.

>  	err = 0;
>  	if (wait)
>  		err = sync_dirty_buffer(bh);
> @@ -925,7 +926,7 @@ static int fat_write_inode(struct inode *inode, struct writeback_control *wbc)
>  		err = fat_clusters_flush(sb);
>  		mutex_unlock(&MSDOS_SB(sb)->s_lock);
>  	} else
> -		err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
> +		err = __fat_write_inode(inode, 0);
>  
>  	return err;
>  }
Jan Kara May 11, 2026, 5:03 p.m. UTC | #2
On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Use mmb inode buffer writeout infrastructure to reliably write out
> > inode's buffer on fsync(2).
> 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/fat/inode.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> > index 28f78df086ef..4ca00b7a618b 100644
> > --- a/fs/fat/inode.c
> > +++ b/fs/fat/inode.c
> > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
> >  	}
> >  	spin_unlock(&sbi->inode_hash_lock);
> >  	mark_buffer_dirty(bh);
> > +	MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr;
> 
> When inode position was changed/removed, this will point the wrong
> block. And maybe sync a unrelated block and wait.

So I didn't realize that e.g. rename does change the backing inode block.
But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk
should always contain the current position where the inode was written so
fsync should be syncing the right block. Or am I still missing something?

								Honza
OGAWA Hirofumi May 11, 2026, 6:02 p.m. UTC | #3
Jan Kara <jack@suse.cz> writes:

> On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > Use mmb inode buffer writeout infrastructure to reliably write out
>> > inode's buffer on fsync(2).
>> 
>> > Signed-off-by: Jan Kara <jack@suse.cz>
>> > ---
>> >  fs/fat/inode.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/fat/inode.c b/fs/fat/inode.c
>> > index 28f78df086ef..4ca00b7a618b 100644
>> > --- a/fs/fat/inode.c
>> > +++ b/fs/fat/inode.c
>> > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
>> >  	}
>> >  	spin_unlock(&sbi->inode_hash_lock);
>> >  	mark_buffer_dirty(bh);
>> > +	MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr;
>> 
>> When inode position was changed/removed, this will point the wrong
>> block. And maybe sync a unrelated block and wait.
>
> So I didn't realize that e.g. rename does change the backing inode block.
> But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk
> should always contain the current position where the inode was written so
> fsync should be syncing the right block. Or am I still missing something?

I didn't check the case of rename completely, just recalled it when I
saw this code, need confirm/check.  But at least, the case of remove
will leave it even after the block is reused.
Jan Kara May 12, 2026, 7:29 a.m. UTC | #4
On Tue 12-05-26 03:02:13, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> 
> >> > Use mmb inode buffer writeout infrastructure to reliably write out
> >> > inode's buffer on fsync(2).
> >> 
> >> > Signed-off-by: Jan Kara <jack@suse.cz>
> >> > ---
> >> >  fs/fat/inode.c | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> >> > index 28f78df086ef..4ca00b7a618b 100644
> >> > --- a/fs/fat/inode.c
> >> > +++ b/fs/fat/inode.c
> >> > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait)
> >> >  	}
> >> >  	spin_unlock(&sbi->inode_hash_lock);
> >> >  	mark_buffer_dirty(bh);
> >> > +	MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr;
> >> 
> >> When inode position was changed/removed, this will point the wrong
> >> block. And maybe sync a unrelated block and wait.
> >
> > So I didn't realize that e.g. rename does change the backing inode block.
> > But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk
> > should always contain the current position where the inode was written so
> > fsync should be syncing the right block. Or am I still missing something?
> 
> I didn't check the case of rename completely, just recalled it when I
> saw this code, need confirm/check.  But at least, the case of remove
> will leave it even after the block is reused.

Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK,
thanks for catching that. I was thinking whether we should set
i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying.
It would be somewhat more obviously correct but it could lead to
unnecessary flushing in case the directory block gets dirtied by some other
entry in it while the inode we are fsyncing got never dirtied. IMHO that's
a sensible tradeoff so I'd do that but what is your opinion?

								Honza
OGAWA Hirofumi May 12, 2026, 2:17 p.m. UTC | #5
Jan Kara <jack@suse.cz> writes:

>> I didn't check the case of rename completely, just recalled it when I
>> saw this code, need confirm/check.  But at least, the case of remove
>> will leave it even after the block is reused.
>
> Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK,
> thanks for catching that. I was thinking whether we should set
> i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying.
> It would be somewhat more obviously correct but it could lead to
> unnecessary flushing in case the directory block gets dirtied by some other
> entry in it while the inode we are fsyncing got never dirtied. IMHO that's
> a sensible tradeoff so I'd do that but what is your opinion?

IMO, the marker should be cleared like b_assoc_buffers or I_DIRTY_*
flags after each sync. Otherwise, because the block is shared with other
inodes, it would sync/wait the unrelated dirty easily.

[And more serious implementation, looks like it should be cleared at
similar points or such with b_assoc_buffers is cleared to minimize
unrelated sync/wait.]
Jan Kara May 13, 2026, 9:41 a.m. UTC | #6
On Tue 12-05-26 23:17:55, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >> I didn't check the case of rename completely, just recalled it when I
> >> saw this code, need confirm/check.  But at least, the case of remove
> >> will leave it even after the block is reused.
> >
> > Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK,
> > thanks for catching that. I was thinking whether we should set
> > i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying.
> > It would be somewhat more obviously correct but it could lead to
> > unnecessary flushing in case the directory block gets dirtied by some other
> > entry in it while the inode we are fsyncing got never dirtied. IMHO that's
> > a sensible tradeoff so I'd do that but what is your opinion?
> 
> IMO, the marker should be cleared like b_assoc_buffers or I_DIRTY_*
> flags after each sync. Otherwise, because the block is shared with other
> inodes, it would sync/wait the unrelated dirty easily.

Well, even if we do that, we should still clear inode_blk in fat_detach()
AFAICT as nobody has to sync the inode before unlinking it.

Regarding clearing of inode_blk in mmb_sync() - yes, that makes sense could
be done although then we have to be careful about races of mmb_sync() with
.write_inode so that always guarantee mmb_sync() after .write_inode will
persist the buffer. I'll see how complex that will get.

> [And more serious implementation, looks like it should be cleared at
> similar points or such with b_assoc_buffers is cleared to minimize
> unrelated sync/wait.]

OTOH this doesn't really make much sense. We need to handle b_assoc_buffers
when bh is getting reclaimed so that we can free the bh. We deliberately
track inode block number and not bh pointer in mapping_metadata_bhs so that
bhs backing inodes can be freed independently as it's infeasible to track
down all mapping_metadata_bhs structs that might be referencing this bh.

And yes, I'm aware that in some corner cases the simple tracking can result
in mmb_sync() writing out inode buffer although the inode itself was
already persisted but exact tracking is way too expensive and not worth it
for simple filesystems using this infrastructure. If we implement clearing
of inode_blk in mmb_sync(), then we'll write out inode buffer unnecessarily
at most once after the inode gets dirty which is IMO a reasonable middle
ground between complexity of the tracking and unnecessary writeback.

								Honza
diff mbox series

Patch

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 28f78df086ef..4ca00b7a618b 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -907,6 +907,7 @@  static int __fat_write_inode(struct inode *inode, int wait)
 	}
 	spin_unlock(&sbi->inode_hash_lock);
 	mark_buffer_dirty(bh);
+	MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr;
 	err = 0;
 	if (wait)
 		err = sync_dirty_buffer(bh);
@@ -925,7 +926,7 @@  static int fat_write_inode(struct inode *inode, struct writeback_control *wbc)
 		err = fat_clusters_flush(sb);
 		mutex_unlock(&MSDOS_SB(sb)->s_lock);
 	} else
-		err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+		err = __fat_write_inode(inode, 0);
 
 	return err;
 }