| Message ID | 20260511121356.241821-15-jack@suse.cz |
|---|---|
| State | Not Applicable |
| Headers | show |
| Series | fs: Fix missed inode write during fsync | expand |
Jan Kara <jack@suse.cz> writes: > Use mmb inode buffer writeout infrastructure to reliably write out > inode's buffer on fsync(2). > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/fat/inode.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/fat/inode.c b/fs/fat/inode.c > index 28f78df086ef..4ca00b7a618b 100644 > --- a/fs/fat/inode.c > +++ b/fs/fat/inode.c > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait) > } > spin_unlock(&sbi->inode_hash_lock); > mark_buffer_dirty(bh); > + MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr; When inode position was changed/removed, this will point the wrong block. And maybe sync a unrelated block and wait. > err = 0; > if (wait) > err = sync_dirty_buffer(bh); > @@ -925,7 +926,7 @@ static int fat_write_inode(struct inode *inode, struct writeback_control *wbc) > err = fat_clusters_flush(sb); > mutex_unlock(&MSDOS_SB(sb)->s_lock); > } else > - err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL); > + err = __fat_write_inode(inode, 0); > > return err; > }
On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > > Use mmb inode buffer writeout infrastructure to reliably write out > > inode's buffer on fsync(2). > > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/fat/inode.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/fat/inode.c b/fs/fat/inode.c > > index 28f78df086ef..4ca00b7a618b 100644 > > --- a/fs/fat/inode.c > > +++ b/fs/fat/inode.c > > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait) > > } > > spin_unlock(&sbi->inode_hash_lock); > > mark_buffer_dirty(bh); > > + MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr; > > When inode position was changed/removed, this will point the wrong > block. And maybe sync a unrelated block and wait. So I didn't realize that e.g. rename does change the backing inode block. But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk should always contain the current position where the inode was written so fsync should be syncing the right block. Or am I still missing something? Honza
Jan Kara <jack@suse.cz> writes: > On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote: >> Jan Kara <jack@suse.cz> writes: >> >> > Use mmb inode buffer writeout infrastructure to reliably write out >> > inode's buffer on fsync(2). >> >> > Signed-off-by: Jan Kara <jack@suse.cz> >> > --- >> > fs/fat/inode.c | 3 ++- >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> > >> > diff --git a/fs/fat/inode.c b/fs/fat/inode.c >> > index 28f78df086ef..4ca00b7a618b 100644 >> > --- a/fs/fat/inode.c >> > +++ b/fs/fat/inode.c >> > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait) >> > } >> > spin_unlock(&sbi->inode_hash_lock); >> > mark_buffer_dirty(bh); >> > + MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr; >> >> When inode position was changed/removed, this will point the wrong >> block. And maybe sync a unrelated block and wait. > > So I didn't realize that e.g. rename does change the backing inode block. > But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk > should always contain the current position where the inode was written so > fsync should be syncing the right block. Or am I still missing something? I didn't check the case of rename completely, just recalled it when I saw this code, need confirm/check. But at least, the case of remove will leave it even after the block is reused.
On Tue 12-05-26 03:02:13, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > > On Mon 11-05-26 23:32:45, OGAWA Hirofumi wrote: > >> Jan Kara <jack@suse.cz> writes: > >> > >> > Use mmb inode buffer writeout infrastructure to reliably write out > >> > inode's buffer on fsync(2). > >> > >> > Signed-off-by: Jan Kara <jack@suse.cz> > >> > --- > >> > fs/fat/inode.c | 3 ++- > >> > 1 file changed, 2 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/fs/fat/inode.c b/fs/fat/inode.c > >> > index 28f78df086ef..4ca00b7a618b 100644 > >> > --- a/fs/fat/inode.c > >> > +++ b/fs/fat/inode.c > >> > @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait) > >> > } > >> > spin_unlock(&sbi->inode_hash_lock); > >> > mark_buffer_dirty(bh); > >> > + MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr; > >> > >> When inode position was changed/removed, this will point the wrong > >> block. And maybe sync a unrelated block and wait. > > > > So I didn't realize that e.g. rename does change the backing inode block. > > But given we set i_metadata_bhs.inode_blk on each inode write, inode_blk > > should always contain the current position where the inode was written so > > fsync should be syncing the right block. Or am I still missing something? > > I didn't check the case of rename completely, just recalled it when I > saw this code, need confirm/check. But at least, the case of remove > will leave it even after the block is reused. Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK, thanks for catching that. I was thinking whether we should set i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying. It would be somewhat more obviously correct but it could lead to unnecessary flushing in case the directory block gets dirtied by some other entry in it while the inode we are fsyncing got never dirtied. IMHO that's a sensible tradeoff so I'd do that but what is your opinion? Honza
Jan Kara <jack@suse.cz> writes: >> I didn't check the case of rename completely, just recalled it when I >> saw this code, need confirm/check. But at least, the case of remove >> will leave it even after the block is reused. > > Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK, > thanks for catching that. I was thinking whether we should set > i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying. > It would be somewhat more obviously correct but it could lead to > unnecessary flushing in case the directory block gets dirtied by some other > entry in it while the inode we are fsyncing got never dirtied. IMHO that's > a sensible tradeoff so I'd do that but what is your opinion? IMO, the marker should be cleared like b_assoc_buffers or I_DIRTY_* flags after each sync. Otherwise, because the block is shared with other inodes, it would sync/wait the unrelated dirty easily. [And more serious implementation, looks like it should be cleared at similar points or such with b_assoc_buffers is cleared to minimize unrelated sync/wait.]
On Tue 12-05-26 23:17:55, OGAWA Hirofumi wrote: > Jan Kara <jack@suse.cz> writes: > > >> I didn't check the case of rename completely, just recalled it when I > >> saw this code, need confirm/check. But at least, the case of remove > >> will leave it even after the block is reused. > > > > Right. fat_detach() should set i_metadata_bhs.inode_blk to INVALID_BLK, > > thanks for catching that. I was thinking whether we should set > > i_metadata_bhs.inode_blk in fat_attach() instead of during inode dirtying. > > It would be somewhat more obviously correct but it could lead to > > unnecessary flushing in case the directory block gets dirtied by some other > > entry in it while the inode we are fsyncing got never dirtied. IMHO that's > > a sensible tradeoff so I'd do that but what is your opinion? > > IMO, the marker should be cleared like b_assoc_buffers or I_DIRTY_* > flags after each sync. Otherwise, because the block is shared with other > inodes, it would sync/wait the unrelated dirty easily. Well, even if we do that, we should still clear inode_blk in fat_detach() AFAICT as nobody has to sync the inode before unlinking it. Regarding clearing of inode_blk in mmb_sync() - yes, that makes sense could be done although then we have to be careful about races of mmb_sync() with .write_inode so that always guarantee mmb_sync() after .write_inode will persist the buffer. I'll see how complex that will get. > [And more serious implementation, looks like it should be cleared at > similar points or such with b_assoc_buffers is cleared to minimize > unrelated sync/wait.] OTOH this doesn't really make much sense. We need to handle b_assoc_buffers when bh is getting reclaimed so that we can free the bh. We deliberately track inode block number and not bh pointer in mapping_metadata_bhs so that bhs backing inodes can be freed independently as it's infeasible to track down all mapping_metadata_bhs structs that might be referencing this bh. And yes, I'm aware that in some corner cases the simple tracking can result in mmb_sync() writing out inode buffer although the inode itself was already persisted but exact tracking is way too expensive and not worth it for simple filesystems using this infrastructure. If we implement clearing of inode_blk in mmb_sync(), then we'll write out inode buffer unnecessarily at most once after the inode gets dirty which is IMO a reasonable middle ground between complexity of the tracking and unnecessary writeback. Honza
diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 28f78df086ef..4ca00b7a618b 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -907,6 +907,7 @@ static int __fat_write_inode(struct inode *inode, int wait) } spin_unlock(&sbi->inode_hash_lock); mark_buffer_dirty(bh); + MSDOS_I(inode)->i_metadata_bhs.inode_blk = bh->b_blocknr; err = 0; if (wait) err = sync_dirty_buffer(bh); @@ -925,7 +926,7 @@ static int fat_write_inode(struct inode *inode, struct writeback_control *wbc) err = fat_clusters_flush(sb); mutex_unlock(&MSDOS_SB(sb)->s_lock); } else - err = __fat_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL); + err = __fat_write_inode(inode, 0); return err; }
Use mmb inode buffer writeout infrastructure to reliably write out inode's buffer on fsync(2). Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fat/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)