Message ID | 20211102024258.210439-1-sunrise_l@sjtu.edu.cn |
---|---|
State | Rejected |
Headers | show |
Series | ext4: remove unnecessary ext4_inode_datasync_dirty in read path | expand |
On Tue, Nov 02, 2021 at 10:42:58AM +0800, Zhongwei Cai wrote: > ext4_inode_datasync_dirty will call read_lock(&journal->j_state_lock) in > journal mode, which is unnecessary in read path (As far as I know, the > IOMAP_F_DIRTY flag set in the if branch is only used in write path, > making it unnecessary in read path. Please correct me if I'm wrong). IOMAP_F_DIRTY isn't conditional on the type of lookup being done. If the inode is dirty in a way that O_DSYNC would require it to be flushed to make the data stable, iomap should be told that it is dirty, even on read lookups... e.g. iomap_swapfile_activate() uses IOMAP_REPORT as the flags for extent mapping iteration passed to iomap_swapfile_iter(). THis then checks: /* No uncommitted metadata or shared blocks. */ if (iomap->flags & IOMAP_F_DIRTY) return iomap_swapfile_fail(isi, "is not committed"); IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of iomap mapping calls if the inode is dirty, not just IOMAP_WRITE calls. Cheers, Dave.
On 11/3/21 8:28 AM, Dave Chinner wrote: > IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of > iomap mapping calls if the inode is dirty, not just IOMAP_WRITE > calls. Thanks for correction! > /* > * Flags reported by the file system from iomap_begin: > * > * IOMAP_F_NEW indicates that the blocks have been newly allocated > * and need zeroing for areas that no data is copied to. > * > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed > * to access written data and requires fdatasync to commit them to > * persistent storage. This needs to take into account metadata > * changes that *may* be made at IO completion, such as file size > * updates from direct IO. > * > * IOMAP_F_SHARED indicates that the blocks are shared, and will > * need to be unshared as part a write. > * > * IOMAP_F_MERGED indicates that the iomap contains the merge of > * multiple block mappings. > * > * IOMAP_F_BUFFER_HEAD indicates that the file system requires the > * use of buffer heads for this mapping. > */ According to the comments in include/linux/iomap.h, it seems other flags in the iomap indicates the iomap-related status, but the IOMAP_F_DIRTY flag only indicates the status of the inode. So can I_DIRTY_INODE or I_DIRTY_PAGES flags in the inode replace it? And for ext4 at least we can do /* * Writes that span EOF might trigger an I/O size update on completion, * so consider them to be dirty for the purpose of O_DSYNC, even if * there is no other metadata changes being made or are pending. */ if (ext4_inode_datasync_dirty(inode) || - offset + length > i_size_read(inode)) + ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode))) iomap->flags |= IOMAP_F_DIRTY; , since only writes that span EOF might trigger an update. How do you feel about it? Best, Zhongwei.
On Thu, Nov 04, 2021 at 05:29:59PM +0800, Zhongwei Cai wrote: > On 11/3/21 8:28 AM, Dave Chinner wrote: > > IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of > > iomap mapping calls if the inode is dirty, not just IOMAP_WRITE > > calls. > > Thanks for correction! > > > /* > > * Flags reported by the file system from iomap_begin: > > * > > * IOMAP_F_NEW indicates that the blocks have been newly allocated > > * and need zeroing for areas that no data is copied to. > > * > > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed > > * to access written data and requires fdatasync to commit them to > > * persistent storage. This needs to take into account metadata > > * changes that *may* be made at IO completion, such as file size > > * updates from direct IO. > > * > > * IOMAP_F_SHARED indicates that the blocks are shared, and will > > * need to be unshared as part a write. > > * > > * IOMAP_F_MERGED indicates that the iomap contains the merge of > > * multiple block mappings. > > * > > * IOMAP_F_BUFFER_HEAD indicates that the file system requires the > > * use of buffer heads for this mapping. > > */ > > According to the comments in include/linux/iomap.h, it seems other > flags in the iomap indicates the iomap-related status, but the > IOMAP_F_DIRTY flag only indicates the status of the inode. So can > I_DIRTY_INODE or I_DIRTY_PAGES flags in the inode replace it? No. Some filesystems don't track inode metadata dirty status using the VFS inode; instead they track it more efficiently in internal inode and/or journal based structures. Hence the only way to get "inode needs journal flush for data stability" information to generic IO code is to have a specific per-IO mapping flag for it. > And for ext4 at least we can do > > /* > * Writes that span EOF might trigger an I/O size update on completion, > * so consider them to be dirty for the purpose of O_DSYNC, even if > * there is no other metadata changes being made or are pending. > */ > if (ext4_inode_datasync_dirty(inode) || > - offset + length > i_size_read(inode)) > + ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode))) > iomap->flags |= IOMAP_F_DIRTY; > > , since only writes that span EOF might trigger an update. How > do you feel about it? ext4 can do what it likes here, but given that the problem that was being addressed was avoiding lock contention in ext4_inode_datasync_dirty(), this appears to be a completely unnecessary change as it doesn't address the problem being reported. Cheers, Dave.
On 11/5/21 7:22 AM, Dave Chinner wrote: > > No. Some filesystems don't track inode metadata dirty status using > the VFS inode; instead they track it more efficiently in internal > inode and/or journal based structures. Hence the only way to get > "inode needs journal flush for data stability" information to > generic IO code is to have a specific per-IO mapping flag for it. > Could we add IOMAP_REPORT_DIRTY flag in the flags field of struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag needs to be set or not? Currently the IOMAP_F_DIRTY flag is only checked in iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw() (To be more specific, only the write path in dax_iomap_fault() and iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY flag in dax_iomap_rw() called in the previous tests. Other file systems that set the IOMAP_F_DIRTY flag efficiently could ignore the IOMAP_REPORT_DIRTY flag. Best, Zhongwei.
On Fri, Nov 05, 2021 at 01:28:10PM +0800, Zhongwei Cai wrote: > > > On 11/5/21 7:22 AM, Dave Chinner wrote: > > > > No. Some filesystems don't track inode metadata dirty status using > > the VFS inode; instead they track it more efficiently in internal > > inode and/or journal based structures. Hence the only way to get > > "inode needs journal flush for data stability" information to > > generic IO code is to have a specific per-IO mapping flag for it. > > > > Could we add IOMAP_REPORT_DIRTY flag in the flags field of > struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag > needs to be set or not? You can try. It might turn out OK, but you're also going to have to modify all the iomap code that current needs IOMAP_F_DIRTY to first set that flag, then change all the code that currently sets IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to change iomap, ext4 and XFS to do this. > Currently the IOMAP_F_DIRTY flag is only checked in > iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw() > (To be more specific, only the write path in dax_iomap_fault() and > iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY > flag in dax_iomap_rw() called in the previous tests. I think you're trying to optimise the wrong thing - the API is not the problem, the problem is that the journal->j_state_lock is contended and the ext4 dirty inode check needs to take it. Fix the dirty check not to need the journal state lock and the ext4 problem goes away and there is no need to change the iomap infrastructure. > Other file systems that set the IOMAP_F_DIRTY flag efficiently > could ignore the IOMAP_REPORT_DIRTY flag. No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY then the iomap infrastructure should only use that control flag when it needs to know if the inode is dirty. At this point, it really becomes mandatory for all filesystems using iomap to support it because the absence of IOMAP_F_DIRTY because a filesystem doesn't support it is not the same as "filesystem didn't set it because the inode is clean". Cheers, Dave.
On 11/9/21 12:50 PM, Dave Chinner wrote: >> >> Could we add IOMAP_REPORT_DIRTY flag in the flags field of >> struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag >> needs to be set or not? > > You can try. It might turn out OK, but you're also going to have to > modify all the iomap code that current needs IOMAP_F_DIRTY to first > set that flag, then change all the code that currently sets > IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to > change iomap, ext4 and XFS to do this. > I will make a v2 patch with this implementation. >> Currently the IOMAP_F_DIRTY flag is only checked in >> iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw() >> (To be more specific, only the write path in dax_iomap_fault() and >> iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY >> flag in dax_iomap_rw() called in the previous tests. > > I think you're trying to optimise the wrong thing - the API is not > the problem, the problem is that the journal->j_state_lock is > contended and the ext4 dirty inode check needs to take it. Fix the > dirty check not to need the journal state lock and the ext4 problem > goes away and there is no need to change the iomap infrastructure. I'll try to fix it inside ext4, although it seems difficult to do dirty check without journal->j_state_lock. >> Other file systems that set the IOMAP_F_DIRTY flag efficiently >> could ignore the IOMAP_REPORT_DIRTY flag. > > No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY > then the iomap infrastructure should only use that control flag when > it needs to know if the inode is dirty. At this point, it really > becomes mandatory for all filesystems using iomap to support it > because the absence of IOMAP_F_DIRTY because a filesystem doesn't > support it is not the same as "filesystem didn't set it because the > inode is clean". > Perhaps I have not made it clear that by "ignore" I mean other file systems can set IOMAP_F_DIRTY regardless of whether the IOMAP_REPORT_DIRTY flag is set or not, just like what they are doing now. So we might not need to modify XFS. I think even without the modification I made, the ambiguity that the absence of IOMAP_F_DIRTY can either be file systems not supporting it or be actually "clean inode" may exist since we do not have a flag to indicate whether the file system supports setting IOMAP_F_DIRTY. Best, Zhongwei
On Wed, Nov 10, 2021 at 04:07:57PM +0800, Zhongwei Cai wrote: > On 11/9/21 12:50 PM, Dave Chinner wrote: > > > > > > Could we add IOMAP_REPORT_DIRTY flag in the flags field of > > > struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag > > > needs to be set or not? > > > > You can try. It might turn out OK, but you're also going to have to > > modify all the iomap code that current needs IOMAP_F_DIRTY to first > > set that flag, then change all the code that currently sets > > IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to > > change iomap, ext4 and XFS to do this. > > > I will make a v2 patch with this implementation. > > > > Currently the IOMAP_F_DIRTY flag is only checked in > > > iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw() > > > (To be more specific, only the write path in dax_iomap_fault() and > > > iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY > > > flag in dax_iomap_rw() called in the previous tests. > > > > I think you're trying to optimise the wrong thing - the API is not > > the problem, the problem is that the journal->j_state_lock is > > contended and the ext4 dirty inode check needs to take it. Fix the > > dirty check not to need the journal state lock and the ext4 problem > > goes away and there is no need to change the iomap infrastructure. > > I'll try to fix it inside ext4, although it seems difficult to do dirty > check without journal->j_state_lock. > > > > Other file systems that set the IOMAP_F_DIRTY flag efficiently > > > could ignore the IOMAP_REPORT_DIRTY flag. > > > > No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY > > then the iomap infrastructure should only use that control flag when > > it needs to know if the inode is dirty. At this point, it really > > becomes mandatory for all filesystems using iomap to support it > > because the absence of IOMAP_F_DIRTY because a filesystem doesn't > > support it is not the same as "filesystem didn't set it because the > > inode is clean". > > > Perhaps I have not made it clear that by "ignore" I mean other file > systems can set IOMAP_F_DIRTY regardless of whether the > IOMAP_REPORT_DIRTY flag is set or not, just like what they are doing > now. So we might not need to modify XFS. I assumed that was exactly what you meant. That's just bad/lazy API design. We need to be explicit in the way behaviours are defined, not create APIs where callees can set random flags whenever they like. > I think even without the modification I made, the ambiguity that > the absence of IOMAP_F_DIRTY can either be file systems not supporting > it or be actually "clean inode" may exist since we do not have a flag > to indicate whether the file system supports setting IOMAP_F_DIRTY. Exactly my point - the current behaviour it's not clearly defined (just set it always if you can!) but ext4 needs it to be much more explicitly constrained to avoid internal overhead. Hence we need to constrain the IOMAP_F_DIRTY flag behaviour to just the iomap operations that need to know if the inode is dirty (i.e. filesystems should set IOMAP_F_DIRTY if and only if IOMAP_REPORT_DIRTY is passed as a control flag from the iomap core and the inode is dirty) and then ensure that *every filesystem implements it correctly*. There are no shortcuts here - the cost of changing generic infrastructure is that you have to ensure that all users of interface use it properly. That includes filesystems that don't currently set IOMAP_F_DIRTY but use iomap interfaces that will now set IOMAP_REPORT_DIRTY.... Cheers, Dave.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0f06305167d5..72ec2074ef54 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3274,7 +3274,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode) static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, struct ext4_map_blocks *map, loff_t offset, - loff_t length) + loff_t length, int flags) { u8 blkbits = inode->i_blkbits; @@ -3284,8 +3284,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, * there is no other metadata changes being made or are pending. */ iomap->flags = 0; - if (ext4_inode_datasync_dirty(inode) || - offset + length > i_size_read(inode)) + if ((flags & IOMAP_WRITE) && (ext4_inode_datasync_dirty(inode) || + offset + length > i_size_read(inode))) iomap->flags |= IOMAP_F_DIRTY; if (map->m_flags & EXT4_MAP_NEW) @@ -3423,7 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, if (ret < 0) return ret; out: - ext4_set_iomap(inode, iomap, &map, offset, length); + ext4_set_iomap(inode, iomap, &map, offset, length, flags); return 0; } @@ -3543,7 +3543,7 @@ static int ext4_iomap_begin_report(struct inode *inode, loff_t offset, delalloc = ext4_iomap_is_delalloc(inode, &map); set_iomap: - ext4_set_iomap(inode, iomap, &map, offset, length); + ext4_set_iomap(inode, iomap, &map, offset, length, flags); if (delalloc && iomap->type == IOMAP_HOLE) iomap->type = IOMAP_DELALLOC;
ext4_inode_datasync_dirty will call read_lock(&journal->j_state_lock) in journal mode, which is unnecessary in read path (As far as I know, the IOMAP_F_DIRTY flag set in the if branch is only used in write path, making it unnecessary in read path. Please correct me if I'm wrong). and will cause cache contention overhead under high concurrency especially in DAX mode. The unnecessary ext4_inode_datasync_dirty can be eliminated by passing flags into ext4_set_iomap and checking it. Performance tests are shown below. Workloads include simply reading files, fileserver in filebench and readrandom/readsequence in RocksDB under nosync mode. Sixteen thread performance under ext4-DAX: Throughput (Kop/s) | original | +patch | improvement -------------------+----------+----------+-------------- Read 4KB block | 11456 | 27651 | +141.37% fileserver | 339 | 343 | +1.18% readrandom | 1807 | 1837 | +1.66% readsequence | 29724 | 30102 | +1.27% Signed-off-by: Zhongwei Cai <sunrise_l@sjtu.edu.cn> --- fs/ext4/inode.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)