diff mbox series

ext4: remove unnecessary ext4_inode_datasync_dirty in read path

Message ID 20211102024258.210439-1-sunrise_l@sjtu.edu.cn
State Rejected
Headers show
Series ext4: remove unnecessary ext4_inode_datasync_dirty in read path | expand

Commit Message

Zhongwei Cai Nov. 2, 2021, 2:42 a.m. UTC
ext4_inode_datasync_dirty will call read_lock(&journal->j_state_lock) in
journal mode, which is unnecessary in read path (As far as I know, the
IOMAP_F_DIRTY flag set in the if branch is only used in write path,
making it unnecessary in read path. Please correct me if I'm wrong).
and will cause cache contention overhead under high concurrency
especially in DAX mode. The unnecessary ext4_inode_datasync_dirty can be
eliminated by passing flags into ext4_set_iomap and checking it.

Performance tests are shown below. Workloads include simply reading files,
fileserver in filebench and readrandom/readsequence in RocksDB under
nosync mode.

Sixteen thread performance under ext4-DAX:
 Throughput (Kop/s) | original |  +patch  | improvement
 -------------------+----------+----------+--------------
 Read 4KB block     |   11456  |   27651  |  +141.37%
 fileserver         |     339  |     343  |  +1.18%
 readrandom         |    1807  |    1837  |  +1.66%
 readsequence       |   29724  |   30102  |  +1.27%

Signed-off-by: Zhongwei Cai <sunrise_l@sjtu.edu.cn>
---
 fs/ext4/inode.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Dave Chinner Nov. 3, 2021, 12:28 a.m. UTC | #1
On Tue, Nov 02, 2021 at 10:42:58AM +0800, Zhongwei Cai wrote:
> ext4_inode_datasync_dirty will call read_lock(&journal->j_state_lock) in
> journal mode, which is unnecessary in read path (As far as I know, the
> IOMAP_F_DIRTY flag set in the if branch is only used in write path,
> making it unnecessary in read path. Please correct me if I'm wrong).

IOMAP_F_DIRTY isn't conditional on the type of lookup being done. If
the inode is dirty in a way that O_DSYNC would require it to be
flushed to make the data stable, iomap should be told that it is
dirty, even on read lookups...

e.g. iomap_swapfile_activate() uses IOMAP_REPORT as the flags for
extent mapping iteration passed to iomap_swapfile_iter(). THis then
checks:

	/* No uncommitted metadata or shared blocks. */
	if (iomap->flags & IOMAP_F_DIRTY)
		return iomap_swapfile_fail(isi, "is not committed");

IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of
iomap mapping calls if the inode is dirty, not just IOMAP_WRITE
calls.

Cheers,

Dave.
Zhongwei Cai Nov. 4, 2021, 9:29 a.m. UTC | #2
On 11/3/21 8:28 AM, Dave Chinner wrote:
 > IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of
 > iomap mapping calls if the inode is dirty, not just IOMAP_WRITE
 > calls.

Thanks for correction!

 > /*
 >  * Flags reported by the file system from iomap_begin:
 >  *
 >  * IOMAP_F_NEW indicates that the blocks have been newly allocated
 >  * and need zeroing for areas that no data is copied to.
 >  *
 >  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed
 >  * to access written data and requires fdatasync to commit them to
 >  * persistent  storage. This needs to take into account metadata
 >  * changes that  *may* be made at IO completion, such as file size
 >  * updates from direct IO.
 >  *
 >  * IOMAP_F_SHARED indicates that the blocks are shared, and will
 >  * need to be unshared as part a write.
 >  *
 >  * IOMAP_F_MERGED indicates that the iomap contains the merge of
 >  * multiple block mappings.
 >  *
 >  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the
 >  * use of buffer heads for this mapping.
 >  */

According to the comments in include/linux/iomap.h, it seems other
flags in the iomap indicates the iomap-related status, but the
IOMAP_F_DIRTY flag only indicates the status of the inode. So can
I_DIRTY_INODE or I_DIRTY_PAGES flags in the inode replace it?

And for ext4 at least we can do

    /*
     * Writes that span EOF might trigger an I/O size update on completion,
     * so consider them to be dirty for the purpose of O_DSYNC, even if
     * there is no other metadata changes being made or are pending.
     */
     if (ext4_inode_datasync_dirty(inode) ||
-       offset + length > i_size_read(inode))
+       ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode)))
         iomap->flags |= IOMAP_F_DIRTY;

, since only writes that span EOF might trigger an update. How
do you feel about it?

Best,

Zhongwei.
Dave Chinner Nov. 4, 2021, 11:22 p.m. UTC | #3
On Thu, Nov 04, 2021 at 05:29:59PM +0800, Zhongwei Cai wrote:
> On 11/3/21 8:28 AM, Dave Chinner wrote:
> > IOWs, we expect the IOMAP_F_DIRTY flag to be set on all types of
> > iomap mapping calls if the inode is dirty, not just IOMAP_WRITE
> > calls.
> 
> Thanks for correction!
> 
> > /*
> >  * Flags reported by the file system from iomap_begin:
> >  *
> >  * IOMAP_F_NEW indicates that the blocks have been newly allocated
> >  * and need zeroing for areas that no data is copied to.
> >  *
> >  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed
> >  * to access written data and requires fdatasync to commit them to
> >  * persistent  storage. This needs to take into account metadata
> >  * changes that  *may* be made at IO completion, such as file size
> >  * updates from direct IO.
> >  *
> >  * IOMAP_F_SHARED indicates that the blocks are shared, and will
> >  * need to be unshared as part a write.
> >  *
> >  * IOMAP_F_MERGED indicates that the iomap contains the merge of
> >  * multiple block mappings.
> >  *
> >  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the
> >  * use of buffer heads for this mapping.
> >  */
> 
> According to the comments in include/linux/iomap.h, it seems other
> flags in the iomap indicates the iomap-related status, but the
> IOMAP_F_DIRTY flag only indicates the status of the inode. So can
> I_DIRTY_INODE or I_DIRTY_PAGES flags in the inode replace it?

No. Some filesystems don't track inode metadata dirty status using
the VFS inode; instead they track it more efficiently in internal
inode and/or journal based structures. Hence the only way to get
"inode needs journal flush for data stability" information to
generic IO code is to have a specific per-IO mapping flag for it.

> And for ext4 at least we can do
> 
>    /*
>     * Writes that span EOF might trigger an I/O size update on completion,
>     * so consider them to be dirty for the purpose of O_DSYNC, even if
>     * there is no other metadata changes being made or are pending.
>     */
>     if (ext4_inode_datasync_dirty(inode) ||
> -       offset + length > i_size_read(inode))
> +       ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode)))
>         iomap->flags |= IOMAP_F_DIRTY;
> 
> , since only writes that span EOF might trigger an update. How
> do you feel about it?

ext4 can do what it likes here, but given that the problem that was
being addressed was avoiding lock contention in
ext4_inode_datasync_dirty(), this appears to be a completely
unnecessary change as it doesn't address the problem being reported.

Cheers,

Dave.
Zhongwei Cai Nov. 5, 2021, 5:28 a.m. UTC | #4
On 11/5/21 7:22 AM, Dave Chinner wrote:
> 
> No. Some filesystems don't track inode metadata dirty status using
> the VFS inode; instead they track it more efficiently in internal
> inode and/or journal based structures. Hence the only way to get
> "inode needs journal flush for data stability" information to
> generic IO code is to have a specific per-IO mapping flag for it.
> 

Could we add IOMAP_REPORT_DIRTY flag in the flags field of
struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag
needs to be set or not?

Currently the IOMAP_F_DIRTY flag is only checked in
iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw()
(To be more specific, only the write path in dax_iomap_fault() and
iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY
flag in dax_iomap_rw() called in the previous tests.

Other file systems that set the IOMAP_F_DIRTY flag efficiently
could ignore the IOMAP_REPORT_DIRTY flag.

Best,

Zhongwei.
Dave Chinner Nov. 9, 2021, 4:50 a.m. UTC | #5
On Fri, Nov 05, 2021 at 01:28:10PM +0800, Zhongwei Cai wrote:
> 
> 
> On 11/5/21 7:22 AM, Dave Chinner wrote:
> > 
> > No. Some filesystems don't track inode metadata dirty status using
> > the VFS inode; instead they track it more efficiently in internal
> > inode and/or journal based structures. Hence the only way to get
> > "inode needs journal flush for data stability" information to
> > generic IO code is to have a specific per-IO mapping flag for it.
> > 
> 
> Could we add IOMAP_REPORT_DIRTY flag in the flags field of
> struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag
> needs to be set or not?

You can try. It might turn out OK, but you're also going to have to
modify all the iomap code that current needs IOMAP_F_DIRTY to first
set that flag, then change all the code that currently sets
IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to
change iomap, ext4 and XFS to do this.

> Currently the IOMAP_F_DIRTY flag is only checked in
> iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw()
> (To be more specific, only the write path in dax_iomap_fault() and
> iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY
> flag in dax_iomap_rw() called in the previous tests.

I think you're trying to optimise the wrong thing - the API is not
the problem, the problem is that the journal->j_state_lock is
contended and the ext4 dirty inode check needs to take it. Fix the
dirty check not to need the journal state lock and the ext4 problem
goes away and there is no need to change the iomap infrastructure.

> Other file systems that set the IOMAP_F_DIRTY flag efficiently
> could ignore the IOMAP_REPORT_DIRTY flag.

No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY
then the iomap infrastructure should only use that control flag when
it needs to know if the inode is dirty. At this point, it really
becomes mandatory for all filesystems using iomap to support it
because the absence of IOMAP_F_DIRTY because a filesystem doesn't
support it is not the same as "filesystem didn't set it because the
inode is clean".

Cheers,

Dave.
Zhongwei Cai Nov. 10, 2021, 8:07 a.m. UTC | #6
On 11/9/21 12:50 PM, Dave Chinner wrote:
>>
>> Could we add IOMAP_REPORT_DIRTY flag in the flags field of
>> struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag
>> needs to be set or not?
> 
> You can try. It might turn out OK, but you're also going to have to
> modify all the iomap code that current needs IOMAP_F_DIRTY to first
> set that flag, then change all the code that currently sets
> IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to
> change iomap, ext4 and XFS to do this.
>
I will make a v2 patch with this implementation.

>> Currently the IOMAP_F_DIRTY flag is only checked in
>> iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw()
>> (To be more specific, only the write path in dax_iomap_fault() and
>> iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY
>> flag in dax_iomap_rw() called in the previous tests.
> 
> I think you're trying to optimise the wrong thing - the API is not
> the problem, the problem is that the journal->j_state_lock is
> contended and the ext4 dirty inode check needs to take it. Fix the
> dirty check not to need the journal state lock and the ext4 problem
> goes away and there is no need to change the iomap infrastructure.

I'll try to fix it inside ext4, although it seems difficult to do dirty
check without journal->j_state_lock.

>> Other file systems that set the IOMAP_F_DIRTY flag efficiently
>> could ignore the IOMAP_REPORT_DIRTY flag.
> 
> No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY
> then the iomap infrastructure should only use that control flag when
> it needs to know if the inode is dirty. At this point, it really
> becomes mandatory for all filesystems using iomap to support it
> because the absence of IOMAP_F_DIRTY because a filesystem doesn't
> support it is not the same as "filesystem didn't set it because the
> inode is clean".
> 
Perhaps I have not made it clear that by "ignore" I mean other file
systems can set IOMAP_F_DIRTY regardless of whether the
IOMAP_REPORT_DIRTY flag is set or not, just like what they are doing
now. So we might not need to modify XFS.

I think even without the modification I made, the ambiguity that
the absence of IOMAP_F_DIRTY can either be file systems not supporting
it or be actually "clean inode" may exist since we do not have a flag
to indicate whether the file system supports setting IOMAP_F_DIRTY.

Best,

Zhongwei
Dave Chinner Nov. 11, 2021, 3:12 a.m. UTC | #7
On Wed, Nov 10, 2021 at 04:07:57PM +0800, Zhongwei Cai wrote:
> On 11/9/21 12:50 PM, Dave Chinner wrote:
> > > 
> > > Could we add IOMAP_REPORT_DIRTY flag in the flags field of
> > > struct iomap_iter to indicate whether the IOMAP_F_DIRTY flag
> > > needs to be set or not?
> > 
> > You can try. It might turn out OK, but you're also going to have to
> > modify all the iomap code that current needs IOMAP_F_DIRTY to first
> > set that flag, then change all the code that currently sets
> > IOMAP_F_DIRTY to look at IOMAP_REPORT_DIRTY. i.e you now have to
> > change iomap, ext4 and XFS to do this.
> > 
> I will make a v2 patch with this implementation.
> 
> > > Currently the IOMAP_F_DIRTY flag is only checked in
> > > iomap_swapfile_activate(), dax_iomap_fault() and iomap_dio_rw()
> > > (To be more specific, only the write path in dax_iomap_fault() and
> > > iomap_dio_rw()). So it would be unnecessary to set the IOMAP_F_DIRTY
> > > flag in dax_iomap_rw() called in the previous tests.
> > 
> > I think you're trying to optimise the wrong thing - the API is not
> > the problem, the problem is that the journal->j_state_lock is
> > contended and the ext4 dirty inode check needs to take it. Fix the
> > dirty check not to need the journal state lock and the ext4 problem
> > goes away and there is no need to change the iomap infrastructure.
> 
> I'll try to fix it inside ext4, although it seems difficult to do dirty
> check without journal->j_state_lock.
> 
> > > Other file systems that set the IOMAP_F_DIRTY flag efficiently
> > > could ignore the IOMAP_REPORT_DIRTY flag.
> > 
> > No, that's just bad API design. If we are adding IOMAP_REPORT_DIRTY
> > then the iomap infrastructure should only use that control flag when
> > it needs to know if the inode is dirty. At this point, it really
> > becomes mandatory for all filesystems using iomap to support it
> > because the absence of IOMAP_F_DIRTY because a filesystem doesn't
> > support it is not the same as "filesystem didn't set it because the
> > inode is clean".
> > 
> Perhaps I have not made it clear that by "ignore" I mean other file
> systems can set IOMAP_F_DIRTY regardless of whether the
> IOMAP_REPORT_DIRTY flag is set or not, just like what they are doing
> now. So we might not need to modify XFS.

I assumed that was exactly what you meant. That's just bad/lazy API
design. We need to be explicit in the way behaviours are defined,
not create APIs where callees can set random flags whenever they
like.

> I think even without the modification I made, the ambiguity that
> the absence of IOMAP_F_DIRTY can either be file systems not supporting
> it or be actually "clean inode" may exist since we do not have a flag
> to indicate whether the file system supports setting IOMAP_F_DIRTY.

Exactly my point - the current behaviour it's not clearly defined
(just set it always if you can!) but ext4 needs it to be much more
explicitly constrained to avoid internal overhead.

Hence we need to constrain the IOMAP_F_DIRTY flag behaviour to just
the iomap operations that need to know if the inode is dirty (i.e.
filesystems should set IOMAP_F_DIRTY if and only if
IOMAP_REPORT_DIRTY is passed as a control flag from the iomap core
and the inode is dirty) and then ensure that *every filesystem
implements it correctly*.

There are no shortcuts here - the cost of changing generic
infrastructure is that you have to ensure that all users of
interface use it properly. That includes filesystems that don't
currently set IOMAP_F_DIRTY but use iomap interfaces that will now
set IOMAP_REPORT_DIRTY....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0f06305167d5..72ec2074ef54 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3274,7 +3274,7 @@  static bool ext4_inode_datasync_dirty(struct inode *inode)
 
 static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 			   struct ext4_map_blocks *map, loff_t offset,
-			   loff_t length)
+			   loff_t length, int flags)
 {
 	u8 blkbits = inode->i_blkbits;
 
@@ -3284,8 +3284,8 @@  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 	 * there is no other metadata changes being made or are pending.
 	 */
 	iomap->flags = 0;
-	if (ext4_inode_datasync_dirty(inode) ||
-	    offset + length > i_size_read(inode))
+	if ((flags & IOMAP_WRITE) && (ext4_inode_datasync_dirty(inode) ||
+	    offset + length > i_size_read(inode)))
 		iomap->flags |= IOMAP_F_DIRTY;
 
 	if (map->m_flags & EXT4_MAP_NEW)
@@ -3423,7 +3423,7 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	if (ret < 0)
 		return ret;
 out:
-	ext4_set_iomap(inode, iomap, &map, offset, length);
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 
 	return 0;
 }
@@ -3543,7 +3543,7 @@  static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
 		delalloc = ext4_iomap_is_delalloc(inode, &map);
 
 set_iomap:
-	ext4_set_iomap(inode, iomap, &map, offset, length);
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 	if (delalloc && iomap->type == IOMAP_HOLE)
 		iomap->type = IOMAP_DELALLOC;