Message ID | da556191f9dba2b477cce57665ded57bfd396463.1570100361.git.mbobrowski@mbobrowski.org |
---|---|
State | Superseded |
Headers | show |
Series | ext4: port direct I/O to iomap infrastructure | expand |
On Thu 03-10-19 21:34:18, Matthew Bobrowski wrote: > In preparation for implementing the iomap direct I/O write path > modifications, the inode extension/truncate code needs to be moved out > from ext4_iomap_end(). For direct I/O, if the current code remained > within ext4_iomap_end() it would behave incorrectly. Updating the > inode size prior to converting unwritten extents to written extents > will potentially allow a racing direct I/O read operation to find > unwritten extents before they've been correctly converted. > > The inode extension/truncate code has been moved out into a new helper > ext4_handle_inode_extension(). This function has been designed so that > it can be used by both DAX and direct I/O paths. > > Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Looks good to me. Fell free to add: Reviewed-by: Jan Kara <jack@suse.cz> Just small nits below: > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, > + ssize_t written, size_t count) > +{ > + int ret = 0; I think both the function and callsites may be slightly simpler if you let the function return 'written' or error (not 0 or error). But I'll leave that decision upto you. > + handle_t *handle; > + bool truncate = false; > + u8 blkbits = inode->i_blkbits; > + ext4_lblk_t written_blk, end_blk; > + > + /* > + * Note that EXT4_I(inode)->i_disksize can get extended up to > + * inode->i_size while the IO was running due to writeback of > + * delalloc blocks. But the code in ext4_iomap_alloc() is careful > + * to use zeroed / unwritten extents if this is possible and thus > + * we won't leave uninitialized blocks in a file even if we didn't > + * succeed in writing as much as we planned. > + */ Whitespace damaged here... Honza
On 10/3/19 5:04 PM, Matthew Bobrowski wrote: > In preparation for implementing the iomap direct I/O write path > modifications, the inode extension/truncate code needs to be moved out > from ext4_iomap_end(). For direct I/O, if the current code remained > within ext4_iomap_end() it would behave incorrectly. Updating the > inode size prior to converting unwritten extents to written extents > will potentially allow a racing direct I/O read operation to find > unwritten extents before they've been correctly converted. > > The inode extension/truncate code has been moved out into a new helper > ext4_handle_inode_extension(). This function has been designed so that > it can be used by both DAX and direct I/O paths. > > Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> checkpatch shows some whitespaces error in your comments in this patch. But apart from that, patch looks good to me. You may add: Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> > --- > fs/ext4/file.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++- > fs/ext4/inode.c | 48 +----------------------------- > 2 files changed, 79 insertions(+), 48 deletions(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 69ac042fb74b..2883711e8a33 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -33,6 +33,7 @@ > #include "ext4_jbd2.h" > #include "xattr.h" > #include "acl.h" > +#include "truncate.h" > > static bool ext4_dio_supported(struct inode *inode) > { > @@ -233,12 +234,82 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) > return iov_iter_count(from); > } > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, > + ssize_t written, size_t count) > +{ > + int ret = 0; > + handle_t *handle; > + bool truncate = false; > + u8 blkbits = inode->i_blkbits; > + ext4_lblk_t written_blk, end_blk; > + > + /* > + * Note that EXT4_I(inode)->i_disksize can get extended up to > + * inode->i_size while the IO was running due to writeback of > + * delalloc blocks. But the code in ext4_iomap_alloc() is careful > + * to use zeroed / unwritten extents if this is possible and thus > + * we won't leave uninitialized blocks in a file even if we didn't > + * succeed in writing as much as we planned. > + */ > + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); > + if (offset + count <= EXT4_I(inode)->i_disksize) > + return written < 0 ? written : 0; > + > + if (written < 0) { > + ret = written; > + goto truncate; > + } > + > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > + if (IS_ERR(handle)) { > + ret = PTR_ERR(handle); > + goto truncate; > + } > + > + if (ext4_update_inode_size(inode, offset + written)) > + ext4_mark_inode_dirty(handle, inode); > + > + /* > + * We may need to truncate allocated but not written blocks > + * beyond EOF. > + */ > + written_blk = ALIGN(offset + written, 1 << blkbits); > + end_blk = ALIGN(offset + count, 1 << blkbits); > + if (written_blk < end_blk && ext4_can_truncate(inode)) > + truncate = true; > + > + /* > + * Remove the inode from the orphan list if it has been > + * extended and everything went OK. > + */ > + if (!truncate && inode->i_nlink) > + ext4_orphan_del(handle, inode); > + ext4_journal_stop(handle); > + > + if (truncate) { > +truncate: > + ext4_truncate_failed_write(inode); > + /* > + * If the truncate operation failed early, then the > + * inode may still be on the orphan list. In that > + * case, we need to try remove the inode from the > + * in-memory linked list. > + */ > + if (inode->i_nlink) > + ext4_orphan_del(NULL, inode); > + } > + return ret; > +} > + > #ifdef CONFIG_FS_DAX > static ssize_t > ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) > { > - struct inode *inode = file_inode(iocb->ki_filp); > + int error; > ssize_t ret; > + size_t count; > + loff_t offset; > + struct inode *inode = file_inode(iocb->ki_filp); > > if (!inode_trylock(inode)) { > if (iocb->ki_flags & IOCB_NOWAIT) > @@ -255,7 +326,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) > if (ret) > goto out; > > + offset = iocb->ki_pos; > + count = iov_iter_count(from); > ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops); > + > + error = ext4_handle_inode_extension(inode, offset, ret, count); > + if (error) > + ret = error; > out: > inode_unlock(inode); > if (ret > 0) > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 159ffb92f82d..d616062b603e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3592,53 +3592,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, > ssize_t written, unsigned flags, struct iomap *iomap) > { > - int ret = 0; > - handle_t *handle; > - int blkbits = inode->i_blkbits; > - bool truncate = false; > - > - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT)) > - return 0; > - > - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > - if (IS_ERR(handle)) { > - ret = PTR_ERR(handle); > - goto orphan_del; > - } > - if (ext4_update_inode_size(inode, offset + written)) > - ext4_mark_inode_dirty(handle, inode); > - /* > - * We may need to truncate allocated but not written blocks beyond EOF. > - */ > - if (iomap->offset + iomap->length > > - ALIGN(inode->i_size, 1 << blkbits)) { > - ext4_lblk_t written_blk, end_blk; > - > - written_blk = (offset + written) >> blkbits; > - end_blk = (offset + length) >> blkbits; > - if (written_blk < end_blk && ext4_can_truncate(inode)) > - truncate = true; > - } > - /* > - * Remove inode from orphan list if we were extending a inode and > - * everything went fine. > - */ > - if (!truncate && inode->i_nlink && > - !list_empty(&EXT4_I(inode)->i_orphan)) > - ext4_orphan_del(handle, inode); > - ext4_journal_stop(handle); > - if (truncate) { > - ext4_truncate_failed_write(inode); > -orphan_del: > - /* > - * If truncate failed early the inode might still be on the > - * orphan list; we need to make sure the inode is removed from > - * the orphan list in that case. > - */ > - if (inode->i_nlink) > - ext4_orphan_del(NULL, inode); > - } > - return ret; > + return 0; > } > > const struct iomap_ops ext4_iomap_ops = { >
On Tue, Oct 08, 2019 at 01:25:12PM +0200, Jan Kara wrote: > On Thu 03-10-19 21:34:18, Matthew Bobrowski wrote: > Looks good to me. Fell free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> Thanks Jan! > Just small nits below: > > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, > > + ssize_t written, size_t count) > > +{ > > + int ret = 0; > > I think both the function and callsites may be slightly simpler if you let > the function return 'written' or error (not 0 or error). But I'll leave > that decision upto you. Hm, don't we actually need to return 0 for success cases so that iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos, etc? > > + handle_t *handle; > > + bool truncate = false; > > + u8 blkbits = inode->i_blkbits; > > + ext4_lblk_t written_blk, end_blk; > > + > > + /* > > + * Note that EXT4_I(inode)->i_disksize can get extended up to > > + * inode->i_size while the IO was running due to writeback of > > + * delalloc blocks. But the code in ext4_iomap_alloc() is careful > > + * to use zeroed / unwritten extents if this is possible and thus > > + * we won't leave uninitialized blocks in a file even if we didn't > > + * succeed in writing as much as we planned. > > + */ > > Whitespace damaged here... I'll fix this. --<M>--
On Wed, Oct 09, 2019 at 11:57:04AM +0530, Ritesh Harjani wrote: > On 10/3/19 5:04 PM, Matthew Bobrowski wrote: > > In preparation for implementing the iomap direct I/O write path > > modifications, the inode extension/truncate code needs to be moved out > > from ext4_iomap_end(). For direct I/O, if the current code remained > > within ext4_iomap_end() it would behave incorrectly. Updating the > > inode size prior to converting unwritten extents to written extents > > will potentially allow a racing direct I/O read operation to find > > unwritten extents before they've been correctly converted. > > > > The inode extension/truncate code has been moved out into a new helper > > ext4_handle_inode_extension(). This function has been designed so that > > it can be used by both DAX and direct I/O paths. > > > > Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> > > checkpatch shows some whitespaces error in your comments > in this patch. > But apart from that, patch looks good to me. Yeah, I will fix those. > You may add: > > Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Thanks! --<M>--
On Wed 09-10-19 21:18:50, Matthew Bobrowski wrote: > > Just small nits below: > > > > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, > > > + ssize_t written, size_t count) > > > +{ > > > + int ret = 0; > > > > I think both the function and callsites may be slightly simpler if you let > > the function return 'written' or error (not 0 or error). But I'll leave > > that decision upto you. > > Hm, don't we actually need to return 0 for success cases so that > iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos, > etc? Correct, iomap_dio_complete() expects 0 on success. So if we keep calling ext4_handle_inode_extension() from ->end_io handler, we'd need some specialcasing there and I agree that changing ext4_handle_inode_extension() return convention isn't then very beneficial. If we stop calling ext4_handle_inode_extension() from ->end_io handler (patch 8/8 discussion pending), then the change would be a clear win. Honza
On Wed, Oct 09, 2019 at 02:51:32PM +0200, Jan Kara wrote: > On Wed 09-10-19 21:18:50, Matthew Bobrowski wrote: > > > Just small nits below: > > > > > > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, > > > > + ssize_t written, size_t count) > > > > +{ > > > > + int ret = 0; > > > > > > I think both the function and callsites may be slightly simpler if you let > > > the function return 'written' or error (not 0 or error). But I'll leave > > > that decision upto you. > > > > Hm, don't we actually need to return 0 for success cases so that > > iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos, > > etc? > > Correct, iomap_dio_complete() expects 0 on success. So if we keep calling > ext4_handle_inode_extension() from ->end_io handler, we'd need some > specialcasing there and I agree that changing ext4_handle_inode_extension() > return convention isn't then very beneficial. If we stop calling > ext4_handle_inode_extension() from ->end_io handler (patch 8/8 discussion > pending), then the change would be a clear win. Agreed. Well, I think we've got some movement in the right direction in 8/8, so it looks like changing up the return values is what we'll go ahead with. --<M>--
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 69ac042fb74b..2883711e8a33 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -33,6 +33,7 @@ #include "ext4_jbd2.h" #include "xattr.h" #include "acl.h" +#include "truncate.h" static bool ext4_dio_supported(struct inode *inode) { @@ -233,12 +234,82 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) return iov_iter_count(from); } +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset, + ssize_t written, size_t count) +{ + int ret = 0; + handle_t *handle; + bool truncate = false; + u8 blkbits = inode->i_blkbits; + ext4_lblk_t written_blk, end_blk; + + /* + * Note that EXT4_I(inode)->i_disksize can get extended up to + * inode->i_size while the IO was running due to writeback of + * delalloc blocks. But the code in ext4_iomap_alloc() is careful + * to use zeroed / unwritten extents if this is possible and thus + * we won't leave uninitialized blocks in a file even if we didn't + * succeed in writing as much as we planned. + */ + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); + if (offset + count <= EXT4_I(inode)->i_disksize) + return written < 0 ? written : 0; + + if (written < 0) { + ret = written; + goto truncate; + } + + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto truncate; + } + + if (ext4_update_inode_size(inode, offset + written)) + ext4_mark_inode_dirty(handle, inode); + + /* + * We may need to truncate allocated but not written blocks + * beyond EOF. + */ + written_blk = ALIGN(offset + written, 1 << blkbits); + end_blk = ALIGN(offset + count, 1 << blkbits); + if (written_blk < end_blk && ext4_can_truncate(inode)) + truncate = true; + + /* + * Remove the inode from the orphan list if it has been + * extended and everything went OK. + */ + if (!truncate && inode->i_nlink) + ext4_orphan_del(handle, inode); + ext4_journal_stop(handle); + + if (truncate) { +truncate: + ext4_truncate_failed_write(inode); + /* + * If the truncate operation failed early, then the + * inode may still be on the orphan list. In that + * case, we need to try remove the inode from the + * in-memory linked list. + */ + if (inode->i_nlink) + ext4_orphan_del(NULL, inode); + } + return ret; +} + #ifdef CONFIG_FS_DAX static ssize_t ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) { - struct inode *inode = file_inode(iocb->ki_filp); + int error; ssize_t ret; + size_t count; + loff_t offset; + struct inode *inode = file_inode(iocb->ki_filp); if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) @@ -255,7 +326,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) if (ret) goto out; + offset = iocb->ki_pos; + count = iov_iter_count(from); ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops); + + error = ext4_handle_inode_extension(inode, offset, ret, count); + if (error) + ret = error; out: inode_unlock(inode); if (ret > 0) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 159ffb92f82d..d616062b603e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3592,53 +3592,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { - int ret = 0; - handle_t *handle; - int blkbits = inode->i_blkbits; - bool truncate = false; - - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT)) - return 0; - - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto orphan_del; - } - if (ext4_update_inode_size(inode, offset + written)) - ext4_mark_inode_dirty(handle, inode); - /* - * We may need to truncate allocated but not written blocks beyond EOF. - */ - if (iomap->offset + iomap->length > - ALIGN(inode->i_size, 1 << blkbits)) { - ext4_lblk_t written_blk, end_blk; - - written_blk = (offset + written) >> blkbits; - end_blk = (offset + length) >> blkbits; - if (written_blk < end_blk && ext4_can_truncate(inode)) - truncate = true; - } - /* - * Remove inode from orphan list if we were extending a inode and - * everything went fine. - */ - if (!truncate && inode->i_nlink && - !list_empty(&EXT4_I(inode)->i_orphan)) - ext4_orphan_del(handle, inode); - ext4_journal_stop(handle); - if (truncate) { - ext4_truncate_failed_write(inode); -orphan_del: - /* - * If truncate failed early the inode might still be on the - * orphan list; we need to make sure the inode is removed from - * the orphan list in that case. - */ - if (inode->i_nlink) - ext4_orphan_del(NULL, inode); - } - return ret; + return 0; } const struct iomap_ops ext4_iomap_ops = {
In preparation for implementing the iomap direct I/O write path modifications, the inode extension/truncate code needs to be moved out from ext4_iomap_end(). For direct I/O, if the current code remained within ext4_iomap_end() it would behave incorrectly. Updating the inode size prior to converting unwritten extents to written extents will potentially allow a racing direct I/O read operation to find unwritten extents before they've been correctly converted. The inode extension/truncate code has been moved out into a new helper ext4_handle_inode_extension(). This function has been designed so that it can be used by both DAX and direct I/O paths. Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> --- fs/ext4/file.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++- fs/ext4/inode.c | 48 +----------------------------- 2 files changed, 79 insertions(+), 48 deletions(-)