Message ID | 20200220152355.5ticlkptc7kwrifz@fiona |
---|---|
State | New |
Headers | show |
Series | [v2] iomap: return partial I/O count on error in iomap_dio_bio_actor | expand |
On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote: > In case of a block device error, written parameter in iomap_end() > is zero as opposed to the amount of submitted I/O. > Filesystems such as btrfs need to account for the I/O in ordered > extents, even if it resulted in an error. Having (incomplete) > submitted bytes in written gives the filesystem the amount of data > which has been submitted before the error occurred, and the > filesystem code can choose how to use it. > > The final returned error for iomap_dio_rw() is set by > iomap_dio_complete(). > > Partial writes in direct I/O are considered an error. So, > ->iomap_end() using written == 0 as error must be changed > to written < length. In this case, ext4 is the only user. I really had a hard time understanding this. I think what you meant was: Currently, I/Os that complete with an error indicate this by passing written == 0 to the iomap_end function. However, btrfs needs to know how many bytes were written for its own accounting. Change the convention to pass the number of bytes which were actually written, and change the only user to check for a short write instead of a zero length write. > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > --- > fs/ext4/inode.c | 2 +- > fs/iomap/direct-io.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index e60aca791d3f..e50e7414351a 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, > * the I/O. Any blocks that may have been allocated in preparation for > * the direct I/O will be reused during buffered I/O. > */ > - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length) > return -ENOTBLK; > > return 0; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 41c1e7c20a1f..01865db1bd09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > size_t n; > if (dio->error) { > iov_iter_revert(dio->submit.iter, copied); > - copied = ret = 0; > + ret = 0; > goto out; > } > > -- > 2.25.0 >
On 9:42 20/02, Matthew Wilcox wrote: > On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote: > > In case of a block device error, written parameter in iomap_end() > > is zero as opposed to the amount of submitted I/O. > > Filesystems such as btrfs need to account for the I/O in ordered > > extents, even if it resulted in an error. Having (incomplete) > > submitted bytes in written gives the filesystem the amount of data > > which has been submitted before the error occurred, and the > > filesystem code can choose how to use it. > > > > The final returned error for iomap_dio_rw() is set by > > iomap_dio_complete(). > > > > Partial writes in direct I/O are considered an error. So, > > ->iomap_end() using written == 0 as error must be changed > > to written < length. In this case, ext4 is the only user. > > I really had a hard time understanding this. I think what you meant > was: > > Currently, I/Os that complete with an error indicate this by passing > written == 0 to the iomap_end function. However, btrfs needs to know how > many bytes were written for its own accounting. Change the convention > to pass the number of bytes which were actually written, and change the > only user to check for a short write instead of a zero length write. Yes, thats right. I was trying to cover base from the previous patch and made a mess of it all. Thanks.. > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > > --- > > fs/ext4/inode.c | 2 +- > > fs/iomap/direct-io.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index e60aca791d3f..e50e7414351a 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, > > * the I/O. Any blocks that may have been allocated in preparation for > > * the direct I/O will be reused during buffered I/O. > > */ > > - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > > + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length) > > return -ENOTBLK; > > > > return 0; > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 41c1e7c20a1f..01865db1bd09 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > size_t n; > > if (dio->error) { > > iov_iter_revert(dio->submit.iter, copied); > > - copied = ret = 0; > > + ret = 0; > > goto out; > > } > > > > -- > > 2.25.0 > >
On 2/20/20 8:53 PM, Goldwyn Rodrigues wrote: > In case of a block device error, written parameter in iomap_end() > is zero as opposed to the amount of submitted I/O. > Filesystems such as btrfs need to account for the I/O in ordered > extents, even if it resulted in an error. Having (incomplete) > submitted bytes in written gives the filesystem the amount of data > which has been submitted before the error occurred, and the > filesystem code can choose how to use it. > > The final returned error for iomap_dio_rw() is set by > iomap_dio_complete(). > > Partial writes in direct I/O are considered an error. So, > ->iomap_end() using written == 0 as error must be changed > to written < length. In this case, ext4 is the only user. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > --- > fs/ext4/inode.c | 2 +- > fs/iomap/direct-io.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index e60aca791d3f..e50e7414351a 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, > * the I/O. Any blocks that may have been allocated in preparation for > * the direct I/O will be reused during buffered I/O. > */ > - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length) > return -ENOTBLK; > > return 0; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 41c1e7c20a1f..01865db1bd09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > size_t n; > if (dio->error) { > iov_iter_revert(dio->submit.iter, copied); > - copied = ret = 0; > + ret = 0; > goto out; > } But if I am seeing this correctly, even after there was a dio->error if you return copied > 0, then the loop in iomap_dio_rw will continue for next iteration as well. Until the second time it won't copy anything since dio->error is set and from there I guess it may return 0 which will break the loop. Is this the correct flow? Shouldn't the while loop doing iomap_apply in iomap_dio_rw should also break in case of dio->error? Or did I miss anything? -ritesh
On 10:21 21/02, Ritesh Harjani wrote: > > > On 2/20/20 8:53 PM, Goldwyn Rodrigues wrote: > > In case of a block device error, written parameter in iomap_end() > > is zero as opposed to the amount of submitted I/O. > > Filesystems such as btrfs need to account for the I/O in ordered > > extents, even if it resulted in an error. Having (incomplete) > > submitted bytes in written gives the filesystem the amount of data > > which has been submitted before the error occurred, and the > > filesystem code can choose how to use it. > > > > The final returned error for iomap_dio_rw() is set by > > iomap_dio_complete(). > > > > Partial writes in direct I/O are considered an error. So, > > ->iomap_end() using written == 0 as error must be changed > > to written < length. In this case, ext4 is the only user. > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > > --- > > fs/ext4/inode.c | 2 +- > > fs/iomap/direct-io.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index e60aca791d3f..e50e7414351a 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, > > * the I/O. Any blocks that may have been allocated in preparation for > > * the direct I/O will be reused during buffered I/O. > > */ > > - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > > + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length) > > return -ENOTBLK; > > return 0; > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 41c1e7c20a1f..01865db1bd09 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > size_t n; > > if (dio->error) { > > iov_iter_revert(dio->submit.iter, copied); > > - copied = ret = 0; > > + ret = 0; > > goto out; > > } > > But if I am seeing this correctly, even after there was a dio->error > if you return copied > 0, then the loop in iomap_dio_rw will continue > for next iteration as well. Until the second time it won't copy > anything since dio->error is set and from there I guess it may return > 0 which will break the loop. > > Is this the correct flow? Shouldn't the while loop doing > iomap_apply in iomap_dio_rw should also break in case of > dio->error? Or did I miss anything? > Yes, We can save an extra iteration by checking for dio->error in the while loop of iomap_dio_rw().
On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote: > In case of a block device error, written parameter in iomap_end() > is zero as opposed to the amount of submitted I/O. > Filesystems such as btrfs need to account for the I/O in ordered > extents, even if it resulted in an error. Having (incomplete) > submitted bytes in written gives the filesystem the amount of data > which has been submitted before the error occurred, and the > filesystem code can choose how to use it. > > The final returned error for iomap_dio_rw() is set by > iomap_dio_complete(). > > Partial writes in direct I/O are considered an error. So, > ->iomap_end() using written == 0 as error must be changed > to written < length. In this case, ext4 is the only user. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 41c1e7c20a1f..01865db1bd09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > size_t n; > if (dio->error) { > iov_iter_revert(dio->submit.iter, copied); > - copied = ret = 0; > + ret = 0; > goto out; > } This part fixes problems I saw with the dio-iomap btrfs conversion patchset, thanks.
On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote: > > if (dio->error) { > > iov_iter_revert(dio->submit.iter, copied); > > - copied = ret = 0; > > + ret = 0; > > goto out; > > } > > But if I am seeing this correctly, even after there was a dio->error > if you return copied > 0, then the loop in iomap_dio_rw will continue > for next iteration as well. Until the second time it won't copy > anything since dio->error is set and from there I guess it may return > 0 which will break the loop. In addition to that copied is also iov_iter_reexpand call. We don't really need the re-expand in case of errors, and in fact we also have the iov_iter_revert call before jumping out, so this will need a little bit more of an audit and properly documented in the commit log. > > Is this the correct flow? Shouldn't the while loop doing > iomap_apply in iomap_dio_rw should also break in case of > dio->error? Or did I miss anything? We'd need something there iff we care about a good number of written in case of the error. Goldwyn, can you explain what you need this number for in btrfs? Maybe with a pointer to the current code base?
On 2020/02/26 5:53, Christoph Hellwig wrote: > On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote: >>> if (dio->error) { >>> iov_iter_revert(dio->submit.iter, copied); >>> - copied = ret = 0; >>> + ret = 0; >>> goto out; >>> } >> >> But if I am seeing this correctly, even after there was a dio->error >> if you return copied > 0, then the loop in iomap_dio_rw will continue >> for next iteration as well. Until the second time it won't copy >> anything since dio->error is set and from there I guess it may return >> 0 which will break the loop. > > In addition to that copied is also iov_iter_reexpand call. We don't > really need the re-expand in case of errors, and in fact we also > have the iov_iter_revert call before jumping out, so this will > need a little bit more of an audit and properly documented in the > commit log. > >> >> Is this the correct flow? Shouldn't the while loop doing >> iomap_apply in iomap_dio_rw should also break in case of >> dio->error? Or did I miss anything? > > We'd need something there iff we care about a good number of written > in case of the error. Goldwyn, can you explain what you need this > number for in btrfs? Maybe with a pointer to the current code base? Not sure about btrfs, but for zonefs, getting the partial I/O count done for a failed large dio would also be useful to avoid having to do the error recovery dance with report zones for getting the current zone write pointer.
On 12:53 25/02, Christoph Hellwig wrote: > On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote: > > > if (dio->error) { > > > iov_iter_revert(dio->submit.iter, copied); > > > - copied = ret = 0; > > > + ret = 0; > > > goto out; > > > } > > > > But if I am seeing this correctly, even after there was a dio->error > > if you return copied > 0, then the loop in iomap_dio_rw will continue > > for next iteration as well. Until the second time it won't copy > > anything since dio->error is set and from there I guess it may return > > 0 which will break the loop. > > In addition to that copied is also iov_iter_reexpand call. We don't > really need the re-expand in case of errors, and in fact we also > have the iov_iter_revert call before jumping out, so this will > need a little bit more of an audit and properly documented in the > commit log. > > > > > Is this the correct flow? Shouldn't the while loop doing > > iomap_apply in iomap_dio_rw should also break in case of > > dio->error? Or did I miss anything? > > We'd need something there iff we care about a good number of written > in case of the error. Goldwyn, can you explain what you need this > number for in btrfs? Maybe with a pointer to the current code base? btrfs needs to account for the bytes "processed", failed or uptodate. This is currently performed in fs/btrfs/inode.c:__end_write_update_ordered(). For the current development version, how I am using it is in my git branch btrfs-iomap-dio [1]. The related commit besides this patch is: 9aeb2b31d10b ("btrfs: Use ->iomap_end() instead of btrfs_dio_data") [1] https://github.com/goldwynr/linux/tree/btrfs-iomap-dio
On 12:53 25/02, Christoph Hellwig wrote: > On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote: > > > if (dio->error) { > > > iov_iter_revert(dio->submit.iter, copied); > > > - copied = ret = 0; > > > + ret = 0; > > > goto out; > > > } > > > > But if I am seeing this correctly, even after there was a dio->error > > if you return copied > 0, then the loop in iomap_dio_rw will continue > > for next iteration as well. Until the second time it won't copy > > anything since dio->error is set and from there I guess it may return > > 0 which will break the loop. > Reading the code again, there are a few clarifications. If iomap_end() handles (written < length) as an error, iomap_apply() will return an error immediately. It will not execute the loop a second time. On the other hand, if there is no ->iomap_end() defined by the filesystem such as in the case of XFS, we will need to check for dio->error in the do {} while loop of iomap_dio_rw(). > In addition to that copied is also iov_iter_reexpand call. We don't > really need the re-expand in case of errors, and in fact we also > have the iov_iter_revert call before jumping out, so this will > need a little bit more of an audit and properly documented in the > commit log. We are still handling this as an error, so why are we concerned about expanding? There is no success/written returned in iomap_dio_rw() call in case of an error. Attached is an updated patch.
On Fri, Feb 28, 2020 at 01:44:01PM -0600, Goldwyn Rodrigues wrote: > +++ b/fs/iomap/direct-io.c > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > size_t n; > if (dio->error) { > iov_iter_revert(dio->submit.iter, copied); > - copied = ret = 0; > + ret = 0; > goto out; There's another change here ... look at the out label out: /* Undo iter limitation to current extent */ iov_iter_reexpand(dio->submit.iter, orig_count - copied); if (copied) return copied; return ret; so you're also changing by how much the iter is reexpanded. I don't know if it's the appropriate amount; I still don't quite get the iov_iter complexities.
On 11:59 28/02, Matthew Wilcox wrote: > On Fri, Feb 28, 2020 at 01:44:01PM -0600, Goldwyn Rodrigues wrote: > > +++ b/fs/iomap/direct-io.c > > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > size_t n; > > if (dio->error) { > > iov_iter_revert(dio->submit.iter, copied); > > - copied = ret = 0; > > + ret = 0; > > goto out; > > There's another change here ... look at the out label > > out: > /* Undo iter limitation to current extent */ > iov_iter_reexpand(dio->submit.iter, orig_count - copied); > if (copied) > return copied; > return ret; > > so you're also changing by how much the iter is reexpanded. I > don't know if it's the appropriate amount; I still don't quite get the > iov_iter complexities. > Ah, okay. Now I understand what Christoph was saying. I suppose it is safe to remove iov_iter_reexpand(). I don't see any other goto to this label which will have a non-zero copied value. And we have already performed the iov_iter_revert().
On Fri, Feb 28, 2020 at 02:35:38PM -0600, Goldwyn Rodrigues wrote: > > Ah, okay. Now I understand what Christoph was saying. > > I suppose it is safe to remove iov_iter_reexpand(). I don't see any > other goto to this label which will have a non-zero copied value. > And we have already performed the iov_iter_revert(). I don't really understand the iov_iter complexities either, at least not without spending sifnificant time with the implementation. But the important thing is that you document the changes in behavior and your findings on why you think it is safe.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e60aca791d3f..e50e7414351a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, * the I/O. Any blocks that may have been allocated in preparation for * the direct I/O will be reused during buffered I/O. */ - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length) return -ENOTBLK; return 0; diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 41c1e7c20a1f..01865db1bd09 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, size_t n; if (dio->error) { iov_iter_revert(dio->submit.iter, copied); - copied = ret = 0; + ret = 0; goto out; }
In case of a block device error, written parameter in iomap_end() is zero as opposed to the amount of submitted I/O. Filesystems such as btrfs need to account for the I/O in ordered extents, even if it resulted in an error. Having (incomplete) submitted bytes in written gives the filesystem the amount of data which has been submitted before the error occurred, and the filesystem code can choose how to use it. The final returned error for iomap_dio_rw() is set by iomap_dio_complete(). Partial writes in direct I/O are considered an error. So, ->iomap_end() using written == 0 as error must be changed to written < length. In this case, ext4 is the only user. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/ext4/inode.c | 2 +- fs/iomap/direct-io.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)