[v2] iomap: return partial I/O count on error in iomap_dio_bio_actor
diff mbox series

Message ID 20200220152355.5ticlkptc7kwrifz@fiona
State New
Headers show
Series
  • [v2] iomap: return partial I/O count on error in iomap_dio_bio_actor
Related show

Commit Message

Goldwyn Rodrigues Feb. 20, 2020, 3:23 p.m. UTC
In case of a block device error, written parameter in iomap_end()
is zero as opposed to the amount of submitted I/O.
Filesystems such as btrfs need to account for the I/O in ordered
extents, even if it resulted in an error. Having (incomplete)
submitted bytes in written gives the filesystem the amount of data
which has been submitted before the error occurred, and the
filesystem code can choose how to use it.

The final returned error for iomap_dio_rw() is set by
iomap_dio_complete().

Partial writes in direct I/O are considered an error. So,
->iomap_end() using written == 0 as error must be changed
to written < length. In this case, ext4 is the only user.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/ext4/inode.c      | 2 +-
 fs/iomap/direct-io.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox Feb. 20, 2020, 5:42 p.m. UTC | #1
On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote:
> In case of a block device error, written parameter in iomap_end()
> is zero as opposed to the amount of submitted I/O.
> Filesystems such as btrfs need to account for the I/O in ordered
> extents, even if it resulted in an error. Having (incomplete)
> submitted bytes in written gives the filesystem the amount of data
> which has been submitted before the error occurred, and the
> filesystem code can choose how to use it.
> 
> The final returned error for iomap_dio_rw() is set by
> iomap_dio_complete().
> 
> Partial writes in direct I/O are considered an error. So,
> ->iomap_end() using written == 0 as error must be changed
> to written < length. In this case, ext4 is the only user.

I really had a hard time understanding this.  I think what you meant
was:

Currently, I/Os that complete with an error indicate this by passing
written == 0 to the iomap_end function.  However, btrfs needs to know how
many bytes were written for its own accounting.  Change the convention
to pass the number of bytes which were actually written, and change the
only user to check for a short write instead of a zero length write.

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/ext4/inode.c      | 2 +-
>  fs/iomap/direct-io.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e60aca791d3f..e50e7414351a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  	 * the I/O. Any blocks that may have been allocated in preparation for
>  	 * the direct I/O will be reused during buffered I/O.
>  	 */
> -	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> +	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
>  		return -ENOTBLK;
>  
>  	return 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 41c1e7c20a1f..01865db1bd09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		size_t n;
>  		if (dio->error) {
>  			iov_iter_revert(dio->submit.iter, copied);
> -			copied = ret = 0;
> +			ret = 0;
>  			goto out;
>  		}
>  
> -- 
> 2.25.0
>
Goldwyn Rodrigues Feb. 21, 2020, 2:06 a.m. UTC | #2
On  9:42 20/02, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote:
> > In case of a block device error, written parameter in iomap_end()
> > is zero as opposed to the amount of submitted I/O.
> > Filesystems such as btrfs need to account for the I/O in ordered
> > extents, even if it resulted in an error. Having (incomplete)
> > submitted bytes in written gives the filesystem the amount of data
> > which has been submitted before the error occurred, and the
> > filesystem code can choose how to use it.
> > 
> > The final returned error for iomap_dio_rw() is set by
> > iomap_dio_complete().
> > 
> > Partial writes in direct I/O are considered an error. So,
> > ->iomap_end() using written == 0 as error must be changed
> > to written < length. In this case, ext4 is the only user.
> 
> I really had a hard time understanding this.  I think what you meant
> was:
> 
> Currently, I/Os that complete with an error indicate this by passing
> written == 0 to the iomap_end function.  However, btrfs needs to know how
> many bytes were written for its own accounting.  Change the convention
> to pass the number of bytes which were actually written, and change the
> only user to check for a short write instead of a zero length write.

Yes, thats right.  I was trying to cover base from the previous patch
and made a mess of it all. Thanks..


> 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/ext4/inode.c      | 2 +-
> >  fs/iomap/direct-io.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e60aca791d3f..e50e7414351a 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> >  	 * the I/O. Any blocks that may have been allocated in preparation for
> >  	 * the direct I/O will be reused during buffered I/O.
> >  	 */
> > -	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> > +	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
> >  		return -ENOTBLK;
> >  
> >  	return 0;
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 41c1e7c20a1f..01865db1bd09 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		size_t n;
> >  		if (dio->error) {
> >  			iov_iter_revert(dio->submit.iter, copied);
> > -			copied = ret = 0;
> > +			ret = 0;
> >  			goto out;
> >  		}
> >  
> > -- 
> > 2.25.0
> >
Ritesh Harjani Feb. 21, 2020, 4:51 a.m. UTC | #3
On 2/20/20 8:53 PM, Goldwyn Rodrigues wrote:
> In case of a block device error, written parameter in iomap_end()
> is zero as opposed to the amount of submitted I/O.
> Filesystems such as btrfs need to account for the I/O in ordered
> extents, even if it resulted in an error. Having (incomplete)
> submitted bytes in written gives the filesystem the amount of data
> which has been submitted before the error occurred, and the
> filesystem code can choose how to use it.
> 
> The final returned error for iomap_dio_rw() is set by
> iomap_dio_complete().
> 
> Partial writes in direct I/O are considered an error. So,
> ->iomap_end() using written == 0 as error must be changed
> to written < length. In this case, ext4 is the only user.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>   fs/ext4/inode.c      | 2 +-
>   fs/iomap/direct-io.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e60aca791d3f..e50e7414351a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>   	 * the I/O. Any blocks that may have been allocated in preparation for
>   	 * the direct I/O will be reused during buffered I/O.
>   	 */
> -	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> +	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
>   		return -ENOTBLK;
>   
>   	return 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 41c1e7c20a1f..01865db1bd09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>   		size_t n;
>   		if (dio->error) {
>   			iov_iter_revert(dio->submit.iter, copied);
> -			copied = ret = 0;
> +			ret = 0;
>   			goto out;
>   		}

But if I am seeing this correctly, even after there was a dio->error
if you return copied > 0, then the loop in iomap_dio_rw will continue
for next iteration as well. Until the second time it won't copy
anything since dio->error is set and from there I guess it may return
0 which will break the loop.

Is this the correct flow? Shouldn't the while loop doing
iomap_apply in iomap_dio_rw should also break in case of
dio->error? Or did I miss anything?


-ritesh
Goldwyn Rodrigues Feb. 21, 2020, 12:48 p.m. UTC | #4
On 10:21 21/02, Ritesh Harjani wrote:
> 
> 
> On 2/20/20 8:53 PM, Goldwyn Rodrigues wrote:
> > In case of a block device error, written parameter in iomap_end()
> > is zero as opposed to the amount of submitted I/O.
> > Filesystems such as btrfs need to account for the I/O in ordered
> > extents, even if it resulted in an error. Having (incomplete)
> > submitted bytes in written gives the filesystem the amount of data
> > which has been submitted before the error occurred, and the
> > filesystem code can choose how to use it.
> > 
> > The final returned error for iomap_dio_rw() is set by
> > iomap_dio_complete().
> > 
> > Partial writes in direct I/O are considered an error. So,
> > ->iomap_end() using written == 0 as error must be changed
> > to written < length. In this case, ext4 is the only user.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >   fs/ext4/inode.c      | 2 +-
> >   fs/iomap/direct-io.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e60aca791d3f..e50e7414351a 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3475,7 +3475,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> >   	 * the I/O. Any blocks that may have been allocated in preparation for
> >   	 * the direct I/O will be reused during buffered I/O.
> >   	 */
> > -	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> > +	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
> >   		return -ENOTBLK;
> >   	return 0;
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 41c1e7c20a1f..01865db1bd09 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >   		size_t n;
> >   		if (dio->error) {
> >   			iov_iter_revert(dio->submit.iter, copied);
> > -			copied = ret = 0;
> > +			ret = 0;
> >   			goto out;
> >   		}
> 
> But if I am seeing this correctly, even after there was a dio->error
> if you return copied > 0, then the loop in iomap_dio_rw will continue
> for next iteration as well. Until the second time it won't copy
> anything since dio->error is set and from there I guess it may return
> 0 which will break the loop.
> 
> Is this the correct flow? Shouldn't the while loop doing
> iomap_apply in iomap_dio_rw should also break in case of
> dio->error? Or did I miss anything?
> 

Yes, We can save an extra iteration by checking for dio->error in the
while loop of iomap_dio_rw().
David Sterba Feb. 21, 2020, 1:14 p.m. UTC | #5
On Thu, Feb 20, 2020 at 09:23:55AM -0600, Goldwyn Rodrigues wrote:
> In case of a block device error, written parameter in iomap_end()
> is zero as opposed to the amount of submitted I/O.
> Filesystems such as btrfs need to account for the I/O in ordered
> extents, even if it resulted in an error. Having (incomplete)
> submitted bytes in written gives the filesystem the amount of data
> which has been submitted before the error occurred, and the
> filesystem code can choose how to use it.
> 
> The final returned error for iomap_dio_rw() is set by
> iomap_dio_complete().
> 
> Partial writes in direct I/O are considered an error. So,
> ->iomap_end() using written == 0 as error must be changed
> to written < length. In this case, ext4 is the only user.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 41c1e7c20a1f..01865db1bd09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		size_t n;
>  		if (dio->error) {
>  			iov_iter_revert(dio->submit.iter, copied);
> -			copied = ret = 0;
> +			ret = 0;
>  			goto out;
>  		}

This part fixes problems I saw with the dio-iomap btrfs conversion
patchset, thanks.
Christoph Hellwig Feb. 25, 2020, 8:53 p.m. UTC | #6
On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote:
> >   		if (dio->error) {
> >   			iov_iter_revert(dio->submit.iter, copied);
> > -			copied = ret = 0;
> > +			ret = 0;
> >   			goto out;
> >   		}
> 
> But if I am seeing this correctly, even after there was a dio->error
> if you return copied > 0, then the loop in iomap_dio_rw will continue
> for next iteration as well. Until the second time it won't copy
> anything since dio->error is set and from there I guess it may return
> 0 which will break the loop.

In addition to that copied is also iov_iter_reexpand call.  We don't
really need the re-expand in case of errors, and in fact we also
have the iov_iter_revert call before jumping out, so this will
need a little bit more of an audit and properly documented in the
commit log.

> 
> Is this the correct flow? Shouldn't the while loop doing
> iomap_apply in iomap_dio_rw should also break in case of
> dio->error? Or did I miss anything?

We'd need something there iff we care about a good number of written
in case of the error.  Goldwyn, can you explain what you need this
number for in btrfs?  Maybe with a pointer to the current code base?
Damien Le Moal Feb. 26, 2020, 2:12 a.m. UTC | #7
On 2020/02/26 5:53, Christoph Hellwig wrote:
> On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote:
>>>   		if (dio->error) {
>>>   			iov_iter_revert(dio->submit.iter, copied);
>>> -			copied = ret = 0;
>>> +			ret = 0;
>>>   			goto out;
>>>   		}
>>
>> But if I am seeing this correctly, even after there was a dio->error
>> if you return copied > 0, then the loop in iomap_dio_rw will continue
>> for next iteration as well. Until the second time it won't copy
>> anything since dio->error is set and from there I guess it may return
>> 0 which will break the loop.
> 
> In addition to that copied is also iov_iter_reexpand call.  We don't
> really need the re-expand in case of errors, and in fact we also
> have the iov_iter_revert call before jumping out, so this will
> need a little bit more of an audit and properly documented in the
> commit log.
> 
>>
>> Is this the correct flow? Shouldn't the while loop doing
>> iomap_apply in iomap_dio_rw should also break in case of
>> dio->error? Or did I miss anything?
> 
> We'd need something there iff we care about a good number of written
> in case of the error.  Goldwyn, can you explain what you need this
> number for in btrfs?  Maybe with a pointer to the current code base?

Not sure about btrfs, but for zonefs, getting the partial I/O count done for a
failed large dio would also be useful to avoid having to do the error recovery
dance with report zones for getting the current zone write pointer.
Goldwyn Rodrigues Feb. 26, 2020, 2:55 a.m. UTC | #8
On 12:53 25/02, Christoph Hellwig wrote:
> On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote:
> > >   		if (dio->error) {
> > >   			iov_iter_revert(dio->submit.iter, copied);
> > > -			copied = ret = 0;
> > > +			ret = 0;
> > >   			goto out;
> > >   		}
> > 
> > But if I am seeing this correctly, even after there was a dio->error
> > if you return copied > 0, then the loop in iomap_dio_rw will continue
> > for next iteration as well. Until the second time it won't copy
> > anything since dio->error is set and from there I guess it may return
> > 0 which will break the loop.
> 
> In addition to that copied is also iov_iter_reexpand call.  We don't
> really need the re-expand in case of errors, and in fact we also
> have the iov_iter_revert call before jumping out, so this will
> need a little bit more of an audit and properly documented in the
> commit log.
> 
> > 
> > Is this the correct flow? Shouldn't the while loop doing
> > iomap_apply in iomap_dio_rw should also break in case of
> > dio->error? Or did I miss anything?
> 
> We'd need something there iff we care about a good number of written
> in case of the error.  Goldwyn, can you explain what you need this
> number for in btrfs?  Maybe with a pointer to the current code base?

btrfs needs to account for the bytes "processed", failed or
uptodate. This is currently performed in
fs/btrfs/inode.c:__end_write_update_ordered().

For the current development version, how I am using it is in my git
branch btrfs-iomap-dio [1]. The related commit besides this patch
is:

9aeb2b31d10b ("btrfs: Use ->iomap_end() instead of btrfs_dio_data")

[1] https://github.com/goldwynr/linux/tree/btrfs-iomap-dio
Goldwyn Rodrigues Feb. 28, 2020, 7:44 p.m. UTC | #9
On 12:53 25/02, Christoph Hellwig wrote:
> On Fri, Feb 21, 2020 at 10:21:04AM +0530, Ritesh Harjani wrote:
> > >   		if (dio->error) {
> > >   			iov_iter_revert(dio->submit.iter, copied);
> > > -			copied = ret = 0;
> > > +			ret = 0;
> > >   			goto out;
> > >   		}
> > 
> > But if I am seeing this correctly, even after there was a dio->error
> > if you return copied > 0, then the loop in iomap_dio_rw will continue
> > for next iteration as well. Until the second time it won't copy
> > anything since dio->error is set and from there I guess it may return
> > 0 which will break the loop.
> 


Reading the code again, there are a few clarifications.

If iomap_end() handles (written < length) as an error, iomap_apply()
will return an error immediately. It will not execute the 
loop a second time.

On the other hand, if there is no ->iomap_end() defined by the
filesystem such as in the case of XFS, we will need to check for
dio->error in the do {} while loop of iomap_dio_rw().

> In addition to that copied is also iov_iter_reexpand call.  We don't
> really need the re-expand in case of errors, and in fact we also
> have the iov_iter_revert call before jumping out, so this will
> need a little bit more of an audit and properly documented in the
> commit log.

We are still handling this as an error, so why are we concerned about
expanding? There is no success/written returned in iomap_dio_rw() call
in case of an error.

Attached is an updated patch.
Matthew Wilcox Feb. 28, 2020, 7:59 p.m. UTC | #10
On Fri, Feb 28, 2020 at 01:44:01PM -0600, Goldwyn Rodrigues wrote:
> +++ b/fs/iomap/direct-io.c
> @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		size_t n;
>  		if (dio->error) {
>  			iov_iter_revert(dio->submit.iter, copied);
> -			copied = ret = 0;
> +			ret = 0;
>  			goto out;

There's another change here ... look at the out label

out:
        /* Undo iter limitation to current extent */
        iov_iter_reexpand(dio->submit.iter, orig_count - copied);
        if (copied)
                return copied;
        return ret;

so you're also changing by how much the iter is reexpanded.  I
don't know if it's the appropriate amount; I still don't quite get the
iov_iter complexities.
Goldwyn Rodrigues Feb. 28, 2020, 8:35 p.m. UTC | #11
On 11:59 28/02, Matthew Wilcox wrote:
> On Fri, Feb 28, 2020 at 01:44:01PM -0600, Goldwyn Rodrigues wrote:
> > +++ b/fs/iomap/direct-io.c
> > @@ -264,7 +264,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		size_t n;
> >  		if (dio->error) {
> >  			iov_iter_revert(dio->submit.iter, copied);
> > -			copied = ret = 0;
> > +			ret = 0;
> >  			goto out;
> 
> There's another change here ... look at the out label
> 
> out:
>         /* Undo iter limitation to current extent */
>         iov_iter_reexpand(dio->submit.iter, orig_count - copied);
>         if (copied)
>                 return copied;
>         return ret;
> 
> so you're also changing by how much the iter is reexpanded.  I
> don't know if it's the appropriate amount; I still don't quite get the
> iov_iter complexities.
> 

Ah, okay. Now I understand what Christoph was saying.

I suppose it is safe to remove iov_iter_reexpand(). I don't see any
other goto to this label which will have a non-zero copied value.
And we have already performed the iov_iter_revert().
Christoph Hellwig March 2, 2020, 1:31 p.m. UTC | #12
On Fri, Feb 28, 2020 at 02:35:38PM -0600, Goldwyn Rodrigues wrote:
> 
> Ah, okay. Now I understand what Christoph was saying.
> 
> I suppose it is safe to remove iov_iter_reexpand(). I don't see any
> other goto to this label which will have a non-zero copied value.
> And we have already performed the iov_iter_revert().

I don't really understand the iov_iter complexities either, at least
not without spending sifnificant time with the implementation.  But
the important thing is that you document the changes in behavior and
your findings on why you think it is safe.

Patch
diff mbox series

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e60aca791d3f..e50e7414351a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3475,7 +3475,7 @@  static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 	 * the I/O. Any blocks that may have been allocated in preparation for
 	 * the direct I/O will be reused during buffered I/O.
 	 */
-	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+	if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written < length)
 		return -ENOTBLK;
 
 	return 0;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 41c1e7c20a1f..01865db1bd09 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -264,7 +264,7 @@  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		size_t n;
 		if (dio->error) {
 			iov_iter_revert(dio->submit.iter, copied);
-			copied = ret = 0;
+			ret = 0;
 			goto out;
 		}