diff mbox series

[v2] io_uring: fix short read slow path

Message ID 20220630010137.2518851-1-dominique.martinet@atmark-techno.com
State New
Headers show
Series [v2] io_uring: fix short read slow path | expand

Commit Message

Dominique MARTINET June 30, 2022, 1:01 a.m. UTC
sqeq.off here is the offset to read within the disk image, so obviously
not 'nread' (the amount we just read), but as the author meant to write
its current value incremented by the amount we just read.

Normally recent versions of linux will not issue short reads,
but it can happen so we should fix this.

This lead to weird image corruptions when short read happened

Fixes: 6663a0a33764 ("block/io_uring: implements interfaces for io_uring")
Link: https://lkml.kernel.org/r/YrrFGO4A1jS0GI0G@atmark-techno.com
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
---
v1 -> v2: also updated total_read to use += as suggested by Kevin,
thank you!

I've tested this quickly by making short reads "recursives", e.g. added
the following to luring_resubmit_short_read() after setting 'remaining':
    if (remaining > 4096) remaining -= 4096;

so when we ask for more we issue an extra short reads, making sure we go
through the two short reads path.
(Unfortunately I wasn't quite sure what to fiddle with to issue short
reads in the first place, I tried cutting one of the iovs short in
luring_do_submit() but I must not have been doing it properly as I ended
up with 0 return values which are handled by filling in with 0 (reads
after eof) and that didn't work well)

Anyway, this looks OK to me now.

Thanks,
Dominique

 block/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Hanna Czenczek June 30, 2022, 3:43 p.m. UTC | #1
On 30.06.22 03:01, Dominique Martinet wrote:
> sqeq.off here is the offset to read within the disk image, so obviously
> not 'nread' (the amount we just read), but as the author meant to write
> its current value incremented by the amount we just read.
>
> Normally recent versions of linux will not issue short reads,
> but it can happen so we should fix this.
>
> This lead to weird image corruptions when short read happened
>
> Fixes: 6663a0a33764 ("block/io_uring: implements interfaces for io_uring")
> Link: https://lkml.kernel.org/r/YrrFGO4A1jS0GI0G@atmark-techno.com
> Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
> ---
> v1 -> v2: also updated total_read to use += as suggested by Kevin,
> thank you!
>
> I've tested this quickly by making short reads "recursives", e.g. added
> the following to luring_resubmit_short_read() after setting 'remaining':
>      if (remaining > 4096) remaining -= 4096;
>
> so when we ask for more we issue an extra short reads, making sure we go
> through the two short reads path.
> (Unfortunately I wasn't quite sure what to fiddle with to issue short
> reads in the first place, I tried cutting one of the iovs short in
> luring_do_submit() but I must not have been doing it properly as I ended
> up with 0 return values which are handled by filling in with 0 (reads
> after eof) and that didn't work well)
>
> Anyway, this looks OK to me now.
>
> Thanks,
> Dominique
>
>   block/io_uring.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Stefano Garzarella June 30, 2022, 3:49 p.m. UTC | #2
On Thu, Jun 30, 2022 at 10:01:37AM +0900, Dominique Martinet wrote:
>sqeq.off here is the offset to read within the disk image, so obviously
>not 'nread' (the amount we just read), but as the author meant to write
>its current value incremented by the amount we just read.
>
>Normally recent versions of linux will not issue short reads,
>but it can happen so we should fix this.
>
>This lead to weird image corruptions when short read happened
>
>Fixes: 6663a0a33764 ("block/io_uring: implements interfaces for io_uring")
>Link: https://lkml.kernel.org/r/YrrFGO4A1jS0GI0G@atmark-techno.com
>Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>

Thanks for fixing this issue!

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

>---
>v1 -> v2: also updated total_read to use += as suggested by Kevin,
>thank you!
>
>I've tested this quickly by making short reads "recursives", e.g. added
>the following to luring_resubmit_short_read() after setting 'remaining':
>    if (remaining > 4096) remaining -= 4096;
>
>so when we ask for more we issue an extra short reads, making sure we go
>through the two short reads path.
>(Unfortunately I wasn't quite sure what to fiddle with to issue short
>reads in the first place, I tried cutting one of the iovs short in
>luring_do_submit() but I must not have been doing it properly as I ended
>up with 0 return values which are handled by filling in with 0 (reads
>after eof) and that didn't work well)

Do you remember the kernel version where you first saw these problems?

Thanks,
Stefano
Dominique MARTINET June 30, 2022, 10:52 p.m. UTC | #3
Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200:
> > so when we ask for more we issue an extra short reads, making sure we go
> > through the two short reads path.
> > (Unfortunately I wasn't quite sure what to fiddle with to issue short
> > reads in the first place, I tried cutting one of the iovs short in
> > luring_do_submit() but I must not have been doing it properly as I ended
> > up with 0 return values which are handled by filling in with 0 (reads
> > after eof) and that didn't work well)
> 
> Do you remember the kernel version where you first saw these problems?

Since you're quoting my paragraph about testing two short reads, I've
never seen any that I know of; but there's also no reason these couldn't
happen.

Single short reads have been happening for me with O_DIRECT (cache=none)
on btrfs for a while, but unfortunately I cannot remember which was the
first kernel I've seen this on -- I think rather than a kernel update it
was due to file manipulations that made the file eligible for short
reads in the first place (I started running deduplication on the backing
file)

The older kernel I have installed right now is 5.16 and that can
reproduce it --  I'll give my laptop some work over the weekend to test
still maintained stable branches if that's useful.
Dominique MARTINET July 1, 2022, 1:33 a.m. UTC | #4
Dominique Martinet wrote on Fri, Jul 01, 2022 at 07:52:31AM +0900:
> Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200:
> > > so when we ask for more we issue an extra short reads, making sure we go
> > > through the two short reads path.
> > > (Unfortunately I wasn't quite sure what to fiddle with to issue short
> > > reads in the first place, I tried cutting one of the iovs short in
> > > luring_do_submit() but I must not have been doing it properly as I ended
> > > up with 0 return values which are handled by filling in with 0 (reads
> > > after eof) and that didn't work well)
> > 
> > Do you remember the kernel version where you first saw these problems?
> 
> Since you're quoting my paragraph about testing two short reads, I've
> never seen any that I know of; but there's also no reason these couldn't
> happen.
> 
> Single short reads have been happening for me with O_DIRECT (cache=none)
> on btrfs for a while, but unfortunately I cannot remember which was the
> first kernel I've seen this on -- I think rather than a kernel update it
> was due to file manipulations that made the file eligible for short
> reads in the first place (I started running deduplication on the backing
> file)
> 
> The older kernel I have installed right now is 5.16 and that can
> reproduce it --  I'll give my laptop some work over the weekend to test
> still maintained stable branches if that's useful.

I can confirm the same behavior happens with 5.4.202
I'm not sure about older kernels as my io_uring test that reproduces
short reads just doesn't start on these, but for all intent and purposes
we can probably say this has just always been possible.

The fix Filipe sent me unfortunately doesn't apply as far back as 5.4
(btrfs_get_blocks_direct had no flags argument back then, that got
introduced with conversion to dio in 5.8), but should probably work as
is for 5.10/15 kernels.
Stefan Hajnoczi July 5, 2022, 1:28 p.m. UTC | #5
On Fri, Jul 01, 2022 at 07:52:31AM +0900, Dominique Martinet wrote:
> Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200:
> > > so when we ask for more we issue an extra short reads, making sure we go
> > > through the two short reads path.
> > > (Unfortunately I wasn't quite sure what to fiddle with to issue short
> > > reads in the first place, I tried cutting one of the iovs short in
> > > luring_do_submit() but I must not have been doing it properly as I ended
> > > up with 0 return values which are handled by filling in with 0 (reads
> > > after eof) and that didn't work well)
> > 
> > Do you remember the kernel version where you first saw these problems?
> 
> Since you're quoting my paragraph about testing two short reads, I've
> never seen any that I know of; but there's also no reason these couldn't
> happen.
> 
> Single short reads have been happening for me with O_DIRECT (cache=none)
> on btrfs for a while, but unfortunately I cannot remember which was the
> first kernel I've seen this on -- I think rather than a kernel update it
> was due to file manipulations that made the file eligible for short
> reads in the first place (I started running deduplication on the backing
> file)
> 
> The older kernel I have installed right now is 5.16 and that can
> reproduce it --  I'll give my laptop some work over the weekend to test
> still maintained stable branches if that's useful.

Hi Dominique,
Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to
async context"). The comment above QEMU's luring_resubmit_short_read()
claims that short reads are a bug that was fixed by Linux commit
9d93a3f5a0c.

If the comment is inaccurate it needs to be fixed. Maybe short writes
need to be handled too.

I have CCed Jens and the io_uring mailing list to clarify:
1. Are short IORING_OP_READV reads possible on files/block devices?
2. Are short IORING_OP_WRITEV writes possible on files/block devices?

Thanks,
Stefan
Stefan Hajnoczi July 5, 2022, 1:34 p.m. UTC | #6
On Thu, Jun 30, 2022 at 10:01:37AM +0900, Dominique Martinet wrote:
> sqeq.off here is the offset to read within the disk image, so obviously
> not 'nread' (the amount we just read), but as the author meant to write
> its current value incremented by the amount we just read.
> 
> Normally recent versions of linux will not issue short reads,
> but it can happen so we should fix this.
> 
> This lead to weird image corruptions when short read happened
> 
> Fixes: 6663a0a33764 ("block/io_uring: implements interfaces for io_uring")
> Link: https://lkml.kernel.org/r/YrrFGO4A1jS0GI0G@atmark-techno.com
> Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
> ---
> v1 -> v2: also updated total_read to use += as suggested by Kevin,
> thank you!
> 
> I've tested this quickly by making short reads "recursives", e.g. added
> the following to luring_resubmit_short_read() after setting 'remaining':
>     if (remaining > 4096) remaining -= 4096;
> 
> so when we ask for more we issue an extra short reads, making sure we go
> through the two short reads path.
> (Unfortunately I wasn't quite sure what to fiddle with to issue short
> reads in the first place, I tried cutting one of the iovs short in
> luring_do_submit() but I must not have been doing it properly as I ended
> up with 0 return values which are handled by filling in with 0 (reads
> after eof) and that didn't work well)
> 
> Anyway, this looks OK to me now.
> 
> Thanks,
> Dominique
> 
>  block/io_uring.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan
Jens Axboe July 5, 2022, 7:23 p.m. UTC | #7
On 7/5/22 7:28 AM, Stefan Hajnoczi wrote:
> On Fri, Jul 01, 2022 at 07:52:31AM +0900, Dominique Martinet wrote:
>> Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200:
>>>> so when we ask for more we issue an extra short reads, making sure we go
>>>> through the two short reads path.
>>>> (Unfortunately I wasn't quite sure what to fiddle with to issue short
>>>> reads in the first place, I tried cutting one of the iovs short in
>>>> luring_do_submit() but I must not have been doing it properly as I ended
>>>> up with 0 return values which are handled by filling in with 0 (reads
>>>> after eof) and that didn't work well)
>>>
>>> Do you remember the kernel version where you first saw these problems?
>>
>> Since you're quoting my paragraph about testing two short reads, I've
>> never seen any that I know of; but there's also no reason these couldn't
>> happen.
>>
>> Single short reads have been happening for me with O_DIRECT (cache=none)
>> on btrfs for a while, but unfortunately I cannot remember which was the
>> first kernel I've seen this on -- I think rather than a kernel update it
>> was due to file manipulations that made the file eligible for short
>> reads in the first place (I started running deduplication on the backing
>> file)
>>
>> The older kernel I have installed right now is 5.16 and that can
>> reproduce it --  I'll give my laptop some work over the weekend to test
>> still maintained stable branches if that's useful.
> 
> Hi Dominique,
> Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to
> async context"). The comment above QEMU's luring_resubmit_short_read()
> claims that short reads are a bug that was fixed by Linux commit
> 9d93a3f5a0c.
> 
> If the comment is inaccurate it needs to be fixed. Maybe short writes
> need to be handled too.
> 
> I have CCed Jens and the io_uring mailing list to clarify:
> 1. Are short IORING_OP_READV reads possible on files/block devices?
> 2. Are short IORING_OP_WRITEV writes possible on files/block devices?

In general we try very hard to avoid them, but if eg we get a short read
or write from blocking context (eg io-wq), then io_uring does return
that. There's really not much we can do here, it seems futile to retry
IO which was issued just like it would've been from a normal blocking
syscall yet it is still short.
Dominique MARTINET July 5, 2022, 10:52 p.m. UTC | #8
Stefan Hajnoczi wrote on Tue, Jul 05, 2022 at 02:28:08PM +0100:
> > The older kernel I have installed right now is 5.16 and that can
> > reproduce it --  I'll give my laptop some work over the weekend to test
> > still maintained stable branches if that's useful.
> 
> Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to
> async context"). The comment above QEMU's luring_resubmit_short_read()
> claims that short reads are a bug that was fixed by Linux commit
> 9d93a3f5a0c.
> 
> If the comment is inaccurate it needs to be fixed. Maybe short writes
> need to be handled too.
> 
> I have CCed Jens and the io_uring mailing list to clarify:
> 1. Are short IORING_OP_READV reads possible on files/block devices?
> 2. Are short IORING_OP_WRITEV writes possible on files/block devices?

Jens replied before me, so I won't be adding much (I agree with his
reply -- linux tries hard to avoid short reads but we should assume they
can happen)

In this particular case it was another btrfs bug with O_DIRECT and mixed
compression in a file, that's been fixed by this patch:
https://lore.kernel.org/all/20220630151038.GA459423@falcondesktop/

queued here:
https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=dio_fixes&id=b3864441547e49a69d45c7771aa8cc5e595d18fc

It should be backported to 5.10, but the problem will likely persist in
5.4 kernels if anyone runs on that as the code changed enough to make
backporting non-trivial.


So, WRT that comment, we probably should remove the reference to that
commit and leave in that they should be very rare but we need to handle
them anyway.


For writes in particular, I haven't seen any and looking at the code
qemu would blow up that storage (IO treated as ENOSPC would likely mark
disk read-only?)
It might make sense to add some warning message that it's what happened
so it'll be obvious what needs doing in case anyone falls on that but I
think the status-quo is good enough here.
Stefan Hajnoczi July 6, 2022, 7:16 a.m. UTC | #9
On Tue, 5 Jul 2022 at 20:26, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 7/5/22 7:28 AM, Stefan Hajnoczi wrote:
> > On Fri, Jul 01, 2022 at 07:52:31AM +0900, Dominique Martinet wrote:
> >> Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200:
> >>>> so when we ask for more we issue an extra short reads, making sure we go
> >>>> through the two short reads path.
> >>>> (Unfortunately I wasn't quite sure what to fiddle with to issue short
> >>>> reads in the first place, I tried cutting one of the iovs short in
> >>>> luring_do_submit() but I must not have been doing it properly as I ended
> >>>> up with 0 return values which are handled by filling in with 0 (reads
> >>>> after eof) and that didn't work well)
> >>>
> >>> Do you remember the kernel version where you first saw these problems?
> >>
> >> Since you're quoting my paragraph about testing two short reads, I've
> >> never seen any that I know of; but there's also no reason these couldn't
> >> happen.
> >>
> >> Single short reads have been happening for me with O_DIRECT (cache=none)
> >> on btrfs for a while, but unfortunately I cannot remember which was the
> >> first kernel I've seen this on -- I think rather than a kernel update it
> >> was due to file manipulations that made the file eligible for short
> >> reads in the first place (I started running deduplication on the backing
> >> file)
> >>
> >> The older kernel I have installed right now is 5.16 and that can
> >> reproduce it --  I'll give my laptop some work over the weekend to test
> >> still maintained stable branches if that's useful.
> >
> > Hi Dominique,
> > Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to
> > async context"). The comment above QEMU's luring_resubmit_short_read()
> > claims that short reads are a bug that was fixed by Linux commit
> > 9d93a3f5a0c.
> >
> > If the comment is inaccurate it needs to be fixed. Maybe short writes
> > need to be handled too.
> >
> > I have CCed Jens and the io_uring mailing list to clarify:
> > 1. Are short IORING_OP_READV reads possible on files/block devices?
> > 2. Are short IORING_OP_WRITEV writes possible on files/block devices?
>
> In general we try very hard to avoid them, but if eg we get a short read
> or write from blocking context (eg io-wq), then io_uring does return
> that. There's really not much we can do here, it seems futile to retry
> IO which was issued just like it would've been from a normal blocking
> syscall yet it is still short.

Thanks! QEMU's short I/O handling is spotty - some code paths handle
it while others don't. For the io_uring QEMU block driver we'll try to
handle short all I/Os.

Stefan
Stefan Hajnoczi July 6, 2022, 7:17 a.m. UTC | #10
On Tue, 5 Jul 2022 at 23:53, Dominique Martinet
<dominique.martinet@atmark-techno.com> wrote:
>
> Stefan Hajnoczi wrote on Tue, Jul 05, 2022 at 02:28:08PM +0100:
> > > The older kernel I have installed right now is 5.16 and that can
> > > reproduce it --  I'll give my laptop some work over the weekend to test
> > > still maintained stable branches if that's useful.
> >
> > Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to
> > async context"). The comment above QEMU's luring_resubmit_short_read()
> > claims that short reads are a bug that was fixed by Linux commit
> > 9d93a3f5a0c.
> >
> > If the comment is inaccurate it needs to be fixed. Maybe short writes
> > need to be handled too.
> >
> > I have CCed Jens and the io_uring mailing list to clarify:
> > 1. Are short IORING_OP_READV reads possible on files/block devices?
> > 2. Are short IORING_OP_WRITEV writes possible on files/block devices?
>
> Jens replied before me, so I won't be adding much (I agree with his
> reply -- linux tries hard to avoid short reads but we should assume they
> can happen)
>
> In this particular case it was another btrfs bug with O_DIRECT and mixed
> compression in a file, that's been fixed by this patch:
> https://lore.kernel.org/all/20220630151038.GA459423@falcondesktop/
>
> queued here:
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=dio_fixes&id=b3864441547e49a69d45c7771aa8cc5e595d18fc
>
> It should be backported to 5.10, but the problem will likely persist in
> 5.4 kernels if anyone runs on that as the code changed enough to make
> backporting non-trivial.
>
>
> So, WRT that comment, we probably should remove the reference to that
> commit and leave in that they should be very rare but we need to handle
> them anyway.
>
>
> For writes in particular, I haven't seen any and looking at the code
> qemu would blow up that storage (IO treated as ENOSPC would likely mark
> disk read-only?)
> It might make sense to add some warning message that it's what happened
> so it'll be obvious what needs doing in case anyone falls on that but I
> think the status-quo is good enough here.

Great! I've already queued your fix.

Do you want to send a follow-up that updates the comment?

Thanks,
Stefan
Dominique MARTINET July 6, 2022, 7:26 a.m. UTC | #11
Stefan Hajnoczi wrote on Wed, Jul 06, 2022 at 08:17:42AM +0100:
> Great! I've already queued your fix.

Thanks!

> Do you want to send a follow-up that updates the comment?

I don't think I'd add much value at this point, leaving it to you unless
you really would prefer me to send it.


Cheers,
Stefan Hajnoczi July 6, 2022, 7:51 a.m. UTC | #12
On Wed, Jul 06, 2022 at 04:26:59PM +0900, Dominique Martinet wrote:
> Stefan Hajnoczi wrote on Wed, Jul 06, 2022 at 08:17:42AM +0100:
> > Great! I've already queued your fix.
> 
> Thanks!
> 
> > Do you want to send a follow-up that updates the comment?
> 
> I don't think I'd add much value at this point, leaving it to you unless
> you really would prefer me to send it.

That's fine. I'll send a patch. Thanks!

Stefan
diff mbox series

Patch

diff --git a/block/io_uring.c b/block/io_uring.c
index d48e472e74cb..b238661740f5 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -89,7 +89,7 @@  static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
     trace_luring_resubmit_short_read(s, luringcb, nread);
 
     /* Update read position */
-    luringcb->total_read = nread;
+    luringcb->total_read += nread;
     remaining = luringcb->qiov->size - luringcb->total_read;
 
     /* Shorten qiov */
@@ -103,7 +103,7 @@  static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
                       remaining);
 
     /* Update sqe */
-    luringcb->sqeq.off = nread;
+    luringcb->sqeq.off += nread;
     luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
     luringcb->sqeq.len = luringcb->resubmit_qiov.niov;