diff mbox

[for-1.4,stable] block: handle spurious coroutine entries

Message ID 1360428298-15829-1-git-send-email-stefanha@redhat.com
State New
Headers show

Commit Message

Stefan Hajnoczi Feb. 9, 2013, 4:44 p.m. UTC
bdrv_co_io_em(), bdrv_co_flush(), and bdrv_co_discard() yield the
coroutine when waiting for aio to complete.  They do not check that the
request has actually finished.

In simple cases this works, but it returns early when we get spurious
wake-ups due to qemu_coroutine_enter() being called from other sources.
One such example is block-migration.c:process_incoming_migration().

This patch fixes a segfault on incoming block migration.

Reported-by: David Pravec <david.pravec@nethost.cz>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
This patch is partially fixes block migration.  There is another issue that
causes block migration to fail with "Unknown flags" at 99%.

 block.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini Feb. 11, 2013, 10:16 a.m. UTC | #1
Il 09/02/2013 17:44, Stefan Hajnoczi ha scritto:
> bdrv_co_io_em(), bdrv_co_flush(), and bdrv_co_discard() yield the
> coroutine when waiting for aio to complete.  They do not check that the
> request has actually finished.
> 
> In simple cases this works, but it returns early when we get spurious
> wake-ups due to qemu_coroutine_enter() being called from other sources.
> One such example is block-migration.c:process_incoming_migration().
> 
> This patch fixes a segfault on incoming block migration.
> 
> Reported-by: David Pravec <david.pravec@nethost.cz>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

I would prefer to have ret initialized to -EINPROGRESS, but since we're
close to release it's easier this way.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> ---
> This patch is partially fixes block migration.  There is another issue that
> causes block migration to fail with "Unknown flags" at 99%.
> 
>  block.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 50dab8e..f2f3b80 100644
> --- a/block.c
> +++ b/block.c
> @@ -3961,6 +3961,7 @@ void qemu_aio_release(void *p)
>  typedef struct CoroutineIOCompletion {
>      Coroutine *coroutine;
>      int ret;
> +    bool done;
>  } CoroutineIOCompletion;
>  
>  static void bdrv_co_io_em_complete(void *opaque, int ret)
> @@ -3968,6 +3969,7 @@ static void bdrv_co_io_em_complete(void *opaque, int ret)
>      CoroutineIOCompletion *co = opaque;
>  
>      co->ret = ret;
> +    co->done = true;
>      qemu_coroutine_enter(co->coroutine, NULL);
>  }
>  
> @@ -3992,7 +3994,9 @@ static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
>      if (!acb) {
>          return -EIO;
>      }
> -    qemu_coroutine_yield();
> +    while (!co.done) {
> +        qemu_coroutine_yield();
> +    }
>  
>      return co.ret;
>  }
> @@ -4051,7 +4055,9 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>          if (acb == NULL) {
>              ret = -EIO;
>          } else {
> -            qemu_coroutine_yield();
> +            while (!co.done) {
> +                qemu_coroutine_yield();
> +            }
>              ret = co.ret;
>          }
>      } else {
> @@ -4161,7 +4167,9 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>          if (acb == NULL) {
>              return -EIO;
>          } else {
> -            qemu_coroutine_yield();
> +            while (!co.done) {
> +                qemu_coroutine_yield();
> +            }
>              return co.ret;
>          }
>      } else {
>
Stefan Hajnoczi Feb. 11, 2013, 10:57 a.m. UTC | #2
On Mon, Feb 11, 2013 at 11:16:09AM +0100, Paolo Bonzini wrote:
> Il 09/02/2013 17:44, Stefan Hajnoczi ha scritto:
> > bdrv_co_io_em(), bdrv_co_flush(), and bdrv_co_discard() yield the
> > coroutine when waiting for aio to complete.  They do not check that the
> > request has actually finished.
> > 
> > In simple cases this works, but it returns early when we get spurious
> > wake-ups due to qemu_coroutine_enter() being called from other sources.
> > One such example is block-migration.c:process_incoming_migration().
> > 
> > This patch fixes a segfault on incoming block migration.
> > 
> > Reported-by: David Pravec <david.pravec@nethost.cz>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> I would prefer to have ret initialized to -EINPROGRESS, but since we're
> close to release it's easier this way.

I did it that way first but then I felt a little paranoid about using
the return value space as the sentinel.  Granted, most block functions
will never return -EINPROGRESS but I think a separate bool done variable
is cleaner.

Stefan
Paolo Bonzini Feb. 11, 2013, 11:32 a.m. UTC | #3
Il 11/02/2013 11:57, Stefan Hajnoczi ha scritto:
>> > I would prefer to have ret initialized to -EINPROGRESS, but since we're
>> > close to release it's easier this way.
> I did it that way first but then I felt a little paranoid about using
> the return value space as the sentinel.  Granted, most block functions
> will never return -EINPROGRESS but I think a separate bool done variable
> is cleaner.

I think we're already expecting in many places that it won't happen.

Paolo
Kevin Wolf Feb. 11, 2013, 12:29 p.m. UTC | #4
Am 09.02.2013 17:44, schrieb Stefan Hajnoczi:
> bdrv_co_io_em(), bdrv_co_flush(), and bdrv_co_discard() yield the
> coroutine when waiting for aio to complete.  They do not check that the
> request has actually finished.
> 
> In simple cases this works, but it returns early when we get spurious
> wake-ups due to qemu_coroutine_enter() being called from other sources.
> One such example is block-migration.c:process_incoming_migration().

The bug is not in this function but in process_incoming_migration(). It
should never blindly enter a coroutine which is in an unknown state. If
it can reenter here, it can reenter anywhere in block drivers, and
adding a workaround to one place that just yields again if it got an
early reentrance doesn't fix the real bug.

Which is the yield that corresponds to the enter in
enter_migration_coroutine()? We need to add some state that can be used
to make sure the enter happens only for this specific yield.

Kevin
Paolo Bonzini Feb. 11, 2013, 12:42 p.m. UTC | #5
Il 11/02/2013 13:29, Kevin Wolf ha scritto:
> The bug is not in this function but in process_incoming_migration(). It
> should never blindly enter a coroutine which is in an unknown state.

process_incoming_migration() is the function that _creates_ the
coroutine and enters it for the first time, so there shouldn't be any
problem there.

The function we're looking at should be, as you write correctly,
enter_migration_coroutine().

> If it can reenter here, it can reenter anywhere in block drivers, and
> adding a workaround to one place that just yields again if it got an
> early reentrance doesn't fix the real bug.
> 
> Which is the yield that corresponds to the enter in
> enter_migration_coroutine()? We need to add some state that can be used
> to make sure the enter happens only for this specific yield.

In this case the corresponding yield is in qemu-coroutine-io.c, and it
is driven by read-availability of the migration file descriptor.  So it
is possible to test before entering the coroutine, but really ugly (not
just because it would cost one extra system call).

But why is enter_migration_coroutine() being called at all?  bdrv_write
should be a synchronous call, it should run its own qemu_aio_wait()
event loop and that loop doesn't include the migration file descriptor.

Paolo
Kevin Wolf Feb. 11, 2013, 1 p.m. UTC | #6
Am 11.02.2013 13:42, schrieb Paolo Bonzini:
> Il 11/02/2013 13:29, Kevin Wolf ha scritto:
>> The bug is not in this function but in process_incoming_migration(). It
>> should never blindly enter a coroutine which is in an unknown state.
> 
> process_incoming_migration() is the function that _creates_ the
> coroutine and enters it for the first time, so there shouldn't be any
> problem there.
> 
> The function we're looking at should be, as you write correctly,
> enter_migration_coroutine().

Yes, Stefan's commit message confused me.

>> If it can reenter here, it can reenter anywhere in block drivers, and
>> adding a workaround to one place that just yields again if it got an
>> early reentrance doesn't fix the real bug.
>>
>> Which is the yield that corresponds to the enter in
>> enter_migration_coroutine()? We need to add some state that can be used
>> to make sure the enter happens only for this specific yield.
> 
> In this case the corresponding yield is in qemu-coroutine-io.c, and it
> is driven by read-availability of the migration file descriptor.  So it
> is possible to test before entering the coroutine, but really ugly (not
> just because it would cost one extra system call).

So the problem is that qemu-coroutine-io.c has a bad interface that
yields without scheduling the reenter itself. Wouldn't it better if it
was qemu_co_sendv_recvv() which registered an fd handler and handled the
yield/reenter stuff transparently?

> But why is enter_migration_coroutine() being called at all?  bdrv_write
> should be a synchronous call, it should run its own qemu_aio_wait()
> event loop and that loop doesn't include the migration file descriptor.

Why doesn't it include the migration fd? Wouldn't we get such behaviour
only if we had different AioContexts?

Kevin
Stefan Hajnoczi Feb. 11, 2013, 1:17 p.m. UTC | #7
On Mon, Feb 11, 2013 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 11/02/2013 13:29, Kevin Wolf ha scritto:
>> The bug is not in this function but in process_incoming_migration(). It
>> should never blindly enter a coroutine which is in an unknown state.
>
> process_incoming_migration() is the function that _creates_ the
> coroutine and enters it for the first time, so there shouldn't be any
> problem there.
>
> The function we're looking at should be, as you write correctly,
> enter_migration_coroutine().
>
>> If it can reenter here, it can reenter anywhere in block drivers, and
>> adding a workaround to one place that just yields again if it got an
>> early reentrance doesn't fix the real bug.
>>
>> Which is the yield that corresponds to the enter in
>> enter_migration_coroutine()? We need to add some state that can be used
>> to make sure the enter happens only for this specific yield.
>
> In this case the corresponding yield is in qemu-coroutine-io.c, and it
> is driven by read-availability of the migration file descriptor.  So it
> is possible to test before entering the coroutine, but really ugly (not
> just because it would cost one extra system call).
>
> But why is enter_migration_coroutine() being called at all?  bdrv_write
> should be a synchronous call, it should run its own qemu_aio_wait()
> event loop and that loop doesn't include the migration file descriptor.

qemu_aio_wait() is not called because we are already in a coroutine -
the migration coroutine.  bdrv_write() yields instead of looping on
qemu_aio_wait().

If we want coroutine_fn to be composable, they must handle spurious
entries.  We could add code to clear the socket fd handler while we're
not waiting for it but I think handling spurious entries is the most
robust solution.

Stefan
Kevin Wolf Feb. 11, 2013, 1:29 p.m. UTC | #8
Am 11.02.2013 14:17, schrieb Stefan Hajnoczi:
> On Mon, Feb 11, 2013 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 11/02/2013 13:29, Kevin Wolf ha scritto:
>>> The bug is not in this function but in process_incoming_migration(). It
>>> should never blindly enter a coroutine which is in an unknown state.
>>
>> process_incoming_migration() is the function that _creates_ the
>> coroutine and enters it for the first time, so there shouldn't be any
>> problem there.
>>
>> The function we're looking at should be, as you write correctly,
>> enter_migration_coroutine().
>>
>>> If it can reenter here, it can reenter anywhere in block drivers, and
>>> adding a workaround to one place that just yields again if it got an
>>> early reentrance doesn't fix the real bug.
>>>
>>> Which is the yield that corresponds to the enter in
>>> enter_migration_coroutine()? We need to add some state that can be used
>>> to make sure the enter happens only for this specific yield.
>>
>> In this case the corresponding yield is in qemu-coroutine-io.c, and it
>> is driven by read-availability of the migration file descriptor.  So it
>> is possible to test before entering the coroutine, but really ugly (not
>> just because it would cost one extra system call).
>>
>> But why is enter_migration_coroutine() being called at all?  bdrv_write
>> should be a synchronous call, it should run its own qemu_aio_wait()
>> event loop and that loop doesn't include the migration file descriptor.
> 
> qemu_aio_wait() is not called because we are already in a coroutine -
> the migration coroutine.  bdrv_write() yields instead of looping on
> qemu_aio_wait().

Right. But that's not really a problem, then the main loop will call the
necessary AIO callbacks that reenter the coroutines - it's a superset of
qemu_aio_wait().

> If we want coroutine_fn to be composable, they must handle spurious
> entries.  We could add code to clear the socket fd handler while we're
> not waiting for it but I think handling spurious entries is the most
> robust solution.

Either clear the fd handler or modify the handler function to check the
state of the coroutine before it reenters (like block_job_resume() does,
for example)

So far the rule has been that each enter corresponds to a well-known
yield. If you really want to change this rule, you'd have to fix things
all over the place because this is an assumption that most if not all
coroutine based code in qemu relies on. I don't want to be the one to
review this change, and I don't think it's a good idea either...

Kevin
Paolo Bonzini Feb. 11, 2013, 1:54 p.m. UTC | #9
Il 11/02/2013 14:29, Kevin Wolf ha scritto:
> Am 11.02.2013 14:17, schrieb Stefan Hajnoczi:
>> On Mon, Feb 11, 2013 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 11/02/2013 13:29, Kevin Wolf ha scritto:
>>>> The bug is not in this function but in process_incoming_migration(). It
>>>> should never blindly enter a coroutine which is in an unknown state.
>>>
>>> process_incoming_migration() is the function that _creates_ the
>>> coroutine and enters it for the first time, so there shouldn't be any
>>> problem there.
>>>
>>> The function we're looking at should be, as you write correctly,
>>> enter_migration_coroutine().
>>>
>>>> If it can reenter here, it can reenter anywhere in block drivers, and
>>>> adding a workaround to one place that just yields again if it got an
>>>> early reentrance doesn't fix the real bug.
>>>>
>>>> Which is the yield that corresponds to the enter in
>>>> enter_migration_coroutine()? We need to add some state that can be used
>>>> to make sure the enter happens only for this specific yield.
>>>
>>> In this case the corresponding yield is in qemu-coroutine-io.c, and it
>>> is driven by read-availability of the migration file descriptor.  So it
>>> is possible to test before entering the coroutine, but really ugly (not
>>> just because it would cost one extra system call).
>>>
>>> But why is enter_migration_coroutine() being called at all?  bdrv_write
>>> should be a synchronous call, it should run its own qemu_aio_wait()
>>> event loop and that loop doesn't include the migration file descriptor.
>>
>> qemu_aio_wait() is not called because we are already in a coroutine -
>> the migration coroutine.  bdrv_write() yields instead of looping on
>> qemu_aio_wait().

Uh-uh, that's it.

> Right. But that's not really a problem, then the main loop will call the
> necessary AIO callbacks that reenter the coroutines - it's a superset of
> qemu_aio_wait().

But it will also call other unnecessary AIO callbacks, and that's a problem.

>> If we want coroutine_fn to be composable, they must handle spurious
>> entries.  We could add code to clear the socket fd handler while we're
>> not waiting for it but I think handling spurious entries is the most
>> robust solution.
> 
> Either clear the fd handler or modify the handler function to check the
> state of the coroutine before it reenters (like block_job_resume() does,
> for example)

That doesn't work because you'd get a busy wait while the socket remans
readable.  Clearing the FD handler during bdrv_write is also ugly.

Perhaps we need a bdrv_write_sync that forces the qemu_aio_wait() path.

> So far the rule has been that each enter corresponds to a well-known
> yield. If you really want to change this rule, you'd have to fix things
> all over the place because this is an assumption that most if not all
> coroutine based code in qemu relies on. I don't want to be the one to
> review this change, and I don't think it's a good idea either...

We don't have much coroutine code:

block.c:    qemu_coroutine_yield();
block.c:            qemu_coroutine_yield();
block.c:            qemu_coroutine_yield();

These are fixed by Stefan's patch.

block/blkdebug.c:    qemu_coroutine_yield();

Not ok, should be easily fixed.

block/mirror.c:        qemu_coroutine_yield();
block/mirror.c:            qemu_coroutine_yield();
block/mirror.c:        qemu_coroutine_yield();
block/mirror.c:                qemu_coroutine_yield();

All wrapped with while.

block/nbd.c:    qemu_coroutine_yield();

Not ok, easily fixed.

block/qed.c:        qemu_coroutine_yield();
block/qed.c:        qemu_coroutine_yield();

First fine, second not but easily fixed.

block/sheepdog.c:    qemu_coroutine_yield();
block/sheepdog.c:    qemu_coroutine_yield();
block/sheepdog.c:    qemu_coroutine_yield();

Not ok, easily fixed.

blockjob.c:        qemu_coroutine_yield();

Doesn't really matter, but could change surrounding if to while.

hw/9pfs/virtio-9p-coth.h:        qemu_coroutine_yield();
hw/9pfs/virtio-9p-coth.h:        qemu_coroutine_yield();

No idea. :)

qemu-coroutine-io.c:                qemu_coroutine_yield();

Ok.

qemu-coroutine-lock.c:    qemu_coroutine_yield();
qemu-coroutine-lock.c:    qemu_coroutine_yield();

These are the problematic ones.

qemu-coroutine-sleep.c:    qemu_coroutine_yield();

Should not be a problem if it enters early.

savevm.c:            qemu_coroutine_yield();
savevm.c:            qemu_coroutine_yield();

Ok.

thread-pool.c:    qemu_coroutine_yield();

Not ok, easily fixed.

Paolo
Stefan Hajnoczi Feb. 11, 2013, 3:44 p.m. UTC | #10
On Mon, Feb 11, 2013 at 2:54 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 11/02/2013 14:29, Kevin Wolf ha scritto:
>> Am 11.02.2013 14:17, schrieb Stefan Hajnoczi:
>>> On Mon, Feb 11, 2013 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 11/02/2013 13:29, Kevin Wolf ha scritto:
>>>>> The bug is not in this function but in process_incoming_migration(). It
>>>>> should never blindly enter a coroutine which is in an unknown state.
>>>>
>>>> process_incoming_migration() is the function that _creates_ the
>>>> coroutine and enters it for the first time, so there shouldn't be any
>>>> problem there.
>>>>
>>>> The function we're looking at should be, as you write correctly,
>>>> enter_migration_coroutine().
>>>>
>>>>> If it can reenter here, it can reenter anywhere in block drivers, and
>>>>> adding a workaround to one place that just yields again if it got an
>>>>> early reentrance doesn't fix the real bug.
>>>>>
>>>>> Which is the yield that corresponds to the enter in
>>>>> enter_migration_coroutine()? We need to add some state that can be used
>>>>> to make sure the enter happens only for this specific yield.
>>>>
>>>> In this case the corresponding yield is in qemu-coroutine-io.c, and it
>>>> is driven by read-availability of the migration file descriptor.  So it
>>>> is possible to test before entering the coroutine, but really ugly (not
>>>> just because it would cost one extra system call).
>>>>
>>>> But why is enter_migration_coroutine() being called at all?  bdrv_write
>>>> should be a synchronous call, it should run its own qemu_aio_wait()
>>>> event loop and that loop doesn't include the migration file descriptor.
>>>
>>> qemu_aio_wait() is not called because we are already in a coroutine -
>>> the migration coroutine.  bdrv_write() yields instead of looping on
>>> qemu_aio_wait().
>
> Uh-uh, that's it.
>
>> Right. But that's not really a problem, then the main loop will call the
>> necessary AIO callbacks that reenter the coroutines - it's a superset of
>> qemu_aio_wait().
>
> But it will also call other unnecessary AIO callbacks, and that's a problem.
>
>>> If we want coroutine_fn to be composable, they must handle spurious
>>> entries.  We could add code to clear the socket fd handler while we're
>>> not waiting for it but I think handling spurious entries is the most
>>> robust solution.
>>
>> Either clear the fd handler or modify the handler function to check the
>> state of the coroutine before it reenters (like block_job_resume() does,
>> for example)
>
> That doesn't work because you'd get a busy wait while the socket remans
> readable.  Clearing the FD handler during bdrv_write is also ugly.
>
> Perhaps we need a bdrv_write_sync that forces the qemu_aio_wait() path.
>
>> So far the rule has been that each enter corresponds to a well-known
>> yield. If you really want to change this rule, you'd have to fix things
>> all over the place because this is an assumption that most if not all
>> coroutine based code in qemu relies on. I don't want to be the one to
>> review this change, and I don't think it's a good idea either...
>
> We don't have much coroutine code:

Thanks for auditing the coroutine users.

I would like to submit a small patch for QEMU 1.4, so I am sending a
different patch to set/clear the handler only around the QEMUFile
yield.  With this patch we don't have spurious coroutine entries.

It still seems like handling spurious entries is most robust.  It's
better to discuss that approach fully during the QEMU 1.5 release
cycle though.

Stefan
diff mbox

Patch

diff --git a/block.c b/block.c
index 50dab8e..f2f3b80 100644
--- a/block.c
+++ b/block.c
@@ -3961,6 +3961,7 @@  void qemu_aio_release(void *p)
 typedef struct CoroutineIOCompletion {
     Coroutine *coroutine;
     int ret;
+    bool done;
 } CoroutineIOCompletion;
 
 static void bdrv_co_io_em_complete(void *opaque, int ret)
@@ -3968,6 +3969,7 @@  static void bdrv_co_io_em_complete(void *opaque, int ret)
     CoroutineIOCompletion *co = opaque;
 
     co->ret = ret;
+    co->done = true;
     qemu_coroutine_enter(co->coroutine, NULL);
 }
 
@@ -3992,7 +3994,9 @@  static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
     if (!acb) {
         return -EIO;
     }
-    qemu_coroutine_yield();
+    while (!co.done) {
+        qemu_coroutine_yield();
+    }
 
     return co.ret;
 }
@@ -4051,7 +4055,9 @@  int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
         if (acb == NULL) {
             ret = -EIO;
         } else {
-            qemu_coroutine_yield();
+            while (!co.done) {
+                qemu_coroutine_yield();
+            }
             ret = co.ret;
         }
     } else {
@@ -4161,7 +4167,9 @@  int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
         if (acb == NULL) {
             return -EIO;
         } else {
-            qemu_coroutine_yield();
+            while (!co.done) {
+                qemu_coroutine_yield();
+            }
             return co.ret;
         }
     } else {