diff mbox series

[5/7] block/mirror: utilize job_exit shim

Message ID 20180817190457.8292-6-jsnow@redhat.com
State New
Headers show
Series jobs: remove job_defer_to_main_loop | expand

Commit Message

John Snow Aug. 17, 2018, 7:04 p.m. UTC
Change the manual deferment to mirror_exit into the implicit
callback to job_exit and the mirror_exit callback.

This does change the order of some bdrv_unref calls and job_completed,
but thanks to the new context in which we call .job_exit, this is safe
to defer the possible flushing of any nodes to the job_finalize_single
cleanup stage.

Signed-off-by: John Snow <jsnow@redhat.com>
---
 block/mirror.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

Comments

Max Reitz Aug. 22, 2018, 12:06 p.m. UTC | #1
On 2018-08-17 21:04, John Snow wrote:
> Change the manual deferment to mirror_exit into the implicit
> callback to job_exit and the mirror_exit callback.
> 
> This does change the order of some bdrv_unref calls and job_completed,
> but thanks to the new context in which we call .job_exit, this is safe
> to defer the possible flushing of any nodes to the job_finalize_single
> cleanup stage.
> 
> Signed-off-by: John Snow <jsnow@redhat.com>
> ---
>  block/mirror.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)

Looks OK, but the same comment from the previous patch applies (I
wouldn't throw 'ret' around and instead work with just a boolean that
signifies failure or success.  Also, you might even want to remove
s->ret because the Error object should be enough, actually.)

Max
Max Reitz Aug. 22, 2018, 12:15 p.m. UTC | #2
On 2018-08-17 21:04, John Snow wrote:
> Change the manual deferment to mirror_exit into the implicit
> callback to job_exit and the mirror_exit callback.
> 
> This does change the order of some bdrv_unref calls and job_completed,
> but thanks to the new context in which we call .job_exit, this is safe
> to defer the possible flushing of any nodes to the job_finalize_single
> cleanup stage.

Ah, right, I forgot this.  Hm, what exactly do you mean?  This function
is executed in the main loop, so it can make 'src' go away.  I don't see
any difference to before.

The only difference I see is that the BH-scheduled function is now
job_exit() instead of just mirror_complete() (which is now called as
part of job_exit()).  But then again, it was mirror_complete() itself
that called job_completed(), like it is now job_exit().  So if
everything worked after this patch, I don't see why mirror_complete()
would bdrv_ref() 'src' around job_completed().

Max

> Signed-off-by: John Snow <jsnow@redhat.com>
> ---
>  block/mirror.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
John Snow Aug. 22, 2018, 10:05 p.m. UTC | #3
On 08/22/2018 08:15 AM, Max Reitz wrote:
> On 2018-08-17 21:04, John Snow wrote:
>> Change the manual deferment to mirror_exit into the implicit
>> callback to job_exit and the mirror_exit callback.
>>
>> This does change the order of some bdrv_unref calls and job_completed,
>> but thanks to the new context in which we call .job_exit, this is safe
>> to defer the possible flushing of any nodes to the job_finalize_single
>> cleanup stage.
> 
> Ah, right, I forgot this.  Hm, what exactly do you mean?  This function
> is executed in the main loop, so it can make 'src' go away.  I don't see
> any difference to before.
> 

This changes the order in which we unreference these objects; if you
look at this patch the job_completed call I delete is in the middle of
what becomes the .exit() callback, which means there is a subtle change
in the ordering of how references are put down.

Take a look at the weird ordering of mirror_exit as it exists right now;
we call job_completed first and *then* put down the last references. If
you re-order this upstream right now, you'll deadlock QEMU because this
means job_completed is responsible for putting down the last reference
to some of these block/bds objects.

However, job_completed takes an additional AIO context lock and calls
job_finalize_single under *two* locks, which will hang QEMU if we
attempt to flush any of these nodes when we put down the last reference.

Performing the reordering here is *safe* because by removing the call to
job_completed and utilizing the exit shim, the .exit() callback executes
only under one lock, and when the finalize code runs later it is also
executed under only one lock, making this re-ordering safe.

Clear as mud?

--js

> The only difference I see is that the BH-scheduled function is now
> job_exit() instead of just mirror_complete() (which is now called as
> part of job_exit()).  But then again, it was mirror_complete() itself
> that called job_completed(), like it is now job_exit().  So if
> everything worked after this patch, I don't see why mirror_complete()
> would bdrv_ref() 'src' around job_completed().
> 
> Max
> 
>> Signed-off-by: John Snow <jsnow@redhat.com>
>> ---
>>  block/mirror.c | 25 +++++++++----------------
>>  1 file changed, 9 insertions(+), 16 deletions(-)
>
Max Reitz Aug. 25, 2018, 3:02 p.m. UTC | #4
On 2018-08-23 00:05, John Snow wrote:
> 
> 
> On 08/22/2018 08:15 AM, Max Reitz wrote:
>> On 2018-08-17 21:04, John Snow wrote:
>>> Change the manual deferment to mirror_exit into the implicit
>>> callback to job_exit and the mirror_exit callback.
>>>
>>> This does change the order of some bdrv_unref calls and job_completed,
>>> but thanks to the new context in which we call .job_exit, this is safe
>>> to defer the possible flushing of any nodes to the job_finalize_single
>>> cleanup stage.
>>
>> Ah, right, I forgot this.  Hm, what exactly do you mean?  This function
>> is executed in the main loop, so it can make 'src' go away.  I don't see
>> any difference to before.
>>
> 
> This changes the order in which we unreference these objects; if you
> look at this patch the job_completed call I delete is in the middle of
> what becomes the .exit() callback, which means there is a subtle change
> in the ordering of how references are put down.
> 
> Take a look at the weird ordering of mirror_exit as it exists right now;
> we call job_completed first and *then* put down the last references. If
> you re-order this upstream right now, you'll deadlock QEMU because this
> means job_completed is responsible for putting down the last reference
> to some of these block/bds objects.
> 
> However, job_completed takes an additional AIO context lock and calls
> job_finalize_single under *two* locks, which will hang QEMU if we
> attempt to flush any of these nodes when we put down the last reference.

If you say so...  I have to admit I don't really understand.  The
comment doesn't explain why it's so important to keep src around until
job_completed(), so I don't know.  I thought AioContexts are recursive
so it doesn't matter whether you take them recursively or not.

Anyway.  So the difference now is that job_defer_to_main_loop() took the
lock around the whole exit function, whereas the new exit shim only
takes it around the .exit() method, but calls job_complete() without a
lock -- and then job_finalize_single() gets its lock again, so the job
methods are again called with locks.  That sounds OK to me.

> Performing the reordering here is *safe* because by removing the call to
> job_completed and utilizing the exit shim, the .exit() callback executes
> only under one lock, and when the finalize code runs later it is also
> executed under only one lock, making this re-ordering safe.
> 
> Clear as mud?

Well, I trust you that the drain issue was the reason that src had to
stay around until after job_completed().  It seems a bit
counter-intuitive, because the comment explaining that src needs to stay
around until job_completed() doesn't say much -- but it does imply that
without that bdrv_ref(), the BDS might be destroyed before
job_completed().  Which is different from simply having only one
reference left and then being deleted in job_completed().

Looking at 3f09bfbc7be, I'm inclined to believe the original reason may
be that src->job points to the job and that we shouldn't delete it as
long as it does (bdrv_delete() asserts that bs->job is NULL).  Oh no, a
tangent appears.

...I would assume that when bdrv_replace_node() is called, BlockJob.blk
is updated to point to the new BDS.  But nobody seems to update the
BDS.job field.  Investigation is in order.

Max
Max Reitz Aug. 25, 2018, 3:14 p.m. UTC | #5
On 2018-08-25 17:02, Max Reitz wrote:
> On 2018-08-23 00:05, John Snow wrote:
>>
>>
>> On 08/22/2018 08:15 AM, Max Reitz wrote:
>>> On 2018-08-17 21:04, John Snow wrote:
>>>> Change the manual deferment to mirror_exit into the implicit
>>>> callback to job_exit and the mirror_exit callback.
>>>>
>>>> This does change the order of some bdrv_unref calls and job_completed,
>>>> but thanks to the new context in which we call .job_exit, this is safe
>>>> to defer the possible flushing of any nodes to the job_finalize_single
>>>> cleanup stage.
>>>
>>> Ah, right, I forgot this.  Hm, what exactly do you mean?  This function
>>> is executed in the main loop, so it can make 'src' go away.  I don't see
>>> any difference to before.
>>>
>>
>> This changes the order in which we unreference these objects; if you
>> look at this patch the job_completed call I delete is in the middle of
>> what becomes the .exit() callback, which means there is a subtle change
>> in the ordering of how references are put down.
>>
>> Take a look at the weird ordering of mirror_exit as it exists right now;
>> we call job_completed first and *then* put down the last references. If
>> you re-order this upstream right now, you'll deadlock QEMU because this
>> means job_completed is responsible for putting down the last reference
>> to some of these block/bds objects.
>>
>> However, job_completed takes an additional AIO context lock and calls
>> job_finalize_single under *two* locks, which will hang QEMU if we
>> attempt to flush any of these nodes when we put down the last reference.
> 
> If you say so...  I have to admit I don't really understand.  The
> comment doesn't explain why it's so important to keep src around until
> job_completed(), so I don't know.  I thought AioContexts are recursive
> so it doesn't matter whether you take them recursively or not.
> 
> Anyway.  So the difference now is that job_defer_to_main_loop() took the
> lock around the whole exit function, whereas the new exit shim only
> takes it around the .exit() method, but calls job_complete() without a
> lock -- and then job_finalize_single() gets its lock again, so the job
> methods are again called with locks.  That sounds OK to me.
> 
>> Performing the reordering here is *safe* because by removing the call to
>> job_completed and utilizing the exit shim, the .exit() callback executes
>> only under one lock, and when the finalize code runs later it is also
>> executed under only one lock, making this re-ordering safe.
>>
>> Clear as mud?
> 
> Well, I trust you that the drain issue was the reason that src had to
> stay around until after job_completed().  It seems a bit
> counter-intuitive, because the comment explaining that src needs to stay
> around until job_completed() doesn't say much -- but it does imply that
> without that bdrv_ref(), the BDS might be destroyed before
> job_completed().  Which is different from simply having only one
> reference left and then being deleted in job_completed().
> 
> Looking at 3f09bfbc7be, I'm inclined to believe the original reason may
> be that src->job points to the job and that we shouldn't delete it as
> long as it does (bdrv_delete() asserts that bs->job is NULL).  Oh no, a
> tangent appears.
> 
> ...I would assume that when bdrv_replace_node() is called, BlockJob.blk
> is updated to point to the new BDS.  But nobody seems to update the
> BDS.job field.  Investigation is in order.

Aha!  Mirror is run from the mirror filter (of course).  So the node
where BDS.job is set is actually not replaced.

Well, it is, but then we specifically switch the job BB back to point to
the mirror filter.  As long as it does not go away until then, all is
safe.  (And that bdrv_ref() guard doesn't change with this patch.)

Max
John Snow Aug. 28, 2018, 8:25 p.m. UTC | #6
On 08/25/2018 11:02 AM, Max Reitz wrote:
> If you say so...  I have to admit I don't really understand.  The
> comment doesn't explain why it's so important to keep src around until
> job_completed(), so I don't know.  I thought AioContexts are recursive
> so it doesn't matter whether you take them recursively or not.

bdrv_flush has troubles under a recursive lock if it is invoked from a
different thread. It tries to poll for flush completion but the
coroutine which gets scheduled (instead of entered) can't actually run
if we hold the lock twice from, say, the main thread -- which is where
we're doing the graph manipulation from.

>        co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
>        bdrv_coroutine_enter(bs, co);
>        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);

BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only
puts down one reference, so we deadlock in bdrv_flush in a recursive
context.

Kevin helped me figure it out; I CC'd him off-list on a mail that you
were also CC'd on ("hang in mirror_exit") that's probably pretty helpful:

> Took a little more than five minutes, but I think I've got it now. The
> important thing is that the test case is for dataplane, i.e. the job
> runs in an I/O thread AioContext. Job completion, however, happens in
> the main loop thread.
> 
> Therefore, when bdrv_delete() wants to flush the node first, it needs to
> run bdrv_co_flush() in a foreign AioContext, so the coroutine is
> scheduled. The I/O thread backtrace shows that it's waiting for the
> AioContext lock before it can actually enter the bdrv_co_flush()
> coroutine, so we must have deadlocked:

--js
John Snow Aug. 28, 2018, 9:51 p.m. UTC | #7
On 08/25/2018 11:02 AM, Max Reitz wrote:
> On 2018-08-23 00:05, John Snow wrote:
>>
>>
>> On 08/22/2018 08:15 AM, Max Reitz wrote:
>>> On 2018-08-17 21:04, John Snow wrote:
>>>> Change the manual deferment to mirror_exit into the implicit
>>>> callback to job_exit and the mirror_exit callback.
>>>>
>>>> This does change the order of some bdrv_unref calls and job_completed,
>>>> but thanks to the new context in which we call .job_exit, this is safe
>>>> to defer the possible flushing of any nodes to the job_finalize_single
>>>> cleanup stage.
>>>
>>> Ah, right, I forgot this.  Hm, what exactly do you mean?  This function
>>> is executed in the main loop, so it can make 'src' go away.  I don't see
>>> any difference to before.
>>>
>>
>> This changes the order in which we unreference these objects; if you
>> look at this patch the job_completed call I delete is in the middle of
>> what becomes the .exit() callback, which means there is a subtle change
>> in the ordering of how references are put down.
>>
>> Take a look at the weird ordering of mirror_exit as it exists right now;
>> we call job_completed first and *then* put down the last references. If
>> you re-order this upstream right now, you'll deadlock QEMU because this
>> means job_completed is responsible for putting down the last reference
>> to some of these block/bds objects.
>>
>> However, job_completed takes an additional AIO context lock and calls
>> job_finalize_single under *two* locks, which will hang QEMU if we
>> attempt to flush any of these nodes when we put down the last reference.
> 
> If you say so...  I have to admit I don't really understand.  The
> comment doesn't explain why it's so important to keep src around until
> job_completed(), so I don't know.  I thought AioContexts are recursive
> so it doesn't matter whether you take them recursively or not.
> 
> Anyway.  So the difference now is that job_defer_to_main_loop() took the
> lock around the whole exit function, whereas the new exit shim only
> takes it around the .exit() method, but calls job_complete() without a
> lock -- and then job_finalize_single() gets its lock again, so the job
> methods are again called with locks.  That sounds OK to me.
> 
>> Performing the reordering here is *safe* because by removing the call to
>> job_completed and utilizing the exit shim, the .exit() callback executes
>> only under one lock, and when the finalize code runs later it is also
>> executed under only one lock, making this re-ordering safe.
>>
>> Clear as mud?
> 
> Well, I trust you that the drain issue was the reason that src had to
> stay around until after job_completed().  It seems a bit
> counter-intuitive, because the comment explaining that src needs to stay
> around until job_completed() doesn't say much -- but it does imply that
> without that bdrv_ref(), the BDS might be destroyed before
> job_completed().  Which is different from simply having only one
> reference left and then being deleted in job_completed().
> 
> Looking at 3f09bfbc7be, I'm inclined to believe the original reason may
> be that src->job points to the job and that we shouldn't delete it as
> long as it does (bdrv_delete() asserts that bs->job is NULL).  Oh no, a
> tangent appears.
> 
> ...I would assume that when bdrv_replace_node() is called, BlockJob.blk
> is updated to point to the new BDS.  But nobody seems to update the
> BDS.job field.  Investigation is in order.
> 
> Max
> 

The real reason, presently, appears to just be that we want to call
bdrv_drained_end on src, but we may have put down the last reference
during bdrv_replace_node.

I'll update the comment accordingly. Anything beyond that is beyond the
scope of this series.

--js
Max Reitz Aug. 29, 2018, 8:28 a.m. UTC | #8
On 2018-08-28 22:25, John Snow wrote:
> 
> 
> On 08/25/2018 11:02 AM, Max Reitz wrote:
>> If you say so...  I have to admit I don't really understand.  The
>> comment doesn't explain why it's so important to keep src around until
>> job_completed(), so I don't know.  I thought AioContexts are recursive
>> so it doesn't matter whether you take them recursively or not.
> 
> bdrv_flush has troubles under a recursive lock if it is invoked from a
> different thread. It tries to poll for flush completion but the
> coroutine which gets scheduled (instead of entered) can't actually run
> if we hold the lock twice from, say, the main thread -- which is where
> we're doing the graph manipulation from.
> 
>>        co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
>>        bdrv_coroutine_enter(bs, co);
>>        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
> 
> BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only
> puts down one reference, so we deadlock in bdrv_flush in a recursive
> context.
> 
> Kevin helped me figure it out; I CC'd him off-list on a mail that you
> were also CC'd on ("hang in mirror_exit") that's probably pretty helpful:
> 
>> Took a little more than five minutes, but I think I've got it now. The
>> important thing is that the test case is for dataplane, i.e. the job
>> runs in an I/O thread AioContext. Job completion, however, happens in
>> the main loop thread.
>>
>> Therefore, when bdrv_delete() wants to flush the node first, it needs to
>> run bdrv_co_flush() in a foreign AioContext, so the coroutine is
>> scheduled. The I/O thread backtrace shows that it's waiting for the
>> AioContext lock before it can actually enter the bdrv_co_flush()
>> coroutine, so we must have deadlocked:

OK, that makes more sense.  I still would have thought that you should
always be allowed to take an AioContext lock more than once in a single
other AioContext, but on my way through the commit log, I found
bd6458e410c1e7, d79df2a2ceb3cb07711, and maybe most importantly
17e2a4a47d4.  So maybe not.

So at some point we decided that, yeah, you can take them multiple times
in a single context, and, yeah, that was how it was designed, but don't
do that if you expect a BDRV_POLL_WHILE().

OK.  Got it now.

Max
diff mbox series

Patch

diff --git a/block/mirror.c b/block/mirror.c
index be5dc6b7b0..57b4ac97d8 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -607,21 +607,17 @@  static void mirror_wait_for_all_io(MirrorBlockJob *s)
     }
 }
 
-typedef struct {
-    int ret;
-} MirrorExitData;
-
-static void mirror_exit(Job *job, void *opaque)
+static void mirror_exit(Job *job)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
     BlockJob *bjob = &s->common;
-    MirrorExitData *data = opaque;
     MirrorBDSOpaque *bs_opaque = s->mirror_top_bs->opaque;
     AioContext *replace_aio_context = NULL;
     BlockDriverState *src = s->mirror_top_bs->backing->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     BlockDriverState *mirror_top_bs = s->mirror_top_bs;
     Error *local_err = NULL;
+    int ret = job->ret;
 
     bdrv_release_dirty_bitmap(src, s->dirty_bitmap);
 
@@ -652,7 +648,7 @@  static void mirror_exit(Job *job, void *opaque)
             bdrv_set_backing_hd(target_bs, backing, &local_err);
             if (local_err) {
                 error_report_err(local_err);
-                data->ret = -EPERM;
+                ret = -EPERM;
             }
         }
     }
@@ -662,7 +658,7 @@  static void mirror_exit(Job *job, void *opaque)
         aio_context_acquire(replace_aio_context);
     }
 
-    if (s->should_complete && data->ret == 0) {
+    if (s->should_complete && ret == 0) {
         BlockDriverState *to_replace = src;
         if (s->to_replace) {
             to_replace = s->to_replace;
@@ -679,7 +675,7 @@  static void mirror_exit(Job *job, void *opaque)
         bdrv_drained_end(target_bs);
         if (local_err) {
             error_report_err(local_err);
-            data->ret = -EPERM;
+            ret = -EPERM;
         }
     }
     if (s->to_replace) {
@@ -710,12 +706,12 @@  static void mirror_exit(Job *job, void *opaque)
     blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort);
 
     bs_opaque->job = NULL;
-    job_completed(job, data->ret);
 
-    g_free(data);
     bdrv_drained_end(src);
     bdrv_unref(mirror_top_bs);
     bdrv_unref(src);
+
+    job->ret = ret;
 }
 
 static void mirror_throttle(MirrorBlockJob *s)
@@ -815,7 +811,6 @@  static int mirror_flush(MirrorBlockJob *s)
 static int coroutine_fn mirror_run(Job *job, Error **errp)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
-    MirrorExitData *data;
     BlockDriverState *bs = s->mirror_top_bs->backing->bs;
     BlockDriverState *target_bs = blk_bs(s->target);
     bool need_drain = true;
@@ -1035,14 +1030,10 @@  immediate_exit:
     g_free(s->in_flight_bitmap);
     bdrv_dirty_iter_free(s->dbi);
 
-    data = g_malloc(sizeof(*data));
-    data->ret = ret;
-
     if (need_drain) {
         bdrv_drained_begin(bs);
     }
 
-    job_defer_to_main_loop(&s->common.job, mirror_exit, data);
     return ret;
 }
 
@@ -1141,6 +1132,7 @@  static const BlockJobDriver mirror_job_driver = {
         .user_resume            = block_job_user_resume,
         .drain                  = block_job_drain,
         .run                    = mirror_run,
+        .exit                   = mirror_exit,
         .pause                  = mirror_pause,
         .complete               = mirror_complete,
     },
@@ -1157,6 +1149,7 @@  static const BlockJobDriver commit_active_job_driver = {
         .user_resume            = block_job_user_resume,
         .drain                  = block_job_drain,
         .run                    = mirror_run,
+        .exit                   = mirror_exit,
         .pause                  = mirror_pause,
         .complete               = mirror_complete,
     },