Message ID | 20180817190457.8292-6-jsnow@redhat.com |
---|---|
State | New |
Headers | show |
Series | jobs: remove job_defer_to_main_loop | expand |
On 2018-08-17 21:04, John Snow wrote: > Change the manual deferment to mirror_exit into the implicit > callback to job_exit and the mirror_exit callback. > > This does change the order of some bdrv_unref calls and job_completed, > but thanks to the new context in which we call .job_exit, this is safe > to defer the possible flushing of any nodes to the job_finalize_single > cleanup stage. > > Signed-off-by: John Snow <jsnow@redhat.com> > --- > block/mirror.c | 25 +++++++++---------------- > 1 file changed, 9 insertions(+), 16 deletions(-) Looks OK, but the same comment from the previous patch applies (I wouldn't throw 'ret' around and instead work with just a boolean that signifies failure or success. Also, you might even want to remove s->ret because the Error object should be enough, actually.) Max
On 2018-08-17 21:04, John Snow wrote: > Change the manual deferment to mirror_exit into the implicit > callback to job_exit and the mirror_exit callback. > > This does change the order of some bdrv_unref calls and job_completed, > but thanks to the new context in which we call .job_exit, this is safe > to defer the possible flushing of any nodes to the job_finalize_single > cleanup stage. Ah, right, I forgot this. Hm, what exactly do you mean? This function is executed in the main loop, so it can make 'src' go away. I don't see any difference to before. The only difference I see is that the BH-scheduled function is now job_exit() instead of just mirror_complete() (which is now called as part of job_exit()). But then again, it was mirror_complete() itself that called job_completed(), like it is now job_exit(). So if everything worked after this patch, I don't see why mirror_complete() would bdrv_ref() 'src' around job_completed(). Max > Signed-off-by: John Snow <jsnow@redhat.com> > --- > block/mirror.c | 25 +++++++++---------------- > 1 file changed, 9 insertions(+), 16 deletions(-)
On 08/22/2018 08:15 AM, Max Reitz wrote: > On 2018-08-17 21:04, John Snow wrote: >> Change the manual deferment to mirror_exit into the implicit >> callback to job_exit and the mirror_exit callback. >> >> This does change the order of some bdrv_unref calls and job_completed, >> but thanks to the new context in which we call .job_exit, this is safe >> to defer the possible flushing of any nodes to the job_finalize_single >> cleanup stage. > > Ah, right, I forgot this. Hm, what exactly do you mean? This function > is executed in the main loop, so it can make 'src' go away. I don't see > any difference to before. > This changes the order in which we unreference these objects; if you look at this patch the job_completed call I delete is in the middle of what becomes the .exit() callback, which means there is a subtle change in the ordering of how references are put down. Take a look at the weird ordering of mirror_exit as it exists right now; we call job_completed first and *then* put down the last references. If you re-order this upstream right now, you'll deadlock QEMU because this means job_completed is responsible for putting down the last reference to some of these block/bds objects. However, job_completed takes an additional AIO context lock and calls job_finalize_single under *two* locks, which will hang QEMU if we attempt to flush any of these nodes when we put down the last reference. Performing the reordering here is *safe* because by removing the call to job_completed and utilizing the exit shim, the .exit() callback executes only under one lock, and when the finalize code runs later it is also executed under only one lock, making this re-ordering safe. Clear as mud? --js > The only difference I see is that the BH-scheduled function is now > job_exit() instead of just mirror_complete() (which is now called as > part of job_exit()). But then again, it was mirror_complete() itself > that called job_completed(), like it is now job_exit(). So if > everything worked after this patch, I don't see why mirror_complete() > would bdrv_ref() 'src' around job_completed(). > > Max > >> Signed-off-by: John Snow <jsnow@redhat.com> >> --- >> block/mirror.c | 25 +++++++++---------------- >> 1 file changed, 9 insertions(+), 16 deletions(-) >
On 2018-08-23 00:05, John Snow wrote: > > > On 08/22/2018 08:15 AM, Max Reitz wrote: >> On 2018-08-17 21:04, John Snow wrote: >>> Change the manual deferment to mirror_exit into the implicit >>> callback to job_exit and the mirror_exit callback. >>> >>> This does change the order of some bdrv_unref calls and job_completed, >>> but thanks to the new context in which we call .job_exit, this is safe >>> to defer the possible flushing of any nodes to the job_finalize_single >>> cleanup stage. >> >> Ah, right, I forgot this. Hm, what exactly do you mean? This function >> is executed in the main loop, so it can make 'src' go away. I don't see >> any difference to before. >> > > This changes the order in which we unreference these objects; if you > look at this patch the job_completed call I delete is in the middle of > what becomes the .exit() callback, which means there is a subtle change > in the ordering of how references are put down. > > Take a look at the weird ordering of mirror_exit as it exists right now; > we call job_completed first and *then* put down the last references. If > you re-order this upstream right now, you'll deadlock QEMU because this > means job_completed is responsible for putting down the last reference > to some of these block/bds objects. > > However, job_completed takes an additional AIO context lock and calls > job_finalize_single under *two* locks, which will hang QEMU if we > attempt to flush any of these nodes when we put down the last reference. If you say so... I have to admit I don't really understand. The comment doesn't explain why it's so important to keep src around until job_completed(), so I don't know. I thought AioContexts are recursive so it doesn't matter whether you take them recursively or not. Anyway. So the difference now is that job_defer_to_main_loop() took the lock around the whole exit function, whereas the new exit shim only takes it around the .exit() method, but calls job_complete() without a lock -- and then job_finalize_single() gets its lock again, so the job methods are again called with locks. That sounds OK to me. > Performing the reordering here is *safe* because by removing the call to > job_completed and utilizing the exit shim, the .exit() callback executes > only under one lock, and when the finalize code runs later it is also > executed under only one lock, making this re-ordering safe. > > Clear as mud? Well, I trust you that the drain issue was the reason that src had to stay around until after job_completed(). It seems a bit counter-intuitive, because the comment explaining that src needs to stay around until job_completed() doesn't say much -- but it does imply that without that bdrv_ref(), the BDS might be destroyed before job_completed(). Which is different from simply having only one reference left and then being deleted in job_completed(). Looking at 3f09bfbc7be, I'm inclined to believe the original reason may be that src->job points to the job and that we shouldn't delete it as long as it does (bdrv_delete() asserts that bs->job is NULL). Oh no, a tangent appears. ...I would assume that when bdrv_replace_node() is called, BlockJob.blk is updated to point to the new BDS. But nobody seems to update the BDS.job field. Investigation is in order. Max
On 2018-08-25 17:02, Max Reitz wrote: > On 2018-08-23 00:05, John Snow wrote: >> >> >> On 08/22/2018 08:15 AM, Max Reitz wrote: >>> On 2018-08-17 21:04, John Snow wrote: >>>> Change the manual deferment to mirror_exit into the implicit >>>> callback to job_exit and the mirror_exit callback. >>>> >>>> This does change the order of some bdrv_unref calls and job_completed, >>>> but thanks to the new context in which we call .job_exit, this is safe >>>> to defer the possible flushing of any nodes to the job_finalize_single >>>> cleanup stage. >>> >>> Ah, right, I forgot this. Hm, what exactly do you mean? This function >>> is executed in the main loop, so it can make 'src' go away. I don't see >>> any difference to before. >>> >> >> This changes the order in which we unreference these objects; if you >> look at this patch the job_completed call I delete is in the middle of >> what becomes the .exit() callback, which means there is a subtle change >> in the ordering of how references are put down. >> >> Take a look at the weird ordering of mirror_exit as it exists right now; >> we call job_completed first and *then* put down the last references. If >> you re-order this upstream right now, you'll deadlock QEMU because this >> means job_completed is responsible for putting down the last reference >> to some of these block/bds objects. >> >> However, job_completed takes an additional AIO context lock and calls >> job_finalize_single under *two* locks, which will hang QEMU if we >> attempt to flush any of these nodes when we put down the last reference. > > If you say so... I have to admit I don't really understand. The > comment doesn't explain why it's so important to keep src around until > job_completed(), so I don't know. I thought AioContexts are recursive > so it doesn't matter whether you take them recursively or not. > > Anyway. So the difference now is that job_defer_to_main_loop() took the > lock around the whole exit function, whereas the new exit shim only > takes it around the .exit() method, but calls job_complete() without a > lock -- and then job_finalize_single() gets its lock again, so the job > methods are again called with locks. That sounds OK to me. > >> Performing the reordering here is *safe* because by removing the call to >> job_completed and utilizing the exit shim, the .exit() callback executes >> only under one lock, and when the finalize code runs later it is also >> executed under only one lock, making this re-ordering safe. >> >> Clear as mud? > > Well, I trust you that the drain issue was the reason that src had to > stay around until after job_completed(). It seems a bit > counter-intuitive, because the comment explaining that src needs to stay > around until job_completed() doesn't say much -- but it does imply that > without that bdrv_ref(), the BDS might be destroyed before > job_completed(). Which is different from simply having only one > reference left and then being deleted in job_completed(). > > Looking at 3f09bfbc7be, I'm inclined to believe the original reason may > be that src->job points to the job and that we shouldn't delete it as > long as it does (bdrv_delete() asserts that bs->job is NULL). Oh no, a > tangent appears. > > ...I would assume that when bdrv_replace_node() is called, BlockJob.blk > is updated to point to the new BDS. But nobody seems to update the > BDS.job field. Investigation is in order. Aha! Mirror is run from the mirror filter (of course). So the node where BDS.job is set is actually not replaced. Well, it is, but then we specifically switch the job BB back to point to the mirror filter. As long as it does not go away until then, all is safe. (And that bdrv_ref() guard doesn't change with this patch.) Max
On 08/25/2018 11:02 AM, Max Reitz wrote: > If you say so... I have to admit I don't really understand. The > comment doesn't explain why it's so important to keep src around until > job_completed(), so I don't know. I thought AioContexts are recursive > so it doesn't matter whether you take them recursively or not. bdrv_flush has troubles under a recursive lock if it is invoked from a different thread. It tries to poll for flush completion but the coroutine which gets scheduled (instead of entered) can't actually run if we hold the lock twice from, say, the main thread -- which is where we're doing the graph manipulation from. > co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co); > bdrv_coroutine_enter(bs, co); > BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE); BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only puts down one reference, so we deadlock in bdrv_flush in a recursive context. Kevin helped me figure it out; I CC'd him off-list on a mail that you were also CC'd on ("hang in mirror_exit") that's probably pretty helpful: > Took a little more than five minutes, but I think I've got it now. The > important thing is that the test case is for dataplane, i.e. the job > runs in an I/O thread AioContext. Job completion, however, happens in > the main loop thread. > > Therefore, when bdrv_delete() wants to flush the node first, it needs to > run bdrv_co_flush() in a foreign AioContext, so the coroutine is > scheduled. The I/O thread backtrace shows that it's waiting for the > AioContext lock before it can actually enter the bdrv_co_flush() > coroutine, so we must have deadlocked: --js
On 08/25/2018 11:02 AM, Max Reitz wrote: > On 2018-08-23 00:05, John Snow wrote: >> >> >> On 08/22/2018 08:15 AM, Max Reitz wrote: >>> On 2018-08-17 21:04, John Snow wrote: >>>> Change the manual deferment to mirror_exit into the implicit >>>> callback to job_exit and the mirror_exit callback. >>>> >>>> This does change the order of some bdrv_unref calls and job_completed, >>>> but thanks to the new context in which we call .job_exit, this is safe >>>> to defer the possible flushing of any nodes to the job_finalize_single >>>> cleanup stage. >>> >>> Ah, right, I forgot this. Hm, what exactly do you mean? This function >>> is executed in the main loop, so it can make 'src' go away. I don't see >>> any difference to before. >>> >> >> This changes the order in which we unreference these objects; if you >> look at this patch the job_completed call I delete is in the middle of >> what becomes the .exit() callback, which means there is a subtle change >> in the ordering of how references are put down. >> >> Take a look at the weird ordering of mirror_exit as it exists right now; >> we call job_completed first and *then* put down the last references. If >> you re-order this upstream right now, you'll deadlock QEMU because this >> means job_completed is responsible for putting down the last reference >> to some of these block/bds objects. >> >> However, job_completed takes an additional AIO context lock and calls >> job_finalize_single under *two* locks, which will hang QEMU if we >> attempt to flush any of these nodes when we put down the last reference. > > If you say so... I have to admit I don't really understand. The > comment doesn't explain why it's so important to keep src around until > job_completed(), so I don't know. I thought AioContexts are recursive > so it doesn't matter whether you take them recursively or not. > > Anyway. So the difference now is that job_defer_to_main_loop() took the > lock around the whole exit function, whereas the new exit shim only > takes it around the .exit() method, but calls job_complete() without a > lock -- and then job_finalize_single() gets its lock again, so the job > methods are again called with locks. That sounds OK to me. > >> Performing the reordering here is *safe* because by removing the call to >> job_completed and utilizing the exit shim, the .exit() callback executes >> only under one lock, and when the finalize code runs later it is also >> executed under only one lock, making this re-ordering safe. >> >> Clear as mud? > > Well, I trust you that the drain issue was the reason that src had to > stay around until after job_completed(). It seems a bit > counter-intuitive, because the comment explaining that src needs to stay > around until job_completed() doesn't say much -- but it does imply that > without that bdrv_ref(), the BDS might be destroyed before > job_completed(). Which is different from simply having only one > reference left and then being deleted in job_completed(). > > Looking at 3f09bfbc7be, I'm inclined to believe the original reason may > be that src->job points to the job and that we shouldn't delete it as > long as it does (bdrv_delete() asserts that bs->job is NULL). Oh no, a > tangent appears. > > ...I would assume that when bdrv_replace_node() is called, BlockJob.blk > is updated to point to the new BDS. But nobody seems to update the > BDS.job field. Investigation is in order. > > Max > The real reason, presently, appears to just be that we want to call bdrv_drained_end on src, but we may have put down the last reference during bdrv_replace_node. I'll update the comment accordingly. Anything beyond that is beyond the scope of this series. --js
On 2018-08-28 22:25, John Snow wrote: > > > On 08/25/2018 11:02 AM, Max Reitz wrote: >> If you say so... I have to admit I don't really understand. The >> comment doesn't explain why it's so important to keep src around until >> job_completed(), so I don't know. I thought AioContexts are recursive >> so it doesn't matter whether you take them recursively or not. > > bdrv_flush has troubles under a recursive lock if it is invoked from a > different thread. It tries to poll for flush completion but the > coroutine which gets scheduled (instead of entered) can't actually run > if we hold the lock twice from, say, the main thread -- which is where > we're doing the graph manipulation from. > >> co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co); >> bdrv_coroutine_enter(bs, co); >> BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE); > > BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only > puts down one reference, so we deadlock in bdrv_flush in a recursive > context. > > Kevin helped me figure it out; I CC'd him off-list on a mail that you > were also CC'd on ("hang in mirror_exit") that's probably pretty helpful: > >> Took a little more than five minutes, but I think I've got it now. The >> important thing is that the test case is for dataplane, i.e. the job >> runs in an I/O thread AioContext. Job completion, however, happens in >> the main loop thread. >> >> Therefore, when bdrv_delete() wants to flush the node first, it needs to >> run bdrv_co_flush() in a foreign AioContext, so the coroutine is >> scheduled. The I/O thread backtrace shows that it's waiting for the >> AioContext lock before it can actually enter the bdrv_co_flush() >> coroutine, so we must have deadlocked: OK, that makes more sense. I still would have thought that you should always be allowed to take an AioContext lock more than once in a single other AioContext, but on my way through the commit log, I found bd6458e410c1e7, d79df2a2ceb3cb07711, and maybe most importantly 17e2a4a47d4. So maybe not. So at some point we decided that, yeah, you can take them multiple times in a single context, and, yeah, that was how it was designed, but don't do that if you expect a BDRV_POLL_WHILE(). OK. Got it now. Max
diff --git a/block/mirror.c b/block/mirror.c index be5dc6b7b0..57b4ac97d8 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -607,21 +607,17 @@ static void mirror_wait_for_all_io(MirrorBlockJob *s) } } -typedef struct { - int ret; -} MirrorExitData; - -static void mirror_exit(Job *job, void *opaque) +static void mirror_exit(Job *job) { MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); BlockJob *bjob = &s->common; - MirrorExitData *data = opaque; MirrorBDSOpaque *bs_opaque = s->mirror_top_bs->opaque; AioContext *replace_aio_context = NULL; BlockDriverState *src = s->mirror_top_bs->backing->bs; BlockDriverState *target_bs = blk_bs(s->target); BlockDriverState *mirror_top_bs = s->mirror_top_bs; Error *local_err = NULL; + int ret = job->ret; bdrv_release_dirty_bitmap(src, s->dirty_bitmap); @@ -652,7 +648,7 @@ static void mirror_exit(Job *job, void *opaque) bdrv_set_backing_hd(target_bs, backing, &local_err); if (local_err) { error_report_err(local_err); - data->ret = -EPERM; + ret = -EPERM; } } } @@ -662,7 +658,7 @@ static void mirror_exit(Job *job, void *opaque) aio_context_acquire(replace_aio_context); } - if (s->should_complete && data->ret == 0) { + if (s->should_complete && ret == 0) { BlockDriverState *to_replace = src; if (s->to_replace) { to_replace = s->to_replace; @@ -679,7 +675,7 @@ static void mirror_exit(Job *job, void *opaque) bdrv_drained_end(target_bs); if (local_err) { error_report_err(local_err); - data->ret = -EPERM; + ret = -EPERM; } } if (s->to_replace) { @@ -710,12 +706,12 @@ static void mirror_exit(Job *job, void *opaque) blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort); bs_opaque->job = NULL; - job_completed(job, data->ret); - g_free(data); bdrv_drained_end(src); bdrv_unref(mirror_top_bs); bdrv_unref(src); + + job->ret = ret; } static void mirror_throttle(MirrorBlockJob *s) @@ -815,7 +811,6 @@ static int mirror_flush(MirrorBlockJob *s) static int coroutine_fn mirror_run(Job *job, Error **errp) { MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); - MirrorExitData *data; BlockDriverState *bs = s->mirror_top_bs->backing->bs; BlockDriverState *target_bs = blk_bs(s->target); bool need_drain = true; @@ -1035,14 +1030,10 @@ immediate_exit: g_free(s->in_flight_bitmap); bdrv_dirty_iter_free(s->dbi); - data = g_malloc(sizeof(*data)); - data->ret = ret; - if (need_drain) { bdrv_drained_begin(bs); } - job_defer_to_main_loop(&s->common.job, mirror_exit, data); return ret; } @@ -1141,6 +1132,7 @@ static const BlockJobDriver mirror_job_driver = { .user_resume = block_job_user_resume, .drain = block_job_drain, .run = mirror_run, + .exit = mirror_exit, .pause = mirror_pause, .complete = mirror_complete, }, @@ -1157,6 +1149,7 @@ static const BlockJobDriver commit_active_job_driver = { .user_resume = block_job_user_resume, .drain = block_job_drain, .run = mirror_run, + .exit = mirror_exit, .pause = mirror_pause, .complete = mirror_complete, },
Change the manual deferment to mirror_exit into the implicit callback to job_exit and the mirror_exit callback. This does change the order of some bdrv_unref calls and job_completed, but thanks to the new context in which we call .job_exit, this is safe to defer the possible flushing of any nodes to the job_finalize_single cleanup stage. Signed-off-by: John Snow <jsnow@redhat.com> --- block/mirror.c | 25 +++++++++---------------- 1 file changed, 9 insertions(+), 16 deletions(-)