diff mbox series

job: Fix nested aio_poll() hanging in job_txn_apply

Message ID 20180821064538.19750-1-famz@redhat.com
State New
Headers show
Series job: Fix nested aio_poll() hanging in job_txn_apply | expand

Commit Message

Fam Zheng Aug. 21, 2018, 6:45 a.m. UTC
All callers have acquired ctx already. Doing that again results in
aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
callback cannot make progress because ctx is recursively locked, for
example, when drive-backup finishes.

Cc: qemu-stable@nongnu.org
Reported-by: Gu Nini <ngu@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 job.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

Comments

Eric Blake Aug. 21, 2018, 4:15 p.m. UTC | #1
On 08/21/2018 01:45 AM, Fam Zheng wrote:
> All callers have acquired ctx already. Doing that again results in
> aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> callback cannot make progress because ctx is recursively locked, for
> example, when drive-backup finishes.
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: Gu Nini <ngu@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>   job.c | 18 +++++-------------
>   1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/job.c b/job.c
> index fa671b431a..b07c5d0ffc 100644
> --- a/job.c
> +++ b/job.c
> @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
>       }
>   }
>   
> -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
> +static int job_txn_apply(JobTxn *txn, int fn(Job *))

Is it worth adding a comment here that callers must have already 
acquired the job context?

However, I was unable to quickly audit whether all callers really did 
have the lock (it balloons into whether all callers of job_finalize() 
have the lock), so I'm reluctant to give R-b.

> @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
>           assert(other_job->ret == 0);
>       }
>   
> -    job_txn_apply(txn, job_transition_to_pending, false);
> +    job_txn_apply(txn, job_transition_to_pending);
>   
>       /* If no jobs need manual finalization, automatically do so */
> -    if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
> +    if (job_txn_apply(txn, job_needs_finalize) == 0) {
>           job_do_finalize(job);

That said, the change makes sense here: since the first direct call to 
job_txn_apply() did not need to lock, why should the second indirect 
call through job_do_finalize() need it?

Or, is the fix to have job_do_finalize() gain a bool parameter, where 
job_completed_txn_success() would pass false to that parameter, while 
job_finalize() would pass true (again, going back to auditing whether 
all callers of job_finalize() have already acquired the context).
Fam Zheng Aug. 22, 2018, 5:48 a.m. UTC | #2
On Tue, 08/21 11:15, Eric Blake wrote:
> On 08/21/2018 01:45 AM, Fam Zheng wrote:
> > All callers have acquired ctx already. Doing that again results in
> > aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> > callback cannot make progress because ctx is recursively locked, for
> > example, when drive-backup finishes.
> > 
> > Cc: qemu-stable@nongnu.org
> > Reported-by: Gu Nini <ngu@redhat.com>
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >   job.c | 18 +++++-------------
> >   1 file changed, 5 insertions(+), 13 deletions(-)
> > 
> > diff --git a/job.c b/job.c
> > index fa671b431a..b07c5d0ffc 100644
> > --- a/job.c
> > +++ b/job.c
> > @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
> >       }
> >   }
> > -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
> > +static int job_txn_apply(JobTxn *txn, int fn(Job *))
> 
> Is it worth adding a comment here that callers must have already acquired
> the job context?

I think most job accessing functions require it. If any, the comment should go
to the public functions like job_finalize().

> 
> However, I was unable to quickly audit whether all callers really did have
> the lock (it balloons into whether all callers of job_finalize() have the
> lock), so I'm reluctant to give R-b.
> 
> > @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
> >           assert(other_job->ret == 0);
> >       }
> > -    job_txn_apply(txn, job_transition_to_pending, false);
> > +    job_txn_apply(txn, job_transition_to_pending);
> >       /* If no jobs need manual finalization, automatically do so */
> > -    if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
> > +    if (job_txn_apply(txn, job_needs_finalize) == 0) {
> >           job_do_finalize(job);
> 
> That said, the change makes sense here: since the first direct call to
> job_txn_apply() did not need to lock, why should the second indirect call
> through job_do_finalize() need it?
> 
> Or, is the fix to have job_do_finalize() gain a bool parameter, where
> job_completed_txn_success() would pass false to that parameter, while
> job_finalize() would pass true (again, going back to auditing whether all
> callers of job_finalize() have already acquired the context).

There are two callers of job_finalize():

    fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
    blockdev.c:    job_finalize(&job->job, errp);
    blockdev.c-    aio_context_release(aio_context);
    --
    job-qmp.c:    job_finalize(job, errp);
    job-qmp.c-    aio_context_release(aio_context);
    --
    tests/test-blockjob.c:    job_finalize(&job->job, &error_abort);
    tests/test-blockjob.c-    assert(job->job.status == JOB_STATUS_CONCLUDED);

Ignoring the test, it's very easy to see both callers have acquired the context.

Fam
Eric Blake Aug. 22, 2018, 1:42 p.m. UTC | #3
On 08/22/2018 12:48 AM, Fam Zheng wrote:
>> However, I was unable to quickly audit whether all callers really did have
>> the lock (it balloons into whether all callers of job_finalize() have the
>> lock), so I'm reluctant to give R-b.
>>
>>> @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
>>>            assert(other_job->ret == 0);
>>>        }
>>> -    job_txn_apply(txn, job_transition_to_pending, false);
>>> +    job_txn_apply(txn, job_transition_to_pending);
>>>        /* If no jobs need manual finalization, automatically do so */
>>> -    if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
>>> +    if (job_txn_apply(txn, job_needs_finalize) == 0) {
>>>            job_do_finalize(job);
>>
>> That said, the change makes sense here: since the first direct call to
>> job_txn_apply() did not need to lock, why should the second indirect call
>> through job_do_finalize() need it?
>>
>> Or, is the fix to have job_do_finalize() gain a bool parameter, where
>> job_completed_txn_success() would pass false to that parameter, while
>> job_finalize() would pass true (again, going back to auditing whether all
>> callers of job_finalize() have already acquired the context).
> 
> There are two callers of job_finalize():

Okay, so the audit would have been easier if I had actually tried 
grepping for it.  Still, it can't hurt to make the commit message give 
more details on what you checked, to make it easier for the reviewers.

> 
>      fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
>      blockdev.c:    job_finalize(&job->job, errp);
>      blockdev.c-    aio_context_release(aio_context);
>      --
>      job-qmp.c:    job_finalize(job, errp);
>      job-qmp.c-    aio_context_release(aio_context);

Yep, that's pretty conclusive that the context is already held by all 
callers.

Reviewed-by: Eric Blake <eblake@redhat.com>
diff mbox series

Patch

diff --git a/job.c b/job.c
index fa671b431a..b07c5d0ffc 100644
--- a/job.c
+++ b/job.c
@@ -136,21 +136,13 @@  static void job_txn_del_job(Job *job)
     }
 }
 
-static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
+static int job_txn_apply(JobTxn *txn, int fn(Job *))
 {
-    AioContext *ctx;
     Job *job, *next;
     int rc = 0;
 
     QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
-        if (lock) {
-            ctx = job->aio_context;
-            aio_context_acquire(ctx);
-        }
         rc = fn(job);
-        if (lock) {
-            aio_context_release(ctx);
-        }
         if (rc) {
             break;
         }
@@ -807,11 +799,11 @@  static void job_do_finalize(Job *job)
     assert(job && job->txn);
 
     /* prepare the transaction to complete */
-    rc = job_txn_apply(job->txn, job_prepare, true);
+    rc = job_txn_apply(job->txn, job_prepare);
     if (rc) {
         job_completed_txn_abort(job);
     } else {
-        job_txn_apply(job->txn, job_finalize_single, true);
+        job_txn_apply(job->txn, job_finalize_single);
     }
 }
 
@@ -857,10 +849,10 @@  static void job_completed_txn_success(Job *job)
         assert(other_job->ret == 0);
     }
 
-    job_txn_apply(txn, job_transition_to_pending, false);
+    job_txn_apply(txn, job_transition_to_pending);
 
     /* If no jobs need manual finalization, automatically do so */
-    if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
+    if (job_txn_apply(txn, job_needs_finalize) == 0) {
         job_do_finalize(job);
     }
 }