diff mbox series

[RESEND] hw/dma: fix crash caused by race condition

Message ID 20220427205056.2522-1-t.zhang2@samsung.com
State New
Headers show
Series [RESEND] hw/dma: fix crash caused by race condition | expand

Commit Message

Tong Zhang April 27, 2022, 8:51 p.m. UTC
assert(dbs->acb) is meant to check the return value of io_func per
documented in commit 6bee44ea34 ("dma: the passed io_func does not
return NULL"). However, there is a chance that after calling
aio_context_release(dbs->ctx); the dma_blk_cb function is called before
the assertion and dbs->acb is set to NULL again at line 121. Thus when
we run assert at line 181 it will fail.

  softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.

Reported-by: Francisco Londono <f.londono@samsung.com>
Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
---
 softmmu/dma-helpers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand May 30, 2022, 4:19 p.m. UTC | #1
On 27.04.22 22:51, Tong Zhang wrote:
> assert(dbs->acb) is meant to check the return value of io_func per
> documented in commit 6bee44ea34 ("dma: the passed io_func does not
> return NULL"). However, there is a chance that after calling
> aio_context_release(dbs->ctx); the dma_blk_cb function is called before
> the assertion and dbs->acb is set to NULL again at line 121. Thus when
> we run assert at line 181 it will fail.
> 
>   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
> 
> Reported-by: Francisco Londono <f.londono@samsung.com>
> Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
> ---
>  softmmu/dma-helpers.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> index 7820fec54c..cb81017928 100644
> --- a/softmmu/dma-helpers.c
> +++ b/softmmu/dma-helpers.c
> @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
>      aio_context_acquire(dbs->ctx);
>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
>                              dma_blk_cb, dbs, dbs->io_func_opaque);
> -    aio_context_release(dbs->ctx);
>      assert(dbs->acb);
> +    aio_context_release(dbs->ctx);
>  }
>  
>  static void dma_aio_cancel(BlockAIOCB *acb)

I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
run after you reshuffled the code?

After all, acquire/release is only around the dbs->io_func() call, so I
don't immediately see how it interacts with re-entrance?
Tong Zhang June 1, 2022, 12:20 a.m. UTC | #2
Hi David,

On Mon, May 30, 2022 at 9:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.04.22 22:51, Tong Zhang wrote:
> > assert(dbs->acb) is meant to check the return value of io_func per
> > documented in commit 6bee44ea34 ("dma: the passed io_func does not
> > return NULL"). However, there is a chance that after calling
> > aio_context_release(dbs->ctx); the dma_blk_cb function is called before
> > the assertion and dbs->acb is set to NULL again at line 121. Thus when
> > we run assert at line 181 it will fail.
> >
> >   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
> >
> > Reported-by: Francisco Londono <f.londono@samsung.com>
> > Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
> > ---
> >  softmmu/dma-helpers.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> > index 7820fec54c..cb81017928 100644
> > --- a/softmmu/dma-helpers.c
> > +++ b/softmmu/dma-helpers.c
> > @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
> >      aio_context_acquire(dbs->ctx);
> >      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
> >                              dma_blk_cb, dbs, dbs->io_func_opaque);
> > -    aio_context_release(dbs->ctx);
> >      assert(dbs->acb);
> > +    aio_context_release(dbs->ctx);
> >  }
> >
> >  static void dma_aio_cancel(BlockAIOCB *acb)
>
> I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
> run after you reshuffled the code?
>

IMO if the assert is to test whether io_func returns a non-NULL value
shouldn't it be immediately after calling io_func.
Also... as suggested by commit 6bee44ea346aed24e12d525daf10542d695508db
  >     dma: the passed io_func does not return NULL

Thanks,
- Tong

> After all, acquire/release is only around the dbs->io_func() call, so I
> don't immediately see how it interacts with re-entrance?
>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand June 1, 2022, 8 a.m. UTC | #3
On 01.06.22 02:20, Tong Zhang wrote:
> Hi David,
> 
> On Mon, May 30, 2022 at 9:19 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.04.22 22:51, Tong Zhang wrote:
>>> assert(dbs->acb) is meant to check the return value of io_func per
>>> documented in commit 6bee44ea34 ("dma: the passed io_func does not
>>> return NULL"). However, there is a chance that after calling
>>> aio_context_release(dbs->ctx); the dma_blk_cb function is called before
>>> the assertion and dbs->acb is set to NULL again at line 121. Thus when
>>> we run assert at line 181 it will fail.
>>>
>>>   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
>>>
>>> Reported-by: Francisco Londono <f.londono@samsung.com>
>>> Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
>>> ---
>>>  softmmu/dma-helpers.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
>>> index 7820fec54c..cb81017928 100644
>>> --- a/softmmu/dma-helpers.c
>>> +++ b/softmmu/dma-helpers.c
>>> @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
>>>      aio_context_acquire(dbs->ctx);
>>>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
>>>                              dma_blk_cb, dbs, dbs->io_func_opaque);
>>> -    aio_context_release(dbs->ctx);
>>>      assert(dbs->acb);
>>> +    aio_context_release(dbs->ctx);
>>>  }
>>>
>>>  static void dma_aio_cancel(BlockAIOCB *acb)
>>
>> I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
>> run after you reshuffled the code?
>>
> 
> IMO if the assert is to test whether io_func returns a non-NULL value
> shouldn't it be immediately after calling io_func.
> Also... as suggested by commit 6bee44ea346aed24e12d525daf10542d695508db
>   >     dma: the passed io_func does not return NULL

Yes, but I just don't see how it would fix the assertion you document in
the patch description. The locking change to fix the assertion doesn't
make any sense to me, and most probably I am missing something important :)
Stefan Hajnoczi June 1, 2022, 1:24 p.m. UTC | #4
On Wed, Jun 01, 2022 at 10:00:50AM +0200, David Hildenbrand wrote:
> On 01.06.22 02:20, Tong Zhang wrote:
> > Hi David,
> > 
> > On Mon, May 30, 2022 at 9:19 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 27.04.22 22:51, Tong Zhang wrote:
> >>> assert(dbs->acb) is meant to check the return value of io_func per
> >>> documented in commit 6bee44ea34 ("dma: the passed io_func does not
> >>> return NULL"). However, there is a chance that after calling
> >>> aio_context_release(dbs->ctx); the dma_blk_cb function is called before
> >>> the assertion and dbs->acb is set to NULL again at line 121. Thus when
> >>> we run assert at line 181 it will fail.
> >>>
> >>>   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
> >>>
> >>> Reported-by: Francisco Londono <f.londono@samsung.com>
> >>> Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
> >>> ---
> >>>  softmmu/dma-helpers.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> >>> index 7820fec54c..cb81017928 100644
> >>> --- a/softmmu/dma-helpers.c
> >>> +++ b/softmmu/dma-helpers.c
> >>> @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
> >>>      aio_context_acquire(dbs->ctx);
> >>>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
> >>>                              dma_blk_cb, dbs, dbs->io_func_opaque);
> >>> -    aio_context_release(dbs->ctx);
> >>>      assert(dbs->acb);
> >>> +    aio_context_release(dbs->ctx);
> >>>  }
> >>>
> >>>  static void dma_aio_cancel(BlockAIOCB *acb)
> >>
> >> I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
> >> run after you reshuffled the code?
> >>
> > 
> > IMO if the assert is to test whether io_func returns a non-NULL value
> > shouldn't it be immediately after calling io_func.
> > Also... as suggested by commit 6bee44ea346aed24e12d525daf10542d695508db
> >   >     dma: the passed io_func does not return NULL
> 
> Yes, but I just don't see how it would fix the assertion you document in
> the patch description. The locking change to fix the assertion doesn't
> make any sense to me, and most probably I am missing something important :)

The other thread will invoke dma_blk_cb(), which modifies dbs->acb, when
it can take the lock. Therefore dbs->acb may contain a value different
from our io_func()'s return value by the time we perform the assertion
check (that's the race).

This patch makes sense to me. Can you rephrase your concern?

Stefan
David Hildenbrand June 1, 2022, 1:29 p.m. UTC | #5
On 01.06.22 15:24, Stefan Hajnoczi wrote:
> On Wed, Jun 01, 2022 at 10:00:50AM +0200, David Hildenbrand wrote:
>> On 01.06.22 02:20, Tong Zhang wrote:
>>> Hi David,
>>>
>>> On Mon, May 30, 2022 at 9:19 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 27.04.22 22:51, Tong Zhang wrote:
>>>>> assert(dbs->acb) is meant to check the return value of io_func per
>>>>> documented in commit 6bee44ea34 ("dma: the passed io_func does not
>>>>> return NULL"). However, there is a chance that after calling
>>>>> aio_context_release(dbs->ctx); the dma_blk_cb function is called before
>>>>> the assertion and dbs->acb is set to NULL again at line 121. Thus when
>>>>> we run assert at line 181 it will fail.
>>>>>
>>>>>   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
>>>>>
>>>>> Reported-by: Francisco Londono <f.londono@samsung.com>
>>>>> Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
>>>>> ---
>>>>>  softmmu/dma-helpers.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
>>>>> index 7820fec54c..cb81017928 100644
>>>>> --- a/softmmu/dma-helpers.c
>>>>> +++ b/softmmu/dma-helpers.c
>>>>> @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
>>>>>      aio_context_acquire(dbs->ctx);
>>>>>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
>>>>>                              dma_blk_cb, dbs, dbs->io_func_opaque);
>>>>> -    aio_context_release(dbs->ctx);
>>>>>      assert(dbs->acb);
>>>>> +    aio_context_release(dbs->ctx);
>>>>>  }
>>>>>
>>>>>  static void dma_aio_cancel(BlockAIOCB *acb)
>>>>
>>>> I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
>>>> run after you reshuffled the code?
>>>>
>>>
>>> IMO if the assert is to test whether io_func returns a non-NULL value
>>> shouldn't it be immediately after calling io_func.
>>> Also... as suggested by commit 6bee44ea346aed24e12d525daf10542d695508db
>>>   >     dma: the passed io_func does not return NULL
>>
>> Yes, but I just don't see how it would fix the assertion you document in
>> the patch description. The locking change to fix the assertion doesn't
>> make any sense to me, and most probably I am missing something important :)
> 
> The other thread will invoke dma_blk_cb(), which modifies dbs->acb, when
> it can take the lock. Therefore dbs->acb may contain a value different
> from our io_func()'s return value by the time we perform the assertion
> check (that's the race).
> 
> This patch makes sense to me. Can you rephrase your concern?

The locking is around dbs->io_func().

aio_context_acquire(dbs->ctx);
dbs->acb = dbs->io_func()
aio_context_release(dbs->ctx);


So where exactly would the lock that's now still held stop someone from
modifying dbs->acb = NULL at the beginning of the function, which seems
to be not protected by that lock?

Maybe I'm missing some locking magic due to the lock being a recursive lock.
Stefan Hajnoczi June 1, 2022, 1:55 p.m. UTC | #6
On Wed, 1 Jun 2022 at 14:29, David Hildenbrand <david@redhat.com> wrote:
>
> On 01.06.22 15:24, Stefan Hajnoczi wrote:
> > On Wed, Jun 01, 2022 at 10:00:50AM +0200, David Hildenbrand wrote:
> >> On 01.06.22 02:20, Tong Zhang wrote:
> >>> Hi David,
> >>>
> >>> On Mon, May 30, 2022 at 9:19 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 27.04.22 22:51, Tong Zhang wrote:
> >>>>> assert(dbs->acb) is meant to check the return value of io_func per
> >>>>> documented in commit 6bee44ea34 ("dma: the passed io_func does not
> >>>>> return NULL"). However, there is a chance that after calling
> >>>>> aio_context_release(dbs->ctx); the dma_blk_cb function is called before
> >>>>> the assertion and dbs->acb is set to NULL again at line 121. Thus when
> >>>>> we run assert at line 181 it will fail.
> >>>>>
> >>>>>   softmmu/dma-helpers.c:181: dma_blk_cb: Assertion `dbs->acb' failed.
> >>>>>
> >>>>> Reported-by: Francisco Londono <f.londono@samsung.com>
> >>>>> Signed-off-by: Tong Zhang <t.zhang2@samsung.com>
> >>>>> ---
> >>>>>  softmmu/dma-helpers.c | 2 +-
> >>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> >>>>> index 7820fec54c..cb81017928 100644
> >>>>> --- a/softmmu/dma-helpers.c
> >>>>> +++ b/softmmu/dma-helpers.c
> >>>>> @@ -177,8 +177,8 @@ static void dma_blk_cb(void *opaque, int ret)
> >>>>>      aio_context_acquire(dbs->ctx);
> >>>>>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
> >>>>>                              dma_blk_cb, dbs, dbs->io_func_opaque);
> >>>>> -    aio_context_release(dbs->ctx);
> >>>>>      assert(dbs->acb);
> >>>>> +    aio_context_release(dbs->ctx);
> >>>>>  }
> >>>>>
> >>>>>  static void dma_aio_cancel(BlockAIOCB *acb)
> >>>>
> >>>> I'm fairly new to that code, but I wonder what prevents dma_blk_cb() to
> >>>> run after you reshuffled the code?
> >>>>
> >>>
> >>> IMO if the assert is to test whether io_func returns a non-NULL value
> >>> shouldn't it be immediately after calling io_func.
> >>> Also... as suggested by commit 6bee44ea346aed24e12d525daf10542d695508db
> >>>   >     dma: the passed io_func does not return NULL
> >>
> >> Yes, but I just don't see how it would fix the assertion you document in
> >> the patch description. The locking change to fix the assertion doesn't
> >> make any sense to me, and most probably I am missing something important :)
> >
> > The other thread will invoke dma_blk_cb(), which modifies dbs->acb, when
> > it can take the lock. Therefore dbs->acb may contain a value different
> > from our io_func()'s return value by the time we perform the assertion
> > check (that's the race).
> >
> > This patch makes sense to me. Can you rephrase your concern?
>
> The locking is around dbs->io_func().
>
> aio_context_acquire(dbs->ctx);
> dbs->acb = dbs->io_func()
> aio_context_release(dbs->ctx);
>
>
> So where exactly would the lock that's now still held stop someone from
> modifying dbs->acb = NULL at the beginning of the function, which seems
> to be not protected by that lock?
>
> Maybe I'm missing some locking magic due to the lock being a recursive lock.

Tong Zhang: Can you share a backtrace of all threads when the
assertion failure occurs?

Stefan
Tong Zhang June 2, 2022, 1:04 a.m. UTC | #7
Hi Stefan,

On Wed, Jun 1, 2022 at 6:56 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> > > This patch makes sense to me. Can you rephrase your concern?
> >
> > The locking is around dbs->io_func().
> >
> > aio_context_acquire(dbs->ctx);
> > dbs->acb = dbs->io_func()
> > aio_context_release(dbs->ctx);
> >
> >
> > So where exactly would the lock that's now still held stop someone from
> > modifying dbs->acb = NULL at the beginning of the function, which seems
> > to be not protected by that lock?
> >
> > Maybe I'm missing some locking magic due to the lock being a recursive lock.
>
> Tong Zhang: Can you share a backtrace of all threads when the
> assertion failure occurs?
>
Sorry I couldn't get the trace now -- but I can tell that we have some
internal code uses
this dma related code and will grab dbs->ctx lock in another thread
and could overwrite dbs->acb.

From my understanding, one of the reasons that the lock is required
here is to protect dbs->acb,
we could not reliably test io_func()'s return value after releasing
the lock here.

Since this code affects our internal code base and I did not reproduce
on master branch,
feel free to ignore it.

- Tong

> Stefan
Stefan Hajnoczi June 2, 2022, 5:29 a.m. UTC | #8
On Thu, Jun 2, 2022, 02:04 Tong Zhang <ztong0001@gmail.com> wrote:
>
> Hi Stefan,
>
> On Wed, Jun 1, 2022 at 6:56 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > > > This patch makes sense to me. Can you rephrase your concern?
> > >
> > > The locking is around dbs->io_func().
> > >
> > > aio_context_acquire(dbs->ctx);
> > > dbs->acb = dbs->io_func()
> > > aio_context_release(dbs->ctx);
> > >
> > >
> > > So where exactly would the lock that's now still held stop someone from
> > > modifying dbs->acb = NULL at the beginning of the function, which seems
> > > to be not protected by that lock?
> > >
> > > Maybe I'm missing some locking magic due to the lock being a recursive lock.
> >
> > Tong Zhang: Can you share a backtrace of all threads when the
> > assertion failure occurs?
> >
> Sorry I couldn't get the trace now -- but I can tell that we have some
> internal code uses
> this dma related code and will grab dbs->ctx lock in another thread
> and could overwrite dbs->acb.
>
> From my understanding, one of the reasons that the lock is required
> here is to protect dbs->acb,
> we could not reliably test io_func()'s return value after releasing
> the lock here.
>
> Since this code affects our internal code base and I did not reproduce
> on master branch,
> feel free to ignore it.

If this patch is unnecessary on qemu.git/master it raises the question
whether aio_context_acquire/release() should be removed from
dma_blk_cb(). It was added by:

commit 1919631e6b5562e474690853eca3c35610201e16
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Mon Feb 13 14:52:31 2017 +0100

    block: explicitly acquire aiocontext in bottom halves that need it

Paolo: Is dma_blk_cb() called without the AioContext lock (by
virtio-scsi or any code path with IOThreads)? David pointed out that
if that's the case then the dbs->acb is accessed without the lock at
the start of dma_blk_cb().

Stefan
diff mbox series

Patch

diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 7820fec54c..cb81017928 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -177,8 +177,8 @@  static void dma_blk_cb(void *opaque, int ret)
     aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                             dma_blk_cb, dbs, dbs->io_func_opaque);
-    aio_context_release(dbs->ctx);
     assert(dbs->acb);
+    aio_context_release(dbs->ctx);
 }
 
 static void dma_aio_cancel(BlockAIOCB *acb)