diff mbox series

[v3] util/async: use atomic_mb_set in qemu_bh_cancel

Message ID 20171108063447.2842-1-slp@redhat.com
State New
Headers show
Series [v3] util/async: use atomic_mb_set in qemu_bh_cancel | expand

Commit Message

Sergio Lopez Nov. 8, 2017, 6:34 a.m. UTC
Commit b7a745d added a qemu_bh_cancel call to the completion function
as an optimization to prevent it from unnecessarily rescheduling itself.

This completion function is scheduled from worker_thread, after setting
the state of a ThreadPoolElement to THREAD_DONE.

This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, under certain access
patterns and scheduling conditions, the loop may wrongly use a
pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
the completion function without having processed a pending TPE linked at
pool->head:

         worker thread             |            I/O thread
------------------------------------------------------------------------
                                   | speculatively read req->state
req->state = THREAD_DONE;          |
qemu_bh_schedule(p->completion_bh) |
  bh->scheduled = 1;               |
                                   | qemu_bh_cancel(p->completion_bh)
                                   |   bh->scheduled = 0;
                                   | if (req->state == THREAD_DONE)
                                   |   // sees THREAD_QUEUED

The source of the misunderstanding was that qemu_bh_cancel is now being
used by the _consumer_ rather than the producer, and therefore now needs
to have acquire semantics just like e.g. aio_bh_poll.

In some situations, if there are no other independent requests in the
same aio context that could eventually trigger the scheduling of the
completion function, the omitted TPE and all operations pending on it
will get stuck forever.

Signed-off-by: Sergio Lopez <slp@redhat.com>
---
 util/async.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stefan Hajnoczi Nov. 8, 2017, 9:24 a.m. UTC | #1
On Wed, Nov 08, 2017 at 07:34:47AM +0100, Sergio Lopez wrote:
> Commit b7a745d added a qemu_bh_cancel call to the completion function
> as an optimization to prevent it from unnecessarily rescheduling itself.
> 
> This completion function is scheduled from worker_thread, after setting
> the state of a ThreadPoolElement to THREAD_DONE.
> 
> This was considered to be safe, as the completion function restarts the
> loop just after the call to qemu_bh_cancel. But, under certain access
> patterns and scheduling conditions, the loop may wrongly use a
> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
> the completion function without having processed a pending TPE linked at
> pool->head:
> 
>          worker thread             |            I/O thread
> ------------------------------------------------------------------------
>                                    | speculatively read req->state
> req->state = THREAD_DONE;          |
> qemu_bh_schedule(p->completion_bh) |
>   bh->scheduled = 1;               |
>                                    | qemu_bh_cancel(p->completion_bh)
>                                    |   bh->scheduled = 0;
>                                    | if (req->state == THREAD_DONE)
>                                    |   // sees THREAD_QUEUED
> 
> The source of the misunderstanding was that qemu_bh_cancel is now being
> used by the _consumer_ rather than the producer, and therefore now needs
> to have acquire semantics just like e.g. aio_bh_poll.
> 
> In some situations, if there are no other independent requests in the
> same aio context that could eventually trigger the scheduling of the
> completion function, the omitted TPE and all operations pending on it
> will get stuck forever.
> 
> Signed-off-by: Sergio Lopez <slp@redhat.com>
> ---
>  util/async.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan
Pavel Butsykin Nov. 8, 2017, 1:50 p.m. UTC | #2
On 08.11.2017 09:34, Sergio Lopez wrote:
> Commit b7a745d added a qemu_bh_cancel call to the completion function
> as an optimization to prevent it from unnecessarily rescheduling itself.
> 
> This completion function is scheduled from worker_thread, after setting
> the state of a ThreadPoolElement to THREAD_DONE.
>

Great! We are seeing the same problem, and I was describing my fix,
when I came across your patch :)

> This was considered to be safe, as the completion function restarts the
> loop just after the call to qemu_bh_cancel. But, under certain access
> patterns and scheduling conditions, the loop may wrongly use a
> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
> the completion function without having processed a pending TPE linked at
> pool->head:

I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:

.text:000000000060E21E loc_60E21E:                             ; CODE 
XREF: .text:000000000060E2F4j
.text:000000000060E21E                 mov     rbx, [r12+98h]
.text:000000000060E226                 test    rbx, rbx
.text:000000000060E229                 jnz     short loc_60E238
.text:000000000060E22B                 jmp     short exit_0
.text:000000000060E22B ; 
---------------------------------------------------------------------------
.text:000000000060E22D                 align 10h
.text:000000000060E21E loc_60E21E:                             ; CODE 
XREF: .text:000000000060E2F4j
.text:000000000060E21E                 mov     rbx, [r12+98h]
.text:000000000060E226                 test    rbx, rbx
.text:000000000060E229                 jnz     short loc_60E238
.text:000000000060E22B                 jmp     short exit_0
.text:000000000060E230 loc_60E230:                             ; CODE 
XREF: .text:000000000060E240j
.text:000000000060E230                 test    rbp, rbp
.text:000000000060E233                 jz      short exit_0
.text:000000000060E235
.text:000000000060E235 loc_60E235:                             ; CODE 
XREF: .text:000000000060E289j
.text:000000000060E235                 mov     rbx, rbp
.text:000000000060E238
.text:000000000060E238 loc_60E238:                             ; CODE 
XREF: .text:000000000060E229j
.text:000000000060E238                 cmp 
[rbx+ThreadPoolElement.state], 2 ; THREAD_DONE
.text:000000000060E23C                 mov     rbp, 
[rbx+ThreadPoolElement.all.link_next]
.text:000000000060E240                 jnz     short loc_60E230
.text:000000000060E242                 mov     r15d, 
[rbx+ThreadPoolElement.ret]
.text:000000000060E246                 mov     r13, 
[rbx+ThreadPoolElement.common.opaque]
.text:000000000060E24A                 nop
.text:000000000060E24B                 lea     rax, 
trace_events_enabled_count
.text:000000000060E252                 mov     eax, [rax]
.text:000000000060E254                 test    eax, eax
.text:000000000060E256                 mov     rax, rbp
.text:000000000060E259                 jnz     loc_60E2F9
  ...

.text:000000000060E2BC loc_60E2BC:                             ; CODE 
XREF: .text:000000000060E27Cj
.text:000000000060E2BC                 mov     rdi, [r12+8]
.text:000000000060E2C1                 call    qemu_bh_schedule
.text:000000000060E2C6                 mov     rdi, [r12]
.text:000000000060E2CA                 call    aio_context_release
.text:000000000060E2CF                 mov     esi, [rbx+44h]
.text:000000000060E2D2                 mov     rdi, [rbx+18h]
.text:000000000060E2D6                 call    qword ptr [rbx+10h]
.text:000000000060E2D9                 mov     rdi, [r12]
.text:000000000060E2DD                 call    aio_context_acquire
.text:000000000060E2E2                 mov     rdi, [r12+8]
.text:000000000060E2E7                 call    qemu_bh_cancel
.text:000000000060E2EC                 mov     rdi, rbx
.text:000000000060E2EF                 call    qemu_aio_unref
.text:000000000060E2F4                 jmp     loc_60E21E


The read (req->state == THREAD_DONE) can be reordered
with qemu_bh_cancel(p->completion_bh) and then we get the same picture:

            worker thread             |            I/O thread
  ------------------------------------------------------------------------
                                      | reordered read req->state
   req->state = THREAD_DONE;          |
   qemu_bh_schedule(p->completion_bh) |
     bh->scheduled = 1;               |
                                      | qemu_bh_cancel(p->completion_bh)
                                      |   bh->scheduled = 0;
                                      | if (req->state == THREAD_DONE)
                                      |   // sees THREAD_QUEUED


> 
>           worker thread             |            I/O thread
> ------------------------------------------------------------------------
>                                     | speculatively read req->state
> req->state = THREAD_DONE;          |
> qemu_bh_schedule(p->completion_bh) |
>    bh->scheduled = 1;               |
>                                     | qemu_bh_cancel(p->completion_bh)
>                                     |   bh->scheduled = 0;
>                                     | if (req->state == THREAD_DONE)
>                                     |   // sees THREAD_QUEUED
> 
> The source of the misunderstanding was that qemu_bh_cancel is now being
> used by the _consumer_ rather than the producer, and therefore now needs
> to have acquire semantics just like e.g. aio_bh_poll.
> 
> In some situations, if there are no other independent requests in the
> same aio context that could eventually trigger the scheduling of the
> completion function, the omitted TPE and all operations pending on it
> will get stuck forever.
> 
> Signed-off-by: Sergio Lopez <slp@redhat.com>
> ---
>   util/async.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 355af73ee7..0e1bd8780a 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh)
>    */
>   void qemu_bh_cancel(QEMUBH *bh)
>   {
> -    bh->scheduled = 0;
> +    atomic_mb_set(&bh->scheduled, 0);

But in the end, the patch looks correct. atomic_mb_set() is xchg:

#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))

Reads and writes cannot be reordered with locked instructions, so it
should protect from reordering.


>   }
>   
>   /* This func is async.The bottom half will do the delete action at the finial
>
Sergio Lopez Nov. 8, 2017, 2:10 p.m. UTC | #3
On Wed, Nov 8, 2017 at 2:50 PM, Pavel Butsykin <pbutsykin@virtuozzo.com> wrote:
> On 08.11.2017 09:34, Sergio Lopez wrote:
>> This was considered to be safe, as the completion function restarts the
>> loop just after the call to qemu_bh_cancel. But, under certain access
>> patterns and scheduling conditions, the loop may wrongly use a
>> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
>> the completion function without having processed a pending TPE linked at
>> pool->head:
>
>
> I'm not quite sure that the pre-fetched is involved in this issue,
> because pre-fetch reading a certain addresses should be invalidated by
> write on another core to the same addresses. In our case write
> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
> I am inclined to think that there is a memory-reordering read with
> write. It's a very real case for x86 and I don't see the reasons which
> can prevent it:
>

Yes, you're right. This is actually a memory reordering issue. I'm
going to rewrite that paragraph.

Thanks Pavel.
Paolo Bonzini Nov. 8, 2017, 2:15 p.m. UTC | #4
On 08/11/2017 15:10, Sergio Lopez wrote:
>> I'm not quite sure that the pre-fetched is involved in this issue,
>> because pre-fetch reading a certain addresses should be invalidated by
>> write on another core to the same addresses. In our case write
>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>> I am inclined to think that there is a memory-reordering read with
>> write. It's a very real case for x86 and I don't see the reasons which
>> can prevent it:
>>
> Yes, you're right. This is actually a memory reordering issue. I'm
> going to rewrite that paragraph.

Well, memory reordering _is_ caused by speculative prefetching, delayed
cache invalidation (store buffers), and so on.

But it's probably better indeed to replace "pre-fetched" with
"outdated".  Whoever commits the patch can do the substitution (I can too).

Paolo
Sergio Lopez Nov. 8, 2017, 2:24 p.m. UTC | #5
On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 08/11/2017 15:10, Sergio Lopez wrote:
>>> I'm not quite sure that the pre-fetched is involved in this issue,
>>> because pre-fetch reading a certain addresses should be invalidated by
>>> write on another core to the same addresses. In our case write
>>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>>> I am inclined to think that there is a memory-reordering read with
>>> write. It's a very real case for x86 and I don't see the reasons which
>>> can prevent it:
>>>
>> Yes, you're right. This is actually a memory reordering issue. I'm
>> going to rewrite that paragraph.
>
> Well, memory reordering _is_ caused by speculative prefetching, delayed
> cache invalidation (store buffers), and so on.
>
> But it's probably better indeed to replace "pre-fetched" with
> "outdated".  Whoever commits the patch can do the substitution (I can too).
>

Alternatively, if we want to explicitly mention the memory barrier, we
can replace the third paragraph with something like this:

<snip>
This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
memory barrier, the read of req->state may actually happen _before_ the
call, seeing it still as THREAD_QUEUED, and ending the completion
function without having processed a pending TPE linked at pool->head:
</snip>

---
Sergio
Pavel Butsykin Nov. 8, 2017, 2:32 p.m. UTC | #6
On 08.11.2017 17:24, Sergio Lopez wrote:
> On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 08/11/2017 15:10, Sergio Lopez wrote:
>>>> I'm not quite sure that the pre-fetched is involved in this issue,
>>>> because pre-fetch reading a certain addresses should be invalidated by
>>>> write on another core to the same addresses. In our case write
>>>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>>>> I am inclined to think that there is a memory-reordering read with
>>>> write. It's a very real case for x86 and I don't see the reasons which
>>>> can prevent it:
>>>>
>>> Yes, you're right. This is actually a memory reordering issue. I'm
>>> going to rewrite that paragraph.
>>
>> Well, memory reordering _is_ caused by speculative prefetching, delayed
>> cache invalidation (store buffers), and so on.
>>
>> But it's probably better indeed to replace "pre-fetched" with
>> "outdated".  Whoever commits the patch can do the substitution (I can too).
>>
> 
> Alternatively, if we want to explicitly mention the memory barrier, we
> can replace the third paragraph with something like this:
> 
> <snip>
> This was considered to be safe, as the completion function restarts the
> loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
> memory barrier, the read of req->state may actually happen _before_ the
> call, seeing it still as THREAD_QUEUED, and ending the completion
> function without having processed a pending TPE linked at pool->head:
> </snip>

Yes, that's better. Thank you.

> ---
> Sergio
>
Pavel Butsykin Nov. 8, 2017, 4:36 p.m. UTC | #7
On 08.11.2017 17:15, Paolo Bonzini wrote:
> On 08/11/2017 15:10, Sergio Lopez wrote:
>>> I'm not quite sure that the pre-fetched is involved in this issue,
>>> because pre-fetch reading a certain addresses should be invalidated by
>>> write on another core to the same addresses. In our case write
>>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
>>> I am inclined to think that there is a memory-reordering read with
>>> write. It's a very real case for x86 and I don't see the reasons which
>>> can prevent it:
>>>
>> Yes, you're right. This is actually a memory reordering issue. I'm
>> going to rewrite that paragraph.
> 
> Well, memory reordering _is_ caused by speculative prefetching, delayed
> cache invalidation (store buffers), and so on.

what do you mean?

If we are speaking about x86, then a write on another core
(like req->state = THREAD_DONE in this issue) should invalidate
prefetch read(req->state = THREAD_DONE) and this is prevented in
hardware. The prefetch is locked to the L1, when another cpu
invalidates the cache lines, the prefetch is invalidated also
(As far as I understand it).

> But it's probably better indeed to replace "pre-fetched" with
> "outdated".  Whoever commits the patch can do the substitution (I can too).
> 
> Paolo
>
Stefan Hajnoczi Nov. 8, 2017, 7:21 p.m. UTC | #8
On Wed, Nov 08, 2017 at 05:32:23PM +0300, Pavel Butsykin wrote:
> On 08.11.2017 17:24, Sergio Lopez wrote:
> > On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > On 08/11/2017 15:10, Sergio Lopez wrote:
> > > > > I'm not quite sure that the pre-fetched is involved in this issue,
> > > > > because pre-fetch reading a certain addresses should be invalidated by
> > > > > write on another core to the same addresses. In our case write
> > > > > req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
> > > > > I am inclined to think that there is a memory-reordering read with
> > > > > write. It's a very real case for x86 and I don't see the reasons which
> > > > > can prevent it:
> > > > > 
> > > > Yes, you're right. This is actually a memory reordering issue. I'm
> > > > going to rewrite that paragraph.
> > > 
> > > Well, memory reordering _is_ caused by speculative prefetching, delayed
> > > cache invalidation (store buffers), and so on.
> > > 
> > > But it's probably better indeed to replace "pre-fetched" with
> > > "outdated".  Whoever commits the patch can do the substitution (I can too).
> > > 
> > 
> > Alternatively, if we want to explicitly mention the memory barrier, we
> > can replace the third paragraph with something like this:
> > 
> > <snip>
> > This was considered to be safe, as the completion function restarts the
> > loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW
> > memory barrier, the read of req->state may actually happen _before_ the
> > call, seeing it still as THREAD_QUEUED, and ending the completion
> > function without having processed a pending TPE linked at pool->head:
> > </snip>
> 
> Yes, that's better. Thank you.

I have updated the commit description and sent an updated pull request
for QEMU 2.11-rc1.

Stefan
diff mbox series

Patch

diff --git a/util/async.c b/util/async.c
index 355af73ee7..0e1bd8780a 100644
--- a/util/async.c
+++ b/util/async.c
@@ -174,7 +174,7 @@  void qemu_bh_schedule(QEMUBH *bh)
  */
 void qemu_bh_cancel(QEMUBH *bh)
 {
-    bh->scheduled = 0;
+    atomic_mb_set(&bh->scheduled, 0);
 }
 
 /* This func is async.The bottom half will do the delete action at the finial