diff mbox

[v2,2/2] QEMUBH: make AioContext's bh re-entrant

Message ID 1371381681-14252-3-git-send-email-pingfanl@linux.vnet.ibm.com
State New
Headers show

Commit Message

pingfan liu June 16, 2013, 11:21 a.m. UTC
BH will be used outside big lock, so introduce lock to protect
between the writers, ie, bh's adders and deleter.
Note that the lock only affects the writers and bh's callback does
not take this extra lock.

Signed-off-by: Liu Ping Fan <pingfanl@linux.vnet.ibm.com>
---
 async.c             | 21 +++++++++++++++++++++
 include/block/aio.h |  3 +++
 2 files changed, 24 insertions(+)

Comments

Stefan Hajnoczi June 17, 2013, 3:28 p.m. UTC | #1
On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>      bh->ctx = ctx;
>      bh->cb = cb;
>      bh->opaque = opaque;
> +    qemu_mutex_lock(&ctx->bh_lock);
>      bh->next = ctx->first_bh;
> +    /* Make sure the memebers ready before putting bh into list */

s/memebers/members/

> +    smp_wmb();

Why lock bh_lock before assigning bh->next?  Could you lock the mutex
here and then drop the smp_wmb() since the pthread function already does
that?

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
Paolo Bonzini June 17, 2013, 4:41 p.m. UTC | #2
Il 17/06/2013 17:28, Stefan Hajnoczi ha scritto:
>> > +    qemu_mutex_lock(&ctx->bh_lock);
>> >      bh->next = ctx->first_bh;
>> > +    /* Make sure the memebers ready before putting bh into list */
> s/memebers/members/
> 
>> > +    smp_wmb();
> Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> here and then drop the smp_wmb() since the pthread function already does
> that?
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

Not sure I understand, ctx->first_bh is read here and that's what the
lock protects.

    thread 1                           thread 2
    ------------------------------------------------------------------
    bh->next = ctx->first_bh;
                                       bh->next = ctx->first_bh;
                                       lock
                                       ctx->first_bh = bh;
                                       unlock
    lock
    ctx->first_bh = bh;
    unlock

and thread 2's bottom half is gone.  There is also a similar race that
leaves a dangling pointer if aio_bh_new races against aio_bh_poll.

Paolo
pingfan liu June 18, 2013, 2:19 a.m. UTC | #3
On Mon, Jun 17, 2013 at 11:28 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
>> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>>      bh->ctx = ctx;
>>      bh->cb = cb;
>>      bh->opaque = opaque;
>> +    qemu_mutex_lock(&ctx->bh_lock);
>>      bh->next = ctx->first_bh;
>> +    /* Make sure the memebers ready before putting bh into list */
>
> s/memebers/members/
>
Will fix, thanks.
>> +    smp_wmb();
>
> Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> here and then drop the smp_wmb() since the pthread function already does
> that?
>
As Paolo's explain, it will open race gap for writers.

Thanks and regards,
Pingfan
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
Stefan Hajnoczi June 18, 2013, 9:31 a.m. UTC | #4
On Tue, Jun 18, 2013 at 10:19:40AM +0800, liu ping fan wrote:
> On Mon, Jun 17, 2013 at 11:28 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> > Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> > here and then drop the smp_wmb() since the pthread function already does
> > that?
> >
> As Paolo's explain, it will open race gap for writers.

Right, thanks!
Michael Roth June 18, 2013, 3:14 p.m. UTC | #5
On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> BH will be used outside big lock, so introduce lock to protect
> between the writers, ie, bh's adders and deleter.
> Note that the lock only affects the writers and bh's callback does
> not take this extra lock.
> 
> Signed-off-by: Liu Ping Fan <pingfanl@linux.vnet.ibm.com>
> ---
>  async.c             | 21 +++++++++++++++++++++
>  include/block/aio.h |  3 +++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/async.c b/async.c
> index 90fe906..6a3269f 100644
> --- a/async.c
> +++ b/async.c
> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>      bh->ctx = ctx;
>      bh->cb = cb;
>      bh->opaque = opaque;
> +    qemu_mutex_lock(&ctx->bh_lock);
>      bh->next = ctx->first_bh;
> +    /* Make sure the memebers ready before putting bh into list */
> +    smp_wmb();
>      ctx->first_bh = bh;
> +    qemu_mutex_unlock(&ctx->bh_lock);
>      return bh;
>  }
> 
> @@ -61,12 +65,18 @@ int aio_bh_poll(AioContext *ctx)
> 
>      ret = 0;
>      for (bh = ctx->first_bh; bh; bh = next) {
> +        /* Make sure fetching bh before accessing its members */
> +        smp_read_barrier_depends();
>          next = bh->next;
>          if (!bh->deleted && bh->scheduled) {
>              bh->scheduled = 0;
>              if (!bh->idle)
>                  ret = 1;
>              bh->idle = 0;
> +            /* Paired with write barrier in bh schedule to ensure reading for
> +             *  callbacks coming after bh's scheduling.
> +             */
> +            smp_rmb();
>              bh->cb(bh->opaque);

Could we possibly simplify this by introducing a recursive mutex that we
could use to protect the whole list loop and hold even during the cb?

I assume we can't hold the lock during the cb currently since we might
try to reschedule, but if it's a recursive mutex would that simplify
things?

I've been doing something similar with IOHandlers for the QContext
stuff, and that's the approach I took. This patch introduces the
recursive mutex:

https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53


>          }
>      }
> @@ -75,6 +85,7 @@ int aio_bh_poll(AioContext *ctx)
> 
>      /* remove deleted bhs */
>      if (!ctx->walking_bh) {
> +        qemu_mutex_lock(&ctx->bh_lock);
>          bhp = &ctx->first_bh;
>          while (*bhp) {
>              bh = *bhp;
> @@ -85,6 +96,7 @@ int aio_bh_poll(AioContext *ctx)
>                  bhp = &bh->next;
>              }
>          }
> +        qemu_mutex_unlock(&ctx->bh_lock);
>      }
> 
>      return ret;
> @@ -94,6 +106,10 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
>  {
>      if (bh->scheduled)
>          return;
> +    /* Make sure any writes that are needed by the callback are done
> +     * before the locations are read in the aio_bh_poll.
> +     */
> +    smp_wmb();
>      bh->scheduled = 1;
>      bh->idle = 1;
>  }
> @@ -102,6 +118,10 @@ void qemu_bh_schedule(QEMUBH *bh)
>  {
>      if (bh->scheduled)
>          return;
> +    /* Make sure any writes that are needed by the callback are done
> +     * before the locations are read in the aio_bh_poll.
> +     */
> +    smp_wmb();
>      bh->scheduled = 1;
>      bh->idle = 0;
>      aio_notify(bh->ctx);
> @@ -211,6 +231,7 @@ AioContext *aio_context_new(void)
>      ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
>      ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
>      ctx->thread_pool = NULL;
> +    qemu_mutex_init(&ctx->bh_lock);
>      event_notifier_init(&ctx->notifier, false);
>      aio_set_event_notifier(ctx, &ctx->notifier, 
>                             (EventNotifierHandler *)
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 1836793..971fbef 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -17,6 +17,7 @@
>  #include "qemu-common.h"
>  #include "qemu/queue.h"
>  #include "qemu/event_notifier.h"
> +#include "qemu/thread.h"
> 
>  typedef struct BlockDriverAIOCB BlockDriverAIOCB;
>  typedef void BlockDriverCompletionFunc(void *opaque, int ret);
> @@ -53,6 +54,8 @@ typedef struct AioContext {
>       */
>      int walking_handlers;
> 
> +    /* lock to protect between bh's adders and deleter */
> +    QemuMutex bh_lock;
>      /* Anchor of the list of Bottom Halves belonging to the context */
>      struct QEMUBH *first_bh;
> 
> -- 
> 1.8.1.4
> 
>
Paolo Bonzini June 18, 2013, 7:20 p.m. UTC | #6
Il 18/06/2013 17:14, mdroth ha scritto:
> Could we possibly simplify this by introducing a recursive mutex that we
> could use to protect the whole list loop and hold even during the cb?

If it is possible, we should avoid recursive locks.  It makes impossible
to establish a lock hierarchy.  For example:

> I assume we can't hold the lock during the cb currently since we might
> try to reschedule, but if it's a recursive mutex would that simplify
> things?

If you have two callbacks in two different AioContexts, both of which do
bdrv_drain_all(), you get an AB-BA deadlock

Also, the barriers for qemu_bh_schedule are needed anyway.  It's only
those that order bh->next vs. ctx->first_bh that would go away.

Paolo

> I've been doing something similar with IOHandlers for the QContext
> stuff, and that's the approach I took. This patch introduces the
> recursive mutex:
> 
> https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53
Michael Roth June 18, 2013, 10:26 p.m. UTC | #7
On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
> Il 18/06/2013 17:14, mdroth ha scritto:
> > Could we possibly simplify this by introducing a recursive mutex that we
> > could use to protect the whole list loop and hold even during the cb?
> 
> If it is possible, we should avoid recursive locks.  It makes impossible
> to establish a lock hierarchy.  For example:
> 
> > I assume we can't hold the lock during the cb currently since we might
> > try to reschedule, but if it's a recursive mutex would that simplify
> > things?
> 
> If you have two callbacks in two different AioContexts, both of which do
> bdrv_drain_all(), you get an AB-BA deadlock

I think I see what you mean. That problem exists regardless of whether we
introduce a recursive mutex though right? I guess the main issue is the
fact that we'd be encouraging sloppy locking practices without
addressing the root problem?

I'm just worried what other subtle problems pop up if we instead rely
heavily on memory barriers and inevitably forget one here or there, but
maybe that's just me not having a good understanding of when to use them.

But doesn't rcu provide higher-level interfaces for these kinds of things?
Is it possible to hide any of this behind our list interfaces?

> 
> Also, the barriers for qemu_bh_schedule are needed anyway.  It's only
> those that order bh->next vs. ctx->first_bh that would go away.

I see, I guess it's unavoidable for some cases.

> 
> Paolo
> 
> > I've been doing something similar with IOHandlers for the QContext
> > stuff, and that's the approach I took. This patch introduces the
> > recursive mutex:
> > 
> > https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53
>
Paolo Bonzini June 19, 2013, 9:27 a.m. UTC | #8
Il 19/06/2013 00:26, mdroth ha scritto:
> On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
>> Il 18/06/2013 17:14, mdroth ha scritto:
>>> Could we possibly simplify this by introducing a recursive mutex that we
>>> could use to protect the whole list loop and hold even during the cb?
>>
>> If it is possible, we should avoid recursive locks.  It makes impossible
>> to establish a lock hierarchy.  For example:
>>
>>> I assume we can't hold the lock during the cb currently since we might
>>> try to reschedule, but if it's a recursive mutex would that simplify
>>> things?
>>
>> If you have two callbacks in two different AioContexts, both of which do
>> bdrv_drain_all(), you get an AB-BA deadlock
> 
> I think I see what you mean. That problem exists regardless of whether we
> introduce a recursive mutex though right?

Without a recursive mutex, you only hold one lock at a time in each thread.

> I guess the main issue is the
> fact that we'd be encouraging sloppy locking practices without
> addressing the root problem?

Yeah.  We're basically standing where the Linux kernel stood 10 years
ago (let's say 2.2 timeframe).  If Linux got this far without recursive
mutexes, we can at least try. :)

> I'm just worried what other subtle problems pop up if we instead rely
> heavily on memory barriers and inevitably forget one here or there, but
> maybe that's just me not having a good understanding of when to use them.

That's true.  I hope that the docs in patch 1 help, and (much) more
thorough docs are available in the Linux kernel.

> But doesn't rcu provide higher-level interfaces for these kinds of things?
> Is it possible to hide any of this behind our list interfaces?

My atomics header file provides higher-level interfaces, but in most
cases barriers are not that harder to use and the docs help converting
one style to the other.

So far I've not used RCU for lists, only for entire data structures.

Paolo
Stefan Hajnoczi June 20, 2013, 9:11 a.m. UTC | #9
On Wed, Jun 19, 2013 at 11:27:25AM +0200, Paolo Bonzini wrote:
> Il 19/06/2013 00:26, mdroth ha scritto:
> > On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
> >> Il 18/06/2013 17:14, mdroth ha scritto:
> >>> Could we possibly simplify this by introducing a recursive mutex that we
> >>> could use to protect the whole list loop and hold even during the cb?
> >>
> >> If it is possible, we should avoid recursive locks.  It makes impossible
> >> to establish a lock hierarchy.  For example:
> >>
> >>> I assume we can't hold the lock during the cb currently since we might
> >>> try to reschedule, but if it's a recursive mutex would that simplify
> >>> things?
> >>
> >> If you have two callbacks in two different AioContexts, both of which do
> >> bdrv_drain_all(), you get an AB-BA deadlock
> > 
> > I think I see what you mean. That problem exists regardless of whether we
> > introduce a recursive mutex though right?
> 
> Without a recursive mutex, you only hold one lock at a time in each thread.
> 
> > I guess the main issue is the
> > fact that we'd be encouraging sloppy locking practices without
> > addressing the root problem?
> 
> Yeah.  We're basically standing where the Linux kernel stood 10 years
> ago (let's say 2.2 timeframe).  If Linux got this far without recursive
> mutexes, we can at least try. :)

FWIW I was also looking into recursive mutexes for the block layer.
What scared me a little is that they make it tempting to stop thinking
about locks since you know you'll be able to reacquire locks you already
hold.

Especially when converting existing code, I think we need to be rigorous
about exploring every function and thinking about the locks it needs and
which child functions it calls.

Otherwise we'll have code paths hidden away somewhere that were never
truly thought through.

Stefan
diff mbox

Patch

diff --git a/async.c b/async.c
index 90fe906..6a3269f 100644
--- a/async.c
+++ b/async.c
@@ -47,8 +47,12 @@  QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
     bh->ctx = ctx;
     bh->cb = cb;
     bh->opaque = opaque;
+    qemu_mutex_lock(&ctx->bh_lock);
     bh->next = ctx->first_bh;
+    /* Make sure the memebers ready before putting bh into list */
+    smp_wmb();
     ctx->first_bh = bh;
+    qemu_mutex_unlock(&ctx->bh_lock);
     return bh;
 }
 
@@ -61,12 +65,18 @@  int aio_bh_poll(AioContext *ctx)
 
     ret = 0;
     for (bh = ctx->first_bh; bh; bh = next) {
+        /* Make sure fetching bh before accessing its members */
+        smp_read_barrier_depends();
         next = bh->next;
         if (!bh->deleted && bh->scheduled) {
             bh->scheduled = 0;
             if (!bh->idle)
                 ret = 1;
             bh->idle = 0;
+            /* Paired with write barrier in bh schedule to ensure reading for
+             *  callbacks coming after bh's scheduling.
+             */
+            smp_rmb();
             bh->cb(bh->opaque);
         }
     }
@@ -75,6 +85,7 @@  int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!ctx->walking_bh) {
+        qemu_mutex_lock(&ctx->bh_lock);
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -85,6 +96,7 @@  int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
+        qemu_mutex_unlock(&ctx->bh_lock);
     }
 
     return ret;
@@ -94,6 +106,10 @@  void qemu_bh_schedule_idle(QEMUBH *bh)
 {
     if (bh->scheduled)
         return;
+    /* Make sure any writes that are needed by the callback are done
+     * before the locations are read in the aio_bh_poll.
+     */
+    smp_wmb();
     bh->scheduled = 1;
     bh->idle = 1;
 }
@@ -102,6 +118,10 @@  void qemu_bh_schedule(QEMUBH *bh)
 {
     if (bh->scheduled)
         return;
+    /* Make sure any writes that are needed by the callback are done
+     * before the locations are read in the aio_bh_poll.
+     */
+    smp_wmb();
     bh->scheduled = 1;
     bh->idle = 0;
     aio_notify(bh->ctx);
@@ -211,6 +231,7 @@  AioContext *aio_context_new(void)
     ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
     ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
     ctx->thread_pool = NULL;
+    qemu_mutex_init(&ctx->bh_lock);
     event_notifier_init(&ctx->notifier, false);
     aio_set_event_notifier(ctx, &ctx->notifier, 
                            (EventNotifierHandler *)
diff --git a/include/block/aio.h b/include/block/aio.h
index 1836793..971fbef 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -17,6 +17,7 @@ 
 #include "qemu-common.h"
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
+#include "qemu/thread.h"
 
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
@@ -53,6 +54,8 @@  typedef struct AioContext {
      */
     int walking_handlers;
 
+    /* lock to protect between bh's adders and deleter */
+    QemuMutex bh_lock;
     /* Anchor of the list of Bottom Halves belonging to the context */
     struct QEMUBH *first_bh;