diff mbox

[for-2.1?!?] AioContext: speed up aio_notify

Message ID 1404406760-22981-1-git-send-email-pbonzini@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini July 3, 2014, 4:59 p.m. UTC
In many cases, the call to event_notifier_set in aio_notify is unnecessary.
In particular, if we are executing aio_dispatch, or if aio_poll is not
blocking, we know that we will soon get to the next loop iteration (if
necessary); the thread that hosts the AioContext's event loop does not
need any nudging.

The patch includes a Promela formal model that shows that this really
works and does not need any further complication such as generation
counts.  It needs a memory barrier though.

The generation counts are not needed because we only care of the state
of ctx->dispatching at the time the memory barrier happens.  If
ctx->dispatching is one at the time the memory barrier happens,
the aio_notify is not needed even if it afterwards becomes zero.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        It should work, but I think this is a bit too tricky for 2.1.

 aio-posix.c             |  34 +++++++++++++++-
 async.c                 |  13 +++++-
 docs/aio_notify.promela | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/block/aio.h     |   9 +++++
 4 files changed, 158 insertions(+), 2 deletions(-)
 create mode 100644 docs/aio_notify.promela

Comments

Stefan Hajnoczi July 3, 2014, 6:17 p.m. UTC | #1
On Thu, Jul 3, 2014 at 6:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

Looks like a very interesting optimization for the block layer.  If we
can get some performance results - especially how it improves the
virtio-blk dataplane regression - maybe we should merge it for 2.1.

> @@ -247,9 +248,22 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>      return ctx->thread_pool;
>  }
>
> +void aio_set_dispatching(AioContext *ctx, bool dispatching)
> +{
> +    ctx->dispatching = dispatching;
> +    /* Write ctx->dispatching before reading e.g. bh->scheduled.  */
> +    if (!dispatching) {
> +        smp_mb();
> +    }

Is the if statement necessary?  It seems like an optimization to avoid
the memory barrier in 50% of the calls.

But by avoiding the barrier it could cause other threads to "miss" the
dispatching period on machines with weak memory models.  I guess
that's fine since we're not trying to optimize other threads but more
focussing on callers from the event loop's own thread.

Anyway, the if statement makes this function trickier than it has to
be IMO.  Want to drop it?
Paolo Bonzini July 4, 2014, 6:51 a.m. UTC | #2
Il 03/07/2014 20:17, Stefan Hajnoczi ha scritto:
> On Thu, Jul 3, 2014 at 6:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Looks like a very interesting optimization for the block layer.  If we
> can get some performance results - especially how it improves the
> virtio-blk dataplane regression - maybe we should merge it for 2.1.
>
>> @@ -247,9 +248,22 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>>      return ctx->thread_pool;
>>  }
>>
>> +void aio_set_dispatching(AioContext *ctx, bool dispatching)
>> +{
>> +    ctx->dispatching = dispatching;
>> +    /* Write ctx->dispatching before reading e.g. bh->scheduled.  */
>> +    if (!dispatching) {
>> +        smp_mb();
>> +    }
>
> Is the if statement necessary?  It seems like an optimization to avoid
> the memory barrier in 50% of the calls.

Yes.

> But by avoiding the barrier it could cause other threads to "miss" the
> dispatching period on machines with weak memory models.  I guess
> that's fine since we're not trying to optimize other threads but more
> focussing on callers from the event loop's own thread.
>
> Anyway, the if statement makes this function trickier than it has to
> be IMO.  Want to drop it?

I think this is the last "trickiness" problem of this patch, but of 
course I have no problem removing it if the patch goes into 2.1.  Ming 
Lei's latest message suggests it is not necessary though.

The patch should be tested at least on rbd or gluster, these are the 
main users of remote-thread notification via bottom halves.

Paolo
Ming Lei July 4, 2014, 7:23 a.m. UTC | #3
On Fri, Jul 4, 2014 at 2:51 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 03/07/2014 20:17, Stefan Hajnoczi ha scritto:
>
>> On Thu, Jul 3, 2014 at 6:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> Looks like a very interesting optimization for the block layer.  If we
>> can get some performance results - especially how it improves the
>> virtio-blk dataplane regression - maybe we should merge it for 2.1.
>>
>>> @@ -247,9 +248,22 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>>>      return ctx->thread_pool;
>>>  }
>>>
>>> +void aio_set_dispatching(AioContext *ctx, bool dispatching)
>>> +{
>>> +    ctx->dispatching = dispatching;
>>> +    /* Write ctx->dispatching before reading e.g. bh->scheduled.  */
>>> +    if (!dispatching) {
>>> +        smp_mb();
>>> +    }
>>
>>
>> Is the if statement necessary?  It seems like an optimization to avoid
>> the memory barrier in 50% of the calls.
>
>
> Yes.
>
>
>> But by avoiding the barrier it could cause other threads to "miss" the
>> dispatching period on machines with weak memory models.  I guess
>> that's fine since we're not trying to optimize other threads but more
>> focussing on callers from the event loop's own thread.
>>
>> Anyway, the if statement makes this function trickier than it has to
>> be IMO.  Want to drop it?
>
>
> I think this is the last "trickiness" problem of this patch, but of course I
> have no problem removing it if the patch goes into 2.1.  Ming Lei's latest
> message suggests it is not necessary though.

I think it is good and better to go to 2.1, and it should save lots of
write syscall.

Also should regression be caused, the per thread trick may be
resorted to, which should be simple.

With multi virtqueue's coming for virtio-blk, it should save more, and I
also plan to use the improved qemu bh to help merge requests from
multi queue, without introducing extra notifier.

Thanks,
--
Ming Lei
Paolo Bonzini July 4, 2014, 7:27 a.m. UTC | #4
Il 04/07/2014 09:23, Ming Lei ha scritto:
> I think it is good and better to go to 2.1, and it should save lots of
> write syscall.
>
> Also should regression be caused, the per thread trick may be
> resorted to, which should be simple.

If we have the "right" solution (which we do, unlike the plug/unplug 
case), and the benefit is there but limited, I don't see a reason to 
rush another patch in 2.1.

Some reasonable level of performance degradation or increased host CPU 
utilization was expected in 2.1; of course 40% is not reasonable.

> With multi virtqueue's coming for virtio-blk, it should save more, and I
> also plan to use the improved qemu bh to help merge requests from
> multi queue, without introducing extra notifier.

But virtio-blk multiqueue is 2.2 material, and so is coalescing of irqfd 
writes.  I think Kevin or Stefan should queue this patch (with the 
smp_mb optimization, IMHO) for block-next.

Paolo
Kevin Wolf July 4, 2014, 10:26 a.m. UTC | #5
Am 04.07.2014 um 09:27 hat Paolo Bonzini geschrieben:
> Il 04/07/2014 09:23, Ming Lei ha scritto:
> >I think it is good and better to go to 2.1, and it should save lots of
> >write syscall.
> >
> >Also should regression be caused, the per thread trick may be
> >resorted to, which should be simple.
> 
> If we have the "right" solution (which we do, unlike the plug/unplug
> case), and the benefit is there but limited, I don't see a reason to
> rush another patch in 2.1.
> 
> Some reasonable level of performance degradation or increased host
> CPU utilization was expected in 2.1; of course 40% is not
> reasonable.
> 
> >With multi virtqueue's coming for virtio-blk, it should save more, and I
> >also plan to use the improved qemu bh to help merge requests from
> >multi queue, without introducing extra notifier.
> 
> But virtio-blk multiqueue is 2.2 material, and so is coalescing of
> irqfd writes.  I think Kevin or Stefan should queue this patch (with
> the smp_mb optimization, IMHO) for block-next.

We have a rather long freeze phase this time (which I think is a good
thing). This patch fixes a regression, even if it may not be the most
important one because it is in experimental code. But I still think that
this time in the hard freeze is the right time to commit patches like
this. I would be very hesitant with such a patch like in the two weeks
before the release, but at this point I'm very open to including it.

All that requiring proper review and testing, of course. I reviewed it
and it looks good to me and Stefan seems to have reviewed it as well, so
I think it just needs a bit more testing.

Kevin
Paolo Bonzini July 4, 2014, 10:40 a.m. UTC | #6
> We have a rather long freeze phase this time (which I think is a good
> thing). This patch fixes a regression, even if it may not be the most
> important one because it is in experimental code. But I still think that
> this time in the hard freeze is the right time to commit patches like
> this. I would be very hesitant with such a patch like in the two weeks
> before the release, but at this point I'm very open to including it.
> 
> All that requiring proper review and testing, of course. I reviewed it
> and it looks good to me and Stefan seems to have reviewed it as well, so
> I think it just needs a bit more testing.

My judgement came from Ming saying that coroutine pools and plug/unplug
alone bring dataplane to the level of 2.0.  It doesn't surprise me that
write to eventfd is so optimized that it matters less than expected.

If this patch is properly tested with rbd and gluster, it's okay for me
to include it.  Still, it feels good that for once I'm the one on the
safe side. ;)

Paolo
Stefan Hajnoczi July 7, 2014, 8:28 a.m. UTC | #7
On Thu, Jul 03, 2014 at 06:59:20PM +0200, Paolo Bonzini wrote:
> In many cases, the call to event_notifier_set in aio_notify is unnecessary.
> In particular, if we are executing aio_dispatch, or if aio_poll is not
> blocking, we know that we will soon get to the next loop iteration (if
> necessary); the thread that hosts the AioContext's event loop does not
> need any nudging.
> 
> The patch includes a Promela formal model that shows that this really
> works and does not need any further complication such as generation
> counts.  It needs a memory barrier though.
> 
> The generation counts are not needed because we only care of the state
> of ctx->dispatching at the time the memory barrier happens.  If
> ctx->dispatching is one at the time the memory barrier happens,
> the aio_notify is not needed even if it afterwards becomes zero.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>         It should work, but I think this is a bit too tricky for 2.1.
> 
>  aio-posix.c             |  34 +++++++++++++++-
>  async.c                 |  13 +++++-
>  docs/aio_notify.promela | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/block/aio.h     |   9 +++++
>  4 files changed, 158 insertions(+), 2 deletions(-)
>  create mode 100644 docs/aio_notify.promela

I can test rbd and gluster.

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan
diff mbox

Patch

diff --git a/aio-posix.c b/aio-posix.c
index f921d4f..e38334c 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -175,11 +175,38 @@  static bool aio_dispatch(AioContext *ctx)
 bool aio_poll(AioContext *ctx, bool blocking)
 {
     AioHandler *node;
+    bool was_dispatching;
     int ret;
     bool progress;
 
+    was_dispatching = ctx->dispatching;
     progress = false;
 
+    /* aio_notify can avoid the expensive event_notifier_set if
+     * everything (file descriptors, bottom halves, timers) will
+     * be re-evaluated before the next blocking poll().  This happens
+     * in two cases:
+     *
+     * 1) when aio_poll is called with blocking == false
+     *
+     * 2) when we are called after poll().  If we are called before
+     *    poll(), bottom halves will not be re-evaluated and we need
+     *    aio_notify() if blocking == true.
+     *
+     * The first aio_dispatch() only does something when AioContext is
+     * running as a GSource, and in that case aio_poll is used only
+     * with blocking == false, so this optimization is already quite
+     * effective.  However, the code is ugly and should be restructured
+     * to have a single aio_dispatch() call.  To do this, we need to
+     * reorganize aio_poll into a prepare/poll/dispatch model like
+     * glib's.
+     *
+     * If we're in a nested event loop, ctx->dispatching might be true.
+     * In that case we can restore it just before returning, but we
+     * have to clear it now.
+     */
+    aio_set_dispatching(ctx, !blocking);
+
     /*
      * If there are callbacks left that have been queued, we need to call them.
      * Do not call select in this case, because it is possible that the caller
@@ -190,12 +217,14 @@  bool aio_poll(AioContext *ctx, bool blocking)
         progress = true;
     }
 
+    /* Re-evaluate condition (1) above.  */
+    aio_set_dispatching(ctx, !blocking);
     if (aio_dispatch(ctx)) {
         progress = true;
     }
 
     if (progress && !blocking) {
-        return true;
+        goto out;
     }
 
     ctx->walking_handlers++;
@@ -234,9 +263,12 @@  bool aio_poll(AioContext *ctx, bool blocking)
     }
 
     /* Run dispatch even if there were no readable fds to run timers */
+    aio_set_dispatching(ctx, true);
     if (aio_dispatch(ctx)) {
         progress = true;
     }
 
+out:
+    aio_set_dispatching(ctx, was_dispatching);
     return progress;
 }
diff --git a/async.c b/async.c
index 5b6fe6b..ecc201e 100644
--- a/async.c
+++ b/async.c
@@ -26,6 +26,7 @@ 
 #include "block/aio.h"
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
+#include "qemu/atomic.h"
 
 /***********************************************************/
 /* bottom halves (can be seen as timers which expire ASAP) */
@@ -247,9 +248,22 @@  ThreadPool *aio_get_thread_pool(AioContext *ctx)
     return ctx->thread_pool;
 }
 
+void aio_set_dispatching(AioContext *ctx, bool dispatching)
+{
+    ctx->dispatching = dispatching;
+    /* Write ctx->dispatching before reading e.g. bh->scheduled.  */
+    if (!dispatching) {
+        smp_mb();
+    }
+}
+
 void aio_notify(AioContext *ctx)
 {
-    event_notifier_set(&ctx->notifier);
+    /* Write e.g. bh->scheduled before reading ctx->dispatching.  */
+    smp_mb();
+    if (!ctx->dispatching) {
+        event_notifier_set(&ctx->notifier);
+    }
 }
 
 static void aio_timerlist_notify(void *opaque)
diff --git a/docs/aio_notify.promela b/docs/aio_notify.promela
new file mode 100644
index 0000000..ad3f6f0
--- /dev/null
+++ b/docs/aio_notify.promela
@@ -0,0 +1,104 @@ 
+/*
+ * This model describes the interaction between aio_set_dispatching()
+ * and aio_notify().
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This file is in the public domain.  If you really want a license,
+ * the WTFPL will do.
+ *
+ * To simulate it:
+ *     spin -p docs/aio_notify.promela
+ *
+ * To verify it:
+ *     spin -a docs/aio_notify.promela
+ *     gcc -O2 pan.c
+ *     ./a.out -a
+ */
+
+#define MAX   4
+#define LAST  (1 << (MAX - 1))
+#define FINAL ((LAST << 1) - 1)
+
+bool dispatching;
+bool event;
+
+int req, done;
+
+active proctype waiter()
+{
+     int fetch, blocking;
+
+     do
+        :: done != FINAL -> {
+            // Computing "blocking" is separate from execution of the
+            // "bottom half"
+            blocking = (req == 0);
+
+            // This is our "bottom half"
+            atomic { fetch = req; req = 0; }
+            done = done | fetch;
+
+            // Wait for a nudge from the other side
+            do
+                :: event == 1 -> { event = 0; break; }
+                :: !blocking  -> break;
+            od;
+
+            dispatching = 1;
+
+            // If you are simulating this model, you may want to add
+            // something like this here:
+            //
+            //      int foo; foo++; foo++; foo++;
+            //
+            // This only wastes some time and makes it more likely
+            // that the notifier process hits the "fast path".
+
+            dispatching = 0;
+        }
+        :: else -> break;
+    od
+}
+
+active proctype notifier()
+{
+    int next = 1;
+    int sets = 0;
+
+    do
+        :: next <= LAST -> {
+            // generate a request
+            req = req | next;
+            next = next << 1;
+
+            // aio_notify
+            if
+                :: dispatching == 0 -> sets++; event = 1;
+                :: else             -> skip;
+            fi;
+
+            // Test both synchronous and asynchronous delivery
+            if
+                :: 1 -> do
+                            :: req == 0 -> break;
+                        od;
+                :: 1 -> skip;
+            fi;
+        }
+        :: else -> break;
+    od;
+    printf("Skipped %d event_notifier_set\n", MAX - sets);
+}
+
+#define p (done == FINAL)
+
+never  {
+    do
+        :: 1                      // after an arbitrarily long prefix
+        :: p -> break             // p becomes true
+    od;
+    do
+        :: !p -> accept: break    // it then must remains true forever after
+    od
+}
diff --git a/include/block/aio.h b/include/block/aio.h
index a92511b..dffb061 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -60,8 +60,14 @@  struct AioContext {
      */
     int walking_handlers;
 
+    /* Used to avoid unnecessary event_notifier_set calls in aio_notify.
+     * Writes protected by lock or BQL, reads are lockless.
+     */
+    bool dispatching;
+
     /* lock to protect between bh's adders and deleter */
     QemuMutex bh_lock;
+
     /* Anchor of the list of Bottom Halves belonging to the context */
     struct QEMUBH *first_bh;
 
@@ -83,6 +89,9 @@  struct AioContext {
     QEMUTimerListGroup tlg;
 };
 
+/* Used internally to synchronize aio_poll against qemu_bh_schedule.  */
+void aio_set_dispatching(AioContext *ctx, bool dispatching);
+
 /**
  * aio_context_new: Allocate a new AioContext.
  *