diff mbox

[RFC,3/3] qemu-coroutine: use a ring per thread for the pool

Message ID 1417084026-12307-4-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Nov. 27, 2014, 10:27 a.m. UTC
This patch creates a ring structure for the coroutine pool instead
of a linked list. The implementation of the list has the issue
that it always throws aways the latest coroutines instead of the
oldest ones. This is a drawback since the latest used coroutines
are more likely cached than old ones.

Furthermore this patch creates a coroutine pool per thread
to remove the need for locking.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/block/coroutine.h |    2 +-
 iothread.c                |    3 +++
 qemu-coroutine.c          |   54 ++++++++++++++++++++-------------------------
 3 files changed, 28 insertions(+), 31 deletions(-)

Comments

Paolo Bonzini Nov. 27, 2014, 4:40 p.m. UTC | #1
On 27/11/2014 11:27, Peter Lieven wrote:
> +static __thread struct CoRoutinePool {
> +    Coroutine *ptrs[POOL_MAX_SIZE];
> +    unsigned int size;
> +    unsigned int nextfree;
> +} CoPool;
>  

The per-thread ring unfortunately didn't work well last time it was
tested.  Devices that do not use ioeventfd (not just the slow ones, even
decently performing ones like ahci, nvme or megasas) will create the
coroutine in the VCPU thread, and destroy it in the iothread.  The
result is that coroutines cannot be reused.

Can you check if this is still the case?

Paolo
Peter Lieven Nov. 28, 2014, 8:13 a.m. UTC | #2
Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:
>
> On 27/11/2014 11:27, Peter Lieven wrote:
>> +static __thread struct CoRoutinePool {
>> +    Coroutine *ptrs[POOL_MAX_SIZE];
>> +    unsigned int size;
>> +    unsigned int nextfree;
>> +} CoPool;
>>  
> The per-thread ring unfortunately didn't work well last time it was
> tested.  Devices that do not use ioeventfd (not just the slow ones, even
> decently performing ones like ahci, nvme or megasas) will create the
> coroutine in the VCPU thread, and destroy it in the iothread.  The
> result is that coroutines cannot be reused.
>
> Can you check if this is still the case?

I already tested at least for IDE and for ioeventfd=off. The coroutine
is created in the vCPU thread and destroyed in the I/O thread.

I also havea more complicated version which sets per therad coroutine pool only
for dataplane. Avoiding the lock for dedicated iothreads.

For those who want to take a look:

https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e

Peter
Peter Lieven Nov. 28, 2014, 10:37 a.m. UTC | #3
Am 28.11.2014 um 11:28 schrieb Paolo Bonzini:
>
> On 28/11/2014 09:13, Peter Lieven wrote:
>> Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:
>>> On 27/11/2014 11:27, Peter Lieven wrote:
>>>> +static __thread struct CoRoutinePool {
>>>> +    Coroutine *ptrs[POOL_MAX_SIZE];
>>>> +    unsigned int size;
>>>> +    unsigned int nextfree;
>>>> +} CoPool;
>>>>  
>>> The per-thread ring unfortunately didn't work well last time it was
>>> tested.  Devices that do not use ioeventfd (not just the slow ones, even
>>> decently performing ones like ahci, nvme or megasas) will create the
>>> coroutine in the VCPU thread, and destroy it in the iothread.  The
>>> result is that coroutines cannot be reused.
>>>
>>> Can you check if this is still the case?
>> I already tested at least for IDE and for ioeventfd=off. The coroutine
>> is created in the vCPU thread and destroyed in the I/O thread.
>>
>> I also havea more complicated version which sets per therad coroutine pool only
>> for dataplane. Avoiding the lock for dedicated iothreads.
>>
>> For those who want to take a look:
>>
>> https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e
> Can you test it against the patch I just sent in Kevin's linux-aio
> coroutine thread?

Was already doing it ;-) At least with test-couroutine.c....

master:
Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine

paolo:
Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine

plieven/perf_master2:
Run operation 40000000 iterations 9.013785 s, 4437K operations/s, 225ns per coroutine

plieven/perf_master:
Run operation 40000000 iterations 11.072883 s, 3612K operations/s, 276ns per coroutine

However, perf_master and perf_master2 have a regerssion regarding nesting as it seems.
@Kevin: Could that be the reason why they performe bad in some szenarios?


Regarding the bypass that is discussed. If it is not just a benchmark thing but really necessary
for some peoples use cases why not add a new aio mode like "bypass" and use it only then.
If the performance is really needed the user he/she might trade it in for lost features like iothrottling, filters etc.

Peter
Paolo Bonzini Nov. 28, 2014, 11:14 a.m. UTC | #4
> master:
> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
> 
> paolo:
> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine

Nice. :)

Can you please try "coroutine: Use __thread … " together, too?  I still
see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
I apply it here (my times are 191/160/145).

Paolo
Peter Lieven Nov. 28, 2014, 11:21 a.m. UTC | #5
Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>> master:
>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>
>> paolo:
>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
> Nice. :)
>
> Can you please try "coroutine: Use __thread … " together, too?  I still
> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
> I apply it here (my times are 191/160/145).

indeed:

Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
Paolo Bonzini Nov. 28, 2014, 11:23 a.m. UTC | #6
On 28/11/2014 12:21, Peter Lieven wrote:
> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>> master:
>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>
>>> paolo:
>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>> Nice. :)
>>
>> Can you please try "coroutine: Use __thread … " together, too?  I still
>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>> I apply it here (my times are 191/160/145).
> 
> indeed:
> 
> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine

Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
if we can use a similar algorithm but with arrays instead of lists...

Paolo
Peter Lieven Nov. 28, 2014, 11:27 a.m. UTC | #7
Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:21, Peter Lieven wrote:
>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>> master:
>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>
>>>> paolo:
>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>> Nice. :)
>>>
>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>> I apply it here (my times are 191/160/145).
>> indeed:
>>
>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
> if we can use a similar algorithm but with arrays instead of lists...

You mean an algorithm similar to perf_master2 or to the current implementation?

The ring buffer seems to have a drawback when it comes to excessive coroutine nesting.
My idea was that you do not throw away hot coroutines when the pool is full. However,
i do not know if this is really a problem since the pool is only full when there is not much
I/O. Or is this assumption to easy?

Peter
Stefan Hajnoczi Nov. 28, 2014, 12:40 p.m. UTC | #8
On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote:
> diff --git a/iothread.c b/iothread.c
> index 342a23f..b53529b 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -15,6 +15,7 @@
>  #include "qom/object_interfaces.h"
>  #include "qemu/module.h"
>  #include "block/aio.h"
> +#include "block/coroutine.h"
>  #include "sysemu/iothread.h"
>  #include "qmp-commands.h"
>  #include "qemu/error-report.h"
> @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque)
>          }
>          aio_context_release(iothread->ctx);
>      }
> +
> +    coroutine_pool_cleanup();
>      return NULL;
>  }

The assumption here is that iothread_run() is the only thread function
that uses coroutines.

If another thread uses coroutines the pool will leak :(.  This is one of
the challenges of thread-local storage.
diff mbox

Patch

diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 1d9343d..f0b3a2d 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -96,7 +96,7 @@  Coroutine *coroutine_fn qemu_coroutine_self(void);
  */
 bool qemu_in_coroutine(void);
 
-
+void coroutine_pool_cleanup(void);
 
 /**
  * CoQueues are a mechanism to queue coroutines in order to continue executing
diff --git a/iothread.c b/iothread.c
index 342a23f..b53529b 100644
--- a/iothread.c
+++ b/iothread.c
@@ -15,6 +15,7 @@ 
 #include "qom/object_interfaces.h"
 #include "qemu/module.h"
 #include "block/aio.h"
+#include "block/coroutine.h"
 #include "sysemu/iothread.h"
 #include "qmp-commands.h"
 #include "qemu/error-report.h"
@@ -47,6 +48,8 @@  static void *iothread_run(void *opaque)
         }
         aio_context_release(iothread->ctx);
     }
+
+    coroutine_pool_cleanup();
     return NULL;
 }
 
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 4708521..1900155 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -24,22 +24,22 @@  enum {
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
+static __thread struct CoRoutinePool {
+    Coroutine *ptrs[POOL_MAX_SIZE];
+    unsigned int size;
+    unsigned int nextfree;
+} CoPool;
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
-
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        co = QSLIST_FIRST(&pool);
-        if (co) {
-            QSLIST_REMOVE_HEAD(&pool, pool_next);
-            pool_size--;
+        if (CoPool.size) {
+            co = CoPool.ptrs[CoPool.nextfree];
+            CoPool.size--;
+            CoPool.nextfree--;
+            CoPool.nextfree &= (POOL_MAX_SIZE - 1);
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     if (!co) {
@@ -54,36 +54,30 @@  Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 static void coroutine_delete(Coroutine *co)
 {
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        if (pool_size < POOL_MAX_SIZE) {
-            QSLIST_INSERT_HEAD(&pool, co, pool_next);
-            co->caller = NULL;
-            pool_size++;
-            qemu_mutex_unlock(&pool_lock);
-            return;
+        CoPool.nextfree++;
+        CoPool.nextfree &= (POOL_MAX_SIZE - 1);
+        if (CoPool.size == POOL_MAX_SIZE) {
+            qemu_coroutine_delete(CoPool.ptrs[CoPool.nextfree]);
+        } else {
+            CoPool.size++;
         }
-        qemu_mutex_unlock(&pool_lock);
+        co->caller = NULL;
+        CoPool.ptrs[CoPool.nextfree] = co;
+    } else {
+        qemu_coroutine_delete(co);
     }
-
-    qemu_coroutine_delete(co);
 }
 
 static void __attribute__((constructor)) coroutine_pool_init(void)
 {
-    qemu_mutex_init(&pool_lock);
 }
 
-static void __attribute__((destructor)) coroutine_pool_cleanup(void)
+void __attribute__((destructor)) coroutine_pool_cleanup(void)
 {
-    Coroutine *co;
-    Coroutine *tmp;
-
-    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        qemu_coroutine_delete(co);
+    printf("coroutine_pool_cleanup %lx pool %p\n", pthread_self(), &CoPool);
+    while (CoPool.size) {
+        qemu_coroutine_delete(qemu_coroutine_create(NULL));
     }
-
-    qemu_mutex_destroy(&pool_lock);
 }
 
 static void coroutine_swap(Coroutine *from, Coroutine *to)