diff mbox

[7/7] coroutine: try harder not to delete coroutines

Message ID 1417183941-26329-8-git-send-email-pbonzini@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini Nov. 28, 2014, 2:12 p.m. UTC
From: Peter Lieven <pl@kamp.de>

Placing coroutines on the global pool should be preferrable, because it
can help all threads.  But if the global pool is full, we can still
try to save some allocations by stashing completed coroutines on the
local pool.  This is quite cheap too, because it does not require
atomic operations.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 qemu-coroutine.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Peter Lieven Nov. 28, 2014, 8:52 p.m. UTC | #1
Am 28.11.2014 um 15:12 schrieb Paolo Bonzini:
> From: Peter Lieven <pl@kamp.de>
>
> Placing coroutines on the global pool should be preferrable, because it
> can help all threads.  But if the global pool is full, we can still
> try to save some allocations by stashing completed coroutines on the
> local pool.  This is quite cheap too, because it does not require
> atomic operations.

At least in test-couroutine.c this turns out to be not just a nice to have.
I have not fully understood why, but i get the following results:

master:
Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine

this series up to patch 6:
Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine

this series up to patch 7:
Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine

So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
+15% that this Patch gains.

>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  qemu-coroutine.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index da1b961..977f114 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -27,6 +27,7 @@ enum {
>  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
>  static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread unsigned int alloc_pool_size;
>  static __thread Notifier coroutine_pool_cleanup_notifier;
>  
>  static void coroutine_pool_cleanup(Notifier *n, void *value)
> @@ -58,13 +59,14 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * release_pool_size and the actual size of release_pool.  But
>                   * it is just a heuristic, it does not need to be perfect.
>                   */
> -                release_pool_size = 0;
> +                alloc_pool_size += atomic_xchg(&release_pool_size, 0);

I had alloc_pool_size = in my original Patch.
It shouldn't make a difference, since alloc_pool_size should be 0
when we reach this code piece. But if for some reason release_pool_size
is inaccurate we add this error to alloc_pool_size again and again
and eventually end up not adding coroutines to the thread local pool below
altough it might be empty in the worst case.

Peter
Paolo Bonzini Nov. 29, 2014, 2:27 p.m. UTC | #2
On 28/11/2014 21:52, Peter Lieven wrote:
> 
> master:
> Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine
> 
> this series up to patch 6:
> Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine
> 
> this series up to patch 7:
> Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine
> 
> So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
> +15% that this Patch gains.

No atomic operations once the release pool gets full.  We're talking of
800 clock cycles here, and one atomic operation costs 50 cycles.  100
clock cycles out of 800 = 15% speedup (8/7 = 1.14).

Paolo
Paolo Bonzini Nov. 29, 2014, 2:28 p.m. UTC | #3
On 28/11/2014 21:52, Peter Lieven wrote:
>> > +                alloc_pool_size += atomic_xchg(&release_pool_size, 0);
> I had alloc_pool_size = in my original Patch.
> It shouldn't make a difference, since alloc_pool_size should be 0
> when we reach this code piece. But if for some reason release_pool_size
> is inaccurate we add this error to alloc_pool_size again and again
> and eventually end up not adding coroutines to the thread local pool below
> altough it might be empty in the worst case.

Oops, this must come from a rebase.  Thanks for pointing it out.

Paolo
Peter Lieven Nov. 29, 2014, 9:28 p.m. UTC | #4
Am 29.11.2014 um 15:27 schrieb Paolo Bonzini:
>
> On 28/11/2014 21:52, Peter Lieven wrote:
>> master:
>> Run operation 40000000 iterations 13.612604 s, 2938K operations/s, 340ns per coroutine
>>
>> this series up to patch 6:
>> Run operation 40000000 iterations 10.428382 s, 3835K operations/s, 260ns per coroutine
>>
>> this series up to patch 7:
>> Run operation 40000000 iterations 9.112539 s, 4389K operations/s, 227ns per coroutine
>>
>> So this confirms the +33% Paolo sees up to Patch 5. But I have yet fully understood the
>> +15% that this Patch gains.
> No atomic operations once the release pool gets full.  We're talking of
> 800 clock cycles here, and one atomic operation costs 50 cycles.  100
> clock cycles out of 800 = 15% speedup (8/7 = 1.14).

Maybe its worth mentioning this (partly) in the commit message that this can give a gain
of additional 15% best case. This gives a +50% for the whole series best case.

Peter
diff mbox

Patch

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index da1b961..977f114 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -27,6 +27,7 @@  enum {
 static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
 static void coroutine_pool_cleanup(Notifier *n, void *value)
@@ -58,13 +59,14 @@  Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  * release_pool_size and the actual size of release_pool.  But
                  * it is just a heuristic, it does not need to be perfect.
                  */
-                release_pool_size = 0;
+                alloc_pool_size += atomic_xchg(&release_pool_size, 0);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);
             }
         }
         if (co) {
             QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+            alloc_pool_size--;
         }
     }
 
@@ -87,6 +89,11 @@  static void coroutine_delete(Coroutine *co)
             atomic_inc(&release_pool_size);
             return;
         }
+        if (alloc_pool_size < POOL_BATCH_SIZE) {
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size++;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);