diff mbox

[RFC,3/3] qemu-coroutine: use a ring per thread for the pool

Message ID 54785D60.1070306@kamp.de
State New
Headers show

Commit Message

Peter Lieven Nov. 28, 2014, 11:32 a.m. UTC
Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:21, Peter Lieven wrote:
>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>> master:
>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>
>>>> paolo:
>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>> Nice. :)
>>>
>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>> I apply it here (my times are 191/160/145).
>> indeed:
>>
>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
> if we can use a similar algorithm but with arrays instead of lists...

Why do you set pool_size = 0 in the create path?

When I do the following:

I get:
Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine

Comments

Peter Lieven Nov. 28, 2014, 11:46 a.m. UTC | #1
Am 28.11.2014 um 12:32 schrieb Peter Lieven:
> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>> On 28/11/2014 12:21, Peter Lieven wrote:
>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>> master:
>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>
>>>>> paolo:
>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>> Nice. :)
>>>>
>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>> I apply it here (my times are 191/160/145).
>>> indeed:
>>>
>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>> if we can use a similar algorithm but with arrays instead of lists...
> Why do you set pool_size = 0 in the create path?
>
> When I do the following:
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..c79ee78 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                atomic_dec(&pool_size);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
>
>
> I get:
> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>

Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
than one thread in need of a lot of coroutines?

Peter
Paolo Bonzini Nov. 28, 2014, 12:21 p.m. UTC | #2
On 28/11/2014 12:32, Peter Lieven wrote:
> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>>
>> On 28/11/2014 12:21, Peter Lieven wrote:
>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>> master:
>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>
>>>>> paolo:
>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>> Nice. :)
>>>>
>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>> I apply it here (my times are 191/160/145).
>>> indeed:
>>>
>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>> if we can use a similar algorithm but with arrays instead of lists...
> 
> Why do you set pool_size = 0 in the create path?
> 
> When I do the following:
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..c79ee78 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                atomic_dec(&pool_size);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
> 
> 
> I get:
> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine

Because pool_size is the (approximate) number of coroutines in the pool.
 It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

Paolo
Peter Lieven Nov. 28, 2014, 12:26 p.m. UTC | #3
Am 28.11.2014 um 13:21 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:32, Peter Lieven wrote:
>> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>>> On 28/11/2014 12:21, Peter Lieven wrote:
>>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>>> master:
>>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>>
>>>>>> paolo:
>>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>>> Nice. :)
>>>>>
>>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>>> I apply it here (my times are 191/160/145).
>>>> indeed:
>>>>
>>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>>> if we can use a similar algorithm but with arrays instead of lists...
>> Why do you set pool_size = 0 in the create path?
>>
>> When I do the following:
>> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
>> index 6bee354..c79ee78 100644
>> --- a/qemu-coroutine.c
>> +++ b/qemu-coroutine.c
>> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>>                   * it does not need to be perfect.
>>                   */
>> -                pool_size = 0;
>> +                atomic_dec(&pool_size);
>>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>>                  co = QSLIST_FIRST(&alloc_pool);
>>
>>
>> I get:
>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
> Because pool_size is the (approximate) number of coroutines in the pool.
>  It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

got it meanwhile. and its not as bad as i thought since you only steal the release_pool if your
alloc_pool is empty. Right?

Peter
Paolo Bonzini Nov. 28, 2014, 12:26 p.m. UTC | #4
On 28/11/2014 12:46, Peter Lieven wrote:
> > I get:
> > Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>
> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
> than one thread in need of a lot of coroutines?

Overall the algorithm is expected to adapt.  The N threads contribute to
the global release pool, so the pool will fill up N times faster than if
you had only one thread.  There can be some variance, which is why the
maximum size of the pool is twice the threshold (and probably could be
tuned better).

Benchmarks are needed on real I/O too, of course, especially with high
queue depth.

Paolo
diff mbox

Patch

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 6bee354..c79ee78 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -44,7 +44,7 @@  Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  * and the actual size of alloc_pool.  But it is just a heuristic,
                  * it does not need to be perfect.
                  */
-                pool_size = 0;
+                atomic_dec(&pool_size);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);