[01/15] qemu coroutine: support bypass mode

Message ID	1406720388-18671-2-git-send-email-ming.lei@canonical.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Ming Lei <ming.lei@canonical.com> To: qemu-devel@nongnu.org, Peter Maydell <peter.maydell@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com> Date: Wed, 30 Jul 2014 19:39:34 +0800 Message-Id: <1406720388-18671-2-git-send-email-ming.lei@canonical.com> In-Reply-To: <1406720388-18671-1-git-send-email-ming.lei@canonical.com> References: <1406720388-18671-1-git-send-email-ming.lei@canonical.com> Cc: Kevin Wolf <kwolf@redhat.com>, Ming Lei <ming.lei@canonical.com>, Fam Zheng <famz@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com> Subject: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Ming Lei July 30, 2014, 11:39 a.m. UTC

This patch introduces several APIs for supporting bypass qemu coroutine
in case of being not necessary and for performance's sake.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/block/coroutine.h     |    8 ++++++++
 include/block/coroutine_int.h |    5 +++++
 qemu-coroutine-lock.c         |    4 ++--
 qemu-coroutine.c              |   33 +++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 2 deletions(-)

Paolo Bonzini July 30, 2014, 1:45 p.m. UTC | #1

Il 30/07/2014 13:39, Ming Lei ha scritto:
> This patch introduces several APIs for supporting bypass qemu coroutine
> in case of being not necessary and for performance's sake.

No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
anything else is a step backwards.

If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
conditions that allow doing that and call the bdrv_aio_readv/writev
directly.

To begin with, have you benchmarked QEMU and can you provide a trace of
*where* the coroutine overhead lies?

Paolo

> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  include/block/coroutine.h     |    8 ++++++++
>  include/block/coroutine_int.h |    5 +++++
>  qemu-coroutine-lock.c         |    4 ++--
>  qemu-coroutine.c              |   33 +++++++++++++++++++++++++++++++++
>  4 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/include/block/coroutine.h b/include/block/coroutine.h
> index b408f96..46d2642 100644
> --- a/include/block/coroutine.h
> +++ b/include/block/coroutine.h
> @@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
>   * Note that this function clobbers the handlers for the file descriptor.
>   */
>  void coroutine_fn yield_until_fd_readable(int fd);
> +
> +/* qemu coroutine bypass APIs */
> +void qemu_coroutine_set_bypass(bool bypass);
> +bool qemu_coroutine_bypassed(Coroutine *self);
> +bool qemu_coroutine_self_bypassed(void);
> +void qemu_coroutine_set_var(void *var);
> +void *qemu_coroutine_get_var(void);
> +
>  #endif /* QEMU_COROUTINE_H */
> diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h
> index f133d65..106d0b2 100644
> --- a/include/block/coroutine_int.h
> +++ b/include/block/coroutine_int.h
> @@ -39,6 +39,11 @@ struct Coroutine {
>      Coroutine *caller;
>      QSLIST_ENTRY(Coroutine) pool_next;
>  
> +    bool bypass;
> +
> +    /* only used in bypass mode */
> +    void *opaque;
> +
>      /* Coroutines that should be woken up when we yield or terminate */
>      QTAILQ_HEAD(, Coroutine) co_queue_wakeup;
>      QTAILQ_ENTRY(Coroutine) co_queue_next;
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index e4860ae..7c69ff6 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single)
>  
>  bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
>  {
> -    assert(qemu_in_coroutine());
> +    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
>      return qemu_co_queue_do_restart(queue, true);
>  }
>  
>  void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
>  {
> -    assert(qemu_in_coroutine());
> +    assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed());
>      qemu_co_queue_do_restart(queue, false);
>  }
>  
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 4708521..0597ed9 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void)
>      self->caller = NULL;
>      coroutine_swap(self, to);
>  }
> +
> +void qemu_coroutine_set_bypass(bool bypass)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    self->bypass = bypass;
> +}
> +
> +bool qemu_coroutine_bypassed(Coroutine *self)
> +{
> +    return self->bypass;
> +}
> +
> +bool qemu_coroutine_self_bypassed(void)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    return qemu_coroutine_bypassed(self);
> +}
> +
> +void qemu_coroutine_set_var(void *var)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    self->opaque = var;
> +}
> +
> +void *qemu_coroutine_get_var(void)
> +{
> +    Coroutine *self = qemu_coroutine_self();
> +
> +    return self->opaque;
> +}
>

Ming Lei July 30, 2014, 5:15 p.m. UTC | #2

On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> This patch introduces several APIs for supporting bypass qemu coroutine
>> in case of being not necessary and for performance's sake.
>
> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
> anything else is a step backwards.

As we saw, coroutine has brought up performance regression
on dataplane, and it isn't necessary to use co in some cases, is it?

>
> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
> conditions that allow doing that and call the bdrv_aio_readv/writev
> directly.

That is easy to detect, please see the 5th patch.

>
> To begin with, have you benchmarked QEMU and can you provide a trace of
> *where* the coroutine overhead lies?

I guess it may be caused by the stack switch, at least in one of
my box, bypassing co can improve throughput by ~7%, and by
~15% in another box.

Thanks,

Paolo Bonzini July 30, 2014, 11:37 p.m. UTC | #3

Il 30/07/2014 19:15, Ming Lei ha scritto:
> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>> in case of being not necessary and for performance's sake.
>>
>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>> anything else is a step backwards.
> 
> As we saw, coroutine has brought up performance regression
> on dataplane, and it isn't necessary to use co in some cases, is it?

Yes, and it's not necessary on non-dataplane either.  It's not necessary
on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
either.

>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>> conditions that allow doing that and call the bdrv_aio_readv/writev
>> directly.
> 
> That is easy to detect, please see the 5th patch.

No, that's not enough.  Dataplane right now prevents block jobs, but
that's going to change and it could require coroutines even for raw devices.

>> To begin with, have you benchmarked QEMU and can you provide a trace of
>> *where* the coroutine overhead lies?
> 
> I guess it may be caused by the stack switch, at least in one of
> my box, bypassing co can improve throughput by ~7%, and by
> ~15% in another box.

No guesses please.  Actually that's also my guess, but since you are
submitting the patch you must do better and show profiles where stack
switching disappears after the patches.

Paolo

Ming Lei July 31, 2014, 3:55 a.m. UTC | #4

On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 19:15, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>>> in case of being not necessary and for performance's sake.
>>>
>>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>>> anything else is a step backwards.
>>
>> As we saw, coroutine has brought up performance regression
>> on dataplane, and it isn't necessary to use co in some cases, is it?
>
> Yes, and it's not necessary on non-dataplane either.  It's not necessary
> on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> either.

It is good to know these cases, so they might benefit from this patch
in future too, :-)

>>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>>> conditions that allow doing that and call the bdrv_aio_readv/writev
>>> directly.
>>
>> That is easy to detect, please see the 5th patch.
>
> No, that's not enough.  Dataplane right now prevents block jobs, but
> that's going to change and it could require coroutines even for raw devices.

Could you explain it a bit why co is required for raw devices in future?

If some cases for not requiring co can be detected beforehand, these
patches are useful, IMO.

>>> To begin with, have you benchmarked QEMU and can you provide a trace of
>>> *where* the coroutine overhead lies?
>>
>> I guess it may be caused by the stack switch, at least in one of
>> my box, bypassing co can improve throughput by ~7%, and by
>> ~15% in another box.
>
> No guesses please.  Actually that's also my guess, but since you are
> submitting the patch you must do better and show profiles where stack
> switching disappears after the patches.

OK, no problem, I will provide some profile result.

Thanks,

Benoît Canet July 31, 2014, 7:37 a.m. UTC | #5

The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote :
> On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 30/07/2014 19:15, Ming Lei ha scritto:
> >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> Il 30/07/2014 13:39, Ming Lei ha scritto:
> >>>> This patch introduces several APIs for supporting bypass qemu coroutine
> >>>> in case of being not necessary and for performance's sake.
> >>>
> >>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
> >>> anything else is a step backwards.
> >>
> >> As we saw, coroutine has brought up performance regression
> >> on dataplane, and it isn't necessary to use co in some cases, is it?
> >
> > Yes, and it's not necessary on non-dataplane either.  It's not necessary
> > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> > either.
> 
> It is good to know these cases, so they might benefit from this patch
> in future too, :-)
> 
> >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
> >>> conditions that allow doing that and call the bdrv_aio_readv/writev
> >>> directly.
> >>
> >> That is easy to detect, please see the 5th patch.
> >
> > No, that's not enough.  Dataplane right now prevents block jobs, but
> > that's going to change and it could require coroutines even for raw devices.
> 
> Could you explain it a bit why co is required for raw devices in future?

Block mirroring of a device for example is done using coroutines.
As block mirroring can be done on a raw device you need coroutines.

> 
> If some cases for not requiring co can be detected beforehand, these
> patches are useful, IMO.
> 
> >>> To begin with, have you benchmarked QEMU and can you provide a trace of
> >>> *where* the coroutine overhead lies?
> >>
> >> I guess it may be caused by the stack switch, at least in one of
> >> my box, bypassing co can improve throughput by ~7%, and by
> >> ~15% in another box.
> >
> > No guesses please.  Actually that's also my guess, but since you are
> > submitting the patch you must do better and show profiles where stack
> > switching disappears after the patches.
> 
> OK, no problem, I will provide some profile result.
> 
> Thanks,
>

Ming Lei July 31, 2014, 8:59 a.m. UTC | #6

On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 19:15, Ming Lei ha scritto:
>> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>>>> This patch introduces several APIs for supporting bypass qemu coroutine
>>>> in case of being not necessary and for performance's sake.
>>>
>>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>>> anything else is a step backwards.
>>
>> As we saw, coroutine has brought up performance regression
>> on dataplane, and it isn't necessary to use co in some cases, is it?
>
> Yes, and it's not necessary on non-dataplane either.  It's not necessary
> on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> either.
>
>>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>>> conditions that allow doing that and call the bdrv_aio_readv/writev
>>> directly.
>>
>> That is easy to detect, please see the 5th patch.
>
> No, that's not enough.  Dataplane right now prevents block jobs, but
> that's going to change and it could require coroutines even for raw devices.
>
>>> To begin with, have you benchmarked QEMU and can you provide a trace of
>>> *where* the coroutine overhead lies?
>>
>> I guess it may be caused by the stack switch, at least in one of
>> my box, bypassing co can improve throughput by ~7%, and by
>> ~15% in another box.
>
> No guesses please.  Actually that's also my guess, but since you are
> submitting the patch you must do better and show profiles where stack
> switching disappears after the patches.

Follows the below hardware events reported by 'perf stat' when running
fio randread benchmark for 2min in VM(single vq, 2 jobs):

sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./nqemu-start-mq 4 1

1), without bypassing coroutine via forcing to set 's->raw_format ' as
false, see patch 5/15

- throughout: 95K

  Performance counter stats for './nqemu-start-mq 4 1':

    69,231,035,842      L1-dcache-loads
              [40.10%]
     1,909,978,930      L1-dcache-load-misses     #    2.76% of all
L1-dcache hits   [39.98%]
   263,731,501,086      cpu-cycles                [40.03%]
   232,564,905,115      instructions              #    0.88  insns per
cycle         [50.23%]
    46,157,868,745      branch-instructions
              [49.82%]
       785,618,591      branch-misses             #    1.70% of all
branches         [49.99%]
    46,280,342,654      branch-loads
              [49.95%]
    34,934,790,140      branch-load-misses
              [50.02%]
    69,447,857,237      dTLB-loads
              [40.13%]
       169,617,374      dTLB-load-misses          #    0.24% of all
dTLB cache hits  [40.04%]

     161.991075781 seconds time elapsed


2), with bypassing coroutinue
- throughput: 115K
 Performance counter stats for './nqemu-start-mq 4 1':

    76,784,224,509      L1-dcache-loads
              [39.93%]
     1,334,036,447      L1-dcache-load-misses     #    1.74% of all
L1-dcache hits   [39.91%]
   262,697,428,470      cpu-cycles                [40.03%]
   255,526,629,881      instructions              #    0.97  insns per
cycle         [50.01%]
    50,160,082,611      branch-instructions
              [49.97%]
       564,407,788      branch-misses             #    1.13% of all
branches         [50.08%]
    50,331,510,702      branch-loads
              [50.08%]
    35,760,766,459      branch-load-misses
              [50.03%]
    76,706,000,951      dTLB-loads
              [40.00%]
       123,291,001      dTLB-load-misses          #    0.16% of all
dTLB cache hits  [40.02%]

     162.333465490 seconds time elapsed

Paolo Bonzini July 31, 2014, 9:15 a.m. UTC | #7

Il 31/07/2014 10:59, Ming Lei ha scritto:
>> > No guesses please.  Actually that's also my guess, but since you are
>> > submitting the patch you must do better and show profiles where stack
>> > switching disappears after the patches.
> Follows the below hardware events reported by 'perf stat' when running
> fio randread benchmark for 2min in VM(single vq, 2 jobs):
> 
> sudo ~/bin/perf stat -e
> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
> ./nqemu-start-mq 4 1
> 
> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
> false, see patch 5/15
> 
> - throughout: 95K
>    232,564,905,115      instructions
>      161.991075781 seconds time elapsed
> 
> 
> 2), with bypassing coroutinue
> - throughput: 115K
>    255,526,629,881      instructions
>      162.333465490 seconds time elapsed

Ok, so you are saving 10% instructions per iop: before 232G / 95K =
2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.

That's not small, and it's a good thing for CPU utilization even if you
were not increasing iops.  On top of this, can you provide the stack
traces to see the difference in the profiles?

Paolo

Ming Lei July 31, 2014, 9:47 a.m. UTC | #8

On Thu, Jul 31, 2014 at 3:37 PM, Benoît Canet <benoit.canet@irqsave.net> wrote:
> The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote :
>> On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 30/07/2014 19:15, Ming Lei ha scritto:
>> >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>> Il 30/07/2014 13:39, Ming Lei ha scritto:
>> >>>> This patch introduces several APIs for supporting bypass qemu coroutine
>> >>>> in case of being not necessary and for performance's sake.
>> >>>
>> >>> No, this is wrong.  Dataplane *must* use the same code as non-dataplane,
>> >>> anything else is a step backwards.
>> >>
>> >> As we saw, coroutine has brought up performance regression
>> >> on dataplane, and it isn't necessary to use co in some cases, is it?
>> >
>> > Yes, and it's not necessary on non-dataplane either.  It's not necessary
>> > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
>> > either.
>>
>> It is good to know these cases, so they might benefit from this patch
>> in future too, :-)
>>
>> >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
>> >>> conditions that allow doing that and call the bdrv_aio_readv/writev
>> >>> directly.
>> >>
>> >> That is easy to detect, please see the 5th patch.
>> >
>> > No, that's not enough.  Dataplane right now prevents block jobs, but
>> > that's going to change and it could require coroutines even for raw devices.
>>
>> Could you explain it a bit why co is required for raw devices in future?
>
> Block mirroring of a device for example is done using coroutines.
> As block mirroring can be done on a raw device you need coroutines.

If block layer knows the mirroring is in-progress, it still can enable
coroutine by ignoring bypass coroutine, or just let device disable
bypass coroutine in case of mirroring, and the current APIs are very
flexible.

Thanks,

Ming Lei July 31, 2014, 10:06 a.m. UTC | #9

On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 10:59, Ming Lei ha scritto:
>>> > No guesses please.  Actually that's also my guess, but since you are
>>> > submitting the patch you must do better and show profiles where stack
>>> > switching disappears after the patches.
>> Follows the below hardware events reported by 'perf stat' when running
>> fio randread benchmark for 2min in VM(single vq, 2 jobs):
>>
>> sudo ~/bin/perf stat -e
>> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
>> ./nqemu-start-mq 4 1
>>
>> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
>> false, see patch 5/15
>>
>> - throughout: 95K
>>    232,564,905,115      instructions
>>      161.991075781 seconds time elapsed
>>
>>
>> 2), with bypassing coroutinue
>> - throughput: 115K
>>    255,526,629,881      instructions
>>      162.333465490 seconds time elapsed
>
> Ok, so you are saving 10% instructions per iop: before 232G / 95K =
> 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.

I am wondering if it is useful since IOP is from view of VM, and instructions
is got from host(qemu). If qemu dataplane handles IO quickly enough,
it can save instructions by batch operations.

But I will collect the 'perf report' on 'instruction' event for you.

L1-dcache-load-misses ratio is increased by 1%, and instructions per
cycle is decreased by 10%(0.88->0.97), which is big too, and means
qemu becomes much slower when running block I/O in VM by
coroutine.

Thanks,

Paolo Bonzini July 31, 2014, 10:45 a.m. UTC | #10

Il 31/07/2014 11:47, Ming Lei ha scritto:
>> Block mirroring of a device for example is done using coroutines.
>> As block mirroring can be done on a raw device you need coroutines.
> 
> If block layer knows the mirroring is in-progress, it still can enable
> coroutine by ignoring bypass coroutine, or just let device disable
> bypass coroutine in case of mirroring, and the current APIs are very
> flexible.

What matters is not whether you're mirroring.  What matters is whether
you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev.  Under
some limitation, if you are calling bdrv_aio_readv/writev you can bypass
coroutines.  (In fact drive mirroring uses those APIs too so it doesn't
need coroutines and can benefit from the speedup too. :)  But
drive-backup does need them).

The limitations are essentially "would bdrv_co_do_preadv or
bdrv_co_do_pwritev do anything special?"  This means for example for
bdrv_co_do_pwritev:

- bs->drv must be non-NULL

- bs->read_only must be false

- bdrv_check_byte_request(bs, offset, bytes) must return false

- bs->io_limits_enabled must be false

- the request must be aligned

- (in bdrv_aligned_pwritev) the before_write_notifiers must be empty

- (in bdrv_aligned_pwritev) bs->detect_zeroes must be off

- (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off

- (in bdrv_aligned_pwritev) bs->enable_write_cache must be false

and the hard part is organizing the code so that the code duplication is
as limited as possible.

Paolo

Ming Lei July 31, 2014, 4:13 p.m. UTC | #11

On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 10:59, Ming Lei ha scritto:
>>> > No guesses please.  Actually that's also my guess, but since you are
>>> > submitting the patch you must do better and show profiles where stack
>>> > switching disappears after the patches.
>> Follows the below hardware events reported by 'perf stat' when running
>> fio randread benchmark for 2min in VM(single vq, 2 jobs):
>>
>> sudo ~/bin/perf stat -e
>> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
>> ./nqemu-start-mq 4 1
>>
>> 1), without bypassing coroutine via forcing to set 's->raw_format ' as
>> false, see patch 5/15
>>
>> - throughout: 95K
>>    232,564,905,115      instructions
>>      161.991075781 seconds time elapsed
>>
>>
>> 2), with bypassing coroutinue
>> - throughput: 115K
>>    255,526,629,881      instructions
>>      162.333465490 seconds time elapsed
>
> Ok, so you are saving 10% instructions per iop: before 232G / 95K =
> 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.
>
> That's not small, and it's a good thing for CPU utilization even if you
> were not increasing iops.  On top of this, can you provide the stack
> traces to see the difference in the profiles?

Follows 'perf report' result on cycles event for with/without bypass
coroutine:

    http://pastebin.com/ae0vnQ6V

From the profiling result, looks bdrv_co_do_preadv() is a bit slow
without bypass coroutine.


Thanks,

Paolo Bonzini July 31, 2014, 4:30 p.m. UTC | #12

Il 31/07/2014 18:13, Ming Lei ha scritto:
> Follows 'perf report' result on cycles event for with/without bypass
> coroutine:
> 
>     http://pastebin.com/ae0vnQ6V
> 
> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> without bypass coroutine.

Yeah, I can count at least 3.3% time spent here:

0.87%          bdrv_co_do_preadv
0.79%          bdrv_aligned_preadv
0.71%          qemu_coroutine_switch
0.52%          tracked_request_begin
0.45%          coroutine_swap

Another ~3% wasted in malloc, etc.

I suggest that we discuss it on the phone at next Tuesday's KVM call.
I'm not denying that bypassing coroutines is a useful thing to do.  But
*everyone* should be doing that if it is so useful, and it's hard to do
it without causing code duplication.

Paolo

Ming Lei Aug. 1, 2014, 2:54 a.m. UTC | #13

On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 18:13, Ming Lei ha scritto:
>> Follows 'perf report' result on cycles event for with/without bypass
>> coroutine:
>>
>>     http://pastebin.com/ae0vnQ6V
>>
>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> without bypass coroutine.
>
> Yeah, I can count at least 3.3% time spent here:
>
> 0.87%          bdrv_co_do_preadv
> 0.79%          bdrv_aligned_preadv
> 0.71%          qemu_coroutine_switch
> 0.52%          tracked_request_begin
> 0.45%          coroutine_swap
>
> Another ~3% wasted in malloc, etc.

That should be related with coroutine and the BH in bdrv_co_do_rw().
In this post I didn't apply Stephan's coroutine resize patch which might
decrease usage of malloc() for coroutine.

At least, coroutine isn't cheap from the profile result.

>
> I suggest that we discuss it on the phone at next Tuesday's KVM call.

No problem, and we can continue to discuss it on mail list too.  If you
and someone else need more profiling data, please feel free to let me
know, and I am happy to provide that.

> I'm not denying that bypassing coroutines is a useful thing to do.  But
> *everyone* should be doing that if it is so useful, and it's hard to do
> it without causing code duplication.

Yes, I agree, and the patch is designed as so or sort of.

If device can know in advance that corotine isn't required, it just
calls qemu_aio_set_bypass_co() to notify block layer to not use
coroutine, that is the approach designed in this patch.

Thanks,

Stefan Hajnoczi Aug. 1, 2014, 1:13 p.m. UTC | #14

On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 31/07/2014 18:13, Ming Lei ha scritto:
> >> Follows 'perf report' result on cycles event for with/without bypass
> >> coroutine:
> >>
> >>     http://pastebin.com/ae0vnQ6V
> >>
> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> >> without bypass coroutine.
> >
> > Yeah, I can count at least 3.3% time spent here:
> >
> > 0.87%          bdrv_co_do_preadv
> > 0.79%          bdrv_aligned_preadv
> > 0.71%          qemu_coroutine_switch
> > 0.52%          tracked_request_begin
> > 0.45%          coroutine_swap
> >
> > Another ~3% wasted in malloc, etc.
> 
> That should be related with coroutine and the BH in bdrv_co_do_rw().
> In this post I didn't apply Stephan's coroutine resize patch which might
> decrease usage of malloc() for coroutine.

Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
size".

> At least, coroutine isn't cheap from the profile result.

Instead of bypassing coroutines we should first understand the overhead
that they impose.  Is it due to the coroutine implementation (switching
stacks) or due to the bdrv_co_*() code that happens to use coroutines
but slow for other reasons.

Ming Lei Aug. 1, 2014, 1:38 p.m. UTC | #15

On Thu, Jul 31, 2014 at 6:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 31/07/2014 11:47, Ming Lei ha scritto:
>>> Block mirroring of a device for example is done using coroutines.
>>> As block mirroring can be done on a raw device you need coroutines.
>>
>> If block layer knows the mirroring is in-progress, it still can enable
>> coroutine by ignoring bypass coroutine, or just let device disable
>> bypass coroutine in case of mirroring, and the current APIs are very
>> flexible.
>
> What matters is not whether you're mirroring.  What matters is whether
> you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev.  Under
> some limitation, if you are calling bdrv_aio_readv/writev you can bypass
> coroutines.  (In fact drive mirroring uses those APIs too so it doesn't
> need coroutines and can benefit from the speedup too. :)  But
> drive-backup does need them).
>
> The limitations are essentially "would bdrv_co_do_preadv or
> bdrv_co_do_pwritev do anything special?"  This means for example for
> bdrv_co_do_pwritev:
>
> - bs->drv must be non-NULL
>
> - bs->read_only must be false
>
> - bdrv_check_byte_request(bs, offset, bytes) must return false
>
> - bs->io_limits_enabled must be false
>
> - the request must be aligned
>
> - (in bdrv_aligned_pwritev) the before_write_notifiers must be empty
>
> - (in bdrv_aligned_pwritev) bs->detect_zeroes must be off
>
> - (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off
>
> - (in bdrv_aligned_pwritev) bs->enable_write_cache must be false
>
> and the hard part is organizing the code so that the code duplication is
> as limited as possible.

In this patchset, only implementation of bypassing coroutine is
introduced, and simply apply it on raw image for dataplane, and it
won't break drive mirror & backup, if I understand correctly. So I
suggest we focus on the implementation of bypassing coroutine.

All the above is mostly about if the bypass can be applied in
other cases, so this patchset shouldn't have caused code duplication.

I appreciate very much the implementation of bypassing coroutine
can be reviewed in the discussion.

Thanks,

Ming Lei Aug. 1, 2014, 1:48 p.m. UTC | #16

On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>> >> Follows 'perf report' result on cycles event for with/without bypass
>> >> coroutine:
>> >>
>> >>     http://pastebin.com/ae0vnQ6V
>> >>
>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> >> without bypass coroutine.
>> >
>> > Yeah, I can count at least 3.3% time spent here:
>> >
>> > 0.87%          bdrv_co_do_preadv
>> > 0.79%          bdrv_aligned_preadv
>> > 0.71%          qemu_coroutine_switch
>> > 0.52%          tracked_request_begin
>> > 0.45%          coroutine_swap
>> >
>> > Another ~3% wasted in malloc, etc.
>>
>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>> In this post I didn't apply Stephan's coroutine resize patch which might
>> decrease usage of malloc() for coroutine.
>
> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
> size".

No problem, will do that. Actually in my last post with rfc, this patchset
was against your coroutine resize patches.

I will provide the profile data tomorrow.

>
>> At least, coroutine isn't cheap from the profile result.
>
> Instead of bypassing coroutines we should first understand the overhead
> that they impose.  Is it due to the coroutine implementation (switching
> stacks) or due to the bdrv_co_*() code that happens to use coroutines
> but slow for other reasons.

From the 3th patch(block: support to bypass qemu coroutinue)
and the 5th patch(dataplane: enable selective bypassing coroutine),
the change is to bypass coroutine and BH, and the other bdrv code
path is same, so it is due to the coroutine implementation, IMO.

Thanks,

Paolo Bonzini Aug. 1, 2014, 2:17 p.m. UTC | #17

Il 01/08/2014 15:48, Ming Lei ha scritto:
> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 31/07/2014 18:13, Ming Lei ha scritto:
>>>>> Follows 'perf report' result on cycles event for with/without bypass
>>>>> coroutine:
>>>>>
>>>>>     http://pastebin.com/ae0vnQ6V
>>>>>
>>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>>>> without bypass coroutine.
>>>>
>>>> Yeah, I can count at least 3.3% time spent here:
>>>>
>>>> 0.87%          bdrv_co_do_preadv
>>>> 0.79%          bdrv_aligned_preadv
>>>> 0.71%          qemu_coroutine_switch
>>>> 0.52%          tracked_request_begin
>>>> 0.45%          coroutine_swap
>>>>
>>>> Another ~3% wasted in malloc, etc.
>>>
>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>> decrease usage of malloc() for coroutine.
>>
>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> size".
> 
> No problem, will do that. Actually in my last post with rfc, this patchset
> was against your coroutine resize patches.
> 
> I will provide the profile data tomorrow.
> 
>>
>>> At least, coroutine isn't cheap from the profile result.
>>
>> Instead of bypassing coroutines we should first understand the overhead
>> that they impose.  Is it due to the coroutine implementation (switching
>> stacks) or due to the bdrv_co_*() code that happens to use coroutines
>> but slow for other reasons.
> 
> From the 3th patch(block: support to bypass qemu coroutinue)
> and the 5th patch(dataplane: enable selective bypassing coroutine),
> the change is to bypass coroutine and BH, and the other bdrv code
> path is same, so it is due to the coroutine implementation, IMO.

But your code breaks all sort of invariants.  For example, the aiocb
must be valid when bdrv_aio_readv/writev return.  virtio-blk does not
use it, but virtio-scsi does.  If we apply your patches now, we will
have to redo it soon.

Basically we should be rewriting parts of block.c so that
bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa.
 Coroutine creation should be pushed down to the
bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can
simply call the driver's bdrv_aio_readv/writev.

Paolo

Ming Lei Aug. 1, 2014, 2:52 p.m. UTC | #18

On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>>> >> Follows 'perf report' result on cycles event for with/without bypass
>>> >> coroutine:
>>> >>
>>> >>     http://pastebin.com/ae0vnQ6V
>>> >>
>>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>> >> without bypass coroutine.
>>> >
>>> > Yeah, I can count at least 3.3% time spent here:
>>> >
>>> > 0.87%          bdrv_co_do_preadv
>>> > 0.79%          bdrv_aligned_preadv
>>> > 0.71%          qemu_coroutine_switch
>>> > 0.52%          tracked_request_begin
>>> > 0.45%          coroutine_swap
>>> >
>>> > Another ~3% wasted in malloc, etc.
>>>
>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>> decrease usage of malloc() for coroutine.
>>
>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> size".
>
> No problem, will do that. Actually in my last post with rfc, this patchset
> was against your coroutine resize patches.
>
> I will provide the profile data tomorrow.

Please see below link for without bypass coroutine, and with
your coroutine resize patches(V3):

   http://pastebin.com/10y00sir

BTW, if anyone likes to play this stuff, these patches can be
pulled from below tree easily:

git://kernel.ubuntu.com/ming/qemu.git

branch:
         - v2.1.0-mq-co-resize:
              these 15 patches against 2.1.0-rc5 plus Stefan's coroutine
              resize patches(V3)
         - v2.1.0-mq:
              only these 15 patches against 2.1.0-rc5

Thanks,

Ming Lei Aug. 1, 2014, 3:21 p.m. UTC | #19

On Fri, Aug 1, 2014 at 10:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 01/08/2014 15:48, Ming Lei ha scritto:
>> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> Il 31/07/2014 18:13, Ming Lei ha scritto:
>>>>>> Follows 'perf report' result on cycles event for with/without bypass
>>>>>> coroutine:
>>>>>>
>>>>>>     http://pastebin.com/ae0vnQ6V
>>>>>>
>>>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>>>>>> without bypass coroutine.
>>>>>
>>>>> Yeah, I can count at least 3.3% time spent here:
>>>>>
>>>>> 0.87%          bdrv_co_do_preadv
>>>>> 0.79%          bdrv_aligned_preadv
>>>>> 0.71%          qemu_coroutine_switch
>>>>> 0.52%          tracked_request_begin
>>>>> 0.45%          coroutine_swap
>>>>>
>>>>> Another ~3% wasted in malloc, etc.
>>>>
>>>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>>>> In this post I didn't apply Stephan's coroutine resize patch which might
>>>> decrease usage of malloc() for coroutine.
>>>
>>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>>> size".
>>
>> No problem, will do that. Actually in my last post with rfc, this patchset
>> was against your coroutine resize patches.
>>
>> I will provide the profile data tomorrow.
>>
>>>
>>>> At least, coroutine isn't cheap from the profile result.
>>>
>>> Instead of bypassing coroutines we should first understand the overhead
>>> that they impose.  Is it due to the coroutine implementation (switching
>>> stacks) or due to the bdrv_co_*() code that happens to use coroutines
>>> but slow for other reasons.
>>
>> From the 3th patch(block: support to bypass qemu coroutinue)
>> and the 5th patch(dataplane: enable selective bypassing coroutine),
>> the change is to bypass coroutine and BH, and the other bdrv code
>> path is same, so it is due to the coroutine implementation, IMO.
>
> But your code breaks all sort of invariants.  For example, the aiocb
> must be valid when bdrv_aio_readv/writev return.  virtio-blk does not
> use it, but virtio-scsi does.  If we apply your patches now, we will
> have to redo it soon.

It can be supported by scheduling BH in bdrv_co_io_em_complete() too.

>
> Basically we should be rewriting parts of block.c so that
> bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa.
>  Coroutine creation should be pushed down to the
> bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can
> simply call the driver's bdrv_aio_readv/writev.

This idea is very constructive, and I will investigate further to see
if it is easy to start.


Thanks,

Stefan Hajnoczi Aug. 1, 2014, 4:03 p.m. UTC | #20

On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote:
> On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
> > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
> >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
> >>> >> Follows 'perf report' result on cycles event for with/without bypass
> >>> >> coroutine:
> >>> >>
> >>> >>     http://pastebin.com/ae0vnQ6V
> >>> >>
> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
> >>> >> without bypass coroutine.
> >>> >
> >>> > Yeah, I can count at least 3.3% time spent here:
> >>> >
> >>> > 0.87%          bdrv_co_do_preadv
> >>> > 0.79%          bdrv_aligned_preadv
> >>> > 0.71%          qemu_coroutine_switch
> >>> > 0.52%          tracked_request_begin
> >>> > 0.45%          coroutine_swap
> >>> >
> >>> > Another ~3% wasted in malloc, etc.
> >>>
> >>> That should be related with coroutine and the BH in bdrv_co_do_rw().
> >>> In this post I didn't apply Stephan's coroutine resize patch which might
> >>> decrease usage of malloc() for coroutine.
> >>
> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
> >> size".
> >
> > No problem, will do that. Actually in my last post with rfc, this patchset
> > was against your coroutine resize patches.
> >
> > I will provide the profile data tomorrow.
> 
> Please see below link for without bypass coroutine, and with
> your coroutine resize patches(V3):
> 
>    http://pastebin.com/10y00sir

Thanks for sharing!

Do you have the results (IOPS and perf report) for just the coroutine
bypass (but not the other changes in this patch series)?

Coroutine: 101k IOPS
Bypass: ? IOPS

Stefan

Ming Lei Aug. 2, 2014, 2:42 a.m. UTC | #21

On Sat, Aug 2, 2014 at 12:03 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote:
>> On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote:
>> > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote:
>> >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>> > Il 31/07/2014 18:13, Ming Lei ha scritto:
>> >>> >> Follows 'perf report' result on cycles event for with/without bypass
>> >>> >> coroutine:
>> >>> >>
>> >>> >>     http://pastebin.com/ae0vnQ6V
>> >>> >>
>> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow
>> >>> >> without bypass coroutine.
>> >>> >
>> >>> > Yeah, I can count at least 3.3% time spent here:
>> >>> >
>> >>> > 0.87%          bdrv_co_do_preadv
>> >>> > 0.79%          bdrv_aligned_preadv
>> >>> > 0.71%          qemu_coroutine_switch
>> >>> > 0.52%          tracked_request_begin
>> >>> > 0.45%          coroutine_swap
>> >>> >
>> >>> > Another ~3% wasted in malloc, etc.
>> >>>
>> >>> That should be related with coroutine and the BH in bdrv_co_do_rw().
>> >>> In this post I didn't apply Stephan's coroutine resize patch which might
>> >>> decrease usage of malloc() for coroutine.
>> >>
>> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool
>> >> size".
>> >
>> > No problem, will do that. Actually in my last post with rfc, this patchset
>> > was against your coroutine resize patches.
>> >
>> > I will provide the profile data tomorrow.
>>
>> Please see below link for without bypass coroutine, and with
>> your coroutine resize patches(V3):
>>
>>    http://pastebin.com/10y00sir
>
> Thanks for sharing!
>
> Do you have the results (IOPS and perf report) for just the coroutine
> bypass (but not the other changes in this patch series)?
>
> Coroutine: 101k IOPS
> Bypass: ? IOPS

Please see blow link, sorry for missing the data for bypass co which
was collected at the same condition:

http://pastebin.com/JqrpF87G

Thanks,

[01/15] qemu coroutine: support bypass mode

Commit Message

Comments

Patch