Message ID | 1406720388-18671-2-git-send-email-ming.lei@canonical.com |
---|---|
State | New |
Headers | show |
Il 30/07/2014 13:39, Ming Lei ha scritto: > This patch introduces several APIs for supporting bypass qemu coroutine > in case of being not necessary and for performance's sake. No, this is wrong. Dataplane *must* use the same code as non-dataplane, anything else is a step backwards. If you want to bypass coroutines, bdrv_aio_readv/writev must detect the conditions that allow doing that and call the bdrv_aio_readv/writev directly. To begin with, have you benchmarked QEMU and can you provide a trace of *where* the coroutine overhead lies? Paolo > Signed-off-by: Ming Lei <ming.lei@canonical.com> > --- > include/block/coroutine.h | 8 ++++++++ > include/block/coroutine_int.h | 5 +++++ > qemu-coroutine-lock.c | 4 ++-- > qemu-coroutine.c | 33 +++++++++++++++++++++++++++++++++ > 4 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/include/block/coroutine.h b/include/block/coroutine.h > index b408f96..46d2642 100644 > --- a/include/block/coroutine.h > +++ b/include/block/coroutine.h > @@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type, > * Note that this function clobbers the handlers for the file descriptor. > */ > void coroutine_fn yield_until_fd_readable(int fd); > + > +/* qemu coroutine bypass APIs */ > +void qemu_coroutine_set_bypass(bool bypass); > +bool qemu_coroutine_bypassed(Coroutine *self); > +bool qemu_coroutine_self_bypassed(void); > +void qemu_coroutine_set_var(void *var); > +void *qemu_coroutine_get_var(void); > + > #endif /* QEMU_COROUTINE_H */ > diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h > index f133d65..106d0b2 100644 > --- a/include/block/coroutine_int.h > +++ b/include/block/coroutine_int.h > @@ -39,6 +39,11 @@ struct Coroutine { > Coroutine *caller; > QSLIST_ENTRY(Coroutine) pool_next; > > + bool bypass; > + > + /* only used in bypass mode */ > + void *opaque; > + > /* Coroutines that should be woken up when we yield or terminate */ > QTAILQ_HEAD(, Coroutine) co_queue_wakeup; > QTAILQ_ENTRY(Coroutine) co_queue_next; > diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c > index e4860ae..7c69ff6 100644 > --- a/qemu-coroutine-lock.c > +++ b/qemu-coroutine-lock.c > @@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single) > > bool coroutine_fn qemu_co_queue_next(CoQueue *queue) > { > - assert(qemu_in_coroutine()); > + assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed()); > return qemu_co_queue_do_restart(queue, true); > } > > void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue) > { > - assert(qemu_in_coroutine()); > + assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed()); > qemu_co_queue_do_restart(queue, false); > } > > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 4708521..0597ed9 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void) > self->caller = NULL; > coroutine_swap(self, to); > } > + > +void qemu_coroutine_set_bypass(bool bypass) > +{ > + Coroutine *self = qemu_coroutine_self(); > + > + self->bypass = bypass; > +} > + > +bool qemu_coroutine_bypassed(Coroutine *self) > +{ > + return self->bypass; > +} > + > +bool qemu_coroutine_self_bypassed(void) > +{ > + Coroutine *self = qemu_coroutine_self(); > + > + return qemu_coroutine_bypassed(self); > +} > + > +void qemu_coroutine_set_var(void *var) > +{ > + Coroutine *self = qemu_coroutine_self(); > + > + self->opaque = var; > +} > + > +void *qemu_coroutine_get_var(void) > +{ > + Coroutine *self = qemu_coroutine_self(); > + > + return self->opaque; > +} >
On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 30/07/2014 13:39, Ming Lei ha scritto: >> This patch introduces several APIs for supporting bypass qemu coroutine >> in case of being not necessary and for performance's sake. > > No, this is wrong. Dataplane *must* use the same code as non-dataplane, > anything else is a step backwards. As we saw, coroutine has brought up performance regression on dataplane, and it isn't necessary to use co in some cases, is it? > > If you want to bypass coroutines, bdrv_aio_readv/writev must detect the > conditions that allow doing that and call the bdrv_aio_readv/writev > directly. That is easy to detect, please see the 5th patch. > > To begin with, have you benchmarked QEMU and can you provide a trace of > *where* the coroutine overhead lies? I guess it may be caused by the stack switch, at least in one of my box, bypassing co can improve throughput by ~7%, and by ~15% in another box. Thanks,
Il 30/07/2014 19:15, Ming Lei ha scritto: > On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> Il 30/07/2014 13:39, Ming Lei ha scritto: >>> This patch introduces several APIs for supporting bypass qemu coroutine >>> in case of being not necessary and for performance's sake. >> >> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >> anything else is a step backwards. > > As we saw, coroutine has brought up performance regression > on dataplane, and it isn't necessary to use co in some cases, is it? Yes, and it's not necessary on non-dataplane either. It's not necessary on virtio-scsi, and it will not be necessary on virtio-scsi dataplane either. >> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >> conditions that allow doing that and call the bdrv_aio_readv/writev >> directly. > > That is easy to detect, please see the 5th patch. No, that's not enough. Dataplane right now prevents block jobs, but that's going to change and it could require coroutines even for raw devices. >> To begin with, have you benchmarked QEMU and can you provide a trace of >> *where* the coroutine overhead lies? > > I guess it may be caused by the stack switch, at least in one of > my box, bypassing co can improve throughput by ~7%, and by > ~15% in another box. No guesses please. Actually that's also my guess, but since you are submitting the patch you must do better and show profiles where stack switching disappears after the patches. Paolo
On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 30/07/2014 19:15, Ming Lei ha scritto: >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> Il 30/07/2014 13:39, Ming Lei ha scritto: >>>> This patch introduces several APIs for supporting bypass qemu coroutine >>>> in case of being not necessary and for performance's sake. >>> >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >>> anything else is a step backwards. >> >> As we saw, coroutine has brought up performance regression >> on dataplane, and it isn't necessary to use co in some cases, is it? > > Yes, and it's not necessary on non-dataplane either. It's not necessary > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane > either. It is good to know these cases, so they might benefit from this patch in future too, :-) >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >>> conditions that allow doing that and call the bdrv_aio_readv/writev >>> directly. >> >> That is easy to detect, please see the 5th patch. > > No, that's not enough. Dataplane right now prevents block jobs, but > that's going to change and it could require coroutines even for raw devices. Could you explain it a bit why co is required for raw devices in future? If some cases for not requiring co can be detected beforehand, these patches are useful, IMO. >>> To begin with, have you benchmarked QEMU and can you provide a trace of >>> *where* the coroutine overhead lies? >> >> I guess it may be caused by the stack switch, at least in one of >> my box, bypassing co can improve throughput by ~7%, and by >> ~15% in another box. > > No guesses please. Actually that's also my guess, but since you are > submitting the patch you must do better and show profiles where stack > switching disappears after the patches. OK, no problem, I will provide some profile result. Thanks,
The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote : > On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > Il 30/07/2014 19:15, Ming Lei ha scritto: > >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>> Il 30/07/2014 13:39, Ming Lei ha scritto: > >>>> This patch introduces several APIs for supporting bypass qemu coroutine > >>>> in case of being not necessary and for performance's sake. > >>> > >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, > >>> anything else is a step backwards. > >> > >> As we saw, coroutine has brought up performance regression > >> on dataplane, and it isn't necessary to use co in some cases, is it? > > > > Yes, and it's not necessary on non-dataplane either. It's not necessary > > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane > > either. > > It is good to know these cases, so they might benefit from this patch > in future too, :-) > > >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the > >>> conditions that allow doing that and call the bdrv_aio_readv/writev > >>> directly. > >> > >> That is easy to detect, please see the 5th patch. > > > > No, that's not enough. Dataplane right now prevents block jobs, but > > that's going to change and it could require coroutines even for raw devices. > > Could you explain it a bit why co is required for raw devices in future? Block mirroring of a device for example is done using coroutines. As block mirroring can be done on a raw device you need coroutines. > > If some cases for not requiring co can be detected beforehand, these > patches are useful, IMO. > > >>> To begin with, have you benchmarked QEMU and can you provide a trace of > >>> *where* the coroutine overhead lies? > >> > >> I guess it may be caused by the stack switch, at least in one of > >> my box, bypassing co can improve throughput by ~7%, and by > >> ~15% in another box. > > > > No guesses please. Actually that's also my guess, but since you are > > submitting the patch you must do better and show profiles where stack > > switching disappears after the patches. > > OK, no problem, I will provide some profile result. > > Thanks, >
On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 30/07/2014 19:15, Ming Lei ha scritto: >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> Il 30/07/2014 13:39, Ming Lei ha scritto: >>>> This patch introduces several APIs for supporting bypass qemu coroutine >>>> in case of being not necessary and for performance's sake. >>> >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >>> anything else is a step backwards. >> >> As we saw, coroutine has brought up performance regression >> on dataplane, and it isn't necessary to use co in some cases, is it? > > Yes, and it's not necessary on non-dataplane either. It's not necessary > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane > either. > >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >>> conditions that allow doing that and call the bdrv_aio_readv/writev >>> directly. >> >> That is easy to detect, please see the 5th patch. > > No, that's not enough. Dataplane right now prevents block jobs, but > that's going to change and it could require coroutines even for raw devices. > >>> To begin with, have you benchmarked QEMU and can you provide a trace of >>> *where* the coroutine overhead lies? >> >> I guess it may be caused by the stack switch, at least in one of >> my box, bypassing co can improve throughput by ~7%, and by >> ~15% in another box. > > No guesses please. Actually that's also my guess, but since you are > submitting the patch you must do better and show profiles where stack > switching disappears after the patches. Follows the below hardware events reported by 'perf stat' when running fio randread benchmark for 2min in VM(single vq, 2 jobs): sudo ~/bin/perf stat -e L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses ./nqemu-start-mq 4 1 1), without bypassing coroutine via forcing to set 's->raw_format ' as false, see patch 5/15 - throughout: 95K Performance counter stats for './nqemu-start-mq 4 1': 69,231,035,842 L1-dcache-loads [40.10%] 1,909,978,930 L1-dcache-load-misses # 2.76% of all L1-dcache hits [39.98%] 263,731,501,086 cpu-cycles [40.03%] 232,564,905,115 instructions # 0.88 insns per cycle [50.23%] 46,157,868,745 branch-instructions [49.82%] 785,618,591 branch-misses # 1.70% of all branches [49.99%] 46,280,342,654 branch-loads [49.95%] 34,934,790,140 branch-load-misses [50.02%] 69,447,857,237 dTLB-loads [40.13%] 169,617,374 dTLB-load-misses # 0.24% of all dTLB cache hits [40.04%] 161.991075781 seconds time elapsed 2), with bypassing coroutinue - throughput: 115K Performance counter stats for './nqemu-start-mq 4 1': 76,784,224,509 L1-dcache-loads [39.93%] 1,334,036,447 L1-dcache-load-misses # 1.74% of all L1-dcache hits [39.91%] 262,697,428,470 cpu-cycles [40.03%] 255,526,629,881 instructions # 0.97 insns per cycle [50.01%] 50,160,082,611 branch-instructions [49.97%] 564,407,788 branch-misses # 1.13% of all branches [50.08%] 50,331,510,702 branch-loads [50.08%] 35,760,766,459 branch-load-misses [50.03%] 76,706,000,951 dTLB-loads [40.00%] 123,291,001 dTLB-load-misses # 0.16% of all dTLB cache hits [40.02%] 162.333465490 seconds time elapsed
Il 31/07/2014 10:59, Ming Lei ha scritto: >> > No guesses please. Actually that's also my guess, but since you are >> > submitting the patch you must do better and show profiles where stack >> > switching disappears after the patches. > Follows the below hardware events reported by 'perf stat' when running > fio randread benchmark for 2min in VM(single vq, 2 jobs): > > sudo ~/bin/perf stat -e > L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses > ./nqemu-start-mq 4 1 > > 1), without bypassing coroutine via forcing to set 's->raw_format ' as > false, see patch 5/15 > > - throughout: 95K > 232,564,905,115 instructions > 161.991075781 seconds time elapsed > > > 2), with bypassing coroutinue > - throughput: 115K > 255,526,629,881 instructions > 162.333465490 seconds time elapsed Ok, so you are saving 10% instructions per iop: before 232G / 95K = 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop. That's not small, and it's a good thing for CPU utilization even if you were not increasing iops. On top of this, can you provide the stack traces to see the difference in the profiles? Paolo
On Thu, Jul 31, 2014 at 3:37 PM, Benoît Canet <benoit.canet@irqsave.net> wrote: > The Thursday 31 Jul 2014 à 11:55:28 (+0800), Ming Lei wrote : >> On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> > Il 30/07/2014 19:15, Ming Lei ha scritto: >> >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >>> Il 30/07/2014 13:39, Ming Lei ha scritto: >> >>>> This patch introduces several APIs for supporting bypass qemu coroutine >> >>>> in case of being not necessary and for performance's sake. >> >>> >> >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >> >>> anything else is a step backwards. >> >> >> >> As we saw, coroutine has brought up performance regression >> >> on dataplane, and it isn't necessary to use co in some cases, is it? >> > >> > Yes, and it's not necessary on non-dataplane either. It's not necessary >> > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane >> > either. >> >> It is good to know these cases, so they might benefit from this patch >> in future too, :-) >> >> >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >> >>> conditions that allow doing that and call the bdrv_aio_readv/writev >> >>> directly. >> >> >> >> That is easy to detect, please see the 5th patch. >> > >> > No, that's not enough. Dataplane right now prevents block jobs, but >> > that's going to change and it could require coroutines even for raw devices. >> >> Could you explain it a bit why co is required for raw devices in future? > > Block mirroring of a device for example is done using coroutines. > As block mirroring can be done on a raw device you need coroutines. If block layer knows the mirroring is in-progress, it still can enable coroutine by ignoring bypass coroutine, or just let device disable bypass coroutine in case of mirroring, and the current APIs are very flexible. Thanks,
On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 31/07/2014 10:59, Ming Lei ha scritto: >>> > No guesses please. Actually that's also my guess, but since you are >>> > submitting the patch you must do better and show profiles where stack >>> > switching disappears after the patches. >> Follows the below hardware events reported by 'perf stat' when running >> fio randread benchmark for 2min in VM(single vq, 2 jobs): >> >> sudo ~/bin/perf stat -e >> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses >> ./nqemu-start-mq 4 1 >> >> 1), without bypassing coroutine via forcing to set 's->raw_format ' as >> false, see patch 5/15 >> >> - throughout: 95K >> 232,564,905,115 instructions >> 161.991075781 seconds time elapsed >> >> >> 2), with bypassing coroutinue >> - throughput: 115K >> 255,526,629,881 instructions >> 162.333465490 seconds time elapsed > > Ok, so you are saving 10% instructions per iop: before 232G / 95K = > 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop. I am wondering if it is useful since IOP is from view of VM, and instructions is got from host(qemu). If qemu dataplane handles IO quickly enough, it can save instructions by batch operations. But I will collect the 'perf report' on 'instruction' event for you. L1-dcache-load-misses ratio is increased by 1%, and instructions per cycle is decreased by 10%(0.88->0.97), which is big too, and means qemu becomes much slower when running block I/O in VM by coroutine. Thanks,
Il 31/07/2014 11:47, Ming Lei ha scritto: >> Block mirroring of a device for example is done using coroutines. >> As block mirroring can be done on a raw device you need coroutines. > > If block layer knows the mirroring is in-progress, it still can enable > coroutine by ignoring bypass coroutine, or just let device disable > bypass coroutine in case of mirroring, and the current APIs are very > flexible. What matters is not whether you're mirroring. What matters is whether you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev. Under some limitation, if you are calling bdrv_aio_readv/writev you can bypass coroutines. (In fact drive mirroring uses those APIs too so it doesn't need coroutines and can benefit from the speedup too. :) But drive-backup does need them). The limitations are essentially "would bdrv_co_do_preadv or bdrv_co_do_pwritev do anything special?" This means for example for bdrv_co_do_pwritev: - bs->drv must be non-NULL - bs->read_only must be false - bdrv_check_byte_request(bs, offset, bytes) must return false - bs->io_limits_enabled must be false - the request must be aligned - (in bdrv_aligned_pwritev) the before_write_notifiers must be empty - (in bdrv_aligned_pwritev) bs->detect_zeroes must be off - (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off - (in bdrv_aligned_pwritev) bs->enable_write_cache must be false and the hard part is organizing the code so that the code duplication is as limited as possible. Paolo
On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 31/07/2014 10:59, Ming Lei ha scritto: >>> > No guesses please. Actually that's also my guess, but since you are >>> > submitting the patch you must do better and show profiles where stack >>> > switching disappears after the patches. >> Follows the below hardware events reported by 'perf stat' when running >> fio randread benchmark for 2min in VM(single vq, 2 jobs): >> >> sudo ~/bin/perf stat -e >> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses >> ./nqemu-start-mq 4 1 >> >> 1), without bypassing coroutine via forcing to set 's->raw_format ' as >> false, see patch 5/15 >> >> - throughout: 95K >> 232,564,905,115 instructions >> 161.991075781 seconds time elapsed >> >> >> 2), with bypassing coroutinue >> - throughput: 115K >> 255,526,629,881 instructions >> 162.333465490 seconds time elapsed > > Ok, so you are saving 10% instructions per iop: before 232G / 95K = > 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop. > > That's not small, and it's a good thing for CPU utilization even if you > were not increasing iops. On top of this, can you provide the stack > traces to see the difference in the profiles? Follows 'perf report' result on cycles event for with/without bypass coroutine: http://pastebin.com/ae0vnQ6V From the profiling result, looks bdrv_co_do_preadv() is a bit slow without bypass coroutine. Thanks,
Il 31/07/2014 18:13, Ming Lei ha scritto: > Follows 'perf report' result on cycles event for with/without bypass > coroutine: > > http://pastebin.com/ae0vnQ6V > > From the profiling result, looks bdrv_co_do_preadv() is a bit slow > without bypass coroutine. Yeah, I can count at least 3.3% time spent here: 0.87% bdrv_co_do_preadv 0.79% bdrv_aligned_preadv 0.71% qemu_coroutine_switch 0.52% tracked_request_begin 0.45% coroutine_swap Another ~3% wasted in malloc, etc. I suggest that we discuss it on the phone at next Tuesday's KVM call. I'm not denying that bypassing coroutines is a useful thing to do. But *everyone* should be doing that if it is so useful, and it's hard to do it without causing code duplication. Paolo
On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 31/07/2014 18:13, Ming Lei ha scritto: >> Follows 'perf report' result on cycles event for with/without bypass >> coroutine: >> >> http://pastebin.com/ae0vnQ6V >> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >> without bypass coroutine. > > Yeah, I can count at least 3.3% time spent here: > > 0.87% bdrv_co_do_preadv > 0.79% bdrv_aligned_preadv > 0.71% qemu_coroutine_switch > 0.52% tracked_request_begin > 0.45% coroutine_swap > > Another ~3% wasted in malloc, etc. That should be related with coroutine and the BH in bdrv_co_do_rw(). In this post I didn't apply Stephan's coroutine resize patch which might decrease usage of malloc() for coroutine. At least, coroutine isn't cheap from the profile result. > > I suggest that we discuss it on the phone at next Tuesday's KVM call. No problem, and we can continue to discuss it on mail list too. If you and someone else need more profiling data, please feel free to let me know, and I am happy to provide that. > I'm not denying that bypassing coroutines is a useful thing to do. But > *everyone* should be doing that if it is so useful, and it's hard to do > it without causing code duplication. Yes, I agree, and the patch is designed as so or sort of. If device can know in advance that corotine isn't required, it just calls qemu_aio_set_bypass_co() to notify block layer to not use coroutine, that is the approach designed in this patch. Thanks,
On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: > On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > Il 31/07/2014 18:13, Ming Lei ha scritto: > >> Follows 'perf report' result on cycles event for with/without bypass > >> coroutine: > >> > >> http://pastebin.com/ae0vnQ6V > >> > >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow > >> without bypass coroutine. > > > > Yeah, I can count at least 3.3% time spent here: > > > > 0.87% bdrv_co_do_preadv > > 0.79% bdrv_aligned_preadv > > 0.71% qemu_coroutine_switch > > 0.52% tracked_request_begin > > 0.45% coroutine_swap > > > > Another ~3% wasted in malloc, etc. > > That should be related with coroutine and the BH in bdrv_co_do_rw(). > In this post I didn't apply Stephan's coroutine resize patch which might > decrease usage of malloc() for coroutine. Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool size". > At least, coroutine isn't cheap from the profile result. Instead of bypassing coroutines we should first understand the overhead that they impose. Is it due to the coroutine implementation (switching stacks) or due to the bdrv_co_*() code that happens to use coroutines but slow for other reasons.
On Thu, Jul 31, 2014 at 6:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 31/07/2014 11:47, Ming Lei ha scritto: >>> Block mirroring of a device for example is done using coroutines. >>> As block mirroring can be done on a raw device you need coroutines. >> >> If block layer knows the mirroring is in-progress, it still can enable >> coroutine by ignoring bypass coroutine, or just let device disable >> bypass coroutine in case of mirroring, and the current APIs are very >> flexible. > > What matters is not whether you're mirroring. What matters is whether > you're calling bdrv_aio_readv/writev or bdrv_co_readv/writev. Under > some limitation, if you are calling bdrv_aio_readv/writev you can bypass > coroutines. (In fact drive mirroring uses those APIs too so it doesn't > need coroutines and can benefit from the speedup too. :) But > drive-backup does need them). > > The limitations are essentially "would bdrv_co_do_preadv or > bdrv_co_do_pwritev do anything special?" This means for example for > bdrv_co_do_pwritev: > > - bs->drv must be non-NULL > > - bs->read_only must be false > > - bdrv_check_byte_request(bs, offset, bytes) must return false > > - bs->io_limits_enabled must be false > > - the request must be aligned > > - (in bdrv_aligned_pwritev) the before_write_notifiers must be empty > > - (in bdrv_aligned_pwritev) bs->detect_zeroes must be off > > - (in bdrv_aligned_pwritev) the BDRV_REQ_ZERO_WRITE flag must be off > > - (in bdrv_aligned_pwritev) bs->enable_write_cache must be false > > and the hard part is organizing the code so that the code duplication is > as limited as possible. In this patchset, only implementation of bypassing coroutine is introduced, and simply apply it on raw image for dataplane, and it won't break drive mirror & backup, if I understand correctly. So I suggest we focus on the implementation of bypassing coroutine. All the above is mostly about if the bypass can be applied in other cases, so this patchset shouldn't have caused code duplication. I appreciate very much the implementation of bypassing coroutine can be reviewed in the discussion. Thanks,
On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: >> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> > Il 31/07/2014 18:13, Ming Lei ha scritto: >> >> Follows 'perf report' result on cycles event for with/without bypass >> >> coroutine: >> >> >> >> http://pastebin.com/ae0vnQ6V >> >> >> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >> >> without bypass coroutine. >> > >> > Yeah, I can count at least 3.3% time spent here: >> > >> > 0.87% bdrv_co_do_preadv >> > 0.79% bdrv_aligned_preadv >> > 0.71% qemu_coroutine_switch >> > 0.52% tracked_request_begin >> > 0.45% coroutine_swap >> > >> > Another ~3% wasted in malloc, etc. >> >> That should be related with coroutine and the BH in bdrv_co_do_rw(). >> In this post I didn't apply Stephan's coroutine resize patch which might >> decrease usage of malloc() for coroutine. > > Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool > size". No problem, will do that. Actually in my last post with rfc, this patchset was against your coroutine resize patches. I will provide the profile data tomorrow. > >> At least, coroutine isn't cheap from the profile result. > > Instead of bypassing coroutines we should first understand the overhead > that they impose. Is it due to the coroutine implementation (switching > stacks) or due to the bdrv_co_*() code that happens to use coroutines > but slow for other reasons. From the 3th patch(block: support to bypass qemu coroutinue) and the 5th patch(dataplane: enable selective bypassing coroutine), the change is to bypass coroutine and BH, and the other bdrv code path is same, so it is due to the coroutine implementation, IMO. Thanks,
Il 01/08/2014 15:48, Ming Lei ha scritto: > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>> Il 31/07/2014 18:13, Ming Lei ha scritto: >>>>> Follows 'perf report' result on cycles event for with/without bypass >>>>> coroutine: >>>>> >>>>> http://pastebin.com/ae0vnQ6V >>>>> >>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >>>>> without bypass coroutine. >>>> >>>> Yeah, I can count at least 3.3% time spent here: >>>> >>>> 0.87% bdrv_co_do_preadv >>>> 0.79% bdrv_aligned_preadv >>>> 0.71% qemu_coroutine_switch >>>> 0.52% tracked_request_begin >>>> 0.45% coroutine_swap >>>> >>>> Another ~3% wasted in malloc, etc. >>> >>> That should be related with coroutine and the BH in bdrv_co_do_rw(). >>> In this post I didn't apply Stephan's coroutine resize patch which might >>> decrease usage of malloc() for coroutine. >> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool >> size". > > No problem, will do that. Actually in my last post with rfc, this patchset > was against your coroutine resize patches. > > I will provide the profile data tomorrow. > >> >>> At least, coroutine isn't cheap from the profile result. >> >> Instead of bypassing coroutines we should first understand the overhead >> that they impose. Is it due to the coroutine implementation (switching >> stacks) or due to the bdrv_co_*() code that happens to use coroutines >> but slow for other reasons. > > From the 3th patch(block: support to bypass qemu coroutinue) > and the 5th patch(dataplane: enable selective bypassing coroutine), > the change is to bypass coroutine and BH, and the other bdrv code > path is same, so it is due to the coroutine implementation, IMO. But your code breaks all sort of invariants. For example, the aiocb must be valid when bdrv_aio_readv/writev return. virtio-blk does not use it, but virtio-scsi does. If we apply your patches now, we will have to redo it soon. Basically we should be rewriting parts of block.c so that bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa. Coroutine creation should be pushed down to the bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can simply call the driver's bdrv_aio_readv/writev. Paolo
On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote: > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> > Il 31/07/2014 18:13, Ming Lei ha scritto: >>> >> Follows 'perf report' result on cycles event for with/without bypass >>> >> coroutine: >>> >> >>> >> http://pastebin.com/ae0vnQ6V >>> >> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >>> >> without bypass coroutine. >>> > >>> > Yeah, I can count at least 3.3% time spent here: >>> > >>> > 0.87% bdrv_co_do_preadv >>> > 0.79% bdrv_aligned_preadv >>> > 0.71% qemu_coroutine_switch >>> > 0.52% tracked_request_begin >>> > 0.45% coroutine_swap >>> > >>> > Another ~3% wasted in malloc, etc. >>> >>> That should be related with coroutine and the BH in bdrv_co_do_rw(). >>> In this post I didn't apply Stephan's coroutine resize patch which might >>> decrease usage of malloc() for coroutine. >> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool >> size". > > No problem, will do that. Actually in my last post with rfc, this patchset > was against your coroutine resize patches. > > I will provide the profile data tomorrow. Please see below link for without bypass coroutine, and with your coroutine resize patches(V3): http://pastebin.com/10y00sir BTW, if anyone likes to play this stuff, these patches can be pulled from below tree easily: git://kernel.ubuntu.com/ming/qemu.git branch: - v2.1.0-mq-co-resize: these 15 patches against 2.1.0-rc5 plus Stefan's coroutine resize patches(V3) - v2.1.0-mq: only these 15 patches against 2.1.0-rc5 Thanks,
On Fri, Aug 1, 2014 at 10:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 01/08/2014 15:48, Ming Lei ha scritto: >> On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >>> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: >>>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>>> Il 31/07/2014 18:13, Ming Lei ha scritto: >>>>>> Follows 'perf report' result on cycles event for with/without bypass >>>>>> coroutine: >>>>>> >>>>>> http://pastebin.com/ae0vnQ6V >>>>>> >>>>>> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >>>>>> without bypass coroutine. >>>>> >>>>> Yeah, I can count at least 3.3% time spent here: >>>>> >>>>> 0.87% bdrv_co_do_preadv >>>>> 0.79% bdrv_aligned_preadv >>>>> 0.71% qemu_coroutine_switch >>>>> 0.52% tracked_request_begin >>>>> 0.45% coroutine_swap >>>>> >>>>> Another ~3% wasted in malloc, etc. >>>> >>>> That should be related with coroutine and the BH in bdrv_co_do_rw(). >>>> In this post I didn't apply Stephan's coroutine resize patch which might >>>> decrease usage of malloc() for coroutine. >>> >>> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool >>> size". >> >> No problem, will do that. Actually in my last post with rfc, this patchset >> was against your coroutine resize patches. >> >> I will provide the profile data tomorrow. >> >>> >>>> At least, coroutine isn't cheap from the profile result. >>> >>> Instead of bypassing coroutines we should first understand the overhead >>> that they impose. Is it due to the coroutine implementation (switching >>> stacks) or due to the bdrv_co_*() code that happens to use coroutines >>> but slow for other reasons. >> >> From the 3th patch(block: support to bypass qemu coroutinue) >> and the 5th patch(dataplane: enable selective bypassing coroutine), >> the change is to bypass coroutine and BH, and the other bdrv code >> path is same, so it is due to the coroutine implementation, IMO. > > But your code breaks all sort of invariants. For example, the aiocb > must be valid when bdrv_aio_readv/writev return. virtio-blk does not > use it, but virtio-scsi does. If we apply your patches now, we will > have to redo it soon. It can be supported by scheduling BH in bdrv_co_io_em_complete() too. > > Basically we should be rewriting parts of block.c so that > bdrv_co_readv/writev calls bdrv_aio_readv/writev instead of vice versa. > Coroutine creation should be pushed down to the > bdrv_aligned_preadv/bdrv_aligned_pwritev and, in the fast path, you can > simply call the driver's bdrv_aio_readv/writev. This idea is very constructive, and I will investigate further to see if it is easy to start. Thanks,
On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote: > On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote: > > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: > >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>> > Il 31/07/2014 18:13, Ming Lei ha scritto: > >>> >> Follows 'perf report' result on cycles event for with/without bypass > >>> >> coroutine: > >>> >> > >>> >> http://pastebin.com/ae0vnQ6V > >>> >> > >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow > >>> >> without bypass coroutine. > >>> > > >>> > Yeah, I can count at least 3.3% time spent here: > >>> > > >>> > 0.87% bdrv_co_do_preadv > >>> > 0.79% bdrv_aligned_preadv > >>> > 0.71% qemu_coroutine_switch > >>> > 0.52% tracked_request_begin > >>> > 0.45% coroutine_swap > >>> > > >>> > Another ~3% wasted in malloc, etc. > >>> > >>> That should be related with coroutine and the BH in bdrv_co_do_rw(). > >>> In this post I didn't apply Stephan's coroutine resize patch which might > >>> decrease usage of malloc() for coroutine. > >> > >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool > >> size". > > > > No problem, will do that. Actually in my last post with rfc, this patchset > > was against your coroutine resize patches. > > > > I will provide the profile data tomorrow. > > Please see below link for without bypass coroutine, and with > your coroutine resize patches(V3): > > http://pastebin.com/10y00sir Thanks for sharing! Do you have the results (IOPS and perf report) for just the coroutine bypass (but not the other changes in this patch series)? Coroutine: 101k IOPS Bypass: ? IOPS Stefan
On Sat, Aug 2, 2014 at 12:03 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Fri, Aug 01, 2014 at 10:52:55PM +0800, Ming Lei wrote: >> On Fri, Aug 1, 2014 at 9:48 PM, Ming Lei <ming.lei@canonical.com> wrote: >> > On Fri, Aug 1, 2014 at 9:13 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote: >> >> On Fri, Aug 01, 2014 at 10:54:02AM +0800, Ming Lei wrote: >> >>> On Fri, Aug 1, 2014 at 12:30 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >>> > Il 31/07/2014 18:13, Ming Lei ha scritto: >> >>> >> Follows 'perf report' result on cycles event for with/without bypass >> >>> >> coroutine: >> >>> >> >> >>> >> http://pastebin.com/ae0vnQ6V >> >>> >> >> >>> >> From the profiling result, looks bdrv_co_do_preadv() is a bit slow >> >>> >> without bypass coroutine. >> >>> > >> >>> > Yeah, I can count at least 3.3% time spent here: >> >>> > >> >>> > 0.87% bdrv_co_do_preadv >> >>> > 0.79% bdrv_aligned_preadv >> >>> > 0.71% qemu_coroutine_switch >> >>> > 0.52% tracked_request_begin >> >>> > 0.45% coroutine_swap >> >>> > >> >>> > Another ~3% wasted in malloc, etc. >> >>> >> >>> That should be related with coroutine and the BH in bdrv_co_do_rw(). >> >>> In this post I didn't apply Stephan's coroutine resize patch which might >> >>> decrease usage of malloc() for coroutine. >> >> >> >> Please rerun with "[PATCH v3 0/2] coroutine: dynamically scale pool >> >> size". >> > >> > No problem, will do that. Actually in my last post with rfc, this patchset >> > was against your coroutine resize patches. >> > >> > I will provide the profile data tomorrow. >> >> Please see below link for without bypass coroutine, and with >> your coroutine resize patches(V3): >> >> http://pastebin.com/10y00sir > > Thanks for sharing! > > Do you have the results (IOPS and perf report) for just the coroutine > bypass (but not the other changes in this patch series)? > > Coroutine: 101k IOPS > Bypass: ? IOPS Please see blow link, sorry for missing the data for bypass co which was collected at the same condition: http://pastebin.com/JqrpF87G Thanks,
diff --git a/include/block/coroutine.h b/include/block/coroutine.h index b408f96..46d2642 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -223,4 +223,12 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type, * Note that this function clobbers the handlers for the file descriptor. */ void coroutine_fn yield_until_fd_readable(int fd); + +/* qemu coroutine bypass APIs */ +void qemu_coroutine_set_bypass(bool bypass); +bool qemu_coroutine_bypassed(Coroutine *self); +bool qemu_coroutine_self_bypassed(void); +void qemu_coroutine_set_var(void *var); +void *qemu_coroutine_get_var(void); + #endif /* QEMU_COROUTINE_H */ diff --git a/include/block/coroutine_int.h b/include/block/coroutine_int.h index f133d65..106d0b2 100644 --- a/include/block/coroutine_int.h +++ b/include/block/coroutine_int.h @@ -39,6 +39,11 @@ struct Coroutine { Coroutine *caller; QSLIST_ENTRY(Coroutine) pool_next; + bool bypass; + + /* only used in bypass mode */ + void *opaque; + /* Coroutines that should be woken up when we yield or terminate */ QTAILQ_HEAD(, Coroutine) co_queue_wakeup; QTAILQ_ENTRY(Coroutine) co_queue_next; diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index e4860ae..7c69ff6 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -82,13 +82,13 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single) bool coroutine_fn qemu_co_queue_next(CoQueue *queue) { - assert(qemu_in_coroutine()); + assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed()); return qemu_co_queue_do_restart(queue, true); } void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue) { - assert(qemu_in_coroutine()); + assert(qemu_in_coroutine() || qemu_coroutine_self_bypassed()); qemu_co_queue_do_restart(queue, false); } diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 4708521..0597ed9 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -137,3 +137,36 @@ void coroutine_fn qemu_coroutine_yield(void) self->caller = NULL; coroutine_swap(self, to); } + +void qemu_coroutine_set_bypass(bool bypass) +{ + Coroutine *self = qemu_coroutine_self(); + + self->bypass = bypass; +} + +bool qemu_coroutine_bypassed(Coroutine *self) +{ + return self->bypass; +} + +bool qemu_coroutine_self_bypassed(void) +{ + Coroutine *self = qemu_coroutine_self(); + + return qemu_coroutine_bypassed(self); +} + +void qemu_coroutine_set_var(void *var) +{ + Coroutine *self = qemu_coroutine_self(); + + self->opaque = var; +} + +void *qemu_coroutine_get_var(void) +{ + Coroutine *self = qemu_coroutine_self(); + + return self->opaque; +}
This patch introduces several APIs for supporting bypass qemu coroutine in case of being not necessary and for performance's sake. Signed-off-by: Ming Lei <ming.lei@canonical.com> --- include/block/coroutine.h | 8 ++++++++ include/block/coroutine_int.h | 5 +++++ qemu-coroutine-lock.c | 4 ++-- qemu-coroutine.c | 33 +++++++++++++++++++++++++++++++++ 4 files changed, 48 insertions(+), 2 deletions(-)