diff mbox

[-qemu] nvme: support Google vendor extension

Message ID 1448178345.7480.2.camel@hasee
State New
Headers show

Commit Message

Ming Lin Nov. 22, 2015, 7:45 a.m. UTC
On Sat, 2015-11-21 at 13:56 +0100, Paolo Bonzini wrote:
> 
> On 21/11/2015 00:05, Ming Lin wrote:
> > [    1.752129] Freeing unused kernel memory: 420K (ffff880001b97000 - ffff880001c00000)
> > [    1.986573] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30e5c9bbf83, max_idle_ns: 440795378954 ns
> > [    1.988187] clocksource: Switched to clocksource tsc
> > [    3.235423] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [    3.358713] clocksource:                       'refined-jiffies' wd_now: fffeddf3 wd_last: fffedd76 mask: ffffffff
> > [    3.410013] clocksource:                       'tsc' cs_now: 3c121d4ec cs_last: 340888eb7 mask: ffffffffffffffff
> > [    3.450026] clocksource: Switched to clocksource refined-jiffies
> > [    7.696769] Adding 392188k swap on /dev/vda5.  Priority:-1 extents:1 across:392188k 
> > [    7.902174] EXT4-fs (vda1): re-mounted. Opts: (null)
> > [    8.734178] EXT4-fs (vda1): re-mounted. Opts: errors=remount-ro
> > 
> > Then it doesn't response input for almost 1 minute.
> > Without this patch, kernel loads quickly.
> 
> Interesting.  I guess there's time to debug it, since QEMU 2.6 is still 
> a few months away.  In the meanwhile we can apply your patch as is, 
> apart from disabling the "if (new_head >= cq->size)" and the similar 
> one for "if (new_ tail >= sq->size".
> 
> But, I have a possible culprit.  In your nvme_cq_notifier you are not doing the 
> equivalent of:
> 
> 	start_sqs = nvme_cq_full(cq) ? 1 : 0;
>         cq->head = new_head;
>         if (start_sqs) {
>             NvmeSQueue *sq;
>             QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
>                 timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>             }
>             timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>         }
> 
> Instead, you are just calling nvme_post_cqes, which is the equivalent of
> 
> 	timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> 
> Adding a loop to nvme_cq_notifier, and having it call nvme_process_sq, might
> fix the weird 1-minute delay.

I found it.

Here is new performance number:

qemu-nvme + google-ext + eventfd: 294MB/s
virtio-blk: 344MB/s
virtio-scsi: 296MB/s

It's almost same as virtio-scsi. Nice.

Comments

Ming Lin Nov. 24, 2015, 6:29 a.m. UTC | #1
On Sat, 2015-11-21 at 23:45 -0800, Ming Lin wrote:
> On Sat, 2015-11-21 at 13:56 +0100, Paolo Bonzini wrote:
> > 
> > On 21/11/2015 00:05, Ming Lin wrote:
> > > [    1.752129] Freeing unused kernel memory: 420K (ffff880001b97000 - ffff880001c00000)
> > > [    1.986573] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30e5c9bbf83, max_idle_ns: 440795378954 ns
> > > [    1.988187] clocksource: Switched to clocksource tsc
> > > [    3.235423] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
> > > [    3.358713] clocksource:                       'refined-jiffies' wd_now: fffeddf3 wd_last: fffedd76 mask: ffffffff
> > > [    3.410013] clocksource:                       'tsc' cs_now: 3c121d4ec cs_last: 340888eb7 mask: ffffffffffffffff
> > > [    3.450026] clocksource: Switched to clocksource refined-jiffies
> > > [    7.696769] Adding 392188k swap on /dev/vda5.  Priority:-1 extents:1 across:392188k 
> > > [    7.902174] EXT4-fs (vda1): re-mounted. Opts: (null)
> > > [    8.734178] EXT4-fs (vda1): re-mounted. Opts: errors=remount-ro
> > > 
> > > Then it doesn't response input for almost 1 minute.
> > > Without this patch, kernel loads quickly.
> > 
> > Interesting.  I guess there's time to debug it, since QEMU 2.6 is still 
> > a few months away.  In the meanwhile we can apply your patch as is, 
> > apart from disabling the "if (new_head >= cq->size)" and the similar 
> > one for "if (new_ tail >= sq->size".
> > 
> > But, I have a possible culprit.  In your nvme_cq_notifier you are not doing the 
> > equivalent of:
> > 
> > 	start_sqs = nvme_cq_full(cq) ? 1 : 0;
> >         cq->head = new_head;
> >         if (start_sqs) {
> >             NvmeSQueue *sq;
> >             QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
> >                 timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> >             }
> >             timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> >         }
> > 
> > Instead, you are just calling nvme_post_cqes, which is the equivalent of
> > 
> > 	timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > 
> > Adding a loop to nvme_cq_notifier, and having it call nvme_process_sq, might
> > fix the weird 1-minute delay.
> 
> I found it.
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 31572f2..f27fd35 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -548,6 +548,7 @@ static void nvme_cq_notifier(EventNotifier *e)
>      NvmeCQueue *cq =
>          container_of(e, NvmeCQueue, notifier);
>  
> +    event_notifier_test_and_clear(&cq->notifier);
>      nvme_post_cqes(cq);
>  }
>  
> @@ -567,6 +568,7 @@ static void nvme_sq_notifier(EventNotifier *e)
>      NvmeSQueue *sq =
>          container_of(e, NvmeSQueue, notifier);
>  
> +    event_notifier_test_and_clear(&sq->notifier);
>      nvme_process_sq(sq);
>  }
>  
> Here is new performance number:
> 
> qemu-nvme + google-ext + eventfd: 294MB/s
> virtio-blk: 344MB/s
> virtio-scsi: 296MB/s
> 
> It's almost same as virtio-scsi. Nice.

(strip CC)

Looks like "regular MMIO" runs in vcpu thread, while "eventfd MMIO" runs
in the main loop thread.

Could you help to explain why eventfd MMIO gets better performance?

call stack: regular MMIO
========================
nvme_mmio_write (qemu/hw/block/nvme.c:921)
memory_region_write_accessor (qemu/memory.c:451)
access_with_adjusted_size (qemu/memory.c:506)
memory_region_dispatch_write (qemu/memory.c:1158)
address_space_rw (qemu/exec.c:2547)
kvm_cpu_exec (qemu/kvm-all.c:1849)
qemu_kvm_cpu_thread_fn (qemu/cpus.c:1050)
start_thread (pthread_create.c:312)
clone

call stack: eventfd MMIO
=========================
nvme_sq_notifier (qemu/hw/block/nvme.c:598)
aio_dispatch (qemu/aio-posix.c:329)
aio_ctx_dispatch (qemu/async.c:232)
g_main_context_dispatch
glib_pollfds_poll (qemu/main-loop.c:213)
os_host_main_loop_wait (qemu/main-loop.c:257)
main_loop_wait (qemu/main-loop.c:504)
main_loop (qemu/vl.c:1920)
main (qemu/vl.c:4682)
__libc_start_main
Paolo Bonzini Nov. 24, 2015, 11:01 a.m. UTC | #2
On 24/11/2015 07:29, Ming Lin wrote:
>> Here is new performance number:
>>
>> qemu-nvme + google-ext + eventfd: 294MB/s
>> virtio-blk: 344MB/s
>> virtio-scsi: 296MB/s
>>
>> It's almost same as virtio-scsi. Nice.

Pretty good indeed.

> Looks like "regular MMIO" runs in vcpu thread, while "eventfd MMIO" runs
> in the main loop thread.
> 
> Could you help to explain why eventfd MMIO gets better performance?

Because VCPU latency is really everything if the I/O is very fast _or_
if the queue depth is high; signaling an eventfd is cheap enough to give
a noticeable boost in VCPU latency. Waking up a sleeping process is a
bit expensive, but if you manage to keep the iothread close to 100% CPU,
the main loop thread's poll() is usually quite cheap too.

> call stack: regular MMIO
> ========================
> nvme_mmio_write (qemu/hw/block/nvme.c:921)
> memory_region_write_accessor (qemu/memory.c:451)
> access_with_adjusted_size (qemu/memory.c:506)
> memory_region_dispatch_write (qemu/memory.c:1158)
> address_space_rw (qemu/exec.c:2547)
> kvm_cpu_exec (qemu/kvm-all.c:1849)
> qemu_kvm_cpu_thread_fn (qemu/cpus.c:1050)
> start_thread (pthread_create.c:312)
> clone
> 
> call stack: eventfd MMIO
> =========================
> nvme_sq_notifier (qemu/hw/block/nvme.c:598)
> aio_dispatch (qemu/aio-posix.c:329)
> aio_ctx_dispatch (qemu/async.c:232)
> g_main_context_dispatch
> glib_pollfds_poll (qemu/main-loop.c:213)
> os_host_main_loop_wait (qemu/main-loop.c:257)
> main_loop_wait (qemu/main-loop.c:504)
> main_loop (qemu/vl.c:1920)
> main (qemu/vl.c:4682)
> __libc_start_main

For comparison, here is the "iothread+eventfd MMIO" stack

nvme_sq_notifier (qemu/hw/block/nvme.c:598)
aio_dispatch (qemu/aio-posix.c:329)
aio_poll (qemu/aio-posix.c:474)
iothread_run (qemu/iothread.c:170)
__libc_start_main

aio_poll is much more specialized than the main thread (which uses glib
and thus wraps aio_poll with a GSource adapter), and can be faster too.
 (That said, things are still a bit in flux here.  2.6 will have pretty
heavy changes in this area, but the API will be the same).

Even more performance can be squeezed by adding a little bit of busy
waiting to aio_poll() before going to the blocking poll(). This avoids
very short idling and can improve things even more.

BTW, you may want to Cc qemu-block@nongnu.org in addition to
qemu-devel@nongnu.org.  Most people are on both lists, but some notice
things faster if you write to the lower-traffic qemu-block mailing list.

Paolo
diff mbox

Patch

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 31572f2..f27fd35 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -548,6 +548,7 @@  static void nvme_cq_notifier(EventNotifier *e)
     NvmeCQueue *cq =
         container_of(e, NvmeCQueue, notifier);
 
+    event_notifier_test_and_clear(&cq->notifier);
     nvme_post_cqes(cq);
 }
 
@@ -567,6 +568,7 @@  static void nvme_sq_notifier(EventNotifier *e)
     NvmeSQueue *sq =
         container_of(e, NvmeSQueue, notifier);
 
+    event_notifier_test_and_clear(&sq->notifier);
     nvme_process_sq(sq);
 }