Message ID | 1285855312-11739-1-git-send-email-stefanha@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote: > Virtqueue notify is currently handled synchronously in userspace virtio. > This prevents the vcpu from executing guest code while hardware > emulation code handles the notify. > > On systems that support KVM, the ioeventfd mechanism can be used to make > virtqueue notify a lightweight exit by deferring hardware emulation to > the iothread and allowing the VM to continue execution. This model is > similar to how vhost receives virtqueue notifies. Note that this is a tradeoff. If an idle core is available and the scheduler places the iothread on that core, then the heavyweight exit is replaced by a lightweight exit + IPI. If the iothread is co-located with the vcpu, then we'll take a heavyweight exit in any case. The first case is very likely if the host cpu is undercommitted and there is heavy I/O activity. This is a typical subsystem benchmark scenario (as opposed to a system benchmark like specvirt). My feeling is that total system throughput will be decreased unless the scheduler is clever enough to place the iothread and vcpu on the same host cpu when the system is overcommitted. We can't balance "feeling" against numbers, especially when we have a precedent in vhost-net, so I think this should go in. But I think we should also try to understand the effects of the extra IPIs and cacheline bouncing that this creates. While virtio was designed to minimize this, we know it has severe problems in this area. > The result of this change is improved performance for userspace virtio > devices. Virtio-blk throughput increases especially for multithreaded > scenarios and virtio-net transmit throughput increases substantially. > Full numbers are below. > > This patch employs ioeventfd virtqueue notify for all virtio devices. > Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care > must be taken so that vhost-net, the other ioeventfd user in QEMU, is > able to function. On such kernels ioeventfd virtqueue notify will not > be used. > > Khoa Huynh<khoa@us.ibm.com> collected the following data for > virtio-blk with cache=none,aio=native: > > FFSB Test Threads Unmodified Patched > (MB/s) (MB/s) > Large file create 1 21.7 21.8 > 8 101.0 118.0 > 16 119.0 157.0 > > Sequential reads 1 21.9 23.2 > 8 114.0 139.0 > 16 143.0 178.0 > > Random reads 1 3.3 3.6 > 8 23.0 25.4 > 16 43.3 47.8 > > Random writes 1 22.2 23.0 > 8 93.1 111.6 > 16 110.5 132.0 Impressive numbers. Can you also provide efficiency (bytes per host cpu seconds)? How many guest vcpus were used with this? With enough vcpus, there is also a reduction in cacheline bouncing, since the virtio state in the host gets to stay on one cpu (especially with aio=native). > Sridhar Samudrala<sri@us.ibm.com> collected the following data for > virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest. > > Guest to Host TCP_STREAM throughput(Mb/sec) > ------------------------------------------- > Msg Size vhost-net virtio-net virtio-net/ioeventfd > 65536 12755 6430 7590 > 16384 8499 3084 5764 > 4096 4723 1578 3659 > 1024 1827 981 2060 Even more impressive (expected since the copying, which isn't present for block, is now shunted off into an iothread). On the last test you even exceeded vhost-net. Any theories how/why? Again, efficiency numbers would be interesting. > Host to Guest TCP_STREAM throughput(Mb/sec) > ------------------------------------------- > Msg Size vhost-net virtio-net virtio-net/ioeventfd > 65536 11156 5790 5853 > 16384 10787 5575 5691 > 4096 10452 5556 4277 > 1024 4437 3671 5277 Here you exceed vhost-net, too. > +static int kvm_check_many_iobus_devs(void) > +{ > + /* Older kernels have a 6 device limit on the KVM io bus. In that case > + * creating many ioeventfds must be avoided. This tests checks for the > + * limitation. > + */ > + EventNotifier notifiers[7]; > + int i, ret = 0; > + for (i = 0; i< ARRAY_SIZE(notifiers); i++) { > + ret = event_notifier_init(¬ifiers[i], 0); > + if (ret< 0) { > + break; > + } > + ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, true); > + if (ret< 0) { > + event_notifier_cleanup(¬ifiers[i]); > + break; > + } > + } > + > + /* Decide whether many devices are supported or not */ > + ret = i == ARRAY_SIZE(notifiers); > + > + while (i--> 0) { > + kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, false); > + event_notifier_cleanup(¬ifiers[i]); > + } > + return ret; > +} Sorry about that. IIRC there was a problem (shared by vhost-net) with interrupts remaining enabled in the window between the guest kicking the queue and the host waking up and disabling interrupts. An even more vague IIRC mst had an idea to fix this?
On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote: > > > >Guest to Host TCP_STREAM throughput(Mb/sec) > >------------------------------------------- > >Msg Size vhost-net virtio-net virtio-net/ioeventfd > >65536 12755 6430 7590 > >16384 8499 3084 5764 > > 4096 4723 1578 3659 > > 1024 1827 981 2060 > > Even more impressive (expected since the copying, which isn't > present for block, is now shunted off into an iothread). > > On the last test you even exceeded vhost-net. Any theories how/why? > > Again, efficiency numbers would be interesting. > > >Host to Guest TCP_STREAM throughput(Mb/sec) > >------------------------------------------- > >Msg Size vhost-net virtio-net virtio-net/ioeventfd > >65536 11156 5790 5853 > >16384 10787 5575 5691 > > 4096 10452 5556 4277 > > 1024 4437 3671 5277 > > Here you exceed vhost-net, too. This is with small packets- I suspect this is the extra per interrupt overhead that eventfd has. > >+static int kvm_check_many_iobus_devs(void) > >+{ > >+ /* Older kernels have a 6 device limit on the KVM io bus. In that case > >+ * creating many ioeventfds must be avoided. This tests checks for the > >+ * limitation. > >+ */ > >+ EventNotifier notifiers[7]; > >+ int i, ret = 0; > >+ for (i = 0; i< ARRAY_SIZE(notifiers); i++) { > >+ ret = event_notifier_init(¬ifiers[i], 0); > >+ if (ret< 0) { > >+ break; > >+ } > >+ ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, true); > >+ if (ret< 0) { > >+ event_notifier_cleanup(¬ifiers[i]); > >+ break; > >+ } > >+ } > >+ > >+ /* Decide whether many devices are supported or not */ > >+ ret = i == ARRAY_SIZE(notifiers); > >+ > >+ while (i--> 0) { > >+ kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, false); > >+ event_notifier_cleanup(¬ifiers[i]); > >+ } > >+ return ret; > >+} > > Sorry about that. > > IIRC there was a problem (shared by vhost-net) with interrupts > remaining enabled in the window between the guest kicking the queue > and the host waking up and disabling interrupts. An even more vague > IIRC mst had an idea to fix this? This is one of the things that vring2 is supposed to fix. > -- > error compiling committee.c: too many arguments to function
On 10/03/2010 03:51 PM, Michael S. Tsirkin wrote: > On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote: > > > > > >Guest to Host TCP_STREAM throughput(Mb/sec) > > >------------------------------------------- > > >Msg Size vhost-net virtio-net virtio-net/ioeventfd > > >65536 12755 6430 7590 > > >16384 8499 3084 5764 > > > 4096 4723 1578 3659 > > > 1024 1827 981 2060 > > > > Even more impressive (expected since the copying, which isn't > > present for block, is now shunted off into an iothread). > > > > On the last test you even exceeded vhost-net. Any theories how/why? > > > > Again, efficiency numbers would be interesting. > > > > >Host to Guest TCP_STREAM throughput(Mb/sec) > > >------------------------------------------- > > >Msg Size vhost-net virtio-net virtio-net/ioeventfd > > >65536 11156 5790 5853 > > >16384 10787 5575 5691 > > > 4096 10452 5556 4277 > > > 1024 4437 3671 5277 > > > > Here you exceed vhost-net, too. > > This is with small packets- I suspect this is the extra > per interrupt overhead that eventfd has. This is using eventfd as well.
On Sun, Oct 03, 2010 at 04:21:57PM +0200, Avi Kivity wrote: > On 10/03/2010 03:51 PM, Michael S. Tsirkin wrote: > >On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote: > >> > > >> >Guest to Host TCP_STREAM throughput(Mb/sec) > >> >------------------------------------------- > >> >Msg Size vhost-net virtio-net virtio-net/ioeventfd > >> >65536 12755 6430 7590 > >> >16384 8499 3084 5764 > >> > 4096 4723 1578 3659 > >> > 1024 1827 981 2060 > >> > >> Even more impressive (expected since the copying, which isn't > >> present for block, is now shunted off into an iothread). > >> > >> On the last test you even exceeded vhost-net. Any theories how/why? > >> > >> Again, efficiency numbers would be interesting. > >> > >> >Host to Guest TCP_STREAM throughput(Mb/sec) > >> >------------------------------------------- > >> >Msg Size vhost-net virtio-net virtio-net/ioeventfd > >> >65536 11156 5790 5853 > >> >16384 10787 5575 5691 > >> > 4096 10452 5556 4277 > >> > 1024 4437 3671 5277 > >> > >> Here you exceed vhost-net, too. > > > >This is with small packets- I suspect this is the extra > >per interrupt overhead that eventfd has. > > This is using eventfd as well. Sorry, I meant irqfd. > -- > error compiling committee.c: too many arguments to function
On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: > >> This is using eventfd as well. >> > Sorry, I meant irqfd. > I've tried using irqfd in userspace. It hurts performance quite a bit compared to doing an ioctl so I would suspect this too. A last_used_idx or similar mechanism should help performance quite a bit on top of ioeventfd too. Regards, Anthony Liguori >> -- >> error compiling committee.c: too many arguments to function >>
On 10/04/2010 03:18 AM, Anthony Liguori wrote: > On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: >> >>> This is using eventfd as well. >> Sorry, I meant irqfd. > > I've tried using irqfd in userspace. It hurts performance quite a bit > compared to doing an ioctl so I would suspect this too. > > A last_used_idx or similar mechanism should help performance quite a > bit on top of ioeventfd too. > Any idea why? While irqfd does quite a bit of extra locking, it shouldn't be that bad.
On 10/04/2010 03:04 AM, Avi Kivity wrote: > On 10/04/2010 03:18 AM, Anthony Liguori wrote: >> On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: >>> >>>> This is using eventfd as well. >>> Sorry, I meant irqfd. >> >> I've tried using irqfd in userspace. It hurts performance quite a >> bit compared to doing an ioctl so I would suspect this too. >> >> A last_used_idx or similar mechanism should help performance quite a >> bit on top of ioeventfd too. >> > > Any idea why? While irqfd does quite a bit of extra locking, it > shouldn't be that bad. Not really. It was somewhat counter intuitive. A worthwhile experiment might be to do some layering violations and have vhost do an irq injection via an ioctl and see what the performance delta is. I suspect it could give vhost a nice boost. Regards, Anthony Liguori
On Sun, Oct 3, 2010 at 12:01 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote: >> >> Virtqueue notify is currently handled synchronously in userspace virtio. >> This prevents the vcpu from executing guest code while hardware >> emulation code handles the notify. >> >> On systems that support KVM, the ioeventfd mechanism can be used to make >> virtqueue notify a lightweight exit by deferring hardware emulation to >> the iothread and allowing the VM to continue execution. This model is >> similar to how vhost receives virtqueue notifies. > > Note that this is a tradeoff. If an idle core is available and the > scheduler places the iothread on that core, then the heavyweight exit is > replaced by a lightweight exit + IPI. If the iothread is co-located with > the vcpu, then we'll take a heavyweight exit in any case. > > The first case is very likely if the host cpu is undercommitted and there is > heavy I/O activity. This is a typical subsystem benchmark scenario (as > opposed to a system benchmark like specvirt). My feeling is that total > system throughput will be decreased unless the scheduler is clever enough to > place the iothread and vcpu on the same host cpu when the system is > overcommitted. > > We can't balance "feeling" against numbers, especially when we have a > precedent in vhost-net, so I think this should go in. But I think we should > also try to understand the effects of the extra IPIs and cacheline bouncing > that this creates. While virtio was designed to minimize this, we know it > has severe problems in this area. Right, there is a danger of optimizing for subsystem benchmark cases rather than real world usage. I have posted some results that we've gathered but more scrutiny is welcome. >> Khoa Huynh<khoa@us.ibm.com> collected the following data for >> virtio-blk with cache=none,aio=native: >> >> FFSB Test Threads Unmodified Patched >> (MB/s) (MB/s) >> Large file create 1 21.7 21.8 >> 8 101.0 118.0 >> 16 119.0 157.0 >> >> Sequential reads 1 21.9 23.2 >> 8 114.0 139.0 >> 16 143.0 178.0 >> >> Random reads 1 3.3 3.6 >> 8 23.0 25.4 >> 16 43.3 47.8 >> >> Random writes 1 22.2 23.0 >> 8 93.1 111.6 >> 16 110.5 132.0 > > Impressive numbers. Can you also provide efficiency (bytes per host cpu > seconds)? Khoa, do you have the host CPU numbers for these benchmark runs? > How many guest vcpus were used with this? With enough vcpus, there is also > a reduction in cacheline bouncing, since the virtio state in the host gets > to stay on one cpu (especially with aio=native). Guest: 2 vcpu, 4 GB RAM Host: 16 cpus, 12 GB RAM Khoa, is this correct? Stefan
On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote: > On 10/04/2010 03:04 AM, Avi Kivity wrote: > > On 10/04/2010 03:18 AM, Anthony Liguori wrote: > >>On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: > >>> > >>>>This is using eventfd as well. > >>>Sorry, I meant irqfd. > >> > >>I've tried using irqfd in userspace. It hurts performance quite > >>a bit compared to doing an ioctl so I would suspect this too. > >> > >>A last_used_idx or similar mechanism should help performance > >>quite a bit on top of ioeventfd too. > >> > > > >Any idea why? While irqfd does quite a bit of extra locking, it > >shouldn't be that bad. > > Not really. It was somewhat counter intuitive. > > A worthwhile experiment might be to do some layering violations and > have vhost do an irq injection via an ioctl and see what the > performance delta is. I think you don't even need to try that hard. Just comment this line: // proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier; this is what switches to irqfd when msi vector is unmasked. > I suspect it could give vhost a nice boost. > > Regards, > > Anthony Liguori
On 10/04/2010 11:12 AM, Michael S. Tsirkin wrote: > On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote: > >> On 10/04/2010 03:04 AM, Avi Kivity wrote: >> >>> On 10/04/2010 03:18 AM, Anthony Liguori wrote: >>> >>>> On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: >>>> >>>>> >>>>>> This is using eventfd as well. >>>>>> >>>>> Sorry, I meant irqfd. >>>>> >>>> I've tried using irqfd in userspace. It hurts performance quite >>>> a bit compared to doing an ioctl so I would suspect this too. >>>> >>>> A last_used_idx or similar mechanism should help performance >>>> quite a bit on top of ioeventfd too. >>>> >>>> >>> Any idea why? While irqfd does quite a bit of extra locking, it >>> shouldn't be that bad. >>> >> Not really. It was somewhat counter intuitive. >> >> A worthwhile experiment might be to do some layering violations and >> have vhost do an irq injection via an ioctl and see what the >> performance delta is. >> > I think you don't even need to try that hard. > Just comment this line: > // proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier; > this is what switches to irqfd when msi vector is unmasked. > That drops to userspace though for all irqs, no? Or did you mean that commenting that line out improves performance demonstrating the overhead of irqfd? Regards, Anthony Liguori > >> I suspect it could give vhost a nice boost. >> >> Regards, >> >> Anthony Liguori >>
On Mon, Oct 04, 2010 at 11:20:19AM -0500, Anthony Liguori wrote: > On 10/04/2010 11:12 AM, Michael S. Tsirkin wrote: > >On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote: > >>On 10/04/2010 03:04 AM, Avi Kivity wrote: > >>>On 10/04/2010 03:18 AM, Anthony Liguori wrote: > >>>>On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote: > >>>>>>This is using eventfd as well. > >>>>>Sorry, I meant irqfd. > >>>>I've tried using irqfd in userspace. It hurts performance quite > >>>>a bit compared to doing an ioctl so I would suspect this too. > >>>> > >>>>A last_used_idx or similar mechanism should help performance > >>>>quite a bit on top of ioeventfd too. > >>>> > >>>Any idea why? While irqfd does quite a bit of extra locking, it > >>>shouldn't be that bad. > >>Not really. It was somewhat counter intuitive. > >> > >>A worthwhile experiment might be to do some layering violations and > >>have vhost do an irq injection via an ioctl and see what the > >>performance delta is. > >I think you don't even need to try that hard. > >Just comment this line: > >// proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier; > >this is what switches to irqfd when msi vector is unmasked. > > That drops to userspace though for all irqs, no? Exactly. > Or did you mean that commenting that line out improves performance > demonstrating the overhead of irqfd? > > Regards, > > Anthony Liguori Haven't tried this, but possibly. > >> I suspect it could give vhost a nice boost. > >> > >>Regards, > >> > >>Anthony Liguori
Hi, W.r.t: > Note that this is a tradeoff. If an idle core is available and the > scheduler places the iothread on that core, then the heavyweight exit is > replaced by a lightweight exit + IPI. If the iothread is co-located with > the vcpu, then we'll take a heavyweight exit in any case. > Q: Does the kvm kernel code check for such a condition and take a heavyweight exit? > The first case is very likely if the host cpu is undercommitted and there is > heavy I/O activity. This is a typical subsystem benchmark scenario (as > opposed to a system benchmark like specvirt). My feeling is that total > system throughput will be decreased unless the scheduler is clever enough to > place the iothread and vcpu on the same host cpu when the system is > overcommitted. > > Q: Sorry if the answer is obvious here. If the heavyweight exit is taken when both threads are assigned to the same core, how will the system throughput increase? Thanks Rukhsana
On 10/05/2010 01:00 PM, rukhsana ansari wrote: > Hi, > > W.r.t: > > Note that this is a tradeoff. If an idle core is available and the > > scheduler places the iothread on that core, then the heavyweight exit is > > replaced by a lightweight exit + IPI. If the iothread is co-located with > > the vcpu, then we'll take a heavyweight exit in any case. > > > Q: Does the kvm kernel code check for such a condition and take a > heavyweight exit? No. The heavyweight exit is caused by a context switch (partial) or return to userspace (full). > > The first case is very likely if the host cpu is undercommitted and there is > > heavy I/O activity. This is a typical subsystem benchmark scenario (as > > opposed to a system benchmark like specvirt). My feeling is that total > > system throughput will be decreased unless the scheduler is clever enough to > > place the iothread and vcpu on the same host cpu when the system is > > overcommitted. > > > > > Q: Sorry if the answer is obvious here. > If the heavyweight exit is taken when both threads are assigned to the > same core, how will the system throughput increase? > Co-locating threads on the same core reduces cross-core traffic.
On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote: > Virtqueue notify is currently handled synchronously in userspace virtio. > This prevents the vcpu from executing guest code while hardware > emulation code handles the notify. > > On systems that support KVM, the ioeventfd mechanism can be used to make > virtqueue notify a lightweight exit by deferring hardware emulation to > the iothread and allowing the VM to continue execution. This model is > similar to how vhost receives virtqueue notifies. > > The result of this change is improved performance for userspace virtio > devices. Virtio-blk throughput increases especially for multithreaded > scenarios and virtio-net transmit throughput increases substantially. > Full numbers are below. > > This patch employs ioeventfd virtqueue notify for all virtio devices. > Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care > must be taken so that vhost-net, the other ioeventfd user in QEMU, is > able to function. On such kernels ioeventfd virtqueue notify will not > be used. > > Khoa Huynh <khoa@us.ibm.com> collected the following data for > virtio-blk with cache=none,aio=native: > > FFSB Test Threads Unmodified Patched > (MB/s) (MB/s) > Large file create 1 21.7 21.8 > 8 101.0 118.0 > 16 119.0 157.0 > > Sequential reads 1 21.9 23.2 > 8 114.0 139.0 > 16 143.0 178.0 > > Random reads 1 3.3 3.6 > 8 23.0 25.4 > 16 43.3 47.8 > > Random writes 1 22.2 23.0 > 8 93.1 111.6 > 16 110.5 132.0 > > Sridhar Samudrala <sri@us.ibm.com> collected the following data for > virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest. > > Guest to Host TCP_STREAM throughput(Mb/sec) > ------------------------------------------- > Msg Size vhost-net virtio-net virtio-net/ioeventfd > 65536 12755 6430 7590 > 16384 8499 3084 5764 > 4096 4723 1578 3659 > 1024 1827 981 2060 > > Host to Guest TCP_STREAM throughput(Mb/sec) > ------------------------------------------- > Msg Size vhost-net virtio-net virtio-net/ioeventfd > 65536 11156 5790 5853 > 16384 10787 5575 5691 > 4096 10452 5556 4277 > 1024 4437 3671 5277 > > Guest to Host TCP_RR latency(transactions/sec) > ---------------------------------------------- > > Msg Size vhost-net virtio-net virtio-net/ioeventfd > 1 9903 3459 3425 > 4096 7185 1931 1899 > 16384 6108 2102 1923 > 65536 3161 1610 1744 > > Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> > --- > Small changes are required for qemu-kvm.git. I will send them once qemu.git > has virtio-ioeventfd support. > > hw/vhost.c | 6 ++-- > hw/virtio.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > hw/virtio.h | 9 +---- > kvm-all.c | 39 +++++++++++++++++++++ > kvm-stub.c | 5 +++ > kvm.h | 1 + > 6 files changed, 156 insertions(+), 10 deletions(-) Is there anything stopping this patch from being merged? Thanks, Stefan
On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote: > Is there anything stopping this patch from being merged? > Michael, any objections? If not, I'll merge it. Regards, Anthony Liguori > Thanks, > Stefan >
As a general comment, could you please try to split this patch up, to make it easier to review? I did a pass over it but I am still not understanding it completely. My main concern is with the fact that we add more state in notifiers that can easily get out of sync with users. If we absolutely need this state, let's try to at least document the state machine, and make the API for state transitions more transparent. On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote: > diff --git a/hw/vhost.c b/hw/vhost.c > index 1b8624d..f127a07 100644 > --- a/hw/vhost.c > +++ b/hw/vhost.c > @@ -517,7 +517,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, > goto fail_guest_notifier; > } > > - r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, true); > + r = virtio_set_host_notifier(vdev, idx, true); > if (r < 0) { > fprintf(stderr, "Error binding host notifier: %d\n", -r); > goto fail_host_notifier; > @@ -539,7 +539,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, > > fail_call: > fail_kick: > - vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false); > + virtio_set_host_notifier(vdev, idx, false); > fail_host_notifier: > vdev->binding->set_guest_notifier(vdev->binding_opaque, idx, false); > fail_guest_notifier: > @@ -575,7 +575,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, > } > assert (r >= 0); > > - r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false); > + r = virtio_set_host_notifier(vdev, idx, false); > if (r < 0) { > fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r); > fflush(stderr); > diff --git a/hw/virtio.c b/hw/virtio.c > index fbef788..f075b3a 100644 > --- a/hw/virtio.c > +++ b/hw/virtio.c > @@ -16,6 +16,7 @@ > #include "trace.h" > #include "virtio.h" > #include "sysemu.h" > +#include "kvm.h" > > /* The alignment to use between consumer and producer parts of vring. > * x86 pagesize again. */ > @@ -77,6 +78,11 @@ struct VirtQueue > VirtIODevice *vdev; > EventNotifier guest_notifier; > EventNotifier host_notifier; > + enum { > + HOST_NOTIFIER_DEASSIGNED, /* inactive */ > + HOST_NOTIFIER_ASSIGNED, /* active */ > + HOST_NOTIFIER_OFFLIMITS, /* active but outside our control */ > + } host_notifier_state; This state machine confuses me. Please note that users already track notifier state and call set with assign/deassign correctly. The comment does not help: what does 'outside our control' mean? Who's control? > }; > > /* virt queue functions */ > @@ -453,6 +459,93 @@ void virtio_update_irq(VirtIODevice *vdev) > virtio_notify_vector(vdev, VIRTIO_NO_VECTOR); > } > > +/* Service virtqueue notify from a host notifier */ > +static void virtio_read_host_notifier(void *opaque) > +{ > + VirtQueue *vq = opaque; > + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); > + if (event_notifier_test_and_clear(notifier)) { > + if (vq->vring.desc) { > + vq->handle_output(vq->vdev, vq); > + } > + } > +} > + > +/* Transition between host notifier states */ > +static int virtio_set_host_notifier_state(VirtIODevice *vdev, int n, int state) really unfortunate naming for functions: we seem to have about 4 of them starting with virtio_set_host_notifier* > +{ > + VirtQueue *vq = &vdev->vq[n]; > + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); > + int rc; > + > + if (!kvm_enabled()) { > + return -ENOSYS; > + } If this means that there's no need to do anything for non kvm, return 0 here. > + > + /* If the number of ioeventfds is limited, use them for vhost only */ > + if (state == HOST_NOTIFIER_ASSIGNED && !kvm_has_many_iobus_devs()) { > + state = HOST_NOTIFIER_DEASSIGNED; > + } > + > + /* Ignore if no state change */ > + if (vq->host_notifier_state == state) { > + return 0; > + } > + > + /* Disable read handler if transitioning away from assigned */ > + if (vq->host_notifier_state == HOST_NOTIFIER_ASSIGNED) { > + qemu_set_fd_handler(event_notifier_get_fd(notifier), NULL, NULL, NULL); > + } > + > + /* Toggle host notifier if transitioning to or from deassigned */ > + if (state == HOST_NOTIFIER_DEASSIGNED || > + vq->host_notifier_state == HOST_NOTIFIER_DEASSIGNED) { > + rc = vdev->binding->set_host_notifier(vdev->binding_opaque, n, > + state != HOST_NOTIFIER_DEASSIGNED); > + if (rc < 0) { > + return rc; > + } > + } > + > + /* Enable read handler if transitioning to assigned */ > + if (state == HOST_NOTIFIER_ASSIGNED) { > + qemu_set_fd_handler(event_notifier_get_fd(notifier), > + virtio_read_host_notifier, NULL, vq); > + } > + > + vq->host_notifier_state = state; > + return 0; > +} > + > +/* Try to assign/deassign host notifiers for all virtqueues */ > +static void virtio_set_host_notifiers(VirtIODevice *vdev, bool assigned) void? don't we care whether this fails? > +{ This really confuses me. Why is this not a simple loop over all vqs? > + int state = assigned ? HOST_NOTIFIER_ASSIGNED : HOST_NOTIFIER_DEASSIGNED; > + int i; > + > + if (!vdev->binding->set_host_notifier) { > + return; > + } > + > + for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > + if (vdev->vq[i].host_notifier_state == HOST_NOTIFIER_OFFLIMITS) { > + continue; > + } > + > + if (!vdev->vq[i].vring.desc) { > + continue; > + } > + > + virtio_set_host_notifier_state(vdev, i, state); ignores return value? > + } > +} > + > +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned) > +{ > + int state = assigned ? HOST_NOTIFIER_OFFLIMITS : HOST_NOTIFIER_ASSIGNED; So here assigned == false sets state to ASSIGNED? > + return virtio_set_host_notifier_state(vdev, n, state); > +} > + > void virtio_reset(void *opaque) > { > VirtIODevice *vdev = opaque; > @@ -467,6 +560,7 @@ void virtio_reset(void *opaque) > vdev->isr = 0; > vdev->config_vector = VIRTIO_NO_VECTOR; > virtio_notify_vector(vdev, vdev->config_vector); > + virtio_set_host_notifiers(vdev, false); > > for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > vdev->vq[i].vring.desc = 0; > @@ -592,6 +686,16 @@ void virtio_queue_set_vector(VirtIODevice *vdev, int n, uint16_t vector) > vdev->vq[n].vector = vector; > } > > +void virtio_set_status(VirtIODevice *vdev, uint8_t val) > +{ > + virtio_set_host_notifiers(vdev, val & VIRTIO_CONFIG_S_DRIVER_OK); > + > + if (vdev->set_status) { > + vdev->set_status(vdev, val); > + } > + vdev->status = val; How does this interact with vhost.c which uses both notifiers and status callback? > +} > + > VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, > void (*handle_output)(VirtIODevice *, VirtQueue *)) > { > @@ -719,6 +823,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f) > } > > virtio_notify_vector(vdev, VIRTIO_NO_VECTOR); > + virtio_set_host_notifiers(vdev, vdev->status & VIRTIO_CONFIG_S_DRIVER_OK); > return 0; > } > > @@ -746,6 +851,7 @@ VirtIODevice *virtio_common_init(const char *name, uint16_t device_id, > for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > vdev->vq[i].vector = VIRTIO_NO_VECTOR; > vdev->vq[i].vdev = vdev; > + vdev->vq[i].host_notifier_state = HOST_NOTIFIER_DEASSIGNED; > } > > vdev->name = name; > diff --git a/hw/virtio.h b/hw/virtio.h > index 96514e6..d76157e 100644 > --- a/hw/virtio.h > +++ b/hw/virtio.h > @@ -125,13 +125,7 @@ struct VirtIODevice > uint16_t device_id; > }; > > -static inline void virtio_set_status(VirtIODevice *vdev, uint8_t val) > -{ > - if (vdev->set_status) { > - vdev->set_status(vdev, val); > - } > - vdev->status = val; > -} > +void virtio_set_status(VirtIODevice *vdev, uint8_t val); > > VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, > void (*handle_output)(VirtIODevice *, > @@ -217,6 +211,7 @@ target_phys_addr_t virtio_queue_get_ring_size(VirtIODevice *vdev, int n); > uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n); > void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx); > VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n); > +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned); I think it's not the best API. Two functions e.g. _assign _deassign would make it easier to avoid confusion. > EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq); > EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq); > void virtio_irq(VirtQueue *vq); > diff --git a/kvm-all.c b/kvm-all.c > index 1cc696f..2f09e34 100644 > --- a/kvm-all.c > +++ b/kvm-all.c > @@ -24,6 +24,7 @@ > #include "qemu-barrier.h" > #include "sysemu.h" > #include "hw/hw.h" > +#include "hw/event_notifier.h" > #include "gdbstub.h" > #include "kvm.h" > #include "bswap.h" > @@ -72,6 +73,7 @@ struct KVMState > int irqchip_in_kernel; > int pit_in_kernel; > int xsave, xcrs; > + int many_iobus_devs; > }; > > static KVMState *kvm_state; > @@ -423,6 +425,36 @@ int kvm_check_extension(KVMState *s, unsigned int extension) > return ret; > } > > +static int kvm_check_many_iobus_devs(void) > +{ > + /* Older kernels have a 6 device limit on the KVM io bus. In that case > + * creating many ioeventfds must be avoided. This tests This tests -> this test >+ checks for the limitation. > + */ > + EventNotifier notifiers[7]; This is low level stuff, we really don't need to use notifiers here, simple eventfd will be enough. > + int i, ret = 0; > + for (i = 0; i < ARRAY_SIZE(notifiers); i++) { > + ret = event_notifier_init(¬ifiers[i], 0); > + if (ret < 0) { > + break; > + } > + ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, true); > + if (ret < 0) { > + event_notifier_cleanup(¬ifiers[i]); > + break; > + } > + } > + > + /* Decide whether many devices are supported or not */ > + ret = i == ARRAY_SIZE(notifiers); > + > + while (i-- > 0) { > + kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, false); > + event_notifier_cleanup(¬ifiers[i]); > + } > + return ret; > +} > + > static void kvm_set_phys_mem(target_phys_addr_t start_addr, > ram_addr_t size, > ram_addr_t phys_offset) > @@ -699,6 +731,8 @@ int kvm_init(int smp_cpus) > kvm_state = s; > cpu_register_phys_memory_client(&kvm_cpu_phys_memory_client); > > + s->many_iobus_devs = kvm_check_many_iobus_devs(); > + > return 0; > > err: > @@ -1028,6 +1062,11 @@ int kvm_has_xcrs(void) > return kvm_state->xcrs; > } > > +int kvm_has_many_iobus_devs(void) > +{ > + return kvm_state->many_iobus_devs; > +} > + > void kvm_setup_guest_memory(void *start, size_t size) > { > if (!kvm_has_sync_mmu()) { > diff --git a/kvm-stub.c b/kvm-stub.c > index d45f9fa..b0887fb 100644 > --- a/kvm-stub.c > +++ b/kvm-stub.c > @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void) > return 0; > } > > +int kvm_has_many_iobus_devs(void) > +{ > + return 0; > +} > + > void kvm_setup_guest_memory(void *start, size_t size) > { > } > diff --git a/kvm.h b/kvm.h > index 50b6c01..f405906 100644 > --- a/kvm.h > +++ b/kvm.h > @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void); > int kvm_has_debugregs(void); > int kvm_has_xsave(void); > int kvm_has_xcrs(void); > +int kvm_has_many_iobus_devs(void); Looks like a pretty low level API. Name it 'kvm_has_ioeventfd' or something like this? > > #ifdef NEED_CPU_H > int kvm_init_vcpu(CPUState *env); > -- > 1.7.1
On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote: > On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote: > >Is there anything stopping this patch from being merged? > > Michael, any objections? If not, I'll merge it. I don't really understand what's going on there. The extra state in notifiers especially scares me. If you do and are comfortable with the code, go ahead :)
On Tue, Oct 19, 2010 at 02:44:35PM +0100, Stefan Hajnoczi wrote: > On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote: > >> On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote: > >> >Is there anything stopping this patch from being merged? > >> > >> Michael, any objections? If not, I'll merge it. > > > > I don't really understand what's going on there. The extra state in > > notifiers especially scares me. If you do and are comfortable with the > > code, go ahead :) > > I'm happy to address your comments. The state machine was a bit icky > but I don't see a way around it. I think the situation is similar to irqfd in qemu-kvm - take a look there, specifically msix mask notifiers.
On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote: >> On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote: >> >Is there anything stopping this patch from being merged? >> >> Michael, any objections? If not, I'll merge it. > > I don't really understand what's going on there. The extra state in > notifiers especially scares me. If you do and are comfortable with the > code, go ahead :) I'm happy to address your comments. The state machine was a bit icky but I don't see a way around it. Will follow up to your review email. Stefan
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote: > My main concern is with the fact that we add more state > in notifiers that can easily get out of sync with users. > If we absolutely need this state, let's try to at least > document the state machine, and make the API > for state transitions more transparent. I'll try to describe how it works. If you're happy with the design in principle then I can rework the code. Otherwise we can think about a different design. The goal is to use ioeventfd instead of the synchronous pio emulation path that userspace virtqueues use today. Both virtio-blk and virtio-net increase performance with this approach because it does not block the vcpu from executing guest code while the I/O operation is initiated. We want to automatically create an event notifier and setup ioeventfd for each initialized virtqueue. Vhost already uses ioeventfd so it is important not to interfere with devices that have enabled vhost. If vhost is enabled, then the device's virtqueues are off-limits and should not be tampered with. Furthermore, older kernels limit you to 6 ioeventfds per guest. On such systems it is risky to automatically use ioeventfd for userspace virtqueues, since that could take a precious ioeventfd away from another virtio device using vhost. Existing guest configurations would break so it is simplest to avoid using ioeventfd for userspace virtqueues on such hosts. The design adds logic into hw/virtio.c to automatically use ioeventfd for userspace virtqueues. Specific virtio devices like blk and net require no modification. The logic sits below the set_host_notifier() function that vhost uses. This design stays in sync because it speaks two interfaces that allow it to accurately track whether or not to use ioeventfd: 1. virtio_set_host_notifier() is used by vhost. When vhost enables the host notifier we stay out of the way. 2. virtio_reset()/virtio_set_status()/virtio_load() define the device life-cycle and transition the state machine appropriately. Migration is supported. Here is the state machine that tracks a virtqueue: assigned ^ / \ ^ e. / / c. g. \ \ b. / / \ \ / v f. v \ a. offlimits ---------------> deassigned <-- start <--------------- d. a. The virtqueue starts deassigned with no ioeventfd. b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to assign an ioeventfd to each virtqueue, except if the 6 ioeventfd limitation is present. c, d. The virtqueue becomes offlimits if vhost enables the host notifier. e. The ioeventfd becomes assigned again when the host notifier is disabled by vhost. f. Except when the 6 ioeventfd limitation is present, then the ioeventfd becomes unassigned because we want to avoid using ioeventfd. g. When the device is reset its virtqueues become deassigned again. Does this make sense? Stefan
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote: Apologies if you receive this twice, the original message either disappeared or was delayed somehow. > My main concern is with the fact that we add more state > in notifiers that can easily get out of sync with users. > If we absolutely need this state, let's try to at least > document the state machine, and make the API > for state transitions more transparent. I'll try to describe how it works. If you're happy with the design in principle then I can rework the code. Otherwise we can think about a different design. The goal is to use ioeventfd instead of the synchronous pio emulation path that userspace virtqueues use today. Both virtio-blk and virtio-net increase performance with this approach because it does not block the vcpu from executing guest code while the I/O operation is initiated. We want to automatically create an event notifier and setup ioeventfd for each initialized virtqueue. Vhost already uses ioeventfd so it is important not to interfere with devices that have enabled vhost. If vhost is enabled, then the device's virtqueues are off-limits and should not be tampered with. Furthermore, older kernels limit you to 6 ioeventfds per guest. On such systems it is risky to automatically use ioeventfd for userspace virtqueues, since that could take a precious ioeventfd away from another virtio device using vhost. Existing guest configurations would break so it is simplest to avoid using ioeventfd for userspace virtqueues on such hosts. The design adds logic into hw/virtio.c to automatically use ioeventfd for userspace virtqueues. Specific virtio devices like blk and net require no modification. The logic sits below the set_host_notifier() function that vhost uses. This design stays in sync because it speaks two interfaces that allow it to accurately track whether or not to use ioeventfd: 1. virtio_set_host_notifier() is used by vhost. When vhost enables the host notifier we stay out of the way. 2. virtio_reset()/virtio_set_status()/virtio_load() define the device life-cycle and transition the state machine appropriately. Migration is supported. Here is the state machine that tracks a virtqueue: assigned ^ / \ ^ e. / / c. g. \ \ b. / / \ \ / v f. v \ a. offlimits ---------------> deassigned <-- start <--------------- d. a. The virtqueue starts deassigned with no ioeventfd. b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to assign an ioeventfd to each virtqueue, except if the 6 ioeventfd limitation is present. c, d. The virtqueue becomes offlimits if vhost enables the host notifier. e. The ioeventfd becomes assigned again when the host notifier is disabled by vhost. f. Except when the 6 ioeventfd limitation is present, then the ioeventfd becomes unassigned because we want to avoid using ioeventfd. g. When the device is reset its virtqueues become deassigned again. Does this make sense? Stefan
Hi Michael, I have looked into the way irqfd with msix mask notifiers works. From what I can tell, the guest notifiers are enabled by vhost net in order to hook up irqfds for the virtqueues. MSIX allows vectors to be masked so there is a mmio write notifier in qemu-kvm to toggle the irqfd and its QEMU fd handler when the guest toggles the MSIX mask. The irqfd is disabled but stays open as an eventfd while masked. That means masking/unmasking the vector does not close/open the eventfd file descriptor itself. I'm having trouble finding a direct parallel to virtio-ioeventfd here. We always want to have an ioeventfd per virtqueue unless the host kernel does not support >6 ioeventfds per VM. When vhost sets the host notifier we want to remove the QEMU fd handler and allow vhost to use the event notifier's fd as it wants. When vhost clears the host notifier we want to add the QEMU fd handler again (unless the kernel does not support >6 ioeventfds per VM). I think hooking in at the virtio-pci.c level instead of virtio.c is possible but we're still going to have the same state transitions. I hope it can be done without adding per-virtqueue variables that track state. Before I go down this route, is there something I've missed and do you think this approach will be better? Stefan
diff --git a/hw/vhost.c b/hw/vhost.c index 1b8624d..f127a07 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -517,7 +517,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, goto fail_guest_notifier; } - r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, true); + r = virtio_set_host_notifier(vdev, idx, true); if (r < 0) { fprintf(stderr, "Error binding host notifier: %d\n", -r); goto fail_host_notifier; @@ -539,7 +539,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev, fail_call: fail_kick: - vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false); + virtio_set_host_notifier(vdev, idx, false); fail_host_notifier: vdev->binding->set_guest_notifier(vdev->binding_opaque, idx, false); fail_guest_notifier: @@ -575,7 +575,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, } assert (r >= 0); - r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false); + r = virtio_set_host_notifier(vdev, idx, false); if (r < 0) { fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r); fflush(stderr); diff --git a/hw/virtio.c b/hw/virtio.c index fbef788..f075b3a 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -16,6 +16,7 @@ #include "trace.h" #include "virtio.h" #include "sysemu.h" +#include "kvm.h" /* The alignment to use between consumer and producer parts of vring. * x86 pagesize again. */ @@ -77,6 +78,11 @@ struct VirtQueue VirtIODevice *vdev; EventNotifier guest_notifier; EventNotifier host_notifier; + enum { + HOST_NOTIFIER_DEASSIGNED, /* inactive */ + HOST_NOTIFIER_ASSIGNED, /* active */ + HOST_NOTIFIER_OFFLIMITS, /* active but outside our control */ + } host_notifier_state; }; /* virt queue functions */ @@ -453,6 +459,93 @@ void virtio_update_irq(VirtIODevice *vdev) virtio_notify_vector(vdev, VIRTIO_NO_VECTOR); } +/* Service virtqueue notify from a host notifier */ +static void virtio_read_host_notifier(void *opaque) +{ + VirtQueue *vq = opaque; + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); + if (event_notifier_test_and_clear(notifier)) { + if (vq->vring.desc) { + vq->handle_output(vq->vdev, vq); + } + } +} + +/* Transition between host notifier states */ +static int virtio_set_host_notifier_state(VirtIODevice *vdev, int n, int state) +{ + VirtQueue *vq = &vdev->vq[n]; + EventNotifier *notifier = virtio_queue_get_host_notifier(vq); + int rc; + + if (!kvm_enabled()) { + return -ENOSYS; + } + + /* If the number of ioeventfds is limited, use them for vhost only */ + if (state == HOST_NOTIFIER_ASSIGNED && !kvm_has_many_iobus_devs()) { + state = HOST_NOTIFIER_DEASSIGNED; + } + + /* Ignore if no state change */ + if (vq->host_notifier_state == state) { + return 0; + } + + /* Disable read handler if transitioning away from assigned */ + if (vq->host_notifier_state == HOST_NOTIFIER_ASSIGNED) { + qemu_set_fd_handler(event_notifier_get_fd(notifier), NULL, NULL, NULL); + } + + /* Toggle host notifier if transitioning to or from deassigned */ + if (state == HOST_NOTIFIER_DEASSIGNED || + vq->host_notifier_state == HOST_NOTIFIER_DEASSIGNED) { + rc = vdev->binding->set_host_notifier(vdev->binding_opaque, n, + state != HOST_NOTIFIER_DEASSIGNED); + if (rc < 0) { + return rc; + } + } + + /* Enable read handler if transitioning to assigned */ + if (state == HOST_NOTIFIER_ASSIGNED) { + qemu_set_fd_handler(event_notifier_get_fd(notifier), + virtio_read_host_notifier, NULL, vq); + } + + vq->host_notifier_state = state; + return 0; +} + +/* Try to assign/deassign host notifiers for all virtqueues */ +static void virtio_set_host_notifiers(VirtIODevice *vdev, bool assigned) +{ + int state = assigned ? HOST_NOTIFIER_ASSIGNED : HOST_NOTIFIER_DEASSIGNED; + int i; + + if (!vdev->binding->set_host_notifier) { + return; + } + + for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { + if (vdev->vq[i].host_notifier_state == HOST_NOTIFIER_OFFLIMITS) { + continue; + } + + if (!vdev->vq[i].vring.desc) { + continue; + } + + virtio_set_host_notifier_state(vdev, i, state); + } +} + +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned) +{ + int state = assigned ? HOST_NOTIFIER_OFFLIMITS : HOST_NOTIFIER_ASSIGNED; + return virtio_set_host_notifier_state(vdev, n, state); +} + void virtio_reset(void *opaque) { VirtIODevice *vdev = opaque; @@ -467,6 +560,7 @@ void virtio_reset(void *opaque) vdev->isr = 0; vdev->config_vector = VIRTIO_NO_VECTOR; virtio_notify_vector(vdev, vdev->config_vector); + virtio_set_host_notifiers(vdev, false); for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { vdev->vq[i].vring.desc = 0; @@ -592,6 +686,16 @@ void virtio_queue_set_vector(VirtIODevice *vdev, int n, uint16_t vector) vdev->vq[n].vector = vector; } +void virtio_set_status(VirtIODevice *vdev, uint8_t val) +{ + virtio_set_host_notifiers(vdev, val & VIRTIO_CONFIG_S_DRIVER_OK); + + if (vdev->set_status) { + vdev->set_status(vdev, val); + } + vdev->status = val; +} + VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, void (*handle_output)(VirtIODevice *, VirtQueue *)) { @@ -719,6 +823,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f) } virtio_notify_vector(vdev, VIRTIO_NO_VECTOR); + virtio_set_host_notifiers(vdev, vdev->status & VIRTIO_CONFIG_S_DRIVER_OK); return 0; } @@ -746,6 +851,7 @@ VirtIODevice *virtio_common_init(const char *name, uint16_t device_id, for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { vdev->vq[i].vector = VIRTIO_NO_VECTOR; vdev->vq[i].vdev = vdev; + vdev->vq[i].host_notifier_state = HOST_NOTIFIER_DEASSIGNED; } vdev->name = name; diff --git a/hw/virtio.h b/hw/virtio.h index 96514e6..d76157e 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -125,13 +125,7 @@ struct VirtIODevice uint16_t device_id; }; -static inline void virtio_set_status(VirtIODevice *vdev, uint8_t val) -{ - if (vdev->set_status) { - vdev->set_status(vdev, val); - } - vdev->status = val; -} +void virtio_set_status(VirtIODevice *vdev, uint8_t val); VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, void (*handle_output)(VirtIODevice *, @@ -217,6 +211,7 @@ target_phys_addr_t virtio_queue_get_ring_size(VirtIODevice *vdev, int n); uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n); void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx); VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n); +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned); EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq); EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq); void virtio_irq(VirtQueue *vq); diff --git a/kvm-all.c b/kvm-all.c index 1cc696f..2f09e34 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -24,6 +24,7 @@ #include "qemu-barrier.h" #include "sysemu.h" #include "hw/hw.h" +#include "hw/event_notifier.h" #include "gdbstub.h" #include "kvm.h" #include "bswap.h" @@ -72,6 +73,7 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; + int many_iobus_devs; }; static KVMState *kvm_state; @@ -423,6 +425,36 @@ int kvm_check_extension(KVMState *s, unsigned int extension) return ret; } +static int kvm_check_many_iobus_devs(void) +{ + /* Older kernels have a 6 device limit on the KVM io bus. In that case + * creating many ioeventfds must be avoided. This tests checks for the + * limitation. + */ + EventNotifier notifiers[7]; + int i, ret = 0; + for (i = 0; i < ARRAY_SIZE(notifiers); i++) { + ret = event_notifier_init(¬ifiers[i], 0); + if (ret < 0) { + break; + } + ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, true); + if (ret < 0) { + event_notifier_cleanup(¬ifiers[i]); + break; + } + } + + /* Decide whether many devices are supported or not */ + ret = i == ARRAY_SIZE(notifiers); + + while (i-- > 0) { + kvm_set_ioeventfd_pio_word(event_notifier_get_fd(¬ifiers[i]), 0, i, false); + event_notifier_cleanup(¬ifiers[i]); + } + return ret; +} + static void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size, ram_addr_t phys_offset) @@ -699,6 +731,8 @@ int kvm_init(int smp_cpus) kvm_state = s; cpu_register_phys_memory_client(&kvm_cpu_phys_memory_client); + s->many_iobus_devs = kvm_check_many_iobus_devs(); + return 0; err: @@ -1028,6 +1062,11 @@ int kvm_has_xcrs(void) return kvm_state->xcrs; } +int kvm_has_many_iobus_devs(void) +{ + return kvm_state->many_iobus_devs; +} + void kvm_setup_guest_memory(void *start, size_t size) { if (!kvm_has_sync_mmu()) { diff --git a/kvm-stub.c b/kvm-stub.c index d45f9fa..b0887fb 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void) return 0; } +int kvm_has_many_iobus_devs(void) +{ + return 0; +} + void kvm_setup_guest_memory(void *start, size_t size) { } diff --git a/kvm.h b/kvm.h index 50b6c01..f405906 100644 --- a/kvm.h +++ b/kvm.h @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void); int kvm_has_debugregs(void); int kvm_has_xsave(void); int kvm_has_xcrs(void); +int kvm_has_many_iobus_devs(void); #ifdef NEED_CPU_H int kvm_init_vcpu(CPUState *env);
Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Full numbers are below. This patch employs ioeventfd virtqueue notify for all virtio devices. Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care must be taken so that vhost-net, the other ioeventfd user in QEMU, is able to function. On such kernels ioeventfd virtqueue notify will not be used. Khoa Huynh <khoa@us.ibm.com> collected the following data for virtio-blk with cache=none,aio=native: FFSB Test Threads Unmodified Patched (MB/s) (MB/s) Large file create 1 21.7 21.8 8 101.0 118.0 16 119.0 157.0 Sequential reads 1 21.9 23.2 8 114.0 139.0 16 143.0 178.0 Random reads 1 3.3 3.6 8 23.0 25.4 16 43.3 47.8 Random writes 1 22.2 23.0 8 93.1 111.6 16 110.5 132.0 Sridhar Samudrala <sri@us.ibm.com> collected the following data for virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest. Guest to Host TCP_STREAM throughput(Mb/sec) ------------------------------------------- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 12755 6430 7590 16384 8499 3084 5764 4096 4723 1578 3659 1024 1827 981 2060 Host to Guest TCP_STREAM throughput(Mb/sec) ------------------------------------------- Msg Size vhost-net virtio-net virtio-net/ioeventfd 65536 11156 5790 5853 16384 10787 5575 5691 4096 10452 5556 4277 1024 4437 3671 5277 Guest to Host TCP_RR latency(transactions/sec) ---------------------------------------------- Msg Size vhost-net virtio-net virtio-net/ioeventfd 1 9903 3459 3425 4096 7185 1931 1899 16384 6108 2102 1923 65536 3161 1610 1744 Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> --- Small changes are required for qemu-kvm.git. I will send them once qemu.git has virtio-ioeventfd support. hw/vhost.c | 6 ++-- hw/virtio.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ hw/virtio.h | 9 +---- kvm-all.c | 39 +++++++++++++++++++++ kvm-stub.c | 5 +++ kvm.h | 1 + 6 files changed, 156 insertions(+), 10 deletions(-)