Patchwork virtio: Use ioeventfd for virtqueue notify

login
register
mail settings
Submitter Stefan Hajnoczi
Date Sept. 30, 2010, 2:01 p.m.
Message ID <1285855312-11739-1-git-send-email-stefanha@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/66175/
State New
Headers show

Comments

Stefan Hajnoczi - Sept. 30, 2010, 2:01 p.m.
Virtqueue notify is currently handled synchronously in userspace virtio.
This prevents the vcpu from executing guest code while hardware
emulation code handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to
the iothread and allowing the VM to continue execution.  This model is
similar to how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio
devices.  Virtio-blk throughput increases especially for multithreaded
scenarios and virtio-net transmit throughput increases substantially.
Full numbers are below.

This patch employs ioeventfd virtqueue notify for all virtio devices.
Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care
must be taken so that vhost-net, the other ioeventfd user in QEMU, is
able to function.  On such kernels ioeventfd virtqueue notify will not
be used.

Khoa Huynh <khoa@us.ibm.com> collected the following data for
virtio-blk with cache=none,aio=native:

FFSB Test          Threads  Unmodified  Patched
                            (MB/s)      (MB/s)
Large file create  1        21.7        21.8
                   8        101.0       118.0
                   16       119.0       157.0

Sequential reads   1        21.9        23.2
                   8        114.0       139.0
                   16       143.0       178.0

Random reads       1        3.3         3.6
                   8        23.0        25.4
                   16       43.3        47.8

Random writes      1        22.2        23.0
                   8        93.1        111.6
                   16       110.5       132.0

Sridhar Samudrala <sri@us.ibm.com> collected the following data for
virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest.

Guest to Host TCP_STREAM throughput(Mb/sec)
-------------------------------------------
Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
65536         12755        6430                  7590
16384          8499        3084                  5764
 4096          4723        1578                  3659
 1024          1827         981                  2060

Host to Guest TCP_STREAM throughput(Mb/sec)
-------------------------------------------
Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
65536         11156        5790                  5853
16384         10787        5575                  5691
 4096         10452        5556                  4277
 1024          4437        3671                  5277

Guest to Host TCP_RR latency(transactions/sec)
----------------------------------------------

Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
    1          9903        3459                  3425
 4096          7185        1931                  1899
16384          6108        2102                  1923
65536          3161        1610                  1744

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
Small changes are required for qemu-kvm.git.  I will send them once qemu.git
has virtio-ioeventfd support.

 hw/vhost.c  |    6 ++--
 hw/virtio.c |  106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio.h |    9 +----
 kvm-all.c   |   39 +++++++++++++++++++++
 kvm-stub.c  |    5 +++
 kvm.h       |    1 +
 6 files changed, 156 insertions(+), 10 deletions(-)
Avi Kivity - Oct. 3, 2010, 11:01 a.m.
On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote:
> Virtqueue notify is currently handled synchronously in userspace virtio.
> This prevents the vcpu from executing guest code while hardware
> emulation code handles the notify.
>
> On systems that support KVM, the ioeventfd mechanism can be used to make
> virtqueue notify a lightweight exit by deferring hardware emulation to
> the iothread and allowing the VM to continue execution.  This model is
> similar to how vhost receives virtqueue notifies.

Note that this is a tradeoff.  If an idle core is available and the 
scheduler places the iothread on that core, then the heavyweight exit is 
replaced by a lightweight exit + IPI.  If the iothread is co-located 
with the vcpu, then we'll take a heavyweight exit in any case.

The first case is very likely if the host cpu is undercommitted and 
there is heavy I/O activity.  This is a typical subsystem benchmark 
scenario (as opposed to a system benchmark like specvirt).  My feeling 
is that total system throughput will be decreased unless the scheduler 
is clever enough to place the iothread and vcpu on the same host cpu 
when the system is overcommitted.

We can't balance "feeling" against numbers, especially when we have a 
precedent in vhost-net, so I think this should go in.  But I think we 
should also try to understand the effects of the extra IPIs and 
cacheline bouncing that this creates.  While virtio was designed to 
minimize this, we know it has severe problems in this area.

> The result of this change is improved performance for userspace virtio
> devices.  Virtio-blk throughput increases especially for multithreaded
> scenarios and virtio-net transmit throughput increases substantially.
> Full numbers are below.
>
> This patch employs ioeventfd virtqueue notify for all virtio devices.
> Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care
> must be taken so that vhost-net, the other ioeventfd user in QEMU, is
> able to function.  On such kernels ioeventfd virtqueue notify will not
> be used.
>
> Khoa Huynh<khoa@us.ibm.com>  collected the following data for
> virtio-blk with cache=none,aio=native:
>
> FFSB Test          Threads  Unmodified  Patched
>                              (MB/s)      (MB/s)
> Large file create  1        21.7        21.8
>                     8        101.0       118.0
>                     16       119.0       157.0
>
> Sequential reads   1        21.9        23.2
>                     8        114.0       139.0
>                     16       143.0       178.0
>
> Random reads       1        3.3         3.6
>                     8        23.0        25.4
>                     16       43.3        47.8
>
> Random writes      1        22.2        23.0
>                     8        93.1        111.6
>                     16       110.5       132.0

Impressive numbers.  Can you also provide efficiency (bytes per host cpu 
seconds)?

How many guest vcpus were used with this?  With enough vcpus, there is 
also a reduction in cacheline bouncing, since the virtio state in the 
host gets to stay on one cpu (especially with aio=native).

> Sridhar Samudrala<sri@us.ibm.com>  collected the following data for
> virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest.
>
> Guest to Host TCP_STREAM throughput(Mb/sec)
> -------------------------------------------
> Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> 65536         12755        6430                  7590
> 16384          8499        3084                  5764
>   4096          4723        1578                  3659
>   1024          1827         981                  2060

Even more impressive (expected since the copying, which isn't present 
for block, is now shunted off into an iothread).

On the last test you even exceeded vhost-net.  Any theories how/why?

Again, efficiency numbers would be interesting.

> Host to Guest TCP_STREAM throughput(Mb/sec)
> -------------------------------------------
> Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> 65536         11156        5790                  5853
> 16384         10787        5575                  5691
>   4096         10452        5556                  4277
>   1024          4437        3671                  5277

Here you exceed vhost-net, too.

> +static int kvm_check_many_iobus_devs(void)
> +{
> +    /* Older kernels have a 6 device limit on the KVM io bus.  In that case
> +     * creating many ioeventfds must be avoided.  This tests checks for the
> +     * limitation.
> +     */
> +    EventNotifier notifiers[7];
> +    int i, ret = 0;
> +    for (i = 0; i<  ARRAY_SIZE(notifiers); i++) {
> +        ret = event_notifier_init(&notifiers[i], 0);
> +        if (ret<  0) {
> +            break;
> +        }
> +        ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, true);
> +        if (ret<  0) {
> +            event_notifier_cleanup(&notifiers[i]);
> +            break;
> +        }
> +    }
> +
> +    /* Decide whether many devices are supported or not */
> +    ret = i == ARRAY_SIZE(notifiers);
> +
> +    while (i-->  0) {
> +        kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, false);
> +        event_notifier_cleanup(&notifiers[i]);
> +    }
> +    return ret;
> +}

Sorry about that.

IIRC there was a problem (shared by vhost-net) with interrupts remaining 
enabled in the window between the guest kicking the queue and the host 
waking up and disabling interrupts.  An even more vague IIRC mst had an 
idea to fix this?
Michael S. Tsirkin - Oct. 3, 2010, 1:51 p.m.
On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote:
> >
> >Guest to Host TCP_STREAM throughput(Mb/sec)
> >-------------------------------------------
> >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >65536         12755        6430                  7590
> >16384          8499        3084                  5764
> >  4096          4723        1578                  3659
> >  1024          1827         981                  2060
> 
> Even more impressive (expected since the copying, which isn't
> present for block, is now shunted off into an iothread).
> 
> On the last test you even exceeded vhost-net.  Any theories how/why?
> 
> Again, efficiency numbers would be interesting.
> 
> >Host to Guest TCP_STREAM throughput(Mb/sec)
> >-------------------------------------------
> >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >65536         11156        5790                  5853
> >16384         10787        5575                  5691
> >  4096         10452        5556                  4277
> >  1024          4437        3671                  5277
> 
> Here you exceed vhost-net, too.

This is with small packets- I suspect this is the extra
per interrupt overhead  that eventfd has.

> >+static int kvm_check_many_iobus_devs(void)
> >+{
> >+    /* Older kernels have a 6 device limit on the KVM io bus.  In that case
> >+     * creating many ioeventfds must be avoided.  This tests checks for the
> >+     * limitation.
> >+     */
> >+    EventNotifier notifiers[7];
> >+    int i, ret = 0;
> >+    for (i = 0; i<  ARRAY_SIZE(notifiers); i++) {
> >+        ret = event_notifier_init(&notifiers[i], 0);
> >+        if (ret<  0) {
> >+            break;
> >+        }
> >+        ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, true);
> >+        if (ret<  0) {
> >+            event_notifier_cleanup(&notifiers[i]);
> >+            break;
> >+        }
> >+    }
> >+
> >+    /* Decide whether many devices are supported or not */
> >+    ret = i == ARRAY_SIZE(notifiers);
> >+
> >+    while (i-->  0) {
> >+        kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, false);
> >+        event_notifier_cleanup(&notifiers[i]);
> >+    }
> >+    return ret;
> >+}
> 
> Sorry about that.
> 
> IIRC there was a problem (shared by vhost-net) with interrupts
> remaining enabled in the window between the guest kicking the queue
> and the host waking up and disabling interrupts.  An even more vague
> IIRC mst had an idea to fix this?

This is one of the things that vring2 is supposed to fix.

> -- 
> error compiling committee.c: too many arguments to function
Avi Kivity - Oct. 3, 2010, 2:21 p.m.
On 10/03/2010 03:51 PM, Michael S. Tsirkin wrote:
> On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote:
> >  >
> >  >Guest to Host TCP_STREAM throughput(Mb/sec)
> >  >-------------------------------------------
> >  >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >  >65536         12755        6430                  7590
> >  >16384          8499        3084                  5764
> >  >   4096          4723        1578                  3659
> >  >   1024          1827         981                  2060
> >
> >  Even more impressive (expected since the copying, which isn't
> >  present for block, is now shunted off into an iothread).
> >
> >  On the last test you even exceeded vhost-net.  Any theories how/why?
> >
> >  Again, efficiency numbers would be interesting.
> >
> >  >Host to Guest TCP_STREAM throughput(Mb/sec)
> >  >-------------------------------------------
> >  >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >  >65536         11156        5790                  5853
> >  >16384         10787        5575                  5691
> >  >   4096         10452        5556                  4277
> >  >   1024          4437        3671                  5277
> >
> >  Here you exceed vhost-net, too.
>
> This is with small packets- I suspect this is the extra
> per interrupt overhead  that eventfd has.

This is using eventfd as well.
Michael S. Tsirkin - Oct. 3, 2010, 2:28 p.m.
On Sun, Oct 03, 2010 at 04:21:57PM +0200, Avi Kivity wrote:
>  On 10/03/2010 03:51 PM, Michael S. Tsirkin wrote:
> >On Sun, Oct 03, 2010 at 01:01:59PM +0200, Avi Kivity wrote:
> >>  >
> >>  >Guest to Host TCP_STREAM throughput(Mb/sec)
> >>  >-------------------------------------------
> >>  >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >>  >65536         12755        6430                  7590
> >>  >16384          8499        3084                  5764
> >>  >   4096          4723        1578                  3659
> >>  >   1024          1827         981                  2060
> >>
> >>  Even more impressive (expected since the copying, which isn't
> >>  present for block, is now shunted off into an iothread).
> >>
> >>  On the last test you even exceeded vhost-net.  Any theories how/why?
> >>
> >>  Again, efficiency numbers would be interesting.
> >>
> >>  >Host to Guest TCP_STREAM throughput(Mb/sec)
> >>  >-------------------------------------------
> >>  >Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> >>  >65536         11156        5790                  5853
> >>  >16384         10787        5575                  5691
> >>  >   4096         10452        5556                  4277
> >>  >   1024          4437        3671                  5277
> >>
> >>  Here you exceed vhost-net, too.
> >
> >This is with small packets- I suspect this is the extra
> >per interrupt overhead  that eventfd has.
> 
> This is using eventfd as well.

Sorry, I meant irqfd.

> -- 
> error compiling committee.c: too many arguments to function
Anthony Liguori - Oct. 4, 2010, 1:18 a.m.
On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
>
>> This is using eventfd as well.
>>      
> Sorry, I meant irqfd.
>    

I've tried using irqfd in userspace.  It hurts performance quite a bit 
compared to doing an ioctl so I would suspect this too.

A last_used_idx or similar mechanism should help performance quite a bit 
on top of ioeventfd too.

Regards,

Anthony Liguori

>> -- 
>> error compiling committee.c: too many arguments to function
>>
Avi Kivity - Oct. 4, 2010, 8:04 a.m.
On 10/04/2010 03:18 AM, Anthony Liguori wrote:
> On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
>>
>>> This is using eventfd as well.
>> Sorry, I meant irqfd.
>
> I've tried using irqfd in userspace.  It hurts performance quite a bit 
> compared to doing an ioctl so I would suspect this too.
>
> A last_used_idx or similar mechanism should help performance quite a 
> bit on top of ioeventfd too.
>

Any idea why?  While irqfd does quite a bit of extra locking, it 
shouldn't be that bad.
Anthony Liguori - Oct. 4, 2010, 2:01 p.m.
On 10/04/2010 03:04 AM, Avi Kivity wrote:
>  On 10/04/2010 03:18 AM, Anthony Liguori wrote:
>> On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> This is using eventfd as well.
>>> Sorry, I meant irqfd.
>>
>> I've tried using irqfd in userspace.  It hurts performance quite a 
>> bit compared to doing an ioctl so I would suspect this too.
>>
>> A last_used_idx or similar mechanism should help performance quite a 
>> bit on top of ioeventfd too.
>>
>
> Any idea why?  While irqfd does quite a bit of extra locking, it 
> shouldn't be that bad.

Not really.  It was somewhat counter intuitive.

A worthwhile experiment might be to do some layering violations and have 
vhost do an irq injection via an ioctl and see what the performance 
delta is.  I suspect it could give vhost a nice boost.

Regards,

Anthony Liguori
Stefan Hajnoczi - Oct. 4, 2010, 2:30 p.m.
On Sun, Oct 3, 2010 at 12:01 PM, Avi Kivity <avi@redhat.com> wrote:
>  On 09/30/2010 04:01 PM, Stefan Hajnoczi wrote:
>>
>> Virtqueue notify is currently handled synchronously in userspace virtio.
>> This prevents the vcpu from executing guest code while hardware
>> emulation code handles the notify.
>>
>> On systems that support KVM, the ioeventfd mechanism can be used to make
>> virtqueue notify a lightweight exit by deferring hardware emulation to
>> the iothread and allowing the VM to continue execution.  This model is
>> similar to how vhost receives virtqueue notifies.
>
> Note that this is a tradeoff.  If an idle core is available and the
> scheduler places the iothread on that core, then the heavyweight exit is
> replaced by a lightweight exit + IPI.  If the iothread is co-located with
> the vcpu, then we'll take a heavyweight exit in any case.
>
> The first case is very likely if the host cpu is undercommitted and there is
> heavy I/O activity.  This is a typical subsystem benchmark scenario (as
> opposed to a system benchmark like specvirt).  My feeling is that total
> system throughput will be decreased unless the scheduler is clever enough to
> place the iothread and vcpu on the same host cpu when the system is
> overcommitted.
>
> We can't balance "feeling" against numbers, especially when we have a
> precedent in vhost-net, so I think this should go in.  But I think we should
> also try to understand the effects of the extra IPIs and cacheline bouncing
> that this creates.  While virtio was designed to minimize this, we know it
> has severe problems in this area.

Right, there is a danger of optimizing for subsystem benchmark cases
rather than real world usage.  I have posted some results that we've
gathered but more scrutiny is welcome.

>> Khoa Huynh<khoa@us.ibm.com>  collected the following data for
>> virtio-blk with cache=none,aio=native:
>>
>> FFSB Test          Threads  Unmodified  Patched
>>                             (MB/s)      (MB/s)
>> Large file create  1        21.7        21.8
>>                    8        101.0       118.0
>>                    16       119.0       157.0
>>
>> Sequential reads   1        21.9        23.2
>>                    8        114.0       139.0
>>                    16       143.0       178.0
>>
>> Random reads       1        3.3         3.6
>>                    8        23.0        25.4
>>                    16       43.3        47.8
>>
>> Random writes      1        22.2        23.0
>>                    8        93.1        111.6
>>                    16       110.5       132.0
>
> Impressive numbers.  Can you also provide efficiency (bytes per host cpu
> seconds)?

Khoa, do you have the host CPU numbers for these benchmark runs?

> How many guest vcpus were used with this?  With enough vcpus, there is also
> a reduction in cacheline bouncing, since the virtio state in the host gets
> to stay on one cpu (especially with aio=native).

Guest: 2 vcpu, 4 GB RAM
Host: 16 cpus, 12 GB RAM

Khoa, is this correct?

Stefan
Michael S. Tsirkin - Oct. 4, 2010, 4:12 p.m.
On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote:
> On 10/04/2010 03:04 AM, Avi Kivity wrote:
> > On 10/04/2010 03:18 AM, Anthony Liguori wrote:
> >>On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
> >>>
> >>>>This is using eventfd as well.
> >>>Sorry, I meant irqfd.
> >>
> >>I've tried using irqfd in userspace.  It hurts performance quite
> >>a bit compared to doing an ioctl so I would suspect this too.
> >>
> >>A last_used_idx or similar mechanism should help performance
> >>quite a bit on top of ioeventfd too.
> >>
> >
> >Any idea why?  While irqfd does quite a bit of extra locking, it
> >shouldn't be that bad.
> 
> Not really.  It was somewhat counter intuitive.
> 
> A worthwhile experiment might be to do some layering violations and
> have vhost do an irq injection via an ioctl and see what the
> performance delta is.

I think you don't even need to try that hard.
Just comment this line:
//   proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier;
this is what switches to irqfd when msi vector is unmasked.


>  I suspect it could give vhost a nice boost.
> 
> Regards,
> 
> Anthony Liguori
Anthony Liguori - Oct. 4, 2010, 4:20 p.m.
On 10/04/2010 11:12 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote:
>    
>> On 10/04/2010 03:04 AM, Avi Kivity wrote:
>>      
>>> On 10/04/2010 03:18 AM, Anthony Liguori wrote:
>>>        
>>>> On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
>>>>          
>>>>>            
>>>>>> This is using eventfd as well.
>>>>>>              
>>>>> Sorry, I meant irqfd.
>>>>>            
>>>> I've tried using irqfd in userspace.  It hurts performance quite
>>>> a bit compared to doing an ioctl so I would suspect this too.
>>>>
>>>> A last_used_idx or similar mechanism should help performance
>>>> quite a bit on top of ioeventfd too.
>>>>
>>>>          
>>> Any idea why?  While irqfd does quite a bit of extra locking, it
>>> shouldn't be that bad.
>>>        
>> Not really.  It was somewhat counter intuitive.
>>
>> A worthwhile experiment might be to do some layering violations and
>> have vhost do an irq injection via an ioctl and see what the
>> performance delta is.
>>      
> I think you don't even need to try that hard.
> Just comment this line:
> //   proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier;
> this is what switches to irqfd when msi vector is unmasked.
>    

That drops to userspace though for all irqs, no?

Or did you mean that commenting that line out improves performance 
demonstrating the overhead of irqfd?

Regards,

Anthony Liguori

>    
>>   I suspect it could give vhost a nice boost.
>>
>> Regards,
>>
>> Anthony Liguori
>>
Michael S. Tsirkin - Oct. 4, 2010, 4:25 p.m.
On Mon, Oct 04, 2010 at 11:20:19AM -0500, Anthony Liguori wrote:
> On 10/04/2010 11:12 AM, Michael S. Tsirkin wrote:
> >On Mon, Oct 04, 2010 at 09:01:14AM -0500, Anthony Liguori wrote:
> >>On 10/04/2010 03:04 AM, Avi Kivity wrote:
> >>>On 10/04/2010 03:18 AM, Anthony Liguori wrote:
> >>>>On 10/03/2010 09:28 AM, Michael S. Tsirkin wrote:
> >>>>>>This is using eventfd as well.
> >>>>>Sorry, I meant irqfd.
> >>>>I've tried using irqfd in userspace.  It hurts performance quite
> >>>>a bit compared to doing an ioctl so I would suspect this too.
> >>>>
> >>>>A last_used_idx or similar mechanism should help performance
> >>>>quite a bit on top of ioeventfd too.
> >>>>
> >>>Any idea why?  While irqfd does quite a bit of extra locking, it
> >>>shouldn't be that bad.
> >>Not really.  It was somewhat counter intuitive.
> >>
> >>A worthwhile experiment might be to do some layering violations and
> >>have vhost do an irq injection via an ioctl and see what the
> >>performance delta is.
> >I think you don't even need to try that hard.
> >Just comment this line:
> >//   proxy->pci_dev.msix_mask_notifier = virtio_pci_mask_notifier;
> >this is what switches to irqfd when msi vector is unmasked.
> 
> That drops to userspace though for all irqs, no?

Exactly.

> Or did you mean that commenting that line out improves performance
> demonstrating the overhead of irqfd?
> 
> Regards,
> 
> Anthony Liguori

Haven't tried this, but possibly.

> >>  I suspect it could give vhost a nice boost.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
rukhsana ansari - Oct. 5, 2010, 11 a.m.
Hi,

W.r.t:
> Note that this is a tradeoff.  If an idle core is available and the
> scheduler places the iothread on that core, then the heavyweight exit is
> replaced by a lightweight exit + IPI.  If the iothread is co-located with
> the vcpu, then we'll take a heavyweight exit in any case.
>
Q: Does the kvm kernel code check for such a condition and take a
heavyweight exit?

> The first case is very likely if the host cpu is undercommitted and there is
> heavy I/O activity.  This is a typical subsystem benchmark scenario (as
> opposed to a system benchmark like specvirt).  My feeling is that total
> system throughput will be decreased unless the scheduler is clever enough to
> place the iothread and vcpu on the same host cpu when the system is
> overcommitted.
>
>
Q: Sorry if the answer is obvious here.
If the heavyweight exit is taken when both threads are assigned to the
same core, how will the system throughput increase?


Thanks
Rukhsana
Avi Kivity - Oct. 5, 2010, 11:58 a.m.
On 10/05/2010 01:00 PM, rukhsana ansari wrote:
> Hi,
>
> W.r.t:
> >  Note that this is a tradeoff.  If an idle core is available and the
> >  scheduler places the iothread on that core, then the heavyweight exit is
> >  replaced by a lightweight exit + IPI.  If the iothread is co-located with
> >  the vcpu, then we'll take a heavyweight exit in any case.
> >
> Q: Does the kvm kernel code check for such a condition and take a
> heavyweight exit?

No.  The heavyweight exit is caused by a context switch (partial) or 
return to userspace (full).

> >  The first case is very likely if the host cpu is undercommitted and there is
> >  heavy I/O activity.  This is a typical subsystem benchmark scenario (as
> >  opposed to a system benchmark like specvirt).  My feeling is that total
> >  system throughput will be decreased unless the scheduler is clever enough to
> >  place the iothread and vcpu on the same host cpu when the system is
> >  overcommitted.
> >
> >
> Q: Sorry if the answer is obvious here.
> If the heavyweight exit is taken when both threads are assigned to the
> same core, how will the system throughput increase?
>

Co-locating threads on the same core reduces cross-core traffic.
Stefan Hajnoczi - Oct. 19, 2010, 1:07 p.m.
On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote:
> Virtqueue notify is currently handled synchronously in userspace virtio.
> This prevents the vcpu from executing guest code while hardware
> emulation code handles the notify.
> 
> On systems that support KVM, the ioeventfd mechanism can be used to make
> virtqueue notify a lightweight exit by deferring hardware emulation to
> the iothread and allowing the VM to continue execution.  This model is
> similar to how vhost receives virtqueue notifies.
> 
> The result of this change is improved performance for userspace virtio
> devices.  Virtio-blk throughput increases especially for multithreaded
> scenarios and virtio-net transmit throughput increases substantially.
> Full numbers are below.
> 
> This patch employs ioeventfd virtqueue notify for all virtio devices.
> Linux kernels pre-2.6.34 only allow for 6 ioeventfds per VM and care
> must be taken so that vhost-net, the other ioeventfd user in QEMU, is
> able to function.  On such kernels ioeventfd virtqueue notify will not
> be used.
> 
> Khoa Huynh <khoa@us.ibm.com> collected the following data for
> virtio-blk with cache=none,aio=native:
> 
> FFSB Test          Threads  Unmodified  Patched
>                             (MB/s)      (MB/s)
> Large file create  1        21.7        21.8
>                    8        101.0       118.0
>                    16       119.0       157.0
> 
> Sequential reads   1        21.9        23.2
>                    8        114.0       139.0
>                    16       143.0       178.0
> 
> Random reads       1        3.3         3.6
>                    8        23.0        25.4
>                    16       43.3        47.8
> 
> Random writes      1        22.2        23.0
>                    8        93.1        111.6
>                    16       110.5       132.0
> 
> Sridhar Samudrala <sri@us.ibm.com> collected the following data for
> virtio-net with 2.6.36-rc1 on the host and 2.6.34 on the guest.
> 
> Guest to Host TCP_STREAM throughput(Mb/sec)
> -------------------------------------------
> Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> 65536         12755        6430                  7590
> 16384          8499        3084                  5764
>  4096          4723        1578                  3659
>  1024          1827         981                  2060
> 
> Host to Guest TCP_STREAM throughput(Mb/sec)
> -------------------------------------------
> Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
> 65536         11156        5790                  5853
> 16384         10787        5575                  5691
>  4096         10452        5556                  4277
>  1024          4437        3671                  5277
> 
> Guest to Host TCP_RR latency(transactions/sec)
> ----------------------------------------------
> 
> Msg Size  vhost-net  virtio-net  virtio-net/ioeventfd
>     1          9903        3459                  3425
>  4096          7185        1931                  1899
> 16384          6108        2102                  1923
> 65536          3161        1610                  1744
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
> Small changes are required for qemu-kvm.git.  I will send them once qemu.git
> has virtio-ioeventfd support.
> 
>  hw/vhost.c  |    6 ++--
>  hw/virtio.c |  106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/virtio.h |    9 +----
>  kvm-all.c   |   39 +++++++++++++++++++++
>  kvm-stub.c  |    5 +++
>  kvm.h       |    1 +
>  6 files changed, 156 insertions(+), 10 deletions(-)

Is there anything stopping this patch from being merged?

Thanks,
Stefan
Anthony Liguori - Oct. 19, 2010, 1:12 p.m.
On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote:
> Is there anything stopping this patch from being merged?
>    

Michael, any objections?  If not, I'll merge it.

Regards,

Anthony Liguori

> Thanks,
> Stefan
>
Michael S. Tsirkin - Oct. 19, 2010, 1:33 p.m.
As a general comment, could you please try to split this patch
up, to make it easier to review? I did a pass over it but I am
still not understanding it completely.

My main concern is with the fact that we add more state
in notifiers that can easily get out of sync with users.
If we absolutely need this state, let's try to at least
document the state machine, and make the API
for state transitions more transparent.

On Thu, Sep 30, 2010 at 03:01:52PM +0100, Stefan Hajnoczi wrote:
> diff --git a/hw/vhost.c b/hw/vhost.c
> index 1b8624d..f127a07 100644
> --- a/hw/vhost.c
> +++ b/hw/vhost.c
> @@ -517,7 +517,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
>          goto fail_guest_notifier;
>      }
>  
> -    r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, true);
> +    r = virtio_set_host_notifier(vdev, idx, true);
>      if (r < 0) {
>          fprintf(stderr, "Error binding host notifier: %d\n", -r);
>          goto fail_host_notifier;
> @@ -539,7 +539,7 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
>  
>  fail_call:
>  fail_kick:
> -    vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false);
> +    virtio_set_host_notifier(vdev, idx, false);
>  fail_host_notifier:
>      vdev->binding->set_guest_notifier(vdev->binding_opaque, idx, false);
>  fail_guest_notifier:
> @@ -575,7 +575,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
>      }
>      assert (r >= 0);
>  
> -    r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false);
> +    r = virtio_set_host_notifier(vdev, idx, false);
>      if (r < 0) {
>          fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r);
>          fflush(stderr);
> diff --git a/hw/virtio.c b/hw/virtio.c
> index fbef788..f075b3a 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -16,6 +16,7 @@
>  #include "trace.h"
>  #include "virtio.h"
>  #include "sysemu.h"
> +#include "kvm.h"
>  
>  /* The alignment to use between consumer and producer parts of vring.
>   * x86 pagesize again. */
> @@ -77,6 +78,11 @@ struct VirtQueue
>      VirtIODevice *vdev;
>      EventNotifier guest_notifier;
>      EventNotifier host_notifier;
> +    enum {
> +        HOST_NOTIFIER_DEASSIGNED,   /* inactive */
> +        HOST_NOTIFIER_ASSIGNED,     /* active */
> +        HOST_NOTIFIER_OFFLIMITS,    /* active but outside our control */
> +    } host_notifier_state;

This state machine confuses me. Please note that users already
track notifier state and call set with assign/deassign correctly.
The comment does not help: what does 'outside our control' mean?
Who's control?

>  };
>  
>  /* virt queue functions */
> @@ -453,6 +459,93 @@ void virtio_update_irq(VirtIODevice *vdev)
>      virtio_notify_vector(vdev, VIRTIO_NO_VECTOR);
>  }
>  
> +/* Service virtqueue notify from a host notifier */
> +static void virtio_read_host_notifier(void *opaque)
> +{
> +    VirtQueue *vq = opaque;
> +    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
> +    if (event_notifier_test_and_clear(notifier)) {
> +        if (vq->vring.desc) {
> +            vq->handle_output(vq->vdev, vq);
> +        }
> +    }
> +}
> +
> +/* Transition between host notifier states */
> +static int virtio_set_host_notifier_state(VirtIODevice *vdev, int n, int state)

really unfortunate naming for functions: we seem to have
about 4 of them starting with virtio_set_host_notifier*

> +{
> +    VirtQueue *vq = &vdev->vq[n];
> +    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
> +    int rc;
> +
> +    if (!kvm_enabled()) {
> +        return -ENOSYS;
> +    }

If this means that there's no need to do anything for non kvm,
return 0 here.

> +
> +    /* If the number of ioeventfds is limited, use them for vhost only */
> +    if (state == HOST_NOTIFIER_ASSIGNED && !kvm_has_many_iobus_devs()) {
> +        state = HOST_NOTIFIER_DEASSIGNED;
> +    }
> +
> +    /* Ignore if no state change */
> +    if (vq->host_notifier_state == state) {
> +        return 0;
> +    }
> +
> +    /* Disable read handler if transitioning away from assigned */
> +    if (vq->host_notifier_state == HOST_NOTIFIER_ASSIGNED) {
> +        qemu_set_fd_handler(event_notifier_get_fd(notifier), NULL, NULL, NULL);
> +    }
> +
> +    /* Toggle host notifier if transitioning to or from deassigned */
> +    if (state == HOST_NOTIFIER_DEASSIGNED ||
> +        vq->host_notifier_state == HOST_NOTIFIER_DEASSIGNED) {
> +        rc = vdev->binding->set_host_notifier(vdev->binding_opaque, n,
> +                state != HOST_NOTIFIER_DEASSIGNED);
> +        if (rc < 0) {
> +            return rc;
> +        }
> +    }
> +
> +    /* Enable read handler if transitioning to assigned */
> +    if (state == HOST_NOTIFIER_ASSIGNED) {
> +        qemu_set_fd_handler(event_notifier_get_fd(notifier),
> +                            virtio_read_host_notifier, NULL, vq);
> +    }
> +
> +    vq->host_notifier_state = state;
> +    return 0;
> +}
> +
> +/* Try to assign/deassign host notifiers for all virtqueues */
> +static void virtio_set_host_notifiers(VirtIODevice *vdev, bool assigned)

void? don't we care whether this fails?

> +{

This really confuses me. Why is this not a simple loop
over all vqs?

> +    int state = assigned ? HOST_NOTIFIER_ASSIGNED : HOST_NOTIFIER_DEASSIGNED;
> +    int i;
> +
> +    if (!vdev->binding->set_host_notifier) {
> +        return;
> +    }
> +
> +    for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +        if (vdev->vq[i].host_notifier_state == HOST_NOTIFIER_OFFLIMITS) {
> +            continue;
> +        }
> +
> +        if (!vdev->vq[i].vring.desc) {
> +            continue;
> +        }
> +
> +        virtio_set_host_notifier_state(vdev, i, state);

ignores return value?

> +    }
> +}
> +
> +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned)
> +{
> +    int state = assigned ? HOST_NOTIFIER_OFFLIMITS : HOST_NOTIFIER_ASSIGNED;

So here assigned == false sets state to ASSIGNED?


> +    return virtio_set_host_notifier_state(vdev, n, state);
> +}
> +
>  void virtio_reset(void *opaque)
>  {
>      VirtIODevice *vdev = opaque;
> @@ -467,6 +560,7 @@ void virtio_reset(void *opaque)
>      vdev->isr = 0;
>      vdev->config_vector = VIRTIO_NO_VECTOR;
>      virtio_notify_vector(vdev, vdev->config_vector);
> +    virtio_set_host_notifiers(vdev, false);
>  
>      for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>          vdev->vq[i].vring.desc = 0;
> @@ -592,6 +686,16 @@ void virtio_queue_set_vector(VirtIODevice *vdev, int n, uint16_t vector)
>          vdev->vq[n].vector = vector;
>  }
>  
> +void virtio_set_status(VirtIODevice *vdev, uint8_t val)
> +{
> +    virtio_set_host_notifiers(vdev, val & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +    if (vdev->set_status) {
> +        vdev->set_status(vdev, val);
> +    }
> +    vdev->status = val;

How does this interact with vhost.c which uses both notifiers and status
callback?

> +}
> +
>  VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>                              void (*handle_output)(VirtIODevice *, VirtQueue *))
>  {
> @@ -719,6 +823,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>      }
>  
>      virtio_notify_vector(vdev, VIRTIO_NO_VECTOR);
> +    virtio_set_host_notifiers(vdev, vdev->status & VIRTIO_CONFIG_S_DRIVER_OK);
>      return 0;
>  }
>  
> @@ -746,6 +851,7 @@ VirtIODevice *virtio_common_init(const char *name, uint16_t device_id,
>      for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>          vdev->vq[i].vector = VIRTIO_NO_VECTOR;
>          vdev->vq[i].vdev = vdev;
> +        vdev->vq[i].host_notifier_state = HOST_NOTIFIER_DEASSIGNED;
>      }
>  
>      vdev->name = name;
> diff --git a/hw/virtio.h b/hw/virtio.h
> index 96514e6..d76157e 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -125,13 +125,7 @@ struct VirtIODevice
>      uint16_t device_id;
>  };
>  
> -static inline void virtio_set_status(VirtIODevice *vdev, uint8_t val)
> -{
> -    if (vdev->set_status) {
> -        vdev->set_status(vdev, val);
> -    }
> -    vdev->status = val;
> -}
> +void virtio_set_status(VirtIODevice *vdev, uint8_t val);
>  
>  VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>                              void (*handle_output)(VirtIODevice *,
> @@ -217,6 +211,7 @@ target_phys_addr_t virtio_queue_get_ring_size(VirtIODevice *vdev, int n);
>  uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n);
>  void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
>  VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n);
> +int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned);

I think it's not the best API. Two functions e.g.
_assign
_deassign
would make it easier to avoid confusion.

>  EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq);
>  EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
>  void virtio_irq(VirtQueue *vq);
> diff --git a/kvm-all.c b/kvm-all.c
> index 1cc696f..2f09e34 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -24,6 +24,7 @@
>  #include "qemu-barrier.h"
>  #include "sysemu.h"
>  #include "hw/hw.h"
> +#include "hw/event_notifier.h"
>  #include "gdbstub.h"
>  #include "kvm.h"
>  #include "bswap.h"
> @@ -72,6 +73,7 @@ struct KVMState
>      int irqchip_in_kernel;
>      int pit_in_kernel;
>      int xsave, xcrs;
> +    int many_iobus_devs;
>  };
>  
>  static KVMState *kvm_state;
> @@ -423,6 +425,36 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
>      return ret;
>  }
>  
> +static int kvm_check_many_iobus_devs(void)
> +{
> +    /* Older kernels have a 6 device limit on the KVM io bus.  In that case
> +     * creating many ioeventfds must be avoided.  This tests 

This tests -> this test

>+        checks for the limitation.
> +     */
> +    EventNotifier notifiers[7];

This is low level stuff, we really don't need to use notifiers here,
simple eventfd will be enough.

> +    int i, ret = 0;
> +    for (i = 0; i < ARRAY_SIZE(notifiers); i++) {
> +        ret = event_notifier_init(&notifiers[i], 0);
> +        if (ret < 0) {
> +            break;
> +        }
> +        ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, true);
> +        if (ret < 0) {
> +            event_notifier_cleanup(&notifiers[i]);
> +            break;
> +        }
> +    }
> +
> +    /* Decide whether many devices are supported or not */
> +    ret = i == ARRAY_SIZE(notifiers);
> +
> +    while (i-- > 0) {
> +        kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, false);
> +        event_notifier_cleanup(&notifiers[i]);
> +    }
> +    return ret;
> +}
> +
>  static void kvm_set_phys_mem(target_phys_addr_t start_addr,
>  			     ram_addr_t size,
>  			     ram_addr_t phys_offset)
> @@ -699,6 +731,8 @@ int kvm_init(int smp_cpus)
>      kvm_state = s;
>      cpu_register_phys_memory_client(&kvm_cpu_phys_memory_client);
>  
> +    s->many_iobus_devs = kvm_check_many_iobus_devs();
> +
>      return 0;
>  
>  err:
> @@ -1028,6 +1062,11 @@ int kvm_has_xcrs(void)
>      return kvm_state->xcrs;
>  }
>  
> +int kvm_has_many_iobus_devs(void)
> +{
> +    return kvm_state->many_iobus_devs;
> +}
> +
>  void kvm_setup_guest_memory(void *start, size_t size)
>  {
>      if (!kvm_has_sync_mmu()) {
> diff --git a/kvm-stub.c b/kvm-stub.c
> index d45f9fa..b0887fb 100644
> --- a/kvm-stub.c
> +++ b/kvm-stub.c
> @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void)
>      return 0;
>  }
>  
> +int kvm_has_many_iobus_devs(void)
> +{
> +    return 0;
> +}
> +
>  void kvm_setup_guest_memory(void *start, size_t size)
>  {
>  }
> diff --git a/kvm.h b/kvm.h
> index 50b6c01..f405906 100644
> --- a/kvm.h
> +++ b/kvm.h
> @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void);
>  int kvm_has_debugregs(void);
>  int kvm_has_xsave(void);
>  int kvm_has_xcrs(void);
> +int kvm_has_many_iobus_devs(void);

Looks like a pretty low level API.
Name it 'kvm_has_ioeventfd' or something like this?

>  
>  #ifdef NEED_CPU_H
>  int kvm_init_vcpu(CPUState *env);
> -- 
> 1.7.1
Michael S. Tsirkin - Oct. 19, 2010, 1:35 p.m.
On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote:
> On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote:
> >Is there anything stopping this patch from being merged?
> 
> Michael, any objections?  If not, I'll merge it.

I don't really understand what's going on there.  The extra state in
notifiers especially scares me. If you do and are comfortable with the
code, go ahead :)
Michael S. Tsirkin - Oct. 19, 2010, 1:43 p.m.
On Tue, Oct 19, 2010 at 02:44:35PM +0100, Stefan Hajnoczi wrote:
> On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote:
> >> On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote:
> >> >Is there anything stopping this patch from being merged?
> >>
> >> Michael, any objections?  If not, I'll merge it.
> >
> > I don't really understand what's going on there.  The extra state in
> > notifiers especially scares me. If you do and are comfortable with the
> > code, go ahead :)
> 
> I'm happy to address your comments.  The state machine was a bit icky
> but I don't see a way around it.

I think the situation is similar to irqfd in qemu-kvm - take a look
there, specifically msix mask notifiers.
Stefan Hajnoczi - Oct. 19, 2010, 1:44 p.m.
On Tue, Oct 19, 2010 at 2:35 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Oct 19, 2010 at 08:12:42AM -0500, Anthony Liguori wrote:
>> On 10/19/2010 08:07 AM, Stefan Hajnoczi wrote:
>> >Is there anything stopping this patch from being merged?
>>
>> Michael, any objections?  If not, I'll merge it.
>
> I don't really understand what's going on there.  The extra state in
> notifiers especially scares me. If you do and are comfortable with the
> code, go ahead :)

I'm happy to address your comments.  The state machine was a bit icky
but I don't see a way around it.  Will follow up to your review email.

Stefan
Stefan Hajnoczi - Oct. 25, 2010, 1:25 p.m.
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote:
> My main concern is with the fact that we add more state
> in notifiers that can easily get out of sync with users.
> If we absolutely need this state, let's try to at least
> document the state machine, and make the API
> for state transitions more transparent.

I'll try to describe how it works.  If you're happy with the design in
principle then I can rework the code.  Otherwise we can think about a
different design.

The goal is to use ioeventfd instead of the synchronous pio emulation
path that userspace virtqueues use today.  Both virtio-blk and
virtio-net increase performance with this approach because it does not
block the vcpu from executing guest code while the I/O operation is
initiated.

We want to automatically create an event notifier and setup ioeventfd
for each initialized virtqueue.

Vhost already uses ioeventfd so it is important not to interfere with
devices that have enabled vhost.  If vhost is enabled, then the device's
virtqueues are off-limits and should not be tampered with.

Furthermore, older kernels limit you to 6 ioeventfds per guest.  On such
systems it is risky to automatically use ioeventfd for userspace
virtqueues, since that could take a precious ioeventfd away from another
virtio device using vhost.  Existing guest configurations would break so
it is simplest to avoid using ioeventfd for userspace virtqueues on such
hosts.

The design adds logic into hw/virtio.c to automatically use ioeventfd
for userspace virtqueues.  Specific virtio devices like blk and net
require no modification.  The logic sits below the set_host_notifier()
function that vhost uses.

This design stays in sync because it speaks two interfaces that allow it
to accurately track whether or not to use ioeventfd:
1. virtio_set_host_notifier() is used by vhost.  When vhost enables the
   host notifier we stay out of the way.
2. virtio_reset()/virtio_set_status()/virtio_load() define the device
   life-cycle and transition the state machine appropriately.  Migration
   is supported.

Here is the state machine that tracks a virtqueue:

                 assigned
             ^ /          \ ^
         e. / / c.      g. \ \ b.
           / /              \ \
          / v       f.       v \               a.
 offlimits   ---------------> deassigned <-- start
             <---------------
                    d.

a. The virtqueue starts deassigned with no ioeventfd.

b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to
assign an ioeventfd to each virtqueue, except if the 6 ioeventfd
limitation is present.

c, d. The virtqueue becomes offlimits if vhost enables the host notifier.

e. The ioeventfd becomes assigned again when the host notifier is disabled by vhost.
f. Except when the 6 ioeventfd limitation is present, then the ioeventfd
becomes unassigned because we want to avoid using ioeventfd.

g. When the device is reset its virtqueues become deassigned again.

Does this make sense?

Stefan
Stefan Hajnoczi - Oct. 25, 2010, 3:01 p.m.
On Tue, Oct 19, 2010 at 03:33:41PM +0200, Michael S. Tsirkin wrote:
Apologies if you receive this twice, the original message either
disappeared or was delayed somehow.

> My main concern is with the fact that we add more state
> in notifiers that can easily get out of sync with users.
> If we absolutely need this state, let's try to at least
> document the state machine, and make the API
> for state transitions more transparent.

I'll try to describe how it works.  If you're happy with the design in
principle then I can rework the code.  Otherwise we can think about a
different design.

The goal is to use ioeventfd instead of the synchronous pio emulation
path that userspace virtqueues use today.  Both virtio-blk and
virtio-net increase performance with this approach because it does not
block the vcpu from executing guest code while the I/O operation is
initiated.

We want to automatically create an event notifier and setup ioeventfd
for each initialized virtqueue.

Vhost already uses ioeventfd so it is important not to interfere with
devices that have enabled vhost.  If vhost is enabled, then the device's
virtqueues are off-limits and should not be tampered with.

Furthermore, older kernels limit you to 6 ioeventfds per guest.  On such
systems it is risky to automatically use ioeventfd for userspace
virtqueues, since that could take a precious ioeventfd away from another
virtio device using vhost.  Existing guest configurations would break so
it is simplest to avoid using ioeventfd for userspace virtqueues on such
hosts.

The design adds logic into hw/virtio.c to automatically use ioeventfd
for userspace virtqueues.  Specific virtio devices like blk and net
require no modification.  The logic sits below the set_host_notifier()
function that vhost uses.

This design stays in sync because it speaks two interfaces that allow it
to accurately track whether or not to use ioeventfd:
1. virtio_set_host_notifier() is used by vhost.  When vhost enables the
   host notifier we stay out of the way.
2. virtio_reset()/virtio_set_status()/virtio_load() define the device
   life-cycle and transition the state machine appropriately.  Migration
   is supported.

Here is the state machine that tracks a virtqueue:

                 assigned
             ^ /          \ ^
         e. / / c.      g. \ \ b.
           / /              \ \
          / v       f.       v \               a.
 offlimits   ---------------> deassigned <-- start
             <---------------
                    d.

a. The virtqueue starts deassigned with no ioeventfd.

b. When the device status becomes VIRTIO_CONFIG_S_DRIVER_OK we try to
assign an ioeventfd to each virtqueue, except if the 6 ioeventfd
limitation is present.

c, d. The virtqueue becomes offlimits if vhost enables the host notifier.

e. The ioeventfd becomes assigned again when the host notifier is
disabled by vhost.
f. Except when the 6 ioeventfd limitation is present, then the ioeventfd
becomes unassigned because we want to avoid using ioeventfd.

g. When the device is reset its virtqueues become deassigned again.

Does this make sense?

Stefan
Stefan Hajnoczi - Nov. 10, 2010, 2:52 p.m.
Hi Michael,
I have looked into the way irqfd with msix mask notifiers works.  From
what I can tell, the guest notifiers are enabled by vhost net in order
to hook up irqfds for the virtqueues.  MSIX allows vectors to be
masked so there is a mmio write notifier in qemu-kvm to toggle the
irqfd and its QEMU fd handler when the guest toggles the MSIX mask.

The irqfd is disabled but stays open as an eventfd while masked.  That
means masking/unmasking the vector does not close/open the eventfd
file descriptor itself.

I'm having trouble finding a direct parallel to virtio-ioeventfd here.
 We always want to have an ioeventfd per virtqueue unless the host
kernel does not support >6 ioeventfds per VM.

When vhost sets the host notifier we want to remove the QEMU fd
handler and allow vhost to use the event notifier's fd as it wants.
When vhost clears the host notifier we want to add the QEMU fd handler
again (unless the kernel does not support >6 ioeventfds per VM).

I think hooking in at the virtio-pci.c level instead of virtio.c is
possible but we're still going to have the same state transitions.  I
hope it can be done without adding per-virtqueue variables that track
state.

Before I go down this route, is there something I've missed and do you
think this approach will be better?

Stefan

Patch

diff --git a/hw/vhost.c b/hw/vhost.c
index 1b8624d..f127a07 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -517,7 +517,7 @@  static int vhost_virtqueue_init(struct vhost_dev *dev,
         goto fail_guest_notifier;
     }
 
-    r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, true);
+    r = virtio_set_host_notifier(vdev, idx, true);
     if (r < 0) {
         fprintf(stderr, "Error binding host notifier: %d\n", -r);
         goto fail_host_notifier;
@@ -539,7 +539,7 @@  static int vhost_virtqueue_init(struct vhost_dev *dev,
 
 fail_call:
 fail_kick:
-    vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false);
+    virtio_set_host_notifier(vdev, idx, false);
 fail_host_notifier:
     vdev->binding->set_guest_notifier(vdev->binding_opaque, idx, false);
 fail_guest_notifier:
@@ -575,7 +575,7 @@  static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
     }
     assert (r >= 0);
 
-    r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false);
+    r = virtio_set_host_notifier(vdev, idx, false);
     if (r < 0) {
         fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r);
         fflush(stderr);
diff --git a/hw/virtio.c b/hw/virtio.c
index fbef788..f075b3a 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -16,6 +16,7 @@ 
 #include "trace.h"
 #include "virtio.h"
 #include "sysemu.h"
+#include "kvm.h"
 
 /* The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. */
@@ -77,6 +78,11 @@  struct VirtQueue
     VirtIODevice *vdev;
     EventNotifier guest_notifier;
     EventNotifier host_notifier;
+    enum {
+        HOST_NOTIFIER_DEASSIGNED,   /* inactive */
+        HOST_NOTIFIER_ASSIGNED,     /* active */
+        HOST_NOTIFIER_OFFLIMITS,    /* active but outside our control */
+    } host_notifier_state;
 };
 
 /* virt queue functions */
@@ -453,6 +459,93 @@  void virtio_update_irq(VirtIODevice *vdev)
     virtio_notify_vector(vdev, VIRTIO_NO_VECTOR);
 }
 
+/* Service virtqueue notify from a host notifier */
+static void virtio_read_host_notifier(void *opaque)
+{
+    VirtQueue *vq = opaque;
+    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+    if (event_notifier_test_and_clear(notifier)) {
+        if (vq->vring.desc) {
+            vq->handle_output(vq->vdev, vq);
+        }
+    }
+}
+
+/* Transition between host notifier states */
+static int virtio_set_host_notifier_state(VirtIODevice *vdev, int n, int state)
+{
+    VirtQueue *vq = &vdev->vq[n];
+    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+    int rc;
+
+    if (!kvm_enabled()) {
+        return -ENOSYS;
+    }
+
+    /* If the number of ioeventfds is limited, use them for vhost only */
+    if (state == HOST_NOTIFIER_ASSIGNED && !kvm_has_many_iobus_devs()) {
+        state = HOST_NOTIFIER_DEASSIGNED;
+    }
+
+    /* Ignore if no state change */
+    if (vq->host_notifier_state == state) {
+        return 0;
+    }
+
+    /* Disable read handler if transitioning away from assigned */
+    if (vq->host_notifier_state == HOST_NOTIFIER_ASSIGNED) {
+        qemu_set_fd_handler(event_notifier_get_fd(notifier), NULL, NULL, NULL);
+    }
+
+    /* Toggle host notifier if transitioning to or from deassigned */
+    if (state == HOST_NOTIFIER_DEASSIGNED ||
+        vq->host_notifier_state == HOST_NOTIFIER_DEASSIGNED) {
+        rc = vdev->binding->set_host_notifier(vdev->binding_opaque, n,
+                state != HOST_NOTIFIER_DEASSIGNED);
+        if (rc < 0) {
+            return rc;
+        }
+    }
+
+    /* Enable read handler if transitioning to assigned */
+    if (state == HOST_NOTIFIER_ASSIGNED) {
+        qemu_set_fd_handler(event_notifier_get_fd(notifier),
+                            virtio_read_host_notifier, NULL, vq);
+    }
+
+    vq->host_notifier_state = state;
+    return 0;
+}
+
+/* Try to assign/deassign host notifiers for all virtqueues */
+static void virtio_set_host_notifiers(VirtIODevice *vdev, bool assigned)
+{
+    int state = assigned ? HOST_NOTIFIER_ASSIGNED : HOST_NOTIFIER_DEASSIGNED;
+    int i;
+
+    if (!vdev->binding->set_host_notifier) {
+        return;
+    }
+
+    for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+        if (vdev->vq[i].host_notifier_state == HOST_NOTIFIER_OFFLIMITS) {
+            continue;
+        }
+
+        if (!vdev->vq[i].vring.desc) {
+            continue;
+        }
+
+        virtio_set_host_notifier_state(vdev, i, state);
+    }
+}
+
+int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned)
+{
+    int state = assigned ? HOST_NOTIFIER_OFFLIMITS : HOST_NOTIFIER_ASSIGNED;
+    return virtio_set_host_notifier_state(vdev, n, state);
+}
+
 void virtio_reset(void *opaque)
 {
     VirtIODevice *vdev = opaque;
@@ -467,6 +560,7 @@  void virtio_reset(void *opaque)
     vdev->isr = 0;
     vdev->config_vector = VIRTIO_NO_VECTOR;
     virtio_notify_vector(vdev, vdev->config_vector);
+    virtio_set_host_notifiers(vdev, false);
 
     for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
         vdev->vq[i].vring.desc = 0;
@@ -592,6 +686,16 @@  void virtio_queue_set_vector(VirtIODevice *vdev, int n, uint16_t vector)
         vdev->vq[n].vector = vector;
 }
 
+void virtio_set_status(VirtIODevice *vdev, uint8_t val)
+{
+    virtio_set_host_notifiers(vdev, val & VIRTIO_CONFIG_S_DRIVER_OK);
+
+    if (vdev->set_status) {
+        vdev->set_status(vdev, val);
+    }
+    vdev->status = val;
+}
+
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
                             void (*handle_output)(VirtIODevice *, VirtQueue *))
 {
@@ -719,6 +823,7 @@  int virtio_load(VirtIODevice *vdev, QEMUFile *f)
     }
 
     virtio_notify_vector(vdev, VIRTIO_NO_VECTOR);
+    virtio_set_host_notifiers(vdev, vdev->status & VIRTIO_CONFIG_S_DRIVER_OK);
     return 0;
 }
 
@@ -746,6 +851,7 @@  VirtIODevice *virtio_common_init(const char *name, uint16_t device_id,
     for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
         vdev->vq[i].vector = VIRTIO_NO_VECTOR;
         vdev->vq[i].vdev = vdev;
+        vdev->vq[i].host_notifier_state = HOST_NOTIFIER_DEASSIGNED;
     }
 
     vdev->name = name;
diff --git a/hw/virtio.h b/hw/virtio.h
index 96514e6..d76157e 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -125,13 +125,7 @@  struct VirtIODevice
     uint16_t device_id;
 };
 
-static inline void virtio_set_status(VirtIODevice *vdev, uint8_t val)
-{
-    if (vdev->set_status) {
-        vdev->set_status(vdev, val);
-    }
-    vdev->status = val;
-}
+void virtio_set_status(VirtIODevice *vdev, uint8_t val);
 
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
                             void (*handle_output)(VirtIODevice *,
@@ -217,6 +211,7 @@  target_phys_addr_t virtio_queue_get_ring_size(VirtIODevice *vdev, int n);
 uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n);
 void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
 VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n);
+int virtio_set_host_notifier(VirtIODevice *vdev, int n, bool assigned);
 EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_irq(VirtQueue *vq);
diff --git a/kvm-all.c b/kvm-all.c
index 1cc696f..2f09e34 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -24,6 +24,7 @@ 
 #include "qemu-barrier.h"
 #include "sysemu.h"
 #include "hw/hw.h"
+#include "hw/event_notifier.h"
 #include "gdbstub.h"
 #include "kvm.h"
 #include "bswap.h"
@@ -72,6 +73,7 @@  struct KVMState
     int irqchip_in_kernel;
     int pit_in_kernel;
     int xsave, xcrs;
+    int many_iobus_devs;
 };
 
 static KVMState *kvm_state;
@@ -423,6 +425,36 @@  int kvm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
+static int kvm_check_many_iobus_devs(void)
+{
+    /* Older kernels have a 6 device limit on the KVM io bus.  In that case
+     * creating many ioeventfds must be avoided.  This tests checks for the
+     * limitation.
+     */
+    EventNotifier notifiers[7];
+    int i, ret = 0;
+    for (i = 0; i < ARRAY_SIZE(notifiers); i++) {
+        ret = event_notifier_init(&notifiers[i], 0);
+        if (ret < 0) {
+            break;
+        }
+        ret = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, true);
+        if (ret < 0) {
+            event_notifier_cleanup(&notifiers[i]);
+            break;
+        }
+    }
+
+    /* Decide whether many devices are supported or not */
+    ret = i == ARRAY_SIZE(notifiers);
+
+    while (i-- > 0) {
+        kvm_set_ioeventfd_pio_word(event_notifier_get_fd(&notifiers[i]), 0, i, false);
+        event_notifier_cleanup(&notifiers[i]);
+    }
+    return ret;
+}
+
 static void kvm_set_phys_mem(target_phys_addr_t start_addr,
 			     ram_addr_t size,
 			     ram_addr_t phys_offset)
@@ -699,6 +731,8 @@  int kvm_init(int smp_cpus)
     kvm_state = s;
     cpu_register_phys_memory_client(&kvm_cpu_phys_memory_client);
 
+    s->many_iobus_devs = kvm_check_many_iobus_devs();
+
     return 0;
 
 err:
@@ -1028,6 +1062,11 @@  int kvm_has_xcrs(void)
     return kvm_state->xcrs;
 }
 
+int kvm_has_many_iobus_devs(void)
+{
+    return kvm_state->many_iobus_devs;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
     if (!kvm_has_sync_mmu()) {
diff --git a/kvm-stub.c b/kvm-stub.c
index d45f9fa..b0887fb 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -99,6 +99,11 @@  int kvm_has_robust_singlestep(void)
     return 0;
 }
 
+int kvm_has_many_iobus_devs(void)
+{
+    return 0;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 }
diff --git a/kvm.h b/kvm.h
index 50b6c01..f405906 100644
--- a/kvm.h
+++ b/kvm.h
@@ -42,6 +42,7 @@  int kvm_has_robust_singlestep(void);
 int kvm_has_debugregs(void);
 int kvm_has_xsave(void);
 int kvm_has_xcrs(void);
+int kvm_has_many_iobus_devs(void);
 
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUState *env);