Patchwork [v5,2/4] virtio-pci: Use ioeventfd for virtqueue notify

login
register
mail settings
Submitter Stefan Hajnoczi
Date Dec. 12, 2010, 3:02 p.m.
Message ID <1292166128-10874-3-git-send-email-stefanha@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/75251/
State New
Headers show

Comments

Stefan Hajnoczi - Dec. 12, 2010, 3:02 p.m.
Virtqueue notify is currently handled synchronously in userspace virtio.  This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.

On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution.  This model is similar to
how vhost receives virtqueue notifies.

The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.

Some virtio devices are known to have guest drivers which expect a notify to be
processed synchronously and spin waiting for completion.  Only enable ioeventfd
for virtio-blk and virtio-net for now.

Care must be taken not to interfere with vhost-net, which uses host
notifiers.  If the set_host_notifier() API is used by a device
virtio-pci will disable virtio-ioeventfd and let the device deal with
host notifiers as it wishes.

After migration and on VM change state (running/paused) virtio-ioeventfd
will enable/disable itself.

 * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
 * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
 * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
 * vm_change_state(running=0) -> disable virtio-ioeventfd
 * vm_change_state(running=1) -> enable virtio-ioeventfd

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-pci.c |  190 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 hw/virtio.c     |   14 +++-
 hw/virtio.h     |    1 +
 3 files changed, 179 insertions(+), 26 deletions(-)
Kevin Wolf - Jan. 24, 2011, 6:54 p.m.
Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
> Virtqueue notify is currently handled synchronously in userspace virtio.  This
> prevents the vcpu from executing guest code while hardware emulation code
> handles the notify.
> 
> On systems that support KVM, the ioeventfd mechanism can be used to make
> virtqueue notify a lightweight exit by deferring hardware emulation to the
> iothread and allowing the VM to continue execution.  This model is similar to
> how vhost receives virtqueue notifies.
> 
> The result of this change is improved performance for userspace virtio devices.
> Virtio-blk throughput increases especially for multithreaded scenarios and
> virtio-net transmit throughput increases substantially.
> 
> Some virtio devices are known to have guest drivers which expect a notify to be
> processed synchronously and spin waiting for completion.  Only enable ioeventfd
> for virtio-blk and virtio-net for now.
> 
> Care must be taken not to interfere with vhost-net, which uses host
> notifiers.  If the set_host_notifier() API is used by a device
> virtio-pci will disable virtio-ioeventfd and let the device deal with
> host notifiers as it wishes.
> 
> After migration and on VM change state (running/paused) virtio-ioeventfd
> will enable/disable itself.
> 
>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>  * vm_change_state(running=1) -> enable virtio-ioeventfd
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

On current git master I'm getting hangs when running iozone on a
virtio-blk disk. "Hang" means that it's not responsive any more and has
100% CPU consumption.

I bisected the problem to this patch. Any ideas?

Kevin
Michael S. Tsirkin - Jan. 24, 2011, 7:36 p.m.
On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
> > Virtqueue notify is currently handled synchronously in userspace virtio.  This
> > prevents the vcpu from executing guest code while hardware emulation code
> > handles the notify.
> > 
> > On systems that support KVM, the ioeventfd mechanism can be used to make
> > virtqueue notify a lightweight exit by deferring hardware emulation to the
> > iothread and allowing the VM to continue execution.  This model is similar to
> > how vhost receives virtqueue notifies.
> > 
> > The result of this change is improved performance for userspace virtio devices.
> > Virtio-blk throughput increases especially for multithreaded scenarios and
> > virtio-net transmit throughput increases substantially.
> > 
> > Some virtio devices are known to have guest drivers which expect a notify to be
> > processed synchronously and spin waiting for completion.  Only enable ioeventfd
> > for virtio-blk and virtio-net for now.
> > 
> > Care must be taken not to interfere with vhost-net, which uses host
> > notifiers.  If the set_host_notifier() API is used by a device
> > virtio-pci will disable virtio-ioeventfd and let the device deal with
> > host notifiers as it wishes.
> > 
> > After migration and on VM change state (running/paused) virtio-ioeventfd
> > will enable/disable itself.
> > 
> >  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
> >  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
> >  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
> >  * vm_change_state(running=0) -> disable virtio-ioeventfd
> >  * vm_change_state(running=1) -> enable virtio-ioeventfd
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> 
> On current git master I'm getting hangs when running iozone on a
> virtio-blk disk. "Hang" means that it's not responsive any more and has
> 100% CPU consumption.
> 
> I bisected the problem to this patch. Any ideas?
> 
> Kevin

Does it help if you set ioeventfd=off on command line?
Michael S. Tsirkin - Jan. 24, 2011, 7:47 p.m.
On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
> > On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
> >> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
> >>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
> >>> prevents the vcpu from executing guest code while hardware emulation code
> >>> handles the notify.
> >>>
> >>> On systems that support KVM, the ioeventfd mechanism can be used to make
> >>> virtqueue notify a lightweight exit by deferring hardware emulation to the
> >>> iothread and allowing the VM to continue execution.  This model is similar to
> >>> how vhost receives virtqueue notifies.
> >>>
> >>> The result of this change is improved performance for userspace virtio devices.
> >>> Virtio-blk throughput increases especially for multithreaded scenarios and
> >>> virtio-net transmit throughput increases substantially.
> >>>
> >>> Some virtio devices are known to have guest drivers which expect a notify to be
> >>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
> >>> for virtio-blk and virtio-net for now.
> >>>
> >>> Care must be taken not to interfere with vhost-net, which uses host
> >>> notifiers.  If the set_host_notifier() API is used by a device
> >>> virtio-pci will disable virtio-ioeventfd and let the device deal with
> >>> host notifiers as it wishes.
> >>>
> >>> After migration and on VM change state (running/paused) virtio-ioeventfd
> >>> will enable/disable itself.
> >>>
> >>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
> >>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
> >>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
> >>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
> >>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
> >>>
> >>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> >>
> >> On current git master I'm getting hangs when running iozone on a
> >> virtio-blk disk. "Hang" means that it's not responsive any more and has
> >> 100% CPU consumption.
> >>
> >> I bisected the problem to this patch. Any ideas?
> >>
> >> Kevin
> > 
> > Does it help if you set ioeventfd=off on command line?
> 
> Yes, with ioeventfd=off it seems to work fine.
> 
> Kevin

Then it's the ioeventfd that is to blame.
Is it the io thread that consumes 100% CPU?
Or the vcpu thread?
Kevin Wolf - Jan. 24, 2011, 7:48 p.m.
Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>>> prevents the vcpu from executing guest code while hardware emulation code
>>> handles the notify.
>>>
>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>>> iothread and allowing the VM to continue execution.  This model is similar to
>>> how vhost receives virtqueue notifies.
>>>
>>> The result of this change is improved performance for userspace virtio devices.
>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>>> virtio-net transmit throughput increases substantially.
>>>
>>> Some virtio devices are known to have guest drivers which expect a notify to be
>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>>> for virtio-blk and virtio-net for now.
>>>
>>> Care must be taken not to interfere with vhost-net, which uses host
>>> notifiers.  If the set_host_notifier() API is used by a device
>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>>> host notifiers as it wishes.
>>>
>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>>> will enable/disable itself.
>>>
>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>
>> On current git master I'm getting hangs when running iozone on a
>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>> 100% CPU consumption.
>>
>> I bisected the problem to this patch. Any ideas?
>>
>> Kevin
> 
> Does it help if you set ioeventfd=off on command line?

Yes, with ioeventfd=off it seems to work fine.

Kevin
Kevin Wolf - Jan. 24, 2011, 8:05 p.m.
Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>>>>> prevents the vcpu from executing guest code while hardware emulation code
>>>>> handles the notify.
>>>>>
>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>>>>> iothread and allowing the VM to continue execution.  This model is similar to
>>>>> how vhost receives virtqueue notifies.
>>>>>
>>>>> The result of this change is improved performance for userspace virtio devices.
>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>>>>> virtio-net transmit throughput increases substantially.
>>>>>
>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>>>>> for virtio-blk and virtio-net for now.
>>>>>
>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>>>>> host notifiers as it wishes.
>>>>>
>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>>>>> will enable/disable itself.
>>>>>
>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>>>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>>>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>>>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
>>>>>
>>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>>>
>>>> On current git master I'm getting hangs when running iozone on a
>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>>>> 100% CPU consumption.
>>>>
>>>> I bisected the problem to this patch. Any ideas?
>>>>
>>>> Kevin
>>>
>>> Does it help if you set ioeventfd=off on command line?
>>
>> Yes, with ioeventfd=off it seems to work fine.
>>
>> Kevin
> 
> Then it's the ioeventfd that is to blame.
> Is it the io thread that consumes 100% CPU?
> Or the vcpu thread?

I was building with the default options, i.e. there is no IO thread.

Now I'm just running the test with IO threads enabled, and so far
everything looks good. So I can only reproduce the problem with IO
threads disabled.

Kevin
Stefan Hajnoczi - Jan. 25, 2011, 7:12 a.m.
On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>>>>>> prevents the vcpu from executing guest code while hardware emulation code
>>>>>> handles the notify.
>>>>>>
>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>>>>>> iothread and allowing the VM to continue execution.  This model is similar to
>>>>>> how vhost receives virtqueue notifies.
>>>>>>
>>>>>> The result of this change is improved performance for userspace virtio devices.
>>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>
>>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
>>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>>>>>> for virtio-blk and virtio-net for now.
>>>>>>
>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>>>>>> host notifiers as it wishes.
>>>>>>
>>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>>>>>> will enable/disable itself.
>>>>>>
>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>>>>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>>>>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>>>>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
>>>>>>
>>>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>>>>
>>>>> On current git master I'm getting hangs when running iozone on a
>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>>>>> 100% CPU consumption.
>>>>>
>>>>> I bisected the problem to this patch. Any ideas?
>>>>>
>>>>> Kevin
>>>>
>>>> Does it help if you set ioeventfd=off on command line?
>>>
>>> Yes, with ioeventfd=off it seems to work fine.
>>>
>>> Kevin
>>
>> Then it's the ioeventfd that is to blame.
>> Is it the io thread that consumes 100% CPU?
>> Or the vcpu thread?
>
> I was building with the default options, i.e. there is no IO thread.
>
> Now I'm just running the test with IO threads enabled, and so far
> everything looks good. So I can only reproduce the problem with IO
> threads disabled.

Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
(relevant when --enable-io-thread is not used).  I will take a look at
that again and see why we're spinning without checking for ioeventfd
completion.

Stefan
Stefan Hajnoczi - Jan. 25, 2011, 9:49 a.m.
On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>>>>>>> prevents the vcpu from executing guest code while hardware emulation code
>>>>>>> handles the notify.
>>>>>>>
>>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>>>>>>> iothread and allowing the VM to continue execution.  This model is similar to
>>>>>>> how vhost receives virtqueue notifies.
>>>>>>>
>>>>>>> The result of this change is improved performance for userspace virtio devices.
>>>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>>
>>>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
>>>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>>>>>>> for virtio-blk and virtio-net for now.
>>>>>>>
>>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>>>>>>> host notifiers as it wishes.
>>>>>>>
>>>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>>>>>>> will enable/disable itself.
>>>>>>>
>>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>>>>>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>>>>>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>>>>>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
>>>>>>>
>>>>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>>>>>
>>>>>> On current git master I'm getting hangs when running iozone on a
>>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>>>>>> 100% CPU consumption.
>>>>>>
>>>>>> I bisected the problem to this patch. Any ideas?
>>>>>>
>>>>>> Kevin
>>>>>
>>>>> Does it help if you set ioeventfd=off on command line?
>>>>
>>>> Yes, with ioeventfd=off it seems to work fine.
>>>>
>>>> Kevin
>>>
>>> Then it's the ioeventfd that is to blame.
>>> Is it the io thread that consumes 100% CPU?
>>> Or the vcpu thread?
>>
>> I was building with the default options, i.e. there is no IO thread.
>>
>> Now I'm just running the test with IO threads enabled, and so far
>> everything looks good. So I can only reproduce the problem with IO
>> threads disabled.
>
> Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
> (relevant when --enable-io-thread is not used).  I will take a look at
> that again and see why we're spinning without checking for ioeventfd
> completion.

Here's my understanding of --disable-io-thread.  Added Anthony on CC,
please correct me.

When I/O thread is disabled our only thread runs guest code until an
exit request is made.  There are synchronous exit cases like a halt
instruction or single step.  There are also asynchronous exit cases
when signal handlers use qemu_notify_event(), which does cpu_exit(),
to set env->exit_request = 1 and unlink the current tb.

With this structure in mind, anything which needs to interrupt the
vcpu in order to process events must use signals and
qemu_notify_event().  Otherwise that event source may be starved and
never processed.

virtio-ioeventfd currently does not use signals and will therefore
never interrupt the vcpu.

However, you normally don't notice the missing signal handler because
some other event interrupts the vcpu and we enter select(2) to process
all pending handlers.  So virtio-ioeventfd mostly gets a free ride on
top of timer events.  This is suboptimal because it adds latency to
virtqueue kick - we're waiting for another event to interrupt the vcpu
before we can process virtqueue-kick.

If any other vcpu interruption makes virtio-ioeventfd chug along then
why are you seeing 100% CPU livelock?  My theory is that dynticks has
a race condition which causes timers to stop working in QEMU.  Here is
an strace of QEMU --disable-io-thread entering live lock.  I can
trigger this by starting a VM and running "while true; do true; done"
at the shell.  Then strace the QEMU process:

08:04:34.985177 ioctl(11, KVM_RUN, 0)   = 0
08:04:34.985242 --- SIGALRM (Alarm clock) @ 0 (0) ---
08:04:34.985319 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
08:04:34.985368 rt_sigreturn(0x2758ad0) = 0
08:04:34.985423 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
08:04:34.985484 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
08:04:34.985538 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
08:04:34.985588 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
273000}}, NULL) = 0
08:04:34.985646 ioctl(11, KVM_RUN, 0)   = -1 EINTR (Interrupted system call)
08:04:34.985928 --- SIGALRM (Alarm clock) @ 0 (0) ---
08:04:34.986007 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
08:04:34.986063 rt_sigreturn(0x2758ad0) = -1 EINTR (Interrupted system call)
08:04:34.986124 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
08:04:34.986188 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
08:04:34.986246 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
08:04:34.986299 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
250000}}, NULL) = 0
08:04:34.986359 ioctl(11, KVM_INTERRUPT, 0x7fff90404ef0) = 0
08:04:34.986406 ioctl(11, KVM_RUN, 0)   = 0
08:04:34.986465 ioctl(11, KVM_RUN, 0)   = 0              <--- guest
finishes execution

                v--- dynticks_rearm_timer() returns early because
timer is already scheduled
08:04:34.986533 timer_gettime(0, {it_interval={0, 0}, it_value={0, 24203}}) = 0
08:04:34.986585 --- SIGALRM (Alarm clock) @ 0 (0) ---    <--- timer expires
08:04:34.986661 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
08:04:34.986710 rt_sigreturn(0x2758ad0) = 0

                v--- we re-enter the guest without rearming the timer!
08:04:34.986754 ioctl(11, KVM_RUN^C <unfinished ...>
[QEMU hang, 100% CPU]

So dynticks fails to rearm the timer before we enter the guest.  This
is a race condition: we check that there is already a timer scheduled
and head on towards re-entering the guest, the timer expires before we
enter the guest, we re-enter the guest without realizing the timer has
expired.  Now we're inside the guest without the hope of a timer
expiring - and the guest is running a CPU-bound workload that doesn't
need to perform I/O.

The result is a hung QEMU (screen does not update) and a softlockup
inside the guest once we do kick it to life again (by detaching
strace).

I think the only way to avoid this race condition in dynticks is to
mask SIGALRM, then check if the timer expired, and then ioctl(KVM_RUN)
with atomic signal mask change back to SIGALRM enabled.  Thoughts?

Back to virtio-ioeventfd, we really shouldn't use virtio-ioeventfd
when there is no I/O thread.  It doesn't make sense because there's no
opportunity to process the virtqueue while the guest code is executing
in parallel like there is with I/O thread.  It will just degrade
performance when QEMU only has one thread.  I'll send a patch to
disable it when we build without I/O thread.

Stefan
Stefan Hajnoczi - Jan. 25, 2011, 9:54 a.m.
On Tue, Jan 25, 2011 at 9:49 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> If any other vcpu interruption makes virtio-ioeventfd chug along then
> why are you seeing 100% CPU livelock?  My theory is that dynticks has
> a race condition which causes timers to stop working in QEMU.

I forgot to mention that you can test this theory by building without
I/O thread and running with -clock hpet.

If the guest no longer hangs, then this suggests you're seeing the
dynticks race condition.

Stefan
Michael S. Tsirkin - Jan. 25, 2011, 11:27 a.m.
On Tue, Jan 25, 2011 at 09:49:04AM +0000, Stefan Hajnoczi wrote:
> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
> >>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
> >>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
> >>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
> >>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
> >>>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
> >>>>>>> prevents the vcpu from executing guest code while hardware emulation code
> >>>>>>> handles the notify.
> >>>>>>>
> >>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
> >>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
> >>>>>>> iothread and allowing the VM to continue execution.  This model is similar to
> >>>>>>> how vhost receives virtqueue notifies.
> >>>>>>>
> >>>>>>> The result of this change is improved performance for userspace virtio devices.
> >>>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
> >>>>>>> virtio-net transmit throughput increases substantially.
> >>>>>>>
> >>>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
> >>>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
> >>>>>>> for virtio-blk and virtio-net for now.
> >>>>>>>
> >>>>>>> Care must be taken not to interfere with vhost-net, which uses host
> >>>>>>> notifiers.  If the set_host_notifier() API is used by a device
> >>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
> >>>>>>> host notifiers as it wishes.
> >>>>>>>
> >>>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
> >>>>>>> will enable/disable itself.
> >>>>>>>
> >>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
> >>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
> >>>>>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
> >>>>>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
> >>>>>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
> >>>>>>>
> >>>>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> >>>>>>
> >>>>>> On current git master I'm getting hangs when running iozone on a
> >>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
> >>>>>> 100% CPU consumption.
> >>>>>>
> >>>>>> I bisected the problem to this patch. Any ideas?
> >>>>>>
> >>>>>> Kevin
> >>>>>
> >>>>> Does it help if you set ioeventfd=off on command line?
> >>>>
> >>>> Yes, with ioeventfd=off it seems to work fine.
> >>>>
> >>>> Kevin
> >>>
> >>> Then it's the ioeventfd that is to blame.
> >>> Is it the io thread that consumes 100% CPU?
> >>> Or the vcpu thread?
> >>
> >> I was building with the default options, i.e. there is no IO thread.
> >>
> >> Now I'm just running the test with IO threads enabled, and so far
> >> everything looks good. So I can only reproduce the problem with IO
> >> threads disabled.
> >
> > Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
> > (relevant when --enable-io-thread is not used).  I will take a look at
> > that again and see why we're spinning without checking for ioeventfd
> > completion.
> 
> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
> please correct me.
> 
> When I/O thread is disabled our only thread runs guest code until an
> exit request is made.  There are synchronous exit cases like a halt
> instruction or single step.  There are also asynchronous exit cases
> when signal handlers use qemu_notify_event(), which does cpu_exit(),
> to set env->exit_request = 1 and unlink the current tb.
> 
> With this structure in mind, anything which needs to interrupt the
> vcpu in order to process events must use signals and
> qemu_notify_event().  Otherwise that event source may be starved and
> never processed.
> 
> virtio-ioeventfd currently does not use signals and will therefore
> never interrupt the vcpu.
> 
> However, you normally don't notice the missing signal handler because
> some other event interrupts the vcpu and we enter select(2) to process
> all pending handlers.  So virtio-ioeventfd mostly gets a free ride on
> top of timer events.  This is suboptimal because it adds latency to
> virtqueue kick - we're waiting for another event to interrupt the vcpu
> before we can process virtqueue-kick.
> 
> If any other vcpu interruption makes virtio-ioeventfd chug along then
> why are you seeing 100% CPU livelock?  My theory is that dynticks has
> a race condition which causes timers to stop working in QEMU.  Here is
> an strace of QEMU --disable-io-thread entering live lock.  I can
> trigger this by starting a VM and running "while true; do true; done"
> at the shell.  Then strace the QEMU process:
> 
> 08:04:34.985177 ioctl(11, KVM_RUN, 0)   = 0
> 08:04:34.985242 --- SIGALRM (Alarm clock) @ 0 (0) ---
> 08:04:34.985319 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> 08:04:34.985368 rt_sigreturn(0x2758ad0) = 0
> 08:04:34.985423 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
> 08:04:34.985484 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
> 08:04:34.985538 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
> 08:04:34.985588 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
> 273000}}, NULL) = 0
> 08:04:34.985646 ioctl(11, KVM_RUN, 0)   = -1 EINTR (Interrupted system call)
> 08:04:34.985928 --- SIGALRM (Alarm clock) @ 0 (0) ---
> 08:04:34.986007 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> 08:04:34.986063 rt_sigreturn(0x2758ad0) = -1 EINTR (Interrupted system call)
> 08:04:34.986124 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
> 08:04:34.986188 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
> 08:04:34.986246 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
> 08:04:34.986299 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
> 250000}}, NULL) = 0
> 08:04:34.986359 ioctl(11, KVM_INTERRUPT, 0x7fff90404ef0) = 0
> 08:04:34.986406 ioctl(11, KVM_RUN, 0)   = 0
> 08:04:34.986465 ioctl(11, KVM_RUN, 0)   = 0              <--- guest
> finishes execution
> 
>                 v--- dynticks_rearm_timer() returns early because
> timer is already scheduled
> 08:04:34.986533 timer_gettime(0, {it_interval={0, 0}, it_value={0, 24203}}) = 0
> 08:04:34.986585 --- SIGALRM (Alarm clock) @ 0 (0) ---    <--- timer expires
> 08:04:34.986661 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> 08:04:34.986710 rt_sigreturn(0x2758ad0) = 0
> 
>                 v--- we re-enter the guest without rearming the timer!
> 08:04:34.986754 ioctl(11, KVM_RUN^C <unfinished ...>
> [QEMU hang, 100% CPU]
> 
> So dynticks fails to rearm the timer before we enter the guest.  This
> is a race condition: we check that there is already a timer scheduled
> and head on towards re-entering the guest, the timer expires before we
> enter the guest, we re-enter the guest without realizing the timer has
> expired.  Now we're inside the guest without the hope of a timer
> expiring - and the guest is running a CPU-bound workload that doesn't
> need to perform I/O.
> 
> The result is a hung QEMU (screen does not update) and a softlockup
> inside the guest once we do kick it to life again (by detaching
> strace).
> 
> I think the only way to avoid this race condition in dynticks is to
> mask SIGALRM, then check if the timer expired, and then ioctl(KVM_RUN)
> with atomic signal mask change back to SIGALRM enabled.  Thoughts?
> 
> Back to virtio-ioeventfd, we really shouldn't use virtio-ioeventfd
> when there is no I/O thread.

Can we make it work with SIGIO?

>  It doesn't make sense because there's no
> opportunity to process the virtqueue while the guest code is executing
> in parallel like there is with I/O thread.  It will just degrade
> performance when QEMU only has one thread.

Probably. But it's really better to check this than theorethise about
it.

>  I'll send a patch to
> disable it when we build without I/O thread.
> 
> Stefan
Stefan Hajnoczi - Jan. 25, 2011, 1:20 p.m.
On Tue, Jan 25, 2011 at 11:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jan 25, 2011 at 09:49:04AM +0000, Stefan Hajnoczi wrote:
>> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>> >>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>> >>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>> >>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>> >>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>> >>>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>> >>>>>>> prevents the vcpu from executing guest code while hardware emulation code
>> >>>>>>> handles the notify.
>> >>>>>>>
>> >>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>> >>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>> >>>>>>> iothread and allowing the VM to continue execution.  This model is similar to
>> >>>>>>> how vhost receives virtqueue notifies.
>> >>>>>>>
>> >>>>>>> The result of this change is improved performance for userspace virtio devices.
>> >>>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>> >>>>>>> virtio-net transmit throughput increases substantially.
>> >>>>>>>
>> >>>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
>> >>>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>> >>>>>>> for virtio-blk and virtio-net for now.
>> >>>>>>>
>> >>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>> >>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>> >>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>> >>>>>>> host notifiers as it wishes.
>> >>>>>>>
>> >>>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>> >>>>>>> will enable/disable itself.
>> >>>>>>>
>> >>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
>> >>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
>> >>>>>>>  * virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
>> >>>>>>>  * vm_change_state(running=0) -> disable virtio-ioeventfd
>> >>>>>>>  * vm_change_state(running=1) -> enable virtio-ioeventfd
>> >>>>>>>
>> >>>>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> >>>>>>
>> >>>>>> On current git master I'm getting hangs when running iozone on a
>> >>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>> >>>>>> 100% CPU consumption.
>> >>>>>>
>> >>>>>> I bisected the problem to this patch. Any ideas?
>> >>>>>>
>> >>>>>> Kevin
>> >>>>>
>> >>>>> Does it help if you set ioeventfd=off on command line?
>> >>>>
>> >>>> Yes, with ioeventfd=off it seems to work fine.
>> >>>>
>> >>>> Kevin
>> >>>
>> >>> Then it's the ioeventfd that is to blame.
>> >>> Is it the io thread that consumes 100% CPU?
>> >>> Or the vcpu thread?
>> >>
>> >> I was building with the default options, i.e. there is no IO thread.
>> >>
>> >> Now I'm just running the test with IO threads enabled, and so far
>> >> everything looks good. So I can only reproduce the problem with IO
>> >> threads disabled.
>> >
>> > Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
>> > (relevant when --enable-io-thread is not used).  I will take a look at
>> > that again and see why we're spinning without checking for ioeventfd
>> > completion.
>>
>> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
>> please correct me.
>>
>> When I/O thread is disabled our only thread runs guest code until an
>> exit request is made.  There are synchronous exit cases like a halt
>> instruction or single step.  There are also asynchronous exit cases
>> when signal handlers use qemu_notify_event(), which does cpu_exit(),
>> to set env->exit_request = 1 and unlink the current tb.
>>
>> With this structure in mind, anything which needs to interrupt the
>> vcpu in order to process events must use signals and
>> qemu_notify_event().  Otherwise that event source may be starved and
>> never processed.
>>
>> virtio-ioeventfd currently does not use signals and will therefore
>> never interrupt the vcpu.
>>
>> However, you normally don't notice the missing signal handler because
>> some other event interrupts the vcpu and we enter select(2) to process
>> all pending handlers.  So virtio-ioeventfd mostly gets a free ride on
>> top of timer events.  This is suboptimal because it adds latency to
>> virtqueue kick - we're waiting for another event to interrupt the vcpu
>> before we can process virtqueue-kick.
>>
>> If any other vcpu interruption makes virtio-ioeventfd chug along then
>> why are you seeing 100% CPU livelock?  My theory is that dynticks has
>> a race condition which causes timers to stop working in QEMU.  Here is
>> an strace of QEMU --disable-io-thread entering live lock.  I can
>> trigger this by starting a VM and running "while true; do true; done"
>> at the shell.  Then strace the QEMU process:
>>
>> 08:04:34.985177 ioctl(11, KVM_RUN, 0)   = 0
>> 08:04:34.985242 --- SIGALRM (Alarm clock) @ 0 (0) ---
>> 08:04:34.985319 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
>> 08:04:34.985368 rt_sigreturn(0x2758ad0) = 0
>> 08:04:34.985423 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
>> 08:04:34.985484 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
>> 08:04:34.985538 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
>> 08:04:34.985588 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
>> 273000}}, NULL) = 0
>> 08:04:34.985646 ioctl(11, KVM_RUN, 0)   = -1 EINTR (Interrupted system call)
>> 08:04:34.985928 --- SIGALRM (Alarm clock) @ 0 (0) ---
>> 08:04:34.986007 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
>> 08:04:34.986063 rt_sigreturn(0x2758ad0) = -1 EINTR (Interrupted system call)
>> 08:04:34.986124 select(15, [5 8 14], [], [], {0, 0}) = 1 (in [5], left {0, 0})
>> 08:04:34.986188 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
>> 08:04:34.986246 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
>> 08:04:34.986299 timer_settime(0, 0, {it_interval={0, 0}, it_value={0,
>> 250000}}, NULL) = 0
>> 08:04:34.986359 ioctl(11, KVM_INTERRUPT, 0x7fff90404ef0) = 0
>> 08:04:34.986406 ioctl(11, KVM_RUN, 0)   = 0
>> 08:04:34.986465 ioctl(11, KVM_RUN, 0)   = 0              <--- guest
>> finishes execution
>>
>>                 v--- dynticks_rearm_timer() returns early because
>> timer is already scheduled
>> 08:04:34.986533 timer_gettime(0, {it_interval={0, 0}, it_value={0, 24203}}) = 0
>> 08:04:34.986585 --- SIGALRM (Alarm clock) @ 0 (0) ---    <--- timer expires
>> 08:04:34.986661 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
>> 08:04:34.986710 rt_sigreturn(0x2758ad0) = 0
>>
>>                 v--- we re-enter the guest without rearming the timer!
>> 08:04:34.986754 ioctl(11, KVM_RUN^C <unfinished ...>
>> [QEMU hang, 100% CPU]
>>
>> So dynticks fails to rearm the timer before we enter the guest.  This
>> is a race condition: we check that there is already a timer scheduled
>> and head on towards re-entering the guest, the timer expires before we
>> enter the guest, we re-enter the guest without realizing the timer has
>> expired.  Now we're inside the guest without the hope of a timer
>> expiring - and the guest is running a CPU-bound workload that doesn't
>> need to perform I/O.
>>
>> The result is a hung QEMU (screen does not update) and a softlockup
>> inside the guest once we do kick it to life again (by detaching
>> strace).
>>
>> I think the only way to avoid this race condition in dynticks is to
>> mask SIGALRM, then check if the timer expired, and then ioctl(KVM_RUN)
>> with atomic signal mask change back to SIGALRM enabled.  Thoughts?
>>
>> Back to virtio-ioeventfd, we really shouldn't use virtio-ioeventfd
>> when there is no I/O thread.
>
> Can we make it work with SIGIO?
>
>>  It doesn't make sense because there's no
>> opportunity to process the virtqueue while the guest code is executing
>> in parallel like there is with I/O thread.  It will just degrade
>> performance when QEMU only has one thread.
>
> Probably. But it's really better to check this than theorethise about
> it.

eventfd does not seem to support O_ASYNC.  After adding the necessary
code into QEMU no signals were firing so I wrote a test:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/eventfd.h>

int main(int argc, char **argv)
{
        int fd = eventfd(0, 0);
        if (fd < 0) {
                perror("eventfd");
                exit(1);
        }

        if (fcntl(fd, F_SETSIG, SIGTERM) < 0) {
                perror("fcntl(F_SETSIG)");
                exit(1);
        }

        if (fcntl(fd, F_SETOWN, getpid()) < 0) {
                perror("fcntl(F_SETOWN)");
                exit(1);
        }

        if (fcntl(fd, F_SETFL, O_NONBLOCK | O_ASYNC) < 0) {
                perror("fcntl(F_SETFL)");
                exit(1);
        }

        switch (fork()) {
        case -1:
                perror("fork");
                exit(1);

        case 0:         /* child */
                eventfd_write(fd, 1);
                exit(0);

        default:        /* parent */
                break;
        }

        sleep(5);
        wait(NULL);
        close(fd);
        return 0;
}

I'd expect the parent to get a SIGTERM but the process just sleeps and
then exits.  When replacing the eventfd with a pipe in this program
the parent does receive a SIGKILL.

Stefan
Stefan Hajnoczi - Jan. 25, 2011, 2:07 p.m.
On Tue, Jan 25, 2011 at 1:20 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> eventfd does not seem to support O_ASYNC.

linux-2.6/fs/eventfd.c does not implement file_operations::fasync() so
I'm convinced SIGIO is not possible here.

I have sent a patch to disable virtio-ioeventfd when !CONFIG_IOTHREAD.

Stefan
Anthony Liguori - Jan. 25, 2011, 7:18 p.m.
On 01/25/2011 03:49 AM, Stefan Hajnoczi wrote:
> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi<stefanha@gmail.com>  wrote:
>    
>> On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>      
>>> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>>>        
>>>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>>>          
>>>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>>>            
>>>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>>>              
>>>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>>>                
>>>>>>>> Virtqueue notify is currently handled synchronously in userspace virtio.  This
>>>>>>>> prevents the vcpu from executing guest code while hardware emulation code
>>>>>>>> handles the notify.
>>>>>>>>
>>>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to make
>>>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation to the
>>>>>>>> iothread and allowing the VM to continue execution.  This model is similar to
>>>>>>>> how vhost receives virtqueue notifies.
>>>>>>>>
>>>>>>>> The result of this change is improved performance for userspace virtio devices.
>>>>>>>> Virtio-blk throughput increases especially for multithreaded scenarios and
>>>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>>>
>>>>>>>> Some virtio devices are known to have guest drivers which expect a notify to be
>>>>>>>> processed synchronously and spin waiting for completion.  Only enable ioeventfd
>>>>>>>> for virtio-blk and virtio-net for now.
>>>>>>>>
>>>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal with
>>>>>>>> host notifiers as it wishes.
>>>>>>>>
>>>>>>>> After migration and on VM change state (running/paused) virtio-ioeventfd
>>>>>>>> will enable/disable itself.
>>>>>>>>
>>>>>>>>   * VIRTIO_CONFIG_S_DRIVER_OK ->  enable virtio-ioeventfd
>>>>>>>>   * !VIRTIO_CONFIG_S_DRIVER_OK ->  disable virtio-ioeventfd
>>>>>>>>   * virtio_pci_set_host_notifier() ->  disable virtio-ioeventfd
>>>>>>>>   * vm_change_state(running=0) ->  disable virtio-ioeventfd
>>>>>>>>   * vm_change_state(running=1) ->  enable virtio-ioeventfd
>>>>>>>>
>>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>>>>>>                  
>>>>>>> On current git master I'm getting hangs when running iozone on a
>>>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and has
>>>>>>> 100% CPU consumption.
>>>>>>>
>>>>>>> I bisected the problem to this patch. Any ideas?
>>>>>>>
>>>>>>> Kevin
>>>>>>>                
>>>>>> Does it help if you set ioeventfd=off on command line?
>>>>>>              
>>>>> Yes, with ioeventfd=off it seems to work fine.
>>>>>
>>>>> Kevin
>>>>>            
>>>> Then it's the ioeventfd that is to blame.
>>>> Is it the io thread that consumes 100% CPU?
>>>> Or the vcpu thread?
>>>>          
>>> I was building with the default options, i.e. there is no IO thread.
>>>
>>> Now I'm just running the test with IO threads enabled, and so far
>>> everything looks good. So I can only reproduce the problem with IO
>>> threads disabled.
>>>        
>> Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
>> (relevant when --enable-io-thread is not used).  I will take a look at
>> that again and see why we're spinning without checking for ioeventfd
>> completion.
>>      
> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
> please correct me.
>
> When I/O thread is disabled our only thread runs guest code until an
> exit request is made.  There are synchronous exit cases like a halt
> instruction or single step.  There are also asynchronous exit cases
> when signal handlers use qemu_notify_event(), which does cpu_exit(),
> to set env->exit_request = 1 and unlink the current tb.
>    

Correct.

Note that this is a problem today.  If you have a tight loop in TCG and 
you have nothing that would generate a signal (no pending disk I/O and 
no periodic timer) then the main loop is starved.

This is a fundamental flaw in TCG.

Regards,

Anthony Liguori
Stefan Hajnoczi - Jan. 25, 2011, 7:45 p.m.
On Tue, Jan 25, 2011 at 7:18 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 01/25/2011 03:49 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi<stefanha@gmail.com>
>>  wrote:
>>
>>>
>>> On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>
>>>>
>>>> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>>>>
>>>>>
>>>>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>>>>
>>>>>>
>>>>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Virtqueue notify is currently handled synchronously in userspace
>>>>>>>>> virtio.  This
>>>>>>>>> prevents the vcpu from executing guest code while hardware
>>>>>>>>> emulation code
>>>>>>>>> handles the notify.
>>>>>>>>>
>>>>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to
>>>>>>>>> make
>>>>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation
>>>>>>>>> to the
>>>>>>>>> iothread and allowing the VM to continue execution.  This model is
>>>>>>>>> similar to
>>>>>>>>> how vhost receives virtqueue notifies.
>>>>>>>>>
>>>>>>>>> The result of this change is improved performance for userspace
>>>>>>>>> virtio devices.
>>>>>>>>> Virtio-blk throughput increases especially for multithreaded
>>>>>>>>> scenarios and
>>>>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>>>>
>>>>>>>>> Some virtio devices are known to have guest drivers which expect a
>>>>>>>>> notify to be
>>>>>>>>> processed synchronously and spin waiting for completion.  Only
>>>>>>>>> enable ioeventfd
>>>>>>>>> for virtio-blk and virtio-net for now.
>>>>>>>>>
>>>>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal
>>>>>>>>> with
>>>>>>>>> host notifiers as it wishes.
>>>>>>>>>
>>>>>>>>> After migration and on VM change state (running/paused)
>>>>>>>>> virtio-ioeventfd
>>>>>>>>> will enable/disable itself.
>>>>>>>>>
>>>>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK ->  enable virtio-ioeventfd
>>>>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK ->  disable virtio-ioeventfd
>>>>>>>>>  * virtio_pci_set_host_notifier() ->  disable virtio-ioeventfd
>>>>>>>>>  * vm_change_state(running=0) ->  disable virtio-ioeventfd
>>>>>>>>>  * vm_change_state(running=1) ->  enable virtio-ioeventfd
>>>>>>>>>
>>>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>> On current git master I'm getting hangs when running iozone on a
>>>>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and
>>>>>>>> has
>>>>>>>> 100% CPU consumption.
>>>>>>>>
>>>>>>>> I bisected the problem to this patch. Any ideas?
>>>>>>>>
>>>>>>>> Kevin
>>>>>>>>
>>>>>>>
>>>>>>> Does it help if you set ioeventfd=off on command line?
>>>>>>>
>>>>>>
>>>>>> Yes, with ioeventfd=off it seems to work fine.
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>
>>>>> Then it's the ioeventfd that is to blame.
>>>>> Is it the io thread that consumes 100% CPU?
>>>>> Or the vcpu thread?
>>>>>
>>>>
>>>> I was building with the default options, i.e. there is no IO thread.
>>>>
>>>> Now I'm just running the test with IO threads enabled, and so far
>>>> everything looks good. So I can only reproduce the problem with IO
>>>> threads disabled.
>>>>
>>>
>>> Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
>>> (relevant when --enable-io-thread is not used).  I will take a look at
>>> that again and see why we're spinning without checking for ioeventfd
>>> completion.
>>>
>>
>> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
>> please correct me.
>>
>> When I/O thread is disabled our only thread runs guest code until an
>> exit request is made.  There are synchronous exit cases like a halt
>> instruction or single step.  There are also asynchronous exit cases
>> when signal handlers use qemu_notify_event(), which does cpu_exit(),
>> to set env->exit_request = 1 and unlink the current tb.
>>
>
> Correct.
>
> Note that this is a problem today.  If you have a tight loop in TCG and you
> have nothing that would generate a signal (no pending disk I/O and no
> periodic timer) then the main loop is starved.

Even with KVM we can spin inside the guest and get a softlockup due to
the dynticks race condition shown above.  In a CPU bound guest that's
doing no I/O it's possible to go AWOL for extended periods of time.

I can think of two solutions:
1. Block SIGALRM during critical regions, not sure if the necessary
atomic signal mask capabilities are there in KVM.  Haven't looked at
TCG yet either.
2. Make a portion of the timer code signal-safe and rearm the timer
from within the SIGLARM handler.

Stefan
Anthony Liguori - Jan. 25, 2011, 7:51 p.m.
On 01/25/2011 01:45 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 25, 2011 at 7:18 PM, Anthony Liguori
> <aliguori@linux.vnet.ibm.com>  wrote:
>    
>> On 01/25/2011 03:49 AM, Stefan Hajnoczi wrote:
>>      
>>> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi<stefanha@gmail.com>
>>>   wrote:
>>>
>>>        
>>>> On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf<kwolf@redhat.com>    wrote:
>>>>
>>>>          
>>>>> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>>>>>
>>>>>            
>>>>>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>>>>>
>>>>>>              
>>>>>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>>>>>
>>>>>>>                
>>>>>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> Virtqueue notify is currently handled synchronously in userspace
>>>>>>>>>> virtio.  This
>>>>>>>>>> prevents the vcpu from executing guest code while hardware
>>>>>>>>>> emulation code
>>>>>>>>>> handles the notify.
>>>>>>>>>>
>>>>>>>>>> On systems that support KVM, the ioeventfd mechanism can be used to
>>>>>>>>>> make
>>>>>>>>>> virtqueue notify a lightweight exit by deferring hardware emulation
>>>>>>>>>> to the
>>>>>>>>>> iothread and allowing the VM to continue execution.  This model is
>>>>>>>>>> similar to
>>>>>>>>>> how vhost receives virtqueue notifies.
>>>>>>>>>>
>>>>>>>>>> The result of this change is improved performance for userspace
>>>>>>>>>> virtio devices.
>>>>>>>>>> Virtio-blk throughput increases especially for multithreaded
>>>>>>>>>> scenarios and
>>>>>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>>>>>
>>>>>>>>>> Some virtio devices are known to have guest drivers which expect a
>>>>>>>>>> notify to be
>>>>>>>>>> processed synchronously and spin waiting for completion.  Only
>>>>>>>>>> enable ioeventfd
>>>>>>>>>> for virtio-blk and virtio-net for now.
>>>>>>>>>>
>>>>>>>>>> Care must be taken not to interfere with vhost-net, which uses host
>>>>>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal
>>>>>>>>>> with
>>>>>>>>>> host notifiers as it wishes.
>>>>>>>>>>
>>>>>>>>>> After migration and on VM change state (running/paused)
>>>>>>>>>> virtio-ioeventfd
>>>>>>>>>> will enable/disable itself.
>>>>>>>>>>
>>>>>>>>>>   * VIRTIO_CONFIG_S_DRIVER_OK ->    enable virtio-ioeventfd
>>>>>>>>>>   * !VIRTIO_CONFIG_S_DRIVER_OK ->    disable virtio-ioeventfd
>>>>>>>>>>   * virtio_pci_set_host_notifier() ->    disable virtio-ioeventfd
>>>>>>>>>>   * vm_change_state(running=0) ->    disable virtio-ioeventfd
>>>>>>>>>>   * vm_change_state(running=1) ->    enable virtio-ioeventfd
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>> On current git master I'm getting hangs when running iozone on a
>>>>>>>>> virtio-blk disk. "Hang" means that it's not responsive any more and
>>>>>>>>> has
>>>>>>>>> 100% CPU consumption.
>>>>>>>>>
>>>>>>>>> I bisected the problem to this patch. Any ideas?
>>>>>>>>>
>>>>>>>>> Kevin
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>> Does it help if you set ioeventfd=off on command line?
>>>>>>>>
>>>>>>>>                  
>>>>>>> Yes, with ioeventfd=off it seems to work fine.
>>>>>>>
>>>>>>> Kevin
>>>>>>>
>>>>>>>                
>>>>>> Then it's the ioeventfd that is to blame.
>>>>>> Is it the io thread that consumes 100% CPU?
>>>>>> Or the vcpu thread?
>>>>>>
>>>>>>              
>>>>> I was building with the default options, i.e. there is no IO thread.
>>>>>
>>>>> Now I'm just running the test with IO threads enabled, and so far
>>>>> everything looks good. So I can only reproduce the problem with IO
>>>>> threads disabled.
>>>>>
>>>>>            
>>>> Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
>>>> (relevant when --enable-io-thread is not used).  I will take a look at
>>>> that again and see why we're spinning without checking for ioeventfd
>>>> completion.
>>>>
>>>>          
>>> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
>>> please correct me.
>>>
>>> When I/O thread is disabled our only thread runs guest code until an
>>> exit request is made.  There are synchronous exit cases like a halt
>>> instruction or single step.  There are also asynchronous exit cases
>>> when signal handlers use qemu_notify_event(), which does cpu_exit(),
>>> to set env->exit_request = 1 and unlink the current tb.
>>>
>>>        
>> Correct.
>>
>> Note that this is a problem today.  If you have a tight loop in TCG and you
>> have nothing that would generate a signal (no pending disk I/O and no
>> periodic timer) then the main loop is starved.
>>      
> Even with KVM we can spin inside the guest and get a softlockup due to
> the dynticks race condition shown above.  In a CPU bound guest that's
> doing no I/O it's possible to go AWOL for extended periods of time.
>    

This is a different race.  I need to look more deeply into the code.

> I can think of two solutions:
> 1. Block SIGALRM during critical regions, not sure if the necessary
> atomic signal mask capabilities are there in KVM.  Haven't looked at
> TCG yet either.
> 2. Make a portion of the timer code signal-safe and rearm the timer
> from within the SIGLARM handler.
>    

Or, switch to timerfd and stop using a signal based alarm timer.

Regards,

Anthony Liguori



> Stefan
>
Stefan Hajnoczi - Jan. 25, 2011, 7:59 p.m.
On Tue, Jan 25, 2011 at 7:51 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 01/25/2011 01:45 PM, Stefan Hajnoczi wrote:
>>
>> On Tue, Jan 25, 2011 at 7:18 PM, Anthony Liguori
>> <aliguori@linux.vnet.ibm.com>  wrote:
>>
>>>
>>> On 01/25/2011 03:49 AM, Stefan Hajnoczi wrote:
>>>
>>>>
>>>> On Tue, Jan 25, 2011 at 7:12 AM, Stefan Hajnoczi<stefanha@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> On Mon, Jan 24, 2011 at 8:05 PM, Kevin Wolf<kwolf@redhat.com>    wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Am 24.01.2011 20:47, schrieb Michael S. Tsirkin:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 24, 2011 at 08:48:05PM +0100, Kevin Wolf wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Am 24.01.2011 20:36, schrieb Michael S. Tsirkin:
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 24, 2011 at 07:54:20PM +0100, Kevin Wolf wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 12.12.2010 16:02, schrieb Stefan Hajnoczi:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Virtqueue notify is currently handled synchronously in userspace
>>>>>>>>>>> virtio.  This
>>>>>>>>>>> prevents the vcpu from executing guest code while hardware
>>>>>>>>>>> emulation code
>>>>>>>>>>> handles the notify.
>>>>>>>>>>>
>>>>>>>>>>> On systems that support KVM, the ioeventfd mechanism can be used
>>>>>>>>>>> to
>>>>>>>>>>> make
>>>>>>>>>>> virtqueue notify a lightweight exit by deferring hardware
>>>>>>>>>>> emulation
>>>>>>>>>>> to the
>>>>>>>>>>> iothread and allowing the VM to continue execution.  This model
>>>>>>>>>>> is
>>>>>>>>>>> similar to
>>>>>>>>>>> how vhost receives virtqueue notifies.
>>>>>>>>>>>
>>>>>>>>>>> The result of this change is improved performance for userspace
>>>>>>>>>>> virtio devices.
>>>>>>>>>>> Virtio-blk throughput increases especially for multithreaded
>>>>>>>>>>> scenarios and
>>>>>>>>>>> virtio-net transmit throughput increases substantially.
>>>>>>>>>>>
>>>>>>>>>>> Some virtio devices are known to have guest drivers which expect
>>>>>>>>>>> a
>>>>>>>>>>> notify to be
>>>>>>>>>>> processed synchronously and spin waiting for completion.  Only
>>>>>>>>>>> enable ioeventfd
>>>>>>>>>>> for virtio-blk and virtio-net for now.
>>>>>>>>>>>
>>>>>>>>>>> Care must be taken not to interfere with vhost-net, which uses
>>>>>>>>>>> host
>>>>>>>>>>> notifiers.  If the set_host_notifier() API is used by a device
>>>>>>>>>>> virtio-pci will disable virtio-ioeventfd and let the device deal
>>>>>>>>>>> with
>>>>>>>>>>> host notifiers as it wishes.
>>>>>>>>>>>
>>>>>>>>>>> After migration and on VM change state (running/paused)
>>>>>>>>>>> virtio-ioeventfd
>>>>>>>>>>> will enable/disable itself.
>>>>>>>>>>>
>>>>>>>>>>>  * VIRTIO_CONFIG_S_DRIVER_OK ->    enable virtio-ioeventfd
>>>>>>>>>>>  * !VIRTIO_CONFIG_S_DRIVER_OK ->    disable virtio-ioeventfd
>>>>>>>>>>>  * virtio_pci_set_host_notifier() ->    disable virtio-ioeventfd
>>>>>>>>>>>  * vm_change_state(running=0) ->    disable virtio-ioeventfd
>>>>>>>>>>>  * vm_change_state(running=1) ->    enable virtio-ioeventfd
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On current git master I'm getting hangs when running iozone on a
>>>>>>>>>> virtio-blk disk. "Hang" means that it's not responsive any more
>>>>>>>>>> and
>>>>>>>>>> has
>>>>>>>>>> 100% CPU consumption.
>>>>>>>>>>
>>>>>>>>>> I bisected the problem to this patch. Any ideas?
>>>>>>>>>>
>>>>>>>>>> Kevin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does it help if you set ioeventfd=off on command line?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, with ioeventfd=off it seems to work fine.
>>>>>>>>
>>>>>>>> Kevin
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Then it's the ioeventfd that is to blame.
>>>>>>> Is it the io thread that consumes 100% CPU?
>>>>>>> Or the vcpu thread?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I was building with the default options, i.e. there is no IO thread.
>>>>>>
>>>>>> Now I'm just running the test with IO threads enabled, and so far
>>>>>> everything looks good. So I can only reproduce the problem with IO
>>>>>> threads disabled.
>>>>>>
>>>>>>
>>>>>
>>>>> Hrm...aio uses SIGUSR2 to force the vcpu to process aio completions
>>>>> (relevant when --enable-io-thread is not used).  I will take a look at
>>>>> that again and see why we're spinning without checking for ioeventfd
>>>>> completion.
>>>>>
>>>>>
>>>>
>>>> Here's my understanding of --disable-io-thread.  Added Anthony on CC,
>>>> please correct me.
>>>>
>>>> When I/O thread is disabled our only thread runs guest code until an
>>>> exit request is made.  There are synchronous exit cases like a halt
>>>> instruction or single step.  There are also asynchronous exit cases
>>>> when signal handlers use qemu_notify_event(), which does cpu_exit(),
>>>> to set env->exit_request = 1 and unlink the current tb.
>>>>
>>>>
>>>
>>> Correct.
>>>
>>> Note that this is a problem today.  If you have a tight loop in TCG and
>>> you
>>> have nothing that would generate a signal (no pending disk I/O and no
>>> periodic timer) then the main loop is starved.
>>>
>>
>> Even with KVM we can spin inside the guest and get a softlockup due to
>> the dynticks race condition shown above.  In a CPU bound guest that's
>> doing no I/O it's possible to go AWOL for extended periods of time.
>>
>
> This is a different race.  I need to look more deeply into the code.

int kvm_cpu_exec(CPUState *env)
{
    struct kvm_run *run = env->kvm_run;
    int ret;

    DPRINTF("kvm_cpu_exec()\n");

    do {

This is broken because a signal handler could change env->exit_request
after this check:

#ifndef CONFIG_IOTHREAD
        if (env->exit_request) {
            DPRINTF("interrupt exit requested\n");
            ret = 0;
            break;
        }
#endif

        if (kvm_arch_process_irqchip_events(env)) {
            ret = 0;
            break;
        }

        if (env->kvm_vcpu_dirty) {
            kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
            env->kvm_vcpu_dirty = 0;
        }

        kvm_arch_pre_run(env, run);
        cpu_single_env = NULL;
        qemu_mutex_unlock_iothread();

env->exit_request might be set but we still reenter, possibly without
rearming the timer:
        ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);

>> I can think of two solutions:
>> 1. Block SIGALRM during critical regions, not sure if the necessary
>> atomic signal mask capabilities are there in KVM.  Haven't looked at
>> TCG yet either.
>> 2. Make a portion of the timer code signal-safe and rearm the timer
>> from within the SIGLARM handler.
>>
>
> Or, switch to timerfd and stop using a signal based alarm timer.

Doesn't work for !CONFIG_IOTHREAD.

Stefan
Anthony Liguori - Jan. 26, 2011, 12:18 a.m.
On 01/25/2011 01:59 PM, Stefan Hajnoczi wrote:
> int kvm_cpu_exec(CPUState *env)
> {
>      struct kvm_run *run = env->kvm_run;
>      int ret;
>
>      DPRINTF("kvm_cpu_exec()\n");
>
>      do {
>
> This is broken because a signal handler could change env->exit_request
> after this check:
>
> #ifndef CONFIG_IOTHREAD
>          if (env->exit_request) {
>              DPRINTF("interrupt exit requested\n");
>              ret = 0;
>              break;
>          }
> #endif
>    

Yeah, this is classic signal/select race with ioctl(KVM_RUN) subbing in 
for select.  But this is supposed to be mitigated by the fact that we 
block SIG_IPI except for when we execute KVM_RUN which means that we can 
reliably send SIG_IPI.

Of course, that doesn't help for SIGALRM unless we send a SIG_IPI from 
the SIGALRM handler which we do with the I/O thread but not w/o it.  At 
any rate, post stable-0.14, I want to enable I/O thread by default so I 
don't know that we really need to fix this...

>          if (kvm_arch_process_irqchip_events(env)) {
>              ret = 0;
>              break;
>          }
>
>          if (env->kvm_vcpu_dirty) {
>              kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
>              env->kvm_vcpu_dirty = 0;
>          }
>
>          kvm_arch_pre_run(env, run);
>          cpu_single_env = NULL;
>          qemu_mutex_unlock_iothread();
>
> env->exit_request might be set but we still reenter, possibly without
> rearming the timer:
>          ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
>
>    
>>> I can think of two solutions:
>>> 1. Block SIGALRM during critical regions, not sure if the necessary
>>> atomic signal mask capabilities are there in KVM.  Haven't looked at
>>> TCG yet either.
>>> 2. Make a portion of the timer code signal-safe and rearm the timer
>>> from within the SIGLARM handler.
>>>
>>>        
>> Or, switch to timerfd and stop using a signal based alarm timer.
>>      
> Doesn't work for !CONFIG_IOTHREAD.
>    

Yeah, we need to get rid of !CONFIG_IOTHREAD.  We need to run select() 
in parallel with TCG/KVM and interrupt the VCPUs appropriately when 
select() returns.

Regards,

Anthony Liguori

> Stefan
>
>

Patch

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 13dd391..f57c45a 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -83,6 +83,11 @@ 
 /* Flags track per-device state like workarounds for quirks in older guests. */
 #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG  (1 << 0)
 
+/* Performance improves when virtqueue kick processing is decoupled from the
+ * vcpu thread using ioeventfd for some devices. */
+#define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1
+#define VIRTIO_PCI_FLAG_USE_IOEVENTFD   (1 << VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT)
+
 /* QEMU doesn't strictly need write barriers since everything runs in
  * lock-step.  We'll leave the calls to wmb() in though to make it obvious for
  * KVM or if kqemu gets SMP support.
@@ -107,6 +112,8 @@  typedef struct {
     /* Max. number of ports we can have for a the virtio-serial device */
     uint32_t max_virtserial_ports;
     virtio_net_conf net;
+    bool ioeventfd_started;
+    VMChangeStateEntry *vm_change_state_entry;
 } VirtIOPCIProxy;
 
 /* virtio device */
@@ -179,12 +186,131 @@  static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f)
     return 0;
 }
 
+static int virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
+                                                 int n, bool assign)
+{
+    VirtQueue *vq = virtio_get_queue(proxy->vdev, n);
+    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+    int r;
+    if (assign) {
+        r = event_notifier_init(notifier, 1);
+        if (r < 0) {
+            return r;
+        }
+        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+                                       proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
+                                       n, assign);
+        if (r < 0) {
+            event_notifier_cleanup(notifier);
+        }
+    } else {
+        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
+                                       proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
+                                       n, assign);
+        if (r < 0) {
+            return r;
+        }
+
+        /* Handle the race condition where the guest kicked and we deassigned
+         * before we got around to handling the kick.
+         */
+        if (event_notifier_test_and_clear(notifier)) {
+            virtio_queue_notify_vq(vq);
+        }
+
+        event_notifier_cleanup(notifier);
+    }
+    return r;
+}
+
+static void virtio_pci_host_notifier_read(void *opaque)
+{
+    VirtQueue *vq = opaque;
+    EventNotifier *n = virtio_queue_get_host_notifier(vq);
+    if (event_notifier_test_and_clear(n)) {
+        virtio_queue_notify_vq(vq);
+    }
+}
+
+static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy,
+                                                    int n, bool assign)
+{
+    VirtQueue *vq = virtio_get_queue(proxy->vdev, n);
+    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
+    if (assign) {
+        qemu_set_fd_handler(event_notifier_get_fd(notifier),
+                            virtio_pci_host_notifier_read, NULL, vq);
+    } else {
+        qemu_set_fd_handler(event_notifier_get_fd(notifier),
+                            NULL, NULL, NULL);
+    }
+}
+
+static int virtio_pci_start_ioeventfd(VirtIOPCIProxy *proxy)
+{
+    int n, r;
+
+    if (!(proxy->flags & VIRTIO_PCI_FLAG_USE_IOEVENTFD) ||
+        proxy->ioeventfd_started) {
+        return 0;
+    }
+
+    for (n = 0; n < VIRTIO_PCI_QUEUE_MAX; n++) {
+        if (!virtio_queue_get_num(proxy->vdev, n)) {
+            continue;
+        }
+
+        r = virtio_pci_set_host_notifier_internal(proxy, n, true);
+        if (r < 0) {
+            goto assign_error;
+        }
+
+        virtio_pci_set_host_notifier_fd_handler(proxy, n, true);
+    }
+    proxy->ioeventfd_started = true;
+    return 0;
+
+assign_error:
+    while (--n >= 0) {
+        if (!virtio_queue_get_num(proxy->vdev, n)) {
+            continue;
+        }
+
+        virtio_pci_set_host_notifier_fd_handler(proxy, n, false);
+        virtio_pci_set_host_notifier_internal(proxy, n, false);
+    }
+    proxy->ioeventfd_started = false;
+    proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+    return r;
+}
+
+static int virtio_pci_stop_ioeventfd(VirtIOPCIProxy *proxy)
+{
+    int n;
+
+    if (!proxy->ioeventfd_started) {
+        return 0;
+    }
+
+    for (n = 0; n < VIRTIO_PCI_QUEUE_MAX; n++) {
+        if (!virtio_queue_get_num(proxy->vdev, n)) {
+            continue;
+        }
+
+        virtio_pci_set_host_notifier_fd_handler(proxy, n, false);
+        virtio_pci_set_host_notifier_internal(proxy, n, false);
+    }
+    proxy->ioeventfd_started = false;
+    return 0;
+}
+
 static void virtio_pci_reset(DeviceState *d)
 {
     VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev);
+    virtio_pci_stop_ioeventfd(proxy);
     virtio_reset(proxy->vdev);
     msix_reset(&proxy->pci_dev);
-    proxy->flags = 0;
+    proxy->flags &= ~VIRTIO_PCI_FLAG_BUS_MASTER_BUG;
 }
 
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
@@ -209,6 +335,7 @@  static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
     case VIRTIO_PCI_QUEUE_PFN:
         pa = (target_phys_addr_t)val << VIRTIO_PCI_QUEUE_ADDR_SHIFT;
         if (pa == 0) {
+            virtio_pci_stop_ioeventfd(proxy);
             virtio_reset(proxy->vdev);
             msix_unuse_all_vectors(&proxy->pci_dev);
         }
@@ -223,6 +350,12 @@  static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
         virtio_queue_notify(vdev, val);
         break;
     case VIRTIO_PCI_STATUS:
+        if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
+            virtio_pci_start_ioeventfd(proxy);
+        } else {
+            virtio_pci_stop_ioeventfd(proxy);
+        }
+
         virtio_set_status(vdev, val & 0xFF);
         if (vdev->status == 0) {
             virtio_reset(proxy->vdev);
@@ -403,6 +536,7 @@  static void virtio_write_config(PCIDevice *pci_dev, uint32_t address,
     if (PCI_COMMAND == address) {
         if (!(val & PCI_COMMAND_MASTER)) {
             if (!(proxy->flags & VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) {
+                virtio_pci_stop_ioeventfd(proxy);
                 virtio_set_status(proxy->vdev,
                                   proxy->vdev->status & ~VIRTIO_CONFIG_S_DRIVER_OK);
             }
@@ -480,30 +614,27 @@  assign_error:
 static int virtio_pci_set_host_notifier(void *opaque, int n, bool assign)
 {
     VirtIOPCIProxy *proxy = opaque;
-    VirtQueue *vq = virtio_get_queue(proxy->vdev, n);
-    EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
-    int r;
-    if (assign) {
-        r = event_notifier_init(notifier, 1);
-        if (r < 0) {
-            return r;
-        }
-        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
-                                       proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
-                                       n, assign);
-        if (r < 0) {
-            event_notifier_cleanup(notifier);
-        }
+
+    /* Stop using ioeventfd for virtqueue kick if the device starts using host
+     * notifiers.  This makes it easy to avoid stepping on each others' toes.
+     */
+    if (proxy->ioeventfd_started) {
+        virtio_pci_stop_ioeventfd(proxy);
+        proxy->flags &= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD;
+    }
+
+    return virtio_pci_set_host_notifier_internal(proxy, n, assign);
+}
+
+static void virtio_pci_vm_change_state_handler(void *opaque, int running, int reason)
+{
+    VirtIOPCIProxy *proxy = opaque;
+
+    if (running && (proxy->vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        virtio_pci_start_ioeventfd(proxy);
     } else {
-        r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
-                                       proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
-                                       n, assign);
-        if (r < 0) {
-            return r;
-        }
-        event_notifier_cleanup(notifier);
+        virtio_pci_stop_ioeventfd(proxy);
     }
-    return r;
 }
 
 static const VirtIOBindings virtio_pci_bindings = {
@@ -563,6 +694,10 @@  static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
     proxy->host_features |= 0x1 << VIRTIO_F_NOTIFY_ON_EMPTY;
     proxy->host_features |= 0x1 << VIRTIO_F_BAD_FEATURE;
     proxy->host_features = vdev->get_features(vdev, proxy->host_features);
+
+    proxy->vm_change_state_entry = qemu_add_vm_change_state_handler(
+                                        virtio_pci_vm_change_state_handler,
+                                        proxy);
 }
 
 static int virtio_blk_init_pci(PCIDevice *pci_dev)
@@ -590,6 +725,9 @@  static int virtio_blk_init_pci(PCIDevice *pci_dev)
 
 static int virtio_exit_pci(PCIDevice *pci_dev)
 {
+    VirtIOPCIProxy *proxy = DO_UPCAST(VirtIOPCIProxy, pci_dev, pci_dev);
+
+    qemu_del_vm_change_state_handler(proxy->vm_change_state_entry);
     return msix_uninit(pci_dev);
 }
 
@@ -597,6 +735,7 @@  static int virtio_blk_exit_pci(PCIDevice *pci_dev)
 {
     VirtIOPCIProxy *proxy = DO_UPCAST(VirtIOPCIProxy, pci_dev, pci_dev);
 
+    virtio_pci_stop_ioeventfd(proxy);
     virtio_blk_exit(proxy->vdev);
     blockdev_mark_auto_del(proxy->block.bs);
     return virtio_exit_pci(pci_dev);
@@ -658,6 +797,7 @@  static int virtio_net_exit_pci(PCIDevice *pci_dev)
 {
     VirtIOPCIProxy *proxy = DO_UPCAST(VirtIOPCIProxy, pci_dev, pci_dev);
 
+    virtio_pci_stop_ioeventfd(proxy);
     virtio_net_exit(proxy->vdev);
     return virtio_exit_pci(pci_dev);
 }
@@ -705,6 +845,8 @@  static PCIDeviceInfo virtio_info[] = {
         .qdev.props = (Property[]) {
             DEFINE_PROP_HEX32("class", VirtIOPCIProxy, class_code, 0),
             DEFINE_BLOCK_PROPERTIES(VirtIOPCIProxy, block),
+            DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
+                            VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 2),
             DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_PROP_END_OF_LIST(),
@@ -717,6 +859,8 @@  static PCIDeviceInfo virtio_info[] = {
         .exit       = virtio_net_exit_pci,
         .romfile    = "pxe-virtio.bin",
         .qdev.props = (Property[]) {
+            DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
+                            VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
             DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
diff --git a/hw/virtio.c b/hw/virtio.c
index 07dbf86..e40296a 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -575,11 +575,19 @@  int virtio_queue_get_num(VirtIODevice *vdev, int n)
     return vdev->vq[n].vring.num;
 }
 
+void virtio_queue_notify_vq(VirtQueue *vq)
+{
+    if (vq->vring.desc) {
+        VirtIODevice *vdev = vq->vdev;
+        trace_virtio_queue_notify(vdev, vq - vdev->vq, vq);
+        vq->handle_output(vdev, vq);
+    }
+}
+
 void virtio_queue_notify(VirtIODevice *vdev, int n)
 {
-    if (n < VIRTIO_PCI_QUEUE_MAX && vdev->vq[n].vring.desc) {
-        trace_virtio_queue_notify(vdev, n, &vdev->vq[n]);
-        vdev->vq[n].handle_output(vdev, &vdev->vq[n]);
+    if (n < VIRTIO_PCI_QUEUE_MAX) {
+        virtio_queue_notify_vq(&vdev->vq[n]);
     }
 }
 
diff --git a/hw/virtio.h b/hw/virtio.h
index 02fa312..5ae521c 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -219,5 +219,6 @@  void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
 VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n);
 EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
+void virtio_queue_notify_vq(VirtQueue *vq);
 void virtio_irq(VirtQueue *vq);
 #endif