Patchwork kvm: Move kvm_allows_irq0_override() to target-i386

login
register
mail settings
Submitter Peter Maydell
Date July 20, 2012, 7:14 p.m.
Message ID <1342811652-16931-1-git-send-email-peter.maydell@linaro.org>
Download mbox | patch
Permalink /patch/172326/
State New
Headers show

Comments

Peter Maydell - July 20, 2012, 7:14 p.m.
kvm_allows_irq0_override() is a totally x86 specific concept:
move it to the target-specific source file where it belongs.
This means we need a new header file for the prototype:
kvm_i386.h, in line with the existing kvm_ppc.h.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
I'm sure this isn't the only x86ism in the KVM generic source
files. However the thing I'm specifically trying to do is
nuke all the uses of kvm_irqchip_in_kernel() in common code,
as part of trying to untangle "what irqchip model do you want"
from other aspects of the interrupt handling model.

Other than this one, there are uses in cpus.c and hw/virtio-pci.c,
but I haven't figured out yet exactly what those bits of code
are trying to do...

 hw/pc.c                |    1 +
 kvm-all.c              |    5 -----
 kvm.h                  |    2 --
 target-i386/kvm.c      |    6 ++++++
 target-i386/kvm_i386.h |   16 ++++++++++++++++
 5 files changed, 23 insertions(+), 7 deletions(-)
 create mode 100644 target-i386/kvm_i386.h
Jan Kiszka - July 21, 2012, 6:57 a.m.
On 2012-07-20 21:14, Peter Maydell wrote:
> kvm_allows_irq0_override() is a totally x86 specific concept:
> move it to the target-specific source file where it belongs.
> This means we need a new header file for the prototype:
> kvm_i386.h, in line with the existing kvm_ppc.h.

First of all, the patch is inconsistent as it lacks something like
target-i386/kvm-stub.c (IOW, you forgot about kvm_allows_irq0_override
in kvm-stub.c). But the approach is wrong in general, see below.

> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> I'm sure this isn't the only x86ism in the KVM generic source
> files. However the thing I'm specifically trying to do is
> nuke all the uses of kvm_irqchip_in_kernel() in common code,

No, "irqchip in kernel" is supposed to be a generic concept. We will
also have it on Power. Not sure what your plans are for ARM, maybe it
will always be true there.

That said, maybe there is room for discussion about what it means for
the general KVM code and its users if the irqchip is in the kernel. Two
things that should be common for every arch:
 - VCPU idle management is done inside the kernel
 - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly

The latter point implies that irqfd is available and that interrupt
routes from virtual IRQs (*) (like the one associated with an irqfd) to
the in-kernel IRQ controller have to be established. That's pretty generic.

> as part of trying to untangle "what irqchip model do you want"
> from other aspects of the interrupt handling model.
> 
> Other than this one, there are uses in cpus.c and hw/virtio-pci.c,
> but I haven't figured out yet exactly what those bits of code
> are trying to do...

See above for the two reasons.

Jan

(*) "GSI" actually means "physical or virtual IRQ line".
Peter Maydell - July 21, 2012, 8:54 a.m.
On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-20 21:14, Peter Maydell wrote:
>> I'm sure this isn't the only x86ism in the KVM generic source
>> files. However the thing I'm specifically trying to do is
>> nuke all the uses of kvm_irqchip_in_kernel() in common code,
>
> No, "irqchip in kernel" is supposed to be a generic concept. We will
> also have it on Power. Not sure what your plans are for ARM, maybe it
> will always be true there.

I agree that "irqchip in kernel?" is generic (though as you'll see
below there's disagreement about what that ought to mean or imply).
"irq0_override" though seems to me to be absolutely x86 specific.

> That said, maybe there is room for discussion about what it means for
> the general KVM code and its users if the irqchip is in the kernel. Two
> things that should be common for every arch:
>  - VCPU idle management is done inside the kernel

The trouble is that at the moment QEMU assumes that "is the
irqchip in kernel?" == "is VCPU idle management in kernel", for
instance. For ARM, VCPU idle management is in kernel whether
we're using the kernel's model of the VGIC or not. Alex tells
me PPC is the same way. It's only x86 that has tied these two
concepts together.

The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
is because I think they're all similar to this -- the common code is
using the check as a proxy for something else, and it should be directly
asking about that something else. The only bits of code that should
care about "is the irqchip in kernel?" are:
 * target-specific device/machine setup code which needs to know
   which apic/etc to instantiate
 * target-specific x86 code which has this weird synchronous IRQ
   delivery model for irqchip-not-in-kernel
(Obviously I might have missed something, I'm flailing around
trying to understand this code :-))

>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>
> The latter point implies that irqfd is available and that interrupt
> routes from virtual IRQs (*) (like the one associated with an irqfd) to
> the in-kernel IRQ controller have to be established. That's pretty generic.

But you can perfectly well have an in-kernel-irqchip that doesn't
support irqfd -- it just means that interrupts from devices have
to come in via the ioctls same as anything else. Some in-kernel
helpers obviously would depend on the existence and use of a full
featured in-kernel irqchip (on ARM you don't get the in kernel timer
unless you have in kernel VGIC), but I don't see why the virtio code
should be asking "is there an in kernel irqchip?" rather than "can
I do irqfd routing?" or whatever the question is it actually wants
to ask.  (In fact the virtio code probably needs to do something
more complex anyway: you could perfectly well have a system where
there is a full-featured irqchip in the kernel but the virtio
device is on the "wrong" side of a second interrupt controller
which is not in-kernel. So the actual question it needs to ask
is "does the interrupt wiring in this specific machine model mean
I can get and use an irqfd from where I am to the main CPU
interrupt controller?" or something similar.)

>> as part of trying to untangle "what irqchip model do you want"
>> from other aspects of the interrupt handling model.
>>
>> Other than this one, there are uses in cpus.c and hw/virtio-pci.c,
>> but I haven't figured out yet exactly what those bits of code
>> are trying to do...
>
> See above for the two reasons.

Thanks for the explanations.

-- PMM
Jan Kiszka - July 21, 2012, 9:14 a.m.
On 2012-07-21 10:54, Peter Maydell wrote:
> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-20 21:14, Peter Maydell wrote:
>>> I'm sure this isn't the only x86ism in the KVM generic source
>>> files. However the thing I'm specifically trying to do is
>>> nuke all the uses of kvm_irqchip_in_kernel() in common code,
>>
>> No, "irqchip in kernel" is supposed to be a generic concept. We will
>> also have it on Power. Not sure what your plans are for ARM, maybe it
>> will always be true there.
> 
> I agree that "irqchip in kernel?" is generic (though as you'll see
> below there's disagreement about what that ought to mean or imply).
> "irq0_override" though seems to me to be absolutely x86 specific.

Naming is x86 specific, semantic not. It means that KVM doesn't prevent
remapping of IRQs. Granted, I really hope you don't make such mistakes
in your arch.

> 
>> That said, maybe there is room for discussion about what it means for
>> the general KVM code and its users if the irqchip is in the kernel. Two
>> things that should be common for every arch:
>>  - VCPU idle management is done inside the kernel
> 
> The trouble is that at the moment QEMU assumes that "is the
> irqchip in kernel?" == "is VCPU idle management in kernel", for
> instance. For ARM, VCPU idle management is in kernel whether
> we're using the kernel's model of the VGIC or not. Alex tells
> me PPC is the same way. It's only x86 that has tied these two
> concepts together.

Hmm, and why does Power work despite this mismatch?

If cpu_thread_is_idle doesn't work for you, define something like
kvm_idle_in_kernel() to replace kvm_irqchip_in_kernel here.

> 
> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
> is because I think they're all similar to this -- the common code is
> using the check as a proxy for something else, and it should be directly
> asking about that something else. The only bits of code that should
> care about "is the irqchip in kernel?" are:
>  * target-specific device/machine setup code which needs to know
>    which apic/etc to instantiate
>  * target-specific x86 code which has this weird synchronous IRQ
>    delivery model for irqchip-not-in-kernel
> (Obviously I might have missed something, I'm flailing around
> trying to understand this code :-))
> 
>>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>>
>> The latter point implies that irqfd is available and that interrupt
>> routes from virtual IRQs (*) (like the one associated with an irqfd) to
>> the in-kernel IRQ controller have to be established. That's pretty generic.
> 
> But you can perfectly well have an in-kernel-irqchip that doesn't
> support irqfd 

You could, thought this doesn't make much sense.

> -- it just means that interrupts from devices have
> to come in via the ioctls same as anything else. Some in-kernel
> helpers obviously would depend on the existence and use of a full
> featured in-kernel irqchip (on ARM you don't get the in kernel timer
> unless you have in kernel VGIC), but I don't see why the virtio code
> should be asking "is there an in kernel irqchip?" rather than "can
> I do irqfd routing?" or whatever the question is it actually wants
> to ask.  (In fact the virtio code probably needs to do something
> more complex anyway: you could perfectly well have a system where
> there is a full-featured irqchip in the kernel but the virtio
> device is on the "wrong" side of a second interrupt controller
> which is not in-kernel. So the actual question it needs to ask
> is "does the interrupt wiring in this specific machine model mean
> I can get and use an irqfd from where I am to the main CPU
> interrupt controller?" or something similar.)

Well, same here: then define more precise generic test functions.

Jan
Peter Maydell - July 21, 2012, 9:30 a.m.
On 21 July 2012 10:14, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-21 10:54, Peter Maydell wrote:
>> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> On 2012-07-20 21:14, Peter Maydell wrote:
>>>> I'm sure this isn't the only x86ism in the KVM generic source
>>>> files. However the thing I'm specifically trying to do is
>>>> nuke all the uses of kvm_irqchip_in_kernel() in common code,
>>>
>>> No, "irqchip in kernel" is supposed to be a generic concept. We will
>>> also have it on Power. Not sure what your plans are for ARM, maybe it
>>> will always be true there.
>>
>> I agree that "irqchip in kernel?" is generic (though as you'll see
>> below there's disagreement about what that ought to mean or imply).
>> "irq0_override" though seems to me to be absolutely x86 specific.
>
> Naming is x86 specific, semantic not. It means that KVM doesn't prevent
> remapping of IRQs. Granted, I really hope you don't make such mistakes
> in your arch.

What does "remapping of IRQs" mean here? This is still sounding
not very generic to me, in that I really don't know what it would
mean in an ARM context. The fact that the only caller of this is
in hw/pc.c is also a big red flag that this isn't exactly generic.

>>> That said, maybe there is room for discussion about what it means for
>>> the general KVM code and its users if the irqchip is in the kernel. Two
>>> things that should be common for every arch:
>>>  - VCPU idle management is done inside the kernel
>>
>> The trouble is that at the moment QEMU assumes that "is the
>> irqchip in kernel?" == "is VCPU idle management in kernel", for
>> instance. For ARM, VCPU idle management is in kernel whether
>> we're using the kernel's model of the VGIC or not. Alex tells
>> me PPC is the same way. It's only x86 that has tied these two
>> concepts together.
>
> Hmm, and why does Power work despite this mismatch?

I think because hw/ppc.c:ppc_set_irq() both calls cpu_interrupt()
and also kvmppc_set_interrupt(), so we end up with a sort of odd
mix of both models... Alex?

> If cpu_thread_is_idle doesn't work for you, define something like
> kvm_idle_in_kernel() to replace kvm_irqchip_in_kernel here.

Yes, this is kind of where I'm headed. I thought I'd start with this
patch as the easiest one first, though...

>> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
>> is because I think they're all similar to this -- the common code is
>> using the check as a proxy for something else, and it should be directly
>> asking about that something else. The only bits of code that should
>> care about "is the irqchip in kernel?" are:
>>  * target-specific device/machine setup code which needs to know
>>    which apic/etc to instantiate
>>  * target-specific x86 code which has this weird synchronous IRQ
>>    delivery model for irqchip-not-in-kernel
>> (Obviously I might have missed something, I'm flailing around
>> trying to understand this code :-))
>>
>>>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>>>
>>> The latter point implies that irqfd is available and that interrupt
>>> routes from virtual IRQs (*) (like the one associated with an irqfd) to
>>> the in-kernel IRQ controller have to be established. That's pretty generic.
>>
>> But you can perfectly well have an in-kernel-irqchip that doesn't
>> support irqfd
>
> You could, thought this doesn't make much sense.

Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
advantage of the hardware support for a virtual GIC, and you can use
the virtual timer support too. These are both big performance advantages
even if QEMU never does anything with irqfds. (In fact the current
ARM KVM VGIC code doesn't support irqfds as far as I can see from
a quick scan of the kernel code.)

>> -- it just means that interrupts from devices have
>> to come in via the ioctls same as anything else. Some in-kernel
>> helpers obviously would depend on the existence and use of a full
>> featured in-kernel irqchip (on ARM you don't get the in kernel timer
>> unless you have in kernel VGIC), but I don't see why the virtio code
>> should be asking "is there an in kernel irqchip?" rather than "can
>> I do irqfd routing?" or whatever the question is it actually wants
>> to ask.  (In fact the virtio code probably needs to do something
>> more complex anyway: you could perfectly well have a system where
>> there is a full-featured irqchip in the kernel but the virtio
>> device is on the "wrong" side of a second interrupt controller
>> which is not in-kernel. So the actual question it needs to ask
>> is "does the interrupt wiring in this specific machine model mean
>> I can get and use an irqfd from where I am to the main CPU
>> interrupt controller?" or something similar.)
>
> Well, same here: then define more precise generic test functions.

First I have to understand what the current code is trying to do;
your explanations have been very helpful in that respect.

-- PMM
Jan Kiszka - July 21, 2012, 9:44 a.m.
On 2012-07-21 11:30, Peter Maydell wrote:
> On 21 July 2012 10:14, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-21 10:54, Peter Maydell wrote:
>>> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> On 2012-07-20 21:14, Peter Maydell wrote:
>>>>> I'm sure this isn't the only x86ism in the KVM generic source
>>>>> files. However the thing I'm specifically trying to do is
>>>>> nuke all the uses of kvm_irqchip_in_kernel() in common code,
>>>>
>>>> No, "irqchip in kernel" is supposed to be a generic concept. We will
>>>> also have it on Power. Not sure what your plans are for ARM, maybe it
>>>> will always be true there.
>>>
>>> I agree that "irqchip in kernel?" is generic (though as you'll see
>>> below there's disagreement about what that ought to mean or imply).
>>> "irq0_override" though seems to me to be absolutely x86 specific.
>>
>> Naming is x86 specific, semantic not. It means that KVM doesn't prevent
>> remapping of IRQs. Granted, I really hope you don't make such mistakes
>> in your arch.
> 
> What does "remapping of IRQs" mean here? This is still sounding

It means that the QEMU model of the board can define interrupt routes in
an unconfined way, which is obviously always true when the irqchips are
all in userspace but not necessarily when KVM support is in the loop.

> not very generic to me, in that I really don't know what it would
> mean in an ARM context. The fact that the only caller of this is
> in hw/pc.c is also a big red flag that this isn't exactly generic.

x86 is also still the only arch with full in-kernel irqchip support. And
even if there is only one arch using it, that doesn't mean the test
needs to be moved around - if the test itself is generic, just always
true for other archs.

> 
>>>> That said, maybe there is room for discussion about what it means for
>>>> the general KVM code and its users if the irqchip is in the kernel. Two
>>>> things that should be common for every arch:
>>>>  - VCPU idle management is done inside the kernel
>>>
>>> The trouble is that at the moment QEMU assumes that "is the
>>> irqchip in kernel?" == "is VCPU idle management in kernel", for
>>> instance. For ARM, VCPU idle management is in kernel whether
>>> we're using the kernel's model of the VGIC or not. Alex tells
>>> me PPC is the same way. It's only x86 that has tied these two
>>> concepts together.
>>
>> Hmm, and why does Power work despite this mismatch?
> 
> I think because hw/ppc.c:ppc_set_irq() both calls cpu_interrupt()
> and also kvmppc_set_interrupt(), so we end up with a sort of odd
> mix of both models... Alex?
> 
>> If cpu_thread_is_idle doesn't work for you, define something like
>> kvm_idle_in_kernel() to replace kvm_irqchip_in_kernel here.
> 
> Yes, this is kind of where I'm headed. I thought I'd start with this
> patch as the easiest one first, though...

Before moving anything, let's refine/break up the semantics of
kvm_irqchip_in_kernel first.

> 
>>> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
>>> is because I think they're all similar to this -- the common code is
>>> using the check as a proxy for something else, and it should be directly
>>> asking about that something else. The only bits of code that should
>>> care about "is the irqchip in kernel?" are:
>>>  * target-specific device/machine setup code which needs to know
>>>    which apic/etc to instantiate
>>>  * target-specific x86 code which has this weird synchronous IRQ
>>>    delivery model for irqchip-not-in-kernel
>>> (Obviously I might have missed something, I'm flailing around
>>> trying to understand this code :-))
>>>
>>>>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>>>>
>>>> The latter point implies that irqfd is available and that interrupt
>>>> routes from virtual IRQs (*) (like the one associated with an irqfd) to
>>>> the in-kernel IRQ controller have to be established. That's pretty generic.
>>>
>>> But you can perfectly well have an in-kernel-irqchip that doesn't
>>> support irqfd
>>
>> You could, thought this doesn't make much sense.
> 
> Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
> advantage of the hardware support for a virtual GIC, and you can use
> the virtual timer support too. These are both big performance advantages
> even if QEMU never does anything with irqfds. (In fact the current
> ARM KVM VGIC code doesn't support irqfds as far as I can see from
> a quick scan of the kernel code.)

It doesn't make sense as it means your in-kernel irqchip model is
semi-finished. If you didn't consider how to support direct in-kernel
IRQ injections, you risk designing something that requires userspace
quirk handling later on when extending it to full-featured in-kernel
irqchip support. Of course, you are free to do this stepwise, but your
digging through QEMU /wrt x86-specifics in the KVM layer should warn you
about the risks. ;)

Jan
Peter Maydell - July 21, 2012, 9:56 a.m.
On 21 July 2012 10:44, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-21 11:30, Peter Maydell wrote:
>> On 21 July 2012 10:14, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> On 2012-07-21 10:54, Peter Maydell wrote:
>>>> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Naming is x86 specific, semantic not. It means that KVM doesn't prevent
>>> remapping of IRQs. Granted, I really hope you don't make such mistakes
>>> in your arch.
>>
>> What does "remapping of IRQs" mean here?
>
> It means that the QEMU model of the board can define interrupt routes in
> an unconfined way, which is obviously always true when the irqchips are
> all in userspace but not necessarily when KVM support is in the loop.

Er, surely it's only false if your KVM irqchip is obviously broken?
I mean, any sane in-kernel-irqchip model needs to support the same set
of incoming irq lines as the out-of-kernel irqchip model it's replacing.
There's no difference in flexibility there.

Or are you trying to talk about defining interrupt routes when the
source and destination are both kernel code but the route needs to
be set by userspace (ie is machine specific not cpu specific)?
Whether that's possible sounds to me like it would depend on all
the board model code between the source and destination rather
than being a single global boolean check.

>> not very generic to me, in that I really don't know what it would
>> mean in an ARM context. The fact that the only caller of this is
>> in hw/pc.c is also a big red flag that this isn't exactly generic.
>
> x86 is also still the only arch with full in-kernel irqchip support. And
> even if there is only one arch using it, that doesn't mean the test
> needs to be moved around - if the test itself is generic, just always
> true for other archs.

I just don't see why this check is anything other than "do I have
a broken x86 kernel irqchip?" In particular it doesn't have any
documented semantics defined in generic terms that you could usefully
use in general QEMU code to say "do I need to call this function
and what should I be doing based on the result?"

>>>>>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>>>>>
>>>>> The latter point implies that irqfd is available and that interrupt
>>>>> routes from virtual IRQs (*) (like the one associated with an irqfd) to
>>>>> the in-kernel IRQ controller have to be established. That's pretty generic.
>>>>
>>>> But you can perfectly well have an in-kernel-irqchip that doesn't
>>>> support irqfd
>>>
>>> You could, thought this doesn't make much sense.
>>
>> Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
>> advantage of the hardware support for a virtual GIC, and you can use
>> the virtual timer support too. These are both big performance advantages
>> even if QEMU never does anything with irqfds. (In fact the current
>> ARM KVM VGIC code doesn't support irqfds as far as I can see from
>> a quick scan of the kernel code.)
>
> It doesn't make sense as it means your in-kernel irqchip model is
> semi-finished. If you didn't consider how to support direct in-kernel
> IRQ injections, you risk designing something that requires userspace
> quirk handling later on when extending it to full-featured in-kernel
> irqchip support.

Well, the in-kernel virtual timer already does direct in-kernel IRQ
injection to the VGIC: it calls a function to say "inject IRQ X"...
(this works because the interrupt line used is fixed by the CPU,
it's not a board model property so there is no need for the routing
to be defined by userspace. (ie both ends of this irq injection are
in the CPU proper.))

As far as I can tell you seem to be saying "irqfds are an extra
feature you might want later", which is exactly "you can have
an irqchip without them".

(Is there some summary of the design of the irqfds stuff somewhere I
can go and read up?)

thanks
-- PMM
Jan Kiszka - July 21, 2012, 10:22 a.m.
On 2012-07-21 11:56, Peter Maydell wrote:
> On 21 July 2012 10:44, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-21 11:30, Peter Maydell wrote:
>>> On 21 July 2012 10:14, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> On 2012-07-21 10:54, Peter Maydell wrote:
>>>>> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> Naming is x86 specific, semantic not. It means that KVM doesn't prevent
>>>> remapping of IRQs. Granted, I really hope you don't make such mistakes
>>>> in your arch.
>>>
>>> What does "remapping of IRQs" mean here?
>>
>> It means that the QEMU model of the board can define interrupt routes in
>> an unconfined way, which is obviously always true when the irqchips are
>> all in userspace but not necessarily when KVM support is in the loop.
> 
> Er, surely it's only false if your KVM irqchip is obviously broken?
> I mean, any sane in-kernel-irqchip model needs to support the same set
> of incoming irq lines as the out-of-kernel irqchip model it's replacing.
> There's no difference in flexibility there.
> 
> Or are you trying to talk about defining interrupt routes when the
> source and destination are both kernel code but the route needs to
> be set by userspace (ie is machine specific not cpu specific)?

It describes this requirement primarily.

> Whether that's possible sounds to me like it would depend on all
> the board model code between the source and destination rather
> than being a single global boolean check.

It depends on the feature set of the in-kernel irqchips and if this can
possibly vary on real hw.

> 
>>> not very generic to me, in that I really don't know what it would
>>> mean in an ARM context. The fact that the only caller of this is
>>> in hw/pc.c is also a big red flag that this isn't exactly generic.
>>
>> x86 is also still the only arch with full in-kernel irqchip support. And
>> even if there is only one arch using it, that doesn't mean the test
>> needs to be moved around - if the test itself is generic, just always
>> true for other archs.
> 
> I just don't see why this check is anything other than "do I have
> a broken x86 kernel irqchip?" In particular it doesn't have any
> documented semantics defined in generic terms that you could usefully
> use in general QEMU code to say "do I need to call this function
> and what should I be doing based on the result?"

Let's look at this again when we redefined the KVM test functions. Maybe
then the infrastructure is available to move it cleanly. Maybe it will
continue to not matter and we can leave it alone.

> 
>>>>>>  - in-kernel KVM helpers like vhost or VFIO can inject IRQs directly
>>>>>>
>>>>>> The latter point implies that irqfd is available and that interrupt
>>>>>> routes from virtual IRQs (*) (like the one associated with an irqfd) to
>>>>>> the in-kernel IRQ controller have to be established. That's pretty generic.
>>>>>
>>>>> But you can perfectly well have an in-kernel-irqchip that doesn't
>>>>> support irqfd
>>>>
>>>> You could, thought this doesn't make much sense.
>>>
>>> Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
>>> advantage of the hardware support for a virtual GIC, and you can use
>>> the virtual timer support too. These are both big performance advantages
>>> even if QEMU never does anything with irqfds. (In fact the current
>>> ARM KVM VGIC code doesn't support irqfds as far as I can see from
>>> a quick scan of the kernel code.)
>>
>> It doesn't make sense as it means your in-kernel irqchip model is
>> semi-finished. If you didn't consider how to support direct in-kernel
>> IRQ injections, you risk designing something that requires userspace
>> quirk handling later on when extending it to full-featured in-kernel
>> irqchip support.
> 
> Well, the in-kernel virtual timer already does direct in-kernel IRQ
> injection to the VGIC: it calls a function to say "inject IRQ X"...
> (this works because the interrupt line used is fixed by the CPU,
> it's not a board model property so there is no need for the routing
> to be defined by userspace. (ie both ends of this irq injection are
> in the CPU proper.))

Could you inject IRQs from an in-kernel helper that (partially or fully)
emulates some device sitting on peripheral buses as well (like PCI)? If
not, you aren't done with the in-kernel irqchip model yet.

> 
> As far as I can tell you seem to be saying "irqfds are an extra
> feature you might want later", which is exactly "you can have
> an irqchip without them".

Once the prerequisites for peripheral interrupt injections are there,
enabling irqfd for your arch should be straightforward. I'm picking on
those prerequisites here, not irqfd.

> 
> (Is there some summary of the design of the irqfds stuff somewhere I
> can go and read up?)

linux/Documentation/virtual/kvm/api.txt is a good start, though not
really a high-level summary.

Jan
Peter Maydell - July 21, 2012, 10:53 a.m.
On 21 July 2012 11:22, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-21 11:56, Peter Maydell wrote:
>> Or are you trying to talk about defining interrupt routes when the
>> source and destination are both kernel code but the route needs to
>> be set by userspace (ie is machine specific not cpu specific)?
>
> It describes this requirement primarily.
>
>> Whether that's possible sounds to me like it would depend on all
>> the board model code between the source and destination rather
>> than being a single global boolean check.
>
> It depends on the feature set of the in-kernel irqchips and if this can
> possibly vary on real hw.

If the interrupt route is on-CPU then its routing is fixed (for
that CPU), and you don't need to care about irqfds because the
kernel knows what CPU it's providing to the guest, has both ends
of the connection and can just do the right thing however is most
convenient for the internal implementation. If the interrupt route
is off-CPU then all bets are off because the routing is machine
specific and could go through any kind of logic between the peripheral
and the CPU's irqchip.

I don't see how you can do this with QEMU's current IRQ infrastructure,
which basically just hardwires everything with qemu_irq lines and
doesn't provide any way to query the routing and logic from an
irq source to its destination.

>>>>>> But you can perfectly well have an in-kernel-irqchip that doesn't
>>>>>> support irqfd
>>>>>
>>>>> You could, thought this doesn't make much sense.
>>>>
>>>> Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
>>>> advantage of the hardware support for a virtual GIC, and you can use
>>>> the virtual timer support too. These are both big performance advantages
>>>> even if QEMU never does anything with irqfds. (In fact the current
>>>> ARM KVM VGIC code doesn't support irqfds as far as I can see from
>>>> a quick scan of the kernel code.)
>>>
>>> It doesn't make sense as it means your in-kernel irqchip model is
>>> semi-finished. If you didn't consider how to support direct in-kernel
>>> IRQ injections, you risk designing something that requires userspace
>>> quirk handling later on when extending it to full-featured in-kernel
>>> irqchip support.
>>
>> Well, the in-kernel virtual timer already does direct in-kernel IRQ
>> injection to the VGIC: it calls a function to say "inject IRQ X"...
>> (this works because the interrupt line used is fixed by the CPU,
>> it's not a board model property so there is no need for the routing
>> to be defined by userspace. (ie both ends of this irq injection are
>> in the CPU proper.))
>
> Could you inject IRQs from an in-kernel helper that (partially or fully)
> emulates some device sitting on peripheral buses as well (like PCI)? If
> not, you aren't done with the in-kernel irqchip model yet.

This is still sounding like "there is an extra feature which you should
probably implement at some point and should certainly design with the
intention of supporting", not "you cannot have an irqchip without irqfds".

Therefore QEMU code which cares about irqfds should be doing
checks for irqfd functionality, not "is there an in kernel
irqchip".

>> As far as I can tell you seem to be saying "irqfds are an extra
>> feature you might want later", which is exactly "you can have
>> an irqchip without them".
>
> Once the prerequisites for peripheral interrupt injections are there,
> enabling irqfd for your arch should be straightforward. I'm picking on
> those prerequisites here, not irqfd.
>
>>
>> (Is there some summary of the design of the irqfds stuff somewhere I
>> can go and read up?)
>
> linux/Documentation/virtual/kvm/api.txt is a good start, though not
> really a high-level summary.

I looked for 'irqfd' in that and found a straightforward ioctl for
"wire this eventfd up to this irqchip input". That doesn't say anything
about remapping of IRQs, and it's not clear to me either why a
straightforward "use an ioctl to deliver incoming interrupts" design
would be broken by adding that later: it's just a different source
for the interrupt...

-- PMM
Jan Kiszka - July 21, 2012, 11:08 a.m.
On 2012-07-21 12:53, Peter Maydell wrote:
> On 21 July 2012 11:22, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-21 11:56, Peter Maydell wrote:
>>> Or are you trying to talk about defining interrupt routes when the
>>> source and destination are both kernel code but the route needs to
>>> be set by userspace (ie is machine specific not cpu specific)?
>>
>> It describes this requirement primarily.
>>
>>> Whether that's possible sounds to me like it would depend on all
>>> the board model code between the source and destination rather
>>> than being a single global boolean check.
>>
>> It depends on the feature set of the in-kernel irqchips and if this can
>> possibly vary on real hw.
> 
> If the interrupt route is on-CPU then its routing is fixed (for
> that CPU), and you don't need to care about irqfds because the
> kernel knows what CPU it's providing to the guest, has both ends
> of the connection and can just do the right thing however is most
> convenient for the internal implementation. If the interrupt route
> is off-CPU then all bets are off because the routing is machine
> specific and could go through any kind of logic between the peripheral
> and the CPU's irqchip.
> 
> I don't see how you can do this with QEMU's current IRQ infrastructure,
> which basically just hardwires everything with qemu_irq lines and
> doesn't provide any way to query the routing and logic from an
> irq source to its destination.

Routing from arbitrary sources to the in-kernel sink, skipping
intermediate steps in the hotpath is in fact an unsolved issue in QEMU.
We are just introducing band-aid for PCI on x86 (to introduce PCI device
assignment). Long-term, this requires a generic solution which allows
path discovery etc. But this is a userspace problem, nothing related to
the KVM kernel features.

> 
>>>>>>> But you can perfectly well have an in-kernel-irqchip that doesn't
>>>>>>> support irqfd
>>>>>>
>>>>>> You could, thought this doesn't make much sense.
>>>>>
>>>>> Why doesn't it make sense? On ARM, in-kernel-irqchip means you can take
>>>>> advantage of the hardware support for a virtual GIC, and you can use
>>>>> the virtual timer support too. These are both big performance advantages
>>>>> even if QEMU never does anything with irqfds. (In fact the current
>>>>> ARM KVM VGIC code doesn't support irqfds as far as I can see from
>>>>> a quick scan of the kernel code.)
>>>>
>>>> It doesn't make sense as it means your in-kernel irqchip model is
>>>> semi-finished. If you didn't consider how to support direct in-kernel
>>>> IRQ injections, you risk designing something that requires userspace
>>>> quirk handling later on when extending it to full-featured in-kernel
>>>> irqchip support.
>>>
>>> Well, the in-kernel virtual timer already does direct in-kernel IRQ
>>> injection to the VGIC: it calls a function to say "inject IRQ X"...
>>> (this works because the interrupt line used is fixed by the CPU,
>>> it's not a board model property so there is no need for the routing
>>> to be defined by userspace. (ie both ends of this irq injection are
>>> in the CPU proper.))
>>
>> Could you inject IRQs from an in-kernel helper that (partially or fully)
>> emulates some device sitting on peripheral buses as well (like PCI)? If
>> not, you aren't done with the in-kernel irqchip model yet.
> 
> This is still sounding like "there is an extra feature which you should
> probably implement at some point and should certainly design with the
> intention of supporting", not "you cannot have an irqchip without irqfds".
> 
> Therefore QEMU code which cares about irqfds should be doing
> checks for irqfd functionality, not "is there an in kernel
> irqchip".

Defining some kvm_irqfd_available() is one thing. Ignoring irqfd "for
now" while introducing in-kernel irqchip is another, less wise decision.

>>> As far as I can tell you seem to be saying "irqfds are an extra
>>> feature you might want later", which is exactly "you can have
>>> an irqchip without them".
>>
>> Once the prerequisites for peripheral interrupt injections are there,
>> enabling irqfd for your arch should be straightforward. I'm picking on
>> those prerequisites here, not irqfd.
>>
>>>
>>> (Is there some summary of the design of the irqfds stuff somewhere I
>>> can go and read up?)
>>
>> linux/Documentation/virtual/kvm/api.txt is a good start, though not
>> really a high-level summary.
> 
> I looked for 'irqfd' in that and found a straightforward ioctl for
> "wire this eventfd up to this irqchip input". That doesn't say anything
> about remapping of IRQs, and it's not clear to me either why a
> straightforward "use an ioctl to deliver incoming interrupts" design
> would be broken by adding that later: it's just a different source
> for the interrupt...

Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
adding irqfd is in fact simple.

Jan
Peter Maydell - July 21, 2012, 12:17 p.m.
On 21 July 2012 12:08, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-21 12:53, Peter Maydell wrote:
>> This is still sounding like "there is an extra feature which you should
>> probably implement at some point and should certainly design with the
>> intention of supporting", not "you cannot have an irqchip without irqfds".
>>
>> Therefore QEMU code which cares about irqfds should be doing
>> checks for irqfd functionality, not "is there an in kernel
>> irqchip".
>
> Defining some kvm_irqfd_available() is one thing. Ignoring irqfd "for
> now" while introducing in-kernel irqchip is another, less wise decision.

You still haven't really explained why we can't just ignore irqfd
for now. I don't see how it would particularly affect the design
of the kernel implementation very much, and the userspace interface
seems to just be an extra ioctl.

> Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
> adding irqfd is in fact simple.

I don't really understand where KVM_SET_GSI_ROUTING comes into
this -- the documentation is a bit opaque. It says "Sets the GSI
routing table entries" but it doesn't define what a GSI is or
what we're routing to where. Googling suggests GSI is an APIC
specific term so it doesn't sound like it's relevant for non-x86.

I'm not sure what KVM_IRQ_LINE has to do with it either -- this
is just the standard interrupt injection ioctl. KVM ARM supports
this whether there's an in-kernel irqchip or not.

-- PMM
Jan Kiszka - July 21, 2012, 12:35 p.m.
On 2012-07-21 14:17, Peter Maydell wrote:
> On 21 July 2012 12:08, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-21 12:53, Peter Maydell wrote:
>>> This is still sounding like "there is an extra feature which you should
>>> probably implement at some point and should certainly design with the
>>> intention of supporting", not "you cannot have an irqchip without irqfds".
>>>
>>> Therefore QEMU code which cares about irqfds should be doing
>>> checks for irqfd functionality, not "is there an in kernel
>>> irqchip".
>>
>> Defining some kvm_irqfd_available() is one thing. Ignoring irqfd "for
>> now" while introducing in-kernel irqchip is another, less wise decision.
> 
> You still haven't really explained why we can't just ignore irqfd
> for now. I don't see how it would particularly affect the design
> of the kernel implementation very much, and the userspace interface
> seems to just be an extra ioctl.

I bet you ignored MSI so far. Physical IRQ lines are just a part of the
whole picture. How are MSIs delivered on the systems you want to add?

> 
>> Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
>> adding irqfd is in fact simple.
> 
> I don't really understand where KVM_SET_GSI_ROUTING comes into
> this -- the documentation is a bit opaque. It says "Sets the GSI
> routing table entries" but it doesn't define what a GSI is or
> what we're routing to where. Googling suggests GSI is an APIC
> specific term so it doesn't sound like it's relevant for non-x86.

As I said before: "GSI" needs to be read as "physical or virtual IRQ
line". The virtual ones can be of any source you define, irqfd is just one.

> 
> I'm not sure what KVM_IRQ_LINE has to do with it either -- this
> is just the standard interrupt injection ioctl. KVM ARM supports
> this whether there's an in-kernel irqchip or not.

KVM_IRQ_LINE injects according to the routing defined via
KVM_SET_GSI_ROUTING, at least on x86 (and ia64). If you define your own
KVM_IRQ_LINE semantics, you may have problems later on when you need to
redefine it after adding IRQ routing support.

Even if you don't want to add full irqchip support now, plan for it.
Define the so far used interfaces in a way that they will work
seamlessly when adding the missing bits and pieces. However, I still
think it's better to skip the intermediate steps in order to avoid
planning mistakes.

Jan
Peter Maydell - July 21, 2012, 12:57 p.m.
On 21 July 2012 13:35, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-07-21 14:17, Peter Maydell wrote:
>> You still haven't really explained why we can't just ignore irqfd
>> for now. I don't see how it would particularly affect the design
>> of the kernel implementation very much, and the userspace interface
>> seems to just be an extra ioctl.
>
> I bet you ignored MSI so far. Physical IRQ lines are just a part of the
> whole picture. How are MSIs delivered on the systems you want to add?

You're using random acronyms without defining them again. It looks
as if MSI is a PCI specific term. That would seem to me to fall
under the heading of "routing across a board model" which we can't
do anyway, because you have no idea how this all wired up, it
will depend on the details of the SoC and the PCI controller.
(As it happens the initial board model doesn't have PCI support;
most ARM boards don't.) I'm not entirely sure we want to have
"in kernel random SoC-specific PCI controller"...

[Point taken that thought is required here, though.]

>>> Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
>>> adding irqfd is in fact simple.
>>
>> I don't really understand where KVM_SET_GSI_ROUTING comes into
>> this -- the documentation is a bit opaque. It says "Sets the GSI
>> routing table entries" but it doesn't define what a GSI is or
>> what we're routing to where. Googling suggests GSI is an APIC
>> specific term so it doesn't sound like it's relevant for non-x86.
>
> As I said before: "GSI" needs to be read as "physical or virtual IRQ
> line". The virtual ones can be of any source you define, irqfd is just one.

What's a virtual irq line in this context? We're modelling a physical
bit of hardware which has N interrupt lines, so I'm not sure what
a virtual irq line would be or how it would appear to the guest...

-- PMM
Jan Kiszka - July 21, 2012, 1:16 p.m.
On 2012-07-21 14:57, Peter Maydell wrote:
> On 21 July 2012 13:35, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-21 14:17, Peter Maydell wrote:
>>> You still haven't really explained why we can't just ignore irqfd
>>> for now. I don't see how it would particularly affect the design
>>> of the kernel implementation very much, and the userspace interface
>>> seems to just be an extra ioctl.
>>
>> I bet you ignored MSI so far. Physical IRQ lines are just a part of the
>> whole picture. How are MSIs delivered on the systems you want to add?
> 
> You're using random acronyms without defining them again. It looks
> as if MSI is a PCI specific term. That would seem to me to fall
> under the heading of "routing across a board model" which we can't
> do anyway, because you have no idea how this all wired up, it
> will depend on the details of the SoC and the PCI controller.

For sure you can. You need to discover those wiring, cache it, and then
let the source inject to the final destination directly. See the INTx
routing notifier and pci_device_route_intx_to_irq from [1] for that
simplistic approach we are taking on the x86/PC architecture.

> (As it happens the initial board model doesn't have PCI support;
> most ARM boards don't.) I'm not entirely sure we want to have
> "in kernel random SoC-specific PCI controller"...

Isn't ARM going after server scenarios as well? Will be hard without
some PCI support. The good news is that you likely won't need a full
in-kernel PCI model for this (we don't do so on x86 as well).

> 
> [Point taken that thought is required here, though.]
> 
>>>> Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
>>>> adding irqfd is in fact simple.
>>>
>>> I don't really understand where KVM_SET_GSI_ROUTING comes into
>>> this -- the documentation is a bit opaque. It says "Sets the GSI
>>> routing table entries" but it doesn't define what a GSI is or
>>> what we're routing to where. Googling suggests GSI is an APIC
>>> specific term so it doesn't sound like it's relevant for non-x86.
>>
>> As I said before: "GSI" needs to be read as "physical or virtual IRQ
>> line". The virtual ones can be of any source you define, irqfd is just one.
> 
> What's a virtual irq line in this context? We're modelling a physical
> bit of hardware which has N interrupt lines, so I'm not sure what
> a virtual irq line would be or how it would appear to the guest...

A virtual line is an input of the in-kernel IRQ router you configure via
SET_GSI_ROUTING. A physical line is a potential output of it that goes
into some in-kernel interrupt controller model. It can also be an
interrupt message sent to a specific CPU - provided the arch supports
such a delivery protocol.

Of course, the current router was modeled after x86 and ia64. So I
wouldn't be surprised if some ARM system configuration cannot be
expressed this way. Then we need to discuss reasonable extensions. But
it should provide a sound foundation at least.

Jan

[1] http://permalink.gmane.org/gmane.comp.emulators.qemu/160792
Cornelia Huck - July 23, 2012, 12:04 p.m.
On Sat, 21 Jul 2012 15:16:56 +0200
Jan Kiszka <jan.kiszka@web.de> wrote:

> On 2012-07-21 14:57, Peter Maydell wrote:
> > On 21 July 2012 13:35, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> On 2012-07-21 14:17, Peter Maydell wrote:
> >>> You still haven't really explained why we can't just ignore irqfd
> >>> for now. I don't see how it would particularly affect the design
> >>> of the kernel implementation very much, and the userspace interface
> >>> seems to just be an extra ioctl.
> >>
> >> I bet you ignored MSI so far. Physical IRQ lines are just a part of the
> >> whole picture. How are MSIs delivered on the systems you want to add?
> > 
> > You're using random acronyms without defining them again. It looks
> > as if MSI is a PCI specific term. That would seem to me to fall
> > under the heading of "routing across a board model" which we can't
> > do anyway, because you have no idea how this all wired up, it
> > will depend on the details of the SoC and the PCI controller.
> 
> For sure you can. You need to discover those wiring, cache it, and then
> let the source inject to the final destination directly. See the INTx
> routing notifier and pci_device_route_intx_to_irq from [1] for that
> simplistic approach we are taking on the x86/PC architecture.

> >>>> Once you support the backend (KVM_SET_GSI_ROUTING + KVM_IRQ_LINE),
> >>>> adding irqfd is in fact simple.
> >>>
> >>> I don't really understand where KVM_SET_GSI_ROUTING comes into
> >>> this -- the documentation is a bit opaque. It says "Sets the GSI
> >>> routing table entries" but it doesn't define what a GSI is or
> >>> what we're routing to where. Googling suggests GSI is an APIC
> >>> specific term so it doesn't sound like it's relevant for non-x86.
> >>
> >> As I said before: "GSI" needs to be read as "physical or virtual IRQ
> >> line". The virtual ones can be of any source you define, irqfd is just one.
> > 
> > What's a virtual irq line in this context? We're modelling a physical
> > bit of hardware which has N interrupt lines, so I'm not sure what
> > a virtual irq line would be or how it would appear to the guest...
> 
> A virtual line is an input of the in-kernel IRQ router you configure via
> SET_GSI_ROUTING. A physical line is a potential output of it that goes
> into some in-kernel interrupt controller model. It can also be an
> interrupt message sent to a specific CPU - provided the arch supports
> such a delivery protocol.
> 
> Of course, the current router was modeled after x86 and ia64. So I
> wouldn't be surprised if some ARM system configuration cannot be
> expressed this way. Then we need to discuss reasonable extensions. But
> it should provide a sound foundation at least.

OK, so I was reading through this thread since I want to add irqfd
support for s390, but we don't have any kind of "irqchip".

The understanding I got so far is that !s390 architectures have some
kind of mechanism that allows them to "route" an interrupt between a
device and a cpu, meaning that there's a fixed tie-in between a device
and a cpu. If that's correct, I don't see how to model irqfds via
this irqchip infrastructure for s390.

Here's in a nutshell how (external and I/O) interrupts work on s390:

- Interrupts have an internal prioritation, that means different types
of interrupts (external, I/O, machine check, ...) take precedent over
other types

- External and I/O interrupts are "floating", i. e. they are not tied
to a specific cpu, but can be delivered to any cpu that has external
resp. I/O interrupts enabled

- Interrupts take payload with them that defines which device they are
for

So, for example, if a specific subchannel (=device) has pending status
and an I/O interrupt is to be generated, this interrupt remains pending
until an arbitrary cpu is enabled for I/O interrupts. If several cpus
are enabled for I/O interrupts, any of them may be interrupted. When an
I/O interrupt is delivered on a cpu, the cpu's lowcore contains the
interrupt payload which defines the subchannel (=device) the interrupt
is for.

Any idea on how this architecture can be married with the irqchip
concept is welcome. If all else fails, would a special irqfd concept
for !irqchip be acceptable?

Cornelia
Avi Kivity - July 23, 2012, 12:18 p.m.
On 07/23/2012 03:04 PM, Cornelia Huck wrote:
> 
> OK, so I was reading through this thread since I want to add irqfd
> support for s390, but we don't have any kind of "irqchip".
> 
> The understanding I got so far is that !s390 architectures have some
> kind of mechanism that allows them to "route" an interrupt between a
> device and a cpu, meaning that there's a fixed tie-in between a device
> and a cpu. 

It's not fixed, but it changes rarely.  Various interrupt routers can be
programmed to change the route, but guests do so only rarely.

> If that's correct, I don't see how to model irqfds via
> this irqchip infrastructure for s390.
> 
> Here's in a nutshell how (external and I/O) interrupts work on s390:
> 
> - Interrupts have an internal prioritation, that means different types
> of interrupts (external, I/O, machine check, ...) take precedent over
> other types
> 
> - External and I/O interrupts are "floating", i. e. they are not tied
> to a specific cpu, but can be delivered to any cpu that has external
> resp. I/O interrupts enabled
> 
> - Interrupts take payload with them that defines which device they are
> for
> 
> So, for example, if a specific subchannel (=device) has pending status
> and an I/O interrupt is to be generated, this interrupt remains pending
> until an arbitrary cpu is enabled for I/O interrupts. If several cpus
> are enabled for I/O interrupts, any of them may be interrupted.

This may be costly to emulate.  On x86 we do not have access to a
guest's interrupt status while it is running.  Is this not the case for
s390?

Oh, let me guess.  You write some interrupt descriptor in memory
somewhere, issue one of your famous instructions, and the hardware finds
a guest vcpu and injects the interrupt.

x86 has a "least priority" mode which is similar to what you're
describing, but I don't think we emulate it correctly.

> When an
> I/O interrupt is delivered on a cpu, the cpu's lowcore contains the
> interrupt payload which defines the subchannel (=device) the interrupt
> is for.
> 
> Any idea on how this architecture can be married with the irqchip
> concept is welcome. If all else fails, would a special irqfd concept
> for !irqchip be acceptable?

I don't see an issue.  You need an arch-specific irqfd configuration
ioctl (or your own data field in the existing ioctl) that defines the
payload.  Then the kernel installs a poll function on that eventfd that,
when called, does the magic sequence needed to get the interrupt there.
 While you don't have an irqchip, you do have asynchronous interrupt
injection, yes?  That's what irqchip really is all about.
Peter Maydell - July 23, 2012, 12:25 p.m.
On 23 July 2012 13:18, Avi Kivity <avi@redhat.com> wrote:
>  While you don't have an irqchip, you do have asynchronous interrupt
> injection, yes?  That's what irqchip really is all about.

This seems an odd point of view -- async interrupt injection
doesn't have anything to do with whether we're modelling
the irqchip in the kernel or in QEMU, I thought...

-- PMM
Avi Kivity - July 23, 2012, 12:26 p.m.
On 07/21/2012 11:54 AM, Peter Maydell wrote:
> On 21 July 2012 07:57, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-07-20 21:14, Peter Maydell wrote:
>>> I'm sure this isn't the only x86ism in the KVM generic source
>>> files. However the thing I'm specifically trying to do is
>>> nuke all the uses of kvm_irqchip_in_kernel() in common code,
>>
>> No, "irqchip in kernel" is supposed to be a generic concept. We will
>> also have it on Power. Not sure what your plans are for ARM, maybe it
>> will always be true there.
> 
> I agree that "irqchip in kernel?" is generic (though as you'll see
> below there's disagreement about what that ought to mean or imply).
> "irq0_override" though seems to me to be absolutely x86 specific.
> 
>> That said, maybe there is room for discussion about what it means for
>> the general KVM code and its users if the irqchip is in the kernel. Two
>> things that should be common for every arch:
>>  - VCPU idle management is done inside the kernel
> 
> The trouble is that at the moment QEMU assumes that "is the
> irqchip in kernel?" == "is VCPU idle management in kernel", for
> instance. For ARM, VCPU idle management is in kernel whether
> we're using the kernel's model of the VGIC or not. Alex tells
> me PPC is the same way. It's only x86 that has tied these two
> concepts together.

Really, "irqchip in kernel" means asynchronous interrupts - you can
inject an interrupt from outside the vcpu thread.  Obviously if the vcpu
is sleeping you need to wake it up and that pulls in idle management.

"irqchip" for x86 really means the local APIC, which has a synchronous
interface with the cpu core.  "local APIC in kernel" really is
equivalent to "kernel idle management", "KVM_IRQ_LINE", and irqfd.  The
ioapic and pit, on the other hand, don't contribute anything to this
(just performance).

So yes, ARM with and without GIC are irqchip_in_kernel, since the
ARM<->GIC interface is asynchronous.  Whether to emulate the GIC or not
is just a performance question.

> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
> is because I think they're all similar to this -- the common code is
> using the check as a proxy for something else, and it should be directly
> asking about that something else. The only bits of code that should
> care about "is the irqchip in kernel?" are:
>  * target-specific device/machine setup code which needs to know
>    which apic/etc to instantiate
>  * target-specific x86 code which has this weird synchronous IRQ
>    delivery model for irqchip-not-in-kernel
> (Obviously I might have missed something, I'm flailing around
> trying to understand this code :-))

Agree naming should be improved.  In fact the early series I pushed to
decompose local apic, ioapic, and pic, but that didn't happen.  If it
did we'd probably not have this conversation.
Avi Kivity - July 23, 2012, 12:31 p.m.
On 07/23/2012 03:25 PM, Peter Maydell wrote:
> On 23 July 2012 13:18, Avi Kivity <avi@redhat.com> wrote:
>>  While you don't have an irqchip, you do have asynchronous interrupt
>> injection, yes?  That's what irqchip really is all about.
> 
> This seems an odd point of view -- async interrupt injection
> doesn't have anything to do with whether we're modelling
> the irqchip in the kernel or in QEMU, I thought...

It does on x86.  The relationship between the APIC and the core is
synchronous - the APIC presents the interrupt, the core grabs is when it
is ready (interrupt flag, etc.) and signals the APIC it has done so.
The APIC then clears the interrupt from the interrupt request register
and sets it in the interrupt status register.  This sequence has to be
done while the vcpu is stopped, since we don't have access to the
interrupt flag otherwise.
Avi Kivity - July 23, 2012, 12:34 p.m.
On 07/23/2012 03:31 PM, Avi Kivity wrote:
> On 07/23/2012 03:25 PM, Peter Maydell wrote:
>> On 23 July 2012 13:18, Avi Kivity <avi@redhat.com> wrote:
>>>  While you don't have an irqchip, you do have asynchronous interrupt
>>> injection, yes?  That's what irqchip really is all about.
>> 
>> This seems an odd point of view -- async interrupt injection
>> doesn't have anything to do with whether we're modelling
>> the irqchip in the kernel or in QEMU, I thought...
> 
> It does on x86.  The relationship between the APIC and the core is
> synchronous - the APIC presents the interrupt, the core grabs is when it
> is ready (interrupt flag, etc.) and signals the APIC it has done so.
> The APIC then clears the interrupt from the interrupt request register
> and sets it in the interrupt status register.  This sequence has to be
> done while the vcpu is stopped, since we don't have access to the
> interrupt flag otherwise.

Again, this is just the local APIC.  The IOAPIC (which is x86 equivalent
to the GIC, IIUC) doesn't change this picture and could have been
emulated in userspace even with async interrupts.  As it happens local
APIC emulation pulls in the IOAPIC and PIC.

So my view is that ARM with and without kernel GIC are
irqchip_in_kernel, since whatever is the local APIC in ARM is always
emulated in the kernel.
Peter Maydell - July 23, 2012, 12:58 p.m.
On 23 July 2012 13:26, Avi Kivity <avi@redhat.com> wrote:
> Really, "irqchip in kernel" means asynchronous interrupts - you can
> inject an interrupt from outside the vcpu thread.  Obviously if the vcpu
> is sleeping you need to wake it up and that pulls in idle management.
>
> "irqchip" for x86 really means the local APIC, which has a synchronous
> interface with the cpu core.  "local APIC in kernel" really is
> equivalent to "kernel idle management", "KVM_IRQ_LINE", and irqfd.  The
> ioapic and pit, on the other hand, don't contribute anything to this
> (just performance).

> So yes, ARM with and without GIC are irqchip_in_kernel, since the
> ARM<->GIC interface is asynchronous.  Whether to emulate the GIC or not
> is just a performance question.

So should we be using something other than KVM_CREATE_IRQCHIP to
ask the kernel to create a GIC model for us (and leave KVM_CREATE_IRQCHIP
as a dummy "always succeed" ioctl)?

> So my view is that ARM with and without kernel GIC are
> irqchip_in_kernel, since whatever is the local APIC in ARM is always
> emulated in the kernel.

I'm not sure ARM has any equivalent to the local APIC -- the GIC
deals with everything and we don't have any equivalent division
of labour to the x86 LAPIC-IOAPIC one.

-- PMM
Cornelia Huck - July 23, 2012, 1:06 p.m.
On Mon, 23 Jul 2012 15:18:49 +0300
Avi Kivity <avi@redhat.com> wrote:

> > So, for example, if a specific subchannel (=device) has pending status
> > and an I/O interrupt is to be generated, this interrupt remains pending
> > until an arbitrary cpu is enabled for I/O interrupts. If several cpus
> > are enabled for I/O interrupts, any of them may be interrupted.
> 
> This may be costly to emulate.  On x86 we do not have access to a
> guest's interrupt status while it is running.  Is this not the case for
> s390?
> 
> Oh, let me guess.  You write some interrupt descriptor in memory
> somewhere, issue one of your famous instructions, and the hardware finds
> a guest vcpu and injects the interrupt.

Basically, we have some flags in our control block we can set so that
the cpu drops out of SIE whenever external/I/O/... interrupts are
enabled and then have the host do the lowcore updates, psw swaps, etc.

> 
> x86 has a "least priority" mode which is similar to what you're
> describing, but I don't think we emulate it correctly.
> 
> > When an
> > I/O interrupt is delivered on a cpu, the cpu's lowcore contains the
> > interrupt payload which defines the subchannel (=device) the interrupt
> > is for.
> > 
> > Any idea on how this architecture can be married with the irqchip
> > concept is welcome. If all else fails, would a special irqfd concept
> > for !irqchip be acceptable?
> 
> I don't see an issue.  You need an arch-specific irqfd configuration
> ioctl (or your own data field in the existing ioctl) that defines the
> payload.  Then the kernel installs a poll function on that eventfd that,
> when called, does the magic sequence needed to get the interrupt there.

If extending the existing ioctl is acceptable, I think we will go that
route.

>  While you don't have an irqchip, you do have asynchronous interrupt
> injection, yes?  That's what irqchip really is all about.

You mean injection via ioctl() that is asynchronous to vcpu execution?
Yes, although we use a different ioctl than the others.

Cornelia
Avi Kivity - July 23, 2012, 1:09 p.m.
On 07/23/2012 03:58 PM, Peter Maydell wrote:
> On 23 July 2012 13:26, Avi Kivity <avi@redhat.com> wrote:
>> Really, "irqchip in kernel" means asynchronous interrupts - you can
>> inject an interrupt from outside the vcpu thread.  Obviously if the vcpu
>> is sleeping you need to wake it up and that pulls in idle management.
>>
>> "irqchip" for x86 really means the local APIC, which has a synchronous
>> interface with the cpu core.  "local APIC in kernel" really is
>> equivalent to "kernel idle management", "KVM_IRQ_LINE", and irqfd.  The
>> ioapic and pit, on the other hand, don't contribute anything to this
>> (just performance).
> 
>> So yes, ARM with and without GIC are irqchip_in_kernel, since the
>> ARM<->GIC interface is asynchronous.  Whether to emulate the GIC or not
>> is just a performance question.
> 
> So should we be using something other than KVM_CREATE_IRQCHIP to
> ask the kernel to create a GIC model for us (and leave KVM_CREATE_IRQCHIP
> as a dummy "always succeed" ioctl)?

Some time ago I suggested using the parameter to KVM_CREATE_IRQCHIP to
select the "irqchip type".

> 
>> So my view is that ARM with and without kernel GIC are
>> irqchip_in_kernel, since whatever is the local APIC in ARM is always
>> emulated in the kernel.
> 
> I'm not sure ARM has any equivalent to the local APIC -- the GIC
> deals with everything and we don't have any equivalent division
> of labour to the x86 LAPIC-IOAPIC one.

It's probably a tiny part of the core with no name.  The point is that
the x86<->lapic interface is synchronous and bidirectional, while the
lapic<->ioapic interface is asynchronous (it is still bidirectional, but
not in a stop-the-vcpu way).  I assume the ARM<->GIC interface is
unidirectional?
Avi Kivity - July 23, 2012, 1:14 p.m.
On 07/23/2012 04:06 PM, Cornelia Huck wrote:
> On Mon, 23 Jul 2012 15:18:49 +0300
> Avi Kivity <avi@redhat.com> wrote:
> 
>> > So, for example, if a specific subchannel (=device) has pending status
>> > and an I/O interrupt is to be generated, this interrupt remains pending
>> > until an arbitrary cpu is enabled for I/O interrupts. If several cpus
>> > are enabled for I/O interrupts, any of them may be interrupted.
>> 
>> This may be costly to emulate.  On x86 we do not have access to a
>> guest's interrupt status while it is running.  Is this not the case for
>> s390?
>> 
>> Oh, let me guess.  You write some interrupt descriptor in memory
>> somewhere, issue one of your famous instructions, and the hardware finds
>> a guest vcpu and injects the interrupt.
> 
> Basically, we have some flags in our control block we can set so that
> the cpu drops out of SIE whenever external/I/O/... interrupts are
> enabled and then have the host do the lowcore updates, psw swaps, etc.

Can you write them from a different cpu and expect them to take effect?

How do you emulate an interrupt with a large guest?  You have to update
the flags in the control blocks for all vcpus; then restore them when
the interrupt is delivered?

>> I don't see an issue.  You need an arch-specific irqfd configuration
>> ioctl (or your own data field in the existing ioctl) that defines the
>> payload.  Then the kernel installs a poll function on that eventfd that,
>> when called, does the magic sequence needed to get the interrupt there.
> 
> If extending the existing ioctl is acceptable, I think we will go that
> route.

Sure.

>>  While you don't have an irqchip, you do have asynchronous interrupt
>> injection, yes?  That's what irqchip really is all about.
> 
> You mean injection via ioctl() that is asynchronous to vcpu execution?
> Yes, although we use a different ioctl than the others.

Ok.  The difference between irqfd and the ioctl is that with irqfd
everything (the payload in your case, the interrupt number for others)
is prepared beforehand, while with the ioctl the extra information is
delivered with the ioctl.
Peter Maydell - July 23, 2012, 1:27 p.m.
On 23 July 2012 14:09, Avi Kivity <avi@redhat.com> wrote:
> On 07/23/2012 03:58 PM, Peter Maydell wrote:
>> So should we be using something other than KVM_CREATE_IRQCHIP to
>> ask the kernel to create a GIC model for us (and leave KVM_CREATE_IRQCHIP
>> as a dummy "always succeed" ioctl)?
>
> Some time ago I suggested using the parameter to KVM_CREATE_IRQCHIP to
> select the "irqchip type".

That seems reasonable, although we have an awkward ordering issue
in KVM as it stands: KVM_CREATE_IRQCHIP needs to be called before
we start creating VCPU threads (at the moment it is done via kvm_init)
but we don't know what kind of irqchip we want to create until the
machine model code actually creates the irqchip device, which generally
happens after the CPU object is created (and VCPU threads are created
at that point). We could fix that by moving creation of the actual
VCPU thread to vl.c after the model has been initialized.

>> I'm not sure ARM has any equivalent to the local APIC -- the GIC
>> deals with everything and we don't have any equivalent division
>> of labour to the x86 LAPIC-IOAPIC one.
>
> It's probably a tiny part of the core with no name.  The point is that
> the x86<->lapic interface is synchronous and bidirectional, while the
> lapic<->ioapic interface is asynchronous (it is still bidirectional, but
> not in a stop-the-vcpu way).  I assume the ARM<->GIC interface is
> unidirectional?

Well, strictly speaking the ARM<->GIC interface is implementation
defined, but in practice you can think of it as "the GIC controls
the IRQ and FIQ input lines to each core and uses them to signal
that an interrupt is present". There's no need for anything to
be signalled back in the other direction: the GIC will just continue
to hold IRQ asserted until the interrupt handler code writes the
relevant GIC register to indicate that the interrupt has been
handled.

(On a core with the virtualization extensions there are also
signals for the GIC to raise a virtual IRQ or FIQ, but we can
ignore those for KVM because we don't and can't provide the
virtualization extensions to a guest.)

-- PMM
Avi Kivity - July 23, 2012, 1:38 p.m.
On 07/23/2012 04:27 PM, Peter Maydell wrote:
> On 23 July 2012 14:09, Avi Kivity <avi@redhat.com> wrote:
>> On 07/23/2012 03:58 PM, Peter Maydell wrote:
>>> So should we be using something other than KVM_CREATE_IRQCHIP to
>>> ask the kernel to create a GIC model for us (and leave KVM_CREATE_IRQCHIP
>>> as a dummy "always succeed" ioctl)?
>>
>> Some time ago I suggested using the parameter to KVM_CREATE_IRQCHIP to
>> select the "irqchip type".
> 
> That seems reasonable, although we have an awkward ordering issue
> in KVM as it stands: KVM_CREATE_IRQCHIP needs to be called before
> we start creating VCPU threads (at the moment it is done via kvm_init)
> but we don't know what kind of irqchip we want to create until the
> machine model code actually creates the irqchip device, which generally
> happens after the CPU object is created (and VCPU threads are created
> at that point). We could fix that by moving creation of the actual
> VCPU thread to vl.c after the model has been initialized.

Seems to be purely a qemu problem, no?  While I think it's reasonable to
be flexible, in this case I think qemu ought to know all these things
beforehand.

> 
>>> I'm not sure ARM has any equivalent to the local APIC -- the GIC
>>> deals with everything and we don't have any equivalent division
>>> of labour to the x86 LAPIC-IOAPIC one.
>>
>> It's probably a tiny part of the core with no name.  The point is that
>> the x86<->lapic interface is synchronous and bidirectional, while the
>> lapic<->ioapic interface is asynchronous (it is still bidirectional, but
>> not in a stop-the-vcpu way).  I assume the ARM<->GIC interface is
>> unidirectional?
> 
> Well, strictly speaking the ARM<->GIC interface is implementation
> defined, but in practice you can think of it as "the GIC controls
> the IRQ and FIQ input lines to each core and uses them to signal
> that an interrupt is present". There's no need for anything to
> be signalled back in the other direction: the GIC will just continue
> to hold IRQ asserted until the interrupt handler code writes the
> relevant GIC register to indicate that the interrupt has been
> handled.

Okay.  This agrees with my mental model of how it works.

> (On a core with the virtualization extensions there are also
> signals for the GIC to raise a virtual IRQ or FIQ, but we can
> ignore those for KVM because we don't and can't provide the
> virtualization extensions to a guest.)

Yet.
Peter Maydell - July 23, 2012, 1:50 p.m.
On 23 July 2012 14:38, Avi Kivity <avi@redhat.com> wrote:
> On 07/23/2012 04:27 PM, Peter Maydell wrote:
>> That seems reasonable, although we have an awkward ordering issue
>> in KVM as it stands: KVM_CREATE_IRQCHIP needs to be called before
>> we start creating VCPU threads (at the moment it is done via kvm_init)
>> but we don't know what kind of irqchip we want to create until the
>> machine model code actually creates the irqchip device, which generally
>> happens after the CPU object is created (and VCPU threads are created
>> at that point). We could fix that by moving creation of the actual
>> VCPU thread to vl.c after the model has been initialized.
>
> Seems to be purely a qemu problem, no?  While I think it's reasonable to
> be flexible, in this case I think qemu ought to know all these things
> beforehand.

Yes, sorry, I meant "issue in QEMU" but had a thinko; my suggested
solution is basically shuffling code around in QEMU so we wait
until we know what we are dealing with before asking the kernel
to create the vcpu threads.

>> (On a core with the virtualization extensions there are also
>> signals for the GIC to raise a virtual IRQ or FIQ, but we can
>> ignore those for KVM because we don't and can't provide the
>> virtualization extensions to a guest.)
>
> Yet.

There is no mechanism in the virtualization extensions to either
trap on or present a false value for guest accesses to the CPSR
mode bits. So you can't make the guest OS think it is in Hypervisor
mode. Therefore you can't provide the guest with the virtualization
extensions. (The same argument applies for Monitor mode and means
you can't provide the Security extensions (TrustZone) to the guest
either.) I guess you could handle all code in the guest hypervisor
under TCG but implementing a hybrid TCG+KVM mode would be a lot
of effort and probably not really perform very well anyway...

-- PMM
Cornelia Huck - July 23, 2012, 1:55 p.m.
On Mon, 23 Jul 2012 16:14:02 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 07/23/2012 04:06 PM, Cornelia Huck wrote:
> > On Mon, 23 Jul 2012 15:18:49 +0300
> > Avi Kivity <avi@redhat.com> wrote:
> > 
> >> > So, for example, if a specific subchannel (=device) has pending status
> >> > and an I/O interrupt is to be generated, this interrupt remains pending
> >> > until an arbitrary cpu is enabled for I/O interrupts. If several cpus
> >> > are enabled for I/O interrupts, any of them may be interrupted.
> >> 
> >> This may be costly to emulate.  On x86 we do not have access to a
> >> guest's interrupt status while it is running.  Is this not the case for
> >> s390?
> >> 
> >> Oh, let me guess.  You write some interrupt descriptor in memory
> >> somewhere, issue one of your famous instructions, and the hardware finds
> >> a guest vcpu and injects the interrupt.
> > 
> > Basically, we have some flags in our control block we can set so that
> > the cpu drops out of SIE whenever external/I/O/... interrupts are
> > enabled and then have the host do the lowcore updates, psw swaps, etc.
> 
> Can you write them from a different cpu and expect them to take effect?
> 
> How do you emulate an interrupt with a large guest?  You have to update
> the flags in the control blocks for all vcpus; then restore them when
> the interrupt is delivered?

We may access the flags for any vcpu via the kvm_s390_float_interrupt
structure which is contained in kvm->arch.

We'll get control when a vcpu enters a wait state and try to deliver
pending interrupts or set/clear the flags. Moreover, idle vcpus are on
a wait queue and are the first target for getting an interrupt injected.

> >>  While you don't have an irqchip, you do have asynchronous interrupt
> >> injection, yes?  That's what irqchip really is all about.
> > 
> > You mean injection via ioctl() that is asynchronous to vcpu execution?
> > Yes, although we use a different ioctl than the others.
> 
> Ok.  The difference between irqfd and the ioctl is that with irqfd
> everything (the payload in your case, the interrupt number for others)
> is prepared beforehand, while with the ioctl the extra information is
> delivered with the ioctl.

This should work for us.

Cornelia
Avi Kivity - July 23, 2012, 2:27 p.m.
On 07/23/2012 04:55 PM, Cornelia Huck wrote:
>> > Basically, we have some flags in our control block we can set so that
>> > the cpu drops out of SIE whenever external/I/O/... interrupts are
>> > enabled and then have the host do the lowcore updates, psw swaps, etc.
>> 
>> Can you write them from a different cpu and expect them to take effect?
>> 
>> How do you emulate an interrupt with a large guest?  You have to update
>> the flags in the control blocks for all vcpus; then restore them when
>> the interrupt is delivered?
> 
> We may access the flags for any vcpu via the kvm_s390_float_interrupt
> structure which is contained in kvm->arch.
> 
> We'll get control when a vcpu enters a wait state and try to deliver
> pending interrupts or set/clear the flags. Moreover, idle vcpus are on
> a wait queue and are the first target for getting an interrupt injected.

Okay.  And you can ask a vcpu to exit when it enables interrupts,
without interrupting it?

On x86, we have to IPI the vcpu so it drops to the host, ask it to let
us know when interrupts are enabled, and let it run again.
Avi Kivity - July 23, 2012, 2:30 p.m.
On 07/23/2012 04:50 PM, Peter Maydell wrote:
>>
>> Yet.
> 
> There is no mechanism in the virtualization extensions to either
> trap on or present a false value for guest accesses to the CPSR
> mode bits. So you can't make the guest OS think it is in Hypervisor
> mode. Therefore you can't provide the guest with the virtualization
> extensions. (The same argument applies for Monitor mode and means
> you can't provide the Security extensions (TrustZone) to the guest
> either.) I guess you could handle all code in the guest hypervisor
> under TCG but implementing a hybrid TCG+KVM mode would be a lot
> of effort and probably not really perform very well anyway...

Gaah, people add virtualization extensions to fix an ISA's
non-virtualizability, then do the same mistake again.

But I was only joking.  Nested virtualization is interesting technically
but so far I haven't seen any huge or even small uptake.
Cornelia Huck - July 23, 2012, 3:01 p.m.
On Mon, 23 Jul 2012 17:27:47 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 07/23/2012 04:55 PM, Cornelia Huck wrote:
> >> > Basically, we have some flags in our control block we can set so that
> >> > the cpu drops out of SIE whenever external/I/O/... interrupts are
> >> > enabled and then have the host do the lowcore updates, psw swaps, etc.
> >> 
> >> Can you write them from a different cpu and expect them to take effect?
> >> 
> >> How do you emulate an interrupt with a large guest?  You have to update
> >> the flags in the control blocks for all vcpus; then restore them when
> >> the interrupt is delivered?
> > 
> > We may access the flags for any vcpu via the kvm_s390_float_interrupt
> > structure which is contained in kvm->arch.
> > 
> > We'll get control when a vcpu enters a wait state and try to deliver
> > pending interrupts or set/clear the flags. Moreover, idle vcpus are on
> > a wait queue and are the first target for getting an interrupt injected.
> 
> Okay.  And you can ask a vcpu to exit when it enables interrupts,
> without interrupting it?

Yes.

See arch/s390/kvm/interrupt.c and the CPUSTAT_* flags.

> 
> On x86, we have to IPI the vcpu so it drops to the host, ask it to let
> us know when interrupts are enabled, and let it run again.
> 

Cornelia
Peter Maydell - July 23, 2012, 3:19 p.m.
On 23 July 2012 13:26, Avi Kivity <avi@redhat.com> wrote:
> On 07/21/2012 11:54 AM, Peter Maydell wrote:
>> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
>> is because I think they're all similar to this -- the common code is
>> using the check as a proxy for something else, and it should be directly
>> asking about that something else. The only bits of code that should
>> care about "is the irqchip in kernel?" are:
>>  * target-specific device/machine setup code which needs to know
>>    which apic/etc to instantiate
>>  * target-specific x86 code which has this weird synchronous IRQ
>>    delivery model for irqchip-not-in-kernel
>> (Obviously I might have missed something, I'm flailing around
>> trying to understand this code :-))
>
> Agree naming should be improved.  In fact the early series I pushed to
> decompose local apic, ioapic, and pic, but that didn't happen.  If it
> did we'd probably not have this conversation.

OK, let's see if we can get some agreement about naming here.

First, some test-functions I think we definitely need:

 kvm_interrupts_are_async()
   -- true if interrupt delivery is asynchronous
      default false in kvm_init, set true in kvm_irqchip_create,
      architectures may set it true in kvm_arch_init [ARM will
      do so; PPC might want to do so]

 kvm_irqchip_in_kernel()
   -- the user-settable option, actual behaviour is arch specific
      on x86, true means (as it does now) LAPIC,IOAPIC,PIT in kernel
      on ARM, we ignore this setting and just DTRT
      on PPC, used as a convenience setting for whether to use
      an in-kernel model of the interrupt controller
      Shouldn't be used in non-target-specific code

and two I'm not quite so sure about:

 kvm_has_msi_routing()
   -- true if we can do routing of MSIs
      set true only if x86 and kvm_irqchip_in_kernel

 kvm_has_irqfds()
   -- true if kernel supports IRQFDs
      currently true only if x86 and kvm_irqchip_in_kernel


Second, current uses of kvm_irqchip_in_kernel():

hw/kvmvapic.c, hw/pc.c, hw/pc_piix.c, target-i386/kvm.c:
 -- these are all x86 specific and can continue to use
    kvm_irqchip_in_kernel()

cpus.c:cpu_thread_is_idle()
 -- should use !kvm_interrupts_are_async() [because halt is
in userspace iff we're using the synchronous interrupt model]

kvm-all.c:kvm_irqchip_set_irq():
 -- (just an assert) should be kvm_interrupts_are_async()

kvm-all.c:kvm_irqchip_add_msi_route():
 -- should be kvm_have_msi_routing()

kvm-all.c:kvm_irqchip_assign_irqfd():
 -- should be true if kvm_has_irqfds()

kvm-all.c:kvm_allows_irq0_override():
 -- this still seems to me to be a completely x86 specific concept;
    it should move to a source file in target-x86 and then it
    can continue to use kvm_irqchip_in_kernel()

hw/virtio-pci.c:virtio_pci_set_guest_notifiers()
 -- not entirely sure about this one but I think it
    should be testing kvm_has_msi_routing().

Any mistakes/countersuggestions?

thanks
-- PMM
Jan Kiszka - July 23, 2012, 4:55 p.m.
On 2012-07-23 17:19, Peter Maydell wrote:
> On 23 July 2012 13:26, Avi Kivity <avi@redhat.com> wrote:
>> On 07/21/2012 11:54 AM, Peter Maydell wrote:
>>> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
>>> is because I think they're all similar to this -- the common code is
>>> using the check as a proxy for something else, and it should be directly
>>> asking about that something else. The only bits of code that should
>>> care about "is the irqchip in kernel?" are:
>>>  * target-specific device/machine setup code which needs to know
>>>    which apic/etc to instantiate
>>>  * target-specific x86 code which has this weird synchronous IRQ
>>>    delivery model for irqchip-not-in-kernel
>>> (Obviously I might have missed something, I'm flailing around
>>> trying to understand this code :-))
>>
>> Agree naming should be improved.  In fact the early series I pushed to
>> decompose local apic, ioapic, and pic, but that didn't happen.  If it
>> did we'd probably not have this conversation.
> 
> OK, let's see if we can get some agreement about naming here.
> 
> First, some test-functions I think we definitely need:
> 
>  kvm_interrupts_are_async()
>    -- true if interrupt delivery is asynchronous
>       default false in kvm_init, set true in kvm_irqchip_create,
>       architectures may set it true in kvm_arch_init [ARM will
>       do so; PPC might want to do so]
> 
>  kvm_irqchip_in_kernel()
>    -- the user-settable option, actual behaviour is arch specific
>       on x86, true means (as it does now) LAPIC,IOAPIC,PIT in kernel
>       on ARM, we ignore this setting and just DTRT

You should reject kernel_irqchip=off as long as you only support an
in-kernel GIC model.

>       on PPC, used as a convenience setting for whether to use
>       an in-kernel model of the interrupt controller
>       Shouldn't be used in non-target-specific code
> 
> and two I'm not quite so sure about:
> 
>  kvm_has_msi_routing()
>    -- true if we can do routing of MSIs

GSI, not MSI.

>       set true only if x86 and kvm_irqchip_in_kernel

It means that the target architecture supports routing of various
interrupt sources (userspace, irqfds, in-kernel device models) to
different in-kernel IRQ sinks (CPU cores, irqchip models, whatever).
Interrupt messages via (binary-state) irqfd depend on it.

> 
>  kvm_has_irqfds()
>    -- true if kernel supports IRQFDs
>       currently true only if x86 and kvm_irqchip_in_kernel

Note that this and the above are currently static feature tests, not
mode checks (i.e. they are true even if kernel_irqchip=off). The
"kvm_has" namespace is reserved for such tests.

> 
> 
> Second, current uses of kvm_irqchip_in_kernel():
> 
> hw/kvmvapic.c, hw/pc.c, hw/pc_piix.c, target-i386/kvm.c:
>  -- these are all x86 specific and can continue to use
>     kvm_irqchip_in_kernel()
> 
> cpus.c:cpu_thread_is_idle()
>  -- should use !kvm_interrupts_are_async() [because halt is
> in userspace iff we're using the synchronous interrupt model]
> 
> kvm-all.c:kvm_irqchip_set_irq():
>  -- (just an assert) should be kvm_interrupts_are_async()

The name kvm_irqchip_set_irq implies so far that it injects into an
in-kernel irqchip model. Either different functions for archs that don't
follow this concept need to be provided, or this function requires
renaming (kvm_set_irq_async or so).

> 
> kvm-all.c:kvm_irqchip_add_msi_route():
>  -- should be kvm_have_msi_routing()

Only if you change the semantics of kvm_has_gsi_routing (and rename it).

> 
> kvm-all.c:kvm_irqchip_assign_irqfd():
>  -- should be true if kvm_has_irqfds()

The same issue. Plus there is that naming conflict again if we should
ever see irqfd without some in-kernel irqchip. But even s390 would have
a logical "irqchip" for me at the point it may route interrupt messages
from devices directly to the CPUs.

> 
> kvm-all.c:kvm_allows_irq0_override():
>  -- this still seems to me to be a completely x86 specific concept;
>     it should move to a source file in target-x86 and then it
>     can continue to use kvm_irqchip_in_kernel()
> 
> hw/virtio-pci.c:virtio_pci_set_guest_notifiers()
>  -- not entirely sure about this one but I think it
>     should be testing kvm_has_msi_routing().

It depends on full irqfd support, which includes IRQ routing to allow
MSI via irqfd. Something like kvm_msi_via_irqfd_available.

Jan
Peter Maydell - July 23, 2012, 5:41 p.m.
On 23 July 2012 17:55, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-07-23 17:19, Peter Maydell wrote:
>> OK, let's see if we can get some agreement about naming here.
>>
>> First, some test-functions I think we definitely need:
>>
>>  kvm_interrupts_are_async()
>>    -- true if interrupt delivery is asynchronous
>>       default false in kvm_init, set true in kvm_irqchip_create,
>>       architectures may set it true in kvm_arch_init [ARM will
>>       do so; PPC might want to do so]
>>
>>  kvm_irqchip_in_kernel()
>>    -- the user-settable option, actual behaviour is arch specific
>>       on x86, true means (as it does now) LAPIC,IOAPIC,PIT in kernel
>>       on ARM, we ignore this setting and just DTRT
>
> You should reject kernel_irqchip=off as long as you only support an
> in-kernel GIC model.

Agreed. (At the moment we still have code in QEMU for out-of-kernel
GIC models, which is legacy from before the VGIC implementation;
depending on whether that actually vanishes from the kernel ABI
or not I may include the QEMU code which makes arm_pic.c handle
injecting interrupts to KVM via ioctl. But in that case we would
reject =off for A15 and =on for non-A15 (slightly pointlessly
since non-A15 will fail the "must be an A15" check.))

>>       on PPC, used as a convenience setting for whether to use
>>       an in-kernel model of the interrupt controller
>>       Shouldn't be used in non-target-specific code
>>
>> and two I'm not quite so sure about:
>>
>>  kvm_has_msi_routing()
>>    -- true if we can do routing of MSIs
>
> GSI, not MSI.
>
>>       set true only if x86 and kvm_irqchip_in_kernel
>
> It means that the target architecture supports routing of various
> interrupt sources (userspace, irqfds, in-kernel device models) to
> different in-kernel IRQ sinks (CPU cores, irqchip models, whatever).
> Interrupt messages via (binary-state) irqfd depend on it.

So I still don't actually understand what this interrupt routing
is (as far as the kernel is concerned). Surely the way this should
work is that you use KVM_IRQFD to say "this fd goes to this
interrupt sink" (with some sensible scheme for encoding what the
interrupt destination actually is, like an (irqchip,pin) tuple
or something). Why does the kernel need to care about irq
routing?

>> kvm-all.c:kvm_irqchip_set_irq():
>>  -- (just an assert) should be kvm_interrupts_are_async()
>
> The name kvm_irqchip_set_irq implies so far that it injects into an
> in-kernel irqchip model. Either different functions for archs that don't
> follow this concept need to be provided, or this function requires
> renaming (kvm_set_irq_async or so).

Yes, renaming the function would probably be reasonable.
(Currently my QEMU code for using the VGIC just does a direct
ioctl itself though, because kvm_irqchip_set_irq() doesn't actually
do very much and it's a bit awkward that it insists on fishing
the ioctl number out from the KVMState.)

>> kvm-all.c:kvm_irqchip_assign_irqfd():
>>  -- should be true if kvm_has_irqfds()
>
> The same issue. Plus there is that naming conflict again if we should
> ever see irqfd without some in-kernel irqchip. But even s390 would have
> a logical "irqchip" for me at the point it may route interrupt messages
> from devices directly to the CPUs.

I don't think you can have irqfds without an irqchip because where
would you be sending the interrupts?

-- PMM
Jan Kiszka - July 23, 2012, 5:51 p.m.
On 2012-07-23 19:41, Peter Maydell wrote:
> On 23 July 2012 17:55, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2012-07-23 17:19, Peter Maydell wrote:
>>> OK, let's see if we can get some agreement about naming here.
>>>
>>> First, some test-functions I think we definitely need:
>>>
>>>  kvm_interrupts_are_async()
>>>    -- true if interrupt delivery is asynchronous
>>>       default false in kvm_init, set true in kvm_irqchip_create,
>>>       architectures may set it true in kvm_arch_init [ARM will
>>>       do so; PPC might want to do so]
>>>
>>>  kvm_irqchip_in_kernel()
>>>    -- the user-settable option, actual behaviour is arch specific
>>>       on x86, true means (as it does now) LAPIC,IOAPIC,PIT in kernel
>>>       on ARM, we ignore this setting and just DTRT
>>
>> You should reject kernel_irqchip=off as long as you only support an
>> in-kernel GIC model.
> 
> Agreed. (At the moment we still have code in QEMU for out-of-kernel
> GIC models, which is legacy from before the VGIC implementation;
> depending on whether that actually vanishes from the kernel ABI
> or not I may include the QEMU code which makes arm_pic.c handle
> injecting interrupts to KVM via ioctl. But in that case we would
> reject =off for A15 and =on for non-A15 (slightly pointlessly
> since non-A15 will fail the "must be an A15" check.))
> 
>>>       on PPC, used as a convenience setting for whether to use
>>>       an in-kernel model of the interrupt controller
>>>       Shouldn't be used in non-target-specific code
>>>
>>> and two I'm not quite so sure about:
>>>
>>>  kvm_has_msi_routing()
>>>    -- true if we can do routing of MSIs
>>
>> GSI, not MSI.
>>
>>>       set true only if x86 and kvm_irqchip_in_kernel
>>
>> It means that the target architecture supports routing of various
>> interrupt sources (userspace, irqfds, in-kernel device models) to
>> different in-kernel IRQ sinks (CPU cores, irqchip models, whatever).
>> Interrupt messages via (binary-state) irqfd depend on it.
> 
> So I still don't actually understand what this interrupt routing
> is (as far as the kernel is concerned). Surely the way this should
> work is that you use KVM_IRQFD to say "this fd goes to this
> interrupt sink" (with some sensible scheme for encoding what the
> interrupt destination actually is, like an (irqchip,pin) tuple
> or something). Why does the kernel need to care about irq
> routing?

- to define where to inject if you have multiple irqchips to address, or
  if there is a special sink that reattaches payload (e.g. MSI
  messages) to a binary source like irqfd is
- to configure board-specific re-directions (like that IRQ0 override on
  x86)

> 
>>> kvm-all.c:kvm_irqchip_set_irq():
>>>  -- (just an assert) should be kvm_interrupts_are_async()
>>
>> The name kvm_irqchip_set_irq implies so far that it injects into an
>> in-kernel irqchip model. Either different functions for archs that don't
>> follow this concept need to be provided, or this function requires
>> renaming (kvm_set_irq_async or so).
> 
> Yes, renaming the function would probably be reasonable.
> (Currently my QEMU code for using the VGIC just does a direct
> ioctl itself though, because kvm_irqchip_set_irq() doesn't actually
> do very much and it's a bit awkward that it insists on fishing
> the ioctl number out from the KVMState.)

The latter has to do with legacy and lacking support for certain IOCTLs
on older kernels. However, it shouldn't hurt you to take the generic
route for the sake of consistency and a to ease instrumentations etc.

> 
>>> kvm-all.c:kvm_irqchip_assign_irqfd():
>>>  -- should be true if kvm_has_irqfds()
>>
>> The same issue. Plus there is that naming conflict again if we should
>> ever see irqfd without some in-kernel irqchip. But even s390 would have
>> a logical "irqchip" for me at the point it may route interrupt messages
>> from devices directly to the CPUs.
> 
> I don't think you can have irqfds without an irqchip because where
> would you be sending the interrupts?

To the CPU cores. But even that requires routing and arbitration if
there are multiple receivers. So I think s390 has an in-kernel (or
"in-hardware") irqchip, conceptually.

Jan
Peter Maydell - July 23, 2012, 5:58 p.m.
On 23 July 2012 15:30, Avi Kivity <avi@redhat.com> wrote:
> But I was only joking.  Nested virtualization is interesting technically
> but so far I haven't seen any huge or even small uptake.

Yes; that (as I understand it) is why it wasn't an expected use
case for the architecture extensions. The other related thing that
might be surprising for x86-background people is that being
able to present the guest with a virtual CPU that looks like
a pre-virtualization CPU (eg the A9) isn't really an intended
use case either. (The ARM world has much less of the 'everything
must be fully backwards compatible for existing OSes' than x86...)

-- PMM
Avi Kivity - July 24, 2012, 8:50 a.m.
On 07/23/2012 08:58 PM, Peter Maydell wrote:
> On 23 July 2012 15:30, Avi Kivity <avi@redhat.com> wrote:
>> But I was only joking.  Nested virtualization is interesting technically
>> but so far I haven't seen any huge or even small uptake.
> 
> Yes; that (as I understand it) is why it wasn't an expected use
> case for the architecture extensions. The other related thing that
> might be surprising for x86-background people is that being
> able to present the guest with a virtual CPU that looks like
> a pre-virtualization CPU (eg the A9) isn't really an intended
> use case either. (The ARM world has much less of the 'everything
> must be fully backwards compatible for existing OSes' than x86...)

I expect this to change once ARM servers become a reality.
Peter Maydell - July 24, 2012, 8:54 a.m.
On 24 July 2012 09:50, Avi Kivity <avi@redhat.com> wrote:
> On 07/23/2012 08:58 PM, Peter Maydell wrote:
>> The other related thing that
>> might be surprising for x86-background people is that being
>> able to present the guest with a virtual CPU that looks like
>> a pre-virtualization CPU (eg the A9) isn't really an intended
>> use case either. (The ARM world has much less of the 'everything
>> must be fully backwards compatible for existing OSes' than x86...)
>
> I expect this to change once ARM servers become a reality.

Yes, compatibility for guests migrating forwards from old
virt-supporting host CPUs to new ones is important; it's
only the pre-virt to post-virt that is less interesting.

-- PMM
Avi Kivity - July 24, 2012, 8:56 a.m.
On 07/23/2012 06:19 PM, Peter Maydell wrote:
> On 23 July 2012 13:26, Avi Kivity <avi@redhat.com> wrote:
>> On 07/21/2012 11:54 AM, Peter Maydell wrote:
>>> The reason I want to get rid of common-code uses of kvm_irqchip_in_kernel()
>>> is because I think they're all similar to this -- the common code is
>>> using the check as a proxy for something else, and it should be directly
>>> asking about that something else. The only bits of code that should
>>> care about "is the irqchip in kernel?" are:
>>>  * target-specific device/machine setup code which needs to know
>>>    which apic/etc to instantiate
>>>  * target-specific x86 code which has this weird synchronous IRQ
>>>    delivery model for irqchip-not-in-kernel
>>> (Obviously I might have missed something, I'm flailing around
>>> trying to understand this code :-))
>>
>> Agree naming should be improved.  In fact the early series I pushed to
>> decompose local apic, ioapic, and pic, but that didn't happen.  If it
>> did we'd probably not have this conversation.
> 
> OK, let's see if we can get some agreement about naming here.
> 
> First, some test-functions I think we definitely need:
> 
>  kvm_interrupts_are_async()
>    -- true if interrupt delivery is asynchronous
>       default false in kvm_init, set true in kvm_irqchip_create,
>       architectures may set it true in kvm_arch_init [ARM will
>       do so; PPC might want to do so]

Interrupts are by nature async.  I'd say kvm_async_interrupt_injection()
to make it clearer.

> 
>  kvm_irqchip_in_kernel()
>    -- the user-settable option, actual behaviour is arch specific
>       on x86, true means (as it does now) LAPIC,IOAPIC,PIT in kernel
>       on ARM, we ignore this setting and just DTRT
>       on PPC, used as a convenience setting for whether to use
>       an in-kernel model of the interrupt controller
>       Shouldn't be used in non-target-specific code

If it's 100% arch specific, the name can/should be arch specific since
it will never be used in generic core.  So kvm_ioapic_in_kernel(),
kvm_gic_in_kernel() (or even kvm_ioapic(), kvm_gic(), since "kvm"
already implies the kernel (that's the k in kvm, after all).

> 
> and two I'm not quite so sure about:
> 
>  kvm_has_msi_routing()
>    -- true if we can do routing of MSIs
>       set true only if x86 and kvm_irqchip_in_kernel
> 
>  kvm_has_irqfds()
>    -- true if kernel supports IRQFDs
>       currently true only if x86 and kvm_irqchip_in_kernel

Will be true for everyone, eventually.
Jan Kiszka - July 24, 2012, 8:58 a.m.
On 2012-07-24 10:54, Peter Maydell wrote:
> On 24 July 2012 09:50, Avi Kivity <avi@redhat.com> wrote:
>> On 07/23/2012 08:58 PM, Peter Maydell wrote:
>>> The other related thing that
>>> might be surprising for x86-background people is that being
>>> able to present the guest with a virtual CPU that looks like
>>> a pre-virtualization CPU (eg the A9) isn't really an intended
>>> use case either. (The ARM world has much less of the 'everything
>>> must be fully backwards compatible for existing OSes' than x86...)
>>
>> I expect this to change once ARM servers become a reality.
> 
> Yes, compatibility for guests migrating forwards from old
> virt-supporting host CPUs to new ones is important; it's
> only the pre-virt to post-virt that is less interesting.

One of the many use cases for virtualization on other archs is precisely
this (we run 286 code on latest and greatest CPUs in KVM). There should
be quite a few embedded systems out there that could happily be
consolidated on a more modern and powerful ARM platform via virtualization.

Jan

Patch

diff --git a/hw/pc.c b/hw/pc.c
index 598267a..4953376 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -42,6 +42,7 @@ 
 #include "sysbus.h"
 #include "sysemu.h"
 #include "kvm.h"
+#include "kvm_i386.h"
 #include "xen.h"
 #include "blockdev.h"
 #include "hw/block-common.h"
diff --git a/kvm-all.c b/kvm-all.c
index 2148b20..65f1160 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1667,11 +1667,6 @@  int kvm_has_gsi_routing(void)
 #endif
 }
 
-int kvm_allows_irq0_override(void)
-{
-    return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
-}
-
 void *kvm_vmalloc(ram_addr_t size)
 {
 #ifdef TARGET_S390X
diff --git a/kvm.h b/kvm.h
index 2617dd5..c46132f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -62,8 +62,6 @@  int kvm_has_pit_state2(void);
 int kvm_has_many_ioeventfds(void);
 int kvm_has_gsi_routing(void);
 
-int kvm_allows_irq0_override(void);
-
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUArchState *env);
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index e53c2f6..503abeb 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -23,6 +23,7 @@ 
 #include "qemu-common.h"
 #include "sysemu.h"
 #include "kvm.h"
+#include "kvm_i386.h"
 #include "cpu.h"
 #include "gdbstub.h"
 #include "host-utils.h"
@@ -65,6 +66,11 @@  static bool has_msr_async_pf_en;
 static bool has_msr_misc_enable;
 static int lm_capable_kernel;
 
+int kvm_allows_irq0_override(void)
+{
+    return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
+}
+
 static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max)
 {
     struct kvm_cpuid2 *cpuid;
diff --git a/target-i386/kvm_i386.h b/target-i386/kvm_i386.h
new file mode 100644
index 0000000..87031c0
--- /dev/null
+++ b/target-i386/kvm_i386.h
@@ -0,0 +1,16 @@ 
+/*
+ * QEMU KVM support -- x86 specific functions.
+ *
+ * Copyright (c) 2012 Linaro Limited
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_KVM_I386_H
+#define QEMU_KVM_I386_H
+
+int kvm_allows_irq0_override(void);
+
+#endif