Patchwork [1/2] kvm irqfd: support msimessage to irq translation in PHB

login
register
mail settings
Submitter Alexander Graf
Date Aug. 27, 2013, 10:06 a.m.
Message ID <6CCA8220-0125-4CB5-A54E-9648D8B409D2@suse.de>
Download mbox | patch
Permalink /patch/270078/
State New
Headers show

Comments

Alexander Graf - Aug. 27, 2013, 10:06 a.m.
On 23.08.2013, at 06:29, Alexey Kardashevskiy wrote:

> On 08/19/2013 07:01 PM, Michael S. Tsirkin wrote:
>> On Mon, Aug 19, 2013 at 06:10:01PM +1000, Alexey Kardashevskiy wrote:
>>> On 08/19/2013 05:54 PM, Michael S. Tsirkin wrote:
>>>> On Mon, Aug 19, 2013 at 05:44:04PM +1000, Alexey Kardashevskiy wrote:
>>>>> On 08/19/2013 05:35 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Aug 07, 2013 at 06:51:32PM +1000, Alexey Kardashevskiy wrote:
>>>>>>> On PPC64 systems MSI Messages are translated to system IRQ in a PCI
>>>>>>> host bridge. This is already supported for emulated MSI/MSIX but
>>>>>>> not for irqfd where the current QEMU allocates IRQ numbers from
>>>>>>> irqchip and maps MSIMessages to those IRQ in the host kernel.
>>>>>>> 
>>>>>>> The patch extends irqfd support in order to avoid unnecessary
>>>>>>> mapping and reuse the one which already exists in a PCI host bridge.
>>>>>>> 
>>>>>>> Specifically, a map_msi callback is added to PCIBus and pci_bus_map_msi()
>>>>>>> to PCI API. The latter tries a bus specific map_msi and falls back to
>>>>>>> the default kvm_irqchip_add_msi_route() if none set.
>>>>>>> 
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> 
>>>>>> The mapping (irq # within data) is really KVM on PPC64 specific,
>>>>>> isn't it?
>>>>>> 
>>>>>> So why not implement
>>>>>> kvm_irqchip_add_msi_route(kvm_state, msg);
>>>>>> to simply return msg.data on PPC64?
>>>>> 
>>>>> How exactly? Please give some details. kvm_irqchip_add_msi_route() is
>>>>> implemented in kvm-all.c once for all platforms so hack it right there?
>>>> 
>>>> You can add the map_msi callback in kvm state,
>>>> then just call it if it's set, right?
>>>> 
>>>>> I thought we discussed all of this already and decided that this thing
>>>>> should go to PCI host bridge and by default PHB's map_msi() callback should
>>>>> just call kvm_irqchip_add_msi_route(). This is why I touched i386.
>>>>> 
>>>>> Things have changed since then?
>>>> 
>>>> 
>>>> I think pci_bus_map_msi was there since day 1, it's not like
>>>> we are going back and forth on it, right?
>>> 
>>> 
>>> Sorry, I am not following you. pci_bus_map_msi() is added by this patch.
>>> Where was it from day 1?
>> 
>> I'm sorry. I merely meant that it's there in v1 of this patch.
>> 
>>> 
>>>> I always felt it's best to have irqfd isolated to kvm somehow
>>>> rather than have hooks in pci code, I just don't know enough
>>>> about kvm on ppc to figure out the right way to do this.
>>>> With your latest patches I think this is becoming clearer ...
>>> 
>>> 
>>> Well... This encoding (irq# as msg.data) is used in spapr_pci.c in any
>>> case, whether it is KVM or TCG.
>> 
>> Not only on TCG. All emulated devices merely do a write to send an MSI,
>> this is exactly what happens with real hardware.  If this happens to
>> land in the MSI region, you get an interrupt.  The concept of mapping
>> msi to irq normally doesn't belong in devices/pci core, it's something
>> done by APIC or pci host.
>> 
>> For KVM, we have this special hook where devices allocate a route and
>> then Linux can send an interrupt to guest quickly bypassing QEMU.  It
>> was originally designed for paravirtualization, so it doesn't support
>> strange cases such as guest programming MSI to write into guest memory,
>> which is possible with real PCI devices. However, no one seems to do
>> these hacks in practice, so in practice this works for
>> some other cases, such as device assignment.
>> 
>> That's why we have kvm_irqchip_add_msi_route in some places
>> in code - it's so specific devices implemented by
>> Linux can take this shortcut (it does not make sense for
>> devices implemented by qemu).
>> So the way I see it, it's not a PCI thing at all, it's
>> a KVM specific implementation, so it seems cleaner if
>> it's isolated there.
> 
> The only PCI thing here (at least for spapr) is the way we translate
> MSIMessage to IRQ number. Besides that, yeah, there is no good reason to
> poison pci.c or spapr_pci.c with KVM.
> 
> 
>> Now, we have this allocate/free API for reasons that
>> have to do with the API of kvm on x86. spapr doesn't need to
>> allocate/free resources, so just shortcut free and
>> do map when we allocate.
>> 
>> Doesn't this sound reasonable?
> 
> 
> Yes, pretty much. But it did not help to get this patch accepted at the
> first place and my vote costs a little here :)
> 
> 
>>> I am confused.
>>> 1.5 month ago Anthony suggested (I'll answer to that mail to bring that
>>> discussion up) to do this thing as:
>>> 
>>>> So what this should look like is:
>>>> 
>>>> 1) A PCI bus function to do the MSI -> virq mapping
>>>> 2) On x86 (and e500), this is implemented by calling
>>> kvm_irqchip_add_msi_route()
>>>> 3) On pseries, this just returns msi->data
>>>> 
>>>> Perhaps (2) can just be the default PCI bus implementation to simplify
>>> things.
>>> 
>>> 
>>> Now it is not right. Anthony, please help.
>> 
>> That's right, you implemented exactly what Anthony suggested.  Now that
>> you did, I think I see an even better, more contained way to do this.
>> And it's not much of a change as compared to what you have, is it?
>> 
>> I'm sorry if this looks like you are given a run-around,
>> not everyone understands how spapr works, sometimes
>> it takes a full implementation to make everyone understand
>> the issues.
>> 
>> But I agree, let's see what Anthony thinks, maybe I
>> misunderstood how spapr works.
> 
> 
> Anthony, Alex (Graf), ping, please?

I think it makes sense to just have this special case be treated as a special case. Why don't we just add a new global check in kvm.c similar to kvm_gsi_routing_enabled() for kvm_gsi_routing_linear(). This should probably disable routing_enabled(), but we add a new check for kvm_gsi_routing_linear().

So basically we would end up with something like


I agree that we should isolate kvm specifics to kvm code when we can.


Alex
Michael S. Tsirkin - Aug. 27, 2013, 10:26 a.m.
On Tue, Aug 27, 2013 at 12:06:49PM +0200, Alexander Graf wrote:
> 
> On 23.08.2013, at 06:29, Alexey Kardashevskiy wrote:
> 
> > On 08/19/2013 07:01 PM, Michael S. Tsirkin wrote:
> >> On Mon, Aug 19, 2013 at 06:10:01PM +1000, Alexey Kardashevskiy wrote:
> >>> On 08/19/2013 05:54 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Aug 19, 2013 at 05:44:04PM +1000, Alexey Kardashevskiy wrote:
> >>>>> On 08/19/2013 05:35 PM, Michael S. Tsirkin wrote:
> >>>>>> On Wed, Aug 07, 2013 at 06:51:32PM +1000, Alexey Kardashevskiy wrote:
> >>>>>>> On PPC64 systems MSI Messages are translated to system IRQ in a PCI
> >>>>>>> host bridge. This is already supported for emulated MSI/MSIX but
> >>>>>>> not for irqfd where the current QEMU allocates IRQ numbers from
> >>>>>>> irqchip and maps MSIMessages to those IRQ in the host kernel.
> >>>>>>> 
> >>>>>>> The patch extends irqfd support in order to avoid unnecessary
> >>>>>>> mapping and reuse the one which already exists in a PCI host bridge.
> >>>>>>> 
> >>>>>>> Specifically, a map_msi callback is added to PCIBus and pci_bus_map_msi()
> >>>>>>> to PCI API. The latter tries a bus specific map_msi and falls back to
> >>>>>>> the default kvm_irqchip_add_msi_route() if none set.
> >>>>>>> 
> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>> 
> >>>>>> The mapping (irq # within data) is really KVM on PPC64 specific,
> >>>>>> isn't it?
> >>>>>> 
> >>>>>> So why not implement
> >>>>>> kvm_irqchip_add_msi_route(kvm_state, msg);
> >>>>>> to simply return msg.data on PPC64?
> >>>>> 
> >>>>> How exactly? Please give some details. kvm_irqchip_add_msi_route() is
> >>>>> implemented in kvm-all.c once for all platforms so hack it right there?
> >>>> 
> >>>> You can add the map_msi callback in kvm state,
> >>>> then just call it if it's set, right?
> >>>> 
> >>>>> I thought we discussed all of this already and decided that this thing
> >>>>> should go to PCI host bridge and by default PHB's map_msi() callback should
> >>>>> just call kvm_irqchip_add_msi_route(). This is why I touched i386.
> >>>>> 
> >>>>> Things have changed since then?
> >>>> 
> >>>> 
> >>>> I think pci_bus_map_msi was there since day 1, it's not like
> >>>> we are going back and forth on it, right?
> >>> 
> >>> 
> >>> Sorry, I am not following you. pci_bus_map_msi() is added by this patch.
> >>> Where was it from day 1?
> >> 
> >> I'm sorry. I merely meant that it's there in v1 of this patch.
> >> 
> >>> 
> >>>> I always felt it's best to have irqfd isolated to kvm somehow
> >>>> rather than have hooks in pci code, I just don't know enough
> >>>> about kvm on ppc to figure out the right way to do this.
> >>>> With your latest patches I think this is becoming clearer ...
> >>> 
> >>> 
> >>> Well... This encoding (irq# as msg.data) is used in spapr_pci.c in any
> >>> case, whether it is KVM or TCG.
> >> 
> >> Not only on TCG. All emulated devices merely do a write to send an MSI,
> >> this is exactly what happens with real hardware.  If this happens to
> >> land in the MSI region, you get an interrupt.  The concept of mapping
> >> msi to irq normally doesn't belong in devices/pci core, it's something
> >> done by APIC or pci host.
> >> 
> >> For KVM, we have this special hook where devices allocate a route and
> >> then Linux can send an interrupt to guest quickly bypassing QEMU.  It
> >> was originally designed for paravirtualization, so it doesn't support
> >> strange cases such as guest programming MSI to write into guest memory,
> >> which is possible with real PCI devices. However, no one seems to do
> >> these hacks in practice, so in practice this works for
> >> some other cases, such as device assignment.
> >> 
> >> That's why we have kvm_irqchip_add_msi_route in some places
> >> in code - it's so specific devices implemented by
> >> Linux can take this shortcut (it does not make sense for
> >> devices implemented by qemu).
> >> So the way I see it, it's not a PCI thing at all, it's
> >> a KVM specific implementation, so it seems cleaner if
> >> it's isolated there.
> > 
> > The only PCI thing here (at least for spapr) is the way we translate
> > MSIMessage to IRQ number. Besides that, yeah, there is no good reason to
> > poison pci.c or spapr_pci.c with KVM.
> > 
> > 
> >> Now, we have this allocate/free API for reasons that
> >> have to do with the API of kvm on x86. spapr doesn't need to
> >> allocate/free resources, so just shortcut free and
> >> do map when we allocate.
> >> 
> >> Doesn't this sound reasonable?
> > 
> > 
> > Yes, pretty much. But it did not help to get this patch accepted at the
> > first place and my vote costs a little here :)
> > 
> > 
> >>> I am confused.
> >>> 1.5 month ago Anthony suggested (I'll answer to that mail to bring that
> >>> discussion up) to do this thing as:
> >>> 
> >>>> So what this should look like is:
> >>>> 
> >>>> 1) A PCI bus function to do the MSI -> virq mapping
> >>>> 2) On x86 (and e500), this is implemented by calling
> >>> kvm_irqchip_add_msi_route()
> >>>> 3) On pseries, this just returns msi->data
> >>>> 
> >>>> Perhaps (2) can just be the default PCI bus implementation to simplify
> >>> things.
> >>> 
> >>> 
> >>> Now it is not right. Anthony, please help.
> >> 
> >> That's right, you implemented exactly what Anthony suggested.  Now that
> >> you did, I think I see an even better, more contained way to do this.
> >> And it's not much of a change as compared to what you have, is it?
> >> 
> >> I'm sorry if this looks like you are given a run-around,
> >> not everyone understands how spapr works, sometimes
> >> it takes a full implementation to make everyone understand
> >> the issues.
> >> 
> >> But I agree, let's see what Anthony thinks, maybe I
> >> misunderstood how spapr works.
> > 
> > 
> > Anthony, Alex (Graf), ping, please?
> 
> I think it makes sense to just have this special case be treated as a special case. Why don't we just add a new global check in kvm.c similar to kvm_gsi_routing_enabled() for kvm_gsi_routing_linear(). This should probably disable routing_enabled(), but we add a new check for kvm_gsi_routing_linear().
> 
> So basically we would end up with something like
> 
> diff --git a/kvm-all.c b/kvm-all.c
> index 716860f..ca3251e 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1190,6 +1190,10 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
>      struct kvm_irq_routing_entry kroute = {};
>      int virq;
> 
> +    if (kvm_gsi_routing_linear()) {
> +        return msi.data & 0xffff;
> +    }
> +
>      if (!kvm_gsi_routing_enabled()) {
>          return -ENOSYS;
>      }
> 
> I agree that we should isolate kvm specifics to kvm code when we can.
> 
> 
> Alex

This makes sense to me.
Benjamin Herrenschmidt - Aug. 27, 2013, 10:32 a.m.
On Tue, 2013-08-27 at 13:26 +0300, Michael S. Tsirkin wrote:
> e would end up with something like
> > 
> > diff --git a/kvm-all.c b/kvm-all.c
> > index 716860f..ca3251e 100644
> > --- a/kvm-all.c
> > +++ b/kvm-all.c
> > @@ -1190,6 +1190,10 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
> >      struct kvm_irq_routing_entry kroute = {};
> >      int virq;
> > 
> > +    if (kvm_gsi_routing_linear()) {
> > +        return msi.data & 0xffff;
> > +    }
> > +

I haven't followed the whole discussion, Alexey, is this the per-host
bridge source number or the global XIRR (XICS interrupt number) ?

Because in the latter case, it can be up to 24 bits... (And yes, MSI
data is limited to 16).

However maybe we can decide arbitrarily that under qemu/kvm we only
support 16-bit XIRRs (it's fine, from a PAPR perspective at least if it
keep things simpler).

Cheers,
Ben.
Alexander Graf - Aug. 27, 2013, 11:31 a.m.
On 27.08.2013, at 12:32, Benjamin Herrenschmidt wrote:

> On Tue, 2013-08-27 at 13:26 +0300, Michael S. Tsirkin wrote:
>> e would end up with something like
>>> 
>>> diff --git a/kvm-all.c b/kvm-all.c
>>> index 716860f..ca3251e 100644
>>> --- a/kvm-all.c
>>> +++ b/kvm-all.c
>>> @@ -1190,6 +1190,10 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
>>>     struct kvm_irq_routing_entry kroute = {};
>>>     int virq;
>>> 
>>> +    if (kvm_gsi_routing_linear()) {
>>> +        return msi.data & 0xffff;
>>> +    }
>>> +
> 
> I haven't followed the whole discussion, Alexey, is this the per-host
> bridge source number or the global XIRR (XICS interrupt number) ?

This is a global interrupt number now. One of Alexey's previous patches changed it to fit that scheme, which makes it a lot easier for irqfd and the likes :).

> Because in the latter case, it can be up to 24 bits... (And yes, MSI
> data is limited to 16).

Oh. We define the MSI data field from QEMU anyways, so maybe we should just not mask at all.


Alex

> However maybe we can decide arbitrarily that under qemu/kvm we only
> support 16-bit XIRRs (it's fine, from a PAPR perspective at least if it
> keep things simpler).
> 
> Cheers,
> Ben.
> 
>
Alexey Kardashevskiy - Aug. 28, 2013, 12:55 a.m.
On 08/27/2013 08:32 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2013-08-27 at 13:26 +0300, Michael S. Tsirkin wrote:
>> e would end up with something like
>>>
>>> diff --git a/kvm-all.c b/kvm-all.c
>>> index 716860f..ca3251e 100644
>>> --- a/kvm-all.c
>>> +++ b/kvm-all.c
>>> @@ -1190,6 +1190,10 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
>>>      struct kvm_irq_routing_entry kroute = {};
>>>      int virq;
>>>
>>> +    if (kvm_gsi_routing_linear()) {
>>> +        return msi.data & 0xffff;
>>> +    }
>>> +
> 
> I haven't followed the whole discussion, Alexey, is this the per-host
> bridge source number or the global XIRR (XICS interrupt number) ?

When I started this topic, that was per bridge and this is why I changed it
to be global and eventually this spapr-MSIX fix reached agraf/ppc-next
(happened last week) so the fix above should just work now.


> Because in the latter case, it can be up to 24 bits... (And yes, MSI
> data is limited to 16).

Yes, this is the main limitation here.

> However maybe we can decide arbitrarily that under qemu/kvm we only
> support 16-bit XIRRs (it's fine, from a PAPR perspective at least if it
> keep things simpler).

I'll keep 0xffff then as everybody is happy with this.

Patch

diff --git a/kvm-all.c b/kvm-all.c
index 716860f..ca3251e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1190,6 +1190,10 @@  int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
     struct kvm_irq_routing_entry kroute = {};
     int virq;

+    if (kvm_gsi_routing_linear()) {
+        return msi.data & 0xffff;
+    }
+
     if (!kvm_gsi_routing_enabled()) {
         return -ENOSYS;
     }