Patchwork [Xen-devel] Re: [PATCH] xen: fix interrupt routing

login
register
mail settings
Submitter Stefano Stabellini
Date June 14, 2011, 1:27 p.m.
Message ID <alpine.DEB.2.00.1106141344260.12963@kaball-desktop>
Download mbox | patch
Permalink /patch/100328/
State New
Headers show

Comments

Stefano Stabellini - June 14, 2011, 1:27 p.m.
On Tue, 14 Jun 2011, Alexander Graf wrote:
> >>>>> static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
> >>>>> {
> >>>>>   PCII440FXState *d = opaque;
> >>>>> @@ -267,8 +263,17 @@ static PCIBus *i440fx_common_init(const char *device_name,
> >>>>>   d = pci_create_simple(b, 0, device_name);
> >>>>>   *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
> >>>>> 
> >>>>> -    piix3 = DO_UPCAST(PIIX3State, dev,
> >>>>> -                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
> >>>>> +    if (xen_enabled()) {
> >>>>> +        piix3 = DO_UPCAST(PIIX3State, dev,
> >>>>> +                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
> >>>>> +        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
> >>>>> +                piix3, XEN_PIIX_NUM_PIRQS);
> >>>> 
> >>>> But with XEN_PIIX_NUM_PIRQS it's not a piix3 anymore, no? What's the reason behind this change?
> >>> 
> >>> It is still a piix3, but also provides non-legacy interrupt links to the
> >>> IO-APIC.
> >>> The four pins of each PCI device on the bus not only are routed to the
> >>> normal four pirqs (programmed writing to 0x60-0x63, see above) but also
> >>> they are connected to the IO-APIC directly.
> >>> These additional routes can only be discovered through ACPI, so you need
> >>> matching ACPI tables. We used to build the old ACPI tables like this:
> >>> 
> >>> /* PRTA: APIC routing table (via non-legacy IOAPIC GSIs). */
> >>> printf("Name(PRTA, Package() {\n");
> >>> for ( dev = 1; dev < 32; dev++ )
> >>>   for ( intx = 0; intx < 4; intx++ ) /* INTA-D */
> >>>       printf("Package(){0x%04xffff, %u, 0, %u},\n",
> >>>              dev, intx, ((dev*4+dev/8+intx)&31)+16);
> >>> printf("})\n");
> >>> 
> >> 
> >> Interesting concept, but completely non-standard and very much
> >> different from real hardware. Please at least add a comment there to
> >> show readers that Xen is doing a hack which is not at all related to
> >> how the PIIX really works.
> > 
> > Isn't this more a function of the "wires" on the motherboard than the
> > PIIX specifically? i.e. this just encodes the permutation of the wires
> > from the PCI slots into the IO-APIC input pins (bypassing the PIIX,
> > which is only used for legacy ISA IRQs i.e. by non-APIC aware OSes)?
> 
> Interrupts with PCI work slightly different. PCI devices can map (themselves or by software) to one of 4 interrupt lines: INTA, INTB, INTC, INTD. These get converted using PCI host controller specific logic to 4 interrupt lines which then go into the IO-APIC.
> 
> The IO-APIC is a chip with a limited number of pins. IIRC it was 24, could be 26 though.

The number of redirection entries in the IOAPIC can be discovered
reading from the IOAPICVER register and it is a property of a specific
model of IOAPIC. As a matter of fact Xen's emulated IOAPIC supports more
pins than the most popular IOAPIC used with PIIX3.

 
> I haven't seen a single case where PCI devices have a direct link to the IO-APIC. I also have not seen any PCI host controller that exports more than 4 interrupts. Giving each PCI device its own line, on top of that more than ever could be in real hardware, is a plain hack IMHO.

Actually this happens quite often: if I am not mistaken all the GSIs
higher than 15 are actually the result of a direct connection between
an interrupt source and the IOAPIC. I have several on my testboxes.
Also give a look at the Intel Multiprocessor Specification, section
3.6.2.3: as you can see from the diagram in "Symmetric I/O Mode" all the
interrupts are routed through the IOAPIC directly.


> Did this really give you actual performance/latency/scalability gains? I still think for devices that matter, we should go with MSI rather than deriving from real hw.
> 

Not all the operating systems support MSIs, it is nice to be able to
avoid interrupt sharing without recurring to MSIs.
Also this is how Xen has been working for more then 5 years in HVM mode,
so this configuration is well tested and supported by most operating
systems (at least all the ones we tried so far).


In any case I think it is a good idea to add a comment to better explain
what we are doing, see below.



commit 973bb091a967fdec37a1bc8fe30d46a483d2903d
Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Date:   Tue May 17 12:10:36 2011 +0000

    xen: fix interrupt routing
    
    - remove i440FX-xen and i440fx_write_config_xen
    we don't need to intercept pci config writes to i440FX anymore;
    
    - introduce PIIX3-xen and piix3_write_config_xen
    we do need to intercept pci config write to the PCI-ISA bridge to update
    the PCI link routing;
    
    - set the number of PIIX3-xen interrupts line to 128;
    
    Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Jan Kiszka - June 14, 2011, 3:30 p.m.
On 2011-06-14 15:27, Stefano Stabellini wrote:
> On Tue, 14 Jun 2011, Alexander Graf wrote:
>>>>>>> static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
>>>>>>> {
>>>>>>>   PCII440FXState *d = opaque;
>>>>>>> @@ -267,8 +263,17 @@ static PCIBus *i440fx_common_init(const char *device_name,
>>>>>>>   d = pci_create_simple(b, 0, device_name);
>>>>>>>   *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
>>>>>>>
>>>>>>> -    piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>> -                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
>>>>>>> +    if (xen_enabled()) {
>>>>>>> +        piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>> +                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
>>>>>>> +        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
>>>>>>> +                piix3, XEN_PIIX_NUM_PIRQS);
>>>>>>
>>>>>> But with XEN_PIIX_NUM_PIRQS it's not a piix3 anymore, no? What's the reason behind this change?
>>>>>
>>>>> It is still a piix3, but also provides non-legacy interrupt links to the
>>>>> IO-APIC.
>>>>> The four pins of each PCI device on the bus not only are routed to the
>>>>> normal four pirqs (programmed writing to 0x60-0x63, see above) but also
>>>>> they are connected to the IO-APIC directly.
>>>>> These additional routes can only be discovered through ACPI, so you need
>>>>> matching ACPI tables. We used to build the old ACPI tables like this:
>>>>>
>>>>> /* PRTA: APIC routing table (via non-legacy IOAPIC GSIs). */
>>>>> printf("Name(PRTA, Package() {\n");
>>>>> for ( dev = 1; dev < 32; dev++ )
>>>>>   for ( intx = 0; intx < 4; intx++ ) /* INTA-D */
>>>>>       printf("Package(){0x%04xffff, %u, 0, %u},\n",
>>>>>              dev, intx, ((dev*4+dev/8+intx)&31)+16);
>>>>> printf("})\n");
>>>>>
>>>>
>>>> Interesting concept, but completely non-standard and very much
>>>> different from real hardware. Please at least add a comment there to
>>>> show readers that Xen is doing a hack which is not at all related to
>>>> how the PIIX really works.
>>>
>>> Isn't this more a function of the "wires" on the motherboard than the
>>> PIIX specifically? i.e. this just encodes the permutation of the wires
>>> from the PCI slots into the IO-APIC input pins (bypassing the PIIX,
>>> which is only used for legacy ISA IRQs i.e. by non-APIC aware OSes)?
>>
>> Interrupts with PCI work slightly different. PCI devices can map (themselves or by software) to one of 4 interrupt lines: INTA, INTB, INTC, INTD. These get converted using PCI host controller specific logic to 4 interrupt lines which then go into the IO-APIC.
>>
>> The IO-APIC is a chip with a limited number of pins. IIRC it was 24, could be 26 though.
> 
> The number of redirection entries in the IOAPIC can be discovered
> reading from the IOAPICVER register and it is a property of a specific
> model of IOAPIC. As a matter of fact Xen's emulated IOAPIC supports more
> pins than the most popular IOAPIC used with PIIX3.

Do real IOAPICs exist with more than 24 pins? Otherwise there is the
risk that OSes aren't well prepared for this oddity - specifically not
when the chipset is specified to include a 24-pin IOAPIC.

> 
>  
>> I haven't seen a single case where PCI devices have a direct link to the IO-APIC. I also have not seen any PCI host controller that exports more than 4 interrupts. Giving each PCI device its own line, on top of that more than ever could be in real hardware, is a plain hack IMHO.
> 
> Actually this happens quite often: if I am not mistaken all the GSIs
> higher than 15 are actually the result of a direct connection between
> an interrupt source and the IOAPIC. I have several on my testboxes.

Except that the interrupt source is the chipset with its PCI bridge, not
individual PCI devices.

Jan
Stefano Stabellini - June 14, 2011, 6:18 p.m.
On Tue, 14 Jun 2011, Jan Kiszka wrote:
> On 2011-06-14 15:27, Stefano Stabellini wrote:
> > On Tue, 14 Jun 2011, Alexander Graf wrote:
> >>>>>>> static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
> >>>>>>> {
> >>>>>>>   PCII440FXState *d = opaque;
> >>>>>>> @@ -267,8 +263,17 @@ static PCIBus *i440fx_common_init(const char *device_name,
> >>>>>>>   d = pci_create_simple(b, 0, device_name);
> >>>>>>>   *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
> >>>>>>>
> >>>>>>> -    piix3 = DO_UPCAST(PIIX3State, dev,
> >>>>>>> -                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
> >>>>>>> +    if (xen_enabled()) {
> >>>>>>> +        piix3 = DO_UPCAST(PIIX3State, dev,
> >>>>>>> +                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
> >>>>>>> +        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
> >>>>>>> +                piix3, XEN_PIIX_NUM_PIRQS);
> >>>>>>
> >>>>>> But with XEN_PIIX_NUM_PIRQS it's not a piix3 anymore, no? What's the reason behind this change?
> >>>>>
> >>>>> It is still a piix3, but also provides non-legacy interrupt links to the
> >>>>> IO-APIC.
> >>>>> The four pins of each PCI device on the bus not only are routed to the
> >>>>> normal four pirqs (programmed writing to 0x60-0x63, see above) but also
> >>>>> they are connected to the IO-APIC directly.
> >>>>> These additional routes can only be discovered through ACPI, so you need
> >>>>> matching ACPI tables. We used to build the old ACPI tables like this:
> >>>>>
> >>>>> /* PRTA: APIC routing table (via non-legacy IOAPIC GSIs). */
> >>>>> printf("Name(PRTA, Package() {\n");
> >>>>> for ( dev = 1; dev < 32; dev++ )
> >>>>>   for ( intx = 0; intx < 4; intx++ ) /* INTA-D */
> >>>>>       printf("Package(){0x%04xffff, %u, 0, %u},\n",
> >>>>>              dev, intx, ((dev*4+dev/8+intx)&31)+16);
> >>>>> printf("})\n");
> >>>>>
> >>>>
> >>>> Interesting concept, but completely non-standard and very much
> >>>> different from real hardware. Please at least add a comment there to
> >>>> show readers that Xen is doing a hack which is not at all related to
> >>>> how the PIIX really works.
> >>>
> >>> Isn't this more a function of the "wires" on the motherboard than the
> >>> PIIX specifically? i.e. this just encodes the permutation of the wires
> >>> from the PCI slots into the IO-APIC input pins (bypassing the PIIX,
> >>> which is only used for legacy ISA IRQs i.e. by non-APIC aware OSes)?
> >>
> >> Interrupts with PCI work slightly different. PCI devices can map (themselves or by software) to one of 4 interrupt lines: INTA, INTB, INTC, INTD. These get converted using PCI host controller specific logic to 4 interrupt lines which then go into the IO-APIC.
> >>
> >> The IO-APIC is a chip with a limited number of pins. IIRC it was 24, could be 26 though.
> > 
> > The number of redirection entries in the IOAPIC can be discovered
> > reading from the IOAPICVER register and it is a property of a specific
> > model of IOAPIC. As a matter of fact Xen's emulated IOAPIC supports more
> > pins than the most popular IOAPIC used with PIIX3.
> 
> Do real IOAPICs exist with more than 24 pins? Otherwise there is the
> risk that OSes aren't well prepared for this oddity - specifically not
> when the chipset is specified to include a 24-pin IOAPIC.

Linux supports up to 128 pins and as I wrote before all the other OSes
we tested so far seem to react well.


> >> I haven't seen a single case where PCI devices have a direct link to the IO-APIC. I also have not seen any PCI host controller that exports more than 4 interrupts. Giving each PCI device its own line, on top of that more than ever could be in real hardware, is a plain hack IMHO.
> > 
> > Actually this happens quite often: if I am not mistaken all the GSIs
> > higher than 15 are actually the result of a direct connection between
> > an interrupt source and the IOAPIC. I have several on my testboxes.
> 
> Except that the interrupt source is the chipset with its PCI bridge, not
> individual PCI devices.

That is the most common configuration but it is not the only one: I have
an ACPI table that has individual PCI devices as source in some test
boxes.
In fact there is even an example of it in this good article about
interrupt routing from the FreeBSD guys (it is the last figure):

http://people.freebsd.org/~jhb/papers/bsdcan/2007/article/node5.html

"Figure 6 contains a portion of an example _PRT. Specifically, it
includes the first entry in the table. This corresponds to the PCI
interrupt for PCI bus 3, slot 7"

..ZIP...

"For APIC mode, the interrupt is routed to GSI 66.  For this machine,
ACPI assigns a base GSI of 64 to the I/O APIC with an APIC ID of 10.
Thus, GSI 66 corresponds to pin 2 on that I/O APIC"

Unless I am missing something I don't think that interrupt is going
through any PCI bridges...
Alexander Graf - June 15, 2011, 8:16 a.m.
On 14.06.2011, at 15:27, Stefano Stabellini wrote:

> On Tue, 14 Jun 2011, Alexander Graf wrote:
>>>>>>> static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
>>>>>>> {
>>>>>>>  PCII440FXState *d = opaque;
>>>>>>> @@ -267,8 +263,17 @@ static PCIBus *i440fx_common_init(const char *device_name,
>>>>>>>  d = pci_create_simple(b, 0, device_name);
>>>>>>>  *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
>>>>>>> 
>>>>>>> -    piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>> -                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
>>>>>>> +    if (xen_enabled()) {
>>>>>>> +        piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>> +                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
>>>>>>> +        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
>>>>>>> +                piix3, XEN_PIIX_NUM_PIRQS);
>>>>>> 
>>>>>> But with XEN_PIIX_NUM_PIRQS it's not a piix3 anymore, no? What's the reason behind this change?
>>>>> 
>>>>> It is still a piix3, but also provides non-legacy interrupt links to the
>>>>> IO-APIC.
>>>>> The four pins of each PCI device on the bus not only are routed to the
>>>>> normal four pirqs (programmed writing to 0x60-0x63, see above) but also
>>>>> they are connected to the IO-APIC directly.
>>>>> These additional routes can only be discovered through ACPI, so you need
>>>>> matching ACPI tables. We used to build the old ACPI tables like this:
>>>>> 
>>>>> /* PRTA: APIC routing table (via non-legacy IOAPIC GSIs). */
>>>>> printf("Name(PRTA, Package() {\n");
>>>>> for ( dev = 1; dev < 32; dev++ )
>>>>>  for ( intx = 0; intx < 4; intx++ ) /* INTA-D */
>>>>>      printf("Package(){0x%04xffff, %u, 0, %u},\n",
>>>>>             dev, intx, ((dev*4+dev/8+intx)&31)+16);
>>>>> printf("})\n");
>>>>> 
>>>> 
>>>> Interesting concept, but completely non-standard and very much
>>>> different from real hardware. Please at least add a comment there to
>>>> show readers that Xen is doing a hack which is not at all related to
>>>> how the PIIX really works.
>>> 
>>> Isn't this more a function of the "wires" on the motherboard than the
>>> PIIX specifically? i.e. this just encodes the permutation of the wires
>>> from the PCI slots into the IO-APIC input pins (bypassing the PIIX,
>>> which is only used for legacy ISA IRQs i.e. by non-APIC aware OSes)?
>> 
>> Interrupts with PCI work slightly different. PCI devices can map (themselves or by software) to one of 4 interrupt lines: INTA, INTB, INTC, INTD. These get converted using PCI host controller specific logic to 4 interrupt lines which then go into the IO-APIC.
>> 
>> The IO-APIC is a chip with a limited number of pins. IIRC it was 24, could be 26 though.
> 
> The number of redirection entries in the IOAPIC can be discovered
> reading from the IOAPICVER register and it is a property of a specific
> model of IOAPIC. As a matter of fact Xen's emulated IOAPIC supports more
> pins than the most popular IOAPIC used with PIIX3.

which means you're emulating hardware that never existed :).

> 
> 
>> I haven't seen a single case where PCI devices have a direct link to the IO-APIC. I also have not seen any PCI host controller that exports more than 4 interrupts. Giving each PCI device its own line, on top of that more than ever could be in real hardware, is a plain hack IMHO.
> 
> Actually this happens quite often: if I am not mistaken all the GSIs
> higher than 15 are actually the result of a direct connection between
> an interrupt source and the IOAPIC. I have several on my testboxes.

Yes. "Interrupt source" meaning a wire on the board. I haven't seen any situation so far where you get direct IO-APIC connections to PCI _device_ pins. You obviously get plenty connections to PCI _bus_ pins.

> Also give a look at the Intel Multiprocessor Specification, section
> 3.6.2.3: as you can see from the diagram in "Symmetric I/O Mode" all the
> interrupts are routed through the IOAPIC directly.
> 
> 
>> Did this really give you actual performance/latency/scalability gains? I still think for devices that matter, we should go with MSI rather than deriving from real hw.
>> 
> 
> Not all the operating systems support MSIs, it is nice to be able to
> avoid interrupt sharing without recurring to MSIs.

Yes and no. It's a tradeoff. If no interrupt sharing means that we emulate hardware that simply never could have existed the way we model it, I think it's a bad idea.

> Also this is how Xen has been working for more then 5 years in HVM mode,
> so this configuration is well tested and supported by most operating
> systems (at least all the ones we tried so far).

I'm fine with Xen breaking its own neck, as long as it doesn't affect non-Xen code paths. Just be aware that I'm not a huge fan of this approach :).

> In any case I think it is a good idea to add a comment to better explain
> what we are doing, see below.
> 
> 
> 
> commit 973bb091a967fdec37a1bc8fe30d46a483d2903d
> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Date:   Tue May 17 12:10:36 2011 +0000
> 
>    xen: fix interrupt routing
> 
>    - remove i440FX-xen and i440fx_write_config_xen
>    we don't need to intercept pci config writes to i440FX anymore;
> 
>    - introduce PIIX3-xen and piix3_write_config_xen
>    we do need to intercept pci config write to the PCI-ISA bridge to update
>    the PCI link routing;
> 
>    - set the number of PIIX3-xen interrupts line to 128;

I still find it unpretty and I'm pretty sure it's completely different from real hardware, but since Xen code is your call and this doesn't affect non-Xen workloads, I won't block it, unless someone else is very much opposed to it.

Please resend as proper patch.


Alex
Alexander Graf - June 15, 2011, 8:24 a.m.
On 14.06.2011, at 20:18, Stefano Stabellini wrote:

> On Tue, 14 Jun 2011, Jan Kiszka wrote:
>> On 2011-06-14 15:27, Stefano Stabellini wrote:
>>> On Tue, 14 Jun 2011, Alexander Graf wrote:
>>>>>>>>> static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
>>>>>>>>> {
>>>>>>>>>  PCII440FXState *d = opaque;
>>>>>>>>> @@ -267,8 +263,17 @@ static PCIBus *i440fx_common_init(const char *device_name,
>>>>>>>>>  d = pci_create_simple(b, 0, device_name);
>>>>>>>>>  *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
>>>>>>>>> 
>>>>>>>>> -    piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>>>> -                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
>>>>>>>>> +    if (xen_enabled()) {
>>>>>>>>> +        piix3 = DO_UPCAST(PIIX3State, dev,
>>>>>>>>> +                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
>>>>>>>>> +        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
>>>>>>>>> +                piix3, XEN_PIIX_NUM_PIRQS);
>>>>>>>> 
>>>>>>>> But with XEN_PIIX_NUM_PIRQS it's not a piix3 anymore, no? What's the reason behind this change?
>>>>>>> 
>>>>>>> It is still a piix3, but also provides non-legacy interrupt links to the
>>>>>>> IO-APIC.
>>>>>>> The four pins of each PCI device on the bus not only are routed to the
>>>>>>> normal four pirqs (programmed writing to 0x60-0x63, see above) but also
>>>>>>> they are connected to the IO-APIC directly.
>>>>>>> These additional routes can only be discovered through ACPI, so you need
>>>>>>> matching ACPI tables. We used to build the old ACPI tables like this:
>>>>>>> 
>>>>>>> /* PRTA: APIC routing table (via non-legacy IOAPIC GSIs). */
>>>>>>> printf("Name(PRTA, Package() {\n");
>>>>>>> for ( dev = 1; dev < 32; dev++ )
>>>>>>>  for ( intx = 0; intx < 4; intx++ ) /* INTA-D */
>>>>>>>      printf("Package(){0x%04xffff, %u, 0, %u},\n",
>>>>>>>             dev, intx, ((dev*4+dev/8+intx)&31)+16);
>>>>>>> printf("})\n");
>>>>>>> 
>>>>>> 
>>>>>> Interesting concept, but completely non-standard and very much
>>>>>> different from real hardware. Please at least add a comment there to
>>>>>> show readers that Xen is doing a hack which is not at all related to
>>>>>> how the PIIX really works.
>>>>> 
>>>>> Isn't this more a function of the "wires" on the motherboard than the
>>>>> PIIX specifically? i.e. this just encodes the permutation of the wires
>>>>> from the PCI slots into the IO-APIC input pins (bypassing the PIIX,
>>>>> which is only used for legacy ISA IRQs i.e. by non-APIC aware OSes)?
>>>> 
>>>> Interrupts with PCI work slightly different. PCI devices can map (themselves or by software) to one of 4 interrupt lines: INTA, INTB, INTC, INTD. These get converted using PCI host controller specific logic to 4 interrupt lines which then go into the IO-APIC.
>>>> 
>>>> The IO-APIC is a chip with a limited number of pins. IIRC it was 24, could be 26 though.
>>> 
>>> The number of redirection entries in the IOAPIC can be discovered
>>> reading from the IOAPICVER register and it is a property of a specific
>>> model of IOAPIC. As a matter of fact Xen's emulated IOAPIC supports more
>>> pins than the most popular IOAPIC used with PIIX3.
>> 
>> Do real IOAPICs exist with more than 24 pins? Otherwise there is the
>> risk that OSes aren't well prepared for this oddity - specifically not
>> when the chipset is specified to include a 24-pin IOAPIC.
> 
> Linux supports up to 128 pins and as I wrote before all the other OSes
> we tested so far seem to react well.
> 
> 
>>>> I haven't seen a single case where PCI devices have a direct link to the IO-APIC. I also have not seen any PCI host controller that exports more than 4 interrupts. Giving each PCI device its own line, on top of that more than ever could be in real hardware, is a plain hack IMHO.
>>> 
>>> Actually this happens quite often: if I am not mistaken all the GSIs
>>> higher than 15 are actually the result of a direct connection between
>>> an interrupt source and the IOAPIC. I have several on my testboxes.
>> 
>> Except that the interrupt source is the chipset with its PCI bridge, not
>> individual PCI devices.
> 
> That is the most common configuration but it is not the only one: I have
> an ACPI table that has individual PCI devices as source in some test
> boxes.
> In fact there is even an example of it in this good article about
> interrupt routing from the FreeBSD guys (it is the last figure):
> 
> http://people.freebsd.org/~jhb/papers/bsdcan/2007/article/node5.html
> 
> "Figure 6 contains a portion of an example _PRT. Specifically, it
> includes the first entry in the table. This corresponds to the PCI
> interrupt for PCI bus 3, slot 7"
> 
> ..ZIP...
> 
> "For APIC mode, the interrupt is routed to GSI 66.  For this machine,
> ACPI assigns a base GSI of 64 to the I/O APIC with an APIC ID of 10.
> Thus, GSI 66 corresponds to pin 2 on that I/O APIC"
> 
> Unless I am missing something I don't think that interrupt is going
> through any PCI bridges...

I'm actually not quite sure what exactly he's describing here. But if it's bypassing the bus logic, it's not a normal PCI device :). Sure, there are special case devices that also expose a PCI interface. But real PCI cards that you plug in onto the PCI bus can't bypass the interrupt logic of the bus, as the only interrupt wires they have go to the bus. And since the PCI adapters we use in PC machines in Qemu are all non-special, guests can possibly choke on this.

But either way, I won't block the patch as I mentioned before.


Alex
Stefano Stabellini - June 15, 2011, 4:32 p.m.
On Wed, 15 Jun 2011, Alexander Graf wrote:
> > commit 973bb091a967fdec37a1bc8fe30d46a483d2903d
> > Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > Date:   Tue May 17 12:10:36 2011 +0000
> > 
> >    xen: fix interrupt routing
> > 
> >    - remove i440FX-xen and i440fx_write_config_xen
> >    we don't need to intercept pci config writes to i440FX anymore;
> > 
> >    - introduce PIIX3-xen and piix3_write_config_xen
> >    we do need to intercept pci config write to the PCI-ISA bridge to update
> >    the PCI link routing;
> > 
> >    - set the number of PIIX3-xen interrupts line to 128;
> 
> I still find it unpretty and I'm pretty sure it's completely different from real hardware, but since Xen code is your call and this doesn't affect non-Xen workloads, I won't block it, unless someone else is very much opposed to it.
> 
> Please resend as proper patch.
> 
 
Thanks, I'll do that.
Avi Kivity - June 15, 2011, 4:34 p.m.
On 06/15/2011 11:24 AM, Alexander Graf wrote:
> I'm actually not quite sure what exactly he's describing here. But if it's bypassing the bus logic, it's not a normal PCI device :). Sure, there are special case devices that also expose a PCI interface. But real PCI cards that you plug in onto the PCI bus can't bypass the interrupt logic of the bus, as the only interrupt wires they have go to the bus. And since the PCI adapters we use in PC machines in Qemu are all non-special, guests can possibly choke on this.
>

There actually is a special device in qemu - acpi power management is 
configured as a PCI device, but its interrupt is hard-wired to gsi 9 and 
is edge-triggered (so it can't share the irq line).

I other devices that are special in this regard to also be part of the 
chipset, not devices you can plug into arbitrary slots.
Alexander Graf - June 15, 2011, 4:54 p.m.
Am 15.06.2011 um 18:34 schrieb Avi Kivity <avi@redhat.com>:

> On 06/15/2011 11:24 AM, Alexander Graf wrote:
>> I'm actually not quite sure what exactly he's describing here. But if it's bypassing the bus logic, it's not a normal PCI device :). Sure, there are special case devices that also expose a PCI interface. But real PCI cards that you plug in onto the PCI bus can't bypass the interrupt logic of the bus, as the only interrupt wires they have go to the bus. And since the PCI adapters we use in PC machines in Qemu are all non-special, guests can possibly choke on this.
>> 
> 
> There actually is a special device in qemu - acpi power management is configured as a PCI device, but its interrupt is hard-wired to gsi 9 and is edge-triggered (so it can't share the irq line).
> 
> I other devices that are special in this regard to also be part of the chipset, not devices you can plug into arbitrary slots.

Sure, platform devices can do that. Real PCI cards can not. Have you ever seen an e1000 with direct mapped interrupt lines? :)

I admit though that we also emulate platform devices that happen to expose themselves on the PCI bus. It's not common though and I wouldn't expect every OS/driver to be happy about it.


Alex

>

Patch

diff --git a/hw/pc.h b/hw/pc.h
index 0dcbee7..6d5730b 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -176,7 +176,6 @@  struct PCII440FXState;
 typedef struct PCII440FXState PCII440FXState;
 
 PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int *piix_devfn, qemu_irq *pic, ram_addr_t ram_size);
-PCIBus *i440fx_xen_init(PCII440FXState **pi440fx_state, int *piix3_devfn, qemu_irq *pic, ram_addr_t ram_size);
 void i440fx_init_memory_mappings(PCII440FXState *d);
 
 /* piix4.c */
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 9a22a8a..ba198de 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -124,11 +124,7 @@  static void pc_init1(ram_addr_t ram_size,
     isa_irq = qemu_allocate_irqs(isa_irq_handler, isa_irq_state, 24);
 
     if (pci_enabled) {
-        if (!xen_enabled()) {
-            pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
-        } else {
-            pci_bus = i440fx_xen_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
-        }
+        pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
     } else {
         pci_bus = NULL;
         i440fx_state = NULL;
diff --git a/hw/piix_pci.c b/hw/piix_pci.c
index 7f1c4cc..1e29ec4 100644
--- a/hw/piix_pci.c
+++ b/hw/piix_pci.c
@@ -40,6 +40,7 @@  typedef PCIHostState I440FXState;
 
 #define PIIX_NUM_PIC_IRQS       16      /* i8259 * 2 */
 #define PIIX_NUM_PIRQS          4ULL    /* PIRQ[A-D] */
+#define XEN_PIIX_NUM_PIRQS      128ULL
 #define PIIX_PIRQC              0x60
 
 typedef struct PIIX3State {
@@ -78,6 +79,8 @@  struct PCII440FXState {
 #define I440FX_SMRAM    0x72
 
 static void piix3_set_irq(void *opaque, int pirq, int level);
+static void piix3_write_config_xen(PCIDevice *dev,
+                               uint32_t address, uint32_t val, int len);
 
 /* return the global irq number corresponding to a given device irq
    pin. We could also use the bus number to have a more precise
@@ -173,13 +176,6 @@  static void i440fx_write_config(PCIDevice *dev,
     }
 }
 
-static void i440fx_write_config_xen(PCIDevice *dev,
-                                    uint32_t address, uint32_t val, int len)
-{
-    xen_piix_pci_write_config_client(address, val, len);
-    i440fx_write_config(dev, address, val, len);
-}
-
 static int i440fx_load_old(QEMUFile* f, void *opaque, int version_id)
 {
     PCII440FXState *d = opaque;
@@ -267,8 +263,21 @@  static PCIBus *i440fx_common_init(const char *device_name,
     d = pci_create_simple(b, 0, device_name);
     *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
 
-    piix3 = DO_UPCAST(PIIX3State, dev,
-                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
+    /* Xen supports additional interrupt routes from the PCI devices to
+     * the IOAPIC: the four pins of each PCI device on the bus are also
+     * connected to the IOAPIC directly.
+     * These additional routes can be discovered through ACPI. */
+    if (xen_enabled()) {
+        piix3 = DO_UPCAST(PIIX3State, dev,
+                pci_create_simple_multifunction(b, -1, true, "PIIX3-xen"));
+        pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
+                piix3, XEN_PIIX_NUM_PIRQS);
+    } else {
+        piix3 = DO_UPCAST(PIIX3State, dev,
+                pci_create_simple_multifunction(b, -1, true, "PIIX3"));
+        pci_bus_irqs(b, piix3_set_irq, pci_slot_get_pirq, piix3,
+                PIIX_NUM_PIRQS);
+    }
     piix3->pic = pic;
 
     (*pi440fx_state)->piix3 = piix3;
@@ -289,21 +298,6 @@  PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int *piix3_devfn,
     PCIBus *b;
 
     b = i440fx_common_init("i440FX", pi440fx_state, piix3_devfn, pic, ram_size);
-    pci_bus_irqs(b, piix3_set_irq, pci_slot_get_pirq, (*pi440fx_state)->piix3,
-                 PIIX_NUM_PIRQS);
-
-    return b;
-}
-
-PCIBus *i440fx_xen_init(PCII440FXState **pi440fx_state, int *piix3_devfn,
-                        qemu_irq *pic, ram_addr_t ram_size)
-{
-    PCIBus *b;
-
-    b = i440fx_common_init("i440FX-xen", pi440fx_state, piix3_devfn, pic, ram_size);
-    pci_bus_irqs(b, xen_piix3_set_irq, xen_pci_slot_get_pirq,
-                 (*pi440fx_state)->piix3, PIIX_NUM_PIRQS);
-
     return b;
 }
 
@@ -365,6 +359,13 @@  static void piix3_write_config(PCIDevice *dev,
     }
 }
 
+static void piix3_write_config_xen(PCIDevice *dev,
+                               uint32_t address, uint32_t val, int len)
+{
+    xen_piix_pci_write_config_client(address, val, len);
+    piix3_write_config(dev, address, val, len);
+}
+
 static void piix3_reset(void *opaque)
 {
     PIIX3State *d = opaque;
@@ -465,14 +466,6 @@  static PCIDeviceInfo i440fx_info[] = {
         .init         = i440fx_initfn,
         .config_write = i440fx_write_config,
     },{
-        .qdev.name    = "i440FX-xen",
-        .qdev.desc    = "Host bridge",
-        .qdev.size    = sizeof(PCII440FXState),
-        .qdev.vmsd    = &vmstate_i440fx,
-        .qdev.no_user = 1,
-        .init         = i440fx_initfn,
-        .config_write = i440fx_write_config_xen,
-    },{
         .qdev.name    = "PIIX3",
         .qdev.desc    = "ISA bridge",
         .qdev.size    = sizeof(PIIX3State),
@@ -482,6 +475,15 @@  static PCIDeviceInfo i440fx_info[] = {
         .init         = piix3_initfn,
         .config_write = piix3_write_config,
     },{
+        .qdev.name    = "PIIX3-xen",
+        .qdev.desc    = "ISA bridge",
+        .qdev.size    = sizeof(PIIX3State),
+        .qdev.vmsd    = &vmstate_piix3,
+        .qdev.no_user = 1,
+        .no_hotplug   = 1,
+        .init         = piix3_initfn,
+        .config_write = piix3_write_config_xen,
+    },{
         /* end of list */
     }
 };