diff mbox

[RFC] PCI: Introduce INTx check & mask API

Message ID 4FBDE6D6.80700@ozlabs.ru
State New
Headers show

Commit Message

Alexey Kardashevskiy May 24, 2012, 7:44 a.m. UTC
[Found while debugging VFIO on POWER but it is platform independent]

There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
PCI_STATUS registers.

And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).

I have a network adapter:
0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

pci_intx_mask_supported() reports that the feature is supported for this adapter
BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.

If I remove the check of this bit, it works fine as it is called from an interrupt handler and
Status bit check is redundant.

Opened a spec:
PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits

Comments

Jan Kiszka May 24, 2012, 12:02 p.m. UTC | #1
On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
> [Found while debugging VFIO on POWER but it is platform independent]
> 
> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
> PCI_STATUS registers.

Yes, 2.3 introduced this. Masking is done via command register, checking
if the source was the PCI in question via the status register. The
latter is important for supporting IRQ sharing - and that's why we
introduced this masking API to the PCI layer.

> 
> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
> 
> I have a network adapter:
> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 
> pci_intx_mask_supported() reports that the feature is supported for this adapter
> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
> 
> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
> Status bit check is redundant.
> 
> Opened a spec:
> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
> ===
> 3	This read-only bit reflects the state of the interrupt in the
> device/function. Only when the Interrupt Disable bit in the command
> register is a 0 and this Interrupt Status bit is a 1, will the
> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>    Disable bit to a 1 has no effect on the state of this bit.
> ===
> With this adapter, INTx# is asserted but Status bit is still 0.
> 
> Is it mandatory for a device to set Status bit if it supports INTx masking?
> 
> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
> somehow.

Since PCI 2.3, this bit is mandatory, and it should be independent of
the masking bit. The question is, if your device is supposed to support
2.3, thus is just buggy, or if our detection algorithm is unreliable. It
basically builds on the assumption that, if we can flip the mask bit,
the feature should be present. I guess that is the best we can do. Maybe
we can augment this with a blacklist of devices that "support" flipping
without actually providing the feature.

Jan
Alex Williamson May 24, 2012, 2:41 p.m. UTC | #2
On Thu, 2012-05-24 at 09:02 -0300, Jan Kiszka wrote:
> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
> > [Found while debugging VFIO on POWER but it is platform independent]
> > 
> > There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
> > PCI_STATUS registers.
> 
> Yes, 2.3 introduced this. Masking is done via command register, checking
> if the source was the PCI in question via the status register. The
> latter is important for supporting IRQ sharing - and that's why we
> introduced this masking API to the PCI layer.
> 
> > 
> > And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
> > 
> > I have a network adapter:
> > 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
> > 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
> > 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > 
> > pci_intx_mask_supported() reports that the feature is supported for this adapter
> > BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
> > never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
> > 
> > If I remove the check of this bit, it works fine as it is called from an interrupt handler and
> > Status bit check is redundant.
> > 
> > Opened a spec:
> > PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
> > ===
> > 3	This read-only bit reflects the state of the interrupt in the
> > device/function. Only when the Interrupt Disable bit in the command
> > register is a 0 and this Interrupt Status bit is a 1, will the
> > device’s/function’s INTx# signal be asserted. Setting the Interrupt
> >    Disable bit to a 1 has no effect on the state of this bit.
> > ===
> > With this adapter, INTx# is asserted but Status bit is still 0.
> > 
> > Is it mandatory for a device to set Status bit if it supports INTx masking?
> > 
> > 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
> > somehow.
> 
> Since PCI 2.3, this bit is mandatory, and it should be independent of
> the masking bit. The question is, if your device is supposed to support
> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
> basically builds on the assumption that, if we can flip the mask bit,
> the feature should be present. I guess that is the best we can do. Maybe
> we can augment this with a blacklist of devices that "support" flipping
> without actually providing the feature.

Yep, that's what I'd suggest as well, add a blacklist to
pci_intx_mask_supported() so this device returns false and we require an
exclusive interrupt for it.  Thanks,

Alex
Alexey Kardashevskiy May 25, 2012, 1:06 a.m. UTC | #3
On 25/05/12 00:41, Alex Williamson wrote:

>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>
>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>> PCI_STATUS registers.
>>
>> Yes, 2.3 introduced this. Masking is done via command register, checking
>> if the source was the PCI in question via the status register. The
>> latter is important for supporting IRQ sharing - and that's why we
>> introduced this masking API to the PCI layer.
>>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>>
>>> I have a network adapter:
>>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>
>>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>>
>>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>>> Status bit check is redundant.
>>>
>>> Opened a spec:
>>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>>> ===
>>> 3	This read-only bit reflects the state of the interrupt in the
>>> device/function. Only when the Interrupt Disable bit in the command
>>> register is a 0 and this Interrupt Status bit is a 1, will the
>>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>>    Disable bit to a 1 has no effect on the state of this bit.
>>> ===
>>> With this adapter, INTx# is asserted but Status bit is still 0.
>>>
>>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>>
>>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>>> somehow.
>>
>> Since PCI 2.3, this bit is mandatory, and it should be independent of
>> the masking bit. The question is, if your device is supposed to support
>> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
>> basically builds on the assumption that, if we can flip the mask bit,
>> the feature should be present. I guess that is the best we can do. Maybe
>> we can augment this with a blacklist of devices that "support" flipping
>> without actually providing the feature.
> 
> Yep, that's what I'd suggest as well, add a blacklist to
> pci_intx_mask_supported() so this device returns false and we require an
> exclusive interrupt for it.  Thanks,

Okay, here is one for the starter:

aik@vpl2:~$ lspci -s 1:1:0.0
0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
aik@vpl2:~$ lspci -ns 1:1:0.0
0001:01:00.0 0200: 1425:0030
Alexey Kardashevskiy May 25, 2012, 1:18 a.m. UTC | #4
On 24/05/12 22:02, Jan Kiszka wrote:
> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>> [Found while debugging VFIO on POWER but it is platform independent]
>>
>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>> PCI_STATUS registers.
> 
> Yes, 2.3 introduced this. Masking is done via command register, checking
> if the source was the PCI in question via the status register. The
> latter is important for supporting IRQ sharing - and that's why we
> introduced this masking API to the PCI layer.


Is not it just a quite small optimization to not to disable interrupts on all devices which share
the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
IRQs? Does not supporting this mean real slowdown on such devices?

As far as I understand, everyone who cares about performance uses MSI/MSIX, no?


>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>
>> I have a network adapter:
>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>
>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>
>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>> Status bit check is redundant.
>>
>> Opened a spec:
>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>> ===
>> 3	This read-only bit reflects the state of the interrupt in the
>> device/function. Only when the Interrupt Disable bit in the command
>> register is a 0 and this Interrupt Status bit is a 1, will the
>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>    Disable bit to a 1 has no effect on the state of this bit.
>> ===
>> With this adapter, INTx# is asserted but Status bit is still 0.
>>
>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>
>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>> somehow.
> 
> Since PCI 2.3, this bit is mandatory, and it should be independent of
> the masking bit. The question is, if your device is supposed to support
> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
> basically builds on the assumption that, if we can flip the mask bit,
> the feature should be present. I guess that is the best we can do. Maybe
> we can augment this with a blacklist of devices that "support" flipping
> without actually providing the feature.

It is a good moment to start :)
Not sure where - in VFIO or along with that PCI INTx API.

Here is that broken device:
aik@vpl2:~$ lspci -s 1:1:0.0
0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
aik@vpl2:~$ lspci -ns 1:1:0.0
0001:01:00.0 0200: 1425:0030
Jan Kiszka May 25, 2012, 2:29 a.m. UTC | #5
On 2012-05-24 22:18, Alexey Kardashevskiy wrote:
> On 24/05/12 22:02, Jan Kiszka wrote:
>> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>
>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>> PCI_STATUS registers.
>>
>> Yes, 2.3 introduced this. Masking is done via command register, checking
>> if the source was the PCI in question via the status register. The
>> latter is important for supporting IRQ sharing - and that's why we
>> introduced this masking API to the PCI layer.
> 
> 
> Is not it just a quite small optimization to not to disable interrupts on all devices which share
> the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
> IRQs? Does not supporting this mean real slowdown on such devices?
> 
> As far as I understand, everyone who cares about performance uses MSI/MSIX, no?

Not everyone is blessed with MSI-only PCI devices. From my notebook:

# cat /proc/interrupts
[...]
 22: [...] IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2

So, if I want to assign one EHCI controller to a guest, I have to
disable the other as well. The same can happen quickly if you attach a
few legacy PCI adapters to a system and want to pass them through.

> 
> 
>>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>>
>>> I have a network adapter:
>>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>
>>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>>
>>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>>> Status bit check is redundant.
>>>
>>> Opened a spec:
>>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>>> ===
>>> 3	This read-only bit reflects the state of the interrupt in the
>>> device/function. Only when the Interrupt Disable bit in the command
>>> register is a 0 and this Interrupt Status bit is a 1, will the
>>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>>    Disable bit to a 1 has no effect on the state of this bit.
>>> ===
>>> With this adapter, INTx# is asserted but Status bit is still 0.
>>>
>>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>>
>>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>>> somehow.
>>
>> Since PCI 2.3, this bit is mandatory, and it should be independent of
>> the masking bit. The question is, if your device is supposed to support
>> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
>> basically builds on the assumption that, if we can flip the mask bit,
>> the feature should be present. I guess that is the best we can do. Maybe
>> we can augment this with a blacklist of devices that "support" flipping
>> without actually providing the feature.
> 
> It is a good moment to start :)
> Not sure where - in VFIO or along with that PCI INTx API.

At PCI level as the API is VFIO agnostic (it was introduced for
"classic" KVM device assignment, in fact).

> 
> Here is that broken device:
> aik@vpl2:~$ lspci -s 1:1:0.0
> 0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
> aik@vpl2:~$ lspci -ns 1:1:0.0
> 0001:01:00.0 0200: 1425:0030

A patch to add the infrastructure as well would be even more welcome. :)
You could have a look at drivers/pci/quirks.c for patterns how to do this.

Jan
Alexey Kardashevskiy May 25, 2012, 2:47 a.m. UTC | #6
On 25/05/12 12:29, Jan Kiszka wrote:
> On 2012-05-24 22:18, Alexey Kardashevskiy wrote:
>> On 24/05/12 22:02, Jan Kiszka wrote:
>>> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>>
>>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>>> PCI_STATUS registers.
>>>
>>> Yes, 2.3 introduced this. Masking is done via command register, checking
>>> if the source was the PCI in question via the status register. The
>>> latter is important for supporting IRQ sharing - and that's why we
>>> introduced this masking API to the PCI layer.
>>
>>
>> Is not it just a quite small optimization to not to disable interrupts on all devices which share
>> the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
>> IRQs? Does not supporting this mean real slowdown on such devices?
>>
>> As far as I understand, everyone who cares about performance uses MSI/MSIX, no?
> 
> Not everyone is blessed with MSI-only PCI devices. From my notebook:
> 
> # cat /proc/interrupts
> [...]
>  22: [...] IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2
> 
> So, if I want to assign one EHCI controller to a guest, I have to
> disable the other as well. The same can happen quickly if you attach a
> few legacy PCI adapters to a system and want to pass them through.

Why? vfio-pci receives interrupt, disables it, handles it, enables interrupt back. Yes, handling is
a bit longer and includes passing interrupt to QEMU and then to the guest (can be optimized to avoid
QEMU) and waiting for EOI notification but this is all the difference.

Does the current kernel use INTx bit for your USB controllers now, without any KVM, etc?

So, is it just an optimization or it is something bigger that I missed?


>>>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>>>
>>>> I have a network adapter:
>>>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>
>>>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>>>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>>>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>>>
>>>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>>>> Status bit check is redundant.
>>>>
>>>> Opened a spec:
>>>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>>>> ===
>>>> 3	This read-only bit reflects the state of the interrupt in the
>>>> device/function. Only when the Interrupt Disable bit in the command
>>>> register is a 0 and this Interrupt Status bit is a 1, will the
>>>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>>>    Disable bit to a 1 has no effect on the state of this bit.
>>>> ===
>>>> With this adapter, INTx# is asserted but Status bit is still 0.
>>>>
>>>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>>>
>>>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>>>> somehow.
>>>
>>> Since PCI 2.3, this bit is mandatory, and it should be independent of
>>> the masking bit. The question is, if your device is supposed to support
>>> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
>>> basically builds on the assumption that, if we can flip the mask bit,
>>> the feature should be present. I guess that is the best we can do. Maybe
>>> we can augment this with a blacklist of devices that "support" flipping
>>> without actually providing the feature.
>>
>> It is a good moment to start :)
>> Not sure where - in VFIO or along with that PCI INTx API.
> 
> At PCI level as the API is VFIO agnostic (it was introduced for
> "classic" KVM device assignment, in fact).
>> Here is that broken device:
>> aik@vpl2:~$ lspci -s 1:1:0.0
>> 0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>> aik@vpl2:~$ lspci -ns 1:1:0.0
>> 0001:01:00.0 0200: 1425:0030
> 
> A patch to add the infrastructure as well would be even more welcome. :)
> You could have a look at drivers/pci/quirks.c for patterns how to do this.

I am not sure yet that we need this feature at all ;) I would rather prefer to have some way to
disable it in VFIO rather than to add yet another quirk for the feature which nobody uses at the moment.
Really, this device supports MSI/MSIX and in real life nobody is going to use INTx on it. The only
need for it is testing.
Jan Kiszka May 25, 2012, 10:43 a.m. UTC | #7
On 2012-05-24 23:47, Alexey Kardashevskiy wrote:
> On 25/05/12 12:29, Jan Kiszka wrote:
>> On 2012-05-24 22:18, Alexey Kardashevskiy wrote:
>>> On 24/05/12 22:02, Jan Kiszka wrote:
>>>> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>>>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>>>
>>>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>>>> PCI_STATUS registers.
>>>>
>>>> Yes, 2.3 introduced this. Masking is done via command register, checking
>>>> if the source was the PCI in question via the status register. The
>>>> latter is important for supporting IRQ sharing - and that's why we
>>>> introduced this masking API to the PCI layer.
>>>
>>>
>>> Is not it just a quite small optimization to not to disable interrupts on all devices which share
>>> the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
>>> IRQs? Does not supporting this mean real slowdown on such devices?
>>>
>>> As far as I understand, everyone who cares about performance uses MSI/MSIX, no?
>>
>> Not everyone is blessed with MSI-only PCI devices. From my notebook:
>>
>> # cat /proc/interrupts
>> [...]
>>  22: [...] IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2
>>
>> So, if I want to assign one EHCI controller to a guest, I have to
>> disable the other as well. The same can happen quickly if you attach a
>> few legacy PCI adapters to a system and want to pass them through.
> 
> Why? vfio-pci receives interrupt, disables it, handles it, enables interrupt back. Yes, handling is
> a bit longer and includes passing interrupt to QEMU and then to the guest (can be optimized to avoid
> QEMU) and waiting for EOI notification but this is all the difference.

You can disable the complete IRQ line as you cannot predict when the
untrusted device driver that VFIO, KVM, or UIO serves will finally
decide to silence the IRQ reason in hardware. If you did this, you risk
a DoS attack on those other devices.

> 
> Does the current kernel use INTx bit for your USB controllers now, without any KVM, etc?

No, it is only used for KVM device assignment when it grabs a device and
uio_pci_generic. If a host driver uses the device, and can silence
interrupts in a device-specific way.

> 
> So, is it just an optimization or it is something bigger that I missed?

It is not an optimization but an essential feature to support INTx
sharing between an untrusted device driver and some other driver.

> 
> 
>>>>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>>>>
>>>>> I have a network adapter:
>>>>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>>>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>>>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>
>>>>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>>>>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>>>>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>>>>
>>>>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>>>>> Status bit check is redundant.
>>>>>
>>>>> Opened a spec:
>>>>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>>>>> ===
>>>>> 3	This read-only bit reflects the state of the interrupt in the
>>>>> device/function. Only when the Interrupt Disable bit in the command
>>>>> register is a 0 and this Interrupt Status bit is a 1, will the
>>>>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>>>>    Disable bit to a 1 has no effect on the state of this bit.
>>>>> ===
>>>>> With this adapter, INTx# is asserted but Status bit is still 0.
>>>>>
>>>>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>>>>
>>>>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>>>>> somehow.
>>>>
>>>> Since PCI 2.3, this bit is mandatory, and it should be independent of
>>>> the masking bit. The question is, if your device is supposed to support
>>>> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
>>>> basically builds on the assumption that, if we can flip the mask bit,
>>>> the feature should be present. I guess that is the best we can do. Maybe
>>>> we can augment this with a blacklist of devices that "support" flipping
>>>> without actually providing the feature.
>>>
>>> It is a good moment to start :)
>>> Not sure where - in VFIO or along with that PCI INTx API.
>>
>> At PCI level as the API is VFIO agnostic (it was introduced for
>> "classic" KVM device assignment, in fact).
>>> Here is that broken device:
>>> aik@vpl2:~$ lspci -s 1:1:0.0
>>> 0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>> aik@vpl2:~$ lspci -ns 1:1:0.0
>>> 0001:01:00.0 0200: 1425:0030
>>
>> A patch to add the infrastructure as well would be even more welcome. :)
>> You could have a look at drivers/pci/quirks.c for patterns how to do this.
> 
> I am not sure yet that we need this feature at all ;) I would rather prefer to have some way to
> disable it in VFIO rather than to add yet another quirk for the feature which nobody uses at the moment.
> Really, this device supports MSI/MSIX and in real life nobody is going to use INTx on it. The only
> need for it is testing.

These are wrong assumptions, both that it has to be addressed at VFIO
level and that it has no serious use case. We will need this feature for
quite a while until legacy PCI finally died. Bets are taken when this
will happen, but I would be careful with any date in this decade. ;)

Jan
Alexey Kardashevskiy May 25, 2012, 11:26 a.m. UTC | #8
25.05.2012 20:43, Jan Kiszka написал:
> On 2012-05-24 23:47, Alexey Kardashevskiy wrote:
>> On 25/05/12 12:29, Jan Kiszka wrote:
>>> On 2012-05-24 22:18, Alexey Kardashevskiy wrote:
>>>> On 24/05/12 22:02, Jan Kiszka wrote:
>>>>> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>>>>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>>>>
>>>>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>>>>> PCI_STATUS registers.
>>>>>
>>>>> Yes, 2.3 introduced this. Masking is done via command register, checking
>>>>> if the source was the PCI in question via the status register. The
>>>>> latter is important for supporting IRQ sharing - and that's why we
>>>>> introduced this masking API to the PCI layer.
>>>>
>>>>
>>>> Is not it just a quite small optimization to not to disable interrupts on all devices which share
>>>> the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
>>>> IRQs? Does not supporting this mean real slowdown on such devices?
>>>>
>>>> As far as I understand, everyone who cares about performance uses MSI/MSIX, no?
>>>
>>> Not everyone is blessed with MSI-only PCI devices. From my notebook:
>>>
>>> # cat /proc/interrupts
>>> [...]
>>>  22: [...] IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2
>>>
>>> So, if I want to assign one EHCI controller to a guest, I have to
>>> disable the other as well. The same can happen quickly if you attach a
>>> few legacy PCI adapters to a system and want to pass them through.
>>
>> Why? vfio-pci receives interrupt, disables it, handles it, enables interrupt back. Yes, handling is
>> a bit longer and includes passing interrupt to QEMU and then to the guest (can be optimized to avoid
>> QEMU) and waiting for EOI notification but this is all the difference.
> 
> You can disable the complete IRQ line as you cannot predict when the
> untrusted device driver that VFIO, KVM, or UIO serves will finally
> decide to silence the IRQ reason in hardware. If you did this, you risk
> a DoS attack on those other devices.


Untrusted device still can pull down (or up? do not remember :) )
hardware INT# line, stop other devices on this line and you cannot do
anything about it. How does INTx help if the device is that broken?


>> Does the current kernel use INTx bit for your USB controllers now, without any KVM, etc?
> 
> No, it is only used for KVM device assignment when it grabs a device and
> uio_pci_generic. If a host driver uses the device, and can silence
> interrupts in a device-specific way.
> 
>>
>> So, is it just an optimization or it is something bigger that I missed?
> 
> It is not an optimization but an essential feature to support INTx
> sharing between an untrusted device driver and some other driver.

So you propose to trust every hardware adapter and only some drivers, is
it what you are saying?

And I thought it is all for the kernel to understand what device called
interrupt and disable it without calling every device which uses the
same line, and that's it. Am I wrong?


>>>>>> And there is some API to support that (commit a2e27787f893621c5a6b865acf6b7766f8671328).
>>>>>>
>>>>>> I have a network adapter:
>>>>>> 0001:00:01.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>>>>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>>>>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>
>>>>>> pci_intx_mask_supported() reports that the feature is supported for this adapter
>>>>>> BUT the adapter does not set PCI_STATUS_INTERRUPT so pci_check_and_set_intx_mask()
>>>>>> never changes PCI_COMMAND and INTx does not work on it when we use it as VFIO-PCI device.
>>>>>>
>>>>>> If I remove the check of this bit, it works fine as it is called from an interrupt handler and
>>>>>> Status bit check is redundant.
>>>>>>
>>>>>> Opened a spec:
>>>>>> PCI LOCAL BUS SPECIFICATION, REV. 3.0, Table 6-2: Status Register Bits
>>>>>> ===
>>>>>> 3	This read-only bit reflects the state of the interrupt in the
>>>>>> device/function. Only when the Interrupt Disable bit in the command
>>>>>> register is a 0 and this Interrupt Status bit is a 1, will the
>>>>>> device’s/function’s INTx# signal be asserted. Setting the Interrupt
>>>>>>    Disable bit to a 1 has no effect on the state of this bit.
>>>>>> ===
>>>>>> With this adapter, INTx# is asserted but Status bit is still 0.
>>>>>>
>>>>>> Is it mandatory for a device to set Status bit if it supports INTx masking?
>>>>>>
>>>>>> 2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
>>>>>> somehow.
>>>>>
>>>>> Since PCI 2.3, this bit is mandatory, and it should be independent of
>>>>> the masking bit. The question is, if your device is supposed to support
>>>>> 2.3, thus is just buggy, or if our detection algorithm is unreliable. It
>>>>> basically builds on the assumption that, if we can flip the mask bit,
>>>>> the feature should be present. I guess that is the best we can do. Maybe
>>>>> we can augment this with a blacklist of devices that "support" flipping
>>>>> without actually providing the feature.
>>>>
>>>> It is a good moment to start :)
>>>> Not sure where - in VFIO or along with that PCI INTx API.
>>>
>>> At PCI level as the API is VFIO agnostic (it was introduced for
>>> "classic" KVM device assignment, in fact).
>>>> Here is that broken device:
>>>> aik@vpl2:~$ lspci -s 1:1:0.0
>>>> 0001:01:00.0 Ethernet controller: Chelsio Communications Inc T310 10GbE Single Port Adapter
>>>> aik@vpl2:~$ lspci -ns 1:1:0.0
>>>> 0001:01:00.0 0200: 1425:0030
>>>
>>> A patch to add the infrastructure as well would be even more welcome. :)
>>> You could have a look at drivers/pci/quirks.c for patterns how to do this.
>>
>> I am not sure yet that we need this feature at all ;) I would rather prefer to have some way to
>> disable it in VFIO rather than to add yet another quirk for the feature which nobody uses at the moment.
>> Really, this device supports MSI/MSIX and in real life nobody is going to use INTx on it. The only
>> need for it is testing.
> 
> These are wrong assumptions, both that it has to be addressed at VFIO
> level and that it has no serious use case. We will need this feature for
> quite a while until legacy PCI finally died. Bets are taken when this
> will happen, but I would be careful with any date in this decade. ;)

Heh. I bet that legacy PCI will never be a serious target for legacy PCI ;)

I really do not understand. You want to do PCI pass through for old
devices which share INT# line and can screw the devices they share
interrupt with. And you trust drivers of devices which support INTx.
I believe that this is not enough for trust, you need hardware isolation.
At least, you should put all the devices which share the same IRQ to one
IOMMU group.
Jan Kiszka May 25, 2012, 11:31 a.m. UTC | #9
On 2012-05-25 08:26, Alexey Kardashevskiy wrote:
> 25.05.2012 20:43, Jan Kiszka написал:
>> On 2012-05-24 23:47, Alexey Kardashevskiy wrote:
>>> On 25/05/12 12:29, Jan Kiszka wrote:
>>>> On 2012-05-24 22:18, Alexey Kardashevskiy wrote:
>>>>> On 24/05/12 22:02, Jan Kiszka wrote:
>>>>>> On 2012-05-24 04:44, Alexey Kardashevskiy wrote:
>>>>>>> [Found while debugging VFIO on POWER but it is platform independent]
>>>>>>>
>>>>>>> There is a feature in PCI (>=2.3?) to mask/unmask INTx via PCI_COMMAND and
>>>>>>> PCI_STATUS registers.
>>>>>>
>>>>>> Yes, 2.3 introduced this. Masking is done via command register, checking
>>>>>> if the source was the PCI in question via the status register. The
>>>>>> latter is important for supporting IRQ sharing - and that's why we
>>>>>> introduced this masking API to the PCI layer.
>>>>>
>>>>>
>>>>> Is not it just a quite small optimization to not to disable interrupts on all devices which share
>>>>> the same IRQ but just on those who fired an interrupt? If so, do PCI devices really often share
>>>>> IRQs? Does not supporting this mean real slowdown on such devices?
>>>>>
>>>>> As far as I understand, everyone who cares about performance uses MSI/MSIX, no?
>>>>
>>>> Not everyone is blessed with MSI-only PCI devices. From my notebook:
>>>>
>>>> # cat /proc/interrupts
>>>> [...]
>>>>  22: [...] IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2
>>>>
>>>> So, if I want to assign one EHCI controller to a guest, I have to
>>>> disable the other as well. The same can happen quickly if you attach a
>>>> few legacy PCI adapters to a system and want to pass them through.
>>>
>>> Why? vfio-pci receives interrupt, disables it, handles it, enables interrupt back. Yes, handling is
>>> a bit longer and includes passing interrupt to QEMU and then to the guest (can be optimized to avoid
>>> QEMU) and waiting for EOI notification but this is all the difference.
>>
>> You can disable the complete IRQ line as you cannot predict when the
>> untrusted device driver that VFIO, KVM, or UIO serves will finally
>> decide to silence the IRQ reason in hardware. If you did this, you risk
>> a DoS attack on those other devices.
> 
> 
> Untrusted device still can pull down (or up? do not remember :) )
> hardware INT# line, stop other devices on this line and you cannot do
> anything about it. How does INTx help if the device is that broken?

If the untrusted device truly complies to PCI 2.3, we can stop it from
pulling that line by setting the generic INTx mask bit. That's the whole
reason for this exercise here.

Jan
Benjamin Herrenschmidt June 5, 2012, 1:39 a.m. UTC | #10
> Yep, that's what I'd suggest as well, add a blacklist to
> pci_intx_mask_supported() so this device returns false and we require an
> exclusive interrupt for it.  Thanks,

BTW, we should consider supporting an MSI-only option for guests as
well:

LSIs are a problem for virtualization, especially when we start
having things like expander racks with slots behind bridges etc, and
in some case it's better to support an MSI only setup rather than
not support the virtualizing the devices at all (or at least in
different partitions).

However, to do that, we either need to ensure the device can't be
coerced by SW to still assert the LSI and cause trouble. This can be
dealt with two ways I have in mind:

 - If we don't use any of those 4 interrupts lines at all (ie, we use no
LSI on the host bridge and they aren't shared with another bridge
etc...), we can just mask them out in the main PIC. On Power there's no
sharing between interrupt sources from different host bridges so that
would work for us

 - If the intermediary P2P bridge has a feature to block incoming LSIs
from children (I've heard that exists, is that standard ? I haven't
looked in the latest specs)

There's a third one:

 - If you trust the device own mask bit

 ... But this is fishy since many devices -will- have some kind of
backdoor via MMIO to bypass (or alter) the config space setting. In some
cases the driver can even completely replace the firmware inside the
device and do pretty much whatever it wants :-)

The main thing is how do we "represent" both in term of interface to
qemu and qemu -> kernel interface wanting such a "no LSI" setup... not
sure about that one.

Cheers,
Ben.
Benjamin Herrenschmidt June 5, 2012, 1:39 a.m. UTC | #11
On Thu, 2012-05-24 at 09:02 -0300, Jan Kiszka wrote:
> 
> Since PCI 2.3, this bit is mandatory, and it should be independent of
> the masking bit. The question is, if your device is supposed to support
> 2.3, thus is just buggy, 

It's a PCI Express device :-)

> or if our detection algorithm is unreliable. It
> basically builds on the assumption that, if we can flip the mask bit,
> the feature should be present. I guess that is the best we can do. Maybe
> we can augment this with a blacklist of devices that "support" flipping
> without actually providing the feature.

Cheers,
Ben.
Alexander Graf June 5, 2012, 1:44 a.m. UTC | #12
On 05.06.2012, at 03:39, Benjamin Herrenschmidt wrote:

> 
>> Yep, that's what I'd suggest as well, add a blacklist to
>> pci_intx_mask_supported() so this device returns false and we require an
>> exclusive interrupt for it.  Thanks,
> 
> BTW, we should consider supporting an MSI-only option for guests as
> well:
> 
> LSIs are a problem for virtualization, especially when we start
> having things like expander racks with slots behind bridges etc, and
> in some case it's better to support an MSI only setup rather than
> not support the virtualizing the devices at all (or at least in
> different partitions).
> 
> However, to do that, we either need to ensure the device can't be
> coerced by SW to still assert the LSI and cause trouble. This can be
> dealt with two ways I have in mind:
> 
> - If we don't use any of those 4 interrupts lines at all (ie, we use no
> LSI on the host bridge and they aren't shared with another bridge
> etc...), we can just mask them out in the main PIC. On Power there's no
> sharing between interrupt sources from different host bridges so that
> would work for us
> 
> - If the intermediary P2P bridge has a feature to block incoming LSIs
> from children (I've heard that exists, is that standard ? I haven't
> looked in the latest specs)
> 
> There's a third one:
> 
> - If you trust the device own mask bit
> 
> ... But this is fishy since many devices -will- have some kind of
> backdoor via MMIO to bypass (or alter) the config space setting. In some
> cases the driver can even completely replace the firmware inside the
> device and do pretty much whatever it wants :-)
> 
> The main thing is how do we "represent" both in term of interface to
> qemu and qemu -> kernel interface wanting such a "no LSI" setup... not
> sure about that one.

Wouldn't the "no LSI" setting be a flag to the vfio configuration? So when you set up the device group, you say "this group can only do MSI". That way the interface would be sysfs and QEMU wouldn't have anything to do with it :)


Alex
Benjamin Herrenschmidt June 5, 2012, 2:56 a.m. UTC | #13
On Tue, 2012-06-05 at 03:44 +0200, Alexander Graf wrote:
> Wouldn't the "no LSI" setting be a flag to the vfio configuration? So
> when you set up the device group, you say "this group can only do
> MSI". That way the interface would be sysfs and QEMU wouldn't have
> anything to do with it :)

Sure whatever ;-)

There need to be some validity checking in the kernel to see if we can
safely mask those interrupts too, etc...

Cheers,
Ben.
diff mbox

Patch

===
3	This read-only bit reflects the state of the interrupt in the
device/function. Only when the Interrupt Disable bit in the command
register is a 0 and this Interrupt Status bit is a 1, will the
device’s/function’s INTx# signal be asserted. Setting the Interrupt
   Disable bit to a 1 has no effect on the state of this bit.
===
With this adapter, INTx# is asserted but Status bit is still 0.

Is it mandatory for a device to set Status bit if it supports INTx masking?

2 Alex: if it is mandatory, then we need to be able to disable pci_2_3 in VFIO-PCI
somehow.


Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/pci/pci.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ab6c2a6..ee4c804 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2978,60 +2978,60 @@  EXPORT_SYMBOL_GPL(pci_intx_mask_supported);
 
 static bool pci_check_and_set_intx_mask(struct pci_dev *dev, bool mask)
 {
 	struct pci_bus *bus = dev->bus;
 	bool mask_updated = true;
 	u32 cmd_status_dword;
 	u16 origcmd, newcmd;
 	unsigned long flags;
 	bool irq_pending;
 
 	/*
 	 * We do a single dword read to retrieve both command and status.
 	 * Document assumptions that make this possible.
 	 */
 	BUILD_BUG_ON(PCI_COMMAND % 4);
 	BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
 
 	raw_spin_lock_irqsave(&pci_lock, flags);
 
 	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
 
 	irq_pending = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
 
 	/*
 	 * Check interrupt status register to see whether our device
 	 * triggered the interrupt (when masking) or the next IRQ is
 	 * already pending (when unmasking).
 	 */
-	if (mask != irq_pending) {
+/*	if (mask != irq_pending) {
 		mask_updated = false;
 		goto done;
-	}
+	}*/
 
 	origcmd = cmd_status_dword;
 	newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
 	if (mask)
 		newcmd |= PCI_COMMAND_INTX_DISABLE;
 	if (newcmd != origcmd)
 		bus->ops->write(bus, dev->devfn, PCI_COMMAND, 2, newcmd);
 
 done:
 	raw_spin_unlock_irqrestore(&pci_lock, flags);
 
 	return mask_updated;
 }
 
 /**
  * pci_check_and_mask_intx - mask INTx on pending interrupt
  * @dev: the PCI device to operate on
  *
  * Check if the device dev has its INTx line asserted, mask it and
  * return true in that case. False is returned if not interrupt was
  * pending.
  */
 bool pci_check_and_mask_intx(struct pci_dev *dev)
 {
 	return pci_check_and_set_intx_mask(dev, true);
 }
 EXPORT_SYMBOL_GPL(pci_check_and_mask_intx);