diff mbox

[v2,2/3] VFIO: Disable INTx interrupt on EEH reset

Message ID 1426523486-9794-2-git-send-email-gwshan@linux.vnet.ibm.com
State New
Headers show

Commit Message

Gavin Shan March 16, 2015, 4:31 p.m. UTC
When Linux guest recovers from EEH error on the following Emulex
adapter, the MSIx interrupts are disabled and the INTx emulation
is enabled. One INTx interrupt is injected to the guest by host
because of detected pending INTx interrupts on the adapter. QEMU
disables mmap'ed BAR regions and starts a timer to enable those
regions at later point the INTx interrupt handler. Unfortunately,
"VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled
mapp'ed BAR regions won't be reenabled properly. It leads to EEH
recovery failure at guest side because of hanged MMIO access.

 # lspci | grep Emulex
 0000:01:00.0 Ethernet controller: Emulex Corporation \
              OneConnect 10Gb NIC (be3) (rev 02)
 0000:01:00.1 Ethernet controller: Emulex Corporation \
              OneConnect 10Gb NIC (be3) (rev 02)

The patch disables INTx interrupt before doing EEH reset to avoid
the issue.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 hw/vfio/pci.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Alex Williamson March 17, 2015, 9:16 p.m. UTC | #1
On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote:
> When Linux guest recovers from EEH error on the following Emulex
> adapter, the MSIx interrupts are disabled and the INTx emulation
> is enabled. One INTx interrupt is injected to the guest by host
> because of detected pending INTx interrupts on the adapter. QEMU
> disables mmap'ed BAR regions and starts a timer to enable those
> regions at later point the INTx interrupt handler. Unfortunately,
> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled
> mapp'ed BAR regions won't be reenabled properly. It leads to EEH
> recovery failure at guest side because of hanged MMIO access.
> 
>  # lspci | grep Emulex
>  0000:01:00.0 Ethernet controller: Emulex Corporation \
>               OneConnect 10Gb NIC (be3) (rev 02)
>  0000:01:00.1 Ethernet controller: Emulex Corporation \
>               OneConnect 10Gb NIC (be3) (rev 02)
> 
> The patch disables INTx interrupt before doing EEH reset to avoid
> the issue.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  hw/vfio/pci.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index fca1edc..bfa3d0c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>           * disable it so that it can be reenabled properly. Also,
>           * the cached MSIx table should be cleared as it's not
>           * reflecting the contents in hardware.
> +         *
> +         * We might have INTx interrupt whose handler disables the
> +         * memory mapped BARs. The INTx pending state can't be
> +         * cleared with memory BAR access in slow path. The timer
> +         * kicked by the INTx interrupt handler won't enable those
> +         * disabled memory mapped BARs, which leads to hanged MMIO
> +         * register access and EEH recovery failure. We simply disable
> +         * INTx if it has been enabled.
>           */

This feels like a quick hack for a problem we don't really understand.
Why is INTx being fired through QEMU rather than KVM?  Why isn't the
INTx re-enabling happening since this is exactly the scenario where it's
supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses
BAR, mmap re-enabled, INTx unmasked)?

>          QLIST_FOREACH(vbasedev, &group->device_list, next) {
>              vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>              }
>  
>              msix_reset(&vdev->pdev);
> +
> +            /* Disable INTx */
> +            if (vdev->interrupt == VFIO_INT_INTx) {
> +                vfio_disable_intx(vdev);
> +            }
>          }
>  
>          break;
Gavin Shan March 18, 2015, 4:54 a.m. UTC | #2
On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote:
>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote:
>> When Linux guest recovers from EEH error on the following Emulex
>> adapter, the MSIx interrupts are disabled and the INTx emulation
>> is enabled. One INTx interrupt is injected to the guest by host
>> because of detected pending INTx interrupts on the adapter. QEMU
>> disables mmap'ed BAR regions and starts a timer to enable those
>> regions at later point the INTx interrupt handler. Unfortunately,
>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled
>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH
>> recovery failure at guest side because of hanged MMIO access.
>> 
>>  # lspci | grep Emulex
>>  0000:01:00.0 Ethernet controller: Emulex Corporation \
>>               OneConnect 10Gb NIC (be3) (rev 02)
>>  0000:01:00.1 Ethernet controller: Emulex Corporation \
>>               OneConnect 10Gb NIC (be3) (rev 02)
>> 
>> The patch disables INTx interrupt before doing EEH reset to avoid
>> the issue.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  hw/vfio/pci.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>> 
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index fca1edc..bfa3d0c 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>           * disable it so that it can be reenabled properly. Also,
>>           * the cached MSIx table should be cleared as it's not
>>           * reflecting the contents in hardware.
>> +         *
>> +         * We might have INTx interrupt whose handler disables the
>> +         * memory mapped BARs. The INTx pending state can't be
>> +         * cleared with memory BAR access in slow path. The timer
>> +         * kicked by the INTx interrupt handler won't enable those
>> +         * disabled memory mapped BARs, which leads to hanged MMIO
>> +         * register access and EEH recovery failure. We simply disable
>> +         * INTx if it has been enabled.
>>           */
>
>This feels like a quick hack for a problem we don't really understand.
>Why is INTx being fired through QEMU rather than KVM?  Why isn't the
>INTx re-enabling happening since this is exactly the scenario where it's
>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses
>BAR, mmap re-enabled, INTx unmasked)?
>

Indeed. It's a quick hack before finding the root cause about why slow
path doesn't work when fast path is disabled. I'm still tracing it and
hopefully I can find something soon. Note that: KVM IRQFD isn't enabled
on the system I was doing experiments.

Thanks,
Gavin

>>          QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>              vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>              }
>>  
>>              msix_reset(&vdev->pdev);
>> +
>> +            /* Disable INTx */
>> +            if (vdev->interrupt == VFIO_INT_INTx) {
>> +                vfio_disable_intx(vdev);
>> +            }
>>          }
>>  
>>          break;
>
>
>
>
Gavin Shan March 20, 2015, 4:01 a.m. UTC | #3
On Wed, Mar 18, 2015 at 03:54:09PM +1100, Gavin Shan wrote:
>On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote:
>>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote:
>>> When Linux guest recovers from EEH error on the following Emulex
>>> adapter, the MSIx interrupts are disabled and the INTx emulation
>>> is enabled. One INTx interrupt is injected to the guest by host
>>> because of detected pending INTx interrupts on the adapter. QEMU
>>> disables mmap'ed BAR regions and starts a timer to enable those
>>> regions at later point the INTx interrupt handler. Unfortunately,
>>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled
>>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH
>>> recovery failure at guest side because of hanged MMIO access.
>>> 
>>>  # lspci | grep Emulex
>>>  0000:01:00.0 Ethernet controller: Emulex Corporation \
>>>               OneConnect 10Gb NIC (be3) (rev 02)
>>>  0000:01:00.1 Ethernet controller: Emulex Corporation \
>>>               OneConnect 10Gb NIC (be3) (rev 02)
>>> 
>>> The patch disables INTx interrupt before doing EEH reset to avoid
>>> the issue.
>>> 
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>  hw/vfio/pci.c | 13 +++++++++++++
>>>  1 file changed, 13 insertions(+)
>>> 
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index fca1edc..bfa3d0c 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>>           * disable it so that it can be reenabled properly. Also,
>>>           * the cached MSIx table should be cleared as it's not
>>>           * reflecting the contents in hardware.
>>> +         *
>>> +         * We might have INTx interrupt whose handler disables the
>>> +         * memory mapped BARs. The INTx pending state can't be
>>> +         * cleared with memory BAR access in slow path. The timer
>>> +         * kicked by the INTx interrupt handler won't enable those
>>> +         * disabled memory mapped BARs, which leads to hanged MMIO
>>> +         * register access and EEH recovery failure. We simply disable
>>> +         * INTx if it has been enabled.
>>>           */
>>
>>This feels like a quick hack for a problem we don't really understand.
>>Why is INTx being fired through QEMU rather than KVM?  Why isn't the
>>INTx re-enabling happening since this is exactly the scenario where it's
>>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses
>>BAR, mmap re-enabled, INTx unmasked)?
>>
>
>Indeed. It's a quick hack before finding the root cause about why slow
>path doesn't work when fast path is disabled. I'm still tracing it and
>hopefully I can find something soon. Note that: KVM IRQFD isn't enabled
>on the system I was doing experiments.
>

I'm not sure if there're further replies to this thread since then. My
mailbox is broken for quite a while. Also, I don't know how Ben is missed
from the cc list. Add him who might have more thoughts. Sorry for mail 
flooding to you, Ben :-)

The root cause of the problem would be specific to how MMIO regions are
supported on KVM for book3s_hv, where invalid page table entries (PTEs)
are built to trace the address mapping between guest virtual address
and host physical address. Since they're marked as "invalid", accessing
to the MMIO region will cause page fault in host kernel and then the
PTE table is searched to locate the address mapping, the access is
propogated to QEMU for emulation if the address mapping exits in PTE
table. It's how the "slow path" works.

For the "fast path", KVM creates memory slot and update PTE table with
the mappings upon page fault. Those PTEs are "valid" and seen by CPU
to do address translation automatically. When the memory region for
the "fast path" is disabled from QEMU, the host will purge TLB/PTEs
for the mappings. So when the "fast path" regions are disabled, the
PTE table doesn't have the entries required for "slow path" access
any more. At later point, guest tries access to the region and the
host just doesn't know how to handle the case.

I already have couple of patches to introduce addtional flag to
identify the "fast path" region, which has properties that RAM
region and MMIO region have. Once the region is disabled, host
purges its mappings in PTEs/TLB, and then update PTEs with the
mappings required by "slow path" region. With that, I didn't see
the problem. Both host/QEMU needs changes. I'll remove this patch
from the series and send out a new revision. The patches to address
the "fast path" issue will be sent separately after checking with
Ben or Paul they're worthy to be sent out.

Thanks,
Gavin

>
>>>          QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>              vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>>              }
>>>  
>>>              msix_reset(&vdev->pdev);
>>> +
>>> +            /* Disable INTx */
>>> +            if (vdev->interrupt == VFIO_INT_INTx) {
>>> +                vfio_disable_intx(vdev);
>>> +            }
>>>          }
>>>  
>>>          break;
>>
>>
>>
>>
Gavin Shan March 20, 2015, 5:57 a.m. UTC | #4
On Fri, Mar 20, 2015 at 03:01:43PM +1100, Gavin Shan wrote:
>On Wed, Mar 18, 2015 at 03:54:09PM +1100, Gavin Shan wrote:
>>On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote:
>>>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote:
>>>> When Linux guest recovers from EEH error on the following Emulex
>>>> adapter, the MSIx interrupts are disabled and the INTx emulation
>>>> is enabled. One INTx interrupt is injected to the guest by host
>>>> because of detected pending INTx interrupts on the adapter. QEMU
>>>> disables mmap'ed BAR regions and starts a timer to enable those
>>>> regions at later point the INTx interrupt handler. Unfortunately,
>>>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled
>>>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH
>>>> recovery failure at guest side because of hanged MMIO access.
>>>> 
>>>>  # lspci | grep Emulex
>>>>  0000:01:00.0 Ethernet controller: Emulex Corporation \
>>>>               OneConnect 10Gb NIC (be3) (rev 02)
>>>>  0000:01:00.1 Ethernet controller: Emulex Corporation \
>>>>               OneConnect 10Gb NIC (be3) (rev 02)
>>>> 
>>>> The patch disables INTx interrupt before doing EEH reset to avoid
>>>> the issue.
>>>> 
>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>> ---
>>>>  hw/vfio/pci.c | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>> 
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index fca1edc..bfa3d0c 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>>>           * disable it so that it can be reenabled properly. Also,
>>>>           * the cached MSIx table should be cleared as it's not
>>>>           * reflecting the contents in hardware.
>>>> +         *
>>>> +         * We might have INTx interrupt whose handler disables the
>>>> +         * memory mapped BARs. The INTx pending state can't be
>>>> +         * cleared with memory BAR access in slow path. The timer
>>>> +         * kicked by the INTx interrupt handler won't enable those
>>>> +         * disabled memory mapped BARs, which leads to hanged MMIO
>>>> +         * register access and EEH recovery failure. We simply disable
>>>> +         * INTx if it has been enabled.
>>>>           */
>>>
>>>This feels like a quick hack for a problem we don't really understand.
>>>Why is INTx being fired through QEMU rather than KVM?  Why isn't the
>>>INTx re-enabling happening since this is exactly the scenario where it's
>>>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses
>>>BAR, mmap re-enabled, INTx unmasked)?
>>>
>>
>>Indeed. It's a quick hack before finding the root cause about why slow
>>path doesn't work when fast path is disabled. I'm still tracing it and
>>hopefully I can find something soon. Note that: KVM IRQFD isn't enabled
>>on the system I was doing experiments.
>>
>
>I'm not sure if there're further replies to this thread since then. My
>mailbox is broken for quite a while. Also, I don't know how Ben is missed
>from the cc list. Add him who might have more thoughts. Sorry for mail 
>flooding to you, Ben :-)
>
>The root cause of the problem would be specific to how MMIO regions are
>supported on KVM for book3s_hv, where invalid page table entries (PTEs)
>are built to trace the address mapping between guest virtual address
>and host physical address. Since they're marked as "invalid", accessing
>to the MMIO region will cause page fault in host kernel and then the
>PTE table is searched to locate the address mapping, the access is
>propogated to QEMU for emulation if the address mapping exits in PTE
>table. It's how the "slow path" works.
>
>For the "fast path", KVM creates memory slot and update PTE table with
>the mappings upon page fault. Those PTEs are "valid" and seen by CPU
>to do address translation automatically. When the memory region for
>the "fast path" is disabled from QEMU, the host will purge TLB/PTEs
>for the mappings. So when the "fast path" regions are disabled, the
>PTE table doesn't have the entries required for "slow path" access
>any more. At later point, guest tries access to the region and the
>host just doesn't know how to handle the case.
>
>I already have couple of patches to introduce addtional flag to
>identify the "fast path" region, which has properties that RAM
>region and MMIO region have. Once the region is disabled, host
>purges its mappings in PTEs/TLB, and then update PTEs with the
>mappings required by "slow path" region. With that, I didn't see
>the problem. Both host/QEMU needs changes. I'll remove this patch
>from the series and send out a new revision. The patches to address
>the "fast path" issue will be sent separately after checking with
>Ben or Paul they're worthy to be sent out.
>

Checked with Paul Mackerras. He already has patch fixing the issue
in an different way :-)

Thanks,
Gavin

>
>>
>>>>          QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>>              vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
>>>>              }
>>>>  
>>>>              msix_reset(&vdev->pdev);
>>>> +
>>>> +            /* Disable INTx */
>>>> +            if (vdev->interrupt == VFIO_INT_INTx) {
>>>> +                vfio_disable_intx(vdev);
>>>> +            }
>>>>          }
>>>>  
>>>>          break;
>>>
>>>
>>>
>>>
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index fca1edc..bfa3d0c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3340,6 +3340,14 @@  int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
          * disable it so that it can be reenabled properly. Also,
          * the cached MSIx table should be cleared as it's not
          * reflecting the contents in hardware.
+         *
+         * We might have INTx interrupt whose handler disables the
+         * memory mapped BARs. The INTx pending state can't be
+         * cleared with memory BAR access in slow path. The timer
+         * kicked by the INTx interrupt handler won't enable those
+         * disabled memory mapped BARs, which leads to hanged MMIO
+         * register access and EEH recovery failure. We simply disable
+         * INTx if it has been enabled.
          */
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
@@ -3348,6 +3356,11 @@  int vfio_container_eeh_event(AddressSpace *as, int32_t groupid,
             }
 
             msix_reset(&vdev->pdev);
+
+            /* Disable INTx */
+            if (vdev->interrupt == VFIO_INT_INTx) {
+                vfio_disable_intx(vdev);
+            }
         }
 
         break;