Message ID | 1426523486-9794-2-git-send-email-gwshan@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote: > When Linux guest recovers from EEH error on the following Emulex > adapter, the MSIx interrupts are disabled and the INTx emulation > is enabled. One INTx interrupt is injected to the guest by host > because of detected pending INTx interrupts on the adapter. QEMU > disables mmap'ed BAR regions and starts a timer to enable those > regions at later point the INTx interrupt handler. Unfortunately, > "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled > mapp'ed BAR regions won't be reenabled properly. It leads to EEH > recovery failure at guest side because of hanged MMIO access. > > # lspci | grep Emulex > 0000:01:00.0 Ethernet controller: Emulex Corporation \ > OneConnect 10Gb NIC (be3) (rev 02) > 0000:01:00.1 Ethernet controller: Emulex Corporation \ > OneConnect 10Gb NIC (be3) (rev 02) > > The patch disables INTx interrupt before doing EEH reset to avoid > the issue. > > Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > --- > hw/vfio/pci.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index fca1edc..bfa3d0c 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, > * disable it so that it can be reenabled properly. Also, > * the cached MSIx table should be cleared as it's not > * reflecting the contents in hardware. > + * > + * We might have INTx interrupt whose handler disables the > + * memory mapped BARs. The INTx pending state can't be > + * cleared with memory BAR access in slow path. The timer > + * kicked by the INTx interrupt handler won't enable those > + * disabled memory mapped BARs, which leads to hanged MMIO > + * register access and EEH recovery failure. We simply disable > + * INTx if it has been enabled. > */ This feels like a quick hack for a problem we don't really understand. Why is INTx being fired through QEMU rather than KVM? Why isn't the INTx re-enabling happening since this is exactly the scenario where it's supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses BAR, mmap re-enabled, INTx unmasked)? > QLIST_FOREACH(vbasedev, &group->device_list, next) { > vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); > @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, > } > > msix_reset(&vdev->pdev); > + > + /* Disable INTx */ > + if (vdev->interrupt == VFIO_INT_INTx) { > + vfio_disable_intx(vdev); > + } > } > > break;
On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote: >On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote: >> When Linux guest recovers from EEH error on the following Emulex >> adapter, the MSIx interrupts are disabled and the INTx emulation >> is enabled. One INTx interrupt is injected to the guest by host >> because of detected pending INTx interrupts on the adapter. QEMU >> disables mmap'ed BAR regions and starts a timer to enable those >> regions at later point the INTx interrupt handler. Unfortunately, >> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled >> mapp'ed BAR regions won't be reenabled properly. It leads to EEH >> recovery failure at guest side because of hanged MMIO access. >> >> # lspci | grep Emulex >> 0000:01:00.0 Ethernet controller: Emulex Corporation \ >> OneConnect 10Gb NIC (be3) (rev 02) >> 0000:01:00.1 Ethernet controller: Emulex Corporation \ >> OneConnect 10Gb NIC (be3) (rev 02) >> >> The patch disables INTx interrupt before doing EEH reset to avoid >> the issue. >> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >> --- >> hw/vfio/pci.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >> index fca1edc..bfa3d0c 100644 >> --- a/hw/vfio/pci.c >> +++ b/hw/vfio/pci.c >> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >> * disable it so that it can be reenabled properly. Also, >> * the cached MSIx table should be cleared as it's not >> * reflecting the contents in hardware. >> + * >> + * We might have INTx interrupt whose handler disables the >> + * memory mapped BARs. The INTx pending state can't be >> + * cleared with memory BAR access in slow path. The timer >> + * kicked by the INTx interrupt handler won't enable those >> + * disabled memory mapped BARs, which leads to hanged MMIO >> + * register access and EEH recovery failure. We simply disable >> + * INTx if it has been enabled. >> */ > >This feels like a quick hack for a problem we don't really understand. >Why is INTx being fired through QEMU rather than KVM? Why isn't the >INTx re-enabling happening since this is exactly the scenario where it's >supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses >BAR, mmap re-enabled, INTx unmasked)? > Indeed. It's a quick hack before finding the root cause about why slow path doesn't work when fast path is disabled. I'm still tracing it and hopefully I can find something soon. Note that: KVM IRQFD isn't enabled on the system I was doing experiments. Thanks, Gavin >> QLIST_FOREACH(vbasedev, &group->device_list, next) { >> vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); >> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >> } >> >> msix_reset(&vdev->pdev); >> + >> + /* Disable INTx */ >> + if (vdev->interrupt == VFIO_INT_INTx) { >> + vfio_disable_intx(vdev); >> + } >> } >> >> break; > > > >
On Wed, Mar 18, 2015 at 03:54:09PM +1100, Gavin Shan wrote: >On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote: >>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote: >>> When Linux guest recovers from EEH error on the following Emulex >>> adapter, the MSIx interrupts are disabled and the INTx emulation >>> is enabled. One INTx interrupt is injected to the guest by host >>> because of detected pending INTx interrupts on the adapter. QEMU >>> disables mmap'ed BAR regions and starts a timer to enable those >>> regions at later point the INTx interrupt handler. Unfortunately, >>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled >>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH >>> recovery failure at guest side because of hanged MMIO access. >>> >>> # lspci | grep Emulex >>> 0000:01:00.0 Ethernet controller: Emulex Corporation \ >>> OneConnect 10Gb NIC (be3) (rev 02) >>> 0000:01:00.1 Ethernet controller: Emulex Corporation \ >>> OneConnect 10Gb NIC (be3) (rev 02) >>> >>> The patch disables INTx interrupt before doing EEH reset to avoid >>> the issue. >>> >>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >>> --- >>> hw/vfio/pci.c | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index fca1edc..bfa3d0c 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >>> * disable it so that it can be reenabled properly. Also, >>> * the cached MSIx table should be cleared as it's not >>> * reflecting the contents in hardware. >>> + * >>> + * We might have INTx interrupt whose handler disables the >>> + * memory mapped BARs. The INTx pending state can't be >>> + * cleared with memory BAR access in slow path. The timer >>> + * kicked by the INTx interrupt handler won't enable those >>> + * disabled memory mapped BARs, which leads to hanged MMIO >>> + * register access and EEH recovery failure. We simply disable >>> + * INTx if it has been enabled. >>> */ >> >>This feels like a quick hack for a problem we don't really understand. >>Why is INTx being fired through QEMU rather than KVM? Why isn't the >>INTx re-enabling happening since this is exactly the scenario where it's >>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses >>BAR, mmap re-enabled, INTx unmasked)? >> > >Indeed. It's a quick hack before finding the root cause about why slow >path doesn't work when fast path is disabled. I'm still tracing it and >hopefully I can find something soon. Note that: KVM IRQFD isn't enabled >on the system I was doing experiments. > I'm not sure if there're further replies to this thread since then. My mailbox is broken for quite a while. Also, I don't know how Ben is missed from the cc list. Add him who might have more thoughts. Sorry for mail flooding to you, Ben :-) The root cause of the problem would be specific to how MMIO regions are supported on KVM for book3s_hv, where invalid page table entries (PTEs) are built to trace the address mapping between guest virtual address and host physical address. Since they're marked as "invalid", accessing to the MMIO region will cause page fault in host kernel and then the PTE table is searched to locate the address mapping, the access is propogated to QEMU for emulation if the address mapping exits in PTE table. It's how the "slow path" works. For the "fast path", KVM creates memory slot and update PTE table with the mappings upon page fault. Those PTEs are "valid" and seen by CPU to do address translation automatically. When the memory region for the "fast path" is disabled from QEMU, the host will purge TLB/PTEs for the mappings. So when the "fast path" regions are disabled, the PTE table doesn't have the entries required for "slow path" access any more. At later point, guest tries access to the region and the host just doesn't know how to handle the case. I already have couple of patches to introduce addtional flag to identify the "fast path" region, which has properties that RAM region and MMIO region have. Once the region is disabled, host purges its mappings in PTEs/TLB, and then update PTEs with the mappings required by "slow path" region. With that, I didn't see the problem. Both host/QEMU needs changes. I'll remove this patch from the series and send out a new revision. The patches to address the "fast path" issue will be sent separately after checking with Ben or Paul they're worthy to be sent out. Thanks, Gavin > >>> QLIST_FOREACH(vbasedev, &group->device_list, next) { >>> vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); >>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >>> } >>> >>> msix_reset(&vdev->pdev); >>> + >>> + /* Disable INTx */ >>> + if (vdev->interrupt == VFIO_INT_INTx) { >>> + vfio_disable_intx(vdev); >>> + } >>> } >>> >>> break; >> >> >> >>
On Fri, Mar 20, 2015 at 03:01:43PM +1100, Gavin Shan wrote: >On Wed, Mar 18, 2015 at 03:54:09PM +1100, Gavin Shan wrote: >>On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote: >>>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote: >>>> When Linux guest recovers from EEH error on the following Emulex >>>> adapter, the MSIx interrupts are disabled and the INTx emulation >>>> is enabled. One INTx interrupt is injected to the guest by host >>>> because of detected pending INTx interrupts on the adapter. QEMU >>>> disables mmap'ed BAR regions and starts a timer to enable those >>>> regions at later point the INTx interrupt handler. Unfortunately, >>>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled >>>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH >>>> recovery failure at guest side because of hanged MMIO access. >>>> >>>> # lspci | grep Emulex >>>> 0000:01:00.0 Ethernet controller: Emulex Corporation \ >>>> OneConnect 10Gb NIC (be3) (rev 02) >>>> 0000:01:00.1 Ethernet controller: Emulex Corporation \ >>>> OneConnect 10Gb NIC (be3) (rev 02) >>>> >>>> The patch disables INTx interrupt before doing EEH reset to avoid >>>> the issue. >>>> >>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >>>> --- >>>> hw/vfio/pci.c | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>> index fca1edc..bfa3d0c 100644 >>>> --- a/hw/vfio/pci.c >>>> +++ b/hw/vfio/pci.c >>>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >>>> * disable it so that it can be reenabled properly. Also, >>>> * the cached MSIx table should be cleared as it's not >>>> * reflecting the contents in hardware. >>>> + * >>>> + * We might have INTx interrupt whose handler disables the >>>> + * memory mapped BARs. The INTx pending state can't be >>>> + * cleared with memory BAR access in slow path. The timer >>>> + * kicked by the INTx interrupt handler won't enable those >>>> + * disabled memory mapped BARs, which leads to hanged MMIO >>>> + * register access and EEH recovery failure. We simply disable >>>> + * INTx if it has been enabled. >>>> */ >>> >>>This feels like a quick hack for a problem we don't really understand. >>>Why is INTx being fired through QEMU rather than KVM? Why isn't the >>>INTx re-enabling happening since this is exactly the scenario where it's >>>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses >>>BAR, mmap re-enabled, INTx unmasked)? >>> >> >>Indeed. It's a quick hack before finding the root cause about why slow >>path doesn't work when fast path is disabled. I'm still tracing it and >>hopefully I can find something soon. Note that: KVM IRQFD isn't enabled >>on the system I was doing experiments. >> > >I'm not sure if there're further replies to this thread since then. My >mailbox is broken for quite a while. Also, I don't know how Ben is missed >from the cc list. Add him who might have more thoughts. Sorry for mail >flooding to you, Ben :-) > >The root cause of the problem would be specific to how MMIO regions are >supported on KVM for book3s_hv, where invalid page table entries (PTEs) >are built to trace the address mapping between guest virtual address >and host physical address. Since they're marked as "invalid", accessing >to the MMIO region will cause page fault in host kernel and then the >PTE table is searched to locate the address mapping, the access is >propogated to QEMU for emulation if the address mapping exits in PTE >table. It's how the "slow path" works. > >For the "fast path", KVM creates memory slot and update PTE table with >the mappings upon page fault. Those PTEs are "valid" and seen by CPU >to do address translation automatically. When the memory region for >the "fast path" is disabled from QEMU, the host will purge TLB/PTEs >for the mappings. So when the "fast path" regions are disabled, the >PTE table doesn't have the entries required for "slow path" access >any more. At later point, guest tries access to the region and the >host just doesn't know how to handle the case. > >I already have couple of patches to introduce addtional flag to >identify the "fast path" region, which has properties that RAM >region and MMIO region have. Once the region is disabled, host >purges its mappings in PTEs/TLB, and then update PTEs with the >mappings required by "slow path" region. With that, I didn't see >the problem. Both host/QEMU needs changes. I'll remove this patch >from the series and send out a new revision. The patches to address >the "fast path" issue will be sent separately after checking with >Ben or Paul they're worthy to be sent out. > Checked with Paul Mackerras. He already has patch fixing the issue in an different way :-) Thanks, Gavin > >> >>>> QLIST_FOREACH(vbasedev, &group->device_list, next) { >>>> vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); >>>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, >>>> } >>>> >>>> msix_reset(&vdev->pdev); >>>> + >>>> + /* Disable INTx */ >>>> + if (vdev->interrupt == VFIO_INT_INTx) { >>>> + vfio_disable_intx(vdev); >>>> + } >>>> } >>>> >>>> break; >>> >>> >>> >>>
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index fca1edc..bfa3d0c 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, * disable it so that it can be reenabled properly. Also, * the cached MSIx table should be cleared as it's not * reflecting the contents in hardware. + * + * We might have INTx interrupt whose handler disables the + * memory mapped BARs. The INTx pending state can't be + * cleared with memory BAR access in slow path. The timer + * kicked by the INTx interrupt handler won't enable those + * disabled memory mapped BARs, which leads to hanged MMIO + * register access and EEH recovery failure. We simply disable + * INTx if it has been enabled. */ QLIST_FOREACH(vbasedev, &group->device_list, next) { vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, int32_t groupid, } msix_reset(&vdev->pdev); + + /* Disable INTx */ + if (vdev->interrupt == VFIO_INT_INTx) { + vfio_disable_intx(vdev); + } } break;
When Linux guest recovers from EEH error on the following Emulex adapter, the MSIx interrupts are disabled and the INTx emulation is enabled. One INTx interrupt is injected to the guest by host because of detected pending INTx interrupts on the adapter. QEMU disables mmap'ed BAR regions and starts a timer to enable those regions at later point the INTx interrupt handler. Unfortunately, "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled mapp'ed BAR regions won't be reenabled properly. It leads to EEH recovery failure at guest side because of hanged MMIO access. # lspci | grep Emulex 0000:01:00.0 Ethernet controller: Emulex Corporation \ OneConnect 10Gb NIC (be3) (rev 02) 0000:01:00.1 Ethernet controller: Emulex Corporation \ OneConnect 10Gb NIC (be3) (rev 02) The patch disables INTx interrupt before doing EEH reset to avoid the issue. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> --- hw/vfio/pci.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)