diff mbox

iommu emulation

Message ID 20170215033452.GB3988@pxdev.xzpeter.org
State New
Headers show

Commit Message

Peter Xu Feb. 15, 2017, 3:34 a.m. UTC
On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote:

[...]

> > >
> > > Then, I *think* above assertion you encountered would fail only if
> > > prev == 0 here, but I still don't quite sure why was that happening.
> > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > guest?
> > >
> > 
> > Sure. This is from my L1 guest.
> 
> Hmm... I think I found the problem...
> 
> > 
> > root@guest0:~# lspci -vvv -s 00:03.0
> > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > Subsystem: Mellanox Technologies Device 0050
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 23
> > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > Expansion ROM at fea00000 [disabled] [size=1M]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [48] Vital Product Data
> > Product Name: CX354A - ConnectX-3 QSFP
> > Read-only fields:
> > [PN] Part number: MCX354A-FCBT
> > [EC] Engineering changes: A4
> > [SN] Serial number: MT1346X00791
> > [V0] Vendor specific: PCIe Gen3 x8
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > Read/write fields:
> > [V1] Vendor specific: N/A
> > [YA] Asset tag: N/A
> > [RW] Read-write area: 105 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 252 byte(s) free
> > End
> > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > Vector table: BAR=0 offset=0007c000
> > PBA: BAR=0 offset=0007d000
> > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > Supported
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> > Capabilities: [100 v0] #00
> 
> Here we have the head of ecap capability as cap_id==0, then when we
> boot the l2 guest with the same device, we'll first copy this
> cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> problem since pcie_find_capability_list() will thought there is no cap
> at all (cap_id==0 is skipped).
> 
> Do you want to try this "hacky patch" to see whether it works for you?
> 
> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..bacd302 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>  
>      }
>  
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ------>8-------
> 
> I don't think it's a good solution (it just used 0xffff instead of 0x0
> for the masked cap_id, then l2 guest would like to co-op with it), but
> it should workaround this temporarily. I'll try to think of a better
> one later and post when proper.
> 
> (Alex, please leave comment if you have any better suggestion before
>  mine :)

Alex, do you like something like below to fix above issue that Jintack
has encountered?

(note: this code is not for compile, only trying show what I mean...)

------8<-------
----->8-----

Since after all we need the assumption that 0xffff is reserved for
cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
which is imho error-prone and hacky.

Thanks,

-- peterx

Comments

Alex Williamson Feb. 15, 2017, 6:15 p.m. UTC | #1
On Wed, 15 Feb 2017 11:34:52 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote:
> 
> [...]
> 
> > > >
> > > > Then, I *think* above assertion you encountered would fail only if
> > > > prev == 0 here, but I still don't quite sure why was that happening.
> > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > > guest?
> > > >  
> > > 
> > > Sure. This is from my L1 guest.  
> > 
> > Hmm... I think I found the problem...
> >   
> > > 
> > > root@guest0:~# lspci -vvv -s 00:03.0
> > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > [ConnectX-3]
> > > Subsystem: Mellanox Technologies Device 0050
> > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR+ FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > > <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0, Cache Line Size: 64 bytes
> > > Interrupt: pin A routed to IRQ 23
> > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > Capabilities: [48] Vital Product Data
> > > Product Name: CX354A - ConnectX-3 QSFP
> > > Read-only fields:
> > > [PN] Part number: MCX354A-FCBT
> > > [EC] Engineering changes: A4
> > > [SN] Serial number: MT1346X00791
> > > [V0] Vendor specific: PCIe Gen3 x8
> > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > Read/write fields:
> > > [V1] Vendor specific: N/A
> > > [YA] Asset tag: N/A
> > > [RW] Read-write area: 105 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 252 byte(s) free
> > > End
> > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > Vector table: BAR=0 offset=0007c000
> > > PBA: BAR=0 offset=0007d000
> > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > ExtTag- RBE+
> > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > Supported
> > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> > > Capabilities: [100 v0] #00  
> > 
> > Here we have the head of ecap capability as cap_id==0, then when we
> > boot the l2 guest with the same device, we'll first copy this
> > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > problem since pcie_find_capability_list() will thought there is no cap
> > at all (cap_id==0 is skipped).
> > 
> > Do you want to try this "hacky patch" to see whether it works for you?
> > 
> > ------8<-------
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 332f41d..bacd302 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >  
> >      }
> >  
> > -    /* Cleanup chain head ID if necessary */
> > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > -    }
> > -
> >      g_free(config);
> >      return;
> >  }  
> > ------>8-------  
> > 
> > I don't think it's a good solution (it just used 0xffff instead of 0x0
> > for the masked cap_id, then l2 guest would like to co-op with it), but
> > it should workaround this temporarily. I'll try to think of a better
> > one later and post when proper.
> > 
> > (Alex, please leave comment if you have any better suggestion before
> >  mine :)  
> 
> Alex, do you like something like below to fix above issue that Jintack
> has encountered?
> 
> (note: this code is not for compile, only trying show what I mean...)
> 
> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..4dca631 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>       */
>      config = g_memdup(pdev->config, vdev->config_size);
> 
> -    /*
> -     * Extended capabilities are chained with each pointing to the next, so we
> -     * can drop anything other than the head of the chain simply by modifying
> -     * the previous next pointer.  For the head of the chain, we can modify the
> -     * capability ID to something that cannot match a valid capability.  ID
> -     * 0 is reserved for this since absence of capabilities is indicated by
> -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> -     * uses ID 0 as reserved for list management and will incorrectly match and
> -     * assert if we attempt to pre-load the head of the chain with this ID.
> -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> -     * part for identifying absence of capabilities in a root complex register
> -     * block.  If the ID still exists after adding capabilities, switch back to
> -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> -     */
> -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> -
>      for (next = PCI_CONFIG_SPACE_SIZE; next;
>           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>          header = pci_get_long(config + next);
> @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>          switch (cap_id) {
>          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
>          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> +            ...
>              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
>              break;
>          default:
> @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> 
>      }
> 
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ----->8-----  
> 
> Since after all we need the assumption that 0xffff is reserved for
> cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> which is imho error-prone and hacky.

This doesn't fix the bug, which is that pcie_add_capability() uses a
valid capability ID for it's own internal tracking.  It's only doing
this to find the end of the capability chain, which we could do in a
spec complaint way by looking for a zero next pointer.  Fix that and
then vfio doesn't need to do this set to 0xffff then back to zero
nonsense at all.  Capability ID zero is valid.  Thanks,

Alex
Peter Xu Feb. 16, 2017, 2:28 a.m. UTC | #2
On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:

[...]

> > Alex, do you like something like below to fix above issue that Jintack
> > has encountered?
> > 
> > (note: this code is not for compile, only trying show what I mean...)
> > 
> > ------8<-------
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 332f41d..4dca631 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >       */
> >      config = g_memdup(pdev->config, vdev->config_size);
> > 
> > -    /*
> > -     * Extended capabilities are chained with each pointing to the next, so we
> > -     * can drop anything other than the head of the chain simply by modifying
> > -     * the previous next pointer.  For the head of the chain, we can modify the
> > -     * capability ID to something that cannot match a valid capability.  ID
> > -     * 0 is reserved for this since absence of capabilities is indicated by
> > -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> > -     * uses ID 0 as reserved for list management and will incorrectly match and
> > -     * assert if we attempt to pre-load the head of the chain with this ID.
> > -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> > -     * part for identifying absence of capabilities in a root complex register
> > -     * block.  If the ID still exists after adding capabilities, switch back to
> > -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> > -     */
> > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > -
> >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> >          header = pci_get_long(config + next);
> > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >          switch (cap_id) {
> >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> > +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> > +            ...
> >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
> >              break;
> >          default:
> > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > 
> >      }
> > 
> > -    /* Cleanup chain head ID if necessary */
> > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > -    }
> > -
> >      g_free(config);
> >      return;
> >  }
> > ----->8-----  
> > 
> > Since after all we need the assumption that 0xffff is reserved for
> > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > which is imho error-prone and hacky.
> 
> This doesn't fix the bug, which is that pcie_add_capability() uses a
> valid capability ID for it's own internal tracking.  It's only doing
> this to find the end of the capability chain, which we could do in a
> spec complaint way by looking for a zero next pointer.  Fix that and
> then vfio doesn't need to do this set to 0xffff then back to zero
> nonsense at all.  Capability ID zero is valid.  Thanks,

Yeah I see Michael's fix on the capability list stuff. However, imho
these are two different issues? Or say, even if with that patch, we
should still need this hack (first 0x0, then 0xffff) right? Since
looks like that patch didn't solve the problem if the first pcie ecap
is masked at 0x100.

Please correct me if I missed anything. Thanks,

-- peterx
Alex Williamson Feb. 16, 2017, 2:47 a.m. UTC | #3
On Thu, 16 Feb 2017 10:28:39 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> 
> [...]
> 
> > > Alex, do you like something like below to fix above issue that Jintack
> > > has encountered?
> > > 
> > > (note: this code is not for compile, only trying show what I mean...)
> > > 
> > > ------8<-------
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 332f41d..4dca631 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >       */
> > >      config = g_memdup(pdev->config, vdev->config_size);
> > > 
> > > -    /*
> > > -     * Extended capabilities are chained with each pointing to the next, so we
> > > -     * can drop anything other than the head of the chain simply by modifying
> > > -     * the previous next pointer.  For the head of the chain, we can modify the
> > > -     * capability ID to something that cannot match a valid capability.  ID
> > > -     * 0 is reserved for this since absence of capabilities is indicated by
> > > -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> > > -     * uses ID 0 as reserved for list management and will incorrectly match and
> > > -     * assert if we attempt to pre-load the head of the chain with this ID.
> > > -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> > > -     * part for identifying absence of capabilities in a root complex register
> > > -     * block.  If the ID still exists after adding capabilities, switch back to
> > > -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> > > -     */
> > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > > -
> > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > >          header = pci_get_long(config + next);
> > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >          switch (cap_id) {
> > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> > > +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> > > +            ...
> > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
> > >              break;
> > >          default:
> > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > > 
> > >      }
> > > 
> > > -    /* Cleanup chain head ID if necessary */
> > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > -    }
> > > -
> > >      g_free(config);
> > >      return;
> > >  }  
> > > ----->8-----    
> > > 
> > > Since after all we need the assumption that 0xffff is reserved for
> > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > > which is imho error-prone and hacky.  
> > 
> > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > valid capability ID for it's own internal tracking.  It's only doing
> > this to find the end of the capability chain, which we could do in a
> > spec complaint way by looking for a zero next pointer.  Fix that and
> > then vfio doesn't need to do this set to 0xffff then back to zero
> > nonsense at all.  Capability ID zero is valid.  Thanks,  
> 
> Yeah I see Michael's fix on the capability list stuff. However, imho
> these are two different issues? Or say, even if with that patch, we
> should still need this hack (first 0x0, then 0xffff) right? Since
> looks like that patch didn't solve the problem if the first pcie ecap
> is masked at 0x100.

I thought the problem was that QEMU in the host exposes a device with a
capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
capability ID of 0 because that's how it finds the end of the chain.
Therefore if we make QEMU not use capability ID 0 for internal
purposes, things work.  vfio using 0xffff and swapping back to 0x0
becomes unnecessary, but doesn't hurt anything.  Thanks,

Alex
Jintack Lim Feb. 21, 2017, 10:33 a.m. UTC | #4
On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <alex.williamson@redhat.com
> wrote:

> On Thu, 16 Feb 2017 10:28:39 +0800
> Peter Xu <peterx@redhat.com> wrote:
>
> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> >
> > [...]
> >
> > > > Alex, do you like something like below to fix above issue that
> Jintack
> > > > has encountered?
> > > >
> > > > (note: this code is not for compile, only trying show what I mean...)
> > > >
> > > > ------8<-------
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 332f41d..4dca631 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >       */
> > > >      config = g_memdup(pdev->config, vdev->config_size);
> > > >
> > > > -    /*
> > > > -     * Extended capabilities are chained with each pointing to the
> next, so we
> > > > -     * can drop anything other than the head of the chain simply by
> modifying
> > > > -     * the previous next pointer.  For the head of the chain, we
> can modify the
> > > > -     * capability ID to something that cannot match a valid
> capability.  ID
> > > > -     * 0 is reserved for this since absence of capabilities is
> indicated by
> > > > -     * 0 for the ID, version, AND next pointer.  However,
> pcie_add_capability()
> > > > -     * uses ID 0 as reserved for list management and will
> incorrectly match and
> > > > -     * assert if we attempt to pre-load the head of the chain with
> this ID.
> > > > -     * Use ID 0xFFFF temporarily since it is also seems to be
> reserved in
> > > > -     * part for identifying absence of capabilities in a root
> complex register
> > > > -     * block.  If the ID still exists after adding capabilities,
> switch back to
> > > > -     * zero.  We'll mark this entire first dword as emulated for
> this purpose.
> > > > -     */
> > > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -    pci_set_long(vdev->emulated_config_bits +
> PCI_CONFIG_SPACE_SIZE, ~0);
> > > > -
> > > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> > > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > > >          header = pci_get_long(config + next);
> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >          switch (cap_id) {
> > > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
> OVMF */
> > > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
> virtualization */
> > > > +            /* keep this ecap header (4 bytes), but mask cap_id to
> 0xffff */
> > > > +            ...
> > > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
> cap_id, next);
> > > >              break;
> > > >          default:
> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >
> > > >      }
> > > >
> > > > -    /* Cleanup chain head ID if necessary */
> > > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
> 0xFFFF) {
> > > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -    }
> > > > -
> > > >      g_free(config);
> > > >      return;
> > > >  }
> > > > ----->8-----
> > > >
> > > > Since after all we need the assumption that 0xffff is reserved for
> > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > > > which is imho error-prone and hacky.
> > >
> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > > valid capability ID for it's own internal tracking.  It's only doing
> > > this to find the end of the capability chain, which we could do in a
> > > spec complaint way by looking for a zero next pointer.  Fix that and
> > > then vfio doesn't need to do this set to 0xffff then back to zero
> > > nonsense at all.  Capability ID zero is valid.  Thanks,
> >
> > Yeah I see Michael's fix on the capability list stuff. However, imho
> > these are two different issues? Or say, even if with that patch, we
> > should still need this hack (first 0x0, then 0xffff) right? Since
> > looks like that patch didn't solve the problem if the first pcie ecap
> > is masked at 0x100.
>
> I thought the problem was that QEMU in the host exposes a device with a
> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
> capability ID of 0 because that's how it finds the end of the chain.
> Therefore if we make QEMU not use capability ID 0 for internal
> purposes, things work.  vfio using 0xffff and swapping back to 0x0
> becomes unnecessary, but doesn't hurt anything.  Thanks,
>

I've applied Peter's hack and Michael's patch below, but still can't use
the assigned device in L2.
 commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
    Author: Michael S. Tsirkin <mst@redhat.com>
    Date:   Wed Feb 15 22:37:45 2017 +0200

    pci/pcie: don't assume cap id 0 is reserved

I was able to boot L2 with following qemu warnings,
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.

but then I don't see the network device, which I was trying to assign to
L2, in L2.
This is from L2 dmesg, and it looks like the device is not initialized.

[    5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    5.891563] mlx4_core: Initializing 0000:00:03.0
[    5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23
[    6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported command
interface revision 0
[    6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000)
[    6.926490] mlx4_core 0000:00:03.0: This driver version supports only
revisions 2 to 3
[    6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting
[    6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting.

This is the full kernel log from L2.
https://paste.ubuntu.com/24039462/

L0, L1 and L2 are using the same kernel, so I think they are using the same
device driver.
This is the L0/L1 kernel log about the network device.

--- From L0 ---
[    8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    8.175543] mlx4_core: Initializing 0000:08:00.0
[   14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device
supports 8.0GT/s
[   14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device
supports x8
[   14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
2014)
[   14.722295] mlx4_en 0000:08:00.0: Activating port:2
[   14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings
[   14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings
[   14.747826] mlx4_en: 0000:08:00.0: Port 2:   frag:0 - size:1522 prefix:0
stride:1536
[   14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port
[   14.764036] mlx4_en 0000:08:00.0: registered PHC clock

--- From L1 ---
[    3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    3.791089] mlx4_core: Initializing 0000:00:03.0
[    9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW
capabilities
[    9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
2014)
[    9.204503] mlx4_en 0000:00:03.0: Activating port:2
[    9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings
[    9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings
[    9.214131] mlx4_en: 0000:00:03.0: Port 2:   frag:0 - size:1522 prefix:0
stride:1536
[    9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port
[    9.216377] mlx4_en 0000:00:03.0: registered PHC clock
[    9.261518] mlx4_en: eth1: Link Up
[    9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1

Any thoughts?


> Alex
>
>
Jintack Lim Feb. 23, 2017, 11:04 p.m. UTC | #5
[cc Bandan]

On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
wrote:

>
>
> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> alex.williamson@redhat.com> wrote:
>
>> On Thu, 16 Feb 2017 10:28:39 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
>> >
>> > [...]
>> >
>> > > > Alex, do you like something like below to fix above issue that
>> Jintack
>> > > > has encountered?
>> > > >
>> > > > (note: this code is not for compile, only trying show what I
>> mean...)
>> > > >
>> > > > ------8<-------
>> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> > > > index 332f41d..4dca631 100644
>> > > > --- a/hw/vfio/pci.c
>> > > > +++ b/hw/vfio/pci.c
>> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >       */
>> > > >      config = g_memdup(pdev->config, vdev->config_size);
>> > > >
>> > > > -    /*
>> > > > -     * Extended capabilities are chained with each pointing to the
>> next, so we
>> > > > -     * can drop anything other than the head of the chain simply
>> by modifying
>> > > > -     * the previous next pointer.  For the head of the chain, we
>> can modify the
>> > > > -     * capability ID to something that cannot match a valid
>> capability.  ID
>> > > > -     * 0 is reserved for this since absence of capabilities is
>> indicated by
>> > > > -     * 0 for the ID, version, AND next pointer.  However,
>> pcie_add_capability()
>> > > > -     * uses ID 0 as reserved for list management and will
>> incorrectly match and
>> > > > -     * assert if we attempt to pre-load the head of the chain with
>> this ID.
>> > > > -     * Use ID 0xFFFF temporarily since it is also seems to be
>> reserved in
>> > > > -     * part for identifying absence of capabilities in a root
>> complex register
>> > > > -     * block.  If the ID still exists after adding capabilities,
>> switch back to
>> > > > -     * zero.  We'll mark this entire first dword as emulated for
>> this purpose.
>> > > > -     */
>> > > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
>> > > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
>> > > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -    pci_set_long(vdev->emulated_config_bits +
>> PCI_CONFIG_SPACE_SIZE, ~0);
>> > > > -
>> > > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
>> > > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>> > > >          header = pci_get_long(config + next);
>> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >          switch (cap_id) {
>> > > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
>> OVMF */
>> > > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
>> virtualization */
>> > > > +            /* keep this ecap header (4 bytes), but mask cap_id to
>> 0xffff */
>> > > > +            ...
>> > > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
>> cap_id, next);
>> > > >              break;
>> > > >          default:
>> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >
>> > > >      }
>> > > >
>> > > > -    /* Cleanup chain head ID if necessary */
>> > > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
>> 0xFFFF) {
>> > > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -    }
>> > > > -
>> > > >      g_free(config);
>> > > >      return;
>> > > >  }
>> > > > ----->8-----
>> > > >
>> > > > Since after all we need the assumption that 0xffff is reserved for
>> > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
>> > > > which is imho error-prone and hacky.
>> > >
>> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
>> > > valid capability ID for it's own internal tracking.  It's only doing
>> > > this to find the end of the capability chain, which we could do in a
>> > > spec complaint way by looking for a zero next pointer.  Fix that and
>> > > then vfio doesn't need to do this set to 0xffff then back to zero
>> > > nonsense at all.  Capability ID zero is valid.  Thanks,
>> >
>> > Yeah I see Michael's fix on the capability list stuff. However, imho
>> > these are two different issues? Or say, even if with that patch, we
>> > should still need this hack (first 0x0, then 0xffff) right? Since
>> > looks like that patch didn't solve the problem if the first pcie ecap
>> > is masked at 0x100.
>>
>> I thought the problem was that QEMU in the host exposes a device with a
>> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
>> capability ID of 0 because that's how it finds the end of the chain.
>> Therefore if we make QEMU not use capability ID 0 for internal
>> purposes, things work.  vfio using 0xffff and swapping back to 0x0
>> becomes unnecessary, but doesn't hurt anything.  Thanks,
>>
>
> I've applied Peter's hack and Michael's patch below, but still can't use
> the assigned device in L2.
>  commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
>     Author: Michael S. Tsirkin <mst@redhat.com>
>     Date:   Wed Feb 15 22:37:45 2017 +0200
>
>     pci/pcie: don't assume cap id 0 is reserved
>
> I was able to boot L2 with following qemu warnings,
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
>
> but then I don't see the network device, which I was trying to assign to
> L2, in L2.
> This is from L2 dmesg, and it looks like the device is not initialized.
>
> [    5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    5.891563] mlx4_core: Initializing 0000:00:03.0
> [    5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23
> [    6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported
> command interface revision 0
> [    6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000)
> [    6.926490] mlx4_core 0000:00:03.0: This driver version supports only
> revisions 2 to 3
> [    6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting
> [    6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting.
>
> This is the full kernel log from L2.
> https://paste.ubuntu.com/24039462/
>
> L0, L1 and L2 are using the same kernel, so I think they are using the
> same device driver.
> This is the L0/L1 kernel log about the network device.
>
> --- From L0 ---
> [    8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    8.175543] mlx4_core: Initializing 0000:08:00.0
> [   14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device
> supports 8.0GT/s
> [   14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device
> supports x8
> [   14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
> 2014)
> [   14.722295] mlx4_en 0000:08:00.0: Activating port:2
> [   14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings
> [   14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings
> [   14.747826] mlx4_en: 0000:08:00.0: Port 2:   frag:0 - size:1522
> prefix:0 stride:1536
> [   14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port
> [   14.764036] mlx4_en 0000:08:00.0: registered PHC clock
>
> --- From L1 ---
> [    3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    3.791089] mlx4_core: Initializing 0000:00:03.0
> [    9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW
> capabilities
> [    9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
> 2014)
> [    9.204503] mlx4_en 0000:00:03.0: Activating port:2
> [    9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings
> [    9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings
> [    9.214131] mlx4_en: 0000:00:03.0: Port 2:   frag:0 - size:1522
> prefix:0 stride:1536
> [    9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port
> [    9.216377] mlx4_en 0000:00:03.0: registered PHC clock
> [    9.261518] mlx4_en: eth1: Link Up
> [    9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1
>
> Any thoughts?
>

I've tried another network device on a different machine. It has "Intel
Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
controller. I got the same problem of getting the network device
initialization failure in L2. I think I'm missing something since I heard
from Bandan that he had no problem to assign a device to L2 with ixgbe.

This is the error message from dmesg in L2.

[    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
version 4.2.1-k
[    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
[    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
[    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12

I checked that L2 indeed had that device.
root@guest0:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
Network Connection (rev 01)

I'm describing steps I took, so if you notice something wrong, PLEASE let
me know.

1. [L0] Check the device with lspci. Result is [1]
2. [L0] Unbind from the original driver and bind to vfio-pci driver
following [2][3]
3. [L0] Start L1 with this script. [4]
4. [L1] L1 is able to use the network device.
5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
the step 2.
6. [L1] Start L2 with this script. [5]
7. [L2] Got the init failure error message above.

[1] https://paste.ubuntu.com/24055745/
[2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
[3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
[4] https://paste.ubuntu.com/24055715/
[5] https://paste.ubuntu.com/24055720/

Thanks,
Jintack


>
>
>> Alex
>>
>>
>
Bandan Das March 2, 2017, 10:20 p.m. UTC | #6
Jintack Lim <jintack@cs.columbia.edu> writes:

> [cc Bandan]
>
> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
> wrote:
>
>>
>>
>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>> alex.williamson@redhat.com> wrote:
...
>>
>
> I've tried another network device on a different machine. It has "Intel
> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> controller. I got the same problem of getting the network device
> initialization failure in L2. I think I'm missing something since I heard
> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>
> This is the error message from dmesg in L2.
>
> [    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> version 4.2.1-k
> [    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> [    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
> [    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12
>
> I checked that L2 indeed had that device.
> root@guest0:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> Controller
> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> Network Connection (rev 01)

Jintack, any progress with this ?

I am testing on a X540-AT2 and I see a different behavior. It appears
config succeeds but the driver keeps resetting the device due to a Tx
hang:

[ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
resetting adapter
[ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
timeout
[ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter

This may be device specific but I think the actual behavior you see is
also dependent on the ixgbe driver in the guest. Are you on a recent
kernel ? Also, can you point me to the hack (by Peter) that you have
mentioned above ?

Thanks,
Bandan

> I'm describing steps I took, so if you notice something wrong, PLEASE let
> me know.
>
> 1. [L0] Check the device with lspci. Result is [1]
> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
> following [2][3]
> 3. [L0] Start L1 with this script. [4]
> 4. [L1] L1 is able to use the network device.
> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
> the step 2.
> 6. [L1] Start L2 with this script. [5]
> 7. [L2] Got the init failure error message above.
>
> [1] https://paste.ubuntu.com/24055745/
> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
> [4] https://paste.ubuntu.com/24055715/
> [5] https://paste.ubuntu.com/24055720/
>
> Thanks,
> Jintack
>
>
>>
>>
>>> Alex
>>>
>>>
>>
Peter Xu March 3, 2017, 3:43 a.m. UTC | #7
On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
> Jintack Lim <jintack@cs.columbia.edu> writes:
> 
> > [cc Bandan]
> >
> > On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
> > wrote:
> >
> >>
> >>
> >> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> >> alex.williamson@redhat.com> wrote:
> ...
> >>
> >
> > I've tried another network device on a different machine. It has "Intel
> > Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> > controller. I got the same problem of getting the network device
> > initialization failure in L2. I think I'm missing something since I heard
> > from Bandan that he had no problem to assign a device to L2 with ixgbe.
> >
> > This is the error message from dmesg in L2.
> >
> > [    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> > version 4.2.1-k
> > [    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> > [    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
> > [    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12
> >
> > I checked that L2 indeed had that device.
> > root@guest0:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> > Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> > Network Connection (rev 01)
> 
> Jintack, any progress with this ?
> 
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:
> 
> [ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter
> 
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

Hi, Bandan,

Are you using the vtd vfio v7 series or another branch?

If it's the upstream one... just a note that we need to make sure
"-device intel-iommu" be the first device specified in QEMU command
line parameters (need to before "-device vfio-pci,..."). Thanks,

-- peterx
Bandan Das March 3, 2017, 7:45 a.m. UTC | #8
Peter Xu <peterx@redhat.com> writes:

> On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
>> Jintack Lim <jintack@cs.columbia.edu> writes:
>> 
>> > [cc Bandan]
>> >
...
>> 
>> Jintack, any progress with this ?
>> 
>> I am testing on a X540-AT2 and I see a different behavior. It appears
>> config succeeds but the driver keeps resetting the device due to a Tx
>> hang:
>> 
>> [ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
>> resetting adapter
>> [ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
>> timeout
>> [ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter
>> 
>> This may be device specific but I think the actual behavior you see is
>> also dependent on the ixgbe driver in the guest. Are you on a recent
>> kernel ? Also, can you point me to the hack (by Peter) that you have
>> mentioned above ?
>
> Hi, Bandan,
>
> Are you using the vtd vfio v7 series or another branch?

Thanks for the tip. Jintack pointed me to your repo and I am using v7
from there.

> If it's the upstream one... just a note that we need to make sure
> "-device intel-iommu" be the first device specified in QEMU command
> line parameters (need to before "-device vfio-pci,..."). Thanks,
>
> -- peterx
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..4dca631 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1877,25 +1877,6 @@  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
      */
     config = g_memdup(pdev->config, vdev->config_size);

-    /*
-     * Extended capabilities are chained with each pointing to the next, so we
-     * can drop anything other than the head of the chain simply by modifying
-     * the previous next pointer.  For the head of the chain, we can modify the
-     * capability ID to something that cannot match a valid capability.  ID
-     * 0 is reserved for this since absence of capabilities is indicated by
-     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
-     * uses ID 0 as reserved for list management and will incorrectly match and
-     * assert if we attempt to pre-load the head of the chain with this ID.
-     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
-     * part for identifying absence of capabilities in a root complex register
-     * block.  If the ID still exists after adding capabilities, switch back to
-     * zero.  We'll mark this entire first dword as emulated for this purpose.
-     */
-    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
-                 PCI_EXT_CAP(0xFFFF, 0, 0));
-    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
-    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
-
     for (next = PCI_CONFIG_SPACE_SIZE; next;
          next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
         header = pci_get_long(config + next);
@@ -1917,6 +1898,8 @@  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
         switch (cap_id) {
         case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
         case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
+            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
+            ...
             trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
             break;
         default:
@@ -1925,11 +1908,6 @@  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)

     }

-    /* Cleanup chain head ID if necessary */
-    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
-        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-    }
-
     g_free(config);
     return;
 }