diff mbox

iommu emulation

Message ID 20170215025243.GA3988@pxdev.xzpeter.org
State New
Headers show

Commit Message

Peter Xu Feb. 15, 2017, 2:52 a.m. UTC
On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:

[...]

> > > >> > I misunderstood what you said?
> > > >
> > > > I failed to understand why an vIOMMU could help boost performance. :(
> > > > Could you provide your command line here so that I can try to
> > > > reproduce?
> > >
> > > Sure. This is the command line to launch L1 VM
> > >
> > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > -smp 4,sockets=4,cores=1,threads=1 \
> > > -device vfio-pci,host=08:00.0,id=net0
> > >
> > > And this is for L2 VM.
> > >
> > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > -m 8G \
> > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > -device vfio-pci,host=00:03.0,id=net0
> >
> > ... here looks like these are command lines for L1/L2 guest, rather
> > than L1 guest with/without vIOMMU?
> >
> 
> That's right. I thought you were asking about command lines for L1/L2 guest
> :(.
> I think I made the confusion, and as I said above, I didn't mean to talk
> about the performance of L1 guest with/without vIOMMO.
> We can move on!

I see. Sure! :-)

[...]

> >
> > Then, I *think* above assertion you encountered would fail only if
> > prev == 0 here, but I still don't quite sure why was that happening.
> > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > guest?
> >
> 
> Sure. This is from my L1 guest.

Hmm... I think I found the problem...

> 
> root@guest0:~# lspci -vvv -s 00:03.0
> 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> Subsystem: Mellanox Technologies Device 0050
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 23
> Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> Expansion ROM at fea00000 [disabled] [size=1M]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [48] Vital Product Data
> Product Name: CX354A - ConnectX-3 QSFP
> Read-only fields:
> [PN] Part number: MCX354A-FCBT
> [EC] Engineering changes: A4
> [SN] Serial number: MT1346X00791
> [V0] Vendor specific: PCIe Gen3 x8
> [RV] Reserved: checksum good, 0 byte(s) reserved
> Read/write fields:
> [V1] Vendor specific: N/A
> [YA] Asset tag: N/A
> [RW] Read-write area: 105 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 252 byte(s) free
> End
> Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> Vector table: BAR=0 offset=0007c000
> PBA: BAR=0 offset=0007d000
> Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0
> ExtTag- RBE+
> DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 256 bytes, MaxReadReq 4096 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> Supported
> DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> Capabilities: [100 v0] #00

Here we have the head of ecap capability as cap_id==0, then when we
boot the l2 guest with the same device, we'll first copy this
cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
problem since pcie_find_capability_list() will thought there is no cap
at all (cap_id==0 is skipped).

Do you want to try this "hacky patch" to see whether it works for you?

------8<-------
------>8-------

I don't think it's a good solution (it just used 0xffff instead of 0x0
for the masked cap_id, then l2 guest would like to co-op with it), but
it should workaround this temporarily. I'll try to think of a better
one later and post when proper.

(Alex, please leave comment if you have any better suggestion before
 mine :)

Thanks,

-- peterx

Comments

Jintack Lim Feb. 15, 2017, 10:05 p.m. UTC | #1
On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
>
> [...]
>
> > > > >> > I misunderstood what you said?
> > > > >
> > > > > I failed to understand why an vIOMMU could help boost performance.
> :(
> > > > > Could you provide your command line here so that I can try to
> > > > > reproduce?
> > > >
> > > > Sure. This is the command line to launch L1 VM
> > > >
> > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > -device vfio-pci,host=08:00.0,id=net0
> > > >
> > > > And this is for L2 VM.
> > > >
> > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > -m 8G \
> > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > -device vfio-pci,host=00:03.0,id=net0
> > >
> > > ... here looks like these are command lines for L1/L2 guest, rather
> > > than L1 guest with/without vIOMMU?
> > >
> >
> > That's right. I thought you were asking about command lines for L1/L2
> guest
> > :(.
> > I think I made the confusion, and as I said above, I didn't mean to talk
> > about the performance of L1 guest with/without vIOMMO.
> > We can move on!
>
> I see. Sure! :-)
>
> [...]
>
> > >
> > > Then, I *think* above assertion you encountered would fail only if
> > > prev == 0 here, but I still don't quite sure why was that happening.
> > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > guest?
> > >
> >
> > Sure. This is from my L1 guest.
>
> Hmm... I think I found the problem...
>
> >
> > root@guest0:~# lspci -vvv -s 00:03.0
> > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > Subsystem: Mellanox Technologies Device 0050
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 23
> > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > Expansion ROM at fea00000 [disabled] [size=1M]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-
> )
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [48] Vital Product Data
> > Product Name: CX354A - ConnectX-3 QSFP
> > Read-only fields:
> > [PN] Part number: MCX354A-FCBT
> > [EC] Engineering changes: A4
> > [SN] Serial number: MT1346X00791
> > [V0] Vendor specific: PCIe Gen3 x8
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > Read/write fields:
> > [V1] Vendor specific: N/A
> > [YA] Asset tag: N/A
> > [RW] Read-write area: 105 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 252 byte(s) free
> > End
> > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > Vector table: BAR=0 offset=0007c000
> > PBA: BAR=0 offset=0007d000
> > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > Supported
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF
> Disabled
> > Capabilities: [100 v0] #00
>
> Here we have the head of ecap capability as cap_id==0, then when we
> boot the l2 guest with the same device, we'll first copy this
> cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> problem since pcie_find_capability_list() will thought there is no cap
> at all (cap_id==0 is skipped).
>
> Do you want to try this "hacky patch" to see whether it works for you?
>

Thanks for following this up!

I just tried this, and I got some different message this time.

qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.


Thanks,
Jintack


> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..bacd302 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>
>      }
>
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ------>8-------
>
> I don't think it's a good solution (it just used 0xffff instead of 0x0
> for the masked cap_id, then l2 guest would like to co-op with it), but
> it should workaround this temporarily. I'll try to think of a better
> one later and post when proper.
>
> (Alex, please leave comment if you have any better suggestion before
>  mine :)
>
> Thanks,
>
> -- peterx
>
>
Alex Williamson Feb. 15, 2017, 10:50 p.m. UTC | #2
On Wed, 15 Feb 2017 17:05:35 -0500
Jintack Lim <jintack@cs.columbia.edu> wrote:

> On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> >
> > [...]
> >  
> > > > > >> > I misunderstood what you said?  
> > > > > >
> > > > > > I failed to understand why an vIOMMU could help boost performance.  
> > :(  
> > > > > > Could you provide your command line here so that I can try to
> > > > > > reproduce?  
> > > > >
> > > > > Sure. This is the command line to launch L1 VM
> > > > >
> > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > >
> > > > > And this is for L2 VM.
> > > > >
> > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > -m 8G \
> > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > -device vfio-pci,host=00:03.0,id=net0  
> > > >
> > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > than L1 guest with/without vIOMMU?
> > > >  
> > >
> > > That's right. I thought you were asking about command lines for L1/L2  
> > guest  
> > > :(.
> > > I think I made the confusion, and as I said above, I didn't mean to talk
> > > about the performance of L1 guest with/without vIOMMO.
> > > We can move on!  
> >
> > I see. Sure! :-)
> >
> > [...]
> >  
> > > >
> > > > Then, I *think* above assertion you encountered would fail only if
> > > > prev == 0 here, but I still don't quite sure why was that happening.
> > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > > guest?
> > > >  
> > >
> > > Sure. This is from my L1 guest.  
> >
> > Hmm... I think I found the problem...
> >  
> > >
> > > root@guest0:~# lspci -vvv -s 00:03.0
> > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > [ConnectX-3]
> > > Subsystem: Mellanox Technologies Device 0050
> > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR+ FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > > <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0, Cache Line Size: 64 bytes
> > > Interrupt: pin A routed to IRQ 23
> > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-  
> > )  
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > Capabilities: [48] Vital Product Data
> > > Product Name: CX354A - ConnectX-3 QSFP
> > > Read-only fields:
> > > [PN] Part number: MCX354A-FCBT
> > > [EC] Engineering changes: A4
> > > [SN] Serial number: MT1346X00791
> > > [V0] Vendor specific: PCIe Gen3 x8
> > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > Read/write fields:
> > > [V1] Vendor specific: N/A
> > > [YA] Asset tag: N/A
> > > [RW] Read-write area: 105 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 252 byte(s) free
> > > End
> > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > Vector table: BAR=0 offset=0007c000
> > > PBA: BAR=0 offset=0007d000
> > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > ExtTag- RBE+
> > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > Supported
> > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF  
> > Disabled  
> > > Capabilities: [100 v0] #00  
> >
> > Here we have the head of ecap capability as cap_id==0, then when we
> > boot the l2 guest with the same device, we'll first copy this
> > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > problem since pcie_find_capability_list() will thought there is no cap
> > at all (cap_id==0 is skipped).
> >
> > Do you want to try this "hacky patch" to see whether it works for you?
> >  
> 
> Thanks for following this up!
> 
> I just tried this, and I got some different message this time.
> 
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.

Possibly very true, it might affect the reliability of the device in
the l2 guest, but shouldn't prevent it from being assigned.  What's the
reset mechanism on the physical device (lspci -vvv from host please).
Thanks,

Alex
Jintack Lim Feb. 15, 2017, 11:25 p.m. UTC | #3
On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com
> wrote:

> On Wed, 15 Feb 2017 17:05:35 -0500
> Jintack Lim <jintack@cs.columbia.edu> wrote:
>
> > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> >
> > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > >
> > > [...]
> > >
> > > > > > >> > I misunderstood what you said?
> > > > > > >
> > > > > > > I failed to understand why an vIOMMU could help boost
> performance.
> > > :(
> > > > > > > Could you provide your command line here so that I can try to
> > > > > > > reproduce?
> > > > > >
> > > > > > Sure. This is the command line to launch L1 VM
> > > > > >
> > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host
> \
> > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > >
> > > > > > And this is for L2 VM.
> > > > > >
> > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > -m 8G \
> > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > -device vfio-pci,host=00:03.0,id=net0
> > > > >
> > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > than L1 guest with/without vIOMMU?
> > > > >
> > > >
> > > > That's right. I thought you were asking about command lines for L1/L2
> > > guest
> > > > :(.
> > > > I think I made the confusion, and as I said above, I didn't mean to
> talk
> > > > about the performance of L1 guest with/without vIOMMO.
> > > > We can move on!
> > >
> > > I see. Sure! :-)
> > >
> > > [...]
> > >
> > > > >
> > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > prev == 0 here, but I still don't quite sure why was that
> happening.
> > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in
> your L1
> > > > > guest?
> > > > >
> > > >
> > > > Sure. This is from my L1 guest.
> > >
> > > Hmm... I think I found the problem...
> > >
> > > >
> > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > [ConnectX-3]
> > > > Subsystem: Mellanox Technologies Device 0050
> > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort-
> > > > <MAbort- >SERR- <PERR- INTx-
> > > > Latency: 0, Cache Line Size: 64 bytes
> > > > Interrupt: pin A routed to IRQ 23
> > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-
> > > )
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > Capabilities: [48] Vital Product Data
> > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > Read-only fields:
> > > > [PN] Part number: MCX354A-FCBT
> > > > [EC] Engineering changes: A4
> > > > [SN] Serial number: MT1346X00791
> > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > Read/write fields:
> > > > [V1] Vendor specific: N/A
> > > > [YA] Asset tag: N/A
> > > > [RW] Read-write area: 105 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 252 byte(s) free
> > > > End
> > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > > Vector table: BAR=0 offset=0007c000
> > > > PBA: BAR=0 offset=0007d000
> > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint,
> MSI 00
> > > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > > ExtTag- RBE+
> > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > > Supported
> > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF
> > > Disabled
> > > > Capabilities: [100 v0] #00
> > >
> > > Here we have the head of ecap capability as cap_id==0, then when we
> > > boot the l2 guest with the same device, we'll first copy this
> > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > > problem since pcie_find_capability_list() will thought there is no cap
> > > at all (cap_id==0 is skipped).
> > >
> > > Do you want to try this "hacky patch" to see whether it works for you?
> > >
> >
> > Thanks for following this up!
> >
> > I just tried this, and I got some different message this time.
> >
> > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > reset mechanism.
> > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > reset mechanism.
>
> Possibly very true, it might affect the reliability of the device in
> the l2 guest, but shouldn't prevent it from being assigned.  What's the
> reset mechanism on the physical device (lspci -vvv from host please).
>

Thanks, Alex.
This is from the host (L0).

08:00.0 Network controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
Subsystem: Mellanox Technologies Device 0050
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 31
Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M]
Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M]
Expansion ROM at d9000000 [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX354A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: MCX354A-FCBT
[EC] Engineering changes: A4
[SN] Serial number: MT1346X00624
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 105 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s
unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-15-51-10
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [18c v1] #19
Kernel driver in use: vfio-pci


Thanks,
>
> Alex
>
>
Alex Williamson Feb. 16, 2017, 1:17 a.m. UTC | #4
On Wed, 15 Feb 2017 18:25:26 -0500
Jintack Lim <jintack@cs.columbia.edu> wrote:

> On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com
> > wrote:  
> 
> > On Wed, 15 Feb 2017 17:05:35 -0500
> > Jintack Lim <jintack@cs.columbia.edu> wrote:
> >  
> > > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> > >  
> > > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > > >
> > > > [...]
> > > >  
> > > > > > > >> > I misunderstood what you said?  
> > > > > > > >
> > > > > > > > I failed to understand why an vIOMMU could help boost  
> > performance.  
> > > > :(  
> > > > > > > > Could you provide your command line here so that I can try to
> > > > > > > > reproduce?  
> > > > > > >
> > > > > > > Sure. This is the command line to launch L1 VM
> > > > > > >
> > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host  
> > \  
> > > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > > >
> > > > > > > And this is for L2 VM.
> > > > > > >
> > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > > -m 8G \
> > > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > > -device vfio-pci,host=00:03.0,id=net0  
> > > > > >
> > > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > > than L1 guest with/without vIOMMU?
> > > > > >  
> > > > >
> > > > > That's right. I thought you were asking about command lines for L1/L2  
> > > > guest  
> > > > > :(.
> > > > > I think I made the confusion, and as I said above, I didn't mean to  
> > talk  
> > > > > about the performance of L1 guest with/without vIOMMO.
> > > > > We can move on!  
> > > >
> > > > I see. Sure! :-)
> > > >
> > > > [...]
> > > >  
> > > > > >
> > > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > > prev == 0 here, but I still don't quite sure why was that  
> > happening.  
> > > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in  
> > your L1  
> > > > > > guest?
> > > > > >  
> > > > >
> > > > > Sure. This is from my L1 guest.  
> > > >
> > > > Hmm... I think I found the problem...
> > > >  
> > > > >
> > > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > > [ConnectX-3]
> > > > > Subsystem: Mellanox Technologies Device 0050
> > > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  
> > <TAbort-  
> > > > > <MAbort- >SERR- <PERR- INTx-
> > > > > Latency: 0, Cache Line Size: 64 bytes
> > > > > Interrupt: pin A routed to IRQ 23
> > > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > > > Capabilities: [40] Power Management version 3
> > > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA  
> > PME(D0-,D1-,D2-,D3hot-,D3cold-  
> > > > )  
> > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > > Capabilities: [48] Vital Product Data
> > > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > > Read-only fields:
> > > > > [PN] Part number: MCX354A-FCBT
> > > > > [EC] Engineering changes: A4
> > > > > [SN] Serial number: MT1346X00791
> > > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > > Read/write fields:
> > > > > [V1] Vendor specific: N/A
> > > > > [YA] Asset tag: N/A
> > > > > [RW] Read-write area: 105 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 252 byte(s) free
> > > > > End
> > > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > > > Vector table: BAR=0 offset=0007c000
> > > > > PBA: BAR=0 offset=0007d000
> > > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint,  
> > MSI 00  
> > > > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > > > ExtTag- RBE+
> > > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > > > Supported
> > > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF  
> > > > Disabled  
> > > > > Capabilities: [100 v0] #00  
> > > >
> > > > Here we have the head of ecap capability as cap_id==0, then when we
> > > > boot the l2 guest with the same device, we'll first copy this
> > > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > > > problem since pcie_find_capability_list() will thought there is no cap
> > > > at all (cap_id==0 is skipped).
> > > >
> > > > Do you want to try this "hacky patch" to see whether it works for you?
> > > >  
> > >
> > > Thanks for following this up!
> > >
> > > I just tried this, and I got some different message this time.
> > >
> > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > > reset mechanism.
> > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > > reset mechanism.  
> >
> > Possibly very true, it might affect the reliability of the device in
> > the l2 guest, but shouldn't prevent it from being assigned.  What's the
> > reset mechanism on the physical device (lspci -vvv from host please).
> >  
> 
> Thanks, Alex.
> This is from the host (L0).
> 
> 08:00.0 Network controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> Subsystem: Mellanox Technologies Device 0050
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin A routed to IRQ 31
> Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M]
> Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M]
> Expansion ROM at d9000000 [disabled] [size=1M]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Does not support reset on D3->D0 transition.

> Capabilities: [60] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-

Does not support PCIe FLR.

No AF capability.  Looks right to me, the only mechanism available to
the host is a bus reset, which isn't available to the VM.  If you were
to configure it downstream of a root port, the VM might think it could
reset the device, but I'm pretty sure it cannot.  Thanks,

Alex
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..bacd302 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1925,11 +1925,6 @@  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 
     }
 
-    /* Cleanup chain head ID if necessary */
-    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
-        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-    }
-
     g_free(config);
     return;
 }