Message ID | 20210818194318.110993-1-peterx@redhat.com |
---|---|
State | New |
Headers | show |
Series | vl: Prioritize device realizations | expand |
On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > QEMU creates -device objects in order as specified by the user's cmdline. > However that ordering may not be the ideal order. For example, some platform > devices (vIOMMUs) may want to be created earlier than most of the rest > devices (e.g., vfio-pci, virtio). > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > before kicking off the device realizations. This will allow the device > realization code to be able to use APIs like pci_device_iommu_address_space() > correctly, because those functions rely on the platfrom devices being realized. > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > the ordering, as either VM init and migration completes will need such an > ordering. In the future we can move that priority information out of vmsd. > > Signed-off-by: Peter Xu <peterx@redhat.com> Can we be 100% sure that changing the ordering of every single device being created won't affect guest ABI? (I don't think we can) How many device types in QEMU have non-default vmsd priority? Can we at least ensure devices with the same priority won't be reordered, just to be safe? (qsort() doesn't guarantee that) If very few device types have non-default vmsd priority and devices with the same priority aren't reordered, the risk of compatibility breakage would be much smaller.
On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > QEMU creates -device objects in order as specified by the user's cmdline. > > However that ordering may not be the ideal order. For example, some platform > > devices (vIOMMUs) may want to be created earlier than most of the rest > > devices (e.g., vfio-pci, virtio). > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > before kicking off the device realizations. This will allow the device > > realization code to be able to use APIs like pci_device_iommu_address_space() > > correctly, because those functions rely on the platfrom devices being realized. > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > the ordering, as either VM init and migration completes will need such an > > ordering. In the future we can move that priority information out of vmsd. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > Can we be 100% sure that changing the ordering of every single > device being created won't affect guest ABI? (I don't think we can) That's a good question, however I doubt whether there's any real-world guest ABI for that. As a developer, I normally specify cmdline parameter in an adhoc way, so that I assume most parameters are not sensitive to ordering and I can tune the ordering as wish. I'm not sure whether that's common for qemu users, I would expect so, but I may have missed something that I'm not aware of. Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" to be before "intel-iommu": it'll be constantly broken before this patchset, while after this series it'll be working. It's just that I don't think those "guest ABI" is necessary to be kept, and that's exactly what I want to fix with the patchset.. > > How many device types in QEMU have non-default vmsd priority? Not so much; here's the list of priorities and the devices using it: |--------------------+---------| | priority | devices | |--------------------+---------| | MIG_PRI_IOMMU | 3 | | MIG_PRI_PCI_BUS | 7 | | MIG_PRI_VIRTIO_MEM | 1 | | MIG_PRI_GICV3_ITS | 1 | | MIG_PRI_GICV3 | 1 | |--------------------+---------| All the rest devices are using the default (0) priority. > > Can we at least ensure devices with the same priority won't be > reordered, just to be safe? (qsort() doesn't guarantee that) > > If very few device types have non-default vmsd priority and > devices with the same priority aren't reordered, the risk of > compatibility breakage would be much smaller. I'm also wondering whether it's a good thing to break some guest ABI due to this change, if possible. Let's imagine something breaks after applied, then the only reason should be that qsort() changed the order of some same-priority devices and it's not the same as user specified any more. Then, does it also means there's yet another ordering requirement that we didn't even notice? I doubt whether that'll even happen (or I think there'll be report already, as in qemu man page there's no requirement on parameter ordering). In all cases, instead of "keeping the same priority devices in the same order as the user has specified", IMHO we should make the broken devices to have different priorities so the ordering will be guaranteed by qemu internal, rather than how user specified it. From that pov, maybe this patchset would be great if it can be accepted and applied in early stage of a release? So we can figure out what's missing and fix them within the same release. However again I still doubt whether there's any user that will break in a bad way. Thanks,
On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote: > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > > QEMU creates -device objects in order as specified by the user's cmdline. > > > However that ordering may not be the ideal order. For example, some platform > > > devices (vIOMMUs) may want to be created earlier than most of the rest > > > devices (e.g., vfio-pci, virtio). > > > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > > before kicking off the device realizations. This will allow the device > > > realization code to be able to use APIs like pci_device_iommu_address_space() > > > correctly, because those functions rely on the platfrom devices being realized. > > > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > > the ordering, as either VM init and migration completes will need such an > > > ordering. In the future we can move that priority information out of vmsd. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > Can we be 100% sure that changing the ordering of every single > > device being created won't affect guest ABI? (I don't think we can) > > That's a good question, however I doubt whether there's any real-world guest > ABI for that. As a developer, I normally specify cmdline parameter in an adhoc > way, so that I assume most parameters are not sensitive to ordering and I can > tune the ordering as wish. I'm not sure whether that's common for qemu users, > I would expect so, but I may have missed something that I'm not aware of. To give just one example: $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20 Bus 0, device 4, function 0: Ethernet controller: PCI device 1af4:1000 PCI subsystem 1af4:0001 IRQ 0, pin A BAR0: I/O at 0xffffffffffffffff [0x001e]. BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" Bus 0, device 5, function 0: Ethernet controller: PCI device 8086:10d3 PCI subsystem 8086:0000 IRQ 0, pin A BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. BAR2: I/O at 0xffffffffffffffff [0x001e]. BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" (qemu) quit $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20 Bus 0, device 4, function 0: Ethernet controller: PCI device 8086:10d3 PCI subsystem 8086:0000 IRQ 0, pin A BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. BAR2: I/O at 0xffffffffffffffff [0x001e]. BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" Bus 0, device 5, function 0: Ethernet controller: PCI device 1af4:1000 PCI subsystem 1af4:0001 IRQ 0, pin A BAR0: I/O at 0xffffffffffffffff [0x001e]. BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. id "" (qemu) quit If the order of the -device arguments changes, the devices are assigned to different PCI slots. > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" > to be before "intel-iommu": it'll be constantly broken before this patchset, > while after this series it'll be working. It's just that I don't think those > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with > the patchset.. If the only ordering changes caused by this patch were intentional and affected only configurations that are known to be broken (like vfio-pci vs intel-iommu), I would agree. However, if we are reordering every single -device option in an unspecified way (like qsort() does when elements compare as equal), we are probably breaking guest ABI and creating a completely different machine (like in the PCI example above). > > > > > How many device types in QEMU have non-default vmsd priority? > > Not so much; here's the list of priorities and the devices using it: > > |--------------------+---------| > | priority | devices | > |--------------------+---------| > | MIG_PRI_IOMMU | 3 | > | MIG_PRI_PCI_BUS | 7 | > | MIG_PRI_VIRTIO_MEM | 1 | > | MIG_PRI_GICV3_ITS | 1 | > | MIG_PRI_GICV3 | 1 | > |--------------------+---------| > > All the rest devices are using the default (0) priority. > > > > > Can we at least ensure devices with the same priority won't be > > reordered, just to be safe? (qsort() doesn't guarantee that) > > > > If very few device types have non-default vmsd priority and > > devices with the same priority aren't reordered, the risk of > > compatibility breakage would be much smaller. > > I'm also wondering whether it's a good thing to break some guest ABI due to > this change, if possible. > > Let's imagine something breaks after applied, then the only reason should be > that qsort() changed the order of some same-priority devices and it's not the > same as user specified any more. Then, does it also means there's yet another > ordering requirement that we didn't even notice? Does the PCI slot assignment example above demonstrate a ordering requirement? I don't think it does. It just demonstrates that different orderings are expected to give different results, but both orderings are valid. > > I doubt whether that'll even happen (or I think there'll be report already, as > in qemu man page there's no requirement on parameter ordering). In all cases, > instead of "keeping the same priority devices in the same order as the user has > specified", IMHO we should make the broken devices to have different priorities > so the ordering will be guaranteed by qemu internal, rather than how user > specified it. > > From that pov, maybe this patchset would be great if it can be accepted and > applied in early stage of a release? So we can figure out what's missing and > fix them within the same release. However again I still doubt whether there's > any user that will break in a bad way. > > Thanks, > > -- > Peter Xu >
On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote: > To give just one example: > > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20 > Bus 0, device 4, function 0: > Ethernet controller: PCI device 1af4:1000 > PCI subsystem 1af4:0001 > IRQ 0, pin A > BAR0: I/O at 0xffffffffffffffff [0x001e]. > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > Bus 0, device 5, function 0: > Ethernet controller: PCI device 8086:10d3 > PCI subsystem 8086:0000 > IRQ 0, pin A > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > BAR2: I/O at 0xffffffffffffffff [0x001e]. > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > (qemu) quit > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20 > Bus 0, device 4, function 0: > Ethernet controller: PCI device 8086:10d3 > PCI subsystem 8086:0000 > IRQ 0, pin A > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > BAR2: I/O at 0xffffffffffffffff [0x001e]. > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > Bus 0, device 5, function 0: > Ethernet controller: PCI device 1af4:1000 > PCI subsystem 1af4:0001 > IRQ 0, pin A > BAR0: I/O at 0xffffffffffffffff [0x001e]. > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > id "" > (qemu) quit > > > If the order of the -device arguments changes, the devices are assigned to > different PCI slots. Thanks for the example. Initially I thought about this and didn't think it an issue (because serious users will always specify addr=XXX for -device; I thought libvirt always does that), but I do remember that guest OS could identify its hardware config with devfn number, so nmcli may mess up its config with before/after this change indeed.. I can use a custom sort to replace qsort() to guarantee that. Do you have other examples in mind that I may have overlooked, especially I may not be able to fix by a custom sort with only moving priority>=1 devices? Thanks,
On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote: > On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote: > > To give just one example: > > > > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20 > > Bus 0, device 4, function 0: > > Ethernet controller: PCI device 1af4:1000 > > PCI subsystem 1af4:0001 > > IRQ 0, pin A > > BAR0: I/O at 0xffffffffffffffff [0x001e]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > Bus 0, device 5, function 0: > > Ethernet controller: PCI device 8086:10d3 > > PCI subsystem 8086:0000 > > IRQ 0, pin A > > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR2: I/O at 0xffffffffffffffff [0x001e]. > > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > (qemu) quit > > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20 > > Bus 0, device 4, function 0: > > Ethernet controller: PCI device 8086:10d3 > > PCI subsystem 8086:0000 > > IRQ 0, pin A > > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR2: I/O at 0xffffffffffffffff [0x001e]. > > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > Bus 0, device 5, function 0: > > Ethernet controller: PCI device 1af4:1000 > > PCI subsystem 1af4:0001 > > IRQ 0, pin A > > BAR0: I/O at 0xffffffffffffffff [0x001e]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > (qemu) quit > > > > > > If the order of the -device arguments changes, the devices are assigned to > > different PCI slots. > > Thanks for the example. > > Initially I thought about this and didn't think it an issue (because serious > users will always specify addr=XXX for -device; I thought libvirt always does > that), but I do remember that guest OS could identify its hardware config with > devfn number, so nmcli may mess up its config with before/after this change > indeed.. > > I can use a custom sort to replace qsort() to guarantee that. You don't have to do that. Simply use the device position on the command line for comparisons when priority is the same. > Do you have other examples in mind that I may have overlooked, especially I may > not be able to fix by a custom sort with only moving priority>=1 devices? > > Thanks, > -- > Peter Xu
On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote: > On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote: > > To give just one example: > > > > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20 > > Bus 0, device 4, function 0: > > Ethernet controller: PCI device 1af4:1000 > > PCI subsystem 1af4:0001 > > IRQ 0, pin A > > BAR0: I/O at 0xffffffffffffffff [0x001e]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > Bus 0, device 5, function 0: > > Ethernet controller: PCI device 8086:10d3 > > PCI subsystem 8086:0000 > > IRQ 0, pin A > > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR2: I/O at 0xffffffffffffffff [0x001e]. > > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > (qemu) quit > > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20 > > Bus 0, device 4, function 0: > > Ethernet controller: PCI device 8086:10d3 > > PCI subsystem 8086:0000 > > IRQ 0, pin A > > BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe]. > > BAR2: I/O at 0xffffffffffffffff [0x001e]. > > BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > Bus 0, device 5, function 0: > > Ethernet controller: PCI device 1af4:1000 > > PCI subsystem 1af4:0001 > > IRQ 0, pin A > > BAR0: I/O at 0xffffffffffffffff [0x001e]. > > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. > > BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe]. > > id "" > > (qemu) quit > > > > > > If the order of the -device arguments changes, the devices are assigned to > > different PCI slots. > > Thanks for the example. > > Initially I thought about this and didn't think it an issue (because serious > users will always specify addr=XXX for -device; I thought libvirt always does > that), but I do remember that guest OS could identify its hardware config with > devfn number, so nmcli may mess up its config with before/after this change > indeed.. > > I can use a custom sort to replace qsort() to guarantee that. > > Do you have other examples in mind that I may have overlooked, especially I may > not be able to fix by a custom sort with only moving priority>=1 devices? I don't have any other example, but I assume address assignment based on ordering is a common pattern in device code. I would take a very close and careful look at the devices with non-default vmsd priority. If you can prove that the 13 device types with non-default priority are all order-insensitive, a custom sort function as you describe might be safe.
On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote: > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > > QEMU creates -device objects in order as specified by the user's cmdline. > > > However that ordering may not be the ideal order. For example, some platform > > > devices (vIOMMUs) may want to be created earlier than most of the rest > > > devices (e.g., vfio-pci, virtio). > > > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > > before kicking off the device realizations. This will allow the device > > > realization code to be able to use APIs like pci_device_iommu_address_space() > > > correctly, because those functions rely on the platfrom devices being realized. > > > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > > the ordering, as either VM init and migration completes will need such an > > > ordering. In the future we can move that priority information out of vmsd. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > Can we be 100% sure that changing the ordering of every single > > device being created won't affect guest ABI? (I don't think we can) > > That's a good question, however I doubt whether there's any real-world guest > ABI for that. As a developer, I normally specify cmdline parameter in an adhoc > way, so that I assume most parameters are not sensitive to ordering and I can > tune the ordering as wish. I'm not sure whether that's common for qemu users, > I would expect so, but I may have missed something that I'm not aware of. > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" > to be before "intel-iommu": it'll be constantly broken before this patchset, > while after this series it'll be working. It's just that I don't think those > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with > the patchset.. > > > > > How many device types in QEMU have non-default vmsd priority? > > Not so much; here's the list of priorities and the devices using it: > > |--------------------+---------| > | priority | devices | > |--------------------+---------| > | MIG_PRI_IOMMU | 3 | > | MIG_PRI_PCI_BUS | 7 | > | MIG_PRI_VIRTIO_MEM | 1 | > | MIG_PRI_GICV3_ITS | 1 | > | MIG_PRI_GICV3 | 1 | > |--------------------+---------| iommu is probably ok. I think virtio mem is ok too, in that it is normally created by virtio-mem-pci ... > All the rest devices are using the default (0) priority. > > > > > Can we at least ensure devices with the same priority won't be > > reordered, just to be safe? (qsort() doesn't guarantee that) > > > > If very few device types have non-default vmsd priority and > > devices with the same priority aren't reordered, the risk of > > compatibility breakage would be much smaller. > > I'm also wondering whether it's a good thing to break some guest ABI due to > this change, if possible. > > Let's imagine something breaks after applied, then the only reason should be > that qsort() changed the order of some same-priority devices and it's not the > same as user specified any more. Then, does it also means there's yet another > ordering requirement that we didn't even notice? > > I doubt whether that'll even happen (or I think there'll be report already, as > in qemu man page there's no requirement on parameter ordering). In all cases, > instead of "keeping the same priority devices in the same order as the user has > specified", IMHO we should make the broken devices to have different priorities > so the ordering will be guaranteed by qemu internal, rather than how user > specified it. Well giving user control of guest ABI is a reasonable thing to do, it is realize order that users do not really care about. I guess we could move pci slot allocation out of realize so it does not depend on realize order? > >From that pov, maybe this patchset would be great if it can be accepted and > applied in early stage of a release? So we can figure out what's missing and > fix them within the same release. However again I still doubt whether there's > any user that will break in a bad way. > > Thanks, > > -- > Peter Xu
On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote: > On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote: > > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > > > QEMU creates -device objects in order as specified by the user's cmdline. > > > > However that ordering may not be the ideal order. For example, some platform > > > > devices (vIOMMUs) may want to be created earlier than most of the rest > > > > devices (e.g., vfio-pci, virtio). > > > > > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > > > before kicking off the device realizations. This will allow the device > > > > realization code to be able to use APIs like pci_device_iommu_address_space() > > > > correctly, because those functions rely on the platfrom devices being realized. > > > > > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > > > the ordering, as either VM init and migration completes will need such an > > > > ordering. In the future we can move that priority information out of vmsd. > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > > > Can we be 100% sure that changing the ordering of every single > > > device being created won't affect guest ABI? (I don't think we can) > > > > That's a good question, however I doubt whether there's any real-world guest > > ABI for that. As a developer, I normally specify cmdline parameter in an adhoc > > way, so that I assume most parameters are not sensitive to ordering and I can > > tune the ordering as wish. I'm not sure whether that's common for qemu users, > > I would expect so, but I may have missed something that I'm not aware of. > > > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" > > to be before "intel-iommu": it'll be constantly broken before this patchset, > > while after this series it'll be working. It's just that I don't think those > > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with > > the patchset.. > > > > > > > > How many device types in QEMU have non-default vmsd priority? > > > > Not so much; here's the list of priorities and the devices using it: > > > > |--------------------+---------| > > | priority | devices | > > |--------------------+---------| > > | MIG_PRI_IOMMU | 3 | > > | MIG_PRI_PCI_BUS | 7 | > > | MIG_PRI_VIRTIO_MEM | 1 | > > | MIG_PRI_GICV3_ITS | 1 | > > | MIG_PRI_GICV3 | 1 | > > |--------------------+---------| > > iommu is probably ok. I think virtio mem is ok too, > in that it is normally created by virtio-mem-pci ... Hmm this reminded me whether virtio-mem-pci could have another devfn allocated after being moved.. But frankly I still doubt whether we should guarantee that guest ABI on user not specifying addr=XXX in pci device parameters - I feel like it's a burden that we don't need to carry. (Btw, trying to keep the order is one thing; declare it guest ABI would be another thing to me) > > > > > All the rest devices are using the default (0) priority. > > > > > > > > Can we at least ensure devices with the same priority won't be > > > reordered, just to be safe? (qsort() doesn't guarantee that) > > > > > > If very few device types have non-default vmsd priority and > > > devices with the same priority aren't reordered, the risk of > > > compatibility breakage would be much smaller. > > > > I'm also wondering whether it's a good thing to break some guest ABI due to > > this change, if possible. > > > > Let's imagine something breaks after applied, then the only reason should be > > that qsort() changed the order of some same-priority devices and it's not the > > same as user specified any more. Then, does it also means there's yet another > > ordering requirement that we didn't even notice? > > > > I doubt whether that'll even happen (or I think there'll be report already, as > > in qemu man page there's no requirement on parameter ordering). In all cases, > > instead of "keeping the same priority devices in the same order as the user has > > specified", IMHO we should make the broken devices to have different priorities > > so the ordering will be guaranteed by qemu internal, rather than how user > > specified it. > > Well giving user control of guest ABI is a reasonable thing to do, > it is realize order that users do not really care about. Makes sense. > > I guess we could move pci slot allocation out of realize > so it does not depend on realize order? Yes that sounds like another approach, but it seems to require more changes. Thanks,
On Mon, Aug 23, 2021 at 05:54:44PM -0400, Michael S. Tsirkin wrote: > > I can use a custom sort to replace qsort() to guarantee that. > You don't have to do that. Simply use the device position on the command > line for comparisons when priority is the same. Indeed. :) Thanks,
On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: > I don't have any other example, but I assume address assignment > based on ordering is a common pattern in device code. > > I would take a very close and careful look at the devices with > non-default vmsd priority. If you can prove that the 13 device > types with non-default priority are all order-insensitive, a > custom sort function as you describe might be safe. Besides virtio-mem-pci, there'll also similar devfn issue with all MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, below two cmdlines will generate different pci topology too: $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ -device pcie-root-port,chassis=1 \ -device virtio-net-pci And: $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ -device virtio-net-pci -device pcie-root-port,chassis=1 \ This cannot be solved by keeping priority==0 ordering. After a second thought, I think I was initially wrong on seeing migration priority and device realization the same problem. For example, for live migration we have a requirement on PCI_BUS being migrated earlier than MIG_PRI_IOMMU because there's bus number information required because IOMMU relies on the bus number to find address spaces. However that's definitely not a requirement for device realizations, say, realizing vIOMMU after pci buses are fine (bus assigned during bios). I've probably messed up with the ideas (though they really look alike!). Sorry about that. Since the only ordering constraint so far is IOMMU vs all the rest of devices, I'll introduce a new priority mechanism and only make sure vIOMMUs are realized earlier. That'll also avoid other implications on pci devfn allocations. Will rework a new version tomorrow. Thanks a lot for all the comments,
On Tue, Aug 24, 2021 at 3:18 AM Peter Xu <peterx@redhat.com> wrote: > > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > > QEMU creates -device objects in order as specified by the user's cmdline. > > > However that ordering may not be the ideal order. For example, some platform > > > devices (vIOMMUs) may want to be created earlier than most of the rest > > > devices (e.g., vfio-pci, virtio). > > > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > > before kicking off the device realizations. This will allow the device > > > realization code to be able to use APIs like pci_device_iommu_address_space() > > > correctly, because those functions rely on the platfrom devices being realized. > > > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > > the ordering, as either VM init and migration completes will need such an > > > ordering. In the future we can move that priority information out of vmsd. > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > Can we be 100% sure that changing the ordering of every single > > device being created won't affect guest ABI? (I don't think we can) > > That's a good question, however I doubt whether there's any real-world guest > ABI for that. As a developer, I normally specify cmdline parameter in an adhoc > way, so that I assume most parameters are not sensitive to ordering and I can > tune the ordering as wish. I'm not sure whether that's common for qemu users, > I would expect so, but I may have missed something that I'm not aware of. > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" > to be before "intel-iommu": it'll be constantly broken before this patchset, > while after this series it'll be working. It's just that I don't think those > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with > the patchset.. Yes, and I wonder if we limit this to new machine types, we don't even need to care about ABI stuff. Thanks > > > > > How many device types in QEMU have non-default vmsd priority? > > Not so much; here's the list of priorities and the devices using it: > > |--------------------+---------| > | priority | devices | > |--------------------+---------| > | MIG_PRI_IOMMU | 3 | > | MIG_PRI_PCI_BUS | 7 | > | MIG_PRI_VIRTIO_MEM | 1 | > | MIG_PRI_GICV3_ITS | 1 | > | MIG_PRI_GICV3 | 1 | > |--------------------+---------| > > All the rest devices are using the default (0) priority. > > > > > Can we at least ensure devices with the same priority won't be > > reordered, just to be safe? (qsort() doesn't guarantee that) > > > > If very few device types have non-default vmsd priority and > > devices with the same priority aren't reordered, the risk of > > compatibility breakage would be much smaller. > > I'm also wondering whether it's a good thing to break some guest ABI due to > this change, if possible. > > Let's imagine something breaks after applied, then the only reason should be > that qsort() changed the order of some same-priority devices and it's not the > same as user specified any more. Then, does it also means there's yet another > ordering requirement that we didn't even notice? > > I doubt whether that'll even happen (or I think there'll be report already, as > in qemu man page there's no requirement on parameter ordering). In all cases, > instead of "keeping the same priority devices in the same order as the user has > specified", IMHO we should make the broken devices to have different priorities > so the ordering will be guaranteed by qemu internal, rather than how user > specified it. > > From that pov, maybe this patchset would be great if it can be accepted and > applied in early stage of a release? So we can figure out what's missing and > fix them within the same release. However again I still doubt whether there's > any user that will break in a bad way. > > Thanks, > > -- > Peter Xu >
On Tue, Aug 24, 2021 at 6:37 AM Peter Xu <peterx@redhat.com> wrote: > > On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote: > > On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote: > > > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: > > > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: > > > > > QEMU creates -device objects in order as specified by the user's cmdline. > > > > > However that ordering may not be the ideal order. For example, some platform > > > > > devices (vIOMMUs) may want to be created earlier than most of the rest > > > > > devices (e.g., vfio-pci, virtio). > > > > > > > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first > > > > > before kicking off the device realizations. This will allow the device > > > > > realization code to be able to use APIs like pci_device_iommu_address_space() > > > > > correctly, because those functions rely on the platfrom devices being realized. > > > > > > > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide > > > > > the ordering, as either VM init and migration completes will need such an > > > > > ordering. In the future we can move that priority information out of vmsd. > > > > > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > > > > > Can we be 100% sure that changing the ordering of every single > > > > device being created won't affect guest ABI? (I don't think we can) > > > > > > That's a good question, however I doubt whether there's any real-world guest > > > ABI for that. As a developer, I normally specify cmdline parameter in an adhoc > > > way, so that I assume most parameters are not sensitive to ordering and I can > > > tune the ordering as wish. I'm not sure whether that's common for qemu users, > > > I would expect so, but I may have missed something that I'm not aware of. > > > > > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" > > > to be before "intel-iommu": it'll be constantly broken before this patchset, > > > while after this series it'll be working. It's just that I don't think those > > > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with > > > the patchset.. > > > > > > > > > > > How many device types in QEMU have non-default vmsd priority? > > > > > > Not so much; here's the list of priorities and the devices using it: > > > > > > |--------------------+---------| > > > | priority | devices | > > > |--------------------+---------| > > > | MIG_PRI_IOMMU | 3 | > > > | MIG_PRI_PCI_BUS | 7 | > > > | MIG_PRI_VIRTIO_MEM | 1 | > > > | MIG_PRI_GICV3_ITS | 1 | > > > | MIG_PRI_GICV3 | 1 | > > > |--------------------+---------| > > > > iommu is probably ok. I think virtio mem is ok too, > > in that it is normally created by virtio-mem-pci ... > > Hmm this reminded me whether virtio-mem-pci could have another devfn allocated > after being moved.. > > But frankly I still doubt whether we should guarantee that guest ABI on user > not specifying addr=XXX in pci device parameters - I feel like it's a burden > that we don't need to carry. > > (Btw, trying to keep the order is one thing; declare it guest ABI would be > another thing to me) > > > > > > > > > > All the rest devices are using the default (0) priority. > > > > > > > > > > > Can we at least ensure devices with the same priority won't be > > > > reordered, just to be safe? (qsort() doesn't guarantee that) > > > > > > > > If very few device types have non-default vmsd priority and > > > > devices with the same priority aren't reordered, the risk of > > > > compatibility breakage would be much smaller. > > > > > > I'm also wondering whether it's a good thing to break some guest ABI due to > > > this change, if possible. > > > > > > Let's imagine something breaks after applied, then the only reason should be > > > that qsort() changed the order of some same-priority devices and it's not the > > > same as user specified any more. Then, does it also means there's yet another > > > ordering requirement that we didn't even notice? > > > > > > I doubt whether that'll even happen (or I think there'll be report already, as > > > in qemu man page there's no requirement on parameter ordering). In all cases, > > > instead of "keeping the same priority devices in the same order as the user has > > > specified", IMHO we should make the broken devices to have different priorities > > > so the ordering will be guaranteed by qemu internal, rather than how user > > > specified it. > > > > Well giving user control of guest ABI is a reasonable thing to do, > > it is realize order that users do not really care about. > > Makes sense. > > > > > I guess we could move pci slot allocation out of realize > > so it does not depend on realize order? > > Yes that sounds like another approach, but it seems to require more changes. It looks to me this doesn't solve the issue of using virtio-mmio with vhost? Thanks > > Thanks, > > -- > Peter Xu >
On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
> It looks to me this doesn't solve the issue of using virtio-mmio with vhost?
No IOMMU supported for any of the MMIO devices, right? Or am I wrong? Thanks,
On 24.08.21 00:05, Michael S. Tsirkin wrote: > On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote: >> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote: >>> On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote: >>>> QEMU creates -device objects in order as specified by the user's cmdline. >>>> However that ordering may not be the ideal order. For example, some platform >>>> devices (vIOMMUs) may want to be created earlier than most of the rest >>>> devices (e.g., vfio-pci, virtio). >>>> >>>> This patch orders the QemuOptsList of '-device's so they'll be sorted first >>>> before kicking off the device realizations. This will allow the device >>>> realization code to be able to use APIs like pci_device_iommu_address_space() >>>> correctly, because those functions rely on the platfrom devices being realized. >>>> >>>> Now we rely on vmsd->priority which is defined as MigrationPriority to provide >>>> the ordering, as either VM init and migration completes will need such an >>>> ordering. In the future we can move that priority information out of vmsd. >>>> >>>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> >>> Can we be 100% sure that changing the ordering of every single >>> device being created won't affect guest ABI? (I don't think we can) >> >> That's a good question, however I doubt whether there's any real-world guest >> ABI for that. As a developer, I normally specify cmdline parameter in an adhoc >> way, so that I assume most parameters are not sensitive to ordering and I can >> tune the ordering as wish. I'm not sure whether that's common for qemu users, >> I would expect so, but I may have missed something that I'm not aware of. >> >> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci" >> to be before "intel-iommu": it'll be constantly broken before this patchset, >> while after this series it'll be working. It's just that I don't think those >> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with >> the patchset.. >> >>> >>> How many device types in QEMU have non-default vmsd priority? >> >> Not so much; here's the list of priorities and the devices using it: >> >> |--------------------+---------| >> | priority | devices | >> |--------------------+---------| >> | MIG_PRI_IOMMU | 3 | >> | MIG_PRI_PCI_BUS | 7 | >> | MIG_PRI_VIRTIO_MEM | 1 | >> | MIG_PRI_GICV3_ITS | 1 | >> | MIG_PRI_GICV3 | 1 | >> |--------------------+---------| > > iommu is probably ok. I think virtio mem is ok too, > in that it is normally created by virtio-mem-pci ... IIRC: intel-iommu has to be created on the QEMU cmdline before creating virtio-mem-pci. -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \ ... -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,... Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device state has to be migrated before migrating intel-iommu state. I do wonder if migration priorities are really what we want to reuse here. I guess it works right, but just by pure luck (because we ignore the implicit dependency regarding priorities)
On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote: > > > Not so much; here's the list of priorities and the devices using it: > > > > > > |--------------------+---------| > > > | priority | devices | > > > |--------------------+---------| > > > | MIG_PRI_IOMMU | 3 | > > > | MIG_PRI_PCI_BUS | 7 | > > > | MIG_PRI_VIRTIO_MEM | 1 | > > > | MIG_PRI_GICV3_ITS | 1 | > > > | MIG_PRI_GICV3 | 1 | > > > |--------------------+---------| > > > > iommu is probably ok. I think virtio mem is ok too, > > in that it is normally created by virtio-mem-pci ... > > IIRC: > > intel-iommu has to be created on the QEMU cmdline before creating > virtio-mem-pci. > > -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \ > ... > -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,... > > Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device > state has to be migrated before migrating intel-iommu state. Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be migrated before when reading commit 0fd7616e0f1171b. It gives me the feeling more like that virtio-mem has a ordering dependency with vfio-pci not virtio-mem, but I could be wrong. E.g., the IOMMU unit shouldn't be walking page table if without a device using it, then it won't fail until the device walks it in region_add() hooks when memory replay happens. > > I do wonder if migration priorities are really what we want to reuse here. I > guess it works right, but just by pure luck (because we ignore the implicit > dependency regarding priorities) Yep, that's probably not what we want. I'll make it a new list of priorities as I've stated in the other thread. Thanks,
On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote: > > It looks to me this doesn't solve the issue of using virtio-mmio with vhost? > > No IOMMU supported for any of the MMIO devices, right? Or am I wrong? Thanks, I guess virtio IOMMU has that support, but I might be wrong. Thanks > > -- > Peter Xu >
On 24.08.21 21:52, Peter Xu wrote: > On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote: >>>> Not so much; here's the list of priorities and the devices using it: >>>> >>>> |--------------------+---------| >>>> | priority | devices | >>>> |--------------------+---------| >>>> | MIG_PRI_IOMMU | 3 | >>>> | MIG_PRI_PCI_BUS | 7 | >>>> | MIG_PRI_VIRTIO_MEM | 1 | >>>> | MIG_PRI_GICV3_ITS | 1 | >>>> | MIG_PRI_GICV3 | 1 | >>>> |--------------------+---------| >>> >>> iommu is probably ok. I think virtio mem is ok too, >>> in that it is normally created by virtio-mem-pci ... >> >> IIRC: >> >> intel-iommu has to be created on the QEMU cmdline before creating >> virtio-mem-pci. >> >> -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \ >> ... >> -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,... >> >> Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device >> state has to be migrated before migrating intel-iommu state. > > Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be > migrated before when reading commit 0fd7616e0f1171b. It gives me the feeling > more like that virtio-mem has a ordering dependency with vfio-pci not > virtio-mem, but I could be wrong. "We have to take care of incoming migration: at the point the IOMMUs get restored and start creating mappings in vfio, RamDiscardManager implementations might not be back up and running yet: let's add runstate priorities to enforce the order when restoring. s/runstate/vmstate/ :| I recall that we can see IOMMU_NOTIFIER_MAP events when restoring an IOMMU device. vfio_get_xlat_addr() would be called and could reject mappings targeting virtio-mem memory in case the virtio-mem device has not restored its bitmap from the migration stream such that ram_discard_manager_is_populated() would be reliable. Consequently, we have to restore the virtio-mem device state (not the virtio-mem-pci device state!) before restoring an IOMMU. > > E.g., the IOMMU unit shouldn't be walking page table if without a device using > it, then it won't fail until the device walks it in region_add() hooks when > memory replay happens. I recall it happened when switching to the iommu region e.g., in vtd_post_load()->vtd_switch_address_space(). But I forgot the exact call path.
Peter Xu <peterx@redhat.com> writes: > On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: >> I don't have any other example, but I assume address assignment >> based on ordering is a common pattern in device code. >> >> I would take a very close and careful look at the devices with >> non-default vmsd priority. If you can prove that the 13 device >> types with non-default priority are all order-insensitive, a >> custom sort function as you describe might be safe. > > Besides virtio-mem-pci, there'll also similar devfn issue with all > MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, > below two cmdlines will generate different pci topology too: > > $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > -device pcie-root-port,chassis=1 \ > -device virtio-net-pci > > And: > > $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > -device virtio-net-pci > -device pcie-root-port,chassis=1 \ > > This cannot be solved by keeping priority==0 ordering. > > After a second thought, I think I was initially wrong on seeing migration > priority and device realization the same problem. > > For example, for live migration we have a requirement on PCI_BUS being migrated > earlier than MIG_PRI_IOMMU because there's bus number information required > because IOMMU relies on the bus number to find address spaces. However that's > definitely not a requirement for device realizations, say, realizing vIOMMU > after pci buses are fine (bus assigned during bios). > > I've probably messed up with the ideas (though they really look alike!). Sorry > about that. > > Since the only ordering constraint so far is IOMMU vs all the rest of devices, > I'll introduce a new priority mechanism and only make sure vIOMMUs are realized > earlier. That'll also avoid other implications on pci devfn allocations. > > Will rework a new version tomorrow. Thanks a lot for all the comments, Is it really a good idea to magically reorder device realization just to make a non-working command line work? Why can't we just fail the non-working command line in a way that tells users how to get a working one? We have way too much ordering magic already... If we decide we want more magic, then I'd argue for *dependencies* instead of priorities. Dependencies are specific and local: $this needs to go after $that because $reasons. Priorities are unspecific and global.
Markus Armbruster <armbru@redhat.com> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: >>> I don't have any other example, but I assume address assignment >>> based on ordering is a common pattern in device code. >>> >>> I would take a very close and careful look at the devices with >>> non-default vmsd priority. If you can prove that the 13 device >>> types with non-default priority are all order-insensitive, a >>> custom sort function as you describe might be safe. >> >> Besides virtio-mem-pci, there'll also similar devfn issue with all >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, >> below two cmdlines will generate different pci topology too: >> >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ >> -device pcie-root-port,chassis=1 \ >> -device virtio-net-pci >> >> And: >> >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ >> -device virtio-net-pci >> -device pcie-root-port,chassis=1 \ >> >> This cannot be solved by keeping priority==0 ordering. >> >> After a second thought, I think I was initially wrong on seeing migration >> priority and device realization the same problem. >> >> For example, for live migration we have a requirement on PCI_BUS being migrated >> earlier than MIG_PRI_IOMMU because there's bus number information required >> because IOMMU relies on the bus number to find address spaces. However that's >> definitely not a requirement for device realizations, say, realizing vIOMMU >> after pci buses are fine (bus assigned during bios). >> >> I've probably messed up with the ideas (though they really look alike!). Sorry >> about that. >> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices, >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized >> earlier. That'll also avoid other implications on pci devfn allocations. >> >> Will rework a new version tomorrow. Thanks a lot for all the comments, > > Is it really a good idea to magically reorder device realization just to > make a non-working command line work? Why can't we just fail the > non-working command line in a way that tells users how to get a working > one? We have way too much ordering magic already... > > If we decide we want more magic, then I'd argue for *dependencies* > instead of priorities. Dependencies are specific and local: $this needs > to go after $that because $reasons. Priorities are unspecific and > global. Having thought about this a bit more... Constraints on realize order are nothing new. For instance, when a device plugs into a bus, it needs to be realized after the device providing the bus. We ensure this by having the device refer to the bus, e.g. bus=pci.0. The reference may be implicit, but it's there. It must resolve for device creation to succeed, and if it resolves, the device providing the bus will be realized in time. I believe what's getting us into trouble with IOMMU is not having such a reference. Or in other words, keeping the dependence between the IOMMU and the devices relying on it *implicit*, and thus hidden from the existing realize-ordering machinery. Instead of inventing another such machinery, let's try to use the one we already have.
On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > Markus Armbruster <armbru@redhat.com> writes: > > > Peter Xu <peterx@redhat.com> writes: > > > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: > >>> I don't have any other example, but I assume address assignment > >>> based on ordering is a common pattern in device code. > >>> > >>> I would take a very close and careful look at the devices with > >>> non-default vmsd priority. If you can prove that the 13 device > >>> types with non-default priority are all order-insensitive, a > >>> custom sort function as you describe might be safe. > >> > >> Besides virtio-mem-pci, there'll also similar devfn issue with all > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, > >> below two cmdlines will generate different pci topology too: > >> > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > >> -device pcie-root-port,chassis=1 \ > >> -device virtio-net-pci > >> > >> And: > >> > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > >> -device virtio-net-pci > >> -device pcie-root-port,chassis=1 \ > >> > >> This cannot be solved by keeping priority==0 ordering. > >> > >> After a second thought, I think I was initially wrong on seeing migration > >> priority and device realization the same problem. > >> > >> For example, for live migration we have a requirement on PCI_BUS being migrated > >> earlier than MIG_PRI_IOMMU because there's bus number information required > >> because IOMMU relies on the bus number to find address spaces. However that's > >> definitely not a requirement for device realizations, say, realizing vIOMMU > >> after pci buses are fine (bus assigned during bios). > >> > >> I've probably messed up with the ideas (though they really look alike!). Sorry > >> about that. > >> > >> Since the only ordering constraint so far is IOMMU vs all the rest of devices, > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized > >> earlier. That'll also avoid other implications on pci devfn allocations. > >> > >> Will rework a new version tomorrow. Thanks a lot for all the comments, > > > > Is it really a good idea to magically reorder device realization just to > > make a non-working command line work? Why can't we just fail the > > non-working command line in a way that tells users how to get a working > > one? We have way too much ordering magic already... > > > > If we decide we want more magic, then I'd argue for *dependencies* > > instead of priorities. Dependencies are specific and local: $this needs > > to go after $that because $reasons. Priorities are unspecific and > > global. > > Having thought about this a bit more... > > Constraints on realize order are nothing new. For instance, when a > device plugs into a bus, it needs to be realized after the device > providing the bus. > > We ensure this by having the device refer to the bus, e.g. bus=pci.0. > The reference may be implicit, but it's there. It must resolve for > device creation to succeed, and if it resolves, the device providing the > bus will be realized in time. > > I believe what's getting us into trouble with IOMMU is not having such a > reference. Or in other words, keeping the dependence between the IOMMU > and the devices relying on it *implicit*, and thus hidden from the > existing realize-ordering machinery. > > Instead of inventing another such machinery, let's try to use the one we > already have. Hmm... I just found that we don't have such machinery, do we? This does not really work: $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \ -device pcie-root-port,id=pcie.1,bus=pcie.0 qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found While this will: $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \ -device virtio-net-pci,bus=pcie.1 Thanks,
On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: > On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > > Markus Armbruster <armbru@redhat.com> writes: > > > > > Peter Xu <peterx@redhat.com> writes: > > > > > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: > > >>> I don't have any other example, but I assume address assignment > > >>> based on ordering is a common pattern in device code. > > >>> > > >>> I would take a very close and careful look at the devices with > > >>> non-default vmsd priority. If you can prove that the 13 device > > >>> types with non-default priority are all order-insensitive, a > > >>> custom sort function as you describe might be safe. > > >> > > >> Besides virtio-mem-pci, there'll also similar devfn issue with all > > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, > > >> below two cmdlines will generate different pci topology too: > > >> > > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > > >> -device pcie-root-port,chassis=1 \ > > >> -device virtio-net-pci > > >> > > >> And: > > >> > > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > > >> -device virtio-net-pci > > >> -device pcie-root-port,chassis=1 \ > > >> > > >> This cannot be solved by keeping priority==0 ordering. > > >> > > >> After a second thought, I think I was initially wrong on seeing migration > > >> priority and device realization the same problem. > > >> > > >> For example, for live migration we have a requirement on PCI_BUS being migrated > > >> earlier than MIG_PRI_IOMMU because there's bus number information required > > >> because IOMMU relies on the bus number to find address spaces. However that's > > >> definitely not a requirement for device realizations, say, realizing vIOMMU > > >> after pci buses are fine (bus assigned during bios). > > >> > > >> I've probably messed up with the ideas (though they really look alike!). Sorry > > >> about that. > > >> > > >> Since the only ordering constraint so far is IOMMU vs all the rest of devices, > > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized > > >> earlier. That'll also avoid other implications on pci devfn allocations. > > >> > > >> Will rework a new version tomorrow. Thanks a lot for all the comments, > > > > > > Is it really a good idea to magically reorder device realization just to > > > make a non-working command line work? Why can't we just fail the > > > non-working command line in a way that tells users how to get a working > > > one? We have way too much ordering magic already... > > > > > > If we decide we want more magic, then I'd argue for *dependencies* > > > instead of priorities. Dependencies are specific and local: $this needs > > > to go after $that because $reasons. Priorities are unspecific and > > > global. > > > > Having thought about this a bit more... > > > > Constraints on realize order are nothing new. For instance, when a > > device plugs into a bus, it needs to be realized after the device > > providing the bus. > > > > We ensure this by having the device refer to the bus, e.g. bus=pci.0. > > The reference may be implicit, but it's there. It must resolve for > > device creation to succeed, and if it resolves, the device providing the > > bus will be realized in time. > > > > I believe what's getting us into trouble with IOMMU is not having such a > > reference. Or in other words, keeping the dependence between the IOMMU > > and the devices relying on it *implicit*, and thus hidden from the > > existing realize-ordering machinery. [1] > > > > Instead of inventing another such machinery, let's try to use the one we > > already have. > > Hmm... I just found that we don't have such machinery, do we? > > This does not really work: > > $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \ > -device pcie-root-port,id=pcie.1,bus=pcie.0 > qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found > > While this will: > > $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \ > -device virtio-net-pci,bus=pcie.1 I think I fully agree at [1], the iommu is special in that it's not only implicit, but also does not have a default value. Pci buses have default values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier. When parsing the "-device" entry for a pci device, we'll 100% sure what's the pci bus for the device, either specified or it's just the default. We don't look at the rest of "-device"s to be sure of it. We just try to look up the pci bus, if it's there we continue, otherwise we abort. But IOMMU is different, the device can run without a vIOMMU (in which case there's no dependency), but when vIOMMU is specified there's the dependency. That's why I am not sure whether the old machinery and "early failure" solution would work trivially for IOMMUs. We can add an "-iommu" parameter, then we simply parse it before "-device". However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu" is not ideal in that the IOMMU unit is indeed implemented as a device for the archs I'm aware of, so it's kind of some extra glue that seems to just work around the ordering problem we have right now. Meanwhile that solution won't help the ordering issue of pci bus/device. That's why I still think the idea of having a global priority for device realization (or describe the dependency of devices) makes sense. We can actually fix both IOMMU and pci bus so we can allow pci bus to be put latter than the pci device that belongs to the bus alongside of fixing the IOMMU. IMHO a list of global priority (maybe a realize_priority field in the class of devices) is just a simpler version of device dependencies. Say, at least we don't need to worry about cycle-dependency issues. So far the ordering requirement is still simple, so globally define them should fix most of the issues already and in a straightforward way, also less LOCs. If it goes complicated one day, we can always switch the global priority list into device dependency easily, because that's not guest ABI. Any further thoughts will be greatly welcomed. Thanks,
Peter Xu <peterx@redhat.com> writes: > On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: >> Markus Armbruster <armbru@redhat.com> writes: >> >> > Peter Xu <peterx@redhat.com> writes: >> > >> >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: >> >>> I don't have any other example, but I assume address assignment >> >>> based on ordering is a common pattern in device code. >> >>> >> >>> I would take a very close and careful look at the devices with >> >>> non-default vmsd priority. If you can prove that the 13 device >> >>> types with non-default priority are all order-insensitive, a >> >>> custom sort function as you describe might be safe. >> >> >> >> Besides virtio-mem-pci, there'll also similar devfn issue with all >> >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. Say, >> >> below two cmdlines will generate different pci topology too: >> >> >> >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ >> >> -device pcie-root-port,chassis=1 \ >> >> -device virtio-net-pci >> >> >> >> And: >> >> >> >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ >> >> -device virtio-net-pci >> >> -device pcie-root-port,chassis=1 \ >> >> >> >> This cannot be solved by keeping priority==0 ordering. >> >> >> >> After a second thought, I think I was initially wrong on seeing migration >> >> priority and device realization the same problem. >> >> >> >> For example, for live migration we have a requirement on PCI_BUS being migrated >> >> earlier than MIG_PRI_IOMMU because there's bus number information required >> >> because IOMMU relies on the bus number to find address spaces. However that's >> >> definitely not a requirement for device realizations, say, realizing vIOMMU >> >> after pci buses are fine (bus assigned during bios). >> >> >> >> I've probably messed up with the ideas (though they really look alike!). Sorry >> >> about that. >> >> >> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices, >> >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized >> >> earlier. That'll also avoid other implications on pci devfn allocations. >> >> >> >> Will rework a new version tomorrow. Thanks a lot for all the comments, >> > >> > Is it really a good idea to magically reorder device realization just to >> > make a non-working command line work? Why can't we just fail the >> > non-working command line in a way that tells users how to get a working >> > one? We have way too much ordering magic already... >> > >> > If we decide we want more magic, then I'd argue for *dependencies* >> > instead of priorities. Dependencies are specific and local: $this needs >> > to go after $that because $reasons. Priorities are unspecific and >> > global. >> >> Having thought about this a bit more... >> >> Constraints on realize order are nothing new. For instance, when a >> device plugs into a bus, it needs to be realized after the device >> providing the bus. >> >> We ensure this by having the device refer to the bus, e.g. bus=pci.0. >> The reference may be implicit, but it's there. It must resolve for >> device creation to succeed, and if it resolves, the device providing the >> bus will be realized in time. >> >> I believe what's getting us into trouble with IOMMU is not having such a >> reference. Or in other words, keeping the dependence between the IOMMU >> and the devices relying on it *implicit*, and thus hidden from the >> existing realize-ordering machinery. >> >> Instead of inventing another such machinery, let's try to use the one we >> already have. > > Hmm... I just found that we don't have such machinery, do we? > > This does not really work: > > $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \ > -device pcie-root-port,id=pcie.1,bus=pcie.0 > qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found > > While this will: > > $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \ > -device virtio-net-pci,bus=pcie.1 This is exactly what I described. bus=pcie.0 is the explicit reference. It must resolve for device creation to succeed, and if it resolves, the device providing the bus will be realized in time. It resolves in the second example, but not the first. Look ma, no magic! Instead, stupid & predictable.
Peter Xu <peterx@redhat.com> writes: > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: >> > Having thought about this a bit more... >> > >> > Constraints on realize order are nothing new. For instance, when a >> > device plugs into a bus, it needs to be realized after the device >> > providing the bus. >> > >> > We ensure this by having the device refer to the bus, e.g. bus=pci.0. >> > The reference may be implicit, but it's there. It must resolve for >> > device creation to succeed, and if it resolves, the device providing the >> > bus will be realized in time. >> > >> > I believe what's getting us into trouble with IOMMU is not having such a >> > reference. Or in other words, keeping the dependence between the IOMMU >> > and the devices relying on it *implicit*, and thus hidden from the >> > existing realize-ordering machinery. > > [1] > >> > >> > Instead of inventing another such machinery, let's try to use the one we >> > already have. >> >> Hmm... I just found that we don't have such machinery, do we? >> >> This does not really work: >> >> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \ >> -device pcie-root-port,id=pcie.1,bus=pcie.0 >> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found >> >> While this will: >> >> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \ >> -device virtio-net-pci,bus=pcie.1 > > I think I fully agree at [1], the iommu is special in that it's not only > implicit, but also does not have a default value. Pci buses have default > values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier. To be precise: when device property "bus" is absent, and the device plugs into a bus, QEMU picks a suitable bus. If it can't, device creation fails. It can't when no bus of the required type exists, or the existing buses are all full. Sometimes, the board provides a bus of the required type. If not, you have to plug one, and you have to do it before you refer to it, whether the reference is explicit (bus=...) or implicit (bus absent). Example 1: no bus of the required type $ qemu-system-x86_64 -S -display none -device usb-mouse qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse' Example 2: no bus of the required type, yet $ qemu-system-x86_64 -S -display none -device usb-mouse -device qemu-xhci qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse' Example 3: reordered to the bus is there in time $ qemu-system-x86_64 -S -display none -device qemu-xhci -device usb-mouse Example 4: got a bus, but it's full $ qemu-system-x86_64 -S -display none -monitor stdio -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000,id=noslot QEMU 6.1.0 monitor - type 'help' for more information (qemu) qemu-system-x86_64: -device e1000,id=noslot: PCI: no slot/function available for e1000, all in use or reserved > When parsing the "-device" entry for a pci device, we'll 100% sure what's the > pci bus for the device, either specified or it's just the default. We don't > look at the rest of "-device"s to be sure of it. We just try to look up the > pci bus, if it's there we continue, otherwise we abort. Yes. It's a mandatory dependency, optionally explicit. > But IOMMU is different, the device can run without a vIOMMU (in which case > there's no dependency), but when vIOMMU is specified there's the dependency. > > That's why I am not sure whether the old machinery and "early failure" solution > would work trivially for IOMMUs. Yes. It's an optional, implicit dependency. > We can add an "-iommu" parameter, then we simply parse it before "-device". > However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu" > is not ideal in that the IOMMU unit is indeed implemented as a device for the > archs I'm aware of, so it's kind of some extra glue that seems to just work > around the ordering problem we have right now. Meanwhile that solution won't > help the ordering issue of pci bus/device. The "device providing bus before device plugging into bus" dependency is not actually a problem: -device and device_add ensure by design it's satisfied. As discussed before, there are two workable ways to process the command line: strictly left to right (leave ordering to the user), and "do the right thing" (order of options doesn't matter). Of course, we do neither. We kind of try to do the right thing, by adding special cases whenever we get bitten. Order doesn't matter, except when it does, and things work, except when they don't. Reordering your command line may or may not get it to work. Fails the basic interface taste test: would explaining it in plain English be impractical and/or embarrassing? How to best get out of this self-dug hole isn't obvious. Switching to strictly left to right will break some command lines. Making order truly not matter looks hard, because the dependencies are complex and not well understood. But this isn't the problem at hand here. It's whether to add more magic, or do without. I'm wary about piling more magic onto a heap of magic that's already wobbling under its weight. So, can we do without? The issue, as I understand it, is that certain devices need to know at realize time whether the machine has an IOMMU. When you add an IOMMU with -device, you morph the machine from one that doesn't have one into one that does. Therefore, it needs to be added before any of the devices that need to know whether the machine has one. > That's why I still think the idea of having a global priority for device > realization (or describe the dependency of devices) makes sense. The proposed solution is to magically reorder -device so that IOMMU get created before devices that need to know. This solution falls apart when we use QMP device_add instead of -device, because we can't reorder QMP commands. Right now we can't use device_add for everything, but that's fixable. Maybe it's been fixed already. To keep the proposed solution working, we'd need even more magic. > We can > actually fix both IOMMU and pci bus so we can allow pci bus to be put latter > than the pci device that belongs to the bus alongside of fixing the IOMMU. To "fix" the bus issue, we'd need to delay resolution of property bus until realize time. This might break .instance_init() methods. Of which we have several hundred. > IMHO a list of global priority (maybe a realize_priority field in the class of > devices) is just a simpler version of device dependencies. Say, at least we > don't need to worry about cycle-dependency issues. So far the ordering > requirement is still simple, so globally define them should fix most of the > issues already and in a straightforward way, also less LOCs. If it goes > complicated one day, we can always switch the global priority list into device > dependency easily, because that's not guest ABI. > > Any further thoughts will be greatly welcomed. I'd like to propose a more modest solution: detect the problem and fail. A simple state machine can track "has IOMMU" state. It has three states "no so far", "yes", and "no", and two events "add IOMMU" and "add device that needs to know". State diagram: no so far +--- (start state) ---+ | | add IOMMU | | add device that | | needs to know v v +--> yes no <--+ | | add device that | | +-----+ needs to know +-----+ "Add IOMMU" in state "no" is an error. "Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make sense. Else it's an error. The state machine could be owned by the machine type.
On Thu, 26 Aug 2021 10:01:01 +0200 Markus Armbruster <armbru@redhat.com> wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > >> > Having thought about this a bit more... ... > > Any further thoughts will be greatly welcomed. > > I'd like to propose a more modest solution: detect the problem and fail. That's or proper dependency tracking (which stands chance to work with QMP, but like it was said it's complex) > A simple state machine can track "has IOMMU" state. It has three states > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > that needs to know". State diagram: > > no so far > +--- (start state) ---+ > | | > add IOMMU | | add device that > | | needs to know > v v > +--> yes no <--+ > | | add device that | | > +-----+ needs to know +-----+ > > "Add IOMMU" in state "no" is an error. question is how we distinguish "device that needs to know" from device that doesn't need to know, and then recently added feature 'bypass IOMMU' adds more fun to this. > "Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make > sense. Else it's an error. > > The state machine could be owned by the machine type. > >
On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote: > On Thu, 26 Aug 2021 10:01:01 +0200 > Markus Armbruster <armbru@redhat.com> wrote: > > > Peter Xu <peterx@redhat.com> writes: > > > > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: > > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > > >> > Having thought about this a bit more... > ... > > > Any further thoughts will be greatly welcomed. > > > > I'd like to propose a more modest solution: detect the problem and fail. > That's or proper dependency tracking (which stands chance to work with QMP, > but like it was said it's complex) > > > A simple state machine can track "has IOMMU" state. It has three states > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > that needs to know". State diagram: > > > > no so far > > +--- (start state) ---+ > > | | > > add IOMMU | | add device that > > | | needs to know > > v v > > +--> yes no <--+ > > | | add device that | | > > +-----+ needs to know +-----+ > > > > "Add IOMMU" in state "no" is an error. > > question is how we distinguish "device that needs to know" > from device that doesn't need to know, and then recently > added feature 'bypass IOMMU' adds more fun to this. Maybe we can start from "no device needs to know"? Then add more into it when the list expands. vfio-pci should be a natural fit because we're sure it won't break any valid old configurations. However we may need to be careful on adding more devices, e.g. when some old configuration might work on old binaries, but I'm not sure. Off-topic: I'm wondering whether bypass_iommu is just a work-around of not using iommu=pt in the guest cmdlines? Is there any performance comparison of using bypass_iommu against having iommu but with iommu=pt? At least intel iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu pgtable in the guest but only a single bit in device context entry showing that "this device wants to pass-through iommu", so I would expect the perf can be similar to explicitly bypass iommu in the qemu cmdline. Thanks,
On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: > > > A simple state machine can track "has IOMMU" state. It has three states > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > that needs to know". State diagram: > > > > > > no so far > > > +--- (start state) ---+ > > > | | > > > add IOMMU | | add device that > > > | | needs to know > > > v v > > > +--> yes no <--+ > > > | | add device that | | > > > +-----+ needs to know +-----+ > > > > > > "Add IOMMU" in state "no" is an error. > > > > question is how we distinguish "device that needs to know" > > from device that doesn't need to know, and then recently > > added feature 'bypass IOMMU' adds more fun to this. > > Maybe we can start from "no device needs to know"? Then add more into it when > the list expands. > > vfio-pci should be a natural fit because we're sure it won't break any valid > old configurations. However we may need to be careful on adding more devices, > e.g. when some old configuration might work on old binaries, but I'm not sure. Btw, I think this state machine is indeed bringing some complexity on even understanding it - it is definitely working but it's not obvious to anyone at the first glance, and it's only solving problem for vIOMMU. E.g., do we need yet another state machine for some other ordering constraints? From that POV, I don't like this solution more than the simple "assign priority for device realization" idea..
Peter Xu <peterx@redhat.com> writes: > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: >> > > A simple state machine can track "has IOMMU" state. It has three states >> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device >> > > that needs to know". State diagram: >> > > >> > > no so far >> > > +--- (start state) ---+ >> > > | | >> > > add IOMMU | | add device that >> > > | | needs to know >> > > v v >> > > +--> yes no <--+ >> > > | | add device that | | >> > > +-----+ needs to know +-----+ >> > > >> > > "Add IOMMU" in state "no" is an error. >> > >> > question is how we distinguish "device that needs to know" >> > from device that doesn't need to know, and then recently >> > added feature 'bypass IOMMU' adds more fun to this. >> >> Maybe we can start from "no device needs to know"? Then add more into it when >> the list expands. >> >> vfio-pci should be a natural fit because we're sure it won't break any valid >> old configurations. However we may need to be careful on adding more devices, >> e.g. when some old configuration might work on old binaries, but I'm not sure. > > Btw, I think this state machine is indeed bringing some complexity on even > understanding it - it is definitely working but it's not obvious to anyone at > the first glance, and it's only solving problem for vIOMMU. E.g., do we need > yet another state machine for some other ordering constraints? > > From that POV, I don't like this solution more than the simple "assign priority > for device realization" idea.. I wouldn't worry about other ordering constraints until we have them. If you do, please tell! I'm hoping you can't, because such implicit constraints are commonly signs of oversimplified / screwed up machine modeling.
On Thu, 26 Aug 2021 09:43:59 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote: > > On Thu, 26 Aug 2021 10:01:01 +0200 > > Markus Armbruster <armbru@redhat.com> wrote: > > > > > Peter Xu <peterx@redhat.com> writes: > > > > > > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: > > > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > > > >> > Having thought about this a bit more... > > ... > > > > Any further thoughts will be greatly welcomed. > > > > > > I'd like to propose a more modest solution: detect the problem and fail. > > That's or proper dependency tracking (which stands chance to work with QMP, > > but like it was said it's complex) > > > > > A simple state machine can track "has IOMMU" state. It has three states > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > that needs to know". State diagram: > > > > > > no so far > > > +--- (start state) ---+ > > > | | > > > add IOMMU | | add device that > > > | | needs to know > > > v v > > > +--> yes no <--+ > > > | | add device that | | > > > +-----+ needs to know +-----+ > > > > > > "Add IOMMU" in state "no" is an error. > > > > question is how we distinguish "device that needs to know" > > from device that doesn't need to know, and then recently > > added feature 'bypass IOMMU' adds more fun to this. > > Maybe we can start from "no device needs to know"? Then add more into it when > the list expands. > > vfio-pci should be a natural fit because we're sure it won't break any valid > old configurations. However we may need to be careful on adding more devices, > e.g. when some old configuration might work on old binaries, but I'm not sure. > Off-topic: I'm wondering whether bypass_iommu is just a work-around of not > using iommu=pt in the guest cmdlines? Is there any performance comparison of > using bypass_iommu against having iommu but with iommu=pt? At least intel > iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu > pgtable in the guest but only a single bit in device context entry showing that > "this device wants to pass-through iommu", so I would expect the perf can be > similar to explicitly bypass iommu in the qemu cmdline. They wanted to have a mix of iommu and non-iommu devices in VM (last merged revision was: [PATCH v5 0/9] IOMMU: Add support for IOMMU Bypass Feature) But 'why' reasoning was lost somewhere, CCing author. > > Thanks, >
On Mon, 30 Aug 2021 15:02:53 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: > > > > A simple state machine can track "has IOMMU" state. It has three states > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > > that needs to know". State diagram: > > > > > > > > no so far > > > > +--- (start state) ---+ > > > > | | > > > > add IOMMU | | add device that > > > > | | needs to know > > > > v v > > > > +--> yes no <--+ > > > > | | add device that | | > > > > +-----+ needs to know +-----+ > > > > > > > > "Add IOMMU" in state "no" is an error. > > > > > > question is how we distinguish "device that needs to know" > > > from device that doesn't need to know, and then recently > > > added feature 'bypass IOMMU' adds more fun to this. > > > > Maybe we can start from "no device needs to know"? Then add more into it when > > the list expands. > > > > vfio-pci should be a natural fit because we're sure it won't break any valid > > old configurations. However we may need to be careful on adding more devices, > > e.g. when some old configuration might work on old binaries, but I'm not sure. > > Btw, I think this state machine is indeed bringing some complexity on even > understanding it - it is definitely working but it's not obvious to anyone at > the first glance, and it's only solving problem for vIOMMU. E.g., do we need > yet another state machine for some other ordering constraints? > > From that POV, I don't like this solution more than the simple "assign priority > for device realization" idea.. It seems simple but implicit dependencies are fragile (good or I'd rather say bad example implicit dependencies is vl.c magic code, where changing order of initialization can easily break QEMU, which happens almost every time it's refactored), and as Markus already mentioned it won't work in QMP case. What could work for both cases is explicit dependencies, however it would be hard to describe such dependency in this case, where device can work both with and without IOMMU depending on the bus settings it's attached to. Have you considered another approach, i.e. instead of reordering, change vfio-pci device model to reconfigure DMA address space after realize time (ex: at reset time)?
On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote: > On Mon, 30 Aug 2021 15:02:53 -0400 > Peter Xu <peterx@redhat.com> wrote: > > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: > > > > > A simple state machine can track "has IOMMU" state. It has three states > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > > > that needs to know". State diagram: > > > > > > > > > > no so far > > > > > +--- (start state) ---+ > > > > > | | > > > > > add IOMMU | | add device that > > > > > | | needs to know > > > > > v v > > > > > +--> yes no <--+ > > > > > | | add device that | | > > > > > +-----+ needs to know +-----+ > > > > > > > > > > "Add IOMMU" in state "no" is an error. > > > > > > > > question is how we distinguish "device that needs to know" > > > > from device that doesn't need to know, and then recently > > > > added feature 'bypass IOMMU' adds more fun to this. > > > > > > Maybe we can start from "no device needs to know"? Then add more into it when > > > the list expands. > > > > > > vfio-pci should be a natural fit because we're sure it won't break any valid > > > old configurations. However we may need to be careful on adding more devices, > > > e.g. when some old configuration might work on old binaries, but I'm not sure. > > > > Btw, I think this state machine is indeed bringing some complexity on even > > understanding it - it is definitely working but it's not obvious to anyone at > > the first glance, and it's only solving problem for vIOMMU. E.g., do we need > > yet another state machine for some other ordering constraints? > > > > From that POV, I don't like this solution more than the simple "assign priority > > for device realization" idea.. > It seems simple but implicit dependencies are fragile (good or > I'd rather say bad example implicit dependencies is vl.c magic code, > where changing order of initialization can easily break QEMU, > which happens almost every time it's refactored), True. I was never able of telling whether that comes from natural complexity of the piece of software or there's indeed some smarter moves. > and as Markus already mentioned it won't work in QMP case. I didn't ask before, but I do have a question on whether that's a real problem. When QMP interface is ready, we should have a last phase of "preconfig done" command, right? I thought it was "x-exit-preconfig" now, but I'm not sure so am guessing. If so, can we do the reorder there and postpone all device creations, maybe? Then the ordering properties can be applied there too. The other thing is I think only libvirt will use the QMP case even if it'll be ready, and libvirt will need to know the ordering already like vIOMMU and vfio and pci buses - even if a new qemu supported the ordering of all these, libvirt still need to support old qemu binaries (and the code is already there anyway) and it's fairly natural they initiate the QMP commands using the same ordering. IMHO it means ordering for QMP is not as important as cmdline if that's always initiated by another software. > > What could work for both cases is explicit dependencies, > however it would be hard to describe such dependency in this case, > where device can work both with and without IOMMU depending > on the bus settings it's attached to. > > Have you considered another approach, i.e. instead of reordering, > change vfio-pci device model to reconfigure DMA address space > after realize time (ex: at reset time)? Yes, I agree this seems to be the cleanest appproach. It may just need more changes and they'll be on vfio, and I'm not aware of the rest (Jason should know better on virtio/vdpa). The other thing is some dependency cannot be resolved by this, e.g., the pci bus issue where we put the pci bus to be after the device that it owns. But indeed we're already early-fail that now so it seems ok. Side note: from my gut feeling I still think at some point we shouldn't do the early-failure for cases in the case of pci bus and device - the cmdline obviously contains enough information on the dependency on "this pci device needs that pci bus", and early-fail does seem to be an work-around to me just because they specified in the wrong order. Thanks,
On Thu, Sep 02, 2021 at 09:45:43AM -0400, Peter Xu wrote: > On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote: > > On Mon, 30 Aug 2021 15:02:53 -0400 > > Peter Xu <peterx@redhat.com> wrote: > > > > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: > > > > > > A simple state machine can track "has IOMMU" state. It has three states > > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > > > > that needs to know". State diagram: > > > > > > > > > > > > no so far > > > > > > +--- (start state) ---+ > > > > > > | | > > > > > > add IOMMU | | add device that > > > > > > | | needs to know > > > > > > v v > > > > > > +--> yes no <--+ > > > > > > | | add device that | | > > > > > > +-----+ needs to know +-----+ > > > > > > > > > > > > "Add IOMMU" in state "no" is an error. > > > > > > > > > > question is how we distinguish "device that needs to know" > > > > > from device that doesn't need to know, and then recently > > > > > added feature 'bypass IOMMU' adds more fun to this. > > > > > > > > Maybe we can start from "no device needs to know"? Then add more into it when > > > > the list expands. > > > > > > > > vfio-pci should be a natural fit because we're sure it won't break any valid > > > > old configurations. However we may need to be careful on adding more devices, > > > > e.g. when some old configuration might work on old binaries, but I'm not sure. > > > > > > Btw, I think this state machine is indeed bringing some complexity on even > > > understanding it - it is definitely working but it's not obvious to anyone at > > > the first glance, and it's only solving problem for vIOMMU. E.g., do we need > > > yet another state machine for some other ordering constraints? > > > > > > From that POV, I don't like this solution more than the simple "assign priority > > > for device realization" idea.. > > It seems simple but implicit dependencies are fragile (good or > > I'd rather say bad example implicit dependencies is vl.c magic code, > > where changing order of initialization can easily break QEMU, > > which happens almost every time it's refactored), > > True. I was never able of telling whether that comes from natural complexity > of the piece of software or there's indeed some smarter moves. > > > and as Markus already mentioned it won't work in QMP case. > > I didn't ask before, but I do have a question on whether that's a real problem. > > When QMP interface is ready, we should have a last phase of "preconfig done" > command, right? I thought it was "x-exit-preconfig" now, but I'm not sure so > am guessing. If so, can we do the reorder there and postpone all device > creations, maybe? Then the ordering properties can be applied there too. > > The other thing is I think only libvirt will use the QMP case even if it'll be > ready, and libvirt will need to know the ordering already like vIOMMU and vfio > and pci buses - even if a new qemu supported the ordering of all these, libvirt > still need to support old qemu binaries (and the code is already there anyway) > and it's fairly natural they initiate the QMP commands using the same ordering. > > IMHO it means ordering for QMP is not as important as cmdline if that's always > initiated by another software. The correct ordering of devices/backends is generally pretty obvious for libvirt to determine. Most of the problems we've had related to ordering are on the QEMU side, because the ARGV given to QEMU made correct sense if parsed left-to-right, but QEMU didn't actually process them in that order. We've patched QEMU to hack around its inability to honour the CLI order repeatedly. Being completely self-ordering on the QEMU side using a topological sort would be neat from a conceptual purity POV, but that is quite a challenge to implement and I'm not convinced it is worth it, compared to other problems we want to spend time on. If we don't want to keep hacking up the current QEMU system binaries parsing, then we need to just have new binaries with sane ordering, or we need to willingly break compat in some staged manner. Regards, Daniel
Hi, Dan, On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote: > The correct ordering of devices/backends is generally pretty obvious > for libvirt to determine. Most of the problems we've had related to > ordering are on the QEMU side, because the ARGV given to QEMU made > correct sense if parsed left-to-right, but QEMU didn't actually process > them in that order. We've patched QEMU to hack around its inability to > honour the CLI order repeatedly. Is there a pointer to the problem? > > Being completely self-ordering on the QEMU side using a topological > sort would be neat from a conceptual purity POV, but that is quite a > challenge to implement and I'm not convinced it is worth it, compared > to other problems we want to spend time on. I just noticed there can also be dependency between the buses; that cannot be fixed by ordering of classes indeed as either proposed in this series, or introduce a new priority. Thanks,
Peter Xu <peterx@redhat.com> writes: > Hi, Dan, > > On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote: >> The correct ordering of devices/backends is generally pretty obvious >> for libvirt to determine. Most of the problems we've had related to >> ordering are on the QEMU side, because the ARGV given to QEMU made >> correct sense if parsed left-to-right, but QEMU didn't actually process >> them in that order. We've patched QEMU to hack around its inability to >> honour the CLI order repeatedly. > > Is there a pointer to the problem? Just an example: 9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management cda4aa9a5a "vl: Create block backends before setting machine properties" >> Being completely self-ordering on the QEMU side using a topological >> sort would be neat from a conceptual purity POV, but that is quite a >> challenge to implement and I'm not convinced it is worth it, compared >> to other problems we want to spend time on. > > I just noticed there can also be dependency between the buses; that cannot be > fixed by ordering of classes indeed as either proposed in this series, or > introduce a new priority. --verbose?
On Thu, Sep 02, 2021 at 10:21:19AM -0400, Peter Xu wrote: > Hi, Dan, > > On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote: > > The correct ordering of devices/backends is generally pretty obvious > > for libvirt to determine. Most of the problems we've had related to > > ordering are on the QEMU side, because the ARGV given to QEMU made > > correct sense if parsed left-to-right, but QEMU didn't actually process > > them in that order. We've patched QEMU to hack around its inability to > > honour the CLI order repeatedly. > > Is there a pointer to the problem? The biggest problem was with -object, where some objects need to be created before chardevs and some created after chardevs. See object_create_early() method in softmmu/vl.c for the gross hardcoded hacks Regards, Daniel
Peter Xu <peterx@redhat.com> writes: > On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote: >> On Mon, 30 Aug 2021 15:02:53 -0400 >> Peter Xu <peterx@redhat.com> wrote: >> >> > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: >> > > > > A simple state machine can track "has IOMMU" state. It has three states >> > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device >> > > > > that needs to know". State diagram: >> > > > > >> > > > > no so far >> > > > > +--- (start state) ---+ >> > > > > | | >> > > > > add IOMMU | | add device that >> > > > > | | needs to know >> > > > > v v >> > > > > +--> yes no <--+ >> > > > > | | add device that | | >> > > > > +-----+ needs to know +-----+ >> > > > > >> > > > > "Add IOMMU" in state "no" is an error. >> > > > >> > > > question is how we distinguish "device that needs to know" >> > > > from device that doesn't need to know, and then recently >> > > > added feature 'bypass IOMMU' adds more fun to this. >> > > >> > > Maybe we can start from "no device needs to know"? Then add more into it when >> > > the list expands. >> > > >> > > vfio-pci should be a natural fit because we're sure it won't break any valid >> > > old configurations. However we may need to be careful on adding more devices, >> > > e.g. when some old configuration might work on old binaries, but I'm not sure. >> > >> > Btw, I think this state machine is indeed bringing some complexity on even >> > understanding it - it is definitely working but it's not obvious to anyone at >> > the first glance, and it's only solving problem for vIOMMU. E.g., do we need >> > yet another state machine for some other ordering constraints? >> > >> > From that POV, I don't like this solution more than the simple "assign priority >> > for device realization" idea.. >> It seems simple but implicit dependencies are fragile (good or >> I'd rather say bad example implicit dependencies is vl.c magic code, >> where changing order of initialization can easily break QEMU, >> which happens almost every time it's refactored), > > True. I was never able of telling whether that comes from natural complexity > of the piece of software or there's indeed some smarter moves. > >> and as Markus already mentioned it won't work in QMP case. > > I didn't ask before, but I do have a question on whether that's a real problem. > > When QMP interface is ready, we should have a last phase of "preconfig done" > command, right? I thought it was "x-exit-preconfig" now, but I'm not sure so > am guessing. If so, can we do the reorder there and postpone all device > creations, maybe? Then the ordering properties can be applied there too. This would regress error reporting. A significant advantage of doing configuration in basic steps is the relative ease of pinpointing what exactly fails. If you change device_add to merely record the request for later, you delay the non-trivial failures until later. Compare: -> {'execute':'device_add','arguments':{'driver':'virtio-blk',"drive": "foo"}} <- {"error": {"class": "GenericError", "desc": "Device needs media, but drive is empty"}} to getting the same error in reply to x-exit-preconfig, with a dozen or two device_add queued up. The error comes from virtio_blk_device_realize(), which you propose to delay until x-exit-preconfig. In theory, we can rework all the errors in question to provide sufficient context, and rework libvirt to pick the context apart, so it can pinpoint what's wrong in the user's configuration. In practice, thanks, but no thanks. > The other thing is I think only libvirt will use the QMP case even if it'll be > ready, and libvirt will need to know the ordering already like vIOMMU and vfio > and pci buses - even if a new qemu supported the ordering of all these, libvirt > still need to support old qemu binaries (and the code is already there anyway) > and it's fairly natural they initiate the QMP commands using the same ordering. This is actually a pretty strong argument for not having QEMU reorder at all. > IMHO it means ordering for QMP is not as important as cmdline if that's always > initiated by another software. > >> >> What could work for both cases is explicit dependencies, >> however it would be hard to describe such dependency in this case, >> where device can work both with and without IOMMU depending >> on the bus settings it's attached to. >> >> Have you considered another approach, i.e. instead of reordering, >> change vfio-pci device model to reconfigure DMA address space >> after realize time (ex: at reset time)? > > Yes, I agree this seems to be the cleanest appproach. It may just need more > changes and they'll be on vfio, and I'm not aware of the rest (Jason should > know better on virtio/vdpa). > > The other thing is some dependency cannot be resolved by this, e.g., the pci > bus issue where we put the pci bus to be after the device that it owns. But > indeed we're already early-fail that now so it seems ok. Yes, that's a complete non-problem :) > Side note: from my gut feeling I still think at some point we shouldn't do the > early-failure for cases in the case of pci bus and device - the cmdline > obviously contains enough information on the dependency on "this pci device > needs that pci bus", and early-fail does seem to be an work-around to me just > because they specified in the wrong order. Again, there are two sane command line evaluation orders: (1) strictly left to right, and (2) order doesn't matter. QEMU of course does (3) unpredicable for humans without referring back to the source code. The difficulty with getting to (1) is mostly compatibility and putting in the work. Reporting when things don't get created in time nicely and reliably is yet more work. Not particularly challenging, just work. Getting to a reliable version of (2) feels like a pipe dream to me.
On Thu, 2 Sep 2021 09:45:43 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote: > > On Mon, 30 Aug 2021 15:02:53 -0400 > > Peter Xu <peterx@redhat.com> wrote: > > > > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote: > > > > > > A simple state machine can track "has IOMMU" state. It has three states > > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device > > > > > > that needs to know". State diagram: > > > > > > > > > > > > no so far > > > > > > +--- (start state) ---+ > > > > > > | | > > > > > > add IOMMU | | add device that > > > > > > | | needs to know > > > > > > v v > > > > > > +--> yes no <--+ > > > > > > | | add device that | | > > > > > > +-----+ needs to know +-----+ > > > > > > > > > > > > "Add IOMMU" in state "no" is an error. > > > > > > > > > > question is how we distinguish "device that needs to know" > > > > > from device that doesn't need to know, and then recently > > > > > added feature 'bypass IOMMU' adds more fun to this. > > > > > > > > Maybe we can start from "no device needs to know"? Then add more into it when > > > > the list expands. > > > > > > > > vfio-pci should be a natural fit because we're sure it won't break any valid > > > > old configurations. However we may need to be careful on adding more devices, > > > > e.g. when some old configuration might work on old binaries, but I'm not sure. > > > > > > Btw, I think this state machine is indeed bringing some complexity on even > > > understanding it - it is definitely working but it's not obvious to anyone at > > > the first glance, and it's only solving problem for vIOMMU. E.g., do we need > > > yet another state machine for some other ordering constraints? > > > > > > From that POV, I don't like this solution more than the simple "assign priority > > > for device realization" idea.. > > It seems simple but implicit dependencies are fragile (good or > > I'd rather say bad example implicit dependencies is vl.c magic code, > > where changing order of initialization can easily break QEMU, > > which happens almost every time it's refactored), > > True. I was never able of telling whether that comes from natural complexity > of the piece of software or there's indeed some smarter moves. > > > and as Markus already mentioned it won't work in QMP case. > > I didn't ask before, but I do have a question on whether that's a real problem. > > When QMP interface is ready, we should have a last phase of "preconfig done" > command, right? I thought it was "x-exit-preconfig" now, but I'm not sure so > am guessing. If so, can we do the reorder there and postpone all device > creations, maybe? Then the ordering properties can be applied there too. I don't think "x-exit-preconfig" would help here, ideally devices would be created before this stage, interactively via QMP. So whoever is calling device_add, would need to know the proper ordering or get error as result. And even if we had explicit dependencies, they would be easier to use with current error-out policy if something is out of expected order, and tell user to fix CLI/QMP command order (and replace adhoc checks use currently). > The other thing is I think only libvirt will use the QMP case even if it'll be > ready, and libvirt will need to know the ordering already like vIOMMU and vfio > and pci buses - even if a new qemu supported the ordering of all these, libvirt > still need to support old qemu binaries (and the code is already there anyway) > and it's fairly natural they initiate the QMP commands using the same ordering. > > IMHO it means ordering for QMP is not as important as cmdline if that's always > initiated by another software. PS: If I'm not wrong, we are slowly working towards to composing machine from QMP and ideally CLI becoming a shim on top of that. Hence the reason, QMP is mentioned in this thread and desire to avoid more magic. > > What could work for both cases is explicit dependencies, > > however it would be hard to describe such dependency in this case, > > where device can work both with and without IOMMU depending > > on the bus settings it's attached to. > > > > Have you considered another approach, i.e. instead of reordering, > > change vfio-pci device model to reconfigure DMA address space > > after realize time (ex: at reset time)? > > Yes, I agree this seems to be the cleanest appproach. It may just need more > changes and they'll be on vfio, and I'm not aware of the rest (Jason should > know better on virtio/vdpa). It's also a bit more closer "to real world", where IOMMU dependency is configured by firmware/OS. In addition it doesn't require any machine specific code, a device checks if its parent bus is managed by IOMMU and it is configured accordingly to that. > The other thing is some dependency cannot be resolved by this, e.g., the pci > bus issue where we put the pci bus to be after the device that it owns. But > indeed we're already early-fail that now so it seems ok. > > Side note: from my gut feeling I still think at some point we shouldn't do the > early-failure for cases in the case of pci bus and device - the cmdline > obviously contains enough information on the dependency on "this pci device > needs that pci bus", and early-fail does seem to be an work-around to me just > because they specified in the wrong order. That's how QEMU CLI works most of the time (i.e. left to right order of initialization) where it errors out on offending CLI option, and tells user to fix it instead of trying to fixup CLI order. So I think current pci_bus behavior does align with that. PS: Another, albeit machine depended approach to resolve IOMMU ordering problem can be adding to a specific machine pre_plug hook, an IOMMU handling. Which is called during IOMMU realize time and check if existing buses without bypass enabled (iommu managed) have any children. And if they have devices attached, error out telling user to reorder '-device iommu' before affected devices/bus. It should cover mixed IOMMU+bypass case and doesn't require fixing vfio-pci address space initialization nor defining any priorities for PCI devices. (but I think it's more a hack compared earlier suggested address space initialization at reset time, and it would need to be done for every affected machine) > Thanks, >
On Thu, Sep 02, 2021 at 04:57:04PM +0200, Markus Armbruster wrote: > Peter Xu <peterx@redhat.com> writes: > > > Hi, Dan, > > > > On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote: > >> The correct ordering of devices/backends is generally pretty obvious > >> for libvirt to determine. Most of the problems we've had related to > >> ordering are on the QEMU side, because the ARGV given to QEMU made > >> correct sense if parsed left-to-right, but QEMU didn't actually process > >> them in that order. We've patched QEMU to hack around its inability to > >> honour the CLI order repeatedly. > > > > Is there a pointer to the problem? > > Just an example: > > 9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management > cda4aa9a5a "vl: Create block backends before setting machine properties" Thanks, same to Dan. > > >> Being completely self-ordering on the QEMU side using a topological > >> sort would be neat from a conceptual purity POV, but that is quite a > >> challenge to implement and I'm not convinced it is worth it, compared > >> to other problems we want to spend time on. > > > > I just noticed there can also be dependency between the buses; that cannot be > > fixed by ordering of classes indeed as either proposed in this series, or > > introduce a new priority. > > --verbose? Please ignore it - I just started to realize what kind of a rabbit hole I'm digging, and I am already prepared to run away. :)
On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote: > PS: > Another, albeit machine depended approach to resolve IOMMU ordering problem > can be adding to a specific machine pre_plug hook, an IOMMU handling. > Which is called during IOMMU realize time and check if existing buses > without bypass enabled (iommu managed) have any children. And if they > have devices attached, error out telling user to reorder '-device iommu' > before affected devices/bus. > It should cover mixed IOMMU+bypass case and doesn't require fixing > vfio-pci address space initialization nor defining any priorities > for PCI devices. This sounds appealing among the approaches. Does it need to be a pre_plug hook? I thought we might just need a flag in the pci device classes showing that it should be after vIOMMUs, then in vIOMMU realize functions we walk pci bus to make sure no such device exist? We could have a base vIOMMU class, then that could be in the realize() of the common class. > > (but I think it's more a hack compared earlier suggested > address space initialization at reset time, and it would need to be > done for every affected machine) Agreed.
On Fri, 3 Sep 2021 12:03:06 -0400 Peter Xu <peterx@redhat.com> wrote: > On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote: > > PS: > > Another, albeit machine depended approach to resolve IOMMU ordering problem > > can be adding to a specific machine pre_plug hook, an IOMMU handling. > > Which is called during IOMMU realize time and check if existing buses > > without bypass enabled (iommu managed) have any children. And if they > > have devices attached, error out telling user to reorder '-device iommu' > > before affected devices/bus. > > It should cover mixed IOMMU+bypass case and doesn't require fixing > > vfio-pci address space initialization nor defining any priorities > > for PCI devices. > > This sounds appealing among the approaches. That's the easy one, compared to moving address space (re)initialization to reset time (at least to me since vfio realize looks intimidating on the first glance, but its maintainer(s) probably should know enough to impl. change properly). > Does it need to be a pre_plug hook? I thought we might just need a flag in the > pci device classes showing that it should be after vIOMMUs, then in vIOMMU > realize functions we walk pci bus to make sure no such device exist? > > We could have a base vIOMMU class, then that could be in the realize() of the > common class. We basically don't know if device needs IOMMU or not and can work with/without it just fine. In this case I'd think about IOMMU as board feature that morphs PCI buses (some of them) (address space, bus numers, ...). So I don't perceive any iommu flag as a device property at all. As for realize vs pre_plug, the later is the part of abstract realize (see: device_set_realized) and is already used by some PCI infrastructure: ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug It's purpose is to check pre-conditions and possibly pre-configure some some wiring on behalf of device's parent hot-plug handler (bus owner/machine), and fail cleanly if something is wrong without leaving side effects. See 0ed48fd32eb8 for boiler plate required to set up custom hot-plug handler. You might need only parts of it, but still it's something that's to be done for each affected machine type, to implement error checking at proper layer. So I'd rather look into 'reset' approach and only if that doesn't look possible, resort to adding pre_plug/error check. PS: yours d2321d31ff98b & c6cbc29d36f look to me like another candidate for pre_plug for pci deivice instead of adding dedicated hook just for vfio-pci to generic machine. > > (but I think it's more a hack compared earlier suggested > > address space initialization at reset time, and it would need to be > > done for every affected machine) > > Agreed. >
Hi, On 8/25/21 6:23 AM, Jason Wang wrote: > On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote: >> On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote: >>> It looks to me this doesn't solve the issue of using virtio-mmio with vhost? >> No IOMMU supported for any of the MMIO devices, right? Or am I wrong? Thanks, > I guess virtio IOMMU has that support, but I might be wrong. No that's not supported yet in QEMU. There is an inflight contribution attempting to support vSMMUv3 protection for sysbus devices though: [PATCH v6 0/4] hw/arm/smmuv3: Support non PCI/PCIe devices Thanks Eric > > Thanks > >> -- >> Peter Xu >>
diff --git a/softmmu/vl.c b/softmmu/vl.c index 5ca11e7469..3a30dfe27d 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -126,6 +126,7 @@ #include "qapi/qmp/qerror.h" #include "sysemu/iothread.h" #include "qemu/guest-random.h" +#include "migration/vmstate.h" #include "config-host.h" @@ -2627,6 +2628,35 @@ static void qemu_init_board(void) } } +/* Return the priority of the device option; zero is the default priority */ +static int qemu_opts_device_priority(QemuOpts *opts) +{ + const char *driver; + DeviceClass *dc; + + driver = qemu_opt_get(opts, "driver"); + if (!driver) { + return 0; + } + + dc = qdev_get_device_class(&driver, NULL); + if (!dc) { + return 0; + } + + if (!dc->vmsd) { + return 0; + } + + /* + * Currently we rely on vmsd priority because that's solving the same + * problem for device realization ordering but just for migration. In the + * future, we can move it out of vmsd, but that's not urgently needed. + * Return the negative of it so it'll be sorted with descendant order. + */ + return -dc->vmsd->priority; +} + static void qemu_create_cli_devices(void) { soundhw_init(); @@ -2642,6 +2672,11 @@ static void qemu_create_cli_devices(void) /* init generic devices */ rom_set_order_override(FW_CFG_ORDER_OVERRIDE_DEVICE); + /* + * Sort all the -device parameters; e.g., platform devices like vIOMMU + * should be initialized earlier. + */ + qemu_sort_opts("device", qemu_opts_device_priority); qemu_opts_foreach(qemu_find_opts("device"), device_init_func, NULL, &error_fatal); rom_reset_order_override();
QEMU creates -device objects in order as specified by the user's cmdline. However that ordering may not be the ideal order. For example, some platform devices (vIOMMUs) may want to be created earlier than most of the rest devices (e.g., vfio-pci, virtio). This patch orders the QemuOptsList of '-device's so they'll be sorted first before kicking off the device realizations. This will allow the device realization code to be able to use APIs like pci_device_iommu_address_space() correctly, because those functions rely on the platfrom devices being realized. Now we rely on vmsd->priority which is defined as MigrationPriority to provide the ordering, as either VM init and migration completes will need such an ordering. In the future we can move that priority information out of vmsd. Signed-off-by: Peter Xu <peterx@redhat.com> --- softmmu/vl.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)